Maximum Conditional Likelihood via 
Bound Maximization and the CEM 
Algorithm 
Tony Jebara and Alex Pentland 
Vision and Modeling, MITMedia Laboratory, Cambridge MA 
http://www.media.mit.edu/  jebara 
{ jebara,sandy }media.mit.edu 
Abstract 
We present the CEM (Conditional xpectation Maximization) al- 
gorithm as an extension of the EM (xpectation Maximization) 
algorithm to conditional density estimation under missing data. A 
bounding and maximization process is given to specifically optimize 
conditional likelihood instead of the usual joint likelihood. We ap- 
ply the method to conditioned mixture models and use bounding 
techniques to derive the models update rules. Monotonic conver- 
gence, computational efficiency and regression results superior to 
EM are demonstrated. 
I Introduction 
Conditional densities have played an important role in statistics and their merits 
over joint density models have been debated. Advantages in feature selection, ro- 
bustness and limited resource allocation have been studied. Ultimately, tasks such 
as regression and classification reduce to the evaluation of a conditional density. 
However, popularity of maximum joint likelihood and EM techniques remains strong 
in part due to their elegance and convergence properties. Thus, many conditional 
problems are solved by first estimating joint models then conditioning them. This 
results in concise solutions such as the Nadarya-Watson estimator [2], Xu's mixture 
of experts [7], and Amari's era-neural networks [1]. However, direct conditional 
density approaches [2, 4] can offer solutions with higher conditional likelihood on 
test data than their joint counter-parts. 
Maximum Conditional Likelihood via Bound Maximization and CEM 495 
(a) a = -4.2 
L] = -2.4 
s 10 ls 2o 
(b) Lb:-5.2 L,=-1.8 
Figure 1' Average Joint (x, y) vs. Conditional (ylx) Likelihood Visualization 
Popat [6] describes a simple visualization example where 4 clusters must be fit with 
2 Gaussian models as in Figure 1. Here, the model in (a) has a superior joint likeli- 
hood (La > Lb) and hence a better p(x, y) solution. However, when the models are 
conditioned to estimate p(y[x), model (b) is superior (L, > L}). Model (a) yields 
a poor unimodal conditional density in y and (b) yields a bi-modal conditional 
density. It is therefore of interest to directly optimize conditional models using con- 
ditional likelihood. We introduce the CEM (Conditional zpectation Maa:iraization) 
algorithm for this purpose and apply it to the case of Gaussian mixture models. 
2 EM and Conditional Likelihood 
For joint densities, the tried and true EM algorithm [3] maximizes joint likelihood 
over data. However, EM is not as useful when applied to conditional density estima- 
tion and maximum conditional likelihood problems. Here, one typically resorts to 
other local optimization techniques such as gradient descent or second order Hessian 
methods [2]. We therefore introduce CEM, a variant of EM, which targets condi- 
tional likelihood while maintaining desirable convergence properties. The CEM 
algorithm operates by directly bounding and decoupling conditional likelihood and 
simplifies M-step calculations. 
In EM, a complex density optimization is broken down into a two-step iteration 
using the notion of missing data. The unknown data components are estimated via 
the E-step and a simplified maximization over complete data is done in the M-step. 
In more practical terms, EM is a bound maximization: the E-step finds a lower 
bound for the likelihood and the M-step maximizes the bound. 
M 
p(xi,Yi[O) :  p(m, xi,Yi[O) 
(1) 
Consider a complex joint density p(xi, yi[{3) which is best described by a discrete 
(or continuous) summation of simpler models (Equation 1). Summation is over the 
'missing components' m. 
Ai 
: EiN=xlog(p(xi,YilOt)) -log(p(xi,yilOt-x)) 
M P(m'x"y'lOt) where 
--> Ei51 m=1 him log p(m,X,,y,lt_) 
him -- P(m'x"y'IO'-) 
 ,__  P(n,X,,Y,[ - ) 
By appealing to Jensen's inequality, EM obtains a lower bound for the incremental 
log-likelihood over a data set (Equation 2). Jensen's inequality bounds the log- 
arithm of the sum and the result is that the logarithm is applied to each simple 
(2) 
496 T. Jebara and A. Pentland 
model p(m, xi, yi]O) individually. It then becomes straightforward t.o compute the 
derivatives with respect to 0 and set. to zero for maximization (M-step). 
5I 
P(Yilxi ): Z p(m yi[xi ): Z3m/=l p(m, xi,Yi[) 
(3) 
However, the elegance of EM is compromised when we consider a conditioned density 
as in Equation 3. The corresponding incremental conditional log-likelihood, AF, is 
shown in Equation 4. 
Alc 
(4) 
The above is a difference between a ratio of joints and a ratio of marginals. If 
Jensen's inequality is applied to the second term in Equation 4 it yields an upper 
bound since the term is subtracted (this would compromise convergence). Thus, 
only the first ratio can be lower bounded with Jensen (Equation 5). 
31 
N M p(rn,xi,YilOt ) Y-n:l P(n,xil Or) 
AI > Z  hilgp,;7372 ) -log M () 
i=x : =x P(n, xil Or-x) 
Note the lingering logarithm of a sum which prevents a simple M-Step. At this point, 
one would resort to a Generalized EM (GEM) approach which requires gradient or 
second-order ascent techniques for the M-step. For example, Jordan et al. overcome 
the difficult M-step caused by EM with an Iteratively Re-Weighted Least Squares 
algorithm in the mixtures of experts architecture [4]. 
3 Conditional Expectation Maximization 
The EM algorithm can be extended by substituting Jensen's inequality for a dif- 
ferent bound. Consider the upper variational bound of a logarithm x - 1 > log(x) 
(which becomes a lower bound on the negative log). The proposed logarithm's 
bound satisfies a number of desiderata: (1) it makes contact at the current op- 
erating point , (2) it is tangential to the logarithm, (3) it is a tight bound, (4) 
it is simple and (5) it is the variational dual of the logarithm. Substituting this 
linear bound into the incremental conditional log-likelihood maintains a true lower 
bounding function Q (Equation 6). 
N M p(m, xi,YilOt ) 
Al & Q(t'Ot-)- Z Z hilgp,r.i.i-W-2 ) - 
i'-1 m=l 
zM 
,: P(n, xilO t) 
zM 
,= p(n, xi]O t-) 
+1 
(6) 
The Mixture of Experts formalism [4] offers a graceful representation of a conditional 
density using experts (conditional sub-models) and gates (marginal sub-models). 
The (2 function adopts this form in Equation 7. 
The current operating point is 1 since the ()t model in the ratio is held fixed at the 
previous iteration's value t-x 
Maximum Conditional Likelihood via Bound Maximization and CEM 497 
-.iN= EM= {hi(logp(yilm, xi, 0 t) + logp(m, xil Or) - zim)- rgp(m, xg IO) +  } 
where zi = log(p(m, xi,YilO*-t)) and ri - ( E=x P(n,xil Or-x) )- 
Computing this (2 function forms the CE-step in the Conditional Expectation Max- 
imization algorithm and it results in a simplified M-step. Note the absence of the 
logarithm of a sum and the alecoupled models. The form here allows a more straight- 
forward computation of derivatives with respect to t and a more tractable M-Step. 
For continuous missing data, a similar derivation holds. 
At this point, without loss of generality, we specifically attend to the case of a condi- 
tioned Gaussian mixture model and derive the corresponding M-Step calculations. 
This serves as an implementation example for comparison purposes. 
(7) 
4 CEM and Bound Maximization for Gaussian Mixtures 
In deriving an efficient M-step for the mixture of Gaussians, we call upon more 
bounding techniques that follow the CE-step and provide a monotonically conver- 
gent learning algorithm. The form of the conditional model we will train is obtained 
by conditioning a joint mixture of Gaussians. We write the conditional density 
in a experts-gates form as in Equation 8. We use unnormalized Gaussian gates 
./'(x;/, E) = exp(-(x- ]/)T.v-I (x- ]/)) since conditional models do not require 
true marginal densities over x (i.e. that necessarily integrate to 1). Also, note that 
the parameters of the gates (c, tt, E) are independent of the parameters of the 
experts (um, F TM, 2m). 
Both gates and experts are optimized independently and have no variables in com- 
mon. An update is performed over the experts and then over the gates. If each 
of those causes an increase, we converge to a local maximum of conditional log- 
likelihood (as in Expectation Conditional Maximization [5]). 
M 
p(ylx, ) : 
To update the experts, we hold the gates fixed and merely take derivatives of the Q 
function with respect to the expert parameters ( = {v", F TM, } ) and set them 
to 0. Each expert is effectively decoupled from other terms (gates, other experts, 
etc.). The solution reduces to maximizing the log of a single conditioned Gaussian 
and is analytically straightforward. 
c) " i himalgA/'(Y';""+P"x"f") '-- 0 
(9) 
(8) 
Similarly, to update the gate mixing proportions, derivatives of the Q function are 
taken with respect to c, and set to 0. By holding the other parameters fixed, the 
update equation for the mixing proportions is numerically evaluated (Equation 10). 
N N 
:=  
i=1 i:1 
498 T. Jebara and A. Pentland 
(a) f Function (b) Bound on tt (c) g Function (d) Bound on E 
Figure 2: Bound Width Computation and Example Bounds 
4.1 Bounding Gate Means 
Taking derivatives of Q and setting to 0 is not as straightforward for the case of 
the gate means (even though they are decoupled). What is desired is a simple 
update rule (i.e. computing an empirical mean). Therefore, we further bound the 
Q function for the M-step. The Q function is actually a summation of sub-elements 
Qi, and we bound it instead by a summation of quadratic functions on the means 
(Equation 11). 
N M 
i=1 m=l 
N M 
Z Z ki.- w,,JJ.- c,,ll  (11) 
i=1 m=l 
Each quadratic bound has a location parameter ci, (a centroid), a scale parameter 
wi, (narrowness), and a peak value at ki,. The sum of quadratic bounds makes 
contact with the Q function at the old values of the model t- where the gate 
mean was originally 
/% and the covariance is E "* To facilitate the derivation, 
one may assume that the previous mean was zero and the covariance was identity 
if the data is appropriately whitened with respect to a given gate. 
The parameters of each quadratic bound are solved by ensuring that it contacts the 
corresponding Qi, function at t- and they have equal derivatives at contact (i.e. 
tangential contact). Solving these constraints yields quadratic parameters for each 
gate rn and data point i in Equation 12 (ki, is omitted for brevity). 
The tightest quadratic bound occurs when wir is minimal (without violating the 
inequality). The expression for wi, reduces to finding the minimat value, wi*, , as in 
Equation 13 (here p2 = xTxi). The f function is computed numerically only once 
and stored as a lookup table (see Figure 2(a)). We thus immediately compute the 
optimal wi*, and the rest of the quadratic bound's parameters obtaining bounds as 
in Figure 2(b) where a Qir is lower bounded. 
1 2 
max , ,e-5 c eCP-cp - i ]i, , ]i, (13) 
* P } + = riarne- f(p) +  
Wire = ri m C {e- c2 2 
The gate means tt are solved by maximizing the sum of the M x N parabolas which 
bound (2. The update is tt = (y. wi*,cir ) ( w?,) -. This mean is subsequently 
unwhitened to undo earlier data transformations. 
Maximum Conditional Likelihood via Bound Maximization and CEM 499 
(a) Data (b) CEM p(ylx) (d) EM fit 
(e) EM p(ylx) (f) EM 
Figure 3: Conditional Density Estimation for CEM and EM 
4.2 Bounding Gate Covariances 
Having derived the update equation for gate means, we now turn our attention 
to the gate covariances. We bound the Q function with logarithms of Gaussians. 
Maximizing this bound (a sum of log-Gaussians) reduces to the maximum-likelihood 
estimation of a covariance matrix. The bound for a Qi sub-component is shown 
in Equation 14. Once again, we assume the data has been appropriately whitened 
with respect to the gate's previous parameters (the gate's previous mean is 0 and 
previous covariance is identity). Equation 15 solves for the log-Gaussian parameters 
(again p2 = x/r xi). 
The computation for the minimal wir simplifies to wi*, = riomg(p). The g function 
is derived and plotted in Figure 2(c). An example of a log-Gaussian bound is 
shown in Figure 2(d) a sub-component of the Q function. Each sub-component 
corresponds to a single data point as we vary one gate's covariance. All M x N 
log-Gaussian bounds are computed (one for each data point and gate combination) 
and are summed to bound the Q function in its entirety. 
To obtain a final answer for the update of the gate covariances E" we simply 
maximize the sum of log Gaussians (parametrized by wi*, ki, ci,). The update is 
E = (-] wci,ci,?) (Y] wr) -. This covariance is subsequently unwhitened, 
inverting the whitening transform applied to the data. 
5 Results 
The CEM algorithm updates the conditioned mixture of Gaussians by computing 
him and rir in the CE steps and interlaces these with updates on the experts, 
mixing proportions, gate means and gate covariances. For the mixture of Gaussians, 
each CEM update has a computation time that is comparable with that of an EM 
update (even for high dimensions). However, conditional likelihood (not joint) is 
monotonically increased. 
Consider the 4-cluster (x, y) data in Figure 3(a). The data is modeled with a con- 
ditional density p(ylx) using only 2 Gaussian models. Estimating the density with 
CEM yields the p(y[x) shown in Figure 3(b). CEM exhibits monotonic conditional 
likelihood growth (Figure 3(c)) and obtains a more conditionally likely model In 
500 T. Jebara and A. Pentland 
Algorithm CCNO [ CCN5 I C4.5 
Abalone 24.86% 26.25% 21.5% 
0.0% 22.32% 26.63% 
Table 1: Test Results. Class label regression accuracy data. (CNN0=cascade- 
correlation, 0 hidden units, CCN$=$ hidden LD=linear discriminant). 
the EM case, a joint p(x, y) clusters the data as in Figure 3(d). Conditioning it 
yields the p(y[2') in Figure 3(e). Figure 3(f) depicts EM's non-monotonic evolution 
of conditional log-likelihood. EM produces a superior joint likelihood but an infe- 
rior conditional likelihood. Note how the CEM algorithm utilized limited resources 
to capture the multimodal nature of the distribution in y and ignored spurious bi- 
modal clustering in the 2' feature space. These properties are critical for a good 
conditional density p(y]:c). 
For comparison, standard databases were used from UCI 2. Mixture models were 
trained with EM and CEM, maximizing joint and conditional likelihood respectively. 
Regression results are shown in Table 1. CEM exhibited, monotonic conditional log- 
likelihood growth and out-performed other methods including EM with the same 
2-Gaussian model (EM2 and CEM2). 
6 Discussion 
We have demonstrated a variant of EM called CEM which optimizes conditional 
likelihood efficiently and monotonically. The application of CEM and bound maxi- 
mization to a mixture of Gaussians exhibited promising results and better regression 
than EM. In other work, a MAP framework with various priors and a deterministic 
annealing approach have been formulated. Applications of the CEM algorithm to 
non-linear regressor experts and hidden Markov models are currently being investi- 
gated. Nevertheless, many applications CEM remain to be explored and hopefully 
others will be motivated to extend the initial results. 
Acknowledgements 
Many thanks to Michael Jordan and Kris Popat for insightful discussions. 
References 
[1] S. Amari. Information geometry of em and em algorithms for neural networks. Neural 
Networks, 8(9), 1995. 
f ] C. Bishop. Neural Networks for Pattern Recognition. Oxford Press, 1996. 
A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via 
the em algorithm. Journal of the Royal Statistical Society, B39, 1977. 
[4] M. Jordan and R. Jacobs. Hierarchical mixtures of experts and the em algorithm. 
Neural Computation, 6:181-214, 1994. 
[5] X. Meng and D. Rubin. Maximum likelihood estimation via the ecm algorithm: A 
general framework. Biometrika, 80(2), 1993. 
[6] A. Popat. Conjoint probabilistic subband modeling (phd. thesis). Technical Report 
461, M.I.T. Media Laboratory, 1997. 
[7] L. Xu, M. Jordan, and G. Hinton. An alternative model for mixtures of experts. In 
Neural Information Processing Systems 7, 1995. 
2 http://www.ics.uci.edu/,,mlearn/M LRepository.html 
