Batch and On-line Parameter Estimation of 
Gaussian Mixtures Based on the Joint Entropy 
Yoram Singer 
AT&T Labs 
singer @ research. att.com 
Manfred K. Warmuth 
University of California, Santa Cruz 
manfred @ cse.ucsc.edu 
Abstract 
We describe a new iterative method for parameter estimation of Gaus- 
sian mixtures. The new method is based on a framework developed by 
Kivinen and Warmuth for supervised on-line learning. In contrast to gra- 
dient descent and EM, which estimate the mixture's covariance matrices, 
the proposed method estimates the inverses of the covariance matrices. 
Furthermore, the new parameter estimation procedure can be applied in 
both on-line and batch settings. We show experimentally that it is typi- 
cally faster than EM, and usually requires about half as many iterations 
as EM. 
1 Introduction 
Mixture models, in particular mixtures of Gaussians, have been a popular tool for density 
estimation, clustering, and un-supervised learning with a wide range of applications (see 
for instance [5, 2] and the references therein). Mixture models are one of the most useful 
tools for handling incomplete data, in particular hidden variables. For Gaussian mixtures 
the hidden variables indicate for each data point the index of the Gaussian that generated it. 
Thus, the model is specified by a joint density between the observed and hidden variables. 
The common technique used for estimating the parameters of a stochastic source with hid- 
den variables is the EM algorithm. In this paper we describe a new technique for estimating 
the parameters of Gaussian mixtures. The new parameter estimation method is based on a 
framework developed by Kivinen and Warmuth [8] for supervised on-line learning. This 
framework was successfully used in a large number of supervised and un-supervised prob- 
lems (see for instance [7, 6, 9, 1]). 
Our goal is to find a local minimum of a loss function which, in our case, is the negative 
log likelihood induced by a mixture of Gaussians. However, rather than minimizing the 
Parameter Estimation of Gaussian Mixtures 579 
loss directly we add a term measuring the distance of the new parameters to the old ones. 
This distance is useful for iterative parameter estimation procedures. Its purpose is to keep 
the new parameters close to the old ones. The method for deriving iterative parameter 
estimation can be used in batch settings as well as on-line settings where the parameters 
are updated after each observation. The distance used for deriving the parameter estimation 
method in this paper is the relative entropy between the old and new joint density of the 
observed and hidden variables. For brevity we term the new iterative parameter estimation 
method the joint-entropy (JE) update. 
The JE update shares a common characteristic with the Expectation Maximization [4, 10] 
algorithm as it first calculates the same expectations. However, it replaces the maximization 
step with a different update of the parameters. For instance, it updates the inverse of the 
covariance matrix of each Gaussian in the mixture, rather than the covariance matrices 
themselves. We found in our experiments that the JE update often requires half as many 
iterations as EM. It is also straightforward to modify the proposed parameter estimation 
method for on-line setting where the parameters are updated after each new observation. 
As we demonstrate in our experiments with digit recognition, the on-line version of the 
JE update is especially useful in situations where the observations are generated by a non- 
stationary stochastic source. 
2 Notation and preliminaries 
Let S be a sequence of training examples (z,z2,...,zN} where each zi is a d- 
dimensional vector in IR d. To model the distribution of the examples we use m d- 
dimensional Gaussians. The parameters of the i-th Gaussian are denoted by i and they 
include the mean-vector and the covariance matrix 
/i = E(xli) Oi = E((x-/i)(x- 
The density function of the ith Gaussian, denoted P(xli ), is 
We denote the entire set of parameters of a Gaussian mixture by  = {Oi}i= 1 -' 
C 'rr/,  '7 
{wi, Ii, i}i= where w = (w,. w,) is a non-negative vector of mixture coefficients 
such that Y'4= wi = 1. We denote by ?(xlO) = wiP(xli) the likelihood of an 
observation x according to a Gaussian mixture with parameters. Let i and i be two 
Gaussian distributions. For brevity, we denote by Ei (Z) and Ei (Z) the expectation of a 
random variable Z with respect to i and i. Let f be a parametric function whose param- 
eters constitute a matrix A = (aij). We denote by Of/OA the matrix of partial derivatives 
of f with respect to the elements in A. That is, the ij element of Of/OA is Of/Oaij. 
Similarly, let B = (bij (z)) a matrix whose elements are functions of a scalar z. Then, we 
denote by dB/dz the matrix of derivatives of the elements in B with respect to z, namely, 
the ij element of dB/dz is dbis(z)/dz. 
3 The framework for deriving updates 
Kivinen and Warmuth [8] introduced a general framework for deriving on-line parameter 
updates. In this section we describe how to apply their framework for the problem of 
580 Y. Singer and M. K. Warrnuth 
parameter estimation of Gaussian mixtures in a batch setting. We later discuss how a 
simple modification gives the on-line updates. 
Given a set of data points ,9 in I a and a number m, the goal is to find a set of m 
Gaussians that minimize the loss on the data, denoted as loss(,9113). For density esti- 
mation the natural loss function is the negative log-likelihood of the data loss(,9113 ) - 
-( 1/IS[) In P(S113) de___f --(1 / 1`91) x sin P(x]O). The best parameters which minimize 
the above loss cannot be found analytically. The common approach is to use iterative meth- 
ods such as EM [4, 10] to find a local minimizer of the loss. 
In an iterative parameter estimation framework we are given the old set of parameters 
and we need to find a set of new parameters 13t+1 that induce smaller loss. The framework 
introduced by Kivinen and Warmuth [8] deviates from the common approaches as it also 
requires to the new parameter vector to stay "close" to the old set of parameters which 
incorporates all that was learned in the previous iterations. The distance of the new param- 
eter setting 13t+1 from the old setting 13t is measured by a non-negative distance function 
A(13 t+l , Ot). We now search for a new set of parameters 13t+1 that minimizes the distance 
summed with the loss multiplied by r/. Here r/is a non-negative number measuring the rel- 
ative importance of the distance versus the loss. This parameter r/will become the leaming 
rate of the update. More formally, the update is found by setting 13t+ 1 _ arg min U t (13) 
where Ut()): A(), 13t) + ,/loss(`91)) + x(Eii ,7i - 1). (We use a Lagrange multi- 
plier ,X to enforce the constraint that the mixture coefficients sum to one.) By choosing the 
apropriate distance function and r/-- 1 one can show that EM becomes the above update. 
For most distance functions and learning rates the minimizer of the function Ut(13) can- 
not be found analytically as both the distance function and the log-likelihood are usually 
non-linear in 13. Instead, we expand the log-likelihood using a first order Taylor expan- 
sion around the old parameter setting. This approximation degrades the further the new 
parameter values are from the old ones, which further motivates the use of the distance 
function A(13, 13t) (see also the discussion in [7]). We now seek a new set of parameters 
13t+1 .__ arg min V t () where 
rrt 
V*()) = a(), O*) + r/(loss(SiS*) + () - O*). V1oss(SlO*)) + ,X(y] , - 1). (1) 
Here Voloss(,9113 t) denotes the gradient of the loss at 13t. We use the above method 
Eq. (1) to derive the updates of this paper. For density estimation, it is natural to use the 
relative entropy between the new and old density as a distance. In this paper we use the 
joint density between the observed (data points) and hidden variables (the indices of the 
Gaussians). This motivates the name joint-entropy update. 
4 Entropy based distance functions 
We first consider the relative entropy between the new and old parameter parameters of a 
single Gaussian. Using the notation introduced in Sec. 2, the relative entropy between two 
Gaussian distributions denoted by )i, 13i is 
Parameter Estimation of Gaussian Mixtures 581 
Using standard (though tedious) algebra we can rewrite the expectations as follows: 
11n = 
A(0i, Oi) =  ICl 2 
ltr(C-li) q_ 1 (i-/-ti)  (2) 
+  (g- ,)rC;- 
The relative entropy between the new and the old mixture models is the following 
P(xlO) , 
/x(), ) a. p(xlZ3) In p(-lo)ax =  W,P(xlZ3, ) In = W,P(xl),) 
= ,= , p(-o-dx . (3) 
Ideally, we would like to use the above distance function in V t to give us an update of 
O in terms of O. However, there isn't a closed form expression for Eq. (3). Although the 
relative entropy between two Gaussians is a convex function in their parameters, the relative 
entropy between two Gaussian mixtures is non-convex. Thus, the loss function V t (O) may 
have multiple minima, making the problem of finding arg minV t (0) difficult. 
In order to sidestep this problem we use the log-sum inequality [3] to obtain an upper bound 
for the distance function A(O, O). We denote this upper bound as A(O, O). 
= w, ln w,  P(xlO') dx = w,ln 
- --+ ,a ,, o,). 
-- + w, P(x ,)In p(xlO, ) 
=1 =1 t=l *=1 
(4) 
We call the new distance function A(O, O) thejoint-entropy distance. Note that in this 
distance the parameters of i and wi are "coupled" i the sense that it is a convex combi- 
nation of the distances A(), O). In particular, A(O, O) as a function of the parameters 
i, i, 2i does not remain constant any more when the parameters of the individual Gaus- 
sians are permuted. Furthermore, A(O, O) is also is sufficiently convex so that finding the 
minimizer of V t is possible (see below). 
5 The updates 
We are now ready to derive the new parameter estimation scheme. This is done by setting 
the partial derivatives of V t , with respect to D, to 0. That is, our problem consists of solving 
the following equations 
OA(O,O)  OInP(SIO ) OA(O,O)  OInP(SIO ) OA(O,O)  OInP(SIO ) 
Ow, IS[ Ow, +,X = O,  - O,  = O. 
We now use the fact that C and thus C - is symmetric. The derivatives of (), ), as 
defined by Eq. (4) and Eq. (2), with respect to 7i, i and Ci, are 
OA(O, ) w, In IC, I d 
Ov, = In--+l+  
w, Icl 2 
o/x(o, o) _ c_ 
w, , (i -") (6) 
= 5,o(-&7  +c7). (7) 
+ tr(C7 :i) + 5( i - t0"?l(, -/&)(5) 
582 Y. Singer and M. K. Warmuth 
To simplify the notation throughout the rest of the paper we define the following variables 
der P(xlOi) def 
-- p(xlO) and 
The partial derivatives of the log-likelihood are computed similarly: 
OinP(SlO) 
=E 
Owi 
x$ 
OlnP($10) 
=E 
x$ 
omP(SlO) = 
OC, 
-- _i 
We now need to decide on an order for updating the parameter classes wi, I&, and Ci. We 
use the same order that EM uses, namely, wi, then/&, and finally, Ci. (After doing one 
pass over all three groups we start again using the same order.) Using this order results in 
a simplified set of equations as several terms in Eq. (5) cancel out. Denote the size of the 
sample by N - ISI. We now need to sum the derivatives from Eq. (5) and Eq. (8) while 
using the fact that the Lagrange multiplier A simply assures that the new weight i sum to 
one. By setting the result to zero, we get that 
wi exp (- Y'].xx fli(x)) 
w, <-- (11) 
-sx ws exp (-- Exes ]3,(x)) ' 
Similarly, we sum Eq. (6) and Eq. (9), set the result to zero, and get that 
(12) 
Finally, we do the same for Ci. We sum Eq. (7) and Eq. (10) using the newly obtained 
r/ E/,(x ) (C?X _ CFX(x _ t,)(x - tq):rc -x) (13) 
C5 x v- C5 x +  . 
We call the new iterative parameter estimation procedure the joint-entropy (JE) update. 
To summarize, the JE update is composed of the following alternating steps: We first cal- 
culate for each observation x the value/3i(x) = P(x[i)/P(xl ) and then update the 
parameters as given by Eq. (11), Eq. (12), and Eq. (13). The JE update and EM differ in 
several aspects. First, EM uses a simple update for the mixture we!ghts w. Second, EM 
uses the expectations (with respect to the current parameters) of the sufficient statistics [4] 
for/& and Ci to find new sets of mean vectors and covariance matrices. The JE uses a 
(slightly different) weighted average of the observation and, in addition, it adds the old 
parameters. The learning rate r/determines the proportion to be used in summing the old 
parameters and the newly estimated parameters. Last, EM estimates the covariance matri- 
ces Ci whereas the new update estimates the inverses, C? 1, of these matrices. Thus, it is 
potentially be more stable numerically in cases where the covariance matrices have small 
condition number. 
To obtain an on-line procedure we need to update the parameters after each new observation 
at a time. That is, rather than summing over all x G S, for a new observation xt, we update 
Parameter Estimation of Gaussian Mixtures 583 
-3.0 
-3.1 
--32 
-3.3 
-3.4 
-3.5 
-3.6 
2 3 4 5 6 
Figure 1' Left: comparison of the convergence rate of EM and the JE update with different 
learning rates. Right: example of a case where EM initially increases the likelihood faster 
than the JE update. 
the parameters and get a new set of parameters t+l using the current parameters t. The 
new parameters are then used for inducing the likelihood of the next observation xt+. The 
on-line parameter estimation procedure is composed of the following steps: 
1. Set: /i(Xt) -- 
P(x, IO) ' 
2. Parameter updates: 
(a) w i +-- wiexp(--l]ti(xt)) / Ejrn__ 1 wjexp(--l]ti(xt)) 
(b) + (xt - 
(C) C? 1 {-- Cf 1 -- ti(xt) (C? 1 - c?l(xt -/i)(xt- /i)Tc?i). 
To guarantee convergence of the on-line update one should use a diminishing learning rate, 
that is r/t -+ 0 as t -+ e<> (for further motivation see [11]). 
6 Experiments 
We conducted numerous experiments with the new update. Due to the lack of space we de- 
scribe here only two. In the first experiment we compared the JE update and EM in batch 
settings. We generated data from Gaussian mixture distributions with varying number of 
components (m = 2 to 100) and dimensions (d = 2 to 20). Due to the lack of space 
we describe here results obtained from only one setting. In this setting tlxe examples were 
generated by a mixture of 5 components with w = (0.4, 0.3, 0.2, 0.05, 0.05). The mean 
vectors were the 5 standard unit vectors in the Euclidean space I 5 and we set all of covari- 
ances matrices to the identity matrix. We generated 1000 examples. We then run EM and 
the JE update with different learning rates (r/ = 1.9, 1.5, 1.1, 1.05). To make sure that all 
the runs will end in the same local maximum we fist performed three EM iterations. The 
results are shown on the left hand side of Figure 1. In this setting, the JE update with high 
learning rates achieves much faster convergence than EM. We would like to note that this 
behavior is by no means esoteric - most of our experiments data yielded similar results. 
We found a different behavior in low dimensional settings. On the right hand side of Fig- 
ure 1 we show convergence rate results for a mixture containing two components each of 
which is a single dimension Gaussians. The mean of the two components were located 
584 Y. Singer and M. K. Warmuth 
at ! and -1 with the same variance of 2. Thus, there is a significant "overlap" between 
the two Gaussian constituting the mixture. The mixture weight vector was (0.5, 0.5). We 
generated 50 examples according to this distribution and initialized the parameters as fol- 
lows: 1 = 0.01,u2 = -0.01, 0.1 = 0'2 '-- 2, W 1 : W 2 : 0.5 We see that initially 
EM increases the likelihood much faster than the JE update. Eventually, the JE update 
convergences faster than EM when using a small learning rate (in the example appearing in 
Figure 1 we set r/= 1.05). However, in this setting, the JE update diverges when learning 
rates larger than z/= 1.1 are used. This behavior underscores the advantages of both meth- 
ods. EM uses a fixed learning rate and is guaranteed to converge to a local maximum of the 
likelihood, under conditions that typically hold for mixture of Gaussians [4, 12]. the JE up- 
date, on the other hand, encompasses a learning rate and in many settings it converges much 
faster than EM. However, the superior performance in high dimensional cases demands its 
price in low dimensional "dense" cases. Namely, a very conservative learning rate, which 
is hard to tune, need to be used. In these cases, EM is a better alternative, offering almost 
the same convergence rate without the need to tune any parameters. 
Acknowledgments Thanks to Duncan Herring for careful proof reading and providing 
us with interesting data sets. 
References 
[1 ] E. Bauer, D. Koller, and Y. Singer. Update rules for parameter estimation in Bayesian 
networks. In Proc. of the 13th Annual Conf on Uncertainty in AI, pages 3-13, 1997. 
[2] C.M. Bishop. Neural Networks and Pattern Recognition. Oxford Univ. Press, 1995. 
[3] Thomas M. Cover and Joy A. Thomas. Elements oflnformation Theory. Wiley, 1991. 
[4] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum-likelihood from incomplete 
data via the EM algorithm. Journal of the Royal Statistical Society, B39:1-38, 1977. 
[5] R.O. Duda and P.E. Hart. Pattern Classification and Scene Analysis. Wiley, 1973. 
[6] D. P. Helmbold, J. Kivinen, and M.K. Warmuth. Worst-case loss bounds for sig- 
molded neurons. In Advances in Neural Information Processing Systems 7, pages 
309-315, 1995. 
[7] D.P. Helmbold, R.E. Schapire, Y. Singer, and M.K. Warmuth. A comparison of new 
and old algorithms for a mixture estimation problem. Machine Learning, Vol. 7, 1997. 
[8] J. Kivinen and M.K. Warmuth. Additive versus exponentiated gradient updates for 
linear prediction. Information and Computation, 132(1): 1-64, January 1997. 
[9] J. Kivinen and M.K. Warmuth. Relative loss bounds for multidimensional regression 
problems. In Advances in Neural Information Processing Systems 10, 1997. 
[10] R.A. Rednet and H.F. Walker. Mixture densities, maximum likelihood and the EM 
algorithm. SlAM Review, 26(2), 1984. 
[11] D.M. Titterington, A.F.M. Smith, and U.E. Makov. StatisticalAnalysis of Finite Mix- 
ture Distributions. Wiley, 1985. 
[12] C.F. Wu. On the convergence properties of the EM algorithm. Annals ofStat., 11:95- 
103, 1983. 
