Deterministic Annealing Variant 
of the EM Algorithm 
Naonori Ueda Ryohei Nakano 
ueda@cslab.kecl.ntt.j p nakano@cslab.kecl.ntt.j p 
NTT Communication Science Laboratories 
Hikaridai, Seika-cho, Soraku-gun, 
Kyoto 619-02 Japan 
Abstract 
We present a deterministic annealing variant of the EM algorithm 
for maximum likelihood parameter estimation problems. In our 
approach, the EM process is reformulated as the problem of min- 
imizing the thermodynamic free energy by using the principle of 
macimum entropy and statistical mechanics analogy. Unlike simu- 
lated annealing approaches, this minimization is deterministically 
performed. Moreover, the derived algorithm, unlike the conven- 
tional EM algorithm, can obtain better estimates free of the initial 
parameter values. 
1 INTRODUCTION 
The Expectation-Maximization (EM) algorithm (Dempster, Laird & Rubin, 1977) 
is an iterative statistical technique for computing maximum likelihood parameter 
estimates from incomplete data. It has generally been employed to a wide variety 
of parameter estimation problems. Recently, the EM algorithm has also been suc- 
cessfully employed as the learning algorithm of the hierarchical mixture of experts 
(Jordan & Jacobs, 1993). In addition, it has been found to have some relationship 
to the learning of the Boltzmann machines (Byrne, 1992). 
This algorithm has attractive features such as reliable global convergence, low cost 
per iteration, economy of storage, and ease of programming, but it is not free from 
problems in practice. The serious practical problem associated with the algorithm 
546 Naonori Ueda, Ryohei Nakano 
is the local maxima problem. This problem makes the performance dependent on 
the initial parameter value. Indeed, the EM algorithm should be performed from 
as wide a choice of starting values as possible according to some ad hoc criterion. 
To overcome this problem, we adopt the principle of statistical mechanics. Namely, 
by using the principle of maximum entropy, the thermodynamic free energy is de- 
fined as an effective cost function that depends on the temperature. The maxi- 
mization of log-likelihood is done by minimizing the cost function. Unlike simulated 
anneMing (Geman & Geman, 1984) where stochastic search is performed on the 
given energy surface, this cost function is deterministically optimized at each tem- 
perature. 
Such deterministic annealing (DA) approach has been successfully adopted for vec- 
tor quantization or clustering problems (Rose et al., 1992; Buhmann et al., 1993; 
Wong, 1993). Recently, Yuile et al.(Yuile, Stolorz, &; Utans, 1994) have shown that 
the EM algorithm can be used in conjunction with the DA. In our previous paper, 
independent of Yuile's work, we presented a new EM algorithm with DA for mixture 
density estimation problems (Ueda & Nakano, 1994). The aim of this paper is to 
generalize our earlier work and to derive a DA variant of the general EM algorithm. 
Since the EM algorithm can be used not only for the mixture estimation problems 
but also for other parameter estimation problems, this generalization is expected to 
be of value in practice. 
2 GENERAL THEORY OF THE EM ALGORITHM 
Suppose that a measure space Y of "unobservable data" exists corresponding to a 
measure space , of "observable data". An observable data sample (6 ,) with 
density p(; 61) is called incomplete and (, y) with joint density p(, y; 61) is called 
complete, where y is an unobservable data sample 1 corresponding to . Note that 
 6 7 n and y 6 7 m. 61 is parameter of the density distribution to be estimated. 
Given incomplete data samples X = {aa[k = 1,...,N}, the goal of the EM al- 
gorithm is to compute the maximum-likelihood estimate of 61 that maximizes the 
following log-likelihood function: 
N 
k=l 
logp(a ;61), (1) 
by using the following complete data log-likelihood function: 
N 
k=l 
log p(z, y; 61). 
In the EM algorithm, the parameter 61 is iteratively estimated. Suppose that 61(0 
denotes the current estimate of 61 after the tth iteration of the algorithm. Then 
61(t+l) at the next iteration is determined by the following two steps: 
In such unsupervised learning as mixture problems, y reduces to an integer value 
(y  {1, 2,..., C}, where C is the number of components), indicating the component from 
which a data sample x originates. 
Deterministic Annealing Variant of the EM Algorithm 547 
E-step: 
complete data log-likelihood given X and o(t): 
Compute the Q-function defined by the conditional expectation of the 
Q(o; o(0) dej E{Lc(O;X)IX, 
M-step: Set 0 (t+) equal to O which maximizes Q(O, o(t)). 
(a) 
It has theoretically been shown that an iterative procedure for maximizing Q over 
9 will cause the likelihood L to monotonically increase, e.g., L(O (t+l)) >_ L(O(t)). 
Eventually, L(O (0) converges to a local maximum. The EM algorithm is especially 
useful when the maximization of the Q-function can be more easily performed than 
that of L. 
By substituting Eq. 2 into Eq. 3, we have 
Q(O;O (t)) = ... {logp(,ya;O)} IIr(uili;o('))au-..dy 
k:l j=l 
N 
= /{logp(a,ya;O)}p(yala;o(t))dy. 
k:l 
O that maximizes Q(O; 0 (t)) should satisfy OQ/00 = O, or equivalently, 
k f( logp(:r, y; O))p(y[a; O(O)dy: O. 
k=l 
(4) 
(5) 
Here, p(yalaa, 0 (t)) denotes the posterior and can be computed by the following 
Bayes rule: 
p(l, o(')) p(,; 
 = (6) 
f p(aa,ya; O(t))dy ' 
It can be interpreted that the missing information is estimated by using the poste- 
rior. However, because the reliability of the posterior highly depends on the param- 
eter O (t), the performance of the EM algorithm is sensitive to an initial parameter 
value (0). This has often caused the algorithm to become trapped by some local 
maxima. In the next section, we will derive a new variant of the EM algorithm as 
an attempt at global maximization of the Q-function in the EM process. 
3 DETERMINISTIC ANNEALING APPROACH 
3.1 DERIVATION OF PARAMETERIZED POSTERIOR 
Instead of the posterior given in Eq. 6, we introduce another posterior f(ya[aa). 
The function form of f will be specified later. Using f(yalaa), we consider a new 
function instead of Q, say E, defined as: 
N 
E de__f Z f {-- logp(a, y; O)}f(y la)dy. (7) 
k=l 
548 Naonori Ueda, Ryohei Nakano 
(Note: E is always nonnegative.) One can easily see that (-E) is also the condi- 
tional expectation of the complete data log-likelihood but it differs from Q in that 
the expectation is taken with respect to f(y la) instead of the posterior given by 
rq. 6. In other words, if f(y:laq:) = p(y:laq:; 61(t)), then E -- -Q. 
Since we do not have a priori knowledge about f(y: Igr:), we apply the principle of 
maximum entropy to specify it. That is, by maximizing the entropy given by: 
N 
H: - /{log f(yl)}f(y I)dy, (8) 
k--1 
with respect to f, under the constraints of Eq. 7 and f fdy = 1, we can obtain 
the following Gibbs distribution: 
1 
f(yla) = Zak exp{--/3(--logp(a,y; O))}, (9) 
where Zk = f exp{-/3(-- logp(, y; O))}dy, and is called the partition func- 
tion. The parameter/3 is the Lagrange multiplier determined by the value E. From 
an analogy of the annealing, 1//3 corresponds to the "temperature". 
By simplifying Eq. 9, we obtain a new posterior parameterized by/3, 
P(a' Y}; O) (10) 
f(y - f Y}; 61) dye' 
Clearly, when/3 = 1, reduces to the original posterior given in Eq. 6. The 
effect of/3 will be explained later. 
Since z,...,zv are identically and independently distributed observations, the 
partition function Z(61) for X becomes tin z. Therefore, 
N 
Z(61) = 1-I/p(y,y;61)dy. (11) 
Once the partition function is obtained explicitly, using statistical mechanics anal- 
ogy, we can define the free energy as an effective cost function that depends on the 
temperature: 
F(61) der 1 log Z(61) 
= 
N 
i / 
:-y.log p(a;,y;61)du. (12) 
k---1 
At equilibrium, it is well known that a thermodynamic system settles into a config- 
uration that minimizes its free energy. Hence, 61 should satisfy 0F(61)/061 = O. 
It follows that 
k/{-6611ogp(a;,y;61)}f(y]a;)dy =0. (13) 
Interestingly, we have arrived at the same equation as the result of the maximization 
of the Q-function, except that the posterior p(y:laq:; 61(0) in Eq. 5 is replaced by 
Deterministic Annealing Variant of the EM Algorithm 5 4 9 
3.2 ANNEALING VARIANT OF THE EM ALGORITHM 
Let Qp(61; 61(0) be the expectation of the complete data log-likelihood by the pa- 
rameterized posterior f(y: la:). Then, the following deterministic annealing variant 
of the EM algorithm can be naturally derived to maximize -Fs(61). 
[Annealing EM (AEM) algorithm] 
1. Set/3 -/3,in (0 </3,in << 1). 
2. Arbitrarily choose an initial estimate 61 . Set t - 0. 
3. Iterate the following two steps until convergence2: 
E-step: Compute 
N 
k=l 
' Ya; 61)} f p(aa, ya; 
M-step: Set 61(t+) equal to 61 which maximizes Q(61; 61(0). 
4. Increase/3. 
5. If/3 </3,,o, set t - t -I- 1, and repeat from step 3; otherwise stop. 
(14) 
One can see that in the proposed algorithm, an outer loop is added to the original 
EM algorithm for the annealing process. An important distinction to keep in mind 
is that unlike simulated annealing, the optimization in step 3 is deterministically 
performed at each/3. Now let's consider the effect of the posterior parameterization 
of Eq. 10. The annealing process begins at small/3 (high temperature). Clearly, 
at this time, since f(yl,e) becomes uniform,-Fp(61) has only one global maxi- 
mum. Hence, the maximum can be easily found. Then by gradually increasing/3, 
the influence of each a is gradually localized. At /3 > 0, function Qs will have 
several local maxima. However, at each step, it can be assumed that the new global 
maximum is close to the previous one. Hence, by this assumption, the algorithm 
can track the global maximum at each/3 while increasing/3. Clearly, when/3 = 1 
the parameterized posterior coincides with the original one. Moreover, noting that 
-F(61) = L(61),/3,,o ought be one. 
4 Demonstration 
To visualize how the proposed algorithm works, we consider a simple one- 
dimensional, two-component normal mixture problem. The mixture is given by 
p(x;m,m2) : 0.37 exp{-(x- m/2 } + 0.7exp{-(x- m)}. In this 
case, 61 = (m,m2), y:  {1,2}, and therefore, the joint density p(x, 1;m,m2) = 
0.3 exp{-(x - m)2}, while p(x, 2; m, rna) = 0.7 exp{-(x - rn2)2}. 
One hundred samples in total were generated from this mixture with ml = -2 
and rn2 = 2. Figure 1 shows contour plots of the -F(m, ra2)/N surface. It is 
interesting to see how F(m, m2) varies with/3. One can see that a finer and truer 
structure emerges by increasing/3. Note that as explained before, when /3 = 1, 
2When the sequence converges to a saddle point (e.g., when the Hessian matrix of 
-F(O) has at least one positive eigen value), a local hue search in the direction of the 
eigen vector corresponding to the largest eigen value should be performed to escape the 
solution from the saddle point. 
550 Naonori Ueda, Ryohei Nakano 
4 
2 
0 
-2 
-4 
4 
2 
0 
-2 
-4 
-4 -2 0 2 4 
/Tit 
(a) [1--0.1 
-4 -2 0 2 4 
(d) [l = 0.5 
-4 -2 0 2 4 
(b) [l = 0.25 
-4 -2 0 2 4 
(e) I = 0.7 
-4 -2 0 2 4 
/Tit 
(c) p = 0.3 
-4 -2 0 2 4 
/fil 
(f) p=l 
Figure 1' Contour plots of -F(mbm2)/N surface. 
C+" denotes a global maximum at each 
-4 -2 0 2 4 
ml 
t 0) 0) (4, - 1) 
-4 -2 0 2 4 
/fil 
,, ,, ==(-2,-4) 
Figure 2: Trajectories for EM and AEM procedures. 
Deterministic Annealing Variant of the EM Algorithm 551 
-F(m,m2) = L(m, m2). The maximum of-F (or L) occurs at rh = -2.0 and 
rh2 = 1.9. As initial points, (rh?), rh? )) = (4,-1), (-2,-4) were tested. Note that 
these points are close to a local maximum. Figure 2 shows how both algorithms 
converge from these starting points a. 
The original EM algorithm converges to (rh?),rh?)) = (2.002,-1.687)(Figure 
2(a)) and to 9)) = (1.999,-1.696)(Figure 2(b)). In contrast, the pro- 
posed AEM algorithm converges to (rh? s), rh? a)) = (-2.022, 1.879)(Figure 2(a)) 
and (h?a),rh? a)) = (-2.022, 1.879)(Figure 2(b)). Since these starting points are 
close to the local maximum point, the original EM algorithm becomes trapped by 
the local maximum point, while the proposed algorithm successfully converges to 
the global maximum in both cases. 
5 Discussion 
A new view of the EM algorithm has recently been presented (Hathaway, 1986; Neal 
& Hinton, 1993). This view states that the EM steps can be regarded as a grouped 
version of the method of coordinate ascent of the following objective function: 
N N 
j de__ E Er(Yk){lgp('Y; 61)} -- E Er(Yk){lgp(Y)} 
(15) 
k k 
over p(ya) and 61. That is, the E-step corresponds to the maximization of J with 
respect to p(ya) with fixed 61, while the M-step corresponds to that of J with 
respect to 61 with fixed p(y}). Neal & Hinton mentioned that apart from the sign 
change, J is analogous to the "free energy" well known in statistical physics. 
It is worth mentioning another interpretation of the derived parameterized posterior; 
i.e., it plays a central role in the AEM algorithm. Taking the logarithm on both 
sides of Eq. 10, we have 
] 1 
-11og p(za,Ua;61)dUa=-logp(za,Ua;61)+logj'(Ua]za). (16) 
Moreover, taking the conditional expectation 4 given a}, summing over k, and using 
Eq. 12, we have 
N 1 N 
F(61) = E El(Y,lZ,){-lgp(z'Y; 61)}+ E El(Y, lZ,){lgf(Y}lZ})}' (17) 
k'-I k=l 
From Eqs. 7 and 8, Eq. 17 can be rewritten as F(61) = E- H. Noting that 1// 
corresponds to the "temperature", one can see that this expression exactly agrees 
with the free energy, while Eq. 15 is completely without the "temperature". In 
other words, F(61) can be interpreted as an annealing variant of-J. Clearly, 
when/ = 1, F(61) _= -J. 
SAlthough the AEM procedure was actually performed for successive / (/,, -/ x 
1.4), the results are superimposed on -F(m, m)/N surface for convenience. 
Since the LHS of Eq. 16 is independent of y, it does not change after expectation. 
552 Naonori Ueda, Ryohei Nakano 
The proposed algorithm is also applicable to the learning of the (Generalized) Ra- 
dial Basis Function (RBF, GRBF) networks. Indeed, Nowlan (Nowlan, 1990), for 
instance, proposes a maximum likelihood competitive learning algorithm for the 
RBF networks. In his study, "soft competition" and "hard competition" are ex- 
perimentally compared and it is shown that the soft competition can give better 
performance. In our algorithm, on the other hand, the soft model exactly corre- 
sponds to the case 3 = 1, while the hard model corresponds to the case 3 -- cx. 
Consequently, both models can be regarded as special cases in our algorithm. 
Acknowledgement s 
We would like to thank Dr. Tsukasa Kawaoka, NTT CS labs, for his encouragement. 
References 
J. Buhmann & H. Kuhnel. (1993) Complexity optimized data clustering by com- 
petitive neural networks. Neural Computation, 5:75-88. 
W. Byrne. (1992) Alternating minimization and Boltzmann machine learning. 
IEEE Trans. Neural Networks, 3:612-620. 
A. P. Dempster, N.M. Laird & D. B. Rubin. (1977) Maximum-likelihood from 
incomplete data via the EM algorithm. J. Royal Statist. Soc. Set. B (methodolog- 
ical), 39:1-38. 
S. Geman & D. Geman. (1984) Stochastic relaxation, Gibbs distribution and the 
Baysian restortion in images. IEEE Trans. Pattern Anal. Machine Intell., 6,6:721- 
741. 
R. J. Hathaway. (1986) Another interpretation of the EM algorithm for mixture 
distributions. Statistics  Probability Letters, 4: 53-56. 
M. I. Jordan & R. A. Jacob. (1993) Hierarchical mixtures of experts and the EM 
algorithm. MIT Dept. of Brain and Cognitive Science preprint. 
R. M. Neal & G. E. Hinton. (1993) A new view of the EM algorithm that justifies 
incremental and other variants. submitted to Biometrika. 
S. J. Nowlan. (1990) Maximum likelihood competitve learning. in D. S. Touretzky 
et al. eds., Advances in Neural Information Systems 2, Morgan Kaufmann. 574-582. 
K. Rose, E. Gurewitz & G. C. Fox. (1992) Vector quantization by deterministic 
annealing. IEEE Trans. Information Theory, 38,4:1249-1257. 
N. Ueda & R. Nakano. (1994) Mixture density estimation via EM algorithm with 
deterministic annealing. in proc. IEEE Neural Networks for Signal Processing, 
69-77. 
Y. Wong. (1993) Clustering data by melting. Neural Computation, 5:89-104. 
A. L. Yuille, P. Stolorz & J. Utans. (1994) Statistical physics, mixtures of distribu- 
tions, and the EM algorithm. Neural Computation, 6:334-340. 
