Convergence of The Wake-Sleep Algorithm 
Shiro Ikeda 
PRESTO, JST 
Wako, Saitama, 351-0198, Japan 
shiro@brain.riken.go.jp 
Shun-ichi Amari 
RIKEN Brain Science Institute 
Wako, Saitama, 351-0198, Japan 
amari@brain.riken.go.jp 
Hiroyuki Nakahara 
RIKEN Brain Science Institute 
hirobrain.riken.go.jp 
Abstract 
The W-S (Wake-Sleep) algorithm is a simple learning rule for the models 
with hidden variables. It is shown that this algorithm can be applied to 
a factor analysis model which is a linear version of the Helmholtz ma- 
chine. But even for a factor analysis model, the general convergence is 
not proved theoretically. In this article, we describe the geometrical un- 
derstanding of the W-S algorithm in contrast with the EM (Expectation- 
Maximization) algorithm and the era algorithm. As the result, we prove 
the convergence of the W-S algorithm for the factor analysis model. We 
also show the condition for the convergence in general models. 
1 INTRODUCTION 
The W-S algorithm[5] is a simple Hebbian learning algorithm. Neal and Dayan applied the 
W-S algorithm to a factor analysis model[7]. This model can be seen as a linear version of 
the Helmholtz machine[3]. As it is mentioned in[7], the convergence of the W-S algorithm 
has not been proved theoretically even for this simple model. 
From the similarity of the W-S and the EM algorithms and also from empirical results, the 
W-S algorithm seems to work for a factor analysis model. But there is an essential differ- 
ence between the W-S and the EM algorithms. In this article, we show the era algorithm[2], 
which is the information geometrical version of the EM algorithm, and describe the essen- 
tial difference. From the result, we show that we cannot rely on the similarity for the reason 
of the W-S algorithm to work. However, even with this difference, the W-S algorithm works 
on the factor analysis model and we can prove it theoretically. We show the proof and also 
show the condition of the W-S algorithm to work in general models. 
240 S. Ikeda, S. Amari and H. Nakahara 
2 FACTOR ANALYSIS MODEL AND THE W-S ALGORITHM 
A factor analysis model with a single factor is defined as the following generative model, 
Generative model x =/ + yg + e, 
where x = (xt,..., Xn) T is a n dimensional real-valued visible inputs, y -- 
./V'(0, 1) is the single invisible factor, g is a vector of "factor loadings",/ is the 
overall means vector which is set to be zero in this article, and e  ./V'(0, E) is the 
noise with a diagonal covariance matrix, E = diag(a). In a Helmholtz machine, 
this generative model is accompanied by a recognition model which is defined as, 
Recognition model y = rTx + 6, 
where r is the vector of recognition weights and 6  Af(0, s 2) is the noise. 
When data xt,... , xN is given, we want to estimate the MLE(Maximum Likelihood Es- 
timator) of g and E. The W-S algorithm can be applied[7] for learning of this model. 
Wake-phase: From the training set {xs} choose a number of x randomly and for each 
data, generate y according to the recognition model y = rtrx + 5, 6  ./V'(O, st2). 
Update g and .U as follows using these x's and y's, where c is a small positive 
number and/3 is slightly less than 1. 
gt+l -- gt + a(x -- gtY)Y (1) 
o '2 - /3a, t + (1 -/3)(xi - gi,tY) 2 (2) 
i,t+l --  
where denotes the averaging over the chosen data. 
Sleep-phase: According to the updated generarive model x = Ygt+ q- e,y  
./V'(0, 1), e  ./V'(0, diag(o-t2+)), generate a number of x and y. And update r 
and s 2 as, 
rt+ = rt + a(y - r[x)x (3) 
st2+ : /3st 2 + (1 -/3)(y - rtrx) 2. (4) 
By iterating these phases, they try to find the MLE as the converged point. 
For the following discussion, let us define two probability densities p and q, where p is the 
density of the generative model, and q is that of the recognition model. 
Let 0 = (g, .U), and the generative model gives the density function of x and y as, 
( 1 ( ) 
p(y,x;O) = exp -(y xr)A Y -- b(O) 
_ ( 1+- I--- ) I(E 2 1) log2-) 
--a _ ,8(0) =  log + (n + , 
while the recognition model gives the distribution of y conditional to x as the following, 
q(y[x; rl) ". ./V(r r x, s2), 
where, r/--- (r, s2). From the data x,... , xN, we define, 
1/: 
C --  3s3s T, q(x)  ./V'(O, C). 
s----1 
With this q(x), we define q(y, x; rl) as, 
q(y x;rl)=q(x)q(y[x'rl)=exp (-(y xr)B ( Y )-- 
1(11 ) 
B= -i -r s2C -l + rr fi' 
(5) 
(6) 
1 (log s 2 
,(n--  q- loglCI q- (n q- 1) log2-). 
Convergence of the Wake-Sleep Algorithm 241 
3 
THE EM AND THE em ALGORITHMS FOR A FACTOR 
ANALYSIS MODEL 
It is mentioned that the W-S algorithm is similar to the EM algorithm[4]([5][7]). But there 
is an essential difference between them. In this section, first, we show the EM algorithm. 
We also describe the era algorithm[2] which gives us the information geometrical under- 
standing of the EM algorithm. With these results, we will show the difference between 
W-S and the EM algorithms in the next section. 
The EM algorithm consists of the following two steps. 
E-step: 
Define Q(O, Or) as, 
N 
Q(O, Ot) :  Z Ep(yl*;) [logp(y, acs;O)] 
sl 
M-step: Update 0 as, 
Ot+ = argmax Q(O, Or), 
o 
T --1 --1 
(l+t t t)Ct t 
gt+l : T --1 --1 T -1 ' 
gt Xt C't gt + l + gt Xt gt 
Xt+ = diag (C - gt+l 
T --1 
+ _r%-t  ' 
1 Yt t gt / 
(7) 
Ep [-] denotes taking the average with the probability distribution p. The iteration of these 
two steps converges to give the MLE. 
The EM algorithm only uses the generative model, but the era algorithm[2] also uses the 
recognition model. The era algorithm consists of the e and ra steps which are defined as the 
e and ra projections[l] between the two manifolds M and D. The manifolds are defined 
as follows. 
Model manifold M: M d=f {p(y, ac; O)lO = (g, diag(a) ), g 
Data manifold D: D deal {q(y,x;r)Ir = (r, s2),r  Rn,0 < s < cx}, q(x)include the 
matrix C which is defined by the data, and this is called the "data manifold". 
Figure 1: Information geometrical understanding of the era algorithm 
Figure 1 schematically shows the era algorithm. It consists of two steps, e and ra steps. On 
each step, parameters of recognition and generative models are updated respectively. 
242 S. Ikeda, S. Amari and H. Nakahara 
e-step: 
Update r/as the e projection ofp(y, a:; Or) on D. 
rh+l = argmin KL(q(rl),p(Ot) ) 
-1 
_ 't gt st2+ 1 -'- T --1 
T -1  ' 
rt+l 1 + gt 't gt 1 + gt t gt 
where KL(q(rl), p(O)) is the Kullback-Leibler divergence defined as, 
KL(q(rl),p(O)) = Eq(y,;,v)logp(y,m;O)] 
(8) 
(9) 
ra-step: Update 0 as the ra projection ofq(y, a:; r/t) on M. 
Ot+ = argmin KL(q(rh+),p(O)) (10) 
0 
Urt+ 
t+ :diag(U-gt+lrtr+u). (11) 
gt+ = st2+  + rtUrt+  , 
By substituting (9) for rt+t and st2+t in (11), it is easily proved that (11) is equivalent to 
(7), and the era and EM algorithms are equivalent. 
4 
THE DIFFERENCE BETWEEN THE W-S AND THE EM 
ALGORITHMS 
The wake-phase corresponds to a gradient flow of the M-step[7] in the stochastic sense. 
But the sleep-phase is not a gradient flow of the E-step. In order to see these clear, we show 
the detail of the W-S phases in this section. 
First, we show the averages of (1), (2), (3, and (4,, 
t+ = rt - (1- ) (rt -diag(C- 2(Crt)a[ + (s + r[cr,)a,aD) 
rt+, =rt-,(Zt+, +gt+,gt+,) rt- l +g[+,Z[gt+, (14) 
 th -C divrg i rwritt a X(q(n,p(O)), 
1 n+l 
K(q(n,p(O)) = t(-'n) 2 + (0) - (, 
the derivatives of this K-L divergence with respect to 0: (g, ) arc, 
ff (Cr) (16, 
KL(q(.),p(O)) = X -2 (X-diag(C- 2Cr  + (s 2 + rrCr)v)).(17) 
With these results, we can rewrite the wake-phase as, 
o 
+ =  -( - V)r7(q(v),p(o,)) 9 
Convergence of the Wake-Sleep Algorithm 243 
Since  is a positive definite matrix, the wake-phase is a gradient flow of m-step which is 
defined as (10). 
On the other hand, KL(p(O), q(rl)) is, 
KL(p(O),q(rl)) = 
 n 
t(.4 B) - + (v - (o). 
The derivatives of this K-L divergence respect to r and s 2 are, 
0 2 ( X-g ) (20) 
orKL(p(O),q(rl)) = 5_(. U + ggT) r- 1 + gTE-g 
0 1 
O(s2)KL(p(O),q(rl)) = (s2) 2 (s2-((1-gTr)2+rT,Ur)). (21) 
Therefore, the sleep-phase can bc rewritten as, 
rt+ = rt - st 2 KL(p(Ot+),q(rh)) (22) 
o 
st2+ = st 2 - (1 -/3)(st2)20(st2) KL(p(Ot+),q(rlt)). (23) 
These are also a gradient flow, but because of the asymmctricity of K-L divergence, (22), 
(23) arc different from the on-linc version of the m-step. This is the essential difference 
between the EM and W-S algorithms. Therefore, wc cannot prove the convergence of the 
W-S algorithm based on the similarity of these two algorithms[7]. 
n t 
KLfi(O)KL(q(nlP 'O,,  )M ) 
D 
Figure 2: The Wake-Sleep algorithm 
5 CONVERGENCE PROPERTY 
We want to prove the convergence property of the W-S algorithm. If we can find a Lyapnov 
function for the W-S algorithm, the convergence is guaranteed[7]. But we couldn't find it. 
Instead of finding a Lyapnov function, we take the continuous time, and see the behavior 
of the parameters and K-L divergence, K L(q(r/t ), p( Ot ) ). 
KL(q(rl), p(O)) is a function of g, r, ' and s 2. The derivatives with respect to g and w 
are given in (16) and (17). The derivatives with respect to r and s 2 are, 
-rKL(q(rl),P(O)) 
O(s 
E-lg ) (24) 
1 + gT-g 
= 2(l+gT-g)C (r- 
1 
--KL(q(rl),p(O)) = 1 + gT-g s 2' (25) 
244 S. Ikeda, S. Amari and H. Nakahara 
On the other hand, we set the flows of g, r,  and s 2 to follow the updating due to the W-S 
algorithm, that is, 
d 
d 
d 
dt 
dt 
Crt ) 
= -a'(st 2 + rtTCrt) gt- %2 + rtTCrt (26) 
= --o'(wt +grOt T) rt- 1+ t2.,t Ot 
gr- (27) 
- -/ (-ig(C-.rL +(s +c)) (s 
= -fi' (st 2 ((1 r 2 
- -gtrt) + rtrwtrt)) (29) 
With theses results, dK L( q( rh ), p( Ot ) ) /dt is, 
dKL(q(rtt),p(Ot)) OKLda OKLdr 
= + 
dt Og dt Or dt 
OKL d w OKL d(s 2) 
+ --- + -- (30) 
OE dt 0(s 2) dt 
First 3 terms in the right side of (30) are apparently non-positive. Only the 4th one is not 
clear. 
OKL d(s2) -/3' (st 2 ((1 T 2 rtTwtrt) ( T- 1) 
O(s 2) dt = - -gtrt)+ ) 1+Or t Zgt  
l + gt t gt 1 
=- -gtrt) + )) l +gt t gt 
4 (4-((1  "[r 4- -1  
The KL(q(vt),p(Ot)) does not decrease when s  stays between ((1 T 2 r[trt) 
--Or rt) + 
T -1 
and 1/(1 + gt t gt), but if the following equation holds, these two are equivalent, 
[lot 
rt = I + O[Ot ' (31) 
From the above results, the flows of g, r and  decrease KL(q(vt),p(Ot)) at any time. s 
converge to (( 1 - g  r t )2 + r  t r t ) but it does not always decrease K L( q( vt ), p( Ot )). But 
since r does converge to satisfy (31) independently of s 2 finally s converges to 1/(1 + 
t, 
g[ -1 
t Or). 
6 DISCUSSION 
This factor analysis model has a special property thatp(ylx; O) and q(ylx; rl) are equivalent 
when following conditions are satisfied[7], 
-g 1 
r = s 2 = (32) 
1 + grE'-g' 1 + grE'-lg' 
From this property, minimizing KL(p(O),q(rl)) and KL(q(rl),p(O)) with respect to r/ 
leads to the same point. 
p(x;_)] + Ep(y,=;o) log (33) 
KL(p(O),q(rl))-Ep(=;o) log q(x) j q(ylx; 
[ q(x) [ q(ylx; rl) (34) 
KL(q(rl),p(O)):Eq(=) lOgp(x;O) +Eq(y,=;.) lOgp(ylx;O), 
both of (33) and (34) include r/only in the second term of the right side. If (32) holds, 
those two terms are 0. Therefore KL(p(O), q(rl)) and KL(q(rl),p(O)) are minimized at 
the same point. 
Convergence of the Wake-Sleep Algorithm 245 
We can use this result to modify the W-S algorithm. If the factor analysis model does not 
try wake- and sleep- phase alternately but "sleeps well" untill convergence, it will find the 
r/which is equivalent to the e-step in the era algorithm. Since the wake-phase is a gradient 
flow of the ra-step, this procedure will converge to the MLE. This algorithm is equivalent 
to what is called the GEM(Generalized EM) algorithm[6]. 
The reason of the GEM and the W-S algorithms work is that p(y Ix; 0) is realizable with the 
recognition model q(ylx; r/). If the recognition model is not realizable, the W-S algorithm 
won't converge to the MLE. We are going to show an example and conclude this article. 
Suppose the case that the average of y in the recognition model is not a linear function of 
r and x but comes through a nonlinear function .f(.) as, 
Recognition model y = f(rTx) + 5, 
where f(-) is a function of single input and output and 6 ,-, Af(0, s 2) is the noise. In 
this case, the generative model is not realizable by the recognition model in general. 
And minimizing (33) with respect to r/leads to a different point from minimizing (34). 
KL(p(O), q(rl)) is minimized when r and s 2 satisfies, 
Ep(;0) If (r Tx) f' (r Tx)x] -- Ep(y,;0 ) [yf' (r Tx)x] (35) 
s 2 = 1 - Ep(y,;o ) [--2yf(rTx) + f2(rTx)], (36) 
while KL(q(rl),p(O)) is minimized when r and s 2 satisfies, 
(1 -[-aTr-la)Eq(ae;rl)[f(rrx)f'(rrx)x] = Eq(a;r/)[f'(rrx)xxr]-g (37) 
s2: 1 
1 + grE-g' (38) 
Here, f' (.) is the derivative of f(.). If f(.) is a linear function, f' (.) is a constant value and 
(35), (36) and (37), (38) give the same r/as (32), but these are different in general. 
We studied a factor analysis model, and showed that the W-S algorithm works on this 
model. From further analysis, we could show that the reason why the algorithm works 
on the model is that the generative model is realizable by the recognition model. We also 
showed that the W-S algorithm doesn't converge to the MLE if the generative model is not 
realizable with a simple example. 
Acknowledgment 
We thank Dr. Noboru Murata for very useful discussions on this work. 
References 
[ 1 ] Shun-ichi Amari. Differential-Geometrical Methods in Statistics, volume 28 of Lecture 
Notes in Statistics. Springer-Verlag, Berlin, 1985. 
[2] Shun-ichi Amari. Information geometry of the EM and em algorithms for neural net- 
works. Neural Networks, 8(9):1379-1408, 1995. 
[3] Peter Dayan, Geoffrey E. Hinton, and Radford M. Neal. The Helmholtz machine. 
Neural Computation, 7(5):889-904, 1995. 
[4] A. P. Dempster, N.M. Laird, and D. B. Rubin. Maximum likelihood from incomplete 
data via the EM algorithm. J. R. Statistical Society, Series B, 39:1-38, 1977. 
[5] G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal. The "wake-sleep" algorithm for 
unsupervised neural networks. Science, 268:1158-1160, 1995. 
[6] Geoffrey J. McLachlan and Thriyambakam Krishnan. The EM Algorithm and Exten- 
sions. Wiley series in probability and statistics. John Wiley & Sons, Inc., 1997. 
[7] Radford M. Neal and Peter Dayan. Factor analysis using delta-rule wake-sleep learn- 
ing. Neural Computation, 9(8): 1781-1803, 1997. 
