Two Approaches to Optimal Annealing 
Todd K. Leen 
Dept of Comp. Sci. 2 Engineering 
Oregon Graduate Institute of 
Science and Technology 
P.O.Box 91000, Portland, 
Oregon 97291-1000 
tleen@cse.ogi.edu 
Bernhard Schottky and David Saad 
Neural Computing Research Group 
Dept of Comp. Sci. 2z Appl. Math. 
Aston University 
Birmingham, B4 7ET, UK 
schottba{saadd) @aston.ac.uk 
Abstract 
We employ both master equation and order parameter approaches 
to analyze the asymptotic dynamics of on-line learning with dif- 
ferent learning rate annealing schedules. We examine the relations 
between the results obtained by the two approaches and obtain new 
results on the optimal decay coefficients and their dependence on 
the number of hidden nodes in a two layer architecture. 
I Introduction 
The asymptotic dynamics of stochastic on-line learning and it's dependence on the 
annealing schedule adopted for the learning coefficients have been studied for some 
time in the stochastic approximation literature [1, 2] and more recently in the neural 
network literature [3, 4, 5]. The latter studies are based on examining the Kramers- 
Moyal expansion of the master equation for the weight space probability densities. 
A different approach, based on the deterministic dynamics of macroscopic quantities 
called order parameters, has been recently presented [6, 7]. This approach enables 
one to monitor the evolution of the order parameters and the system performance 
at all times. 
In this paper we examine the relation between the two approaches and contrast the 
results obtained for different learning rate annealing schedules in the asymptotic 
regime. We employ the order parameter approach to examine the dependence of 
the dynamics on the number of hidden nodes in a multilayer system. In addition, 
we report some lesser-known results on non-standard annealing schedules 
302 T. K. Leen, B. Schottky and D. Saad 
2 Master Equation 
Most on-line learning algorithms assume the form Wt+l = Wt -3' rlo/tP H(wt,xt) 
where wt is the weight at time t, xt is the training example, and H(w, x) is the 
weight update. The description of the algorithm's dynamics in terms of weight 
space probability densities starts from the master equation 
vhere ('")x indicates averaging with respect to the measure on x, P(w,t) is the 
probability density on weights at time t, and 5(...) is the Dirac function. One may 
use the Kramers-Moyal expansion of Eq.(1) to derive a partial differential equation 
for the weight probability density (here in one dimension for simplicity) [3, 4] 
 (_l)i (.0)i 
V(w,t) = i! V(w,t) ] (2) 
i----1 
Following Is], we make a small noise expansion for (2) by decomposing the weight 
trajectory into a deterministic and stochastic pieces 
w + or = (3) 
vhere b(t) is the deterministic trajectory, and  are the fluctuations. Apart from 
the factor (r/0/tP)7 that scales the fluctuations, ts is identical to the formulation 
for constant leaning in [3]. The proper value for the unspecffied exponent 7 
emerge from homogeneity requirements. Next, the dependence of the jump moments 
Hi(w, x) on 0 is explited by a Taylor series expansion about the deterministic 
at .  coefficients in ts series exposion are denoted 
Finally one rewrites (2) in terms of & d  d the expansion of the jump moments, 
ting ce to trsform the &fferentiM operators in accordance with (3). 
These transformations leave equations of motion for  and the density H(, t) on 
the fluctuations 
aW =  F (n(O,x) L (4) 
7  (--1)' (_,) ()'(-2 v)+n. 
&n = - oe( n) + (- ) ,  o((-,)n). () 
m=2 i=1 
For stochtic descent H(w,x) = -Vw E(w,x) d (4) describes the evolution of 
  descent on the average cost. The fluctuation equation (5) requir further 
mipulation whose form depends on the context. For the usuM ce of descent 
in a quadratic minimum (?) = -G, minus the cost function cvature), we take 
' = 1/2 to insure that for y m, ter in the sum are homogeneous in o/t p 
For constant leng rate (p = 0), rescng time  t  0t allows (5) to be written 
in a form convenient for perturbative alysis in  Typically, the limit 0  0 is 
invoked and oy the lower order terms in 0 reted (e.g. [3]). These comprise 
a diffusion operator, whi results in a Gaussi appromation for equibrium 
densities. Higher order terms have been successfully used to cMculate corrections 
to the equibri moments in powers of 0 [8]. 
Two Approaches to Optimal Annealing 303 
Of primary interest here is the case of annealed learning, as required for convergence 
of the parameter estimates. Again assuming a quadratic bowl and 7 - 1/2, the first 
few terms of (5) are 
o,n=-:- 5 + 
As t   the right hd side of (6) is dominat by the first tee terms (sce 
0 < p  1). Precisely which terms dominate depends on p. 
We will first review the clsic ce p = 1. Asymptoticly   w*, a loc 
optima. The first three leading terms on the right hd side of (6) e l of order 
1It. For t  , we dkcd the remaining terms. From the restg equation we 
recover a Gaussian equibrium stribution for , or equivently for  (w - w*)  
v vhere v is called the weight eor. The ymptotically normal distribution for 
v h viance a kom which the ymptotic expired squared weight error 
c be derived 
lim E[lvl 2] = a  = 2a*-i  (7) 
where G*  G(w*) is the curvature at the lo optimum. 
Positive a requkes 0 > 1/(2G*). If this condition is not met the eted 
squared weight offset converges  (l/t) -2"' , slower th 1It [5, for exmple, 
and referenc therein]. The above corms the clsi results [1] on ymptotic 
normMiry d convergence rate for 1It neng. 
For the ce 0 < p < 1, the second d third terms on the right hd side of (6) 
will doanate  t  m. Ag, we have a Gaussian eqbrim density for . 
Consequently v is ymptoticaHy normM with ice a leading to the 
expected squed weight error 
E[II = ) 
Notice that the convergence is slower than 1It and that there is no critical value of 
the learning rate to obtain a sensible equilibrium distribution. (See [9] for earlier 
results on 1It p annealing.) 
The generalization error follovs the same decay rate as the expected weight offset. 
In one dimension, the expected squared weight offset is directly related to excess 
generalization error (the generalization error minus the least generalization error 
achievable) eg - G E[v2]. In multiple dimensions, the expected squared weight 
offset, together with the maximum and minimum eigenvalues of G* provide upper 
and lower bounds on the excess generalization error proportional to E[[v[2], with 
the criticality condition on G* (for p -- 1 )replaced with an analogous condition on 
its eigenvalues. 
3 Order parameters 
In the Master equation approach, one focuses attention on the weight space distri- 
bution P(w, t) and calculates quantities of interested by averaging over this density. 
An alternative approach is to choose a smaller set of macroscopic variables that are 
sufficient for describing principal properties of the system such as the generaliza- 
tion error (in contrast to the evolution of the weights w vhich are microscopic). 
304 T. K. Leen, B. Schottky and D. Saad 
Formally, one can replace the parameter dynamics presented in Eq.(1) by the cor- 
responding equation for macroscopic observables which can be easily derived from 
the corresponding expressions for w. By choosing an appropriate set of macroscopic 
variables and invoking the thermodynamic limit (i.e., looking at systems where the 
number of parameters is infinite), one obtains point distributions for the order pa- 
rameters, rendering the dynamics deterministic. 
Several researchers [6, 7] have employed this approach for calculating the trdning 
dynamics of a soft committee machine (SCM) . The SCM maps inputs x 6 R S to 
a scalar, through a model p(w,x) = ./__' g(wi- x). The activation function of 
the hidden units is g(u) -- err(u/v/) and wi is the set of input-to-hidden adaptive 
veights for the i = 1... K hidden nodes. The hidden-to-output weights are set 
to 1. This architecture preserves most of the properties of the learning dynamics 
and the evolution of the generalization error as a general two-layer network, and 
the formalism can be easily extended to accommodate adaptive hidden-to-output 
weights [10]. 
Input vectors x are independently drawn with zero mean and unit variance, and the 
corresponding targets y are generated by deterministic teacher network corrupted 
2 The teacher 
by additive Gaussian output noise of zero mean and variance 
network is also a SCM, with input-to-hidden weights w?. The order parameters 
sufficient to close the dynamics, and to describe the network generalization error 
are overlaps between various input-to-hidden vectors wi  w -- 
 * Ty 
Rin, and w,w m = 
Network performance is measured in terms of the generalization error ea(w ) _= 
{1/2[ p(w, x)- y ]2)x. The generalization error can be expressed in closed form in 
terms of the order parameters in the thermodynamic limit (N - c). The dynamics 
of the latter are also obtained in closed form [7]. These dynamics are coupled non- 
linear ordinary differential equations whose solution can only be obtained through 
numerical integration. However, the asymptotic behavior in the case of annealed 
learning is amenable to analysis, and this is one of the primary results of the paper. 
We assume an isotropic teacher Tnm = 5m and use this symmetry to reduce the 
system to a vector of four order parameters u T = (r, q, s, c) related to the overlaps 
by tin = 5in(1 + r) + (1 --5in)S and Qik ---- 5i(1 +q) + (1 -- 5i)c. 
With learning rate annealing and limt u = 0 we describe the dynamics in this 
vicinity by a linearization of the equations of motion in [7]. The linearization is 
d 
u = r/Mu + r/2 rr2b, (9) 
2 is the noise variance, b T 2 (0, 1/v/-,0, 1/2), r/ o/tP, and M is 
where rr =  = 
M .._ 
2 
3v/rr 
3 -(K-1) V/3) 
4 -3 (K- 1)v 
2 
3v(c- 2) + 
(K- 1)v h 
-(K 1)v 
0 
2 
-3v(/- 2) +  
, (10) 
The asymptotic equations of motion (9) were derived by dropping terms of order 
O(r]]ul] ) and higher, and terms of order O(r 2 u). While the latter are linear in the 
order parameters, they are dominated by the ru and rab terms in (9) as t - c. 
Two Approaches to Optimal Annealing 305 
This choice of truncations sheds light on the approach to equilibrium that is not 
implicit in the master equation approach. In the latter, the dominant terms for 
the asymptotics of (6) were identified by time scale of the coefficients, there was 
no identification of system observables that signal when the asymptotic regime 
is entered. For the order parameter approach, the conditions for validity of the 
asymptotic approximations are cast in terms of system observables r/u vs r/2u vs 
2 2 
The solution to (9) is 
where Uo -= u(to) and 
7(t, t0)=exp M 
2 
u(t) = "(t, to) u0 + rr /(t, to) b 
(11) 
dr r/(r) and /(t, to)= dr 7(t, r) r/2(r). (12) 
The asymptotic order parameter dynamics allow us to compute the generalization 
error (to first order in u) 
=- (c-2s) 
(13) 
Using the solution of Eq.(11), the generalization error consists of two pieces: a 
contribution depending on the actual initial conditions u0 and a contribution due 
to the second term on the r.h.s. of Eq.(11), independent of u0. The former decays 
more rapidly than the latter, and we ignore it in what follows. Asymptotically, 
the generalization error is of the form et = (T2v(ClOl(t) qt_ C202(t)), vhere ci are /f 
dependent coefficients, and 0 i are eigenmodes that evolve as 
Oi '- 1 + ctirlo ' 
,vith eigenvalues (Fig. l(a)) 
-- 2 and a2 - -- + 2(K - 1 
71' -- 71' 
(15) 
The critical learning rate r] tit , above which the generalization decays as 1/t is, for 
K_>2, 
r]0cri t (1 1) /V 
=max . -, - -- (16) 
01 02 4 - 2 
For r/0 > r] tit both modes Oi, i = 1, 2 decay as i/t, and so 
22(C1 C 2 )1 2 1 
el = --rrvr/5 1 + Ctlr/O + 1 + o2r/o ' ----- try f(r/o,K) - (17) 
Minimizing the prefactor f(r/0,K) in (17) minimizes the asymptotic error. The 
values r/ pt (K) are shovn in Fig. l(b), where the special case of If = 1 (see below) 
is also included: There is a significant difference between the values for K = 1 and 
K = 2 and a rather weak dependence on K for K >_ 2. The sensitivity of the 
generalization error decay factor on the choice of r/0 is shown in Fig. 1(c). 
The influence of the noise strength on the generalization error can be seen directly 
2 is just a prefactor scaling the 1It decay. Neither 
from (17): the noise variance rr 
the value for the critical nor for the optimal r/0 is influenced by it. 
306 T. K. Leen, B. Schottky and D. Saad 
The calculation above holds for the case K = i (where c and s and the mode 0x are 
absent). In this case 
flPt(K -- 1)-- 2flgrit(/i ' -' 1)- 
2 2 
(18) 
Finally, for the general annealing schedule of the form r/= rlo/tP with 0 < p < 1 
the equations of motion (11) can be investigated, and one again finds 1/t p decay. 
4 Discussion and summary 
We employed master equation and order parameter approaches to study the conver- 
gence of on-line learning under different annealing schedules. For the 1/t annealing 
schedule, the small noise expansion provides a critical value of r/0 (7) in terms of 
the curvature, above which x/v is asymptotically normal, and the generalization 
decays as 1/t. The approach is general, but requires knowledge of the first two jump 
moments in the asymptotic regime for calculating the relevant properties. 
By restricting the order parameters approach to a symmetric task characterized 
by a set of isotropic teacher vectors, one can explicitly solve the dynamics in the 
asymptotic regime for any number of hidden nodes, and provide explicit expressions 
for the decaying generalization error and for the critical (16) and optimal learning 
rate prefactors for any number of hidden nodes K. Moreover, one can study the 
sensitivity of the generalization error decay to the choice of this prefactor. Simi- 
lax results have been obtained for the critical learning rate prefactors using both 
methods, and both methods have been used to study general 1/tP annealing. How- 
ever, the order parameters approach enables one to gain a complete description of 
the dynamics and additional insight by restricting the task examined. Finally the 
order parameters approach expresses the dynamics in terms of ordinary differential 
equations, rather than partial differential equations; a clear advantage for numerical 
investigations. 
The order parameter approach provides a potentially helpful insight on the passage 
into the asymptotic regime. Unlike the truncation of the small noise expansion, the 
truncation of the order parameter equations to obtain the asymptotic dynamics is 
couched in terms of system observables (c.f. the discussion following (10)). That 
is, one knows exactly which observables must be dominant for the system to be 
in the asymptotic regime. Equivalently, starting from the full equations, the or- 
der parameters approach can tell us when the system is close to the equilibrium 
distribution. 
Although we obtained a full description of the asymptotic dynamics, it is still unclear 
how relevant it is in the larger picture which includes all stages of the training 
process, as in many cases it takes a prohibitively long time for the system to reach 
the asymptotic regime. It would be interesting to find a way of extending this 
framework to gain insight into earlier stages of the learning process. 
Acknowledgements: DS and BS would like to thank the Leverhulme Trust for their 
support (F/250/K). TL thanks the International Human Frontier Science Program (SF 
473-96), and the NSF (ECS-9704094) for their support. 
Two Approaches to Optimal Annealing 307 
-4 
-6 
-8 
-10 
10t 1 
-12 
2 4 
6 8 10 12 
K 
14 16 18 
2O 
2O 
18 
16 
14 
 o 
 8 
6 
2 
0 
2 4 6 8 10 12 14 16 18 
o 
2o I .o  
is ...... f, K) ' 'l 
16f ..." 
14 
10 
8 
6 
4 
2  
0 
I 2 3 4 5 6 7 8 9 10 
(b) K 
Figure 1: (a) The dependence of the 
eigenvalues of M on the the num- 
ber of hidden units K. Note that 
the constant eigenvalue cl domi- 
nates the convergence for K _> 2. 
(b) _opt 
'/0 and the resulting general- 
ization prefactor f(rlPt,K) of (17) 
as a function of K. (c) The depen- 
dence of the generalization error de- 
cay prefactor f(rlo,K ) on the choice 
of r/0. 
References 
[1] V. Fabian. Ann. Math. Statist., 39, 1327 1968. 
[2] L. Goldstein. Technical Report DRB-306, Dept. of Mathematics, University of 
Southern California, LA, 1987. 
[3] T. M. Heskes and B. Kappen, Phys. Rev. A 44, 2718 (1991). 
[4] T. K. Leen and J. E. Moody. In Giles, Hanson, and Cowan, editors, Advances in 
Neural Information Processing Systems, 5, 451, San Mateo, CA, 1993. Morgan 
Kaufmann. 
[5] T. K. Leen and G. B. Orr. In J.D. Cowan, G. Tesauro, and J. Alspector, editors, 
Advances in Neural Information Processing Systems 6, 477 ,San Francisco, CA., 
1994. Morgan Kaufmann Publishers. 
[61 M. Biehl and H. Schwarze, J. Phys. A 28, 643 (1995). 
[7l D. Saad and S.A. Sona Phys. Rev. Lett. 74, 4337 (1995) and Phys. Rev. E 52 
4225 (1995). 
[8] G. B. Orr. Dynamics and Algorithms for Stochastic Search. PhD thesis, Oregon 
Graduate Institute, October 1996. 
[9] Naama Barkai. Statistical Mechanics of Learning. PhD thesis, Hebrew Univer- 
sity of Jerusalem, August 1995. 
[10] P. Riegler and M. Biehl J. Phys. A 28, L507 (1995). 
