Optimal Stochastic Search and 
Adaptive Momentum 
Todd K. Leen and Genevieve B. Orr 
Oregon Graduate Institute of Science and Technology 
Department of Computer Science and Engineering 
P.O.Box 91000, Portland, Oregon 97291-1000 
Abstract 
Stochastic optimization algorithms typically use learning rate 
schedules that behave asymptotically as (t) = o/t. The ensem- 
ble dynamics (Leen and Moody, 1993) for such algorithms provides 
an easy path to results on mean squared weight error and asymp- 
totic normality. We apply this approach to stochastic gradient 
algorithms with momentum. We show that at late times, learning 
is governed by an effective learning rate e# - o/(1 -/) where 
/ is the momentum parameter. We describe the behavior of the 
asymptotic weight error and give conditions on e# that insure 
optimal convergence speed. Finally, we use the results to develop 
an adaptive form of momentum that achieves optimal convergence 
speed independent of o. 
I Introduction 
The rate of convergence for gradient descent algorithms, both batch and stochastic, 
can be improved by including in the weight update a "momentum" term propor- 
tional to the previous weight update. Several authors (Tugay and Tanik, 1989; 
Shynk and Roy, 1988) give conditions for convergence of the mean and covariance 
of the weight vector for momentum LMS with constant learning rate. However 
stochastic algorithms require that the learning rate decay over time in order to 
achieve true convergence of the weight (in probability, in mean square, or with 
probability one). 
477 
478 Leen and O 
This paper uses our previous work on weight space probabilities (Leen and Moody, 
1993; Orr and Leen, 1993) to study the convergence of stochastic gradient algo- 
rithms with annealed learning rates of the form/ = Io/t, both with and without 
momentum. The approach provides simple derivations of previously known results 
and their extension to stochastic descent with momentum. Specifically, we show 
that the mean squared weight misadjustment drops off at the maximal rate o 1/t 
only if the effective learning rate Ieff -/o/(1 - 1) is greater than a critical value 
which is determined by the Hessian. 
These results suggest a new algorithm that automatically adjusts the momentum 
coefficient to achieve the optimal convergence rate. This algorithm is simpler than 
previous approaches that either estimate the curvature directly during the descent 
(Venter, 1967) or measure an auxilliary statistic not directly involved in the opti- 
mization (Darken and Moody, 1992). 
2 Density Evolution and Asymptotics 
We consider stochastic optimization algorithms with weight w E R N. We confine 
attention to a neighborhood of a local optimum w, and express the dynamics in 
terms of the weight error v -- w - w,. For simplicity we treat the continuous time 
algorithm  
v(t) 
= (t) H[v(t),x(t)] (1) 
dt 
where (t) is the learning rate at time t, H is the weight update function and 
x(t) is the data fed to the algorithm at time t. For stochastic gradient algorithms 
H = -Vv (v, x(t) ), minus the gradient of the instantaneous cost function. 
Convergence (in mean square) to w. is characterized by the average squared norm 
of the weight error E [ Iv 12 ] - Trace C where 
c _= f (2) 
is the weight error correlation matrix and P(v, t) is the probability density at v and 
time t. In (Leen and Moody, 1993) we show that the probability density evolves 
according to the Kramers-Moyal expansion 
8P(v,t) 
 (_1) i v 
i----1 Jl,...ji=l 
Ovj Ovj2 ... Ovj 
1Although algorithms are executed in discrete time, continuous time formulations are 
often advantagous for analysis. The passage from discrete to continuous time is treated 
in various ways depending on the needs of the theoretical exposition. Kushner and Clark 
(1978) define continous time functions that interpolate the discrete time process in order 
to establish an equivalence between the asymptotic behavior of the discrete time stochastic 
process, and solutions of an associated deterministic differential equation. Heskes et al. 
(1992) draws on the results of Bedeaux et al. (1971) that link (discrete time) random 
walk trajectories to the solution of a (continuous time) master equation. Heskes' master 
equation is equivalent to our Kramers-Moyal expansion (3). 
Optimal Stochastic Search and Adaptive Momentum 479 
where Hj denotes the j}n component of the N-component vector H, and {.. 
denotes averaging over the density of inputs. Differentiating (2) with respect to 
time, using (3) and integrating by parts, we obtain the equation of motion for the 
weight error correlation 
at = (t) any e(v,t)[ <H(,x):O>x + <H,x)> 
/(t) 2 / dNv P(v,t) (H(v,x) H(v,x) ' ) (4) 
2.1 Asymptotics of the Weight Error Correlation 
Convergence of v can be understood by studying the late time behavior of (4). 
Since the update function H(v, x) is in general non-linear in v, the time evolution 
of the correlation matrix Cij is coupled to higher moments E[vi vj vk...] of the 
weight error. However, the learning rate is assumed to follow a schedule/(t) that 
satisfies the requirements for convergence in mean square to a local optimum. Thus 
at late times the density becomes sharply peaked about v - 0 2. This suggests 
that we expand H(v, x) in a power series about v - 0 and retain the lowest order 
non-trivial terms in (4) leaving: 
dC 
= -(t) [ (R c) + (CRT)] + (t). D , (5) 
dt 
where R is the Hessian of the average cost function {  ), and 
D = (H(O,x) H(O,x)  ) (6) 
is the diffusion matrix, both evaluated at the local optimum w,. (Note that R " = 
R.) We use (5) with the understanding that it is valid for large t. The solution to 
(5) is 
C(t) = U(t, to) C(to) Ur(t, to) + 
where the evolution operator U(t2, tl) is 
U(tp.,tl) 
fia u()' rr(t, ) o v r (t, ) 
(7) 
= exp [-i' a-u0-) ] (s) 
We assume, without loss of generality, that the coordinates are chosen so that R is 
diagonal (D won't be) with eigenvalues Ai, i = 1... N. Then with/(t) = lao/t we 
obtain 
E[[v[ 2] = Trace[C(t)] = i.. 1 Cii(to) 
+ (2uo Ai - 1) t to T . (9) 
2In general the density will have nonzero components outside the basin of w.. We are 
neglecting these, for the purpose of calculating the second moment of the the local density 
in the vicinity of w.. 
480 Leen and On' 
We define 
1 
- (lO) 
Icrit -- 2 Amin 
and identify two regimes for which the behavior of (9) is fundamentally different: 
1. Io > Icrit: E[lvl ] drops off asymptotically as 1/t. 
1 o A., ) 
2. Po < Icrit: E[lvl 2 ] drops off asymptotically as ( ? )<2 
i.e. more slowly than 1/t. 
Figure 1 shows results from simulations of an ensemble of 2000 networks trained by 
LMS, and the prediction from (9). For the simulations, input data were drawn from 
a gaussian with zero mean and variance R = 1.0. The targets were generated by 
a noisy teacher neuron (i.e. targets =w.x + , where <) = 0 and <') = as). The 
upper two curves in each plot (dotted) depict the behavior for o < Icrit -- 0.5. 
The remaining curves (solid) show the behavior for o > Icrit  
'7, 
00 ]000 5000 50000 ]00 000 5000 50000 
t t 
Fig. l' LEFT - Simulation results from an ensemble of 2000 one-dimensional 
LMS algorithms with R = 1.0, 62 _ 1.0 and /t = tto/t. RIGHT - Theo- 
retical predictions from equation (9). Curves correspond to (top to bottom) 
o = 0.2, 0.4, 0.6, 0.8, 1.0, 1.5. 
By minimizing the coefficient of 1/t in (9), the optimal learning rate is found to 
be Iopt = 1/X,in. This formalism also yields asymptotic normality rather simply 
(Orr and Leen, 1994). These conditions for "optimal" (i.e. l/t) convergence of 
the weight error correlation and the related results on asymptotic normality have 
been previously discussed in the stochastic approximation literature (Darken and 
Moody, 1992; Goldstein, 1987; White, 1989; and references therein) . The present 
formal structure provides the results with relative ease and facilitates the extension 
to stochastic gradient descent with momentum. 
3 Stochastic Search with Constant Momentum 
The discrete time algorithm for stochastic optimization with momentum is: 
v(t + 1) = v(t) + (t) H[v(t),x(t)] +/3 f(t) 
(ii) 
Optimal Stochastic Search and Adaptive Momentum 481 
(t + 1) 
or in continuous time, 
v(t) 
dt 
= v(t + 1)-(t) 
= f(t) + !(t) H[v(t),x(t)] + (/3- 1) f(t), 
(12) 
- I(t) H[v(t),x(t)] +/3 f(t) (13) 
aa(t) 
dt 
= i(t) H[v(t),x(t)] + (/3- 1) f(t). 
(14) 
As before, we are interested in the late time behavior of E[lvl 2 ]. To this end, we 
define the 2N-dimensional variable Z _= (v, f)T and, following the arguments of 
the previous sections, expand H[v(t), x(t)] in a power series about v - 0 retaining 
the lowest order non-trivial terms. In this approximation the correlation matrix 
 -- E[ZZ T] evolves according to 
with 
= K + K T +/(t) 2  (15) 
dt 
) - ) 
K = -u(t)n (/3- 1) , D = D D ' 
I is the N x N identity matrix, and R and D are defined as before. The evolution 
operator is now 
- ] 
U(t2,tl) = exp dr K(r) (17) 
1 
and the solution to (15) is 
U = (t, to) U(to) uT(t, to) + dr t2(r) U(t,r)  uT(t,r) (18) 
The squared norm of the weight error is the sum of first N diagonal elements of C. 
In coordinates for which R is diagonal and with/(t) = Io/t, we find that for t >> to 
E[lvl2]  1 (to) + 
/0 2 Dii 
(1 -/3) (2/oAi - 1 +/3) 
to 
(19) 
This reduces to (9) when/3 = 0. Equation (19) defines two regimes of interest: 
1. /z0/(1 --/3) > lZcrit: E[lvl '] drops off asymptotically as 1/t. 
2. to/(1 -/3) < tcrit: E[Ivl 2] drops off asymptotically as 
2taOmin 
i.e. more slowly than 1It. 
482 Leen and Orr 
The form of (19) and the conditions following it show that the asymptotics of 
gradient descent with momentum are governed by the effective learning rate 
Figure 2 compares simulations with the predictions of (19) for fixed/z0 and various 
/3. The simulations were performed on an ensemble of 2000 networks trained by 
LMS as described previously but with an additional momentum term of the form 
given in (11). The upper three curves (dotted) show the behavior of E[Ivl 9'] for 
lteff < ltcrit. The solid curves show the behavior for l%# > lZcrit. 
The derivation of asymptotic normality proceeds similarly to the case without mo- 
mentum. Again the reader is referred to (Orr and Leen, 1994) for details. 
100 1000 5000 50000 
t 
Fig.2: 
gorithms with momentum with R - 
Theoretical predictions from equation 
/3 = 0.0, 0.4, 0.5, 0.6, 0.7, 0.8 . 
100 1000 5000 50000 
t 
LEFT - Simulation results from an ensemble of 2000 one-dimensional LMS al- 
1.0, cr 2 = 1.0, and /z0 = 0.2. RIGHT- 
(19). Curves correspond to (top to bottom) 
4 Adaptive Momentum Insures Optimal Convergence 
The optimal constant momentum parameter is obtained by minimizing the coeffi- 
cient of 1It in (19). Imposing the restriction that this parameter is positive s gives 
/3opt = max(O, 
(20) 
As with lZopt, this result is not of practical use because, in general, ,X,i, is unknown. 
For 1-dimensional linear networks, an alternative is to use the instantaneous esti- 
mate of A, X(t) = x2(t) where x(t) is the network input at time t. We thus define 
the adaptive momentum parameter to be 
max(O, 1 -/zox ') (1-dimension). 
(21) 
An algorithm based on (21) insures that the late time convergence is optimally fast. 
An alternative route to achieving the same goal is to dispense with the momentum 
term and adaptively adjust the learning rate. Vetnet (1967) proposed an algorithm 
3E[ivl 2] diverges for 1/31 > 1. For -1 < /3 < O, E[lvl 2] appears to converge but oscil- 
lations are observed. Additional study is required to determine whether/3 in this range 
might be useful for improving learning. 
Optimal Stochastic Search and Adaptive Momentum 483 
that iteratively estimates  for 1-D algorithms and uses the estimate to adjust 
Darken and Moody (1992) propose measuring an auxiliary statistic they call "drift" 
that is used to determine whether or not /Zo > lZcrit. The adaptive momentum 
scheme generalizes to multiple dimensions more easily than Vetner's algorithm, 
and, unlike Darken and Moody's scheme, does not involve calculating an auxiliary 
statistic not directly involved with the minimization. 
A natural extension to N dimensions is to define a matrix of momentum coefficients, 
? - ! - lzo x x T, where ! is the N x N identity matrix. By zeroing out the negative 
eigenvalues of % we obtain the adaptive momentum matrix 
/3actap - I- CXX T, where c = min(/o, 1/(xTx)). (22) 
Log( E[ Ivl 2] ) Log( E[ Ivl 2] ) 
6 
4 
3 4 
2 2 
1 o 
0 -2 
-1 
I 2 3  Log(t) -4 
1 2 3 4 5 
Fig.3: Simulations of 2-D LMS with 1000 networks initialized at vo - (.2,.3) and with 
ff2 = 1, hi = .4, 2 --' 4, and itctit = 1.25. LEFT- / = 0, RIGHT - / =/aaapt. Dashed 
curves correspond to adaptive momentum. 
Log(t) 
Figure 3 shows that our adaptive momentum not only achieves the optimal con- 
vergence rate independent of the learning rate parameter/o but that the value of 
log(E[Iv]9']) at late times is nearly independent of/o and smaller than when mo- 
mentum is not used. The left graph displays simulation results without momentum. 
Here, convergence rates clearly depend on/o and are optimal for/o >/crit - 1.25. 
When/o is large there is initially significant spreading in v so that the increased 
convergence rate does not result in lower log(E[]v]9']) until very late times (t > 105). 
The graph on the right shows simulations with adaptive momentum. Initially, the 
spreading is even greater than with no momentum, but log(E[Iv[9']) quickly decreases 
to reach a much smaller value. In addition, for t > 300, the optimal convergence 
rate (slope--I) is achieved for all three values of/o and the curves themselves lie 
almost on top of one another. In other words, at late times (t > 300), the value of 
log(E[lv[9']) is independent of/o when adaptive momentum is used. 
5 Summary 
We have used the dynamics of the weight space probabilities to derive the asymp- 
totic behavior of the weight error correlation for annealed stochastic gradient algo- 
rithms with momentum. The late time behavior is governed by the effective learning 
rate/eff-- /o/(1 -/3). For learning rate schedules/o/t, if/eff > 1/(2 Am/n), then 
the squared norm of the weight error v -- co - co. falls off as 1/t. From these results 
we have developed a form of momentum that adapts to obtain optimal convergence 
rates independent of the learning rate parameter. 
484 Leen and Orr 
Acknowledgments 
This work was supported by grants from the Air Force Office of Scientific Research 
(F49620-93-1-0253) and the Electric Power Research Institute (RP8015-2). 
References 
D. Bedeaux, K. Laktos-Lindenberg, and K. Shuler. (1971) On the Relation Between 
Master Equations and Random Walks and their Solutions. Journal of Mathematical 
Physics, 12:2116-2123. 
Christian Darken and John Moody. (1992) Towards Faster Stochastic Gradient 
Search. In J.E. Moody, S.J. Hanson, and R.P. Lipmann (eds.) Advances in Neural 
Information Processing Systems, vol. J. Morgan Kaufmann Publishers, San Mateo, 
CA, 1009-1016. 
Larry Goldstein. (1987) Mean Square Optimality in the Continuous Time Robbins 
Monro Procedure. Technical Report DRB-306, Dept. of Mathematics, University 
of Southern California, LA. 
H.J. Kushner and D.S. Clark. (1978) Stochastic Approximation Methods for Con- 
strained and Unconstrained Systems. Springer-Verlag, New York. 
Tom M. Heskes, Eddy T.P. Slijpen, and Bert Kappen. (1992) Learning in Neural 
Networks with Local Minima. Physical Review A, 46(8):5221-5231. 
Todd K. Leen and John E. Moody. (1993) Weight Space Probability Densities in 
Stochastic Learning: I. Dynamics and Equilibria. In Giles, Hanson, and Cowan 
(eds.), Advances in Neural Information Processing Systems, vol. 5, Morgan Kauf- 
mann Publishers, San Mateo, CA, 451-458. 
G. B. Orr and T. K. Leen. (1993) Weight Space Probability Densities in Stochastic 
Learning: II. Transients and Basin Hopping Times. In Giles, Hanson, and Cowan 
(eds.), Advances in Neural Information Processing Systems, vol. 5, Morgan Kauf- 
mann Publishers, San Mateo, CA, 507-514. 
G. B. Orr and T. K. Leen. (1994) Momentum and Optimal Stochastic Search. In 
M. C. Mozer, P. Smolensky, D. S. Touretzky, J. L. Elman, and A. S. Weigend (eds.), 
Proceedings of the 1993 Connectionist Models Summer School, 351-357. 
John J. Shynk and Sumit Roy. (1988) The LMS Algorithm with Momentum Up- 
dating. Proceedings of the IEEE International Symposium on Circuits and Systems, 
2651-2654. 
Mehmet Ali Tugay and Yalin Tanik. (1989) Properties of the Momentum LMS 
Algorithm. Signal Processing, 18:117-127. 
J. H. Venter. (1967) An Extension of the Robbins-Monro Procedure. Annals of 
Mathematical Statistics, 38:181-190. 
Halbert White. (1989) Learning in Artificial Neural Networks: A Statistical Per- 
spective. Neural Computation, 1:425-464. 
