Using Curvature Information for 
Fast Stochastic Search 
Genevieve B. Orr 
Dept of Computer Science 
Willamette University 
900 State Street 
Salem, OR 97301 
gorr@willamette.edu 
Todd K. Leen 
Dept of Computer Science and Engineering 
Oregon Graduate Institute of 
Science and Technology 
P.O.Box 91000, Portland, Oregon 97291-1000 
tleen@cse.ogi.edu 
Abstract 
We present an algorithm for fast stochastic gradient descent that 
uses a nonlinear adaptive momentum scheme to optimize the late 
time convergence rate. The algorithm makes effective use of cur- 
vature information, requires only (9(n) storage and computation, 
and delivers convergence rates close to the theoretical optimum. 
We demonstrate the technique on linear and large nonlinear back- 
prop networks. 
Improving Stochastic Search 
Learning algorithms that perform gradient descent on a cost function can be for- 
mulated in either stochastic (on-line) or batch form. The stochastic version takes 
the form 
= + 
where cot is the current weight estimate, Pt is the learning rate, G is minus the 
instantaneous gradient estimate, and xt is the input at time t x. One obtains the 
corresponding batch mode learning rule by taking/ constant and averaging G over 
all x. 
Stochastic learning provides several advantages over batch learning. For large 
datasets the batch average is expensive to compute. Stochastic learning eliminates 
the averaging. The stochastic update can be regarded as a noisy estimate of the 
batch update, and this intrinsic noise can reduce the likelihood of becoming trapped 
in poor local optima [1, 2]. 
We assume that the inputs are i.i.d. This is achieved by random sampling with re- 
placement from the training data. 
Using Curvature Information for Fast Stochastic Search 607 
The noise must be reduced late in the training to allow weights to converge. After 
settling within the basin of a local optimum w., learning rate annealing allows con- 
vergence of the weight error v = w - w.. It is well-known that the expected squared 
weight error, E[[v[ 2] decays at its maximal rate oc 1It with the annealing schedule 
Io/t. Furthermore to achieve this rate one must have 0 > Icrit = 1/(2Amin) where 
Amin is the smallest eigenvalue of the Hessian at w. [3, 4, 5, and references therein]. 
Finally the optimal io, which gives the lowest possible value of E[[v[ 2] is 0 = 1/A. 
In multiple dimensions the optimal learning rate matrix is (t) = (l/t)7-/-,where 
7-/is the Hessian at the local optimum. 
Incorporating this curvature information into stochastic learning is difficult for two 
reasons. First, the Hessian is not available since the point of stochastic learning is 
not to perform averages over the training data. Second, even if the Hessian were 
available, optimal learning requires its inverse - which is prohibitively expensive to 
compute 2 
The primary result of this paper is that one can achieve an algorithm that behaves 
optimally, i.e. as if one had incorporated the inverse of the full Hessian, without 
the storage or computational burden. The algorithm, which requires only O(n) 
storage and computation (n = number of weights in the network), uses an adaptive 
momentum parameter, extending our earlier work [7] to fully non-linear problems. 
We demonstrate the performance on several large back-prop networks trained with 
large datasets. 
Implementations of stochastic learning typically use a constant learning rate during 
the early part of training (what Darken and Moody [4] call the search phase) to ob- 
tain exponential convergence towards a local optimum, and then switch to annealed 
learning (called the converge phase). We use Darken and Moody's adaptive search 
then converge (ASTC) algorithm to determine the point at which to switch to 1It 
annealing. ASTC was originally conceived as a means to insure 0 > Icrit during 
the annealed phase, and we compare its performance with adaptive momentum as 
well. We also provide a comparison with conjugate gradient optimization. 
1 Momentum in Stochastic Gradient Descent 
The adaptive momentum algorithm we propose was suggested by earlier work on 
convergence rates for annealed learning with constant momentum. In this section 
we summarize the relevant results of that work. 
Extending (1) to include momentum leaves the learning rule 
(2) 
where  is the momentum parameter constrained so that 0 <  < 1. Analysis of 
the dynamics of the expected squared weight error E[ Ivl 2 ] with it = io/t learning 
rate annealing [7, 8] shows that at late times, learning proceeds as for the algorithm 
without momentum, but with a scaled or effective learning rate 
_ uo (3) 
This result is consistent with earlier work on momentum learning with small, con- 
stant !, where the same result holds [9, 10, 11] 
2Venter [6] proposed a 1-D algorithm for optimizing the convergence rate that estimates 
the Hessian by time averaging finite differences of the gradient and scalin the learning 
rate by the inverse. Its extension to multiple dimensions would require O(n ) storage and 
O(n 3) time for inversion. Both are prohibitive for large models. 
608 G. B. Orr and T. K. Leen 
If we allow the effective learning rate to be a matrix, then, following our comments 
in the introduction, the lowest value of the misadjustment is achieved when Pe# = 
7-/-x [7, 8]. Combining this result with (3) suggests that we adopt the heuristic s 
= ! - (4) 
where opt is a matrix of momentum parameters, I is the identity matrix, and P0 
is a scalar. 
We started with a scalar momentum parameter constrained by 0 < / < 1. The 
equivalent constraint for our matrix/opt is that its eigenvalues lie between 0 and 
1. Thus we require P0  1/Amax where Amax is the largest eigenvalue of 
A scalar annealed learning rate P0/t combined with the momentum parameter/?opt 
ought to provide an effective learning rate asymptotically equal to the optimal learn- 
ing rate 7-/-. This rate 1) is achieved without ever performing a matrix inversion 
on 7-/ and 2) is independent of the choice of P0, subject to the restriction in the 
previous paragraph. 
We have dispensed with the need to invert the Hessian, and we next dispense with 
the need to store it. First notice that, unlike its inverse, stochastic estimates of 
are readily available, so we use a stochastic estimate in (4). Secondly according to 
(2) we do not require the matrix/op, but rather/?opt times the last weight up- 
date. For both linear and non-linear networks this dispenses with the O(n 2) storage 
requirements. This algorithm, which we refer to as adaptive momentum, does not 
require explicit knowledge or inversion of the Hessian, and can be implemented very 
efficiently as is shown in the next section. 
2 Implementation 
The algorithm we propose is 
t+x = + C(t,xt ) + (5) 
where Awt -= w - w_  d t is a stochtic estimate of the Hessian at time t. 
We first consider a single layer fdforwd line network. Since the weights con- 
necting the inputs to different outputs e independent of each other we need only 
discuss the ce for one output node. Each output node is then treated identically. 
For one output node d N inputs, the Hessi is  = (xxT) e x where 
indicates eectation over the inputs x and where x T is the trspose of x. The 
single-step estimate of the hessi is then just , = xtx. The momentum term 
becomes 
- = = - (6) 
Written in his way, we noe ha here is no matrix multiplication, jus he vector 
dot product w, d vector addition that e both O(n). For M output nodes, 
he Mgori is hen O(N) where N = NM is the total number weights in he 
network. 
For nonlinear networks the problem is somewhat more complicated. To compute 
tAt we use the algorithm developed by Pearlmutter [12] for computing the prod- 
uct of the hessian times an arbitrary vector. 4 The equivalent of one forward-back 
awe refer to (4) as a heuristic since we have no theoretical results on the dynamics of 
the squared weight error for learning with this matrix of momentum parameters. 
4We actually use a slight modification that calculates the Iinearized Hessian times a 
vector: DfDf Aw where Df is the Jacobian of the network output (vector) with respect 
to the weights, and  indicates a tensor product. 
Using Curvature Information for Fast Stochastic Search 609 
1%=0.1 I 
I 2 3 4 ,5 
a) Log(t) b) 
Log( E[ Ivl 2 ] ) 
-1 
-2 
-3 
Log( E[ Ivl 2 ] ) 
'1 ______.._;... I B--adaptive I 
-2 =u'u ---x i. 
-3 =0'01 
1 2 Log(t)3 4 
Figure 1: 2-D LMS Simulations: Behavior of log(E[Iv12]) over an ensemble of 1000 net- 
2 
works with A1 = .4 and A] = 4, a = 1. a) P0 = 0.1 with various p. Dashed curve 
corresponds to adaptive momentum. b) p adaptive for various p0. 
propagation is required for this calculation. Thus, to compute the entire weight up- 
date requires two forward-backward propagations, one for the gradient calculation 
and one for computing t Awt. 
The only constraint on p0 is that P0 < 1/,X,a. We use the on-line algorithm 
developed by LeCun, Simard, and Pearlmutter [13] to find the largest eigenvalue 
prior to the start of training. 
3 Examples 
In the following two subsections we examine the behavior of annealed learning with 
adaptive momentum on networks previously trained to a point close to an optimum, 
where the noise dominates. We look at very simple linear nets, large linear nets, and 
a large nonlinear net. In section 3.3 we couple adaptive momentum with automatic 
switching from constant to annealed learning. 
3.1 Linear Networks 
We begin with a simple 2-D LMS network. Inputs xt are gaussian distributed with 
zero mean and the targets d at each timestep t are dt= w,Txt + et where et is zero 
mean gaussian noise, and w, is the optimal weight vector. The weight error at time 
t is just v m COt -- CO,. 
Figure i displays results for both constant and adaptive momentum with averages 
computed over an ensemble of 1000 networks. Figure (la) shows the decay of E[[vl 2] 
for p0 = 0.1 and various values of/7. As momentum is increased, the convergence 
rate increases. The optimal scalar momentum parameter is/7  (1 -PoAmin) = .96. 
Adaptive momentum achieves essentially the same rate of convergence without prior 
knowledge of the Hessian. 
Figure lb shows the behavior of E[Ivl 2] for various p0 when adaptive momentum 
is used. One can see that after a few hundred iterations the value of E[Ivl 2] is 
independent of P0 (in all cases P0 < 1/Ama < Pcrit ). 
Figure 2 shows the behavior of the misadjustment (mean squared error in ex- 
cess of the optimum I for a 4-D LMS problem with a large condition number 
p --- ,ma/,mi, = 10 . We compare 3 cases: 1) the optimal learning rate matrix 
P0 = 7-/-x without momentum, 2) P0 = .5 with the optimal constant momentum 
matrix  = I- p07/, and 3) P0 = .5 with the adaptive momentum. All three 
cases show similar behavior, showing the efficacy with which the matrix momentum 
610 G. B. Orr and T. K. Leen 
Misadj 
10.  " p'O = H'I' j = 0 
',_' adapt 
0.1 .... 
10. 100. 1000. 10000.100000. 6 
t 1.10 
Misadj .026 
O.Ol . 
0.001 Ia(a;",,,, 
0.0001 ", 
10. 100. 1000. 10000. 5 6 
t 10 10 
Figure 2: 4-D LMS with p = 105: Plot 
displays misadjustment. Annealing starts at 
t = 10. For ftdp and ft = I --P0, we use 
P0 = .5. Each curve is an average of 10 runs. 
Figure 3: Linear Prediction: p0 = 0.26. 
Curves show constant learning rate, anneal- 
ing started at t = 50 without momentum, 
and with adaptive momentum. 
mocks up the optimal learning rate matrix P0 = 7-/-x, and lending credence to the 
stochastic estimate of the Hessian used in adaptive momentum. 
We next consider a large linear prediction problem (128 inputs, 16 outputs and 
eigenvalues ranging from 1.06 x 10 -s to 19.98 - condition number p = 1.9 x 100) s. 
Figure 3 displays the misadjustment for 1) annealed learning with 
2) annealed learning with  -- 0, and 3) constant learning rate (for comparison 
purposes). As before, we have first trained (not shown completely) at constant 
learning rate P0 -' .026 until the MSE and the weight error have leveled out. As 
can be seen aclapt does much better than annealing without momentum. 
3.2 Phoneme Classification 
We next use phoneme classification as an example of a large nonlinear problem. 
The database consists of 9000 phoneme vectors taken from 48 50-second speech 
monologues. Each input vector consists of 70 PLP coefficients. There are 39 target 
classes. The architecture was a standard fully connected feedforward network with 
71 (includes bias) input nodes, 70 hidden nodes, and 39 output nodes for a total of 
7700 weights. 
We first trained the network with constant learning rate until the MSE flattened 
out. At that point we either annealed without momentum, annealed with adaptive 
momentum, or used ASTC (which attempts to adjust P0 to be above Pcit - see 
next section). When annealing was used without momentum, we found that the 
noise went away, but the percent of correctly classified phonemes did not improve. 
Both the adaptive momentum and ASTC resulted in significant increases in the 
percent correct, however, adaptive momentum was significantly better than ASTC. 
In the next section, we examine this problem in more detail. 
3.3 Switching on Annealing 
A complete algorithm must choose an appropriate point to change from constant p 
search to annealed learning. We use Moody and Darken's ASTC algorithm [4, 14] 
to accomplish this. ASTC measures the roughness of trajectories, switching to 1It 
annealing when the trajectories become very rough - an indication that the noise 
in the updates is dominating the algorithm's behavior. In an attempt to satisfy 
5 Prediction of a 4 x 4 block of image pixels from the surrounding 8 blocks. 
Using Curvature Information for Fast Stochastic Search 611 
5O 
4O 
30 
o 
220 
lO 
Adaptive M'nt um 
5O 
4O 
30 
o 
20 
10 
3 1 O0 1000 100000 0 20 50 1 O0 
t a) b) epoch 
Figure 4: Phoneme Classification: Percent Correct a) ASTC without momentum (bottom 
curve) and adaptive momentum (top) as function of the number of input presentations. 
b) Conjugate Gradient Descent - one epoch equals one pass through the data, i.e. 9000 
input presentations. 
!ao > lait, ASTC can also switch back to constant learning when trajectories 
become too smooth. 
We return to the phoneme problem using three different training methods: 1) ASTC 
without momentum (with switching back and forth between annealed and constant 
learning), 2) adaptive momentum with annealing turned on when ASTC first sug- 
gests the transition (but no subsequent return to constant learning rate), and 3) 
standard conjugate gradient descent. 
Figure 4a compares ASTC (no momentum) with adaptive momentum (using ASTC 
to turn on annealing). After annealing is turned on, the classification accuracy 
improves far more quickly with adaptive momentum. 
Figure 4b displays the classification performance as a function of epoch using con- 
jugate gradient descent (CGD). After 100 passes through the 9000 example dataset 
(900,000 presentations), the classification accuracy is 39.6%, or 7% below adaptive 
momentum's performance at 100,000 presentations. Note also that adaptive mo- 
mentum is continuing to improve the optimization, while the ASTC and conjugate 
gradient descent curves have flattened out. 
The cpu time used for the optimization was about the same for the CGD and adap- 
tive momentum algorithms. It thus appears that our implementation of adaptive 
momentum costs about 9 times as much per pattern as CGD. We believe that the 
performance can be improved. Our complexity analysis [8] predicts a 3:1 cost ratio, 
rather than 9:1, and optimization comparable to that applied to the CGD code  
should enhance the run-time performance of CGD. 
For this problem, the performance of the two algorityms on the test set (no shown 
on graph) is not much different (31.7% for CGD versus 33.4% for adaptive momen- 
tum. Howver we are concerned here with the efficiency of the optimization, not 
generalization performance. The latter depends on dataset size and regularization 
techniques, which can easily be combined with any optimizer. 
4 Summary 
We have presented an efficient O(n) stochastic algorithm with few adjustable param- 
eters that achieves fast convergence during the converge phase for both linear and 
nonlinear problems. It does this by incorporating curvature information without 
6CGD was performed using nopt written by Etienne Barnard and made available 
through the Center for Spoken Language Understanding at the Oregon Graduate Institute. 
612 G. B. Orr and T. K. Leen 
explicit computation of the Hessian. We also combined it with a method (ASTC) 
for detecting when to make the transition between search and converge regimes. 
Acknowledgments 
The authors thank Yann LeCun for his helpful critique. This work was supported 
by EPRI under grant RP8015-2 and AFOSR under grant FF4962-93-1-0253. 
References 
[1] Genevieve B. Orr and Todd K. Leen. Weight space probability densities in stochastic 
learning: II. Transients and basin hopping times. In Giles, Hanson, and Cowan, 
editors, Advances in Neural Information Processing Systems, vol. 5, San Mateo, CA, 
1993. Morgan Kaufmann. 
[2] William Finnoff. Diffusion approximations for the constant learning rate backprop- 
agation algorithm and resistence to local minima. In Giles, Hanson, and Cowan, 
editors, Advances in Neural Information Processing Systems, vol. 5, San Mateo, CA, 
1993. Morgan Kaufmann. 
[3] Larry Goldstein. Mean square optimality in the continuous time Robbins Monro 
procedure. Technical Report DRB-306, Dept. of Mathematics, University of Southern 
California, LA, 1987. 
[4] Christian Darken and John Moody. Towards faster stochastic gradient search. In J.E. 
Moody, S.J. Hanson, and R.P. Lipmann, editors, Advances in Neural Information 
Processing Systems . Morgan Kaufmann Publishers, San Mateo, CA, 1992. 
[5] Halbert White. Learning in artificial neural networks: A statistical perspective. Neu- 
ral Computation, 1:425-464, 1989. 
[6] J. H. Venter. An extension of the robbins-monro procedure. Annals of Mathematical 
Statistics, 38:117-127, 1967. 
[7] Todd K. Leen and Genevieve B. Orr. Optimal stochastic search and adaptive mo- 
mentum. In J.D. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural 
Information Processing Systems 6, San Francisco, CA., 1994. Morgan Kaufmann Pub- 
lishers. 
[8] Genevieve B. Orr. Dynamics and Algorithms for Stochastic Search. PhD thesis, 
Oregon Graduate Institute, 1996. 
[9] Mehmet Ali Tugay and Yalcin Tanik. Properties of the momentum LMS algorithm. 
Signal Processing, 18:117-127, 1989. 
[10] John J. Shynk and Sumit Roy. Analysis of the momentum LMS algorithm. IEEE 
Transactions on Acoustics, Speech, and Signal Processing, 38(12):2088-2098, 1990. 
[11] W. Wiegerinck, A. Komoda, and T. Heskes. Stochastic dynamics of learning with 
momentum in neural networks. Journal of Physics A, 27:4425-4437, 1994. 
[12] Barak A. Pearlmutter. Fast exact multiplication by the hessian. Neural Computation, 
6:147-160, 1994. 
[13] Yann LeCun, Patrice Y. Simard, and Barak Pearlmutter. Automatic learning rate 
maximization by on-line estimation of the hessian's eigenvectors. In Giles, Hanson, 
and Cowan, editors, Advances in Neural Information Processing Systems, vol. 5, San 
Mateo, CA, 1993. Morgan Kaufmann. 
[14] Christian Darken. Learning Rate Schedules for Stochastic Gradient Algorithms. PhD 
thesis, Yale University, 1993. 
