Diffusion Approximations for the 
Constant Learning Rate 
Backpropagation Algorithm and 
Resistence to Local Minima 
William Finnoff 
Siemens AG, Corporate Research and Development 
Otto-Hahn-Ring 6 
8000 Munich 83, Fed. Rep. Germany 
Abstract 
In this paper we discuss the asymptotic properties of the most com- 
monly used variant of the backpropagation algorithm in which net- 
work weights are trained by means of a local gradient descent on ex- 
amples drawn randomly from a fixed training set, and the learning 
rate  of the gradient updates is held constant (simple backpropa- 
gation). Using stochastic approximation results, we show that for 
 -+ 0 this training process approaches a batch training and pro- 
vide results on the rate of convergence. Further, we show that for 
small  one can approximate simple back propagation by the sum 
of a batch training process and a Gaussian diffusion which is the 
unique solution to a linear stochastic differential equation. Using 
this approximation we indicate the reasons why simple backprop- 
agation is less likely to get stuck in local minima than the batch 
training process and demonstrate this empirically on a number of 
examples. 
1 INTRODUCTION 
The original (simple) backpropagation algorithm, incorporating pattern for pattern 
learning and a constant learning rate ]  (0, oo), remains in spite of many real (and 
459 
460 Finnoff 
imagined) deficiencies the most widely used network training algorithm, and a vast 
body of literature documents its general applicability and robustness. In this paper 
we will draw on the highly developed literature of stochastic approximation the- 
ory to demonstrate several asymptotic properties of simple backpropagation. The 
close relationship between backpropagation and stochastic approximation methods 
has been long recognized, and various properties of the algorithm for the case of 
decreasing learning rate rln+ < r/,, n  N were shown for example by White 
[W,89a], [W,89b] and Darken and Moody [D,M,91]. ttornik and Kuan tit,K,91] 
used comparable results for the algorithm with constant learning rate to derive 
weak convergence results. 
In the first part of this paper we will show that simple backpropagation has the 
same asymptotic dynamics as batch training in the small learning rate limit. As 
such, anything that can be expected of batch training can also be expected in simple 
backpropagation as long as the learning rate of the algorithm is very small. In the 
special situation considered here (in contrast to that in [H,K,91]) we will also be 
able to provide a result on the speed of convergence. As such, anything that can be 
expected of batch training can also be expected in simple backpropagation as long 
as the learning rate of the algorithm is very small. In the next part of the paper, 
Gaussian approximations for the difference between the actual training process and 
the limit are derived. It is shown that this difference, (properly renormalized), con- 
verges to the solution of a linear stochastic differential equation. In the final section 
of the paper, we combine these results to provide an approximation for the simple 
backpropagation training process and use this to show why simple backpropagation 
will be less inclined to get stuck in local minima than batch training. This ability 
to avoid local minima is then demonstrated empirically on several examples. 
2 NOTATION 
Define the. parametric version of a single hidden layer network activation function 
with h inputs, m outputs and q hidden units 
f: R d x R h  R'n,(O,x)  (fx(O,x),...,fm(O,x)), 
by setting for a:  R h,  = (a:, ..., a:t, 1), 0 = (7:,/:) and u = 1, ..., m, 
where "denotes the transpose of  and d = m(q + 1) + q(h + 1) denotes the 
number of weights in the network. Let ((Yk, Xk))=L...,' be a set of training exam- 
ples, consisting of targets (Y)=,...,, and inputs (X)k=L...O.. We then define the 
parametric error function 
I]Y- f(O,)11 
and for every 0, the cummulative gradient 
Diffusion Approximations for the Constant Learning Rate Backpropagation Algorithm 461 
1 k ou 
h(o) = - 
k=l 
, Xk, 0). 
3 APPROXIMATION WITH THE ODE 
We will be considering the asymptotic properties of network training processes 
induced by the starting value 00, the gradient (or direction) function -.- the 
learning rate r/and an infinite training sequence (Yn,Xn)nEN, where each (yn,xn) 
example is drawn at random from the set ((Y1, X1), ...,(YT, XT)). One defines the 
discrete parameter process 0 = 0" = (0n)nEZ+ of weight updates by setting 
0o for n = 0 
0v On_l) fornN 
o" = o."- (y,.,z., 
and the corresponding continuous parameter process (0n/))[0,oo), by setting 
OU O,  
o"(t) = o"- (t - (n- 1).)7(v.,x., ._1) 
for t  [(n - 1)r/, nr/), n  N. The first question that we will investigate is that 
of the 'small learning rate limit' of the continuous parameter process 0 n i.e. the 
limiting properties of the family 0 nfor r/--+ 0. We show that the family of (stochas- 
tic) processes (0")n0 converges with probability one to a limit process , where  
denotes the solution to the cummulative gradient equation, 
a(t) = Oo + h(a())&. 
Here, for 00 = a = constant, this solution is deterministic. This result corresponds 
to a 'law of large numbers' for the weight update process, in which the small learning 
rate (in the limit) averages out the stochastic fluctuations. 
Central to any application of the stochastic approximation results is the derivation 
0v and h. That is the subject of 
of local Lipschitz and linear growth bounds for  
the following, 
Lemma(3.1) i) There exists a constant K > 0 so that 
sup (y, :, o) < K( + I1011) 
(,)  - 
and 
IIh(O)11 _< K(X + IIO11). 
ii) For every G > 0 there exists a constant L6 so that for any 0, '  [-G, G] a, 
462 Finnoff 
and 
0) < 
(,) (y, x, - - 
37 - 
lib(o) - h(O)ll <_ Lal10 - 011. 
Proof: The calculations on which this result are based are tedious but straightfor- 
ward, making repeated use of the fact that products and sums of locally Lipschitz 
continuous functions are themselves locally Lipschitz continuous. It is even possible 
to provide explicit values for the constants given above. 
Denoting with P (resp. E) the probability (resp. mathematical expectation) of the 
processes defined above, we can_present the results on the probability of deviations 
of the process 0 from the limit 0. 
Theorem(3.2) Let r, 5  (0, o). Then there exists a constant Br (which 
doesn't depend on r/) so that 
i) E (sup,<r liOn(s)- (s)l[ ') <_ Brrl. 
ii)P (sup,<r IlO(s) -(s)l I > 5) _< bB,.rl. 
Proof: The first part of the proof requires that one finds bounds for 0nt) and (t) 
for t 6 [0, r]. This is accomplished using the results of Lemma(3.1) and Gronwall's 
Lemma. This places r/independent bounds on Br. The remainder of the proof uses 
Theorem(9), 1.5, Part II of [Ben,Met,Pri,87]. The required conditions (A1), (A2) 
follow directly from our hypotheses, and (A3), (A4) from Lemma(3.1). Due to the 
boundedness of the variables (Yn, xn)nN and 00, condition (A5) is trivially fulfilled. 
It should be noted that the constant Br is usually dependent on r and may indeed 
increase exponentially (in r) unless it is possible to show that the training process 
remains in some bounded region for t --, oo. This is not necessarily due exclusively 
to the difference between the stochastic approximation and the discrete parameter 
cummulative gradient process, but also to the the error between the discrete (Euler 
approximation) and continuous parameter versions of (3.3). 
4 GAUSSIAN APPROXIMATIONS 
In this section we will give a Gaussian approximation for the difference between 
the training process 0 n and the limit . Although in the limit these coincide, for 
r/ > 0 the training process fluctuates away from the limit in a stochastic fashion. 
The following Gaussian approximation provides an estimate for the size and nature 
Diffusion Approximations for the Constant Learning Rate Backpropagation Algorithm 463 
of these fluctuations depending on the second order statistics (variance/covariance 
matrix) of the weight update process. Define for any t  [0, oo), 
o"(t) - 
ovi, ,0), (resp. h i 
Further, for i = 1, ..., dwe denote with  [y, z (0)) the i-th coordinate 
vector of x, = 
-g(y, 0) (resp. h(O)). Then define for i,j 1,...,d, 0  R a 
= 7 \ oo . 
Thus, for any n  N, 0  R a, R(O) represents the covariance matrix of the random 
elements -q-g(y,, xn, 0). We can then define for the symmetric matrix R(0) a further 
R aXa valued matrix RS(O) with the property that R(0)= R(O)(R(O)) T. 
The following result represents a central limit theorem for the training process. This 
permits a type of second order approximation of the fluctuations of the stochastic 
training process around its deterministic limit. 
Theorem(4.1): Under the assumptions given above, the distributions of the 
processes O" r/> 0, converge weakly (in the sense of weak convergence of measures) 
for r/-- 0 to a uniquely defined measure {0}, where 0 denotes the solution to the 
following stochastic differential equation 
t Oh -  jo  
= (O(s))O(s)as + 
where W denotes a standard d-dimensional Brownian motion (i.e. with covariance 
matrix equal to the identity matrix). 
Proof: The proof here uses Theorem(7), 4.4, Part II of [Ben,Met,Pri,87]. As 
noted in the proof of Theorem(3.2), under our hypotheses, the conditions (A1)- 
(.A5) are fulfilled. Define for i,j = 1,...,d, (y,z)  I m+t, 0  R d, wij(y,z,O) = 
p' (y, x, O)p 5 (y, r, O)-h i (O)hJ (0), and v = p. Under our hypotheses, h has continuous 
first and second order derivatives for all 0  R d and the function R = (Rff)ij=l,...,d 
as well as W = (wiY)i,i=x,...,d fulfill the remaining requirements of (AS) as follows: 
(A8)i) and (A8)ii) are trivial consequence of the definition of R and W. Finally, 
setting Pa = p4 = 0 and p = 1, (AS)iii) then can be derived directly from the 
definitions of W and R and Lemma(5.1)ii). 
5 RESISTENCE TO LOCAL MINIMA 
In this section we combine the results of the two preceding sections to provide 
a Gaussian approximation of simple backpropagation. Recalling the results and 
464 Finnoff 
notation of Theorem(3.2) and Theorem(4.1) we have for any t  [0, oo), 
o,(t) = + + 
Using this approximation we have: 
-For 'very small' learning rate r/, simple backpropagation and batch learning will 
produce essentially the same results since the stochastic portion of the process 
(controlled by r/) will be negligible. 
-Otherwise, there is a nonnegligible stochastic element in the training process which 
can be approximated by the Gaussian diffusion 0. 
-This diffusion term gives simple backpropagation a 'quasi-annealing' character, in 
which the cummulative gradient is continuously perturbed by the Gaussian term , 
allowing it to escape local minima with small shallow basins of attraction. 
It should be noted that the rest term will actually have a better convergence rate 
than the indicated o(r/). The calculation of exact rates, though, would require a 
generalized version of the Berry-Essen theorem. To our knowledge, no such results 
are available which would be applicable to the situation described above. 
6 EMPIRICAL RESULTS 
The imperviousness of simple backpropagation to local minima, which is part of 
neural network 'folklore' is documented here in four examples. A single hidden 
layer feedforward network with b = tanh, ten hidden units n and one output was 
trained with both simple backpropagation and batch training using data gener- 
ated by four different models. The data consisted of pairs (yi,:i), il,...,T, 
T  N with targets Yi ( R and inputs xi = (x,...,x/)  [-1,1] , where 
yi = g((x,...,xi))+ ui, for j,K  N. The first experiment was based on 
an additive structure g having the following form with j = 5 and K = 10, 
5 
g((x, ..., x)) = -]k=x sin(a:x/), a  R. The second modelhad a product struc- 
ture g with j = 3, K = 10 and g((x, ..., x/a)) = I-I=x x/, a:  R. The third struc- 
ture considered was constructed with j = 5 and K = 10, using sums of Radial Basis 
Functions (RBF's)as follows: g((x, x)) = s (,5 (e',?) ) 
..., (-1) exp  
These points were chosen by independent drawings from a uniform distribution on 
[-1, 1] 5. The final experiment was conducted using data generated by a feedforward 
network activation function. For more details concerning the construction of the 
examples used here consult [F,H,Z,92]. 
For each model three training runs were made using the same vector of starting 
weights for both simple backpropagation and batch training. As can be seen, in all 
but one example the batch training process got stuck in a local minimum producing 
much worse results than those found using simple backpropagation. Due to the 
wide array of structures used to generate data and the number of data sets used, it 
would be hard to dismiss the observed phenomena as being example dependent. 
Diffusion Approximations for the Constant Learning Rate Backpropagation Algorithm 465 
Error x I0 '3 
Net 
800.00 
....... i simple BP 
........ 6 I ' atch L. 
 '';: .......... 4. ................................................... 
L._ ..t,_ .........  ..............................  ................. "- 
) Epochs 
0.00 !00.00 200.00 
Error x 10 ': 
Product Mapping 
0.() 200.X) 200. 'O 300.00 z()., '() 
smpie BP 
Batch L. 
Eor x i0'-' 
Sums of RBF's 
0.00 iO0.00 200.00 
smple P 
Batch .. 
Sums of sin's 
Error x 10 -3 
500.00 
0,00 I00.00 200.00 "00.?3 
:4mpie BP 
3arch 
EIochs 
466 Finnoff 
7 REFERENCES 
[Ben,Met,Pri,87] Benveniste, A., Mdtivier, M., Priouret, P., Adaptive Algorithms 
and Stochastic Approa:imations, Springer Verlag, 1987. 
[Bou,85] Bouton C., Approximation Gaussienne d'algorithmes stochastiques a dy- 
namique Markovienne. Thesis, Paris VI, (in French), 1985. 
[Da,M,91] Darken C. and Moody J., Note on learning rate schedules for stochastic 
optimization, inAdvances in Neural Information Processing Systems 3, Lipp- 
mann, R. Moody, J., and Touretzky, D., ed., Morgan Kaufmann, San Mateo, 
1991. 
[F,H,Z,92] Improving model selection by nonconvergent methods. To appear in 
Neural Networks. 
[H,K,91], Hornik, K. and Kuan, C.M., Convergence of Learning Algorithms with 
constant learning rates, IEEE 7Yans. on Neural Networks 2, pp. 484-489, 
(1991). 
[Wh,89a] White, H., Some asymptotic results for learning in single hidden-layer 
feedforward network models, Jour. Amer. Star. Ass. 84, no. 408, p. 1003- 
1013, 1989. 
[W,89b] White, H., Learning in artificial neural networks: A statistical perspective, 
Neural Computation 1, p.425-464, 1989. 
