606 Ahmad, Tesauro and He 
Asymptotic Convergence of Backpropagation: 
Numerical Experiments 
Subutai Ahmad 
ICSI 
1947 Center St. 
Berkeley, CA 94704 
Gerald Tesauro 
IBM Watson Labs. 
P.O. Box 704 
Yorktown Heights, NY 
10598 
Yu He 
Dept. of Physics 
Ohio State Univ. 
Columbus, OH 43212 
ABSTRACT 
We have calculated, both analytically and in simulations, the rate 
of convergence at long times in the backpropagation learning al- 
goritlun for networks with and without hidden units. Our basic 
finding for units using the standard sigrnoid transfer function is 1It 
convergence of the error for large t, with at most logarithmic cor- 
rections for networks with hidden units. Other transfer functions 
may lead to a slower polynomial rate of convergence. Our analytic 
calculations were presented in (Tesauro, He & Ahamd, 1989). Here 
we focus in more detail on our empirical measurements of the con- 
vergence rate in numerical simulations, which corm our analytic 
results. 
1 INTRODUCTION 
Backpropagation is a popular learning algoritlun for multilayer neural networks 
which minimizes a global error function by gradient descent (Werbos, 1974: Parker, 
1985; LeCun, 1985; Pumelhart, Hinton & Williams, 1986). In this paper, we ex- 
amine the rate of convergence of backpropagation late in learning when all of the 
errors are small. In this limit, the learning equations become more amenable to an- 
alytic study. By expanding in the small differences between the desired and actual 
output states, and retaining only the dominant terms, one can explicitly solve for 
the leading-order behavior of the weights as a function of time. This is true both for 
Asymptotic Convergence of Backpropagation: Numerical Experiments 607 
single-layer networks, and for multilayer networks containing hidden units. We con- 
fLrm our analysis by empirical measurements of the convergence rate in numerical 
simulations. 
In gradient-descent learning, one minimizes an error function E according to: 
OE 
a : - a () 
where A is the change in the weight vector at each time step, and the learning 
rate  is a small numerical constant. The convergence of equation 1 for single-layer 
networks with general error functions and transfer functions is studied in section 2. 
In section 3, we examine two standard modifications of gradient-descent: the use 
of a "margin" variable for turning off the error backpropagation, and the inclusion 
of a "momentum" term in the learning equation. In section 4 we consider networks 
with hidden units, and in the final section we summarize our results and discuss 
possible extensions in future work. 
2 CONVERGENCE IN SINGLE-LAYER NETWORKS 
The input-output relationship for single-layer networks takes the form: 
yp: g/'-S) (2) 
where  represents the state of the input units for pattern p,  is the real-valued 
weight vector of the network, g is the input-output transfer function (for the moment 
unspecified), and yp is the output state for pattern p. We assume that the transfer 
function approaches 0 for large negative inputs and 1 for large positive inputs. 
For convenience of analysis, we rewrite equation 1 for continuous time as: 
 ar _ ar ay _ - ] ar g'(ap) () 
- - 
where E is the individual error for pattern p, hp = .g" is the total input activation 
of the output unit for pattern p, and the summation over p is for an arbitrary subset 
of the possible training patterns. E is a function of the difference between the 
actual output yp and the desired output dp for pattern p. Examples of common 
error functions are the quadratic error El = (Yl - d)2 d the "cross-entropy" 
e:or (Hton, 1987) Ep: dp logvp + (1 - dp) log(1 - 
stead of solving equation 3 for the weights ectly, it is more convenient to work 
with the outputs yp. The outputs evolve accordg to: 
, a 0'().  () 
q 
Let us now consider the situation late in learning when the output states are ap- 
proaching the desired values. We degree new variables r = Ylo - d, and assume 
608 Ahmad, Tesauro and He 
G .83 
-G.33 
-1 ,SO 
-2.67 
-3.83 
-S.00 I I I I,, I q 
0.00 1.67 3.33 S.00 6.67 8.33 10.00 
Figure 1: Plots of In(error) rs. In(epochs) for single-layer networks learning the 
majority function using standard backpropagation without momentum. Four differ- 
ent learning runs starting from different random initial weights are shown. In each 
case, the asymptotic behavior is approximately E ,.,, l/t, as seen by comparison 
with a reference llne of slope -1. 
that qp is small for all p. For reasonable error functions, the individual errors Ep 
will go to zero as some power of q, i.e., E ,,, /. (For the quadratic error, 7 = 2, 
and for the cross-entropy error, 7 -' 1.) Similarly, the slope of the transfer function 
should approach zero as the output state approaches 1 or 0, and for reasonable 
transfer functions, this will again follow a power law, i.e., g'(h) ,,, . Using the 
definitions of q, 7 and/5, equation 4 becomes: 
The absolute value appears because g is a non-decreasing function. Let q, be the 
slowest to approach zero among all the q's. We then have for ,: 
Upon integrating we obtain 
q, ~ t -1/(2/+-2); E ~ q,'r ~ t-/(2/+-2) 
(?) 
When/5 = 1, i.e., ' ,,,, , the error function approaches zero like l/t, independent 
of 7. Since/ = 1 for the standard sigrnoid function g(z) = (1 + e-')- 1, one expects 
to see 1It behavior in the error function in this case. This behavior was in fact first 
Asymptotic Convergence of Backpropagation: Numerical Experiments 609 
seen in the numerical experiments of (Ahmad, 1988; Ahmad & Tesauro, 1988). The 
behavior was obtained at relatively small t, about 20 cycles through the traJxting 
set. Figure 1 illustrates this behavior for single-layer networks learning a data set 
containing 200 randomly chosen instances of the majority function. In each case, 
the behavior at long times in this plot is approximately a straight line, indicating 
power-law decrease of the error. The slopes are in each case within a few percent 
of the theoretically predicted value of-1. 
It turns out that 1 = 1 gives the fastest possible convergence of the error function. 
This is because 1 < 1 yields transfer functions which do not saturate at finite values, 
and thus are not allowed, while 1 > 1 yields slower convergence. For example, if 
we take the transfer function to be g(x) = 0.511 + (2/r)tan -1 x], then : 2. In 
this case, the error function will go to zero as E ~ t -/(+2). In particular, when 
3 MODIFICATIONS OF GRADIENT DESCENT 
One common modification to strict gradient-descent is the use of a "margin" variable 
/ such that, if the difference between network output and teacher signal is smaller 
than /, no error is backpropagated. This is meant to prevent the network from 
devoting resources to making its output arbitrarily close to the teacher signal, which 
is usually unnecessary. It is clear from the structure of equations 5, 6 that the margin 
will not affect the basic 1 ]t error convergence, except in a rather trivial way. When 
a margin is employed, certain driving terms on the right-hand side of equation 5 
will be set to zero as soon as they become small enough. However, as long as some 
non-zero driving terms are present, the basic polynomial solution of equation 7 will 
be unaltered. Of course, when all the driving terms disappear because they are all 
smaller than the margin, the network will stop learning, and the error will remain 
constant at some positive value. Thus the prediced behavior is 1It decrease in the 
error followed eventually by a rapid transition to constant non-zero error. This 
agrees with what is seen numerically in Figure 2. 
Another popular generalization of equation 1 includes a "momentum" term: 
OE 
a(t) = + 1) 
In continuous time, this takes the form: 
; + (i-) = -- 
(s) 
OE 
q_zrning this into an equation for the evolution of outputs gives: 
II p I 
, oy, g (lO) 
Once again, exapanding yp, Ep and g' in small p yields a second-order differential 
equation for p in terms of a sum over other q. As in equation 6, the sum will be 
610 Ahmad, Tesauro and He 
0.83 
-0.33 
.SO 
-2.67 
-3,83 
-S.OG 
0.2 
0.1 , 
0,0.025 
: I I I I 
.00 1.67 3.33 S.00 6.67 8.33 10.00 
Figure 2: Plot of In(error) vs. In(epochs) for various values of margin variable 
as indicated. In each case there is a 1/t decrease in the error followed by a sudden 
transition to constant error. Tlds transition occnzs earlier for larger values of . 
controlled by some dominant term r, and the equation for this term is: 
- 1  2 r), 2p +,- 1 
where C1, C2 and Ca are numerical constants. For polynomial solutions 
the first two terms are of order t '-2, and can be neglected relative to the third term 
which is of order t - 1 The resulting equation thus has exactly the same form as 
in the zero momentum case of section 2, and therefore the rate of convergence is 
the same as in equation 7. This is demonstrated numerically in Figure 3. We can 
see that the error behaves as 1It for large t regardless of the value of momentum 
constant c. Furthermore, although it is not required by the analytic theory, the 
numerical prefactor appears to be the same in each case. 
Finally, we have also considered the effect on convergence of schemes for adaptively 
altering the learning rate constant e. It was shown analytically in (Tesauro, He & 
Ahmad, 1989) that for the scheme proposed by Jacobs (1988), in which the learning 
rate could in principle increase linearly with time, the error would decrease as 1It 2 
for sigmoid units, instead of the 1It result for fixed e. 
4 CONVERGENCE IN NETWORKS WITH HIDDEN UNITS 
We now consider networks with a single hidden layer. In (Tesauro, He & Ahmad, 
1989), it was shown that if the hidden units saturate late in learning, then the 
convergence rate is no different from the single-layer rate. This should be typical 
Asymptotic Convergence of Backpropagation: Numerical Experiments 611 
2.00 
0.63 
-0.33 
.SO* 
-2.67' 
-3.8; 
: t I I I q 
0.00 1.67 3.33 5.00 6.67 8.33 10.00 
Figure a: Plot of In(error) vs. In(epochs) for single-layer networks learning the 
majority function, with momentum constant a = 0, 0.25, 0.5, 0.75, 0.99. Each run 
starts from the same random initial weights. Asymptotic 1/t behavior is obtained 
in each case, with the same numerical prefactor. 
of what usually happens. However, assuming for purposes of argument that the 
hidden units do not saturate, when one goes through a small r/ expansion of the 
learning equation, one obtains a coupled system of equations of the following form: 
, ~ .+*-[ + n 1 (2) 
where 11 represents the magnitude of the second layer weights, and for convenience 
all indices have been suppressed and all terms of order 1 have been written simply 
as 1. 
For/9 > 1, this system has polynomial solutions of the form r/,,, t', 11 ~ t A, with 
z = -a/(a, + 4/9 - 4) ana x = z(, +/9 - - It is interesting to note that these 
solutions converge slightly faster than in the single-layer case. For example, with 
7 = 2 and/ = 2, r/,,, t -a/ in the multilayer case, but as shown previously, r/goes 
to zero only as t-/4 in the single-layer case. We emphasize that this slight speed-up 
will only be obtaLued when the hidden unit states do not saturate. To the extent 
that the hidden units saturate and their slopes become small, the convergence rate 
will return to the single-layer rate. 
When/9 = 1 the above polynomial solution is not possible. Instead, one can verify 
that the following is a self-consistent leading order soh6on to equations 12, 13: 
rl ~ t-l*ln-21a*t (14) 
612 Ahmad, Tesauro and He 
5.00 
2.50 
0.00 
-2.50 
-S.00 
-7.50 
0 Hidden Units 
3 Hidden Units 
10 Hidden Units 
50 Hidden Units 
-tO.00 I i I I I  
2 3 5 6 7 9 tO 
Figure 4: Plot of In(error) rs. In(epochs) for networks with varying numbers of 
hidden units (as indicated) learning majority function data set. Approximate 
behavior is obtained in each case. 
fl ~ In/at (15) 
Recall that in the single-layer ease, r/~ -/. Therefore, the effect of multiple layers 
could provide at most only a logaritlunie speed-up of convergence when the hidden 
units do not saturate. For practical purposes, then, we expect the convergence of 
networks with hidden units to be no different empiricaLly rom networks without 
hidden units. This is in fact what our simulations find, as illustrated in Figure 4. 
5 DISCUSSION 
We have obtaLed results for the asymptotic convergence of gradient-descent learn- 
ing which are valid for a wide variety of error functions atd transfer functions. We 
typically expect the same rate of convergence to be obtained regardless of whether 
or not the network has hidden units. However, it may be possible to obtain a slight 
polynomial speed-up when/9 > 1 or a logarithmic speed-up when/9 -- 1. We point 
out that in all cases, the sigmoid provides the maximum possible convergence rate, 
and is therefore a "good" transfer function to use in that sense. 
We have not attempted analysis of networks with multiple layers of hidden units; 
however, the analysis of (Tesauro, He & Ahmad, 1989) suggests that, to the extent 
that the hidden unit states saturate and the g' factors vanish, the rate of convergence 
would be no different even in networks with arbitrary numbers of hidden layers. 
Another important finding is that the expected rate of convergence does not depend 
on the use of all 2" input patterns in the training set. The same behavior should 
Asymptotic Convergence of Backpropagation: Numerical Experiments 613 
be seen for general subsets of training data. This is also in agreemen with our 
numerical results, and with the results of (Ahamd, 1988; Ahmand & Tesauro, 1988). 
In conclusion, a combination of analysis and numerical simulations has led to insight 
into the late stages of gradient-descent learning. It might also be possible to extend 
our approach to times earlier in the learning process, when not all of the errors 
are small. One might also be able to analyze the numbers, sizes and shapes of the 
basins of attraction for gradient-descent learning in feed-forward networks. Another 
important issue is the behavior of the generalization performance, i.e., the error on 
a set of test patterns not used in training, which was not addressed in this paper. 
Finally, our analysis might provide insight into the development of new algorithms 
which might scale more favorably than backpropagation. 
leferences 
S. Ahmad. (1988) A study of scaling and generalization in neural networks. Master's 
Thesis, Univ. of Illinois at Urbana-Champaign, Dept. of Computer Science. 
S. Abroad it G. Tesauro. (1988) Scaling and generalization in neural networks: a 
case study. In D. S. Touretzky et al. (eds.), Proceedings of the 1988 Connectionist 
Models Summer School, 3-10. San Mateo, CA: Morgan Kaufmann. 
G. E. Hinton. (1987) Connectionist learning procedures. Technical Report No. 
CMU-CS-87-115, Dept. of Computer Science, Carnegie-Mellon University. 
R. A. Jacobs. (1988) Increased rates of convergence through learning rate adapta- 
tion. Neural Networks 1:295-307. 
Y. Le Cun. (1985) A learning procedure for asymmetric network. Proceedings of 
Cognitiva /Pars) 85:599-604. 
D. B. Parker. (1985) Learning-logic. Technical Report No. TR-4?, MIT Center for 
Computational Research in Economics and Management Science. 
D. E. Rumelhart, G. E. Hinton, it R. J. Williams. (1986) Learning representations 
by back-propagating errors. Nature 828:533-536. 
G. Tesauro, Y. He it S. Ahmad. (1989) Asymptotic convergence ofbackpropagation. 
Neural Computation 1:382-391. 
P. Werbos. (1974) Ph.D. Thesis, Harvard University. 
