Adaptive Back-Propagation in On-Line 
Learning of Multilayer Networks 
Ansgar H. L. West ,2 and David Saad 2 
Department of Physics, University of Edinburgh 
Edinburgh EH9 3JZ, U.K. 
aNeural Computing Research Group, University of Aston 
Birmingham B4 7ET, U.K. 
Abstract 
An adaptive back-propagation algorithm is studied and compared 
with gradient descent (standard back-propagation) for on-line 
learning in two-layer neural networks with an arbitrary number 
of hidden units. Within a statistical mechanics framework, both 
numerical studies and a rigorous analysis show that the adaptive 
back-propagation method results in faster training by breaking the 
symmetry between hidden units more efficiently and by providing 
faster convergence to optimal generalization than gradient descent. 
1 INTRODUCTION 
Multilayer feedforward perceptrons (MLPs) are widely used in classification and 
regression applications due to their ability to learn a range of complicated maps [1] 
from examples. When learning a map fo from N-dimensional inputs  to scalars ( 
the parameters {W} of the student network are adjusted according to some training 
algorithm so that the map defined by these parameters fw approximates the teacher 
fo as close as possible. The resulting performance is measured by the generalization 
error eg, the average of a suitable error measure  over all possible inputs % = {}. 
This error measure is normally defined as the squared distance between the output 
of the network and the desired output, i.e., 
e = l[fv()- fo()] a. 
(1) 
One distinguishes between two learning schemes: batch learning, where training 
algorithms are generally based on minimizing the above error on the whole set of 
given examples, and on-line learning, where single examples are presented serially 
and the training algorithm adjusts the parameters after the presentation of each 
324 A.H.L. WEST, D. SAAD 
example. We measure the efficiency of these training algorithms by how fast (or 
whether at all) they converge to an "acceptable" generalization error. 
This research has been motivated by recent work [2] investigating an on-line learn- 
ing scenario of a general two-layer student network trained by gradient descent on a 
task defined by a teacher network of similar architecture. It has been found that in 
the early stages of training the student is drawn into a suboptimal symmetric phase, 
characterized by each student node imitating all teacher nodes with the same degree 
of success. Although the symmetry between the student nodes is eventually broken 
and the student converges to the minimal achievable generalization error, the ma- 
jority of the training time may be spent with the system trapped in the symmetric 
regime, as one can see in Fig. 1. To investigate possible improvements we introduce 
an adaptive back-propagation algorithm, which improves the ability of the student 
to distinguish between hidden nodes of the teacher. We compare its efficiency with 
that of gradient descent in training two-layer networks following the framework 
of [2]. In this paper we present numerical studies and a rigorous analysis of both 
the breaking of the symmetric phase and the convergence to optimal performance. 
We find that adaptive back-propagation can significantly reduce training time in 
both regimes by breaking the symmetry between hidden units more efficiently and 
by providing faster exponential convergence to zero generalization error. 
2 DERIVATION OF THE DYNAMICAL EQUATIONS 
The student network we consider is a soft committee machine [3], consisting of 
K hidden units which are connected to N-dimensional inputs  by their weight 
vectors W- {W} (i = 1,...,K). All hidden units are connected to the linear 
output unit by couplings of unit strength and the implemented mapping is there- 
fore fw() = i= g(xi), where xi - W/. is the activation of hidden unit i and g(.) 
is a sigmoidal transfer function. The map f0 to be learned is defined by a teacher 
network of the same architecture except for a possible difference in the number of 
hidden units M and is defined by the weight vectors B = {B} (n = 1,..., M). 
Training examples are of the form (,), where the components of the input 
vectors  are drawn independently from a zero mean unit variance Gaussian dis- 
tribution; the outputs are  M 
= y,= g(y), where y = B . is the activation of 
teacher hidden unit n. 
An on-line training algorithm A is defined by the update of each weight in re- 
sponse to the presentation of an example (, ), which can take the general form 
W/+ - W/ + .Ai({/}, W,,), where {/} defines parameters adjustable by 
the user. In the case of standard back-propagation, i.e., gradient descent on the 
error function defined in Eq. (1)' A/gd(r/, W,, ) = (rl/N)5 ' with 
 = 6g'(z) = [ - fw()]gt(zf), 
where the only user adjustable parameter is the learning rate r/ scaled by 1/N. 
One can readily see that the only term that breaks the symmetry between different 
hidden units is  ' 
g (x i ), i.e., the derivative of the transfer function g(.). The fact that 
a prolonged symmetric phase can exist indicates that this term is not significantly 
different over the hidden units for a typical input in the symmetric phase. 
The rationale of the adaptive back-propagation algorithm defined below is therefore 
to alter the f-term, in order to magnify small differences in the activation between 
hidden units. This can be easily achieved by altering f(xi) to f(xi), where  
plays the role of an inverse "temperature". Varying  changes the range of hidden 
unit activations relevant for training, e.g., for  > i learning is more confined to 
Adaptive Back-Propagation in On-line Learning of Multilayer Networks 325 
small activations, when compared to gradient descent (/ = 1). The whole adaptive 
back-propagation training algorithm is therefore: 
.A abpf  5g'(lxf)  rl 
i [r/,fi, W , ,(") = - (3) 
-i 
with 5  in Eq. (2). To compare the adaptive back-propagation algorithm with 
normal gradient descent, we follow the statistical mechanics calculation in [2]. Here 
we will only outline the main ideas and present the results of the calculation. 
As we are interested in the typical behaviour of our training algorithm we average 
over all possible instances of the examples . We rewrite the update equations (3) 
in  as equations in the order parameters describing the overlaps between student 
nodes Oij = -, student and teacher nodes Rin = 'n and teacher nodes 
Tnm = Bn .Bin. The generalization error eg, measuring the typical performance, can 
be expressed in these variables only [2]. The order parameters Qij and in are the 
new dynamical variables, which are self-averaging with respect to the randomness 
in the training data in the thermodynamic limit (N  ). If we interpret the 
normalized example number a = /N  a continuous time variable, the update 
equations for the order parameters become first order coupled differential equations 
d lin 
do 
dQij 
do 
(4) 
All the integrals in Eqs. (4) and the generalization error can be calculated explicitly 
if we choose g(x) - erf(x/v) as the sigmoidal activation function [2]. The exact 
form of the resulting dynamical equations for adaptive back-propagation is similar to 
the equations in [2] and will be presented elsewhere [4]. They can easily be integrated 
numerically for any number of K student and M teacher hidden units. For the 
remainder of the paper, we will however focus on the realizable case (K -- M) and 
uncorrelated isotropic teachers of unit length T,m - 
The dynamical evolution of the overlaps Qij and Ri follows from integrating the 
equations of motion (4) from initial conditions determined by the random initial- 
ization of the student weights W. Whereas the resulting norms Qii of the student 
vector will be order O(1), the overlaps Qij between student vectors, and student- 
teacher vectors/i will be only order O(1/x/). The random initialization of the 
weights is therefore simulated by initializing the norms Qii and the overlaps Qij and 
/i from uniform distributions in the [0, 0.5] and [0, 10 -x] interval respectively. 
In Fig. i we show the difference of a typical evolution of the overlaps and the 
generalization error for fi - 12 and/ - i (gradient descent) for K - 3 and /- 0.01. 
In both cases, the student is drawn quickly into a suboptimal symmetric phase, 
characterized by a finite generalization error (Fig. le) and no differentiation between 
the hidden units of the student: the student norms Qii and overlaps Qij are similar 
(Figs. lb,ld) and the overlaps of each student node with all teacher nodes Rin are 
nearly identical (Figs. la, lc). The student trained by gradient descent (Figs. lc,ld) 
is trapped in this unstable suboptimal solution for most of the training time, whereas 
adaptive back-propagation (Figs. la,lb) breaks the symmetry significantly earlier. 
The convergence phase is characterized by a specialization of the different student 
nodes and the evolution of the overlap matrices Q and R to their optimal value T, 
except for the permutational symmetry due to the arbitrary labeling of the student 
nodes. Clearly, the choice/ - 12 is suboptimal in this regime. The student trained 
with / - i converges faster to zero generalization error (Fig. le). In order to 
optimize/ seperately for both the symmetric and the convergence phase, we will 
examine the equations of motions analytically in the following section. 
326 A.H.L. WEST, D. SAAD 
1.0 
0,8- 
0.6- 
0.2- 
0.0 
1.0 
0.8- 
0.6- 
0.2- 
0.0 
(a) 
'-'--- R11 
 R12 
R13 
I R2 
 R 
R23 
L R 
R32 
R33 ...... 
.2-' 
' ' ' I ' ' ' I ' ' ' I ' ' ' I 
0 20000 40000  60000 80000 0 
(b)  1 __ 
0 
/ 
! ......... 
I 0 
0.2 .2 
,,, ,,, ,, ,-, , ,  0.0 
0 20000 40000 ( 60000 80000 0 
.04 
Figure 1: Dynamical evolution of the 
student-teacher overlaps Rin (a,c), the 
student-student overlaps Qij (b,d), and .03- 
the generalization error (e) as a function 6g 
of the normalized example number ct .02- 
for a student with three hidden nodes 
learning an isotropic three-node teacher 
(TnrnmSnm). The learning rate r/:0.01 .01 
is fixed but the value of the inverse tem- 
perature varies (a,b): /3-12 and (c,d): 
/-1 (gradient descent). 0.0 
.0 , , 
(c) 
20000 40000  60000 
:1: 
''1'''1 
80O0O 
20000 40000  60000 
(e) ,8- 12  
0 20000 40000 60000 
3 ANALYSIS OF THE DYNAMICAL EQUATIONS 
In the case of a realizable learning scenario (K=M) and isotropic teachers 
(Trn=Srrn) the order parameter space can be very well characterized by sim- 
ilar diagonal and off-diagonal elements of the overlap matrices Q and R, i.e., 
Qiy = QSij + C(1 - 5ij) for the student-student overlaps and, apart from a rela- 
beling of the student nodes, by /in = /Sin q-q(X- 5in) for the student-teacher 
overlaps. As one can see from Fig. 1, this approximation is particularly good in the 
symmetric phase and during the final convergence to perfect generalization. 
3.1 SYMMETRIC PHASE AND ONSET OF SPECIALIZATION 
Numerical integration of the equations of motion for a range of learning scenarios 
show that the length of the symmetric phase is especially prolonged by isotropic 
teachers and small learning rates q. We will therefore optimize the dynamics (4) in 
Adaptive Back-Propagation in On-line Learning of Multilayer Networks 32 7 
the symmetric phase with respect to fi for isotropic teachers in the small r/regime, 
where terms proportional to r/2 can be neglected. The fixed point of the truncated 
equations of motion 
Q*=C*- 1 and R*=S* = QV/7  1 
2K- 1 = v/K (27r- 1) (5) 
is independent of fi and thus identical to the one obtained in [2]. However, the 
symmetric solution is an unstable fixed point of the dynamics and the small pertur- 
bations introduced by the generically nonsymmetric initial conditions will eventually 
drive the student towards specialization. 
To study the onset of specialization, we expand the truncated differential equa- 
tions to first order in the deviations q = Q- Q*, c = C- C*, r = R- R*, and 
s = S- S* from the fixed point values (5). The linearized equations of motion 
take the form dv/da = M.v, where v = (q, c, r, s) and M is a 4 x 4 matrix whose 
elements are the first derivatives of the truncated update equations (4) at the fixed 
point with respect to v. Perturbations or modes which are proportional to the 
eigenvectors vi of M will therefore decrease or increase exponentially depending on 
whether the corresponding eigenvalue ,ki is negative or positive. For the onset of 
specialization only the modes are relevant which are amplified by the dynamics, i.e., 
the ones with positive eigenvalue. For them we can identify the inverse eigenvalue 
as a typical escape time ri from the symmetric phase. 
We find only one relevant perturbation for q = c = 0 and s = -r/(K - 1). This can 
be confirmed by a closer look at Fig. 1. The onset of specialization is signaled by the 
breaking of the symmetry between the student-teacher overlaps, whereas significant 
differences from the symmetric fixed point values of the student norms and overlaps 
occur later. The escape time r associated with the above perturbation is 
r v/2K- l(2K + fi)a/2 (6) 
Minimization of r with respect [o  yields opt = 4K, i.e., the optimal  scales 
with the number of hidden units, and 
Topt: 9 2K - 1 
Trapping in the symmetric phe is therefore always inversely proportional [o 
learning rate . In the large K limit it is proportional [o the number of hid- 
den nodes K (r  2K/) for gradient descent, whereas it is independent of K 
[r for the optimized adaptive back-propagation algorithm. 
3.2 CONVERGENCE TO OPTIMAL GENERALIZATION 
In order to predict the optimal learning rate /.]opt and inverse temperature pt 
for the convergence, we linearize the full equations of motion (4) around the zero 
generalization error fixed point R* = Q* = 1 and $* = C* = 0. The matrix M 
of the resulting system of four coupled linear differential equations in q = 1 - Q, 
c = C, r = 1 - R, and s - S is very complicated for arbitrary fi, K and r/, and its 
eigenvalues and eigenvectors can therefore only be analysed numerically. 
We illustrate the solution space with K = 3 and two fi values in Fig. 2a. We find 
that the dynamics decompose into four modes: two slow modes associated with 
eigenvalues &l and ,k2 and two fast modes associated with eigenvalues ,k3 and ,4, 
which are negative for all learning rates and whose magnitude is significantly larger. 
328 A.H.L. WEST, D. SAAD 
q-'- ) 
..... '( P)  ,, 
......... x=Cx) t /I e.o 
o.o //I 
!/I 
] ',.'....:,.., p '"')...-/:..-....:.] 1.85 
"" '  I I I I I 
I'''' / 
o.o 0.s W 1.0 1.s 5 lo K so 100 s0010oo 
Figure 2: (a) The eigenva]ues A, A2 (see text) as a function of the learning rate 
K = 3 for two values of fl: fl --- 1, and fl --- flopt __ 1.8314. For comparison we 
2A2 and find that the optima] learning rate /opt is given by the condition A -- 
for flopt, but by the minimum of A for fl = 1. (b) The optimal inverse temperature 
flopt as a function of the number of hidden units K saturates for large K. 
The fast modes decay quickly and their influence on the long-time dynamics is neg- 
ligible. They are therefore excluded from Fig. 2a and the following discussion. The 
eigenvalue 2 is negative and linear in r/. The eigenvalue At is a non-linear function 
of both  and r/ and negative for small r/. For large r/, l becomes positive and 
training does not converge to the optimal solution defining the maximum learn- 
ing rate r/max as (r/max ) = 0. For all r/< r/max the generalization error decays 
exponentionally to eg* = 0. 
In order to identify the corresponding convergence time r, which is inversely pro- 
portional to the modulus of the eigenvalue associated with the slowest decay mode, 
we expand the generalization error to second order in q, c, r and s. We find that 
the mode associated with the linear eigenvalue A2 does not contribute to first order 
terms, and controls only second order term in a decay rate of 22. The learning 
rate r/pt which provides the fastest asymptotic decay rate opt of the generalization 
error is therefore either given by the condition  (r/pt) = 22(r/pt) or alternatively 
by minr/() if  > 22 at the minimum of  (see Fig. 2a). 
We can further optimize convergence to optimal generalization by minimizing the 
decay rate opt() with respect to  (see Fig. 2b). Numerically, we find that the 
optimal inverse temperature opt saturates for large K at flopt  2.03. For large K, 
we find an associated optimal convergence time rPt(fl pt)  2.90K for adaptive 
back-propagation optimized with respect to r/ and , which is an improvement 
by 17% when compared to rPt(1)~ 3.48K for gradient descent optimized with 
respect to r/. The optimal and maximal learning rates show an asymptotic 1/K 
behaviour and r]Pt(fl pt) - 4.78/K, which is an increase by 20% compared to gra- 
dient descent. Both algorithms are quite stable as the maximal learning rates, for 
which the learning process diverges, are about 30% higher than the optimal rates. 
4 SUMMARY AND DISCUSSION 
This research has been motivated by the dominance of the suboptimal symmetric 
phase in on-line learning of two-layer feedforward networks trained by gradient 
descent [2]. This trapping is emphasized for inappropriate small learning rates 
but exists in all training scenarios, effecting the learning process considerably. We 
Adaptive Back-Propagation in On-line Learning of Multilayer Networks 329 
proposed an adaptive back-propagation training algorithm [Eq. (3)] parameterized 
by an inverse temperature /, which is designed to improve specialization of the 
student nodes by enhancing differences in the activation between hidden units. Its 
performance has been compared to gradient descent for a soft-committee student 
network with K hidden units trying to learn a rule defined by an isotropic teacher 
(T, = 5,) of the same architecture. 
A linear analysis of the equations of motion around the symmetric fixed point for 
small learning rates has shown that optimized adaptive back-propagation character- 
ized by opt __ 4K breaks the symmetry significantly faster. The effect is especially 
pronounced for large networks, where the trapping time of gradient descent grows 
r cr K/r I compared to r o l/r/for opt. With increasing network size it seems to 
become harder for a student node trained by gradient descent to distinguish be- 
tween the many teacher nodes and to specialize on one of them. In the adaptive 
back-propagation algorithm this effect can be eliminated by choosing opt (X /. 
An open question is how the choice of the optimal inverse temperature is effected for 
large learning rates, where r/2-terms cannot be neglected, as unbounded increase of 
the learning rate causes uncontrolled growth of the student norms. However, the full 
equations of motion are very difficult to analyse in the symmetric phase. Numerical 
studies indicate that opt is smaller but still scales with K and yields an overall 
decrease in training time which is still significant. We also find that the optimal 
learning rate opt, which exhibits the shortest symmetric phase, is significantly lower 
in this regime than during convergence [4]. 
During convergence, independent of which algorithm is used, the time constant for 
decay to zero generalization error scales with K, due to the necessary rescaling of 
the learning rate by 1/K as the typical quadratic deviation between teacher and 
student output increases proportional to K. The reduction in training time with 
adaptive back-propagation is 17% and independent of the number of hidden units in 
contrast to the symmetric phase, where a factor K is gained. This can be explained 
by the fact that each student node is already specialized on one teacher node and 
the effect of other nodes in inhibiting further specialization is negligible. In fact, 
at first it seems rather surprising that anything can be gained by not changing the 
weights of the network according to their error gradient. The optimal setting of 
/ > 1, together with training at a larger learning rate, speeds up learning for small 
activations and slows down learning for highly activated nodes. This is equivalent 
to favouring rotational changes of the weight vectors over pure length changes to a 
degree determined by/. 
We believe that the adaptive back-propagation algorithm investigated here will 
be beneficial for any multilayer feedforward network and hope that this work will 
motivate further theoretical research into the efficiency of training algorithms and 
their systematic improvement. 
References 
[1] C. Cybenko, Math. Control Signals and Systems 2,303 (1989). 
[2] D. Saad and S. A. Solla, Phys. Rev. E 52, 4225 (1995). 
[3] M. Biehl and H. Schwarze, J. Phys. A 28, 643 (1995). 
[4] A. West and D. Saad, in preparation (1995). 
