Dynamics of Generalization in Linear Percepttons 
Anders Krogh 
Niels Bohr Institute 
Blegdamsvej 17 
DK-2100 Copenhagen, Denmark 
John A. Hertz 
NORDITA 
Blegdamsvej 17 
DK-2100 Copenhagen, Denmark 
Abstract 
We study the evolution of the generalization ability of a simple linear per- 
ceptron with N inputs which learns to imitate a "teacher perceptron". The 
system is trained on p = aN binary example inputs and the generaliza- 
tion ability measured by testing for agreement with the teacher on all 2 N 
possible binary input patterns. The dynamics may be solved analytically 
and exhibits a phase transition from imperfect to perfect generalization 
at a = 1. Except at this point the generalization ability approaches its 
asymptotic value exponentially, with critical slowing down near the tran- 
sition; the relaxation time is o< (1- x/) -2. Right at the critical point, 
1 
the approach to perfect generalization follows a power law o< t-5. In 
the presence of noise, the generalization ability is degraded by an amount 
o< (x/r- X)- just above a = 1. 
1 INTRODUCTION 
It is very important in practical situations to know how well a neural network will 
generalize from the examples it is trained on to the entire set of possible inputs. This 
problem is the focus of a lot of recent and current work [1-11]. All this work, how- 
ever, deals with the asymptotic state of the network after training. Here we study 
a very simple model which allows us to follow the evolution of the generalization 
ability in time under training. It has a single linear output unit, and the weights 
obey adaline learning. Despite its simplicity, it exhibits nontrivial behaviour: a dy- 
namical phase transition at a critical number of training examples, with power-law 
decay right at the transition point and critical slowing down as one approaches it 
from either side. 
897 
898 Krogh and Hertz 
2 THE MODEL 
1 
Our simple linear neuron has an output V = N-5 i Wii, where i is the ith input. 
It learns to imitate a teacher [1] whose weights are ui by training on p examples of 
input-output pairs (,() with 
1 
C t' = N- Z ui (1) 
generated by the teacher. The adaline learning equation [11] is then 
p 
 -- V//= 1 
By introducing the difference between the teacher and the pupil, 
Vi -- tti -- Wi , 
and the training input correlation matrix 
the learning equation becomes 
I P 
Aii=  E' (4) 
/--1 
ii -- -- Z Aijvj. 
(5) 
We let the example inputs  take the values 4-1, randomly and independently, but it 
is straightforward to generalize it to any distribution of inputs with " 
6} 
For a large number of examples (p - O(N) > !), the resulting generalization ability 
will be independent of just which p of the 2 v possible binary input patterns we 
choose. All our results will then depend only on the fact that we can calculate the 
spectrum of the matrix A. 
3 GENERALIZATION ABILITY 
To measure the generalization ability, we test whether the output of our perceptron 
with weights wi agrees with that of the teacher with weights ui on all possible binary 
inputs. Our objective function, which we call the generalization error, is just the 
square of the error, averaged over all these inputs: 
F -- N2 N Z (tti- wi)'i 
i 1 
1 
(We used that  {o} eri j is zero unless i = j.) That is, F is just proportional to 
the square of the difference between the teacher and pupil weight vectors. With the 
Dynamics of Generalization in Linear Perceptrons 899 
N -1 normalization factor F will then vary between 1 (tabula rasa) and 0 (perfect 
generalization) if we normalize ff to length v/-. During learning, wi and thus vi 
depends on time, so F is a function of t. The complementary quantity I - F(t) 
could be called the generalization ability. 
In the basis where A is diagonal, the learning equation (5) is simply 
= (7) 
where A. are the eigenvalues of A. This has the solution 
v(t) = -A" = (8) 
where it is assumed that the weights are zero at time t = 0 (we will come back to 
the more general ce later). Thus we find 
1 1 
= (9) 
Averaging over all possible training sets of size p this can be expressed in ter of 
the density of eigenvalues of A, p(e): 
f 
In the following it will be sumed that the length of ff is normalized to , so the 
prefactor disappears. 
For large N, the eigenvalue density is (see, e.g. [11], where it can be obtained simply 
from the imaginary part of the Green's function in eq.(57)) 
1 
P() = 2r V/(e+ - e)(e- e_) + (1 -o00(1 
e+ = (1 ::1:: v/") 2 
where 
(11) 
(12) 
and 0( ) is the unit step function. The density has two terms: a 'deformed semicircle' 
between the roots e_ and e+, and for a < 1 a delta function at e = 0 with weight 
1 - a. The delta-function term appears because no learning takes place in the 
subspace orthogonal to that spanned by the training patterns. For a > 1 the 
patterns span the whole space, and therefore the delta-function is absent. 
The results at infinite time are immediately evident. For a < 1 there is a nonzero 
limit, F(c) = 1 - (, while F(c) vanishes for a >_ 1, indicating perfect generaliza- 
tion (the solid line in Figure 1). While on the one hand it may seem remarkable 
that perfect generalization can be obtained from a training set which forms an in- 
finitesimal fi'action of the entire set of possible examples, the meaning of the result 
is just that N points are sufficient to determine an N - 1-dimensional hyperplane 
in N dimensions. 
Figure 2 shows F(t) as obtained numerically from (10) and (11). The qualitative 
form of the approach to F(oc) can be obtained analytically by inspection. For 
a # 1, the asymptotic approach is governed by the smallest nonzero eigenvalue e_. 
Thus we have critical slowing down, with a divergent relaxation time 
1 1 
r - - (13) 
900 Krogh and Hertz 
m 1 
o 
2 
Figure 1' The asymptotic generalization error as a function of . The full line 
corresponds to A = 0, the dashed line to A = 0.2, and the dotted line to w0 = i and 
A-0. 
as the transition at ( = I is approached. Right at the critical point, the eigenvalue 
density diverges for small e like e-T, which leads to the power law 
1 
F(t) cr x/7 (14) 
at long times. Thus, while exactly N examples are sufficient to produce perfect 
generalization, the approach to this desirable state is rather slow. A little bit 
above ( = 1, F(t) will also follow this power law for times t << r, going over to 
(slow) exponential decay at very long times (t >> r). By increasing the training set 
a N 
size well above N (say, to 5 ), one can achieve exponentially fast generalization. 
Below c = 1, where perfect generalization is never achieved, there is at least the 
consolation that the approach to the generalization level the network does reach is 
exponential (though with the same problem of a long relaxation time just below the 
transition as just above it). 
4 EXTENSIONS 
In this section we briefly discuss some extensions of the foregoing calculation. We 
will see what happens if the weights are non-zero at t = 0, discuss weight decay, 
and finally consider noise in the learning process. 
Weight decay is a simple and frequently-used way to limit the growth of the weights, 
which might be desirable for several reasons. It is also possible to approximate the 
problem with binary weights using a weight decay term (the so-called spherical 
model, see [11]). We consider the simplest kind of weight decay, which comes in as 
an additive term, -Awi= -A(ui - vi), in the learning equation (2), so the equation 
Dynamics of Generalization in Linear Perceptrons 901 
1.0 
0.8 
0.6 
0.4 
0.2 
0.0 
0 
 '... 0[=0.8 
' - ....................... a= 1.0 
........... q.= 1.2 
! ! 
5 10 15 20 
t 
Figure 2: The generalization error as a function of time for a couple of different 
(5) for the difference between teacher and pupil is now 
Apart from the last term this just shifts the eigenvalue spectrum by 
In the basis where A is diagonal we can again write down the general solution to 
this equation: 
v (1 -- e-(A+i)t),ur 
= + v.(O)e -(A'+i)'. (16) 
A.+i 
The square of this is 
2 
2 2 [A(1 - e-(A'+i) t) + e_(A.+i)  + W,.(O) (A.+X)f (17) 
As in (10) this has to be integrated over the eigenvalue spectrum to find the averaged 
generalization error. Assuming that the initial weights are random, so that w. (0) = 
0, and that they have a relative variance given by 
lr J -- Wo 2, 
(18) 
the average of F(t) over the distibution of initial conditions now becomes 
F(7t) = / dep(e) .i(1 - e -('+'x)t) + e_(,+)t + woe -2('+'x)t (19) 
e+), ' 
(Again it is assumed the length of ff is x/r-.) 
For  = 0 we see the result is the same  before except for a factor 1 + w in front 
of the integral. This means that the asymptotic generalization error is now 
(l+w)(1-a) for a< 1 (20) 
0 fora>l, - 
902 Krogh and Hertz 
which is shown as a dotted line in Figure 1 for w0 = 1. The excess error can easily 
be understood as a contribution to the error from the non-relaxing part of the initial 
weight vector in the subspace orthogonal to the space spanned by the patterns. The 
relaxation times are unchanged for ,X = 0. 
For ,X > 0 the relaxation times become finite even at a = 0, because the smallest 
eigenvalue is shifted by ,X, so (13) is now 
1 1 
r _ _ . (21) 
In this case the asymptotic error can easily be obtained numerically from (19), and 
is shown by the dashed line in Figure 1. It is smaller than for ,X = 0 for w > 1 at 
sufficiently small a. This is simply because the weight decay makes the part of (0) 
orthogonal to the pattern space decay away exponentially, thereby eliminating the 
excess error due to large initial weight components in this subspace. 
This phase transition is very sensitive to noise. Consider adding a noise term rli(t ) 
to the right-hand side of (2), with 
(,()j(')) = ( - ,). () 
Here we restrict our attention to the ce A = 0. Carrying the extra term through 
the succeeding manipulations leads, in place of (7), to 
0 = - + ,(t). (a) 
The additional term leads to a correction (after Fourier transforming) 
av() = 
-i + A. (24) 
and thus to an extra (time-independent) piece of the generMization error F(t): 
1 f d (1()12) 1 T (25) 
For a > 1, where there are no zero eigenvMues, we have 
f+ p() 
*r = r a (2) 
which h the large a-limit T/a,  found in equilibrium analyses (Mso for thresh- 
old percepttons [2,3,5,6,7,8,9]). Equation (26) gives a generalization error which 
diverges  one approaches the transition at a = 1: 
5F  TeZ / - T 
- 1' (27) 
Equation (25) blows up for a < 1, where some of the A. are zero. This divergence 
just reflects the fact that in the subspace orthogonal to the training patterns,  feels 
only the noise and so exhibits a random walk whose variance diverges as t --* 
Keeping more careful track of the dynamics in this subspace leads to 
eF = 2T(1-.)t + Tflx* d15 p() 
_ t5 
, + 
c--* 1- 
Dynamics of Generalization in Linear Perceptrons 903 
5 CONCLUSION 
Generalization in the linear perceptron can be understood in the following picture. 
To get perfect generalization the training pattern vectors have to span the whole 
input space -- N points (in general position) are enough to specify any hyperplane. 
This means that perfect generalization appears only for a >_ 1. As a approaches 
1 the relaxation time - i.e. learning time - diverges, signaling a phase transition, 
as is common in physical systems. Noise has a severe effect on this transition. It 
leads to a degradation of the generalization ability which diverges as one reduces 
the number of training examples toward the critical number. 
This model is of course much simpler than most real-life training problems. How- 
ever, it does allow us to examine in detail the dynamical phase transition separating 
perfect from imperfect generalization. Further extensions of the model can also be 
solved and will be reported elsewhere. 
References 
[11] 
[1] Gardner, E. and B. Derrida: Three Unfinished Works on the Optimal Storage 
Capacity of Networks. Journal of Physics A 22, 1983-1994 (1989). 
[2] Schwartz, D.B., V.K. Samalam, S.A. Solla, and J.S. Denker: Exhaustive Learn- 
ing. Neural Computation 2, 371-382 (1990). 
[3] Tishby, N., E. Levin, and S.A. Solla: Consistent Inference of Probabilities in 
Layered Networks: Predictions and Generalization. Proc. IJCNN Washington 
1989, vol. 2 403-410, Hillsdale: Erlbaum (1989). 
[4] Baum, E.B. and D. Haussler: What Size Net Gives Valid Generalization. Neu- 
ral Computation 1, 151-160 (1989). 
[5] GySrgyi, G. and N. Tishby: Statistical Theory of Learning a Rule. In Neural 
Networks and Spin G/asses, eds W.K. Theumann and R. Koeberle. Singapore: 
World Scientific (1990). 
[6] Hansel, D. and H. Sompolinsky: Learning from Examples in a Single-Layer 
Neural Network. Europhysics Letters 11,687-692 (1990). 
[7] Vallet, F., J. Callton and P. Refregier: Linear and Nonlinear Extension of the 
Pseudo-Inverse Solution for Learning Boolean Functions. Europhysics Letters 
9,315-320 (1989). 
[8] Opper, M., W. Kinzel, J. Kleinz, and R. Nehh On the Ability of the Optimal 
Perceptton to Generalize. Journal oPhysics A 23, L581-L586 (1990). 
[9] Levin, E., N. Tishby, and S. A. Solla: A Statistical Approach to Learning and 
Generalization in Layered Neural Networks. AT&T Bell Labs, preprint (1990). 
[10] Gy;Srgyi, G.: Inference of a Rule by a Neural Network with Thermal Noise. 
Physical Review Letters 64, 2957-2960 (1990). 
Hertz, J.A., A. Krogh, and G.I. Thorbergsson: Phase Transitions in Simple 
Learning. Journal of Physics A 22, 2133-2150 (1989). 
