Learning in large linear percepttons and 
why the thermodynamic limit is relevant 
to the real world 
Peter Sollich 
Department of Physics, University of Edinburgh 
Edinburgh EH9 3JZ, U.K. 
P. Solliched. ac .uk 
Abstract 
We present a new method for obtaining the response function 6 
and its average G from which most of the properties of learning 
and generalization in linear perceptrons can be derived. We first 
rederive the known results for the 'thermodynamic limit' of infinite 
perceptron size N and show explicitly that 6 is self-averaging in 
this limit. We then discuss extensions of our method to more gen- 
eral learning scenarios with anisotropic teacher space priors, input 
distributions, and weight decay terms. Finally, we use our method 
to calculate the finite N corrections of order 1IN to G and discuss 
the corresponding finite size effects on generalization and learning 
dynamics. An important spin-off is the observation that results 
obtained in the thermodynamic limit are often directly relevant to 
systems of fairly modest, 'real-world' sizes. 
1 INTRODUCTION 
One of the main areas of research within the Neural Networks community is the issue 
of learning and generalization. Starting from a set of training examples (normally 
assumed to be input-output pairs) generated by some unknown 'teacher' rule l], one 
wants to find, using a suitable learning or training algorithm, a student A/' (read 
'Neural Network') which generalizes from the training set, i.e., predicts the outputs 
corresponding to inputs not contained in the training set as accurately as possible. 
208 Peter Sollich 
If the inputs are N-dimensional vectors x C 7 N and the outputs are scalars y C 7, 
then one of the simplest functional forms that can be assumed for the student iV' is 
the linear perceptron, which is parametrized in terms of a weight vector wv  7 N 
and implements the linear input-output mapping 
wvx. (1) 
A commonly used learning algorithm for the linear perceptron is gradient descent 
on the training error, i.e., the error that the student  makes on the training set. 
Using the standard squared output deviation error meure, the training error for a 
given set ofp training examples { (x , y), y = 1... p} is Et =   (y - y(x ))a = 
Z ,(y _ wx,/)a. To prevent the student from fitting noise in the training 
1 2 
data, a quadratic weight decay term Aw is normally added to the training error, 
with the value of the weight decay parameter A determining how strongly large 
weight vectors are penalized. Gradient descent is thus performed on the function 
1 2 
E = Et + Aw, and the corresponding learning dynamics is, in a continuous time 
approximation, dw/dt = -VE. As discussed in detail by Krogh and Hertz 
(1992), this results in an exponential approach of w to its ymptotic value, with 
decay constants given by the eigenvalues of the matrix M, defined by (1 denotes 
the N x N identity matrix) 
M=AI+A, A=E.x(x) T' 
To examine what generalization performance is achieved by the above learning 
algorithm, one h to make an sumption about the functional form of the teacher. 
The simplest such sumption is that the problem is learnable, i.e., that the teacher, 
like the student, is a linear perceptron. A teacher  is then specified by a weight 
x ouou = 
that the test inputs for which the student is ked to predict the corresponding 
outputs are drawn from an isotropic Gaussian distribution P(x)  exp(-  2 
, x ). The 
generalization error, i.e., the average error that a student  makes on a random 
input when compared to teacher , is given by 
Inserting the learning dynamics w: w(t), the generalization acquires a time 
dependence, which in its exact form depends on the specific training set, teacher, and 
initial value of the student weight vector, w(t = 0). We shall confine our attention 
to the average of this time-dependent generalization error over all possible training 
sets and teachers; to avoid clutter, we write this average simply as eg(t). We sume 
that the inputs x  in the training set are chosen independently and randomly from 
the same distribution as the test inputs, and that the corresponding training outputs 
are the teacher outputs corrupted by additive noise, y = yv(x ) + , where the  
have zero mean and variance . If we further sume an isotropic Gaussian prior on 
1 
the teacher weight vectors, P(wv)  exp(-w), then the average generalization 
error for t   is (Krogh and Hertz, 1992) 
g(t ) =  a+A(a --A) , (3) 
where G is the average of the scalled response function over the training inputs: 
G = (6)p({,), 6 = trM . (4) 
Learning in Large Linear Perceptrons 209 
The time dependence of the average generalization error for finite but large t is an 
exponential approach to the asymptotic value (3) with decay constant  + amin, 
where amin is the lowest eigenvalue occurring in the average eigenvalue spectrum of 
the input correlation matrix A (Krogh and Hertz, 1992). This average eigenvalue 
spectrum, which we denote by p(a), can be calculated from the average response 
function according to (Krogh, 1992) 
p(a) = i lim ImGIx=-,-i, (5) 
71' e-- O+ 
where we have assumed p(a) to be normalized, fda p(a)= 1. 
Eqs. (3,5) show that the key quantity determining learning and generalization in 
the linear perceptton is the average response function G defined in (4). This 
function has previously been calculated in the 'thermodynamic limit', N -- o<> 
at a = pin = const., using a diagrammatic expansion (Hertz et al., 1989) and the 
replica method (Opper, 1989, Kinzel and Opper, 1991). In Section 2, we present 
what we believe to be a much simpler method for calculating G, based only on 
simple matrix identities. We also show explicitly that 6 is self-averaging in the 
thermodynamic limit, which means that the fluctuations of 6 around its average G 
become vanishingly small as N - cx>. This implies, for example, that the gener- 
alization error is also self-averaging. In Section 3 we extend the method to more 
general cases such as anisotropic teacher space priors and input distributions, and 
general quadratic penalty terms. Finite size effects are considered in Section 4, 
where we calculate the O(1/N) corrections to G, g(t - o<>) and p(a). We discuss 
the resulting effects on generalization and learning dynamics and derive explicit con- 
ditions on the perceptton size N for results obtained in the thermodynamic limit 
to be valid. We conclude in Section 5 with a brief summary and discussion of our 
results. 
2 THE BASIC METHOD 
Our method for calculating the average response function G is based on a recursion 
relation relating the values of the (unaveraged) response function 6 for p and p + 1 
training examples. Assume that we are given a set of p training examples with 
corresponding matrix M. By adding a new training example with input x, we 
obtain the matrix M+ M  T 
= + NXX . It is straightforward to show that the 
inverse of M+ can be expressed as 
(M)-i = Mr 1 - 
 Mrl xxTMrl 
1 + xTMlx ' 
(One way of proving this identity is to multiply both sides by M+ and exploit the 
1 T -1 
fact that M+M} 1 = I + Nxx M,r .) Taking the trace, we obtain the following 
recursion relation for 6: 
6(p+ 1) = 6(p)- 1 xTM2x (6) 
i ,,rT llr- 1 
Ni+N 
i T 
Now denote zi = Nx lvt x (i = 1,2). With x drawn randomly from the assumed 
x2), the zi can readily be shown to be random 
input distribution P(x) cr exp(- 
210 Peter $ollich 
variables with means and (co-)variances 
(zi) = trM i (AziAzj) = 2 Mri-j 
, N,tr  
Combining this with the fact that tr M k < NA -k = O(N), we have that the 
fluctuations Azi of the zi around their average values are O(1/v/-); inserting this 
i tr M. 2 q- O(N_a/2) 
= 6(p)- Nl+trM  
i o6(p) 
into (6), we obtain 
6(p+ 1) 
(7) 
Starting from 6(0) = l/A, we can apply this recursion p - aN times to obtain 
6(p) up to terms which add up to at most O(pN -a/a) = O(1/V). This shows 
that 6 is self-averaging in the thermodynamic limit: whatever the training set, the 
value of 6 will always be the same up to fluctuations of O(1/v). In fact, we shall 
show in Section 4 that the fluctuations of 6 are only O(1/N). This means that the 
O(N -3/) fluctuations from each iteration of (7) are only weakly correlated, so that 
they add up like independent random variables to give a total fluctuation for 6(p) 
of O((p/Na) /2) = O(1/N). 
We have seen that, in the thermodynamic limit, 6 is identical to its average G 
because its fluctuations are vanishingly small. To calculate the value of G in the 
thermodynamic limit as a function of a and A, we insert the relation 6(p+1)-6(p) = 
06(ot)/0o +O(1/N ) into eq. (7) (with 6 replaced by G) and neglect all finite N 
corrections. This yields the partial differential equation 
OG OG 1 
0a 0A i+G =0' (8) 
which can readily be solved using the method of characteristic curves (see, e.g., 
John, 1978). Using the initial condition G[,=0 = 1/, gives a/(1 + G) = 1/G- A, 
which leads to the well-known result (see, e.g., Hertz et al., 1989) 
1( 
G= 2A 1-ct-A+V/(1-a-A) a+4A . (9) 
In the complex A plane, G has a pole at A = 0 and a branch cut arising from 
the root; according to eq. (5), these singularities determine the average eigenvalue 
spectrum p(a) of A, with the result (Krogh, 1992) 
1 
p(a) = (1 - a)O(1 - a)5(a) + a X/(a+ - a)(a - a_), (10) 
where O(z) is the Heaviside step function, O(z) = 1 for z > 0 and 0 otherwise. 
The root in eq. (10) only contributes when its argument is non-negative, i.e., for a 
between the 'spectral limits' a_ and a+, which have the values a+ -- (1 q- v/-) a. 
3 
EXTENSIONS TO MORE GENERAL LEARNING 
SCENARIOS 
We now discuss some extensions of our method to more general learning sce- 
narios. First, consider the case of an anisotropic teacher space prior, P(wv) cr 
Learning in Large Linear Perceptrons 2 1 1 
1 T Wv) , with symmetric positive definite Sl. v. This leaves the defini- 
exp(--w v 51,  
[ion of[he response function unchanged; eq. (3), however, has [o be replaced by 
%(t -- oo)= 1/2{a2G + A[a  - A(tr Sv)IOG/OA}. 
As a second extension, sume that the inputs are drawn from an anisotropic distri- 
bution, P(x)  exp(--xT-lx). It can then be shown that the ymptotic value 
of the average generalization error is still given by eq. (3) if the response function is 
1 
redefined to be 6 = tr M . This modified response function can be calculated 
 follows: First we rewrite 6 = tr (}-1 + X)-I, where X =  E,(")T  is 
the correlation matrix of the transformed input examples  = -l/x. Since the 
 are distributed according to P()  exp(-(k)), the problem is thus reduced 
to finding the response function 6 = tr (L + A) - for iratropically distributed 
inputs and L = A - The recursion relations between q(p + 1) and 6(P) derived 
in the previous section remain valid, and result, in the thermodynamic limit, in a 
differential equation for the aver ale response function G analogous to eq. (8). The 
initial condition is now Gla=0 = tr L -1, and one obtains an implicit equation for 
G, 
G=tr L+ 1+1 , (11) 
where in the ce of an anisotropic input distribution considered here, L = A- 
If  h a particularly simple form, then the dependence of G on a and A can be 
obtained analytically, but in general eq. (11) h to solved numerically. 
Finally, one can also investigate the effect of a general quadratic weight decay term, 
I T 
wAw, in the energy function E. The expression for the average generalization 
error becomes more cumbersome than eq. (3) in this ce, but the result can still be 
expressed in terms of the average response function G = {6) = {tr (A + A)-), 
which can be obtained  the solution ofeq. (11) for L = A. 
4 FINITE N CORRECTIONS 
So far, we have focussed attention on the thermodynamic limit of perceptrons of 
infinite size N. The results are clearly only approximately valid for real, finite 
systems, and it is therefore interesting to investigate corrections for finite N. This 
we do in the present section by calculating the O(1/N) corrections to G and p(a). 
For details of the calculations and results of computer simulations which support 
our theoretical analysis, we refer the reader to (Sollich, 1994). 
First note that, for A = 0, the exact result for the average response function is 
G]=0 = (a- 1- l/N) - for a > 1 + 1IN (see, e.g., Eaton, 1983), which clearly 
admits a series expansion in powers of 1IN. We assume that a similar expansion 
also exists for nonzero ,, and write 
G = Go + G/N + O(1/N). 
(12) 
Go is the value of G in the thermodynamic limit as given by eq. (9). For finite N, the 
fluctuations A6 = 6--G of 6 around its average value G become relevant; for , = 0, 
the variance of these fluctuations is known to have a power series expansion in l/N, 
and again we assume a similar expansion for finite ,, {(A6) a} = Aa/N + O(1/Na), 
212 Peter SolItch 
where the first term is O(1/N) and not O(1) because, as discussed in Section 2, the 
fluctuations of  for large N are no greater than O(1/v/-). To calculate G and A 2, 
one starts again from the recursion relation (6), now expanding everything up to 
second order in the fluctuation quantities Azi and A6. Averaging over the training 
inputs and collecting orders of 1IN yields after some straightforward algebra the 
known eq. (8) for Go and two linear partial differential equations for G and A 2, 
the latter obtained by squaring both sides of eq. (6). Solving these, one obtains 
A 2=0, Gt= G(1-AG) 
(1 + AG) 2 (13) 
In the limit A -- 0, G1 = 1/(a - 1) 2 consistent with the exact result for G quoted 
above; likewise, the result A 2 ---- 0 agrees with the exact series expansion of the 
variance of the fluctuations of 6 for A = 0, which begins with an O(1/N 2) term 
(see, e.g., Barber et al., 1994). 
(a) (b) 
0.5 1.5 
0.4 _e 0 A-0.001 
0.3- , ........ A--0.1 1.0- 
0.5- 
0.:2- 
0.1- -',k--, 0.0 ................... 
g,1 
0.0 ' - .... - . ..... 2..-_'2; -0.5 
I I I I I 
0.0 0.5 1.0 o 1.5 2.0 0.0 0.5 1.0 
........ A=O.1 
ot 1.5 2.0 
Figure 1: Average generalization error: Result for N - x>, eg,0, and coefficient of 
O(1/N) correction, eg,. (a) Noise free teacher, a 2 = 0. (b) Noisy teacher, a 2 = 0.5. 
Curves are labeled by the value of the weight decay parameter A. 
From the 1IN expansion (12) of G we obtain, using eq. (3), a corresponding ex- 
pansion of the asymptotic value of the average generalization error, which we write 
as eg(t - oo) = g,0 + g,/N + O(1/N2). It follows that the thermodynamic limit 
result for the average generalization error, g,0, is a good approximation to the true 
result for finite N as long as N >> Nc = In Figure 1, we plot g,0 and 
g,1 for several values of A and rr 2. It can be seen that the relative size of the first 
order correction [g,/g,ol and hence the critical system size Nc for validity of the 
thermodynamic limit result is largest when A is small. Exploiting this fact, Nc can 
be bounded by 1/(1 -a) for a < 1 and (3a + 1)/[a(ct- 1)] for ct > 1. It follows, 
for example, that the critical system size Nc is smaller than 5 as long as a < 0.8 
or a > 1.72, for all A and rr 2. This bound on Nc can be tightened for non-zero 
A; for X > 2, for example, one has Nc < (2A- 1)/(A+ 1) 2 < 1/3. We have thus 
shown explicitly that thermodynamic limit calculations of learning and generaliza- 
tion behaviour can be relevant for fairly small, 'real-world' systems of size N of the 
order of a few tens or hundreds. This is in contrast to the widespread suspicion 
Learning in Large Linear Perceptrons 213 
among non-physicists that the methods of statistical physics give valid results only 
for huge system sizes of the order of N  10 2a. 
(a) (b) 
Po 
a_ a+ 
Pl 
Figure 2: Schematic plot of the average eigenvalue spectrum p(a) of the input 
correlation matrix A. (a) Result for N - oe, po(a). (b) O(1/N) correction, px(a). 
Arrows indicate &peaks and are labeled by the corresponding heights. 
We now consider the O(1/N) correction to the average eigenvalue spectrum of the 
input correlation matrix A. Setting p(a) = po(a) + p(a)/N + O(1/N2), po(a) is 
the N - cx> result given by eq. (10), and from eq. (13) one derives 
1 
p(a)-- 5(a-a+)+5(a-a_) 
1 1 
v/(a+ _ .)(. _ a_)' 
Figure 2 shows sketches of po(a) and p(a). Note that fdap(a) = 0 as expected 
since the normalization of p(a) is independent of N. Furthermore, there is no 
O(1/N) correction to the &peak in po(a) at a = 0, since this peak arises from the 
N - p zero eigenvalues of A for c - pin < 1 and therefore has a height of 1 - a for 
any finite N. The &peaks in p (a) at the spectral limits a+ and a_ are an artefact 
of the truncated 1IN expansion: p(a) is determined by the singularities of G as 
a function of A, and the location of these singularities is only obtained correctly 
by resumming the full 1IN expansion. The &peaks in pl(a) can be interpreted 
as 'precursors' of a broadening of the eigenvalue spectrum of A to values which, 
when the whole 1IN series is resummed, will lie outside the N - cx> spectral 
range [a_, a+]. The negative term in p (a) represents the corresponding 'flattening' 
of the eigenvalue spectrum between a_ and a+. We can thus conclude that the 
average eigenvalue spectrum of A for finite N will be broader than for N -- x, 
which means in particular that the learning dynamics will be slowed down since the 
smallest eigenvalue amin of A will be smaller than a_. From our result for p (a) we 
can also deduce when the N - cx> result po(a) is valid for finite N; the condition 
turns out to be N >> a/[(a+- a)(a- a_)]. Consistent with our discussion of the 
broadening of the eigenvalue spectrum of A, N has to be larger for a near the 
spectral limits a_, a+ if p0(a) is to be a good approximation to the finite N average 
eigenvalue spectrum of A. 
214 Peter $ollich 
5 SUMMARY AND DISCUSSION 
We have presented a new method, based on simple matrix identities, for calculating 
the response function 6 and its average G which determine most of the properties 
of learning and generalization in linear perceptrons. In the thermodynamic limit, 
N -- c, we have recovered the known result for G and have shown explicitly that 6 
is self-averaging. Extensions of our method to more general learning scenarios have 
also been discussed. Finally, we have obtained the O(1/N) corrections to G and the 
corresponding corrections to the average generalization error, and shown explicitly 
that the results obtained in the thermodynamic limit can be valid for fairly small, 
'real-world' system sizes N. We have also calculated the O(1/N) correction to the 
average eigenvalue spectrum of the input correlation matrix A and interpreted it 
in terms of a broadening of the spectrum for finite N, which will cause a slowing 
down of the learning dynamics. 
We remark that the O(1/N) corrections that we have obtained can also be used 
in different contexts, for example for calculations of test error fluctuations and 
optimal test set size (Barber el al., 1994). Another application is in an analysis of 
the evidence procedure in Bayesian inference for finite N, where optimal values of 
'hyperparameters' like the weight decay parameter A are determined on the basis of 
the training data (G Marion, in preparation). We hope, therefore, that our results 
will pave the way for a systematic investigation of finite size effects in learning and 
generalization. 
References 
D Barber, D Saad, and P SolItch (1994). Finite size effects and optimal test set size 
in linear perceptrons. Submitted to J. Phys. A. 
M L Eaton (1983). Multivariate Statistics - A Vector Space Approach. Wiley, New 
York. 
J A Hertz, A Krogh, and G I Thorbergsson (1989). Phase transitions in simple 
learning. J. Phys. A, 22:2133-2150. 
F John (1978). Partial Differential Equations. Springer, New York, 3rd ed. 
W Kinzel and M Opper (1991). Dynamics of learning. In E Domany, J L van Hem- 
men, and K Schulten, editors, Models of Neural Networks, pages 149-171. Springer, 
Berlin. 
A Krogh (1992). Learning with noise in a linear perceptron. J. Phys. A, 25:1119- 
1133. 
A Krogh and J A Hertz (1992). Generalization in a linear perceptron in the presence 
of noise. J. Phys. A, 25:1135-1147. 
M Opper (1989). Learning in neural networks: Solvable dynamics. Europhysics 
Letters, 8:389-392. 
P Sollich (1994). Finite-size effects in learning and generalization in linear percep- 
trons. J. Phys. A, 27:7771-7784. 
