OneLine Learning with Restricted 
Training Sets: 
Exact Solution as Benchmark 
for General Theories 
H.C. Rae 
hamish.rae@kcl.ac.uk 
P. Sollich 
psollich@mth.kcl.ac.uk 
Department of Mathematics 
King's College London 
The Strand 
London WC2R 2LS, UK 
A.C.C. Coolen 
tcoolen@mth.kcl.ac.uk 
Abstract 
We solve the dynamics of on-line Hebbian learning in perceptrons 
exactly, for the regime where the size of the training set scales 
linearly with the number of inputs. We consider both noiseless 
and noisy teachers. Our calculation cannot be extended to non- 
Hebbian rules, but the solution provides a nice benchmark to test 
more general and advanced theories for solving the dynamics of 
learning with restricted training sets. 
1 Introduction 
Considerable progress has been made in understanding the dynamics of supervised 
learning in layered neural networks through the application of the methods of sta- 
tistical mechanics. A recent review of work in this field is contained in [1]. For 
the most part, such theories have concentrated on systems where the training set is 
much larger than the number of updates. In such circumstances the probability that 
a question will be repeated during the training process is negligible and it is possible 
to assume for large networks, via the central limit theorem, that the local field dis- 
tribution is Gaussian. In this paper we consider restricted training sets; we suppose 
that the size of the training set scales linearly with N, the number of inputs. The 
probability that a question will reappear during the training process is no longer 
negligible, the assumption that the local fields have Gaussian distributions is not 
tenable, and it is clear that correlations will develop between the weights and the 
Learning with Restricted Training Sets.' Exact Solution 317 
questions in the training set as training progresses. In fact, the non-Gaussian char- 
acter of the local fields should be a prediction of any satisfactory theory of learning 
with restricted training sets, as this is clearly demanded by numerical simulations. 
Several authors [2, 3, 4, 5, 6, 7] have discussed learning with restricted training sets 
but a general theory is difficult. A simple model of learning with restricted training 
sets which can be solved exactly is therefore particularly attractive and provides 
a yardstick against which more difficult and sophisticated general theories can, in 
due course, be tested and compared. We show how this can be accomplished for 
on-line Hebbian learning in perceptrons with restricted training sets and we ob- 
tain exact solutions for the generalisation error and the training error for a class of 
noisy teachers and students with arbitrary weight decay. Our theory is in excellent 
agreement with numerical simulations and our prediction of the probability density 
of the student field is a striking confirmation of them, making it clear that we are 
indeed dealing with local fields which are non-Gaussian. 
2 Definitions 
We study on-line learning in a student perceptron $, which tries to perform a task 
defined by a teacher perceptron characterised by a fixed weight vector B* E RN. 
We assume, however, that the teacher is noisy and that the actual teacher output 
T and the corresponding student response S are given by 
T' {-1,1} N--> {-1,1} T() = sgn[B.], 
S' {-1,1} N {-1,1} S()= sgn[J.], 
where the vector B is drawn independently of  with probability p(B) which may 
depend explicitly on the correct teacher vector B*. Of particular interest are the 
following two choices, described in literature as output noise and Gaussian input 
noise, respectively: 
p(B) = A 5(B+B*) + (l-A) 5(B-B*) (1) 
where A _> 0 represents the probability that the teacher output is incorrect, and 
N 
p(B)= 2-- / e . (2) 
The variance E 2/.iV has been chosen so as to achieve appropriate scaling for N --> 
Our learning rule will be the on-line Hebbian rule, i.e. 
J(e+ 1) = (1 - )J(e) +  (e) sgn[B(e). (e)] (3) 
where the non-negative parameters 3' and r/are the decay rate and the learning rate, 
respectively. At each iteration step  an input vector (/) is picked at random from 
a training set consisting of p = cN randomly drawn vectors u E {-1, 1} :, / = 
1,...p. This set remains unchanged during the learning dynamics. At the same 
time the teacher selects at random, and independently of (), the vector B(f), 
according to the probability distribution p(B). Iterating equation (3) gives 
3' t/ I -  (e) sgn[B(e). (e)] (4) 
J(m)= 1- J0 +  =0 
We assume that the (noisy) teacher output is consistent in the sense that if a 
question  reappears at some stage during the training process the teacher makes 
the same choice of B in both cases, i.e. if () = (') then also B() = B('). This 
consistency allows us to define a generalised training set/) by including with the p 
318 H. C. Rae, P. Sollich and A. C. C. Coolen 
questions the corresponding teacher vectors: 
/5 = {(i,B),...,(P, BP)} 
There are two sources of randomness in this problem. First of all there is the random 
realisation of the 'path'  = {((0), B(0)), ((1), B(1)), ..., ((g), B(g)),...}. This 
is simply the randomness of the stochastic process that gives the evolution of the 
vector J. Averages over this process will be denoted as (...). Secondly there is the 
randomness in the composition of the training set. We will write averages over all 
training sets as (...)sets. We note that 
1 P 
(f[(g), B(t?)]) =  Z f(UBU (for all g) 
and that averages over all possible realisations of the training set are given by 
(f[(, B ), (2, B2),,,,, (v, Bp)])sets 
where {" 6 {-1, 1} . We normalise B* so that [B*] 2 = 1 and choose the time unit 
t = miN. We finally assume that J0 and B* are statistically independent of the 
training vectors {", and that they obey &(0),B; = O(N-}) for all i. 
3 Explicit Microscopic Expressions 
At the m-th stage of the learning process the two simple scalar observables O[J] = 
j2 and R[J] = B*  J, and the joint distribution of fields x = J. , y = B*  , z = 
B   (calculated over the questions in the training set/)), are given by 
O[J(m)] = j2(m) R[J(m)] = B* . J(m) (5) 
1 
e[x,y, z; a(,,)] =  y] [x - a(m) . ] [y - S* . ] [z - S .  ] (6) 
For infinitely large systems one can prove that the fluctuations in mean-field ob- 
servables such as {Q, R, P}, due to the randomness in the dynamics, will vanish [6]. 
Furthermore one assumes, with convincing support from numerical simulations, that 
for N %  the evolution of such observables, observed for different random realisa- 
tions of the training set, will be reproducible (i.e. the sample-to-sample fluctuations 
will also vanish, which is called 'self-averaging'). Both properties are central ingre- 
dients of all current theories. We are thus led to the introduction of the averages of 
the observables in (5,6), with respect to the dynamical randomness and with respect 
to the randomness in the training set (to be carried out in precisely this order)' 
q(t) = lira ( (q[a(tm)]))s,s a(t) = lim (([J(tN)]))ss (7) 
Pt(x,y,z) = lira ((P[x,y,z;J(tN)]))sets (8) 
N 
A fundamental ingredient of our calculations will be the average (i sgn(B.{))(, B), 
calculated over all realisations of ({, B). We find, for a wide class of p(B), that 
(gi sgn(B  =pm? + o(m -3/2) 
where, for example, 
Learning with Restricted Training Sets: Exact Solution 319 
p =  (1-2A) 
P= +E2 
(output noise) (10) 
(Gaussian input noise) 
(11) 
4 Averages of Simple Scalar Observables 
Calculation of Q(t) and R(t) using (4, 5, 7, 9) to execute the path average and the 
average over sets is relatively straightforward, albeit tedious. We find that 
e-'t (1 _ e-'t ) rl 2 
Q(t) = e-2VtOo + 2rlpRo + --ff(1--e -2vt) 
and that 
+ r/2 (1-e-Vt) 2 1 
(12) 
R(t) = e-VtRo + r/p7- (1 - e -vt) (13) 
where p is given by equations (10, 11) in the examples of output noise and Gaussian 
input noise, respectively. We note that the generalisation error is given by 
Eg 1 [ V/] 
= - arccos R(t)/ (14) 
All models of the teacher noise which have the same p will thus have the same 
generalisation error at any time. This is true, in particular, of output noise and 
Gaussian input noise when their respective parameters A and E are related by 
1 
1 - 2,X = (15) 
/1 + E 2 
With each type of teacher noise for which (9) holds, one can thus associate an 
effective output noise parameter A. Note, however, that this effective teacher error 
probability A will in general not be identical to the true teacher error probability 
associated with a given p(B), as can immediately be seen by calculating the latter 
for the Gaussian input noise (2). 
5 Average of the Joint Field Distribution 
The calculation of the average of the joint field distribution starting from equation 
(8) is more difficult. Writing cr = (1 -v/N), and expressing the 5 functions in terms 
of complex exponentials, we find that 
= f dd)di ei(x+yO+z)lim (e 
Pt(x,y,z) J 
x H  e -[i"N-w-'(') sg (B')] (16) 
=0 u=l sets 
tN 
In this expression we replace  by  and B 1 by B, and abbreviate S = e=0['' ']' 
Upon writing the latter product in terms of the auxiliary variables v = (l .)/ 
and o: = B"  , we find that for large N 
logs  X(: sgn[B. ],t) irl:u (1-e -*t) 
where u, u2 are the random variables given by 
7122U2 
4' 
(1 --e -2vt) (17) 
320 H. C. Rae, P. Sollich and A. C. C. Coolen 
i 1 
P v>l 
and with 
X(w, t) I f0  [e -['"e'{-"l - 1] 
= - ds (18) 
A study of the statistics of ui and u2 shows that limN,c u2 = 1, and that 
u = pB*   + a-/2u (N  ), 
where u is a Gaussian random variable with mean equal to zero and variance unity. 
On the basis of these results and equations (16, 17) we find that 
8 3 
x lim (e -i[-"J'+B"+B' (19) 
where Q and R are given by the expressions (12,13) (note: Q-R 2 is independent 
of p, i.e. of the distribution p(B)). Let x0 = J0', y = B*., z = B.. 
We assume that, given y, z is independent of x0. This condition, which reflects in 
some sense the property that the teacher noise preserves the perceptron structure, 
is certainly satisfied for the models which we are considering and is probably true 
of all reasonable noise models. The joint probability density then has the form 
p(x0, y, z) = p(xo lY)P(Y, z). Equation (19) then leads to the following expression for 
the conditional probability of x, given y and z: 
f d ei[x_Ry]_2[Q_R2]+(  sgn[z],t) (20) 
z) = 
We observe that this probability distribution is the same for all models with the 
same p and that the dependence on z is through r = sgn[z], a directly observable 
quantity. The training error and the student field probability density are given by 
r = f,  0(--)(XI,,)P(I,)P(, ) (21) 
r=l 
= (22) 
in which P(y) = (2) e  . We note that the dependence of Err and Pt(x) on 
the specific noise model arises solely through P(rly) which we find is given by 
1 
P(ry) = AO(-ry) + (1 - A)O(ry) P(rly) = (1 + rerf[y/E]) 
in the output noise and Gaussian input noise models, respectively. In order to sim- 
plify the numerical computation of the remaining integrals one can further reduce 
the number of integrations analytically. Details will be reported elsewhere. 
6 Comparison with Numerical Simulations 
It will be clear that there is a large number of parameters that one could vary in 
order to generate different simulation experiments with which to test our theory. 
Here we have to restrict ourselves to presenting a number of representative results. 
Figure 1 shows, for the output noise model, how the probability density Pt(x) of 
Learning with Restricted Training Sets.' Exact Solution 321 
0.2 
0.1 
0.0 
-10 
10 -10 
0 10 -10 0 10 - 
x x 
o o 1o 
x 
Figure 1: Student field distribution P(g) for the case of output noise, at different 
times (left to right: t= 1,2,3,4), for c =-)'= , J0 =r/= 1, ,k =0.2. Histograms: 
distributions measured in simulations, (N= 10,000). Lines: theoretical predictions. 
the student field x - J- develops in time, starting as a Gaussian at t - 0 
and evolving to a highly non-Gaussian distribution with a double peak by time 
t = 4. The theoretical results give an extremely satisfactory account of the numerical 
simulations. Figure 2 compares our predictions for the generalisation and training 
errors Eg and Err with the results of numerical simulations, for different initial 
conditions, Eg(0) = 0 and Eg(0) = 0.5, and for different choices of the two most 
important parameters ,k (which controls the amount of teacher noise) and c (which 
measures the relative size of the training set). The theoretical results are again in 
excellent agreement with the simulations. The system is found to have no memory of 
its past (which will be different for some other learning rules), the asymptotic values 
of Eg and Err being independent of the initial student vector. In our examples Ea is 
consistently larger than Err, the difference becoming less pronounced as c increases. 
Note, however, that in some circumstances Err can also be larger then Ea. Careful 
inspection shows that for Hebbian learning there are no true overfitting effects, not 
even in the case of large ,k and small 'y (for large amounts of teacher noise, without 
regularisation via weight decay). Minor finite time minima of the generalisation 
error are only found for very short times (t < 1), in combination with special 
choices for parameters and initial conditions. 
7 Discussion 
Starting from a microscopic description of Hebbian on-line learning in perceptrons 
with restricted training sets, of size p = cN where N is the number of inputs, 
we have developed an exact theory in terms of macroscopic observables which has 
enabled us to predict the generalisation error and the training error, as well as the 
probability density of the student local fields in the limit N  . Our results are in 
execellent agreement with numerical simulations (as carried out for systems of size 
N = 5,000) in the case of output noise; our predictions for the Gaussian input noise 
model are currently being compared with the results of simulations. Generalisations 
of our calculations to scenarios involving, for instance, time-dependent learning 
rates or time-dependent decay rates are straightforward. Although it will be clear 
that our present calculations cannot be extended to non-Hebbian rules, since they 
322 H. C. Rae, P. Sollich and A. C. C. Coolen 
0.5 
0.4 
0.3 
0.2 
0.1 
0.0 
Or=4.0 
or=0.5 
or=0.5 
or=4.0 
< or=0.5 
0.3 
0.2 
0.1 
0.0 
0 10 20 30 
t 
L=o.25 
40 0 i 0 20 30 40 
t 
Figure 2: Generalisation errors (diamonds/lines) and training errors (circles/lines) 
as observed during on-line Hebbian learning, as functions of time. Upper two graphs: 
,k = 0.2 and a 6 {0.5,4.0} (upper left: Es(O ) = 0.5, upper right: Es(O ) = 0). Lower 
two graphs: a = 1 and ,k E {0.0,0.25} (lower left: Es(O ) = 0.5. lower right: 
Es(O ) = 0.0). Markers: simulation results for an N = 5,000 system. Solid lines: 
predictions of the theory. In all cases J0 = r/= 1 and 'y = 0.5. 
ultimately rely on our ability to write down the microscopic weight vector J at 
any time in explicit form (4), they do indeed provide a significant yardstick against 
which more sophisticated and more general theories can be tested. In particular. 
they have already played a valuable role in assessing the conditions under which a 
recent general theory of learning with restricted training sets, based on a dynamical 
version of the replica formalism, is exact [6, 7]. 
References 
[1] Mace C.W.H.and Coolen A.C.C. (1998) Statistics and Computing 8, 55 
[2] Horner H. (1992a), Z.Phys. B 86, 291; (1992b), Z.Phys. B 87, 371 
[3] Krogh A. and Hertz J.A. (1992) J. Phys. A: Math. Gen. 25, 1135 
[4] Sollich P. and Barber D. (1997) Europhys. Lett. 38,477 
[5] Sollich P. and Barber D. (1998) Advances in Neural Information Processing 
Systems 10, Eds. Jordan M., Kearns M. and Solla S. (Cambridge: MIT) 
[6] Coolen A.C.C. and Saad D., King's College London preprint KCL-MTH-98-08 
[7] Coolen A.C.C. and Saad D. (1998) (in preparation) 
