Online learning from finite training sets: 
An analytical case study 
Peter Sollich* 
Department of Physics 
University of Edinburgh 
Edinburgh EH9 3JZ, U.K. 
P. Solliched. ac. uk 
David Barber t 
Neural Computing Research Group 
Department of Applied Mathematics 
Aston University 
Birmingham B4 7ET, U.K. 
D. Sarberason. ac. uk 
Abstract 
We analyse online learning from finite training sets at non- 
infinitesimal learning rates r/. By an extension of statistical me- 
chanics methods, we obtain exact results for the time-dependent 
generalization error of a linear network with a large number of 
weights N. We find, for example, that for small training sets of 
size p  N, larger learning rates can be used without compromis- 
ing asymptotic generalization performance or convergence speed. 
Encouragingly, for optimal settings of r/ (and, less importantly, 
weight decay A) at given final learning time, the generalization per- 
formance of online learning is essentially as good as that of offiine 
learning. 
I INTRODUCTION 
The analysis of online (gradient descent) learning, which is one of the most common 
approaches to supervised learning found in the neural networks community, has 
recently been the focus of much attention [1]. The characteristic feature of online 
learning is that the weights of a network ('student') are updated each time a new 
training example is presented, such that the error on this example is reduced. In 
offline learning, on the other hand, the total error on all examples in the training 
set is accumulated before a gradient descent weight update is made. Online and 
* Royal Society Dorothy Hodgkin Research Fellow 
t Supported by EPSRC grant GR/J75425: Novel Developments in Learning Theory 
for Neural Networks 
Online Learning from Finite Training Sets: An Analytical Case Study 275 
offiine learning are equivalent only in the limiting case where the learning rate 
/ -4 0 (see, e.g., [2]). The main quantity of interest is normally the evolution of 
the generalization error: How well does the student approximate the input-output 
mapping ('teacher') underlying the training examples after a given number of weight 
updates? 
Most analytical treatments of online learning assume either that the size of the 
training set is infinite, or that the learning rate / is vanishingly small. Both of 
these restrictions are undesirable: In practice, most training sets are finite, and non- 
infinitesimal values of / are needed to ensure that the learning process converges 
after a reasonable number of updates. General results have been derived for the 
difference between online and offiine learning to first order in /, which apply to 
training sets of any size (see, e.g., [2]). These results, however, do not directly 
address the question of generalization performance. The most explicit analysis of 
the time evolution of the generalization error for finite training sets was provided by 
Krogh and Hertz [3] for a scenario very similar to the one we consider below. Their 
r I -4 0 (i.e., offline) calculation will serve as a baseline for our work. For finite r/, 
progress has been made in particular for so-called soft committee machine network 
architectures [4, 5], but only for the case of infinite training sets. 
Our aim in this paper is to analyse a simple model system in order to assess how the 
combination of non-infinitesimal learning rates /and finite training sets (containing 
c examples per weight) affects online learning. In particular, we will consider 
the dependence of the asymptotic generalization error on / and c, the effect of 
finite  on both the critical learning rate and the learning rate yielding optimal 
convergence speed, and optimal values of /and weight decay A. We also compare 
the performance of online and offiine learning and discuss the extent to which infinite 
training set analyses are applicable for finite c. 
2 MODEL AND OUTLINE OF CALCULATION 
We consider online training of a linear student network with input-output relation 
y __-- wTx/x/. 
Here x is an N-dimensional vector of real-valued inputs, y the single real output 
and w the wei._t vector of the network. ,T, denotes the transpose of a vector and 
the factor 1//N is introduced for convenience. Whenever a training example (x, y) 
is presented to the network, its weight vector is updated along the gradient of the 
squared error on this example, i.e., 
(y - wTx/x/)  /(yx/x/ kxxTw) 
Aw = -/ Vw = - 
where r/is the learning rate. We are interested in online learning from finite train- 
ing sets, where for each update an example is randomly chosen from a given set 
{(x ', y'),! a -- 1...p) of p training examples. (The case of cyclical presentation of 
examples [6] is left for future study.) If example p is chosen for update n, the weight 
vector is changed to 
w,+l = {1 - /[x'(x') T + 7]}w, + ly'x'/x/ 
(1) 
Here we have also included a weight decay 7. We will normally parameterize the 
strength of the weight decay in terms of A = 7c (where c = p/N is the number 
276 P. Sollich and D. Barber 
of examples per weight), which plays the same role as the weight decay commonly 
used in offiine learning [3]. For simplicity, all student weights are assumed to be 
initially zero, i.e., w,=0 = 0. 
The main quantity of interest is the evolution of the 9eneralization error of the 
student. We assume that the training examples are generated by a linear 'teacher', 
i.e., y' = w. Tx'/X/+ ' , where  is zero mean additive noise of variance r 2. The 
teacher weight vector is taken to be normalized to w. 2 = N for simplicity, and the 
input vectors are assumed to be sampled randomly from an isotropic distribution 
over the hypersphere x 2 = N. The generalization error, defined as the average of 
the squared error between student and teacher outputs for random inputs, is then 
'g 2]v(W--W*)a  2 where v.=w.-w, 
:  2N Vn ' 
In order to make the scenario analytically tracable, we focus on he limit N   
of a large number of input components and weights, taken a constant number of 
examples per weight a = pin and updates per weigh ('learning time') t = n/N. In 
this limit, he generalization error cg(t) becomes self-averaging and can be calculated 
by averaging boh over he random selection of examples from a given training set 
and over all training sets. Our results can be straightforwardly extended to the ce 
of perceptton teachers with a nonlinear ransfer function,  in [7]. 
The usual statistical mechanical approach to the online learning problem expresses 
, 1 T whose 
he generalization error in terms of 'order parameters like R = Nw nw. 
(self-averaging) time evolution is determined from appropriately averaged update 
equations. This method works because for infinite raining ses, the average or- 
der pameter updates can again be expressed in erms of the order parameters 
alone. For finite training ses, on the oher hand, the updates involve new order 
parameters such  R -  T 
- w n Aw., where A is the correlation matrix of the 
training inputs, A = _x(x) T. Their time evolution is in turn determined 
by order parameters invoking higher powers of A, yielding an infinite hierarchy 
of order parameters. We solve this problem by considering instead order parame- 
ter (generating) functions [8] such  a generalized form of the generalization error 
e(t; h) =  Texp(hA)v,. This allows powers of A to be obtained by differentia- 
vn 
tion with respect to h, resulting in a closed system of (partial differential) equations 
for e(t; h) and R(t; h)=  T exp(hA)w.. 
Wn 
The resulting equations and details of their solution will be given in a future publi- 
cation. The final solution is most easily expressed in terms of the Laplace transform 
of the generalization error 
= = + + (2) 
The functions c(z) (i = 1...4) can be expressed in closed form in terms of a, a 
and A (and, of course, z). The Laplace transform (2) yields directly the ymptotic 
value of the generalization error,  = cg(l  ) = lim40 zg(z), which can be 
calculated analytically. For finite learning times l, %(l) is obtained by numerical 
inversion of the Laplace transform. 
3 RESULTS AND DISCUSSION 
We now discuss the consequences of our main result (2), focusing first on the asymp- 
totic generalization error ca, then the convergence speed for large learning times, 
Online Learning from Finite Training Sets: An Analytical Case Study 277 
ct--0.5 ct=l 
0.4 
0.3 
0.2 
0.1 
1..  
5 0.4 
0.4 
0.3 
0.2 
O. 
0.5 
Figure 1: Asymptotic generalization error eoo vs r/and A. a as shown, er e = 0.1. 
and finally the behaviour at small t. For numerical evaluations, we generally take 
er e = 0.1, corresponding to a sizable noise-to-signal ratio of  m 0.32. 
The asymptotic generalization error eoo is shown in Fig. 1 as a function of r/and A 
for a = 0.5, 1, 2. We observe that it is minimal for A = ere and r/- 0, as expected 
from corresponding results for offiine learning [3]  We also read off that for fixed A, 
eoo is an increasing function of r/: The larger r/, the more the weight updates tend 
to overshoot the minimum of the (total, i.e., offiine) training error. This causes a 
diffusive motion of the weights around their average asymptotic values [2] which 
increases coo. In the absence of weight decay (A - 0) and for a < 1, however, coo 
is independent of r/. In this case the training data can be fitted perfectly; every 
term in the total sum-of-squares training error is then zero and online learning does 
not lead to weight diffusion because all individual updates vanish. In general, the 
relative increase eoo(r/)/eoo(r/- 0)- i due to nonzero r/depends significantly on a. 
For r; = 1 and a = 0.5, for example, this increase is smaller than 6% for all A (at 
er e = 0.1), and for a = i it is at most 13%. This means that in cases where training 
data is limited (p  N), r/can be chosen fairly large in order to optimize learning 
speed, without seriously affecting the asymptotic generalization error. In the large 
a limit, on the other hand, one finds eoo= (er2/2)[1/a + r;/(2 - r/)]. The relative 
increase over the value at r/= 0 therefore grows linearly with a; already for a = 2, 
increases of around 50% can occur for r/= 1. 
Fig. 1 also shows that Coo diverges as r/approaches a critical learning rate rio: As 
r/-+ r/c, the 'overshoot' of the weight update steps becomes so large that the weights 
eventually diverge. From the Laplace transform (2), one finds that r/c is determined 
by r/ce4(z = 0) - 1; it is a function of c and A only. As shown in Fig. 2b-d, 
increases with A. This is reasonable, as the weight decay reduces the length of the 
weight vector at each update, counteracting potential weight divergences. In the 
small and large c limit, one has r/c - 2(1 + A) and r/c = 2(1 + /c), respectively. 
For constant A, r/c therefore decreases e with a (Fig. 2b-d). 
We now turn to the large t behaviour of the generalization error es(t ). For small 
the most slowly decaying contribution (or 'mode') to eg(t) varies as exp(-ct), its 
1The optimal value of the unscaledweight decay decreases with c as 7 = a2/, because 
for large training sets there is less need to counteract noise in the training data by using 
a large weight decay. 
2Conversely, for constant 7, r/c increases with c from 2(lq-7c ) to 2(1+ 7): For large c, 
the weight decay is applied more often between repeat presentations of a training example 
that would otherwise cause the weights to diverge. 
278 P. Sollich and D. Barber 
decay constant c = r/[A q- (v/ - 1):]/a scaling linearly with r/, the size of the weight 
updates, as expected (Fig. 2a). For small a, the condition ct >> 1 for eg(t) to have 
reached its asymptotic value c is r/(1 q- A)(t/a) > 1 and scales with t/a, which is 
the number of times each training example has been used. For large a, on the other 
hand, the condition becomes r/t  1: The size of the training set drops out since 
convergence occurs before repetitions of training examples become significant. 
For larger r/, the picture changes due to a new 'slow mode' (arising from the de- 
nominator of (2)). Interestingly, this mode exists only for r/above a finite threshold 
rimin '- 2/(C1/2q- c-1/2-- 1). For finite a, it could therefore not have been predicted 
from a small r/ expansion of eg(t). Its decay constant cslow decreases to zero as 
r/ -+ r/c, and crosses that of the normal mode at r/x(a, A) (Fig. 2a). For r/ > 
the slow mode therefore determines the convergence speed for large t, and fastest 
convergence is obtained for r/= r/x. However, it may still be advantageous to use 
lower values of r/in order to lower the asymptotic generalization error (see below); 
values of r/ > r/x would deteriorate both convergence speed and asymptotic per- 
formance. Fig. 2b-d shows the dependence of ]min, ]x and r/c on a and . For 
 not too large, r/x has a maximum at a m 1 (where r/x mrk) , while decaying as 
r/x - l+2a -1/: ..w r/c for larger a. This is because for a .. 1 the (total training) er- 
ror surface is very anisotropic around its minimum in weight space [9]. The steepest 
1 
directions determine r/c and convergence along them would be fastest for r/= 
(as in the isotropic case). However, the overall convergence speed is determined by 
the shallow directions, which require maximal r/m r/c for fastest convergence. 
Consider now the small t behaviour of g(t). Fig. 3 illustrates the dependence of 
%(t) on r/; comparison with simulation results for N - 50 clearly confirms our 
calculations and demonstrates that finite N effects are not significant even for such 
fairly small N. For a = 0.7 (Fig. 3a), we see that nonzero r/acts as effective update 
noise, eliminating the minimum in %(t) which corresponds to over-training [3]. 
is also seen to be essentially independent of r/ as predicted for the small value of 
) ---- 10 -4 chosen. For a = 5, Fig. 3b clearly shows the increase of e with r/. It 
also illustrates how convergence first speeds up as r/is increased from zero and then 
slows down again as r/c m 2 is approached. 
Above, we discussed optimal settings of r/ and A for minimal asymptotic gener- 
alization error . Fig. 4 shows what happens if we minimize %(t) instead for 
a given final learning time t, corresponding to a fixed amount of computational 
effort for training the network. As t increases, the optimal r/ decreases towards 
zero as required by the tradeoff between asymptotic performance and convergence 
(a) 
(b) 
0 I 2 3 4 5 
(c) 
Figure 2: Definitions of ]min, ]x and r/c, and their dependence on a (for A as shown). 
Online Learning from Finite Training Sets: An Analytical Case Study 279 
0.55 
0.5 
0.45 
0.4 
0.35 
0.3 
0.2! 
0.; 
0 
(a) a = 0.7 
5 10 15 20 25 30 
0. 
0.4 
0.3 
0.2 
0.1 
0 
0 
_.=' (b) 
5 10 15 20 t 
Figure 3: g vs t for different r/. Simulations for N = 50 are shown by symbols 
(standard errors less than symbol sizes). A=10 -4, er2=0.1, a as shown. The learning 
rate r/increases from below (at large t) over the range (a) 0.... 1.95, (b) 0.... 1.7. 
1.0 
0.6 
0.4. f' 
0.2 
0.0 ' 
0 10 
'rl 
(a) 
Figure 4: 
20 30 40 50 
t 
Optimal r/ and 
1.10 , , , ,  .  ,  , 
I 
H', /// ,' 
.04 I / /  
II  / / '.' 
[I, II 
I I i I 
}.00 II_bt ,t  ,  ,  , 
0 10 20 30 40 50 
t 
} vs given final learning 
0.30 
0.25 
0.20 
0.15 
0.10 
0.05 
0.00 
time t, 
(c) 
--:--:-: 
0 10 20 30 40 50 
t 
and resulting eg. 
Solid/dashed lines: a = 1 / a = 2; bold/thin lines: online/offiine learning. er 2 = 0.1. 
Dotted lines in (a): Fits of form r/= (a + hint)It to optimal r/for online learning. 
speed. Minimizing eg(t) m e+ const. exp(-ct) , 1 q-r/2 q- c3 exp(-c4r//) leads to 
r/opt = (a + hint)It (with some constants a, b, c...4). Although derived for small r/, 
this functional form (dotted lines in Fig. 4a) also provides a good description down 
to fairly small t , where r/opt becomes large. The optimal weight decay A increases a 
with t towards the limiting value er 2. However, optimizing A is much less impor- 
tant than choosing the right r/: Minimizing g(t) for fixed A yields almost the same 
generalization error as optimizing both r/and A (we omit detailed results here4). It 
is encouraging to see from Fig. 4c that after as few as t = 10 updates per weight 
with optimal r/, the generalization error is almost indistinguishable from its optimal 
value for t --+ oo (this also holds if A is kept fixed). Optimization of the learning 
rate should therefore be worthwhile in most practical scenarios. 
In Fig. 4c, we also compare the performance of online learning to that of offiine 
learning (calculated from the appropriate discrete time version of [3]), again with 
3One might have expected the opposite effect of having larger  at low t in order to 
'contain' potential divergences from the larger optimal learning rates r/. However, smaller 
 tends to make the asymptotic value coo less sensitive to large values of r/ as we saw 
above, and we conclude that this effect dominates. 
4For fixed  < a 2 , where eg(t) has an over-training minimum (see Fig. 3a), the asymp- 
totic behaviour of r/op changes to r/op oct -z (without the In t factor), corresponding to a 
fixed effective learning time r/t required to reach this minimum. 
280 P. SoJlich and D. Barber 
optimized values of ? and A for given t. The performance loss from using online 
instead of offiine learning is seen to be negligible. This may seem surprising given 
the effective noise on weight updates implied by online learning, in particular for 
small t. However, comparing the respective optimal learning rates (Fig. 4a), we see 
that online learning makes up for this deficiency by allowing larger values of /to 
be used (for large a, for example, r/c(Offiine) = 2/or << r/c(online) = 2). 
Finally, we compare our finite c results with those for the limiting case 
Good agreement exists for any learning time t if the asymptotic generalization error 
 (c < cx>) is dominated by the contribution from the nonzero learning rate /(as is 
the case for c -+ cx:). In practice, however, one wants /to be small enough to make 
only a negligible contribution to  (c < cx>); in this regime, the c -+ cx> results are 
essentially useless. 
4 CONCLUSIONS 
The main theoretical contribution of this paper is the extension of the statistical 
mechanics method of order parameter dynamics to the dynamics of order parameter 
(generating) functions. The results that we have obtained for a simple linear model 
system are also of practical relevance. For example, the calculated dependence on 
/of the asymptotic generalization error  and the convergence speed shows that, 
in general, sizable values of /can be used for training sets of limited size (c m 1), 
while for larger c it is important to keep learning rates small. We also found a 
simple functional form for the dependence of the optimal /on a given final learning 
time t. This could be used, for example, to estimate the optimal /for large t from 
test runs with only a small number of weight updates. Finally, we found that for 
optimized /online learning performs essentially as well as offiine learning, whether 
or not the weight decay A is optimized as well. This is encouraging, since online 
learning effectively induces noisy weight updates. This allows it to cope better than 
offiine learning with the problem of local (training error) minime[ in realistic neural 
networks. Online learning has the further advantage that the critical learning rates 
are not significantly lowered by input distributions with nonzero mean, whereas for 
offiine learning they are significantly reduced [10]. In the future, we hope to extend 
our approach to dynamic (t-dependent) optimization of / (although performance 
improvements over optimal fixed /may be small [6]), and to more complicated net- 
work architectures in which the crucial question of local minime[ can be addressed. 
References 
[9] 
[lO] 
[1] See for example: The dynamics of online learning. Workshop at NIPS'95. 
[2] T. Heskes and B. Kappen. Phys. Rev. A, 44:2718, 1991. 
[3] A. Krogh and J. A. Hertz. J. Phys. A, 25:1135, 1992. 
[4] D. Saad and S. Solla. Phys. Rev. E, 52:4225, 1995; also in NIPS-& 
[5] M. Biehl and H. Schwarze. J. Phys. A, 28:643-656, 1995. 
[6] Z.-Q. Luo. Neut. Uomp., 3:226, 1991; T. Heskes and W. Wiegerinck. IEEE 
Trans. Neut. Netw., 7:919, 1996. 
[7] P. Sollich. J. Phys. A, 28:6125, 1995. 
[8] L. L. Bonilla, F. G. Padilla, G. Parisi and F. Ritort. Europhys. Lett., 34:159, 
1996; Phys. Rev. B, 54:4170, 1996. 
J. A. Hertz, A. Krogh and G.I. Thorbergsson. J. Phys. A, 22:2133, 1989. 
T. L. H. Watkin, A. Rau and M. Biehl. Rev. Modem Phys., 65:499, 1993. 
