Statistical Dynamics of Batch Learning 
S. Li and K. Y. Michael Wong 
Department of Physics, Hong Kong University of Science and Technology 
Clear Water Bay, Kowloon, Hong Kong 
phlisong, phkywong) @ust. hk 
Abstract 
An important issue in neural computing concerns the description of 
learning dynamics with macroscopic dynamical variables. Recen- 
t progress on on-line learning only addresses the often unrealistic 
case of an infinite training set. We introduce a new framework to 
model batch learning of restricted sets of examples, widely applica- 
ble to any learning cost function, and fully taking into account the 
temporal correlations introduced by the recycling of the examples. 
For illustration we analyze the effects of weight decay and early 
stopping during the learning of teacher-generated examples. 
I Introduction 
The dynamics of learning in neural computing is a complex multi-variate process. 
The interest on the macroscopic level is thus to describe the process with macro- 
scopic dynamical variables. Recently, much progress has been made on modeling 
the dynamics of on-line learning, in which an independent example is generated for 
each learning step [1, 2]. Since statistical correlations among the examples can be 
ignored, the dynamics can be simply described by instantaneous dynamical vari- 
ables. 
However, most studies on on-line learning focus on the ideal case in which the net- 
work has access to an almost infinite training set, whereas in many applications, 
the collection of training examples may be costly. A restricted set of examples 
introduces extra temporal correlations during learning, and the dynamics is much 
more complicated. Early studies briefly considered the dynamics of Adaline learn- 
ing [3, 4, 5], and has recently been extended to linear percepttons learning nonlinear 
rules [6, 7]. Recent attempts, using the dynamical replica theory, have been made 
to study the learning of restricted sets of examples, but so far exact results are pub- 
lished for simple learning rules such as Hebbian learning, beyond which appropriate 
approximations are needed [8]. 
In this paper, we introduce a new framework to model batch learning of restricted 
sets of examples, widely applicable to any learning rule which minimizes an arbi- 
trary cost function by gradient descent. It fully takes into account the temporal 
correlations during learning, and is therefore exact for large networks. 
Statistical Dynamics of Batch Learning 287 
2 Formulation 
Consider the single layer perceptron with N >> I input nodes { } connecting to a 
single output node by the weights {Jj}. For convenience we assume that the inputs 
j are Gaussian variables with mean 0 and variance 1, and the output state $ is a 
function f(x) of the activation x at the output node, i.e. 
S=f(x); x=f-'. (1) 
The network is assigned to "learn" p -_- aN examples which map inputs {?} to the 
outputs {S} (p = 1,... ,p). S are the outputs generated by a teacher perceptron 
{Bj }, namely 
S = f(y,); y =/. '". (2) 
Batch learning by gradient descent is achieved by adjusting the weights {Jj} itera- 
tively so that a certain cost function in terms of the student and teacher activations 
{x) and {y) is minimized. Hence we consider a general cost function 
E- -Eg(x,y). (3) 
The precise functional form of g(x, y) depends on the adopted learning algorithm. 
For the case of binary outputs, f(x) -- sgnx. Early studies on the learning dynamics 
considered Adaline learning [3, 4, 5], where g(x,y) -- -(S- x)2/2 with S - sgny. 
For recent studies on Hebbian learning [8], g(x, y) -- xS. 
To ensure that the perceptton is regularized after learning, it is customary to intro- 
duce a weight decay term. Furthermore, to avoid the system being trapped in local 
minima, noise is often added in the dynamics. Hence the gradient descent dynamics 
is given by 
dJj(t) 1 
dt =  Eg'(x(t),y) - AJj(t) + ]j(t), (4) 
where, here and below, g'(x,y) and g"(x,y) respectively represent the first and 
second partial derivatives of g(x, y) with respect to x. A is the weight decay strength, 
and ]j (t) is the noise term at temperature T with 
2T 
and = 5kS(t- s). 
(5) 
3 The Cavity Method 
Our theory is the dynamical version of the cavity method [9, 10, 11]. It uses a 
self-consistency argument to consider what happens when a new example is added 
to a training set. The central quantity in this method is the cavity activation, which 
is the activation of a new example for a perceptron trained without that example. 
Since the original network has no information about the new example, the cavity 
activation is stochastic. Specifically, denoting the new example by the label 0, its 
cavity activation at time t is 
ho(t) = J(t). (6) 
For large N and independently generated examples, ho(t) is a Gaussian variable. 
Its covariance is given by the correlation function C(t, s) of the weights at times t 
and s, that is, 
/ho(t)o(s)) = f(t). f(s) -- C(t,s), 
288 S. Li and K. Y.. M. Wong 
where ? and  are assumed to be independent for j  k. The distribution is 
further specified by the teacher-student correlation R(t), given by 
(ho(t)yo} - f(t) . t -- R(t). 
(8) 
Now suppose the perceptron incorporates the new example at the batch-mode learn- 
ing step at time s. Then the activation of this new example at a subsequent time 
t > s will no longer be a random variable. Furthermore, the activations of the 
original p examples at time t will also be adjusted from (x(t)) to (x,(t)) because 
of the newcomer, which will in turn affect the evolution of the activation of example 
0, giving rise to the so-called Onsager reaction effects. This makes the dynamics 
complex, but fortunately for large p  N, we can assume that the adjustment from 
o (t) is small, and perturbative analysis can be applied. 
x(t) to x, 
Suppose the weights of the original and new perceptron at time t are {Jj(t)} and 
{J(t)} respectively. Then a perturbation of (4) yields 
d 
I , o 
q- A (Jj(t)- Jj(t)) - g (xo(t),yo) 
1 
- (k(t) - A(t)) 
+ N Eg"(x"(t)'  J . 
k 
(9) 
The first term on the right hand side describes the primary effects of adding example 
0 to the training set, and is the driving term for the difference between the two 
perceptrons. The second term describes the secondary effects due to the changes 
to the original examples caused by the added example, and is referred to as the 
Onsager reaction term. One should note the difference between the cavity and 
generic activations of the added example. The former is denoted by ho(t) and 
corresponds to the activation in the perceptron {Jj (t)}, whereas the latter, denoted 
by xo (t) and corresponding to the activation in the perceptron {Jj (t)}, is the one 
used in calculating the gradient in the driving term of (9). Since their notations 
are sufficiently distinct, we have omitted the superscript 0 in xo(t), which appears 
(t) 
in the background examples x . 
The equation can be solved by the Green's function technique, yielding 
f (1, O) 
Jj(t)- Sj(t)- E dsGjk(t,s) go(s)k , 
k 
(10) 
where g)(s) = g'(xo (s), yo) and Gin (t, s) is the weight Green's function satisfying 
1 / 
= - - t s), 
Gjn(t,s) G()(t s)6 +   dt'G()(t ' " 
(11) 
G () (t - s) -- O(t - s)exp(-A(t - s)) is the bare Green's function, and 0 is the 
step function. The weight Green's function describes how the effects of example 0 
propagates from weight Jn at learning time s to weight Jj at a subsequent time t, 
including both primary and secondary effects. Hence all the temporal correlations 
have been taken into account. 
For large N, the equation can be solved by a diagrammatic approach similar to [5]. 
The weight Green's function is self-averaging over the distribution of examples and 
is diagonal, i.e. limN-+oo Gj(t, s) = G(t, s)5j, where 
G(t,s): G''(t- s)q-or / dt I / dt2''(t- tl)(g(tl)D(tl,t2))(t2,s ). (12) 
Statistical Dynamics of Batch Learning 289 
D(t, s) is the example Green's function given by 
Vu(t,s) = 6(t- s) + f dt'G(t,t')g(t')Du(t',s). (13) 
This allows us to express the generic activations of the examples in terms of their 
cavity counterparts. Multiplying both sides of (10) and summing over j, we get 
xo(t) - no(t) = / dsG(t,s)g(s). (14) 
This equation is interpreted as follows. At time t, the generic activation xo(t) 
deviates from its cavity counterpart because its gradient term g(s) was present 
in the batch learning step at previous times s. This gradient term propagates its 
influence from time s to t via the Green's function G(t, s). Statistically, this equation 
enables us to express the activation distribution in terms of the cavity activation 
distribution, thereby getting a macroscopic description of the dynamics. 
To solve for the Green's functions and the activation distributions, we further need 
the fluctuation-response relation derived by linear response theory, 
C(t,s) a/dt'G()(t t"' ''t"x f 
= - );g. ) .s)) + 2T dt'G()(t - t')G(s,t'). (15) 
Finally, the teacher-student correlation is given by 
R(t) a f dt'G()(t t') ' t' 
= - (g,()y,). (16) 
4 A Solvable Case 
The cavity method can be applied to the dynamics of learning with an arbitrary cost 
function. When it is applied to the Hebb rule, it yields results identical to the exact 
results in [8]. Here we present the results for the Adaline rule to illustrate features 
of learning dynamics derivable from the study. This is a common learning rule and 
bears resemblance with the more common back-propagation rule. Theoretically, its 
dynamics is particularly convenient for analysis since g"(x) = -1, rendering the 
weight Green's function time translation invariant, i.e. G(t, s) = G(t - s). In this 
case, the dynamics can be solved by Laplace transform. 
To monitor the progress of learning, we are interested in three performance mea- 
sures: (a) Training error et, which is the probability of error for the training ex- 
amples. It is given by et = (O(-xsgny)), where the average is taken over the 
joint distribution p(x, y) of the training set. (b) Test error Crest, which is the prob- 
ability of error when the inputs ' of the training examples are corrupted by an 
additive Gaussian noise of variance A 2. This is a relevant performance measure 
when the perceptron is applied to process data which are the corrupted versions of 
the training data. It is given by Crest = (H(xsgny/ACxf,t))). When A 2 = 0, 
the test error reduces to the training error. (c) Generalization error %, which is the 
probability of error for an arbitrary input j when the teacher and student outputs 
are compared. It is given by eg = arccos[R(t)/Cxf , t)]/r. 
Figure l(a) shows the evolution of the generalization error at T - 0. When the 
weight decay strength varies, the steady-state generalization error is minimized at 
the optimum 
,opt ----  -- 1, (17) 
290 $. Li and K. Y.. M. Wong 
which is independent of a. It is interesting to note that in the cases of the linear 
perceptton, the optimal weight decay strength is also independent of a and only 
determined by the output noise and unlearnability of the examples [5, 7]. Simi- 
larly, here the student is only provided the coarse-grained version of the teacher's 
activation in the form of binary bits. 
For/  ,opt, the generalization error is a non-monotonic function in learning time. 
Hence the dynamics is plagued by overtraining, and it is desirable to introduce early 
stopping to improve the perceptron performance. Similar behavior is observed in 
linear percepttons [5, 6, 7]. 
To verify the theoretical predictions, simulations were done with N - 500 and using 
50 samples for averaging. As shown in Fig. l(a), the agreement is excellent. 
Figure l(b) compares the generalization errors at the steady-state and the early 
stopping point. It shows that early stopping improves the performance for   'opt, 
which becomes near-optimal when compared with the best result at  =/opt. Hence 
early stopping can speed up the learning process without significant sacrifice in the 
generalization ability. However, it cannot outperform the optimal result at steady- 
state. This agrees with a recent empirical observation that a careful control of the 
weight decay may be better than early stopping in optimizing generalization [12]. 
0.40 0.38 
0.36 
0.32 
0.28 
0.24 
0.20 
c-0.5 3.=10 
3.=k, 
a=l.2 3.=10 
'.' ............... 3.01 
res 
0.36 L 
0.34 
0.32 
0.30 
0.28 
0.26 
0.24 
te, 
a--0.5 
' u '   ' 0.22  ............  J 
2 4 6 8 10 12 0.0 0.5 1.0 1.5 
time t weight decay . 
Figure 1: (a) The evolution of the generalization error at T - 0 for a -- 0.5, 1.2 
and different weight decay strengths A. Theory: solid line, simulation: symbols. 
(b) Comparing the generalization error at the steady state (oc) and at the early 
stopping point (res) for a -- 0.5, 1.2 and T - 0. 
In the search for optimal learning algorithms, an important consideration is the 
environment in which the performance is tested. Besides the generalization per- 
formance, there are applications in which the test examples have inputs correlated 
with the training examples. Hence we are interested in the evolution of the test 
error for a given additive Gaussian noise A in the inputs. Figure 2(a) shows, again, 
that there is an optimal weight decay parameter/opt which minimizes the test error. 
Furthermore, when the weight decay is weak, early stopping is desirable. 
Figure 2(b) shows the value of the optimal weight decay as a function of the input 
noise variance A 2. To the lowest order approximation, 'opt O( h 2 for sufficiently 
large A 2. The dependence of ,opt on input noise is rather general since it also holds 
in the case of random examples [13]. In the limit of small A 2, 'opt vanishes as 
A 2 for a < 1, whereas ,opt approaches a nonzero constant for a > 1. Hence for 
Statistical Dynamics of Batch Learning 291 
c < 1, weight decay is not necessary when the training error is optimized, but when 
the perceptton is applied to process increasingly noisy data, weight decay becomes 
more and more important in performance enhancement. 
Figure 2(b) also shows the phase line /kot(A 2) below which overtraining occurs. 
Again, to the lowest order approximation, Aot cr A 2 for sufficiently large A 2. How- 
ever, unlike the case of generalization error, the line for the onset of overtraining 
does not coincide exactly with the line of optimal weight decay. In particular, for 
an intermediate range of input noise, the optimal line lies in the region of over- 
training, so that the optimal performance can only be attained by tuning both the 
weight decay strength and learning time. However, at least in the present case, 
computational results show that the improvement is marginal. 
0.30 F 20 
0.28 '- 
0.28 ' k=3.6 
0.24 r- ' 10 
0.22 ,L 
0.20 
0 1 2 3 4 5 
Time 
1 2 3 4 
1 
Figure 2: (a) The evolution of the test error for A 2 = 3, T = 0 and different weight 
decay strengths A ()opt ' 1.5, 3.6 for a = 0.5, 1.2 respectively). (b) The lines of 
the optimal weight decay and the onset of overtraining for a - 5. Inset: The same 
data with Aot - ,opt (magnified) versus A 2. 
5 Conclusion 
Based on the cavity method, we have introduced a new framework for modeling the 
dynamics of learning, which is applicable to any learning cost function, making it 
a versatile theory. It takes into full account the temporal correlations generated by 
the use of a restricted set of examples, which is more realistic in many situations 
than theories of on-line learning with an infinite training set. 
While the Adaline rule is solvable by the cavity method, it is still a relatively 
simple model approachable by more direct methods. Hence the justification of the 
method as a general framework for learning dynamics hinges on its applicability to 
less trivial cases. In general, g"(t') in (13) is not a constant and Du(t,s) has to 
be expanded as a series. The '3;namical equations can then be considered as the 
starting point of a perturbation theory, and results in various limits can be derived, 
e.g. the limits of small a, large a, large A, or the asymptotic limit. Another area for 
the useful application of the cavity method is the case of batch learning with very 
large learning steps. Since it has been shown recently that such learning converges 
in a few steps [6], the dynamical equations remain simple enough for a meaningful 
study. Preliminary results along this direction are promising and will be reported 
elsewhere. 
292 S. Li and K. Y. M. Wong 
An alternative general theory for learning dynamics, the dynamical replica theo- 
ry, has recently been developed [8]. It yields exact results for Hebbian learning, 
and approximate results for more non-trivial cases. Based on certain self-averaging 
assumptions, the theory is able to approximate the dynamics by the evolution of 
single-time functions, at the expense of having to solve a set of saddle point equa- 
tions in the replica formalism at every learning instant. On the other hand, our 
theory retains the functions G(t, s) and C(t, s) with double arguments, but devel- 
ops naturally from the stochastic nature of the cavity activations. Contrary to a 
suggestion [14], the cavity method can also be applied to the on-line learning with 
restricted sets of examples. It is hoped that by adhering to an exact formalism, 
the cavity method can provide more fundamental insights when the studies are 
extended to more sophisticated multilayer networks of practical importance. 
The method enables us to study the effects of weight decay and early stopping. It 
shows that the optimal strength of weight decay is determined by the imprecision 
in the examples, or the level of input noise in anticipated applications. For weaker 
weight decay, the generalization performance can be made near-optimal by early 
stopping. Furthermore, depending on the performance measure, optimality may 
only be attained by a combination of weight decay and early stopping. Though 
the performance improvement is marginal in the present case, the question remains 
open in the more general context. 
We consider the present work as the beginning of an in-depth study of learning 
dynamics. Many interesting and challenging issues remain to be explored. 
Acknowledgments 
We thank A. C. C. Coolen and D. Saad for fruitful discussions during NIPS. This 
work was supported by the grant HKUST6130/97P from the Research Grant Coun- 
cil of Hong Kong. 
References 
[1] D. Saad and S. Solla, Phys. Rev. Left. 74, 4337 (1995). 
[2] D. Saad and M. Rattray, Phys. Rev. Left. 79, 2578 (1997). 
[3] J. Hertz, A. Krogh and G.I. Thorbergssen, J. Phys. A 22, 2133 (1989). 
[4] M. Opper, Europhys. Left. 8, 389 (1989). 
[5] A. Krogh and J. A. Hertz, J. Phys. A 25, 1135 (1992). 
[6] S. BSs and M. Opper, J. Phys. A 31, 4835 (1998). 
[7] S. BSs, Phys. Rev. E 58, 833 (1998). 
[8] A. C. C. Coolen and D. Saad, in On-line Learning in Neural Networks, ed. D. Saad 
(Cambridge University Press, Cambridge, 1998). 
[9] M. Mzard, G. Parisi and M. Virasoro, Spin Glass Theory and Beyond (World Sci- 
entific, Singapore) (1987). 
[10] K. Y. M. Wong, Europhys. Left. 30, 245 (1995). 
[11] K. Y. M. Wong, Advances in Neural Information Processing Systems 9, 302 (1997). 
[12] L.K. Hansen, J. Larsen and T. Fog, IEEE Int. Conf. on Acoustics, Speech, and Signal 
Processing 4, 3205 (1997). 
[13] Y. W. Tong, K. Y. M. Wong and S. Li, to appear in Proc. of IJCNN'99 (1999). 
[14] A. C. C. Coolen and D. Saad, Preprint KCL-MTH-99-33 (1999). 
