Dynamics of Training 
Siegfried BSs* 
Lab for Information Representation 
RIKEN, Hirosawa 2-1, Wako-shi 
Saitama 351-01, Japan 
Manfred Opper 
Theoretical Physics III 
University of Wiirzburg 
97074 Wiirzburg, Germany 
Abstract 
A new method to calculate the full training process of a neural net- 
work is introduced. No sophisticated methods like the replica trick 
are used. The results are directly related to the actual number of 
training steps. Some results are presented here, like the maximal 
learning rate, an exact description of early stopping, and the neces- 
sary number of training steps. Further problems can be addressed 
with this approach. 
I INTRODUCTION 
Training guided by empirical risk minimization does not always minimize the ex- 
pected risk. This phenomenon is called overfitting and is one of the major problems 
in neural network learning. In a previous work [BSs 1995] we developed an approx- 
imate description of the training process using statistical mechanics. To solve this 
problem exactly, we introduce a new description which is directly dependent on the 
actual training steps. As a first result we get analytical curves for empirical risk and 
expected risk as functions of the training time, like the ones shown in Fig. 1. 
To make the method tractable we restrict ourselves to a quite simple neural net- 
work model, which nevertheless demonstrates some typical behavior of neural nets. 
The model is a single layer perceptron, which has one N-dim. layer of adjustable 
weights l between input  and output z. The outputs are linear, i.e. 
 (it = 1, P) are 
We are interested in supervised learning, where examples x i ..., 
given for which the correct output z. is known. To define the task more clearly 
and to monitor the training process, we assume that the examples are provided 
by another network, the so called teacher network. The teacher is not restricted to 
linear outputs, it can have a nonlinear output function g.(h.). 
* email: boes*zoo. riken. go. jp and opper*physik. uni-wuerzburg. de 
142 S. Biis and M. Opper 
Learning by examples attempts to minimize the error averaged over all examples, 
i.e. ET :-- 1/2 < (z,  --z) 2 >(i), which is called training error or empirical risk. In 
fact what we are interested in is a minimal error averaged over all possible inputs 
5, i.e Ec :-- 1/2 < (z, - z) 2 >(Input}, called generalization error or expected risk. 
It can be shown [see BSs 1995] that for random inputs, i.e. all components xi are 
independent and have zero means and unit variance, the generalization error can 
be described by the order parameters R and Q, 
1 [G - 2H R(t) 
Ec(t) =  
q-Q(t)] with<...>a- e 2 ..-, (2) 
with the two parmeters G --( [g,(h)]  >a and H -(g,(h) h >. The order param- 
eters are defined as: 
1 N 1 N 
R(t) =<  Y]Wi*Wi(t) >{,.}, Q(t) =<  y] (Wi(t)) >{,.} . (3) 
i=1 i=1 
As a novelty in this paper we average the order parameters not as usual in statistical 
mechanics over many example realizations (x }, but over many teacher realizations 
(W/* }, where we use a spherical distribution. This corresponds to a Bayesian average 
over the unknown teacher. A study of the static properties of this model was done 
by Saad [1996]. Further comments about the averages can be found in the appendix. 
In the next section we introduce our new method briefly. Readers, who do not 
wish to go into technical details in first reading, can turn directly to the results (15) 
and (16). The remainder of the section can be read later, as a proof. In the third 
section results will be presented and discussed. Finally, we conclude the paper with 
a summary and a perspective on further problems. 
2 DYNAMICAL APPROACH 
Basically we exploit the gradient descent learning rule, using the linear student, i.e 
g'(h) = i and z 
- , 
O(P.ET) 
W(t + 1) = w(t) -  
P 
= vr,(t) +  , 
(4) 
For P < N, the weights are linear combinations of the example inputs x, if Wi(O) = 
0, 
P 
I , 
w(t) =: V  (t) x. (5) 
After some algebra a recursion for er,(t) can be found, i.e. 
,=1 
-,v  x'' (t) +,, (6) 
i=1 
where the term in the round brackets defines the overlap matrix C. From the 
geometric series we know the solution of this recursion, and therefore for the weights 
Dynamics of Training 143 
r  (Hebbian), 
It fulfills the initial conditions Wi(0) - 0 and Wi(1) - rlY]=l z.  xi 
and yields after infinite time steps the so called Pseudo-inverse weights, i.e. 
P 
i  , (c-1). ;'. (8) 
i() = v ,,= 
This is valid as long as the examples are linearly independent, i.e. P < N. Remarks 
about the other case (P > N) will follow later. 
With the expression (7) we can calculate the behavior of the order parameters 
for the whole training process. For R(t) we get 
= aH 
( lp ) 
For the average we have used expression (21) from the appendix. Similarly we get 
for the other order parameter, 
P 
Q() :   
,y,r,ff---- 1 
.r, - (r, - 
C ]  
[E-(E-/C) t] 
C va 
' Vx >{w,-} 
x <z,z, i:xi 
a(G-H2) k [ C 
:  
/z=l 
--1 (F -- (F- C)/)2]/z/z 
ctH 2 P 
+-p E 
Again we have applied an identity (20) from the appendix and we did some matrix 
algebra. Note, up to this point the order parameters were calculated without any 
assumption about the statistics of the inputs. The results hold, even without the 
thermodynamic limit. 
The trace can be calculated by an integration over the eigenvalues, thus we attain 
integrals of the following form, 
with l-- {0, t, 2t} and m = {-1,0, 1}. 
These integrals can be calculated once we know the density of the eigenvalues 
p(). The determination of this density can be found in recent literature calculated 
by Opper [1989] using replicas, by Krogh [1992] using perturbation theory and by 
Sollich [1994] with matrix identities. We should note, that the thermodynamic limit 
and the special assumptions about the inputs enter the calculation here. All authors 
found 
1 
P(f) = 2rctf V/(max -- )( -- min) , (12) 
144 S. BSs and M. Opper 
for a < 1. The maximal and the minimal eigenvalues are max,min :-- (1 4- - )2. So 
all that remains now is a numerical integration. 
Similarly we can calculate the behavior of the training error from 
P 
=<  
(13) 
For the overdetermined case (P > N) we can find a recursion analog to (6), 
The term in the round brackets defines now the matrix Bij. The calculation is 
therefore quite similar to above with the matrix B playing the role of matrix C. The 
density of the eigenvalues p(A) for matrix B is the one from above (12) multiplied 
by a. 
Altogether, we find the following results in the case of a < 1, 
G G-H  ( 1 2i 1 + i2_[) H 2 
Ec (t, a, r/) = + 2 a 1-a - - 2 a(1-It)' 
_ H 2 it 
= C + 
2 ' 
and in the case of a > 1, 
Ec(t,a,) = G- H2 ( 1 i2t ) Hi t 
2 1 + 2It_ + +--- 
or-1 - ' 
ET (t, or, rl) -- G - H 2 ( 1 1,2t x H 2 it 
c 2 c 
(15) 
(16) 
If t - x) all the time-dependent integrals  and I t vanish. The remaining first 
two terms describe, in the limit a - x), the optimal convergence rate of the errors. 
In the next section we discuss the implications of this result. 
3 RESULTS 
First we illustrate how well the theoretical results describe the training process. If 
we compare the theory with simulations, we find a very good correspondence, see 
Fig. 1. 
Trying other values for the learning rate we can see that there is a maximal 
learning rate. It is twice the inverse of the maximal eigenvalue of the matrix B, i.e. 
2 2 
17max-- (max : (1 q- %/-)2 ' (17) 
This is consistent with a more general result, that the maximal learning is twice the 
inverse of the maximal eigenvalue of the Hessian. In the case of the linear perceptron 
the matrix B is identical to the Hessian. 
As our approach is directly related to the actual number of training steps we can 
examine how training time varies in different training scenarios. Training can be 
stopped if the training error reaches a certain minimal value, i.e if ET(t) _< Ein+ e. 
Or, in cross-validated early stopping, we will terminate training if the generalization 
error starts to increase, i.e. if EG(t q- 1) > EG(t). 
Dynamics of Training 145 
E G 
E T 
0.4 
0.3 
0.2 
0.1 
0.0 
I I 
50 100 150 
training steps 
Figure 1: Behavior of the generalization error Ec (upper line) and the training error 
ET (lower line) during the training process. As the loading rate a - P/N - 1.5 is 
near the storage capacity (a - 1) of the net overfitting occurs. The theory describes 
the results of the simulations very well. Parameters: learning rate /= 0.1, system 
size N - 200, and g.(h) -- tanh(yh) with gain 7 -- 5. 
Fig. 2 shows that in exhaustive training the training time diverges for a near 1, 
in the region where also the overfitting occurs. In the same region early stopping 
shows only a slight increase in training time. 
Furthermore, we can guess from Fig. 2 that asymptotically only a few training 
steps are necessary to fulfill the stopping criteria. This has to be specified more 
precisely. First we study the behavior of Ec after only one training step, i.e. t -- 1. 
Since we interested in the limit of many examples (a - x) we choose the learning 
rate as a fraction of the maximal learning rate (17), i.e. /- /0/max- Then we can 
--1 
calculate the behavior of Ec(t -- 1, a, max) analytically. We find that only in the 
case of /0 = 1, the generalization error can reach its asymptotic minimum Einf. The 
rate of the convergence is a - like in the optimal case, but the prefactor is different. 
However, already for t - 2 we find, 
70 (t = 2, ct, r/= max) -'-- EG -- Einf -- 2 ct--2- + O (lS) 
If ct is large, so that we can neglect the ct -2 term, then two batch training steps are 
already enough to get the optimal convergence rate. These results are illustrated in 
Fig. 3. 
4 SUMMARY 
In this paper we have calculated the behavior of the learning and the training error 
during the whole training process. The novel approach relates the errors directly to 
the actual number of training steps. It was shown how good this theory describes the 
training process. Several results have been presented, such as the maximal learning 
rate and the training time in different scenarios, like early stopping. If the learning 
rate is chosen appropriately, then only two batch training steps are necessary to 
reach the optimal convergence rate for sufficiently large a. 
146 $. B6s and M. Opper 
le+02 
le+01 
eps--O.001 ................ 
eps--O.010 ....... 
early stop 
le+00 ........ ' ' 
le-01 le+00 le+01 le+02 
P/N 
Figure 2: Number of necessary training steps to fulfill certain stopping criteria. The 
upper lines show the result if training is stopped when the training error is lower 
than E in d-e, with e - 0.001 (dotted line) and e = 0.01 (dashed line). The solid line 
is the early stopping result where training is stopped, when the generalization error 
started to increase, Ec(t d- 1) > EG (t). Simulation results are indicated by marks. 
Parameters: learning rate r/ - 0.01, system size N - 200, and g.(h) - tanh(7h ) 
with gain 7 - 5. 
Further problems, like the dynamical description of weight decay, and the relation 
of the dynamical approach to the thermodynamic description of the training process 
[see BSs, 1995] can not be discussed here due to lack of space. These problems are 
examined in an extended version of this work [BSs and Opper 1996]. It would be very 
interesting if this method could be extended towards other, more realistic models. 
A APPENDIX 
Here we add some identities which are necessary for the averages over the teacher 
weight distributions, eqs. (9) and (10). In the statistical mechanics approach one 
assumes that the distribution of the local fields h is Gaussian. This becomes true, if 
one averages over random inputs xi, with first moments zero and one, which is the 
usual approach [see BSs 1995 and ref.]. In principle it is also possible to average over 
many tasks, i.e many teacher realizations l*, which is done here. The Gaussian 
local fields h.  fulfill, 
< >= o, < >= 
This implies 
= '2. 
In the second identity we first calculated the diagonal term and for the non-diagonal 
term we made an expansion assuming small correlations. Similarly the following 
Dynamics of Training 147 
le+00 
le-01 
le-02 
]G 
le-03 
le-04 
le-05 
le-01 
"  exh. - ............ 
: ...... opt. - .............. 
1 e+00 1 e+01 1 e+02 1 e+03 1 e+04 
P/N 
Figure 3: Behavior of/2c = Ec - rinf after t training steps. Results for t - 1, 2 
and 3 are given. For large enough a it is already after t = 2 training steps possible 
to reach the optimal convergence (solid line). If t = 3 the optimal result is reached 
--1 
even faster. Parameters: learning rate r/ = max and g.(h) -- tanh(yh ) with gain 
7-5. 
identity can be proved, 
< z,h,  >(,:>= v H +(Cv-)H. (2) 
Acknowledgment: We thank Shun-ichi Amari for many discussions and E. 
Helle, A. Stevenin-Barbier for proofreading and valuable comments. 
References 
B6s S. (1995), 'Avoiding overfitting by finite temperature learning and cross- 
validation', in Int. Conference on Artificial Neural Networks 95 (ICANN'95), 
edited by EC2 & Cie, Vol.2, p.111-116. 
BSs S., and Opper M. (1996), 'An exact description of early stopping and weight 
decay', submitted. 
Kinzel W., and Opper M. (1995), 'Dynamics of learning', in Models of Neural 
Networks I, edited by E. Domany, J. L. van Hemmen and K. Schulten, Springer, 
p.157-179. 
Krogh A. (1992), 'Learning with noise in a linear perceptron', J. Phys. A 25, 
p.1135-1147. 
Opper M. (1989), 'Learning in neural networks: Solvable dynamics', Europhys. 
Lett. 8, p.389-392. 
Saad D. (1996), 'General Gaussian priors for improved generalization', submitted 
to Neural Networks. 
Sollich P. (1995), 'Learning in large linear perceptrons and why the thermodynamic 
limit is relevant to the real world', in NIPS 7, p.207-214. 
