Statistical Theory of Overtraining- 
Cross-Validation Asymptotically 
Effective? 
Is 
S. Amari, N. Murata, K.-R. Milllet* 
Dept. of Math. Engineering and Inf. Physics, University of Tokyo 
Hongo 7-3-1, Bunkyo-ku, Tokyo 113, Japan 
M. Finke 
Inst. f. Logik, University of Karlsruhe 
76128 Karlsruhe, Germany 
H. Yang 
Lab. f. Inf. Representation, RIKEN, 
Wakoshi, Saitama, 351-01, Japan 
Abstract 
A statistical theory for overtraining is proposed. The analysis 
treats realizable stochastic neural networks, trained with Kullback- 
Leibler loss in the asymptotic case. It is shown that the asymptotic 
gain in the generalization error is small if we perform early stop- 
ping, even if we have access to the optimal stopping time. Consider- 
ing cross-validation stopping we answer the question: In what ratio 
the examples should be divided into training and testing sets in or- 
der to obtain the optimum performance. In the non-asymptotic 
region cross-validated early stopping always decreases the general- 
ization error. Our large scale simulations done on a CM5 are in 
nice agreement with our analytical findings. 
I Introduction 
Training multilayer neural feed-forward networks, there is a folklore that the gen- 
eralization error decreases in an early period of training, reaches the minimum and 
then increases as training goes on, while the training error monotonically decreases. 
Therefore, it is considered advantageous to stop training at an adequate time or to 
use regularizers (Hecht-Nielsen [1989], Hassoun [1995], Wang et al. [1994], Poggio 
and Girosi [1990], Moody [1992], LeCun et al. [1990] and others). To avoid over- 
training, the following stopping rule has been proposed based on cross-validation: 
*Permanent address: GMD FIRST, Rudower Chaussee 5, 12489 Berlin, Germany. 
E-mail: Klaus@first.gmd.de 
Statistical Theory of Overtraining--Is Cross-Validation Asymptotically Effective? 177 
Divide all the available examples into two disjoint sets. One set is used for train- 
ing. The other set is used for testing such that the behavior of the trained network 
is evaluated by using the test examples and training is stopped at the point that 
minimizes the testing error. 
The present paper gives a mathematical analysis of the so-called overtraining phe- 
nomena to elucidate the folklore. We analyze the asymptotic case where the number 
t of examples are very large. Our analysis treats 1) a realizable stochastic machine, 
2) Kullback-Leibler loss (negative of the log likelihood loss), 3) asymptotic behavior 
where the number t of examples is sufficiently large (compared with the number m 
of parameters). We firstly show that asymptotically the gain of the generalization 
error is small even if we could find the optimal stopping time. We then answer the 
question: In what ratio, the examples should be divided into training and testing 
sets in order to obtain the optimum performance. We give a definite answer to this 
problem. When the number rn of network parameters is large, the best strategy is 
to use almost all t examples in the training set and to use only 1/2x/r- examples 
in the testing set, e.g. when m = 100, this means that only 7% of the training 
patterns are to be used in the set determining the point for early stopping. 
Our analytic results were confirmed by large-scale computer simulations of three- 
layer continuous feedforward networks where the number m of modifiable param- 
eters are m = 100. When t > 30m, the theory fits well with simulations, showing 
cross-validation is not necessary, because the generalization error becomes worse 
by using test examples to obtain an adaequate stopping time. For an intermediate 
range, where t < 30rn overtraining occurs surely and the cross-validation stopping 
improves the generalization ability strongly. 
2 Stochastic feedforward networks 
Let us consider a stochastic network which receives input vector x and emits 
output vector y. The network includes a modifiable vector parameter w = 
(Wl,..-,w,) and is denoted by N(w). The input-output relation of the net- 
work N(w) is specified by the conditional probability p(ylx;w). We assume (a) 
that there exists a teacher network N(w0) which generates training examples 
for the student N(w). And (b) that the Fisher information matrix Gi/(w) = 
[0 logp(x,y;w)0__ logp(x, y; w)] exists, is non-degenerate and is smooth in 
E  
w, where E denotes the expectation with respect to p(x,y;w) = q(x)p(ylx; w). 
The training set Dt = {(xl,yl), '-.,(xt,yt)} consists of t independent examples 
generated by the distribution p(x, y; w0) of N(w0). The maximum likelihood es- 
timator (m.l.e.)  is the one that maximizes the likelihood of producing Dr, or 
equivalently minimizes the training error or empirical risk function 
Rtrain(W) = -7 E logp(xi, Yi ;w). (2.1) 
i=1 
The generalization error or risk function R(w) of network N(w) is the expectation 
with respect to the true distribution, 
R(w) =-E011ogp(x, y;w)] = H0+D(w0 II w)- H0+E0 log p(x,y;w) ]' 
where E0 denotes the expectation with respect to p(x, y; w0), H0 is the entropy 
of the teacher network and D(w0 ][ w) is the Kullback-Leibler divergence from 
probability distribution p(x,y;w0) to p(x,y;w) or the divergence of N(w) from 
N(w0). Hence, minimizing R(w) is equivalent to minimizing D(w0 I[ w), and the 
1 78 S. AMARI, N. MURATA, K. R. M 'DLLER, M. FINKE, H. YANG 
minimum is attained at w -- w0. The asymptotic theory of statistics proves that the 
m.l.e. vt is asymptotically subject to the normal distribution with mean w0 and 
variance G- /t, where G-  is the inverse of the Fisher information matrix G. We 
can expand for example the risk R(w) - H0 + (w - w0)T G(w0)(w -- w0) + O () 
to obtain 
(Rgen()) - H0 +  + O , (Rtrain(V)) - H0 -  + O , (2.3) 
as asymptotic result for training and test error (see Murata et al. [1993] and Amari 
and Murata [1990]). An extension of (2.3) including higher order corrections was 
recently obtained by Milllet et al. [1995]. 
Let us consider the gradient descent learning rule (Amari [1967], Rumelhart et al. 
[1986], and many others), where the parameter v(n) at the nth step is modified by 
c9Rtrain(V) (2.4) 
.(n + 1) = .(n) - e 0w ' 
and where e is a small positive constant. This is batch learning where all the 
training examples are used for each iteration of modifying v(n).  The batch process 
is deterministic and v(n) converges to v, provided the initial w(0) is included in 
its basin of attraction. For large n we can argue, that v(n) is approaching v 
isotropically and the learning trajectory follows a linear ray towards  (for details 
see Amari et al. [1995]). 
3 Virtual optimal stopping rule 
During learning as the parameter v(n) approaches v, the generalization behavior 
of network N{(n)) is evalulated by the sequence R(n)- R{(n)), n- 1,2,... 
The folklore says that R(n) decreases in an early period of learning but it increases 
later. Therefore, there exists an optimal stopping time n at which R(n) is mini- 
mized. The stopping time hop t is a random variable depending on v and the initial 
w(0). We now evaluate the ensemble average of (R(nopt)). 
The true w0 and the m.l.e. v are in general different, and they are apart of order 
1/v/. Let us compose a sphere $ of which the center is at (1/2)(w0 + v) and which 
passes through both w0 and v, as shown in Fig.lb. Its diameter is denoted by d, 
where d 2 = [17v- w012 and 
Eo[d 2] = E0[(v- w0) rG-(* - w0)] = 3tr(G-G) = m. (3.1) 
t 
Let A be the ray, that is the trajectory v(n) starting at v(0) which is not in the 
neighborhood of w0. The optimal stopping point w* that minimizes 
1 
R() - H0 + 1() - w0l = (3.2) 
is given by the first intersection of the ray A and the sphere $. 
Since w* is the point on A such that w0 - w* is orthogonal to A, it lies on the 
sphere $ (Fig.lb). When ray A  is approaching v from the opposite side of w0 (the 
right-hand side in the figure), the first intersection point is v itself. In this case, 
the optimal stopping never occurs until it converges to v. 
Let 0 be the angle between the ray A and the diameter w0 -  of the sphere $. 
We now calculate the distribution of 0 when the rays are isotropically distributed. 
1We can alternatively use on-line learning, studied by Amari [1967], Heskes and K appen 
[1991], and recently by Barkai et al. [1994] and Solla and Saard [1995]. 
Statistical Theory of Overtraining--Is Cross-Validation Asymptotically Effective? 1 79 
emma I. When ray A is approaching v from the side in which w0 is included, the 
probability density of 0, 0 _ 0 _ r/2, is given by 
I sin "-2 0, where I, = sin TM OdO. (3.3) 
= 
The detailed proof of this lemma can be found in Amari et al. [1995]. Using the 
density of 0 given by Eq.(3.3) and we arrive at the following theorem. 
Theorem 1. 
given by 
The average generalization error at the optimal stopping point is 
I I (3.4) 
(R(nopt)) = H0 + (rn- ). 
Proof. When ray A is at angle 0, 0 _ 0 < r/2, the optimal stopping point w* is on 
the sphere $. It is easily shown that Iw* - w0[ = dsin 0. This is the case where A 
is from the same side as w0 (from the left-hand side in Fig.lb), which occurs with 
probability 0.5, and the average of (dsin 0) 2 is 
Eo[(dsinO) 2] _ E0[d 2] f/2  1) 
- sin 2 0 sin "-20dO = m 
I,.-2 v0 t I--2 = (1-- 
IT/ ' 
When 0 is 7r/2 _< 0 _< 7r, that is A approaches v from the opposite side, it does 
not stop until it reaches , so that [w* - w0] 2 = [W- w0[ = d 2. This occurs with 
probability 0.5. Hence, we proved the theorem. 
The theorem shows that, if we could know the optimal stopping time nop t for 
each trajectory, the generalization error decreases by 1/2t, which has an effect of 
decreasing the effective dimensions by 1/2. This effect is neglegible when m is large. 
The optimal stopping time is of the order log t. However, it is impossible to know 
the optimal stopping time. If we stop learning at an estimated optimal time hopt, 
we have a small gain when the ray A is from the same side as w0 but we have 
some loss when ray A is from the opposite direction. This shows that the gain is 
even smaller if we use a common stopping time riopt independent of v and w(0) as 
proposed by Wang et al. [1994]. However, the point is that there is neither direct 
means to estimate nop t nor riopt rather than for example cross-validation. Hence, 
we analyze cross-validation stopping in the following. 
4 Optimal stopping by cross-validation 
The present section studies asymptotically two fundamental problems: 1) Given t 
examples, how many examples should be used in the training set and how many 
in the testing set? 2) How much gain can one expect by the above cross-validated 
stopping? 
Let us divide t examples into rt examples of the training set and rt examples of the 
testing set, where r-t-r ! - 1. Let  be the m.l.e. from rt training examples, and let 
 be the m.l.e. from the other rt testing examples. Since the training examples 
and testing examples are independent,  and v are subject to independent nor- 
mal distributions with mean w0 and covariance matrices G-/(rt) and G-i/(rt), 
respectively. 
Let us compose the triangle with vertices w0,  and v. The trajectory A starting 
at w(0) enters v linearly in the neighborhood. The point w* on the trajectory A 
which minimizes the testing error is the point on A that is closest to r, since the 
testing error defined by 
1 E{_logp(xi yi;w)}, (4.1) 
Rtest(W) = r't ' 
i 
180 S. AMARI, N. MURATA, K. R. MOLLER, M. FINKE, H. YANG 
where summation is taken over rt testing examples, can be expanded as 
1 
I ~ w01  + I TM -,l . (4.2) 
Rtest(W) = n0- lw- 
Let S be the sphere centered at (v + )/2 and passing through both v and . 
It's diameter is given by d: Iv - 1. Then, the optimal stopping point w* is 
given by the intersection of the trajectory A and sphere $. When the trajectory 
comes from the opposite side of , it does not intersect $ until it converges to , 
so that the optimal point is w* =  in this case. Omitting the detailed proof, the 
generalization error of w* is given by Eq.(??), so that we calculate the expectation 
rn 1 l1 1 / 
E[lw* - wle] - tr 2t r 
Lemma 2. The average generalization error by the optimal cross-validated stopping 
is 
(R(w*, r)) = no + 
4rt 
We can then calculate the optimal division rate 
top t=l- x/2rn-l-1 and 
2(m-l) top t =1 
1 
+ 4r'--- (4.3) 
(large rn limit). 
(4.4) 
of examples, which minimizes the generalization error. So for large m only 
(1/2v/-) x 100% of examples should be used for testing and all others for training. 
For example, when rn -- 100, this shows that 93% of examples are to be used for 
training and only 7% are to be kept for testing. From Eq.(4.4) we obtain as optimal 
generalization error for large rn 
(R(w*, ropt) ) = H0 +  1 + . (4.5) 
This shows that the generalization error asymptotically increases slightly by cross- 
validation compared with non-stopped learning which is using all the examples for 
training. 
Simulations 
We use standard feed-forward classifier networks with N inputs, H sigmoid hidden 
units and M softmax outputs (classes). The output activity Ot of the /th output 
unit is calculated via the softmax squashing function 
exp(h) 1 
p(y-= C']x;w) = O = l: I ..-,M, 0o = 
1 + - exp(h) ' ' 1 + - exp(h)' 
where h  = j wsj - O  is the local field potential. Each output Ot codes the a- 
posteriori probability of being in class Ct, O0 denotes a zero class for normalization 
purposes. The m network parameters consist of biases 0 and weights w. When x 
is input, the activity of the j-th hidden unit is 
N 
sj =[l+exp(-ywffx-0?)l -, j-1,...,H. 
The input layer is connected to the hidden layer via w H, the hidden layer is con- 
nected to the output layer via w , but no short-cut connections are present. Al- 
though the network is completely deterministic, it is constructed to approximate 
Statistical Theory of Overtraining--Is Cross-Validation Asymptotically Effective? 181 
class conditional probabilities (Finke and Mfiller [1994]). 
The examples {(xl,yl),' ",(xt,yt)} are produced randomly, by drawing xi, i = 
1,..-,t, from a uniform distribution independently and producing the labels yi 
stochastically from the teacher classifier. Conjugate gradieut learning with line- 
search on the empirical risk function Eq.(2.1) is applied, starting from some ran- 
dom initial vector. The generalization ability is measured using Eq. (2.2) on a large 
test set (50000 patterns). Note that we use Eq. (2.1) on the cross-validation set, 
because only the empirical risk is available on the cross-validation set in a practical 
situation. We compare the generalisation error for the settings: exhaustive training 
(no stopping), early stopping (controlled by the cross-validation set) and optimal 
stopping (controlled by the large testset). The simulations were performed on a 
parallel computer (CM5). Every curve in the figures takes about 8h of computing 
time on a 128 respectively 256 partition of the CM5, i.e. we perform 128-256 paral- 
lel trials. This setting enabled us to do extensive statistics (cf. Amari et al. [1995]). 
Fig. la shows the results of simulations, where N = 8, H = 8, M = 4, so that 
the number rn of modifiable parameters is rn = (N + 1)H + (H -F 1)M = 108. We 
observe clearly, that saturated learning without early stopping is the best in the 
asymptotic range of t > 30m, a range which is due to the limited size of the data 
sets often unaccessible in practical applications. Cross-validated early stopping does 
not improve the generalization error here, so that no overtraining is observed on 
the average in this range. In the asymptotic area (figure 1) we observe that the 
smaller the percentage of the training set, which is used to determine the point of 
early stopping, the better the performance of the generalization ability. When we 
use cross-validation, the optimal size of the test set is about 7% of all the examples, 
as the theory predicts. 
Clearly, early stopping does improve the generalization ability to a large extent in 
an intermediate range for t < 30m (see Mfiller et al. [1995]). Note, that our the- 
ory also gives a good estimate of the optimal size of the early stopping set in this 
intermediate range. 
0.05 
0.045 
0.04 
0.035 
0.03 
0.025 
0.02 
0.015 
0.01 
0.005 
opt..-- 
20o ..... 
33% "'" 
42% .-. 
no,.gopping 
e-5 le-4 1.5e-4 2e-4 2.5e-4 3e-4 3.5e-4 4e-4 4.5e-4 5e-4 
A 
(a) (b) 
Figure 1: (a) R(w) plolted as a function of 1It for different sizes r' of the early 
stopping set for an 8-8-4 classifier network. opt. denotes the use of a very large 
cross-validation set (50000) and no stopping adresses the case where 100 of the 
lraining set is used for exhaustive learning. (b) Geometrical picture to determine 
the optimal stopping point w*. 
182 S. AMARI, N. MURATA, K. R. M'OLLER, M. FINKE, H. YANG 
6 Conclusion 
We proposed an' asymptotic theory for overtraining. The analysis treats realizable 
stochastic neural networks, trained with Kullback-Leibler loss. 
It is demonstrated both theoretically and in simulations that asymptotically the gain 
in the generalization error is small if we perform early stopping, even if we have 
access to the optimal stopping time. For cross-validation stopping we showed for 
large m that optimally only top t ---- 1/k/f examples should be used to determine 
the point of early stopping in order to obtain the best performance. For example, 
if m - 100 this corresponds to using 93% of the t training patterns for training and 
only 7% for testing where to stop. Yet, even if we use Vopt for cross-validated stop- 
ping the generalization error is always increased comparing to exhaustive training. 
Nevertheless note, that this range is due to the limited size of the data sets often 
unaccessible in practical applications. 
In the non-asymptotic region simulations show that cross-validated early stopping 
always helps to enhance the performance since it decreases the generalization error. 
In this intermediate range our theory also gives a good estimate of the optimal size 
of the early stopping set. In future we will consider higher order correction terms 
to extend our theory to give also a quantitative description of the non-asymptotic 
region. 
Acknowledgements: We would like to thank Y. LeCun, S. BSs and K. Schulten 
for valuable discussions. K. -R. M. thanks K. Schulten for warm hospitality during 
his stay at the Beckman Inst. in Urbana, Illinois. We acknowledge computing time 
on the CM5 in Urbana (NCSA) and in Bonn, supported by the National Institutes 
of Health (P41RRO 5969) and the EC S & T fellowship (FTJ3-004, K. -R. M.). 
References 
Amari, S. [1967], IEEE Trans., EC-16, 299-307. 
Amari, S., Murata, N. [1993], Neural Computation 5, 140 
Amari, S., Murata, N., Mfiller, K.-R., Finke, M., Yang, H. [1995], Statistical Theory 
of Overtraining and Overfitting, Univ. of Tokyo Tech. Report 95-06, submitted 
Barkai, N. and Seung, H. S. and Sompolinski, H. [1994], On-line learning of di- 
chotomies, NIPS'94 
Finke, M. and Mfiller, K.-R. [1994] in Proc. of the 1993 Connectionist Models sum- 
mer school, Mozer, M., Smolensky, P., Touretzky, D.S., Elman, J.L. and Weigend, 
A.S. (Eds.), Hillsdale, NJ: Erlenbaum Associates, 324 
Hassoun, M. H. [1995], Fundamentals of Artificial Neural Networks, MIT Press. 
Hecht-Nielsen, R. [1989], Neurocomputing, Addison-Wesley. 
Heskes, T. and Kappen, B. [1991], Physical Review, A44, 2718-2762. 
LeCun, Y., Denker, J.S., Solla, S. [1990], Optimal brain damage, NIPS'89 
Moody, J. E. [1992], The effective number of parameters: An analysis of general- 
ization and regularization in nonlinear learning systems, NIPS 4 
Murata, N., Yoshizawa, S., Amari, S. [1994], IEEE Trans., NNS, 865-872. 
Mfiller, K.-R., Finke, M., Murata, N., Schulten, K. and Amari, S. [1995] A numer- 
ical study on learning curves in stochastic multilayer feed-forward networks, Univ. 
of Tokyo Tech. Report METR 95-03 and Neural Computation in Press 
Poggio, T. and Girosi, F. [1990], Science, 247, 978-982. 
Rissanen, J. [1986], Ann, Statist., 14, 1080-1100. 
Rumelhart, D., Hinton, G. E., Williams, R. J. [1986], in PDP, Vol.l, MIT Press. 
Saad, D., Solla, S. A. [1995], PRL, 74, 4337 and Phys. Rev. E, 52, 4225 
Wang, Ch., Venkatesh, S.S., Judd, J. S. [1994], Optimal stopping and effective ma- 
chine complexity in learning, to appear, (revised and extended version of NIPS'93). 
