Optimal Stopping and Effective Machine 
Complexity in Learning 
Changfeng Wang 
Department of Systems Sci. and Eng. 
University of Pennsylvania 
Philadelphia, PA, U.S.A. 19104 
Santosh S. Venkatesh 
Department of Electrical Engineering 
University of Pennsylvania 
Philadelphia, PA, U.S.A. 19104 
J. Stephen Judd 
Siemens Corporate Research 
755 College Rd. East, 
Princeton, N J, U.S.A. 08540 
Abstract 
We study the problexn of when to stop learning a class of feedforward networks 
- networks with linear outputs neuron and fixed input weights - when they are 
trained with a gradient descent a.lgorithm on a finite number of examples. Under 
general regularity conditions, it is shown that there are in general three distinct 
phases in the generalization performance in the learning process, and in particular, 
the network has better generalization performance when learning is stopped at a 
certain time before the global minimum of the empirical error is reached. A notion 
of effective size of a machine is defined and used to explain the trade-off between 
the complexity of the machine and the training error in the learning process. 
The study leads naturally to a network size selection criterion, which turns out to 
be a generalization of Akaike's Infornmtion Criterion for the learning process. It is 
shown tha. t stopping learning before the global minimum of the empirical error has 
the effect of network size selection. 
1 INTRODUCTION 
The primary goal of learning in neural nets is to find a network that gives valid generalization. In 
achieving this goal, a central issue is the trade-off between the training error and network cmnplexity. 
This usually reduces to a problem of network size selection, which has drawn much research effort in 
recent years. Various principles, theories, and intuitions, including Occam's razor, statistical model 
selection criteria such as Akaike's Information Criterion (AIC) [1] and many others [5, 1, 10, 3, 11] all 
quantitatively support the following PAC prescription: between two machines which have the same 
mnpirical error, the machine with smaller VC-dimension generalizes better. However, it is noted 
that these methods or criteria do not necessarily lead to optimal (or nearly optimal) generalization 
performance. Furthermore, all of these methods are valid only at the global minimum of the enpirical 
error function (e.g, the likelihood timetlon for AIC), and it is not clear by these methods how the 
generalization error is effected by network complexity or, more generally, how a network generalizes 
during the learning process. This paper addresses these issues. 
303 
304 Wang, Venkatesh, and Judd 
Recently, it has often been observed that when a network is 'traiued by a gradient descent 
algorithm, there exists a critical region in the training epochs where the trained network generalizes 
best, and after that region the generalization error will increase (frequently called over-training). Our 
numerical experiments with gradient-type algorithms in training feedforward networks also indicate 
that in this critical region, as long as the network is large enough to learn the examples, the size 
of the network plays little role in the (best) generalization performance of the network. Does this 
mean we must revise Occam's principle? How should one define the complexity of a network and go 
about tuning it to optinize generalization performance? When should one stop learning? Although 
releva.nt learning processes were treated by numerous authors [2, 6, 7, 4], the formal theoretical 
studies of these problexns are abeyant. 
Under rather general regularity conditions (Section 1), we give in Section 2 a theorem which 
relates the generalization error at each epoch of learning to that at the global minimum of the 
training error. Its consequence is that for any linear machine whose VC-dimension is finite but large 
enough to learn the target concept, the number of iterations needed for the best generalization to 
occur is at the order of the logarithm of the sainple size, rather than at the global minimum of 
the training error; it also provides bounds on the improvement expected. Section 3 deals with the 
relation between the size of the machine and generalization error by appealing to the concept of 
effective size. Section 4 concerns the application of these results to the problem of network size 
selection, where the AIC is generalized to cover the time evolution of the learning process. Finally, 
we conclude the paper with comments on practical implementation and further research in this 
direction. 
2 THE LEARNING MACHINE 
The machine we consider accepts input vectors X from an arbitrary input space and produces scalar 
outputs 
d 
Here, a* = (a*,..., a'a)' is a fixed vector of real weights, for each i, b,(X) is a fixed real hmction 
of the inputs, with b(X) = ((X),..., (X))' the corresponding vector of fuuctions, and  is a 
random noise term. The machine (1) can be thought of as a feedforward neural network with a fixed 
front end and variable weights at the output. In particular, the functions Pi can represent fixed 
polynomials (higher-order or sigma-pi neural networks), radial basis functions with fixed centers, a 
fixed hidden-layer of sigmoidal neurons, or simply a linear map. In this context, N.J. Nilsson [8] 
has called similar structures -machines. 
We consider the problem of learning from examples a relationship between a random variable Y 
and an n-dimensional random vector X. We assume that this function is given by (1) for some fixed 
integer d, the random vector X and random variable  are defined on the same probability space, 
that E [lX] -- 0, and a2(X) = Var(lX ) = constant < c almost surely. The smallest eigenvalue of 
the matrix b(x)b(x) is assumed to be bounded from below by the inverse of some square integrable 
finction. 
Note that it can be shown that the VC-dinension of the class of -machines with d neurons 
is d under the last assumption. The learning-theoretic properties of the system will be determined 
largely by the eigen structure of . Accordingly, let A _> A2 _> '-  _> A denote the eigenvalues of . 
The goal of the learning is to find the true concept a given independently drawn examples (X, y) 
from (1). Given any hypothesis (vector) w = (w,...,w)' for consideration as an approximation 
to the true concept a, the performance measure we use is the mean-square prediction (or ensemble) 
error 
= E (.v - (2) 
Note that the true concept a* is the mean-square solution 
r* = arg nin (w) = -'E (l,(X)y), (3) 
Optimal Stopping and Effective Machine Complexity in Learning 305 
and the minionurn prediction error is given by (c) = minw (w) = a s. 
Let n be the number of samples of (X, ). We assulne that an independent, identically 
distributed sample (X(),y()), ..., (X(n),y('O), generated according to the joint distribution 
of (X,Y) induced by (1), is provided to the learner. To sinplify notation, define the matrix 
62 _= [b(X ()) ... b(X("'))] and the corresponding vector of outputs y - (y(),...,y("))'. In 
analogy with (2) define the empirical error on the sample by 
Let & denote the hypothesis vector for ;vhich the empirical error on the sanple is minimized: 
V(&) = 0. Analogously with (3) we can then show that 
5=(kI")-y=-t (y), 
(4) 
where  = o' is the empirical covariance matrix, which is almost surely nonsingular for large n. 
The terms in (4) are the empirical counterparts of the ensemble averages in (3). 
The gradient descent algorithm is given by: 
(5) 
where a = (a,a2,...,a3)', t is the number of iterations, and e is the rate of learning. From this 
we can get 
c, = (I - A(t))& + A(t)c0, (6) 
where A(t) = (I - eta) t, and (0 is the initial weight vector. 
The limit of a is A when t goes to infinity, provided p is positive definite and the learning rate 
 is small enough (i.e., smaller than the smallest eigenvalue of $). This implies that the gradient 
descent algorithm converges to the least squares solution, starting from any point in 7 '. 
3 GENERALIZATION DYNAMICS AND STOPPING TIME 
3.1 MAIN THEOREM OF GENERALIZATION DYNAMICS 
Even if the true concept (i.e., the precise relation between Y and X in the current problem) is in the 
class of models we consider, it is usually hopeless to find it using only a finite number of examples, 
except in some trivial cases. Our goal is hence less ambitious; we seek to find the best approximation 
of the true concept, the approach entailing a minimization of the training or empirical error, and 
then taking the global minimum of the empirical error & as the approximation. As we have seen the 
procedure is unbiased and consistent. Does this then imply that training should always be carried 
out to the limit? Surprisingly, the answer is no. This assertion follows from the next theorem. 
Theorem 3.1 Let M > 0 be an arbitrary real constant (possibly depending on n), and suppose 
assumptions AI to A3 are satisfied; then the generalization dynamics in the training process are 
governed by the followin.q equation: 
71,  n 
uniformly for all initial weight vectors, ao in the d-dimensional ball {(* + 5: 11611 _< M, 6  
and for all t > O. [] 
306 Wang, Venkatesh, and Judd 
3.2 THREE PHASES IN GENERALIZATION 
By Theorem 3.1, the mean generalization error at each epoch of the training process is characterized 
by the following function: 
d 
- -;- - . 
The analysis of the evoution of generalization with training is facilitated by treating q(.) as a 
function of a continuous time parame[er t. Ve will show that, there are three distinct phases iu 
generalization dynanics. These results are given in the following in form by several corollaries of 
Theerein 3.1. 
Without loss of generality, we assume the initial weight vector is picked up in a region with 
hq/[-x'd)t then for all 0 < t < t  
II*l 5 M,, = 0(,,), and in particular, I1 = 0(."). Lot r,: ,,,, , _ 
2h,(/[-x,,]), we have 0  r < , and thus 
d 
d 1 n 
The quantity (1 - i) TM in the first term of the above inequalities is related to the elimination of 
initial error, and can be defined a.s the approximation error (or fitting error); the last term is related 
to the effective complexity of the network at t (in hct, an order O() shift of the complexity error). 
The definition and observatious here will be discussed in more detail iu the uext section. 
We call the learning process during the time interval 0 < t < tl the first phase of learning. 
Since in this interval d(t) = O(,-2"' ) is a monotouica.lly decreasing function of t, tile generalizatiou 
error decreases mouotonically in the first phase of learning. At the end of first phase of learning 
d(t) = O(), therefore the generalization error is (ct ) = (ro) + O(). As a summary of these 
statements we have the following corollary. 
Corollary 3.2 In the first phase of learning, the complexity error is dominated by the approximation 
error, and within an order of 0(, ), the generalization error decreases monotonically in the learning 
process to + 0(,{) at the end of firt phase. [] 
For t > tl, we can show by Theorem 3.1 titat the generalization dynanics is given by tile following 
equation, where &, = c(t ) - 
c;(t,+t)=w(o) 2a2 y(1- eAi)t 1-(l+pi)(1-eXi) 
where p _= A*2(t )n/a 2, which is, with probability approaching one, of order O(n). 
Without causing confusion, we still use q(.) for the new time-varying part of the generalizatkm 
error. The function q(.) has much more complex behavior after tl than in the first phase of learning. 
As we will see, it decreases for some time, and finally begins to increase again. In particular, we 
found the best generalization at that t where b(t) is ninimized. (It is noted that 5t is a random 
variable now, and the following statements of the generalization dynamics are in the sense of with 
probability approaching one as n 
Define the optimal stopping tinto: t,nin ---- a, rgmin{S(ct) : t  [0, o]}, i.e., tile epoch corre- 
sponding to the smallest generalization error. Then we can prove the followiug corollaries: 
Corollary 3.3 The optimal stoppin.q time tmi,, = O(lnn), provided a 2 > O. In particular, the 
following inequalities hold: 
I,(t +p 2 ) t.(+p) 
1. t t _< tmi n _< tu; where t -- ti + rain, m(1/[ -:,]) and tu --- tl + nax hq/D-,]) are both 
finite real numbers. That is, the smallest generalization occurs before the global minimum 
of the empirical error is reached. 
Optimal Stopping and Effective Machine Complexity in Learning 307 
2. (.) (tracking the generalization error) decreases monotonically for t < te and increases 
monotonically to zero for t > tu; furthermore, tmin is unique if tt + In 2 
h,(/[-A]) --> 
In accordance with our earlier definitions, we call the learning process during the tiine interval 
between tl and t, the second phase of learning; and the rest of tine the third phase of learning. 
According to Corollary 3.3, for t > t,, sufficiently large, the generalization error is uniformly 
better than at the glohal minimum, &, of the empirical error, although lninimum generalization error 
is achieved between te and t,. The generalization error is reduced by at least 
over that for & if we stop training at, a proper tine. For a fixed nmnber of learning examples, 
the larger is the ratio d/,, the larger is the improvmnent in generalization error if the algorithn is 
stopped before the global ninimum rt* is reached. 
4 THE EFFECTIVE SIZE OF THE MACHINE 
Our concentration on dynmnics and our seenting disregard for complexity do not conflict with the 
learning-theoretic focus on VC-dimension; in fact, the two attitudes fit nicely together. This section 
explains the generalization dynanics by introducing the the concept of effective complexity of the 
machine. It is argued that early stopping in effect sets the effective size of the network to a value 
smaller than its VC-dimension. 
d 
The effective size of the machine at time t is defined to be d(t) = ]]],= [1 - (1 - ,,)t], which 
increases monotonically to d, the VC-dinension of the network, as t --4 c. This definition is justified 
after the following theorem: 
Theorem 4.1 
all o such that )51 <_ M, 
= e(.*, t) + + o( 
n IT 
 . 
Under the assumptions of Theorem 3.1, the following equation holds uniformly for 
(7) 
In the limit of learning, we have by letting t --} o in the above equation, 
(a): (,*) + d + O(n-) (8) 
n 
Hence, to an order of O(n-'s), the generalization error at the limit of training breaks into two parts: 
the approximation error (a*), and the complexity error rr 2. Clearly, the latter is proportional to 
d, the VC-dimension of the network. For all d's larger than necessary, (a*) remains a constant, 
and the generalization error is determined solely by _a The term (a* t) differs from (a*) only in 
terms of initial error, and is identified to be the approximation error at t. Comparison of the above 
two equations thus shows titat it is reasonable to define d(t) as the complexity error at t, and 
justifies the definition of d(t) as the effective size of the machine at the same time. The quantity 
d(t) captures the notion of the degree to which the capacity of the machine is used at t. It depends 
on the machine parameters, the algoritlnn being used, and the marginal distribution of X. Thus, we 
see from (7) that the generalization error at epoch t falls into the same two parts as it does at the 
linit: the approximation error (fitting error) and the complexity error (determined by the effective 
size of the machine). 
As we have show in the last section, during the first phase of learning, the complexity error is of 
higher order in n compared to the fitting error during the first phase of learning, if the initial error 
is of order O(n ) or larger. Thus decrease of the fitting error (which is proportional to the training 
error, as we will see in the next section) inplies the decrease of the generalization error. However, 
308 Wang, Venkatesh, and Judd 
when the fitting error is brought down to the order O(,-!/), the decreds& of fitting error will no longer 
imply the decrease of the generalization error. In fact, by the above theorton, the generalization 
error at t + t can be written as 
d 
c(.,,,) = c(,-,*) +  x,,,(t,)( -x,) TM 
2 
+ --,.(t) + o(,,.-  ). 
The fitting error and the complexity error compete at order O(&) during tile second phase of learning. 
After the second the phase of learning, the complexity error dominates the fitting error, still at the 
order of O(,{). Furthermore, if we define , -- i [-' p]*, the,, by the above equation and (3.3), 
we have 
Corollary 4.2 At the optimal stopping time, the following upper bound on the generalization error 
holds, 
2 
c(.,,,,,,,) _< (.) + --(1 - )d + 0(..-). 
Since n is a quantity of order O(n), (1 - n)d is strictly snaller thau d. Thus stopping training at 
tml, has the same effect as using a snaller nachine of size less than (1 - n)d and carrying training 
out to the limit! A more detailed analysis reveals how the effective size of the machine is affected 
by each neuron in the learning process (onfitted due to the space limit). 
REMARK: The concept of effective size of the machine can be defined sinilarly for an arbitrary 
starting point. However, to compare the degree to which the capacity of the machine has been used 
at t, one must specify at what distance between the hypothesis a and the truth t* is such compari- 
son started. While each point in tile d-dimensional Euclidean space can be regarded as a hypothesis 
(machine) about t*, it is intuitively clear that each of these machines has a different capacity to 
approximate it. But it is reasonable to think that all of the machines that are ou the same sphere 
{t: [t - t*[ = r}, for each r > 0, have the same capacity in approximating a*. Thus, to compare 
the capacity being used at t, we must specify a specific sphere as the starting point; defining the 
effective size of the machine at t without specifying the starting sphere is clearly meaningless. As 
we have seen, r   is found to be a good choice for our purposes. 
5 NETWORK SIZE SELECTION 
The next theorem relates the generalization error and training error at each epoch of learning, and 
forms the basis for choosing the optimal stopping time as well as the best size of the machine during 
the learning process. In the limit of the learning process, the criteriou reduces to the well-known 
Akaike Information Criterion (AIC) for statistical model selection. Comparison of the two criteria 
reveals that our criterion will result in better generalization than AIC, since it incorporates the 
information of each individual neuron rather than just the total number of neurons as in the AIC. 
Theorem 5.1 Assuming the learning algorithm converges, and the conditions of Theorem 3.1 are 
satisfied; then the following equation holds.' 
(c,) = (1 + o(1))E,,(c,) + c(d,t) + o(-) (9) 
,,,her 4,t, t) = 2-5,:-E,C, li -( ---x,)'l cn 
According to this theorem, we find all asymptotically unbiased estimate of (tt) to be ,(ct)+ 
C(d, t) when a 2 is known. This results in the following criterion for finding the optional stopping 
time and network size: 
min{.(ct) + C(d,t): d,t = 1,2,...} (10) 
Optimal Stopping and Effective Machine Complexity in Learning 309 
When t goes to infinity, the above criterion becomes: 
22d 
min{,,(&) + --: d = 1,2,...} (11) 
n 
which is the AIC for choosing the best size of networks. Therefore, (10) can be viewed as an 
extension of the AIC to the learning process. To understand the differences, consider the case 
when , has standard normal distribution N(0, a2). Under this assumption, the Maxinum Likelihood 
(ML) estimation of the weight vectors is the same  the Mean Square estimation. The AIC was 
togy () 
obtained by minimizing E , the Kullback-Leibler distance of the density function f(X) 
with aMr being the ML estimation of  and that of the true density f. This is equivalent to 
minimizinglim,E(Y f(X)) 2 E(Y  2 
-- = -- faL (X)) (suming the limit and the expectation 
are interchangeable). Now it is clear that while AIC chooses networks only at the limit of learning, 
(10) does this in the whole learning process. Observe that the matrix  is now exactly the Fisher 
Information Matrix of the density function fo(X), and Ai is a meure of the capacity of i in 
fitting the relation between X and Y. Therefore our criterion incorporates the information about 
each specific neuron provided by the Fisher Information Matrix, which is a meure of how well 
the data fit the model. This implies that there are two pects in finding the trade-off between the 
model complexity and the empirical error in order to minimize the generalization error: one is to 
have the smallest number of neurons and the other is to minimize the utilization of each neuron. 
The AIC (and in fact most statistical model selection criteria) are aimed at the former, while our 
criterion incorporates the two aspects at the same time. We have seen in the earlier discussions that 
for a given number of neurons, this is done by using the capacity of each neuron in fitting the data 
only to the degree I - (1 - eA) t'"" rather than to its limit. 
6 CONCLUDING REMARKS 
To the best of our knowledge, the results described in this paper provide for the first time a precise 
language to describe overtraining phenomena in learning machines such as neural networks. We 
have studied formally the generalization process of a linear nachine when it is trained with a 
gradient descent algorithm. The concept of effective size of a machine was introduced to break the 
generalization error into two parts: the approximation error and the error caused by a complexity 
term which is proportional to effective size; the former decreases monotonically and the later increases 
monotonically in the learning process. When the machine is trained on a finite number of examples, 
there are in general three distinct phases of learning according to the relative magnitude of the 
fitting and complexity errors. In particular, there exists an optimal stopping time t,i, = O(lnn) 
for minimizing generalization error which occurs before the global minionurn of the empirical error 
is reached. These results lead to a generalization of the AIC in which the effect of certain network 
parameters and time of learning are together taken into account in the network size selection process. 
For practical application of neural networks, these results demoustrate that training a network 
to its limits is not desirable. From the learning-theoretic point of view, the concept of effective 
dimension of a network tells us that we need more than the VC-dimensiou of a machine to describe 
the generalization properties of a machine, except in the limit of learning. 
The generalization of the AIC reveals some unknown facts in statistiral model selection theory: 
namely, the generalization error of a network is affected not only by the number of parameters 
but also by the degree to which each parmneter is actually used in tile leariling process. Occam's 
principle therefore stands in a subtler forn: Make minimal use of the capacity of a network for 
encoding the information provided by learning samples. 
Our results hold for weaker assumptions than were made herein about the distributions of X 
and ,. The case of machines that have vector (rather than scalar) outputs is a simple generalization. 
Also, our theorems have recently been generalized to the case of general nonlinear machines and are 
not restricted to the squared error loss hmction. 
While the problem of inferring a rule from the observational data has been studied for a long 
time in learning theory as well as in other context such as in Linear and Nonlinear Regression, the 
310 Wang, Venkatesh, and Judd 
study of the problem as a dynamical process seems to open a flew avenue for looking at the problem. 
Many problems are open. For example, it is interesting to know what could be learned from a finite 
number of examples in a finite number of iterations in the case where the size of the machine is not 
small compared to the sample size. 
Acknowledgments 
C. Wang thanks Siemens Corporate Research for support during the summer of 1992 when this 
research was initiated. The work of C. Wang and S. Venkatesh has been supported in part by tile 
Air Force Office of Scientific Research nnder grant F49620-93-1-0120. 
References 
[1] Akaike, H. (1974) Information theory and an extension of the naximum likelihood principle. 
Second International Symposium on Information Theory, Ed. B.N. Krishnaiah, North Holland, 
Amsterdam, 27-41. 
[2] Baldi, P. and Y. Chauvin (1991) Temporal evolution of generalization during learning in linear 
networks. Neural Communication. 3,589-603. 
[3] Chow, G. C. (1981) A comparison of the information and posterior probability criteria for model 
selection. Journal of Econometrics 16, 21-34. 
[4] Hansen, Lars Kai (1993) Stochastic linear learning: exact test and training error averages. 
Neural Networks, 4,393-396. 
[5] Haussler, D. (1989) Decision theoretical generalization of the PAC model for neural networks 
and other learuing applications. Preprint. 
[6] Heskes, Tom M. and Bert Kappen (1991) Learning processes in neural networks. Physical Review 
A, Vol 44, No. 4, 2718- 2726. 
[7] Kroght, Anders and John A. Hefts Generalization in a linear perceptron in the presence of 
noise. Preprint. 
[8] Nilsson, N.J. Learning Machines. New York: McGraw Hill. 
[9] Pinelis, I., and S. Utev (1984) Estimates of noments of sums of independent random variables. 
Theory of Probability and Its Applications. 29 (1984) 574-577. 
[10] Rissanen, J. (1987) Stochastic complexity. J. Royal Statistical Society, Series B, Vol. 49, No. 3, 
223-265. 
[11] Schwartz, G. (1978) Estinating tile dimension of a nodel. Annals of Statistics 6, 461-464. 
[12] Sazonov, V. (1982). On the accuracy of normal approximation. Journal of multiva?ate analysis. 
12, 371-384. 
[13] Senatov, V. (1980) Uniform estimates of the rate of convergence in the multi-dimensional central 
limit theorem. Theory of Probability and Its Applications. 25 (1980) 745-758. 
[14] Vapnik, V. (1992) Measuring the capacity of learning machines (I). Preprint. 
[15] Weigend, S.A. and Rumelhart (1991). Generalization through minimal networks with applica- 
tion to forcasting. INTERFA CE'91-23rd Symposium on the Interface: Computing Science and 
Statistics, ed. E. M., Keramidas, pp362-370. Interface Foundation of North America. 
