Generalization Dynamics in 
LMS Trained Linear Networks 
Yves Chauvin* 
Psychology Department 
Stanford University 
Stanford, CA 94305 
Abstract 
For a simple linear case, a mathematical analysis of the training and gener- 
alization (validation) performance of networks trained by gradient descent 
on a Least Mean Square cost function is provided as a function of the learn- 
ing parameters and of the statistics of the training data base. The analysis 
predicts that generalization error dynamics are very dependent on a pri- 
ori initial weights. In particular, the generalization error might sometimes 
weave within a computable range during extended training. In some cases, 
the analysis provides bounds on the optimal number of training cycles for 
minimal validation error. For a speech labeling task, predicted weaving 
effects were qualitatively tested and observed by computer simulations in 
networks trained by the linear and non-linear back-propagation algorithm. 
I INTRODUCTION 
Recent progress in network design demonstrates that non-linear feedforward neu- 
ral networks can perform impressive pattern classification for a variety of real-world 
applications (e.g., Le Cun et al., 1990; Waibel et al., 1989). Various simulations and 
relationships between the neural network and machine learning theoretical litera- 
tures also suggest that too large a number of free parameters ("weight overfitting") 
could substantially reduce generalization performance. (e.g., Baum, 1989 1989). 
A number of solutions have recently been proposed to decrease or eliminate the 
overfitting problem in specific situations. They range from ad hoc heuristics to 
theoretical considerations (e.g., Le Cun et al., 1990; Chauvin, 1990a; Weigend et al., 
*Also with Thomson-CSF, Inc., 630 Hansen Way, Suite 250, Palo Alto, CA 94304. 
89O 
Generalization Dynamics in LMS Trained Linear Net-works 891 
In Press). For a phoneme labeling application, Chauvin showed that the overfitting 
phenomenon was actually observed only when networks were overtrained far beyond 
their "optimal" performance point (Chauvin, 1990b). Furthermore, generalization 
performance of networks seemed to be independent of the size of the network during 
early training but the rate of decrease in performance with overtraining was indeed 
related the number of weights. 
The goal of this paper is to better understand training and generalization error dy- 
namics in Least-Mean-Square trained linear networks. As we will see, gradient de- 
scent training on linear networks can actually generate surprisingly rich and insight- 
ful validation dynamics. Furthermore, in numerous applications, even non-linear 
networks tend to function in their linear range, as if the networks were making use 
of non-linearities only when necessary (Weigend et al., In Press; Chauvin, 1990a). 
In Section 2, I present a theoretical illustration yielding a better understanding of 
training and validation error dynamics. In Section 3, numerical solutions to ob- 
tained analytical results make interesting predictions for validation dynamics under 
overtraining. These predictions are tested for a phonemic labeling task. The ob- 
tained simulations suggest that the results of the analysis obtained with the simple 
theoretical framework of Section 2 might remain qualitatively valid for non-linear 
complex architectures. 
2 THEORETICAL H, LUSTRATION 
2.1 ASSUMPTIONS 
Let us consider a linear network composed of n input units and n output units fully 
connected by a n.n weight matrix W. Let us suppose the network is trained to 
reproduce a noiseless output "signal" from a noisy input "signal" (the network can 
be seen as a linear filter). We write F as the "signal", N the noise, X the input, Y 
the output, and D the desired output. For the considered case, we have X = F+N, 
Y = W X and D = F. 
The statistical properties of the data base are the following. The signal is zero-mean 
with covariance matrix CF. We write Ai and ei as the eigenvalues and eigenvectors 
of CF (el are the so-called principal components; we will call Ai the "signal power 
spectrum"). The noise is assumed to be zero-mean, with covariance matrix Cs - 
P.I where I is the identity matrix. We assume the noise is uncorrelated with the 
signal: 'rN -- O. We suppose two sets of patterns have been sampled for training 
and for validation. We write OF, Cs and CFN the resulting covariance matrices for 
the training set and C., Cv and CN the corresp_onding matrices for the validation 
set. We assume CF -- C  F, CFN  CN  CFN -- O, aN -- 12.I and Cv 
with yt > y. (Numerous of these assumptions are made for the sake of clarity of 
explanation: they can be relaxed without changing the resulting implications.) 
The problem considered is much simpler than typical realistic applications. How- 
ever, we will see below that (i) a formal analysis becomes complex very quickly 
(ii) the validation dynamics are rich, insightful and can be mapped to a number 
of results observed in simulations of realistic applications and (iii) an interesting 
number of predictions can be obtained. 
892 Chauvin 
2.2 LEARNING 
The network is trained by gradient descent on the Least Mean Square (LMS) error: 
AW -- -77wE where r/ is the usual learning rate and, in the case considered, 
E - P(Fp - Yp)T(Fp- Yp). We can write the gradient as a function of the 
various covariance matrices: VwE  (I - W)Cr  (I - 2W)Cr - WC. From 
the general assumptions, we get: 
VwE  CF - WCF - WCN (1) 
We assume now that the principal components ei are also eigenvectors of the weight 
matr W at iteration k with corresponding eigenvalue aik: Wk.ei = aikei. We can 
then compute the image of each eigenvector ei at iteration k + 1: 
W+l.ei = .Ai.ei + ai}[1 - .(Ai + y)].ei (2) 
Therefore, ei is aO an eigenvector of W}+ and ai,k+l satisfies the induction: 
Assuming W0 = 0, we can compute the alpha-dynamics of the weight matrk W: 
a,, - Ai + 7 [1 - (1 - .Ai + .))'] (4) 
As k goes to infinity, provided ? < 1/AM + , a approaches A/(A + y), which 
corresponds to the optimal (Wiener) value of the linear filter implemented by the 
network. We will write the convergence rates ai = 1 -?Ai -?y. These rates depend 
on the signal "power spectrum", on the noe power and on the learning rate . 
If we now assume Wo.ei = aio.ei with ai0  0 (this assumption can be made more 
general), we get: 
Ai biaS] (5) 
where bi = 1 -io- iov/i. Figure 1 represents possible alpha dynamics for 
arbitrary vMues of i with i0  0  0. 
We can now compute the learning error dynamics by expanding the LMS error term 
E at time k. Using the general assumptions on the covariance matrices, we find: 
E =  Ei =  ,(1 -ai) 2 + Yai (6) 
i i 
Therefore, training error is a sum of err comnents, each of them being a 
quadratic function of ai. Figure 2 represents a training error component Ei as 
a function of . Knowing the alpha-dynamics, we can write these error components 
as a function of k: 
= + + Aiba) (7) 
It is easy to see that E is a monotonic decreasing function (generated by gradient 
descent) which converges to the bottom of the quadratic error surface, yielding the 
residual asymptotic error: 
(8) 
E = Ai + i 
i 
Generalization Dynamics in LMS Trained Linear Networks 893 
1.0 
Cio 
0.0 , , 
0 20 
 I  I  I  I 
40 6O 80 100 
Number of Cycles 
.i=4 
.i = .7 
'i = .2 
Figure 1: Alpha dynamics for different values of Ai with r/= .01 and ai0 = r0 y 0. 
The solid lines represent the optimal values of ri for the training data set. The 
dashed lines represent corresponding optimal values for the validation data set. 
LMS 
I I 
I I 
I I 
I I I I 
.X .X i 
0 )q+u' 
Ot ik 
Figure 2: Training and validation error dynamics as a function of ci. The dashed 
curved lines represent the error dynamics for the initial conditions ai0. Each training 
error component follows the gradient of a quadratic learning curve (bottom). Note 
the overtraining phenomenon (top curve) between a? (optimal for validation) and 
rioo (optimal for training). 
894 Chauvin 
2.3 GENERALIZATION 
Considering the general assumptions on the statistics of the data base, we can 
compute the validation error E  (Note that "validation error" strictly applies to the 
validation data set. "Generalization error" can qualify the validation data set or 
the whole population, depending on context.): 
(9) 
where the alpha-dynamics are imposed by gradient descent learning on the training 
data set. Again, the validation error is a sum of error components El, quadratic 
functions of ci. However, because the alpha-dynamics are adapted to the training 
sample, they might generate complex dynamics which will strongly depend on the 
inital values ci0 (Figure 1). Consequently, the resulting error components E are not 
monotonic decreasing functions anymore. As seen in Figure 2, each of the validation 
error components might (i) decrease (ii) decrease then increase (overtraining) or 
(iii) increase as a function of ci0. For each of these components, in the case of 
overt raining, it is possible to compute the value of ci} at which training should be 
stopped to get minimal validation error: 
Log x,+,,, + Logx,_,,o(X,+,,, ) (10) 
k = Log(1 - lAi - lv) 
However, the validation error dynamics become much more complex when we con- 
sider sums of these components. If we assume ci0 = 0, the minimum (or minima) 
of E  can be found to correspond to possible intersections of hyper-ellipsoids and 
power curves. In general, it is possible to show that there exists at least one such 
minimum. It is also possible to find simple bounds on the optimal training time for 
minimal validation error: 
" "-' Log "-' 
t'g'x-Z-4'P' < k* < (11) 
;og(i -  - ) - - ;og(1 -  - ) 
These bounds are tight when the noise power is small compared to the signal "power 
spectrum". For ai0  0, a formal analysis of the validation error dynamics becomes 
intractable. Because some error components might increase while others decrease, 
it is possible to imagine multiple minima and maxima for the total validation error 
(see simulations below). Considering each component's dynamics, it is nonetheless 
possible to compute bounds within which E' might vary during training: 
- ,.  (" + .) (la) 
  +. 5 s 5 ( + . 
i i 
Because of the "exponential" hature of training (Figure 1), it is possible to imagine 
that this "weaving" effect might still be observed after a long training period, when 
the training error itself has become stable. Furthermore, whereas the training error 
will qualitatively show the same dynacs, validation error will very much depend 
on i0: for sufficiently large initial weights, validation dynacs ght be very 
dependent on particular simulation "runs". 
Generalization Dynamics in LMS Trained Linear Networks 895 
Figure 3: Training (bottom curves) and validation (top curves) error dynamics in 
a two-dimensional case for A - 17, A2 - 1.7, y - 2, y -- 10, c0 - 0 as c.0 varies 
from 0 to 1.6 (bottom-up) in .2 increments. 
3 SIMULATIONS 
3.1 CASE STUDY 
Equations 7 and 9 were simulated for a two-dimensional case (n - 2) with A1 -- 
17, A - 1.7, y - 2, y = 10 and c10 - 0. The values of a0 determined the 
relative dominance of the two error components during training. Figure 3 represents 
training and validation dynamics as a function of k for a range of values of 
As shown analytically, training dynamics are basically unaffected by the initial 
conditions of the weight matrix W0. However, a variety of validation dynamics 
can be observed as c0 varies from 0 to 1.6. For 1.6 _ c20 _ 1.4, the validation 
error is monotically decreasing and looks like a typical "gradient descent" training 
error. For 1.2 _ c0 _ 1.0, each error component in turn imposes a descent rate: 
the validation error looks like two "connected descents". For .8 _ a0 _ .6, E is 
monotically decreasing with a slow convergence rate, forcing the validation error to 
decrease long after E has become stable. This creates a minimum, followed by a 
maximum, followed by a minimum for E . Finally, for .4 _ c0 _ 0, both error 
components have a single minimum during training and generate a single minimum 
for the total validation error E . 
3.2 PHONEMIC LABELING 
One of the main predictions obtained from the analytical results and from the 
previous case study is that validation dynamics can demonstrate multiple local 
minima and maxima. To my knowledge, this phenomenon has not been described in 
the literature. However, the theory also predicts that the phenomenon will probably 
appear very late in training, well after the training error has become stable, which 
might explain the absence of such observations. The predictions were tested for a 
phonemic labeling task with spectrograms as input patterns and phonemes as output 
896 Chauvin 
patterns. Various architectures were tested (direct connections or back-propagation 
networks with linear or non-linear hidden layers). Due to the limited length of 
this article, the complete simulations will be reported elsewhere. In all cases, as 
predicted, multiple mimina/maxima were observed for the validation dynamics, 
provided the networks were trained way beyond usual training times. Furthermore, 
these generalization dynamics were very dependent on the initial weights (provided 
sufficient variance on the initial weight distribution). 
4 DISCUSSION 
It is sometimes assumed that optimal learning is obtained when validation error 
starts to increase during the course of training. Although for the theoretical study 
presented, the first minimum of E t is probably always a global minimum, inde- 
pendently of ci0, simulations of the speech labeling task show it is not always the 
case with more complex architectures: late validation minima can sometimes (albeit 
rarely) be deeper than the first "local" minimum. These observations and a lack 
of theoretical understanding of statistical inference under limited data set raise the 
question of the significance of a validation data set. As a final comment, we are 
not really interested in minimal validation error (E ) but in minimal generalization 
error (Et). Understanding the dynamics of the "population" error as a function 
of training and validation errors necessitates, at least, an evaluation of the sample 
statistics as a function of the number of training and validation patterns. This is 
beyond the scope of this paper. 
Acknowledgements 
Thanks to Pierre Baldi and Julie Holmes for their helpful comments. 
References 
Baum, E. B. & Haussler, D. (1989). What size net gives valid generalization? Neural 
Computation, 1, 151-160. 
Chauvin, Y. (1990a). Dynamic behavior of constrained back-propagation networks. 
In D. S. Touretzky (Ed.), Neural Information Processing Systems (Vol. 2) (pp. 
642-649). San Mateo, CA: Morgan Kaufman. 
Chauvin, Y. (1990b). Generalization performance of overtrained back-propagation 
networks. In L. B. Almeida & C. J. Wellekens (Eds.), Lecture Notes in Com- 
puter Science (Vol. 412) (pp. 46-55). Berlin: Germany: Springer-Verlag. 
Cun, Y. L., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., 
& Jackel, L. D. (1990). Handwritten digit recognition with a back-propagation 
network. In D. S. Touretzky (Ed.), Neural Information Processing Systems 
(Vol. 2) (pp. 396-404). San Mateo, CA: Morgan Kaufman. 
Waibel, A., Sawai, H., & Shikano, K. (1989). Modularity and scaling in large 
phonemic neural networks. IEEE Transactions on Acoustics, Speech and Signal 
Processing, ASSP-37, 1888-1898. 
Weigend, A. S., Huberman, B. A., & Rumelhart, D. E. (In Press). Predicting the 
future: a connectionist approach. International Journal of Neural Systems. 
