Learning with Noise and Regularizers in 
Multilayer Neural Networks 
David Saad 
Dept. of Comp. Sci. & App. Math. 
Aston University 
Birmingham B4 7ET, UK 
D.Saad@aston.ac.uk 
Sara A. Solla 
AT&T Research Labs 
Holmdel, NJ 07733, USA 
solla@research.att.com 
Abstract 
We study the effect of noise and regularization in an on-line 
gradient-descent learning scenario for a general two-layer student 
network with an arbitrary number of hidden units. Training ex- 
amples are randomly drawn input vectors labeled by a two-layer 
teacher network with an arbitrary number of hidden units; the ex- 
amples are corrupted by Gaussian noise affecting either the output 
or the model itself. We examine the effect of both types of noise 
and that of weight-decay regularization on the dynamical evolu- 
tion of the order parameters and the generalization error in various 
phases of the learning process. 
I Introduction 
One of the most powerful and commonly used methods for training large layered 
neural networks is that of on-line learning, whereby the internal network parameters 
{J} are modified after the presentation of each training example so as to minimize 
the corresponding error. The goal is to bring the map fj implemented by the 
network as close as possible to a desired map ] that generates the examples. Here 
we focus on the learning of continuous maps via gradient descent on a differentiable 
error function. 
Recent work [1]-[4] has provided a powerful tool for the analysis of gradient-descent 
learning in a very general learning scenario [5]: that of a student network with N 
input units, K hidden units, and a single linear output unit, trained to implement a 
continuous map ?ore an N-dimensional input space  onto a scalar (. Examples of 
the target task f are in the form of input-output pairs (, (). The output labels 
( to independently drawn inputs  are provided by a teacher network of similar 
Learning with Noise and Regularizers in Multilayer Neural Networks 261 
architecture, except that its number M of hidden units is not necessarily equal to 
K. 
Here we consider the possibility of a noise process p that corrupts the teacher 
output. Learning from corrupt examples is a realistic and frequently encountered 
scenario. Previous analysis of this case have been based on various approaches: 
Bayesian [6], equilibrium statistical physics [7], and nonequilibrium techniques for 
analyzing learning dynamics [8]. Here we adapt our previously formulated tech- 
niques [2] to investigate the effect of different noise mechanisms on the dynamical 
evolution of the learning process and the resulting generalization ability. 
2 The model 
We focus on a soft committee machine [1], for which all hidden-to-output weights 
are positive and of unit strength. Consider the student network: hidden unit i 
receives information from input unit r through the weight Jir, and its activation 
under presentation of an input pattern  - (El,... ,Es) is xi = Ji  , with Ji = 
(Jil,  ., JiN) defined as the vector of incoming weights onto the i-th hidden unit. 
The output of the student network is cr(J,) = E/K=i g (Ji. ), where g is the 
activation function of the hidden units, taken here to be the error function g(x) -- 
erf(x/V), and J -- {Ji}l<i<K is the set of input-to-hidden adaptive weights. 
The components of the input vectors  are uncorrelated random variables with zero 
mean and unit variance. Output labels ( are provided by a teacher network of sim- 
ilar architecture: hidden unit n in the teacher network receives input information 
through the weight vector B = (Bnl,..., BnN), and its activation under presenta- 
tion of the input pattern  is y = B  . In the noiseless case the teacher output 
is given by (0  M (B ). Here we concentrate on the architecturally 
-- En:l g ' 
matched case M = K, and consider two types of Gaussian noise: additive output 
noise that results in ( = p + -K= g (n. ), and model noise introduced as 
fluctuations in the activations y of the hidden units, ( K 
= -=1 g (P + n .). 
The random variables p and pn  are taken to be Gaussian with zero mean and 
variance a 2 . 
The error made by a student with weights J on a given input  is given by the 
quadratic deviation 
e(J,) --  [ G(J,) - (0 = g(xi)--g(yn) , 
i--1 n--1 
(1) 
measured with respect to the noiseless teacher (it is also possible to measure 
performance as deviations with respect to the actual output ( provided by the 
noisy teacher). Performance on a typical input defines the generalization error 
eg(J)  < e(J,) >{], through an average over all possible input vectors  to 
be performed implicitly through averages over the activations x = (Xl,. ., XK) and 
Y = (Yl,..., YK). These averages can be performed analytically [2] and result in a 
compact expression for eg in terms of order parameters: Qik ---- Ji 'Jk, Rin = Ji 'Sn, 
and Tm -- B. Bin, which represent student-student, student-teacher, and teacher- 
teacher overlaps, respectively. The parameters Tm are characteristic of the task to 
be learned and remain fixed during training, while the overlaps Qik among student 
hidden units and Rin between a student and a teacher hidden units are determined 
by the student weights J and evolve during training. 
A gradient descent rule on the error made with respect to the actual output provided 
262 D. Saad and S. A. Solla 
by the noisy teacher results in j,+x = j + N 5'  for the update of the student 
weights, where the learning rate /has been scaled with the input size N, and 
depends on the type of noise. The time evolution of the overlaps Ri, and Qi can 
be written in terms of similar difference equations. We consider the large N limit, 
and introduce a normalized number of examples a = It/N to be interpreted as a 
continuous time variable in the N   limit. The time evolution of Ri, and Qi 
is thus described in terms of first-order differential equations. 
3 Output noise 
The resulting equations of motion for the student-teacher and student-student over- 
laps are given in this case by: 
dRi,, 
= < >, (2) 
da 
dQit: 
= r I < 5 x > +r I < 5 x > +rl 2 < 5 5 > +riCo ' < g'(xi) g'(x) >, 
da 
where each term is to be averaged over all possible ways in which an example 
could be chosen at a given time step. These averages have been performed using 
the techniques developed for the investigation of the noiseless case [2]; the only 
difference due to the presence of additive output noise is the need to evaluate the 
fourth term in the equation of motion for Qi, proportional to both r/ and a . 
We focus on isotropic uncorrelated teacher vectors: T,rn = T 6,m, and choose T = 1 
in our numerical examples. The time evolution of the overlaps lin and Qik follows 
from integrating the equations of motion (2) from initial conditions determined by 
a random initialization of the student vectors {Ji}l<i<K. Random initial norms 
Qii for the student vectors are taken here from a uniform distribution in the [0, 0.5] 
interval. Overlaps Qik between independently chosen student vectors Ji and J, 
or tin between Ji and an unknown teacher vector B, are small numbers of order 
1/x/- for N >> K, and taken here from a uniform distribution in the [0, 10 -x2] 
interval. 
We show in Figures 1.a and 1.b the evolution of the overlaps for a noise variance 
a 2 = 0.3 and learning rate r/= 0.2. The example corresponds to M = K = 3. The 
qualitative behavior is similar to the one observed for M = K in the noiseless case 
extensively analyzed in [2]. A very short transient is followed by a long plateau 
characterized by lack of differentiation among student vectors: all student vectors 
have the same norm Qii -- Q, the overlap between any two different student vectors 
takes a unique value Qik -- C for i - k, and the overlap lin between an arbitrary 
student vector i and a teacher vector n is independent of i (as student vectors are 
indistinguishable in this regime) and of n (as the teacher is isotropic), resulting 
in Iin - R. This phase is characterized by an unstable symmetric solution; the 
perturbation introduced through the nonsymmetric initialization of the norms 
and overlaps lin eventually takes over in a transition that signals the onset of 
specialization. 
This process is driven by a breaking of the uniform symmetry of the matrix of 
student-teacher overlaps: each student vector acquires an increasingly dominant 
overlap R with a specific teacher vector which it begins to imitate, and a gradually 
decreasing secondary overlap $ with the remaining teacher vectors. In the example 
of Figure 1.b the assignment corresponds to i = 1 - n = 1, i = 2 - n = 3, and 
i = 3 - n = 2. A relabeling of the student hidden units allows us to identify R 
with the diagonal elements and $ with the off-diagonal elements of the matrix of 
student-teacher overlaps. 
Learning with Noise and Regularizers in Multilayer Neural Networks 263 
(a) 
1.0- I ..... 
0.5- 
0.0- 
0 
Q,, -- Q,2 ..... Q,, 
..... Q ........... Q2, ...... 
I I I 
1.2 
1.0 t 
0.8 
(b) 
0.6- 
0.4- 
0.2- 
0.0- 
0 
I I I 
2000 4000 6000 
0.03- 
0.025 - 
0.02 - 
0.015- 
0.01 
0.005 
0.0 
(c) 
....... a. 0, \\\ 
O'.0 3 
I I t [ 1 
0 500 1000 1500 2000 2500 3000 
ot 
(d) 
........ Zo 1 
......... Z 
0.03 - 
0.025 - 
0.02- 
0.015 
0.01 
0.005 
4'10 3 6'10 3 8'103 
Figure 1: Dependence of the overlaps and the generalization error on the normal- 
ized number of examples ct for a three-node student learning corrupted examples 
generated by an isotropic three-node teacher. (a) student-student overlaps Qi and 
(b) student-teacher overlaps tin for tr 2 = 0.3. The generalization error is shown in 
(c) for different values of the noise variance r 2, and in (d) for different powers of 
the polynomial learning rate decay, focusing on ct > ct0 (asymptotic regime). 
Asymptotically the secondary overlaps $ decay to zero, while Ri, --  indicates 
full alignment for T,, = 1. As specialization proceeds, the student weight vectors 
grow in length and become increasingly uncorrelated. It is interesting to observe 
that in the presence of noise the student vectors grow asymptotically longer than 
the teacher vectors: Qii -+ Qoo > 1, and acquire a small negative correlation with 
each other. Another detectable difference in the presence of noise is a larger gap 
between the values of Q and C' in the symmetric phase. Larger norms for the 
student vectors result in larger generalization errors: as shown in Figure 1.c, the 
generalization error increases monotonically with increasing noise level, both in the 
symmetric and asymptotic regimes. 
For an isotropic teacher, the teacher-student and student-student overlaps can thus 
be fully characterized by four parameters: Qi = Q6i +C(1-6ik ) and Rin = R6in + 
$(1 -5i,). In the symmetric phase the additional constraint R = $ reflects the lack 
of differentiation among student vectors and reduces the number of parameters to 
three. 
The symmetric phase is characterized by a fixed point solution to the equations 
264 D. Saad and S. A. Solla 
of motion (2) whose coordinates can be obtained analytically in the small noise 
approximation: R* - 1/v/K(2K - 1) 4- r (r 2 r, , Q* = 1/(2K - 1) 4- r/(r 2 qs , and 
C* = 1/(2K - 1) + r (r 2 cs, with rs, qs, and cs given by relatively simple functions 
of K. The generalization error in this regime is given by: 
* K (-Karcsin( I )) '/(2K-1) a/2 
eg ---- -- - 4- 2- (2/x 7 4- 1)1/2 ; (3) 
note its increase over the corresponding noiseless value, recovered for (r 2 = 0. 
The asymptotic phase is characterized by a fixed point solution with R* - $*. The 
coordinates of the asymptotic fixed point can also be obtained analytically in the 
small noise approximation: R* = 1 + 
and C* - -r/(r 2 with qa, and ca given by rational functions of K with 
-- Ca, ray Sa, 
corrections of order r/. The asymptotic generalization error is given by 
% = 6r /(r2K' (4) 
Explicit expressions for the coefficients rs, qs, cs, ra, sa, qa, and ca will not be given 
here for lack of space; suffice it to say that the fixed point coordinates predicted on 
the basis of the small noise approximation are found to be in excellent agreement 
with the values obtained from the numerical integration of the equations of motion 
for a 2 < 0.3. 
It is worth noting in Figure 1.c that in the small noise regime the length of the 
symmetric plateau decreases with increasing noise. This effect can be investigated 
analytically by linearizing the equations of motion around the symmetric fixed point 
and identifying the positive eigenvalue responsible for the escape from the symmetric 
phase. This calculation has been carried out in the small noise approximation, to 
obtain A = (2/)K(2K- 1)-1/2(2K + 1) -a/2 + Aer2r/, where A is positive and 
increases monotonically with K for K > 1. A faster escape from the symmetric 
plateau is explained by this increase of the positive eigenvalue. The calculation 
is valid for a2r/ << 1; we observe experimentally that the trend is reversed as 
increases. A small level of noise assists in the process of differentiation among 
student vectors, while larger levels of noise tend to keep student vectors equally 
ignorant about the task to be learned. 
The asymptotic value (4) for the generalization error indicates that learning at finite 
r/ will result in asymptotically suboptimal performance for a 2 > 0. A monotonic 
decrease of the learning rate is necessary to achieve optimal asymptotic performance 
with e;: 0. Learning at small r/results in long trapping times in the symmetric 
phase; we therefore suggest starting the training process with a relatively large value 
of r/and switching to a decaying learning rate at c - c0, after specialization begins. 
We propose r/ = r/0 for c _ c0 and r/ = r/0/(c - c0) z for c > c0. Convergence 
to the asymptotic solution requires z _ 1. The value z: 1 corresponds to the 
fastest decay for r/(c0; the question of interest is to determine the value of z which 
results in fastest decay for eg(a). Results shown in Figure 1.d for a > a0 = 4000 
correspond to M: K = 3, r/0: 0.7, and a 2 = 0.1. Our numerical results indicate 
optimal decay of eg(c) for z = 1/2. A rigorous justification of this result remains 
to be found. 
4 Model noise 
The resulting equations of motion for the student-teacher and student-student over- 
laps can also be obtained analytically in this case; they exhibit a structure remark- 
Learning with Noise and Regularizers in Multilayer Neural Networks 265 
0.02- 
ff.o 
.......... O-:.0 
..... 0'.0 
0.03 - 
0.02 - 
0.01 - 
0.0   0.0  I  I  
0'10  5'10  1'10' 50 60 70 80 90 
10 20 30 40 
(x K 
Figure 2: Left - The generalization error for different values of the noise variance 
(r2; training examples are corrupted by model noise. Right - 5raax as a function of 
K. 
ably similar to those for the noiseless case reported in [2], except for some changes 
in the relevant covariance matrices. 
A numerical investigation of the dynamical evolution of the overlaps and generaliza- 
tion error reveals qualitative and quantitative differences with the case of additive 
output noise: 1) The sensitivity to noise is much higher for model noise than for 
output noise. 2) The application of independent noise to the individual teacher 
hidden units results in an effective anisotropic teacher and causes fluctuations in 
the symmetric phase; the various student hidden units acquire some degree of dif- 
ferentiation and the symmetric phase can no longer be fully characterized by unique 
values of Q and C. 3) The noise level does not affect the length of the symmetric 
phase. 
The effect of model noise on the generalization error is illustrated in Figure 2 for 
M: K = 3, r/= 0.2, and various noise levels. The generalization error increases 
monotonically with increasing noise level, both in the symmetric and asymptotic 
regimes, but there is no modification in the length of the symmetric phase. The 
dynamical evolution of the overlaps, not shown here for the case of model noise, 
exhibits qualitative features quite similar to those discussed in the case of additive 
output noise: we observe again a noise-induced widening of the gap between Q and 
C in the symmetric phase, while the asymptotic phase exhibits an enhancement of 
the norm of the student vectors and a small degree of negative correlation between 
them. 
Approximate analytic expressions based on a small noise expansion have been ob- 
tained for the coordinates of the fixed point solutions which describe the symmetric 
and asymptotic phases. In the case of model noise the expansions for the symmetric 
solution are independent of r/and depend only on r 2 and K. The coordinates of 
the asymptotic fixed point can be expressed as: R* 1 + 
Q* = 1 + a  qa, C*: -a  ca, with coefficients ra, sa, qa, and ca given by rational 
functions of K with corrections of order . The important difference with the out- 
put noise ce is that the asymptotic fixed point is shifted from its noiseless position 
even for  = 0. It is therefore not possible to achieve optimal asymptotic perfor- 
mance even if a decaying learning rate is utilized. The asymptotic generalization 
error is given by 
g = 12 
266 D. Saad and S. A. Solla 
Note that the asymptotic generalization error remains finite even as q --* 0. 
5 Regularlzers 
A method frequently used in real world training scenarios to overcome the effects of 
noise and parameter redundancy (K > M) is the use of regularizers such as weight 
decay (for a review see [6]). 
Weight-decay regularization is easily incorporated within the framework of on-line 
learning; it leads to a rule for the update of the student weights of the form j+l = 
J +  6  -  J. The corresponding equations of motion for the dynamical 
evolution of the teacher-student and student-student overlaps can again be obtained 
analytically and integrated numerically from random initial conditions. 
The picture that emerges is basically similar to that described for the noisy case: the 
dynamical evolution of the learning process goes through the same stages, although 
specific values for the order parameters and generalization error at the symmetric 
phase and in the asymptotic regime are changed as a consequence of the modification 
in the dynamics. 
Our numerical investigations have revealed no scenario, either when training from 
noisy data or in the presence of redundant parameters, where weight decay im- 
proves the system performance or speeds up the training process. This lack of 
effect is probably a generic feature of on-line learning, due to the absence of an 
additive, stationary error surface defined over a finite and fixed training set. In 
off-line (batch) learning, regularization leads to improved performance through the 
modification of such error surface. These observations are consistent with the ab- 
sence of 'overfitting' phenomena in on-line learning. One of the effects that arises 
when weight-decay regularization is introduced in on-line learning is a prolongation 
of the symmetric phase, due to a decrease in the positive eingenvalue that controls 
the onset of specialization. This positive eigenvalue, which signals the instability of 
the symmetric fixed point, decreases monotonically with increasing regularization 
strength 7, and crosses zero at '7rnax = r/max- The dependence of max on K is 
shown in Figure 2; for '7 > '7ma the symmetric fixed point is stable and the system 
remains trapped there for ever. 
The work reported here focuses on an architecturally matched scenario, with M = 
K. Over-realizable cases with K > M show a rich behavior that is rather less 
amenable to generic analysis. It will be of interest to examine the effects of different 
types of noise and regularizers in this regime. 
Acknowledgement: D.S. acknowledges support from EPSRC grant GR/L19232. 
References 
[1] M. Biehl and H. Schwarze, J. Phys. A 28, 643 (1995). 
[2] D. Saad and S.A. Solla, Phys. Rev. E 52, 4225 (1995). 
[3] D. Saad and S.A. Solla, preprint (1996). 
[4] P. Riegler and M. Biehl, J. Phys. A 28, L507 (1995). 
[5] G. Cybenko, Math. Control Signals and Systems 2, 303 (1989). 
[6] C.M. Bishop, Neural networks for pattern recognition, (Oxford University Press, Ox- 
ford, 1995). 
[7] T.L.H. Warkin, A. Rau, and M. Biehl, Rev. Mod. Phys. 65,499 (1993). 
[8] K.R. Milllet, M. Finke, N. Murata, K. Schulten, and S. Amari, Neural Computation 
8, 1085 (1996). 
