A Realizable Learning Task which 
Exhibits Overfitting 
Siegfried BSs 
Laboratory for Information Representation, RIKEN, 
Hirosawa 2-1, Wako-shi, Saitama, 351-01, Japan 
email: boes@zoo.riken.go.jp 
Abstract 
In this paper we examine a perceptron learning task. The task is 
realizable since it is provided by another perceptron with identi- 
cal architecture. Both perceptrons have nonlinear sigmoid output 
functions. The gain of the output function determines the level of 
nonlinearity of the learning task. It is observed that a high level 
of nonlinearity leads to overfitting. We give an explanation for this 
rather surprising observation and develop a method to avoid the 
overfitting. This method has two possible interpretations, one is 
learning with noise, the other cross-validated early stopping. 
I Learning Rules from Examples 
The property which makes feedforward neural nets interesting for many practical 
applications is their ability to approximate functions, which are given only by ex- 
amples. Feed-forward networks with at least one hidden layer of nonlinear units 
are able to approximate each continuous function on a N-dimensional hypercube 
arbitrarily well. While the existence of neural function approximators is already 
established, there is still a lack of knowledge about their practical realizations. Also 
major problems, which complicate a good realization, like overfitting, need a better 
understanding. 
In this work we study overfitting in a one-layer perceptron model. The model 
allows a good theoretical description while it exhibits already a qualitatively similar 
behavior as the multilayer perceptron. 
A one-layer perceptron has N input units and one output unit. Between input 
and output it has one layer of adjustable weights Wi, (i -- 1,..., N). The output z 
is a possibly nonlinear function of the weighted sum of inputs xi, i.e. 
N 
z=g(h), with h= v .= 
A Realizable Learning Task Which Exhibits Overfitting 219 
The quality of the function approximation is measured by the difference between 
the correct output z, and the net's output z averaged over all possible inputs. In 
the supervised learning scheme one trains the network using a set of examples x_ u 
(/ -- 1,...,P), for which the correct output is known. It is the learning task to 
minimize a certain cost function, which measures the difference between the correct 
output z, u and the net's output zU averaged over all examples. 
Using the mean squared error as a suitable measure for the difference between 
the outputs, we can define the training error ET and the generalization error EG 
as 
P 
I 1 
:= - , Ec := < - (2) 
The development of both errors as a function of the number P of trained examples 
is given by the learning curves. Training is conventionally done by gradient descend. 
For theoretical purposes it is very useful to study learning tasks, which are pro- 
vided by a second network, the so-called teacher network. This concept allows a 
more transparent definition of the difficulty of the learning task. Also the monitor- 
ing of the training process becomes clearer, since it is always possible to compare 
the student network and the teacher network directly. 
Suitable quantities for such a comparison are, in the perceptron case, the following 
order parameters, 
r :: iiWi---   Wi* Wi, q := IIWII = -.(Wi) 2 . (3) 
i=1 i=1 
Both have a very transparent interpretation, r is the normalized overlap between 
the weight vectors of teacher and student, and q is the norm of the student's weight 
vector. These order parameters can also be used in multilayer learning, but their 
number increases with the number of all possible permutations between the hidden 
units of teacher and student. 
2 The Learning Task 
Here we concentrate on the case in which a student perceptton has to learn a 
mapping provided by another perceptton. We choose identical networks for teacher 
and student. Both have the same sigmoid output function, i.e. g,(h) = g(h) = 
tanh(?h). Identical network architectures of teacher and student are realizable tasks. 
In principle the student is able to learn the task provided by the teacher exactly. 
UnreaIizabIe tasks can not be learnt exactly, there remains always a finite error. 
If we use uniformally distributed random inputs x_ and weights W, the weighted 
sum h in (1) can be assumed as Gaussian distributed. Then we can express the 
generalization error (2) by the order parameters (3), 
1 {tanh [TZ1]- tanh [q(rz1 q- V -- r 2 z2)] 
with the Gaussian measure 
Dz := oo Vr2_  exp - (5) 
From equation (4) we can see how the student learns the gain 7 of the teachers 
output function. It adjusts the norm q of its weights. The gain 7 plays an important 
role since it allows to tune the function tanh(7h) between a linear function (7 << 1) 
and a highly nonlinear function (7 >> 1). Now we want to determine the learning 
curves of this task. 
220 S. BOS 
3 Emergence of Overfitting 
3.1 Explicit Expression for the Weights 
Below the storage capacity of the perceptton, i.e. a = 1, the minimum of the training 
error ET is zero. A zero training error implies that every example has been learnt 
exactly, thus 
(6) 
The weights with minimal norm that fulfill this condition are given by the Pseu- 
doinverse (see Hertz et al. 1991), 
/,,= 1 i=1 
Note, that the weights are completely independent of the output function g(h) - 
g,(h). They are the same as in the simplest realizable case, linear perceptron learns 
linear perceptron. 
3.2 Statistical Mechanics 
The calculation of the order parameters can be done by a method from statistical 
mechanics which applies the commonly used replica method. For details about the 
replica approach see Hertz et al. (1991). The solution of the continuous perceptron 
problem can be found in BSs et al. (1993). Since the results of the statistical me- 
chanics calculations are exact only in the thermodynamic limit, i.e. N -+ oo, the 
variable c is the more natural measure. It is defined as the fraction of the number 
of patterns P over the system size N, i.e. c :- PIN. In the thermodynamic limit 
N and P are infinite, but c is still finite. Normally, reasonable system sizes, such 
as N > 100, are already well described by this theory. 
Usually one concentrates on the zero temperature limit, because this implies that 
the training error ET accepts its absolute minimum for every number of presented 
examples P. The corresponding order parameters for the case, linear perceptron 
learns linear student, are 
q = v vfS, r = vfS. (8) 
The zero temperature limit can also be called exhaustive training, since the student 
net is trained until the absolute minimum of ET is reached. 
For small c and high gains ?, i.e levels of nonlinearity, exhaustive training leads 
to overfitting. That means the generalization error EG(a) is not, as it should, 
monotonously decreasing with a. It is one reason for overfitting, that the training 
follows too strongly the examples. The critical gain ?c, which determines whether 
the generalization error EG(c 0 is increasing or decreasing function for small values 
of a, can be determined by a linear approximation. For small a, both order param- 
eters (3) are small, and the student's tanh-function in (4) can be approximated by 
a linear function. This simplifies the equation (4) to the following expression, 
Ea(e) = Ea(0) -  [ 2H(?) - 7], 
with H(?):=/Dz tanh(?z)z. (9) 
Since the function H(?) has an upper bound, i.e. V//r, the critical gain is reached 
if ?c - 2H(%). The numerical solution gives % - 1.3371. If ? is higher, the slope 
of Ea(a) is positive for small a. In the following considerations we will use always 
the gain ? - 5 as an example, since this is an intermediate level of nonlinearity. 
A Realizable Learning Task Which Exhibits Overfitting 221 
1.0 
0.8 
0.6 
0.4 
0.2 
0.0 
0.0 
I I I I 
lOO.O 
10.0 ..... 
2.0 .......... 
1.0 ..... 
0.2 0.4 0.6 0.8 1.0 
P/N 
Figure 1: Learning curves E(a) for the problem, tanh-perceptron learns tanh- 
perceptron, for different values of the gain ?. Even in this realizable case, exhaustive 
training can lead to overfitting, if the gain ? is high enough. 
3.3 How to Understand the Emergence of Overfitting 
Here the evaluation of the generalization error in dependence of the order parameters 
r and q is helpful. Fig. 2 shows the function EG(r, q) for r between 0 and 1 and q 
between 0 and 1.27. 
The exhaustive training in realizable cases follows always the line q(r) - ?r 
independent of the actual output function. That means, training is guided only by 
the training error and not by the generalization error. If the gain ? is higher than 
%, the line EG = Ea(0, 0) starts with a lower slope than q(r) = ?r, which results 
in overfitting. 
4 How to Avoid Overfitting 
From Fig. 2 we can guess already that q increases too fast compared to r. Maybe the 
ratio between q and r is better during the training process. So we have to develop 
a description for the training process first. 
4.1 Training Process 
We found already that the order parameters for finite temperatures (T > 0) of 
the statistical mechanics approach are a good description of the training process 
in an unrealizable learning task (BSs 1995). So we use the finite temperature order 
parameters also in this task. These are, again taken from the task 'linear perceptton 
learns linear perceptton', 
V/( ) (1 +a)a-2a r(a,a)= /() a 2 -a (10) 
q(a,a) = ? a 2 -a ' (l+a)a-2a' 
with the temperature dependent variable 
a:= 1+[/3(Q-q)] -x (11) 
222 S. BOS 
q 
6.0 
5.0 
4.0 
3.0 
2.0 
1.o 
: I : I 
..."local min.". ..... 
.: abs. mifi. 
; local mih. - ..... :: 
I 
I 
I 
I 
I 
I , 
: 
 ." I I 
 I ' I 
0.0 
0.0 0.2 0.4 0.6 0.8 1.0 
Figure 2: Contour plot of Ea(r,q) defined by (4), the generalization error as a 
function of the two order parameters. Starting from the minimum EG -- 0 at (r, q) -- 
(1, 5) the contour lines for EG -- 0.1, 0.2, ..., 0.8 are given (dotted lines). The dashed 
line corresponds to EG(0,0) -- 0.42. The solid lines are parametric curves of the 
order parameters (r, q) for certain training strategies. The straight line illustrates 
exhaustive training, the lower ones the optimal training, which will be explained in 
Fig. 3. Here the gain ? = 5. 
The zero temperature limit corresponds to a -- 1. We will show now that the 
decrease of the temperature dependent parameter a from oo to 1, describes the 
evolution of the order parameters during the training process. In the training process 
the natural parameter is the number of parallel training steps t. In each parallel 
training step all patterns are presented once and all weights are updated. Fig. 3 
shows the evolution of the order parameters (10) as parametric curves (r, q). 
The exhaustive learning curve is defined by a - I with the parameter c (solid 
line). For each c the training ends on this curve. The dotted lines illustrate the 
training process, a runs from infinity to 1. Simulations of the training process have 
shown that this theoretical curve is a good description, at least after some training 
steps. We will now use this description of the training process for the definition of 
an optimized training strategy. 
4.2 Optimal temperature 
The optimized training strategy chooses not a - 1 or the corresponding temperature 
T - 0, but the value of a (i.e. temperature), which minimizes the generalization 
error Eo. In the lower solid curve indicating the parametric curve (r, q) the value of 
a is chosen for every a, which minimizes Eo. The function Ea(a) has two minima 
between a - 0.5 and 0.7. The solid line indicates always the absolute minimum. 
The parametric curves corresponding to the local minima are given by the double 
dashed and dash-dotted lines. Note, that the optimized value a is always related 
to an optimized temperature through equation (11). But the parameter a is also 
related to the number of training steps t. 
A Realizable Learning Task Which Exhibits Overfitting 223 
q 
6.0 
5.0 
4.0 
3.0 
2.0 
1.o 
I 
local min. - .... 
abs. min. 
local min. - ..... 
simulation   
I I I 
; 
: 
; 
; 
: ! 
 I ,.'.' I 
0.6 
0.0 : li 
0.0 0.2 0.4 0.8 1.0 
Figure 3: Training process. The order parameters (10) as parametric curves (r, q) 
with the parameters a and a. The straight solid line corresponds to exhaustive 
learning, i.e. a = I (marks at a -- 0.1, 0.2,... 1.0). The dotted lines describe the 
training process for fixed c. Iterative training reduces the parameter a from c 
to 1. Examples for a - 0.1, 0.2, 0.3, 0.4, 0.9, 0.99 are given. The lower solid line is 
an optimized learning curve. To achieve this curve the value of a is chosen, which 
minimizes EG absolutely. Between c  0.5 and 0.7 the error Ec has two minima; 
the double-dashed and dash-dotted lines indicate the second, local minimum of Ec. 
Compare with Fig. 2, to see which is the absolute and which the local minimum of 
Eo. A naive early stopping procedure ends always in the minimum with the smaller 
q, since it is the first minimum during the training process (see simulation indicated 
with errorbars). 
4.3 Early Stopping 
Fig. 3 and Fig. 2 together indicate that an earlier stopping of the training process 
can avoid the overfitting. But in order to determine the stopping point one has 
to know the actual generalization error during the training. Cross-validation tries 
to provide an approximation for the real generalization error. The cross-validation 
error Ecv is defined like ET, see (2), on a set of examples, which are not used 
during the training. Here we calculate the optimum using the real generalization 
error, given by r and q, to determine the optimal point for early stopping. It is a 
lower bound for training with finite cross-validation sets. Some preliminary tests 
have shown that already small cross-validation sets approximate the real E quite 
well. Training is stopped, when EG increases. The resulting curve is given by the 
errorbars in Fig. 3. The errorbars indicate the standard deviation of a simulation 
with N -- 100 averaged over 50 trials. 
In Fig. 4 the same results are shown as learning curves EG(c). There one can see 
clearly that the early stopping strategy avoids the overfitting. 
5 Summary and outlook 
In this paper we have shown that overfitting can also emerge in realizable learning 
tasks. The calculation of a critical gain and the contour lines in Fig. 2 imply, that 
224 S. BI3S 
EG 
0.5 
0.4 
0.3 
0.2 
0.1 
exh. 
local min. - .... 
abs. min. 
local min. - ..... 
simulation   
0.0     
0.0 0.2 0.4 0.6 0.8 1.0 
P/N 
Figure 4: Learning curves corresponding to the parametric curves in Fig. 3. The 
upper solid line shows again exhaustive training. The optimized finite temperature 
curve is the lower solid line. From a = 0.6 exhaustive and optimal training lead to 
identical results (see marks). The simulation for early stopping (errorbars) finds the 
first minimum of Ec. 
the reason for the overfitting is the nonlinearity of the problem. The network adjusts 
slowly to the nonlinearity of the task. We have developed a method to avoid the 
overfitting, it can be interpreted in two ways. 
Training at a finite temperature reduces overfitting. It can be realized, if one 
trains with noisy examples. In the other interpretation one learns without noise, 
but stops the training earlier. The early stopping is guided by cross-validation. It 
was observed that early stopping is not completely simple, since it can lead to a 
local minimum of the generalization error. One should be aware of this possibility, 
before one applies early stopping. 
Since multilayer perceptrons are built of nonlinear perceptrons, the same effects 
are important for multilayer learning. A study with large scale simulations (Miiller 
et al. 1995) has shown that overfitting occurs also in realizable multilayer learning 
tasks. 
Acknowledgments 
I would like to thank S. Amari and M. Opper for stimulating discussions, and M. 
Herrmann for hints concerning the presentation. 
References 
S. BSs. (1995) Avoiding overfitting by finite temperature learning and cross- 
validation. International Conference on Artificial Neural Networks'95 Vol.2, p.111. 
S. BSs, W. Kinzel & M. Opper. (1993) Generalization ability of percepttons with 
continuous outputs. Phys. Rev. E 47:1384-1391. 
J. Hertz, A. Krogh & R. G. Palmer. (1991) Introduction to the Theory of Neural 
Computation. Reading: Addison-Wesley. 
K. R. Miiller, M. Finke, N. Murata, K. Schulten & S. Amari. (1995) On large scale 
simulations for learning curves, Neural Computation in press. 
