The Effective Number of Parameters: 
An Analysis of Generalization and Regularization 
in Nonlinear Learning Systems 
John E. Moody 
Department of Computer Science, Yale University 
P.O. Box 2158 Yale Station, New Haven, CT 06520-2158 
Internet: moody@cs.yale.edu, Phone: (203)432-1200 
Abstract 
We present an analysis of how the generalization performance (expected 
test set error) relates to the expected training set error for nonlinear learn- 
ing systems, such as multilayer perceptrons and radial basis functions. The 
principal result is the following relationship (computed to second order) 
between the expected test set and training set errors: 
(1) 
2 is the effective noise 
Here, n is the size of the training sample , O'eff 
variance in the response variable(s),  is a regularization or weight decay 
parameter, and p,ij(A) is the effective number of parameters in the non- 
linear model. The expectations ( ) of training set and test set errors are 
taken over possible training sets  and training and test sets f t respec- 
tively. The effective number of parameters p,ij, (A) usually differs from the 
true number of model parameters p for nonlinear or regularized models; 
this theoretical conclusion is supported by Monte Carlo experiments. In 
addition to the surprising result that p,ii (A)  p, we propose an estimate 
of (1) called the generalized prediction error(GPE) which generalizes well 
established estimates of prediction risk such as Akaike's FPE and AIC, 
Mallows Cp, and Barron's PSE to the nonlinear setting. 1 
GPE and p,fi(A) were previously introduced in Moody (1991). 847 
848 Moody 
1 Background and Motivation 
Many of the nonlinear learning systems of current interest for adaptive control, 
adaptive signal processing, and time series prediction, are supervised learning sys- 
tems of the regression type. Understanding the relationship between generalization 
performance and training error and being able to estimate the generalization per- 
formance of such systems is of crucial importance. We will take the prediction risk 
(expected test set error) as our measure of generalization performance. 
2 Learning from Examples 
Consider a set of n real-valued input/output data pairs (n) = {i _ (xi, yi); i _ 
1,..., n} drawn from a stationary density E(). The observations can be viewed as 
being generated according to the signal plus noise model 2 
yi : It(x i) + ei (2) 
where yi is the observed response (dependent variabl. e), x i are the independent 
variables sampled with input probability density f(x), e' is independent, identically- 
distributed (iid) noise sampled with density (I)(e) having mean 0 and variance or2, 3 
and It(z) is the conditional mean, an 
perspective, the density E() - E(x 
components, the conditional density 
Z(x,y) : 
unknown function. From the signal plus noise 
, y) can be represented as the product of two 
 (yl x) and the input density 
 (ylz) 
q?(y - it(x)) f(x) (3) 
The learning problem is then to find an estimate fi(x) of the conditional mean It(x) 
on the basis of the training set (n). 
In many real world problems, few a priori assumptions can be made about the 
functional form of It(x). Since a parametric function class is usually not known, 
one must resort to a nonparametric regression approach, whereby one constructs an 
estimate fi(x) = f(x) for It(x) from a large class of functions . known to have good 
approximation properties (for example, . could be all possible radial basis func- 
tion networks and multilayer perceptrons). The class of approximation functions is 
usually the union of a countable set of subclasses (specific network architectures) 4 
,4 C - for which the elements of each subclass f(w, x)  ,4 are continuously 
parametrized by a set of p = p(.A) weights w = {w; c = 1,...,p}. The task of 
finding the estimate f(x) thus consists of two problems: choosing the best architec- 
h 
ture A and choosing the best set of weights  given the architecture. Note that in 
2The assumption of additive noise e which is independent of x is a standard assumption 
and is not overly restrictive. Many other conceivable signal/noise models can be trans- 
formed into this form. For example, the mnltiphcative model y = it(x)(1 + e) becomes 
t 
y : it'(x) + for the transformed variable y' = log(y). 
3Note that we have made only a minimal assumption about the noise e, that it is has 
finite variance a2 independent of x. Specifically, we do not need to make the assumption 
that the noise density (I)(e) is of known form (e.g. ganssian) for the following development. 
4For example, a "fully connected two layer perceptton with five internal units". 
The Ejectire Number of Parameters 849 
the nonparametric setting, there does not typically exist a function f(w*, z) 
with a finite number of parameters such that f(w*,z): p(z) for arbitrary p(z). 
For this reason, the estimators (z) = f(, z) will be biased estimators of 
The first problem (finding the architecture A) requires a search over possible ar- 
chitectures (e.9. network sizes and topologies), usually starting with small archi- 
tectures and then considering larger ones. By necessity, the search is not usually 
exhaustive and must use heuristics to reduce search complexity. (A heuristic search 
procedure for two layer networks is presented in Moody and Utans (1992).) 
The second problem (finding a good set of weights for f(w,z)) is accomplished by 
minimizing an objective function: 
x = argmin U(, w,(n)) (4) 
The objective function U consists of an error function plus a regularizer: 
u(, w, (.)) =. ,,.(w, (.)) +  S(w) () 
Here, the error S,.(w,(n)) menures the "distance" between the target response 
values y and the fitted values f(w, zi)  
  [v' f(w, )] , (o) 
,,.(w,(.)) =  , 
i=1 
and S(w) is a regularization or weight-decay function which bies the solution 
toward functions with a priori "desirable" characteristics, such as smoothness. The 
parameter  ) 0 is the regularization or weight decay parameter and must itself be 
optimizedfi 
The most familiar example of an objective function uses the squared erroff 
[u, f(w,)]: [u- f(w, )]  nd  wright dy tm: 
n p 
v(X,w,e(.)) = (-i(w,))+  (w ) (7) 
i=1 
The first term is the sum of squared errors (SSE) of the model f(w,z) with respect 
to the training data, while the second term penalizes either small, medium, or 
large weights, depending on the form of 9(w). Two common examples of weight 
decay hnctions are the ridge regression form 9(w ) = (w) 2 (which penalizes large 
a a 2 0 2 a 2 
wigh) nd h aumh fom a( ) = () /[( ) + () ] (which pni 
weights of intermediate values near w). 
 By biased, we mean that the mean squared bi is nonzero: MSB = f p(x)(((x)}- 
p())2d > 0. Here, p() is some positive weighting function on the input space and 
(} denotes an expected vMued taken over possible trning sets (n). For unbiedness 
(MSB = O) to occur, there must effist a set of weights w* such that 
and the learned weights  must be "close to" w*. For "near unbiedness", we must have 
w*  argminMSB(w) such that (MSB(w*)  O) and  "close to" w*. 
aThe optimization of A will be discussed in Moody (1992). 
 Other error functions, such  those used in generzed linear models (see for example 
McCuagh and Nelder 1983) or robust statistics (see for example Huber 1981) are more 
appropriate than the squared error if the noise is known to be non-gaussian or the data 
contMns many outers. 
8 50 Moody 
An example of a regularizer which is not explicitly a weight decay term is: 
S(w) - d:Cz(:)llO=f(w, x)11 = This is a smoothing term which penalizes functional fits with high curvature. 
(8) 
3 Prediction Risk 
With fi(x)= f(&[(n)],x)denoting an estimate of the true regression function/(x) 
trained on a data set (n), we wish to estimate the prediction risk P, which is the 
expected error of fi(x) in predicting future data. In principle, we can either define 
P for models fi(x) trained on arbitrary training sets of size n sampled from the 
unknown density (y[x)f(x) or for training sets of size n with input density equal 
to the empirical density defined by the single training set available: 
i2'() -- _1  5(- ') (9) 
For such training sets, the n inputs x i are held fixed, but the response variables yi 
are sampled with the conditional densities (y[z). Since () is known, but () 
is generally not known a priori, we adopt the latter approach. 
For a large ensemble of such training sets, the expected training set error is 8 
i:1 
1 -.[yi,f([{(n)],xi)] ' qqyilxi)dy i (10) 
i--1 
For a future exemplar (x,z) sampled with density q(zlx)(x), the prediction risk 
P is defined as: 
P=/[z,.f([{(n)],x)]q(zlx)(x)(Iq(yilxi)dyi}dzdx (11) 
i=1 
Again, however, we don't assume that (x) is known, so computing (11) is not 
possible. 
Following Akaike (1970), Barron (1984), and numerous other authors (see Eubank 
1988), we can define the prediction risk P as the expected test set error for test sets 
of size n '(n) = {i, _ (xi,zi); i = 1,...,n} having the empirical input density 
f'(x). This expected test set error has form: 
1  [zi ' f([(n)],xi)] (12) 
<test(,)){{, ---- i=1 , 
= /l[zi,f([(n)],xi)]{Iql(yi,xi)ql(zi,xi)dyidzi } 
i=1 i=1 
SFollowing the physics convention, we use angled brackets ( ) to denote expected values. 
The subscripts denote the random variables being integrated over. 
The Elective Number of Parameters 8 51 
We take (12) as a proxy for the true prediction risk P. 
In order to compute (12), it will not be necessary to know the precise functional 
form of the noise density (I)(e). Knowing just the noise variance cr 2 will enable an 
exact calculation for linear models trained with the SSE error and an approximate 
calculation correct to second order for general nonlinear models. The results of 
these calculations are presented in the next two sections. 
4 The Expected Test Set Error for Linear Models 
The relationship between expected training set and expected test set errors for linear 
models trained using the SSE error function with no regularizer is well known in 
statistics (Akaike 1970, Barron 1984, Eubank 1988). The exact relation for test and 
training sets with density (9): 
<,et)e' = <,ai)e + 2c?P (13) 
As pointed out by Barron (1984), (13) can also apply approximately to the case of 
a nonlinear model f(w, x) trained by minimizing the sum of squared errors SSE. 
This approximation can be arrived at in two ways. First, the model f(, x) can be 
treated as locally linear in a neighborhood of . This approximation ignores the 
hessian and higher order shape of f(w,x) in parameter space. Alternatively, the 
model f(w, x) can be assumed to be locally quadratic in parameter space w and 
unbiased. 
However, the extension of (13) as an approximate relation for nonlinear models 
breaks down if any of the following situations hold: 
The SSE error function is not used. (For example, one may use a robust error 
measure (Huber 1981) or log likelihood error measure instead.) 
A regularization term is included in the objective function. (This introduces bias.) 
The locally linear approximation for f(w, x) is not good. 
The unbiasedhess assumption for f(w, x) is incorrect. 
5 The Expected Test Set Error for Nonlinear Models 
For neural network models, which are typically nonparametric (thus biased) and 
highly nonlinear, a new relationship is needed to replace (13). We have derived 
such a result correct to second order for nonlinear models: 
+ 
(14) 
This result differs from the classical result (13) by the appearance of p,ff(A) (the 
effective number of parameters),  (the effective noise variance in the response 
Tef f 
variable(s)), and a dependence on A (the regularization or weight decay parameter). 
A full derivation of (14) will be presented in a longer paper (Moody 1992). The 
result is obtained by considering the noise terms e i for both the training and test 
852 Moody 
sets as perturbations to an idealized model fit to noise free data. The perturbative 
expansion is computed out to second order in the es subject to the constraint that 
the estimated weights G minimize the perturbed objective function. Computing 
expectation values and comparing the expansions for expected test and training 
errors yields (14). It is important to re-emphasize that deriving (14) does not 
require knowing the precise form of the noise density (I>(e). Only a knowledge of rr 2 
is assumed. 
The effective number of parameters Pelf(A) usually differs from the true number 
of model parameters p and depends upon the amount of model bias, model non- 
linearity, and on our prior model preferences (eg. smoothness) as determined by 
the regularization parameter A and the form of our regularizer. The precise form of 
Pef () is 
1 
peii() -- trG _--  ETiU;JToi , (15) 
where G is the generalized influence matrix which generalizes the standard influence 
or hat matrix of linear regression, T is the n x p matrix of derivatives of the training 
error function 
  ng(w,(n)) (16) 
Oy i Ow  , 
and U is the inverse of the hessian of the total objective function 
= 0 0 (17) 
Ow  OwO ' 
In the general case that cr2(x) varies with location in the input space x, the effective 
noise variance 2 is a weighted average of the noise variances cr2(xi). For the 
Te f f 
uniform signal plus noise model model we have described above,  = cr 2. 
Te f f 
6 The Effects of Regularization 
In the neural network community, the most commonly used regularizers are weight 
decay functions. The use of weight decay is motivated by the intuitive notion that 
it removes unnecessary weights from the model. An analysis of p,if () with weight 
decay ( > 0) confirms this intuitive notion. Furthermore, whenever cr 2 > 0 and 
n < cx), there exists some optirnal > 0 such that the expected test set error (12) is 
minimized. This is because weight decay methods yield models with lower model 
variance, even though they are biased. These effects will be discussed further in 
Moody (1992). 
For models trained with squared error (SSE) and quadratic weight decay g(w ) - 
(w) 2, the assumptions of unbiasedhess u or local linearizability lead to the following 
expression for p,ii() which we call the linearized effective number of parameters 
ptin(A)- E t _l_ A (18) 
Strictly speaking, a model with quadratic weight decay is unbiased only if the "true" 
weights are O. 
The Elective Number of Parameters 853 
Implied, Linearized, and Full P--effective 
Linearized 
! Fudl 
Weight Decay Parameter (Lambda) 
Figure 1' The full pe/y(A) (15) agrees with the implied pi,,p(A) (19) to within 
experimental error, whereas the linearized pu,(A) (18) does not (except for very 
large A). These results verify the significance of (14) and (15) for nonlinear models. 
Here, , is the a ta eigenvalue of the p x p matrix K = TIT, with T as defined in 
(16). 
The form of peyy (A) can be computed easily for other weight decay functions, such 
as the Rumelhart form !/(w ) = (w')2/[(w) v' + (w)2]. The basic result for all 
weight decay or regularization functions, however, is that pyy(A) is a decreasing 
function of A with pj,y(O) = p and p,j,y(c) = O, as is evident in the special case 
(18). If the model is nonlinear and biased, then p,H(O) generally differs from p. 
7 Testing the Theory 
To test the result (14)in a nonlinear context, we computed the full peri(A) (15), 
the linearized pu,(A) (18), and the implied number of parameters pi,,p(A) (19) for a 
nonlinear test problem. The value of Pimp(A) is obtained by computing the expected 
training and test errors for an ensemble of training sets of size n with known noise 
variance cr and solving for Peyy() in equation (14): 
2r ' 
(19) 
The ^s indicate Monte Carlo estimates based on computations using a finite ensem- 
ble (10 in our experiments) of training sets. The test problem was to fit training 
sets of size 50 generated as a sum of three sigmoids plus noise, with the noise sam- 
pled from the uniform density. The model architecture f(w, x) was also a sum of 
three sigmoids and the weights & were estimated by minimizing (7) with quadratic 
weight decay. See figure 1. 
854 Moody 
8 
GPE: An Estimate of Prediction Risk for Nonlinear 
Systems 
A number of well established, closely related criteria for estimating the prediction 
risk for linear or linearizable models are available. These include Akaike's FPE 
(1970), Akaike's AIC (1973) Mallow's Cio (1973), and Barron's PSE (1984). (See 
also Akaike (1974) and Eubank (1988).) These estimates are all based on equation 
(13). 
The generalized prediction error (GPE) generalizes the classical estimators FPE, 
AIC, Cp, and PSE to the nonlinear setting by estimating (14) as follows: 
The estimation process and the quality of the resulting GPE estimates will be 
described in greater detail elsewhere. 
Acknowledgements 
The author wishes to thank Andrew Barron and Joseph Chang for helpful conversations. 
This research was supported by AFOSR grant 89-0478 and ONR grant N00014-89-J-1228. 
References 
H. Akaike. (1970). Statistical predictor identification. Ann. Inst. Star. Math., 22:203. 
H. Akaike. (1973). Information theory and an extension of the maximum likelihood 
principle. In nd Intl. Syrup. on Information Theory, Akademia Kiado, Budapest, 267. 
H. Akaike. (1974). A new look at the statistical model identification. IEEE Trans. 
Auto. Control, 19:716-723. 
A. Barron. (1984). Predicted squared error: a criterion for automatic model selection. In 
Self-Organizing Methods in Modeling, S. Farlow, ed., Marcel Dekker, New York. 
R. Eubank. (1988). Spline Smoothing and Nonparametric Regression. Marcel Dekker, 
New York. 
P. J. Huber. (1981). Robust Statistics. Wiley, New York. 
C. L. Mallows. (1973). Some comments on Cp. Technometrics 15:661-675. 
P. McCullagh and J.A. Nelder. (1983). Generalized Linear Models. Chapman and Hall, 
New York. 
J. Moody. (1991). Note on Generalization, Regularization, and Architecture Selection 
in Nonhnear Learning Systems. In B.H. Juang, S.Y. Kung, and C.A. Kamm, editors, 
Neural Networks for Signal Processing, IEEE Press, Piscataway, NJ. 
J. Moody. (1992). Long version of this paper, in preparation. 
J. Moody and J. Utans. (1992). Principled architecture selection for neural networks: 
apphcation to corporate bond rating prediction. In this volume. 
