For valid generalization, the size of the 
weights is more important than the size 
of the network 
Peter L. Bartlett 
Department of Systems Engineering 
Research School of Information Sciences and Engineering 
Australian National University 
Canberra, 0200 Australia 
Peter.Bartlettllanu.edu.au 
Abstract 
This paper shows that if a large neural network is used for a pattern 
classification problem, and the learning algorithm finds a network 
with small weights that has small squared error on the training 
patterns, then the generalization performance depends on the size 
of the weights rather than the number of weights. More specifi- 
cally, consider an t-layer feed-forward network of sigmoid units, in 
which the sum of the magnitudes of the weights associated with 
each unit is bounded by A. The misclassification probability con- 
verges to an error estimate (that is closely related to squared error 
on the training set) at rate O((cA)t(t+l)/2v/(logn)/m ) ignoring 
log factors, where m is the number of training patterns, n is the 
input dimension, and c is a constant. This may explain the gen- 
eralization performance of neural networks, particularly when the 
number of training examples is considerably smaller than the num- 
ber of weights. It also supports heuristics (such as weight decay 
and early stopping) that attempt to keep the weights small during 
training. 
I Introduction 
Results from statistical learning theory give bounds on the number of training exam- 
ples that are necessary for satisfactory generalization performance in classification 
problems, in terms of the Vapnik-Chervonenkis dimension of the class of functions 
used by the learning system (see, for example, [13, 5]). Baum and Haussler [4] 
used these results to give sample size bounds for multi-layer threshold networks 
Generalization and the Size of the Weights in Neural Networks 135 
that grow at least as quickly as the number of weights (see also [7]). However, 
for pattern classification applications the VC-bounds seem loose; neural networks 
often perform successfully with training sets that are considerably smaller than the 
number of weights. This paper shows that for classification problems on which neu- 
ral networks perform well, if the weights are not too big, the size of the weights 
determines the generalization performance. 
In contrast with the function classes and algorithms considered in the VC-theory, 
neural networks used for binary classification problems have real-valued outputs, 
and learning algorithms typically attempt to minimize the squared error of the 
network output over a training set. As well as encouraging the correct classification, 
this tends to push the output away from zero and towards the target values of 
{-1, 1}. It is easy to see that if the total squared error of a hypothesis on m 
examples is no more than me, then on no more than me/(1 - )2 of these examples 
can the hypothesis have either the incorrect sign or magnitude less than a. 
The next section gives misclassification probability bounds for hypotheses that are 
"distinctly correct" in this way on most examples. These bounds are in terms of 
a scale-sensitive version of the VC-dimension, called the fat-shattering dimension. 
Section 3 gives bounds on this dimension for feedforward sigmoid networks, which 
imply the main results. The proofs are sketched in Section 4. Full proofs can be 
found in the full version [2]. 
2 Notation and bounds on misclassification probability 
Denote the space of input patterns by X. The space of labels is {-1, 1}. We 
assume that there is a probability distribution P on the product space X x {-1, 1}, 
that reflects both the relative frequency of different input patterns and the relative 
frequency of an expert's classification of those patterns. The learning algorithm 
uses a class of real-valued functions, called the hypothesis class H. An hypothesis h 
is correct on an example (x,y) if sgn(h(x)) = y, where sgn(c)' ll -+ {-1, 1} takes 
value i iff a >_ 0, so the misclassification probability (or error) of h is defined as 
erp(h) = P{(x,y)  X x {-1,1}'sgn(h(x))  y}. 
The crucial quantity determining misclassification probability is the fat-shattering 
dimension of the hypothesis class H. We say that a sequence x,..., xd of d points 
from X is shattered by H if functions in H can give all classifications of the sequence. 
That is, for all b = (bl,...,bin)  {-1, 1} m there is an h in H satisfyingsgn(h(xi)) = 
bi. The VC-dimension of H is defined as the size of the largest shattered sequence.  
For a given scale parameter 7 > 0, we say that a sequence x,..., xd of d points 
from X is 7-shattered by H if there is a sequence r,..., rd of real values such that 
for all b = (b,...,bm)  {-1, 1} m there is an h in H satisfying (h(xi)- ri)bi _ 7' 
The fat-shattering dimension of H at '7, denoted fatH(7), is the size of the largest 
y-shattered sequence. This dimension reflects the complexity of the functions in the 
class H, when examined at scale 7. Notice that fatH(7) is a nonincreasing function 
of 7. The following theorem gives generalization error bounds in terms of fatH (7). 
A related result, that applies to the case of no errors on the training set, will appear 
in [12]. 
Theorem I Define the input space X, hypothesis class H, and probability distri- 
bution P on X x {-1,1} as above. Let 0 < J < 1/2, and 0 < 7 < 1. Then, 
with probability 1 -  over the training sequence (x, y),..., (xm, ym) of rn labelled 
In fact, according to the usual definition, this is the VC-dimension of the class of 
thresholded versions of functions in H. 
136 P. L. Bartlett 
examples, every hypothesis h in H satisfies 
1 
erp(h) < --I{i'lh(x)l < 7 orsgn(h(xi))  Y}l + e(7, rn, J), 
m 
where 
e(7, re, J) = 2 (dln(50em/d) log(1250m) + ln(4/5)) 
-- , 
m 
and d-- fatH(7/16). 
(1) 
2.1 Comments 
It is informative to compare this result with the standard VC-bound. In that case, 
the bound on misclassification probability is 
1 (c )/2 
ere(h) < --I{i 'sgn(h(xi))  Yi}l+ (dlog(m/d) +log(i/J)) , 
m 
where d = VCdim(H) and c is a constant. We shall see in the next section that 
there are function classes H for which VCdim(H) is infinite but fatH(7) is finite 
for all 7 > 0; an example is the class of functions computed by any two-layer neu- 
ral network with an arbitrary number of parameters but constraints on the size of 
the parameters. It is known that if the learning algorithm and error estimates are 
constrained to make use of the sample only by considering the proportion of train- 
ing examples that hypotheses misclassify, there are distributions P for which the 
second term in the VC-bound above cannot be improved by more than log factors. 
Theorem i shows that it can be improved if the learning algorithm makes use of 
the sample by considering the proportion of training examples that are correctly 
classified and have Ih(x)l < 7. it is possible to give a lower bound (see the full 
paper [2]) which, for the function classes considered here, shows that Theorem 1 
also cannot be improved by more than log factors. 
The idea of using the magnitudes of the values of h(xi) to give a more precise 
estimate of the generalization performance was first proposed by Vapnik in [13] 
(and was further developed by Vapnik and co-workers). There it was used only for 
the case of linear hypothesis classes. Results in [13] give bounds on misclassification 
probability for a test sample, in terms of values of h on the training and test data. 
This result is extended in [11], to give bounds on misclassification probability (that 
is, for unseen data) in terms of the values of h on the training examples. This is 
further extended in [12] to more general function classes, to give error bounds that 
are applicable when there is a hypothesis with no errors on the training examples. 
Lugosi and Pintdr [9] have also obtained bounds on misclassification probability in 
terms of similar properties of the class of functions containing the true regression 
function (conditional expectation of y given x). However, their results do not extend 
to the case when the true regression function is not in the class of real-valued 
functions used by the estimator. 
It seems unnatural that the quantity 7 is specified in advance in Theorem 1, since 
it depends on the examples. The full paper [2] gives a similar result in which the 
statement is made uniform over all values of this quantity. 
3 The fat-shattering dimension of neural networks 
Bounds on the VC-dimension of various neural network classes have been established 
(see [10] for a review), but these are all at least linear in the number of parameters. 
In this section, we give bounds on the fat-shattering dimension for several neural 
network classes. 
Generalization and the Size of the Weights in Neural Networks 137 
We assume that the input space X is some subset of 1 n. Define a sigmoid unit 
as a function from 1 n to 1, parametrized by a vector of weights to 6 1 n. The 
unit computes z  rr(z  w), where rr is a fixed bounded function satisfying a 
Lipchitz condition. (For simplicity, we ignore the offset parameter. It is equivalent 
to including an extra input with a constant value.) A multi-layer feed-forward 
sigmoid network of depth  is a network of sigmoid units with a single output unit, 
which can be arranged in a layered structure with  layers, so that the output of 
a unit passes only to the inputs of units in later layers. We will consider networks 
in which the weights are bounded. The relevant norm is the 1 norm: for a vector 
w E 11 , define Iltollx - g Itoil. The following result gives a bound on the fat- 
shattering dimension of a (bounded) linear combination of real-valued functions, in 
terms of the fat-shattering dimension of the basis function class. We can apply this 
result in a recursive fashion to give bounds for single output feed-forward networks. 
Theorem 2 Let F be a class of functions that map from X to t-M/2, M/2], such 
that 0  F and, for all f in F, -f  F. For A > O, define the class H of 
weight-bounded linear combinations of functions from F as 
H = togf  k e N, f  F, lltoll i A . 
Suppose 7 > 0 is such that d = fatF(q`/(32A)) > 1. Then fatH(q`) < 
(cM2A2d/7 2) log2(MAd/q`), for some constant c. 
Gurvits and Koiran [6] have shown that the fat-shattering dimension of the class of 
two-layer networks with bounded output weights and linear threshold hidden units 
is O ((A2n2/q` ) log(n/q,)), when X = l n. As a special case, Theorem 2 improves 
this result. 
Notice that the fat-shattering dimension of a function class is not changed by more 
than a constant factor if we compose the functions with a fixed function satisfying 
a Lipschitz condition (like the standard sigmoid function). Also, for X = 11 ' and 
H = {x -> xi} we have fatu(7)  logn for all 7- Finally, for H = {x  w. x'w  
"} we have fat(7)  n for all 7. These observations, together with Theorem 2, 
give the following corollary. The (.) notation suppresses log factors. (Formally, 
f = (g) if f = o(g +) for all a > 0.) 
Corollary 3 If X  n and H is the class of two-layer sigmoid networks with the 
weights in the output unit satisfying Ilwlll 5 A, then fatH(7) = 6 (A2n/72). 
if x = {  ' IIll 5 B} ad the hidden unit weights are also bounded, then 
fatH(,) = O (BaA6(logn)/74). 
Applying Theorem 2 to this result gives the following result for deeper networks. 
Notice that there is no constraint on the number of hidden units in any layer, only 
on the total magnitude of the weights sociated with a processing unit. 
Corollary 4 For some constant c, if X  n and H is the class of depth g sigmoid 
networks in which the weight vector w associated with each unit beyond the first 
layer satisfies mlmll, 5 A, then fatH(7) = 6 (n(cA)t<t-x)/72<t-x)). 
f x = {  ' ml11 5 B} d the weights in the first layer units also satisfy 
Ilwllx 5 A, then ht(7) = 0 (B2(cA)t(t+)/72t logn). 
In the first part of this corollary, the network h fat-shattering dimension similar 
to the VC-dimension of a linear network. This formalizes the intuition that when 
the weights are small, the network operates in the "linear part" of the sigmoid, and 
so behaves like a linear network. 
138 P. L. Bartlett 
3.1 Comments 
Consider a depth  sigmoid network with bounded weights. The last corollary and 
Theorem i imply that if the training sample size grows roughly as B2At2/e2, then 
the misclassification probability of a network is within e of the proportion of training 
examples that the network classifies as "distinctly correct." 
These results give a plausible explanation for the generalization performance of 
neural networks. If, in applications, networks with many units have small weights 
and small squared error on the training examples, then the VC-dimension (and 
hence number of parameters) is not as important as the magnitude of the weights 
for generalization performance. 
It is possible to give a version of Theorem i in which the probability bound is 
uniform over all values of a complexity parameter indexing the function classes 
(using the same technique mentioned at the end of Section 2.1). For the case of 
sigmoid network classes, indexed by a weight bound, minimizing the resulting bound 
on misclassification probability is equivalent to minimizing the sum of a sample error 
term and a penalty term involving the weight bound. This supports the use of two 
popular heuristic techniques, weight decay and early stopping (see, for example, [8]), 
which aim to minimize squared error while maintaining small weights. 
These techniques give bounds on the fat-shattering dimension and hence generaliza- 
tion performance for any function class that can be expressed as a bounded number 
of compositions of either bounded-weight linear combinations or scalar Lipschitz 
functions with functions in a class that has finite fat-shattering dimension. This 
includes, for example, radial basis function networks. 
4 Proofs 
4.1 Proof sketch of Theorem I 
For A .C S, where (S, p) is a pseudometric space, a set T C S is an e-cover of A if 
for all a in A there is a t in T with p(t, a) < e. We define N'(A, e, p) as the size of the 
smallest e-cover of A. For x = (x,...,Xm)  X m, define the pseudometric 
on the set $ of functions defined on X by dt(r)(f,g) = max/If(xi)-g(xi)l. For a 
set A of functions, denote maxrEx. Af(A,e, dtoo(r)) by Afoo(A,e, rn). Alon et al. [1] 
have obtained the following bound on Afoo in terms of the fat-shattering dimension. 
Lemma 5 For a class F of functions that map from {1,..., n) to {1,..., b} with 
fatr(1) _ d, log2 Afoo(F, 2, n)< 1 + log2(nb2)log2 ('./a=o ()bi), provided that n _ 
1 + log 2 ('./a=o ()bi). 
For 7 > 0 define %  ]R -+ [-7, 7] as the piecewise-linear squashing function satis- 
fying % (c) = 7 if c _ 7, r (c) = -7 if c < -7, and % (c) = c otherwise. For a 
class H of real-valued functions, define % (H) as the set of compositions of r with 
functions in H. 
Lemma 6 For X, H, P, J, and 7 as in Theorem 1, 
-- rrt  
i i{i.lh(xi) I < 7 orsgn(h(xi))  Yi}l ) < d. 
m 
+ 
Generalization and the Size of the Weights in Neural Networks 139 
The proof of the lemma relies on the observation that 
pm {z . 3h E H, erp(h) > e + 1 i{i . lh(xi) 1< 7 or sgn(h(xi))  
We then use a standard symmetrization argument and the permutation argument 
introduced by Vapnik and Chervonenkis to bound this probability by the probability 
under a random permutation of a double-length sample that a related property 
holds. For any fixed sample, we can then use Pollard's approach of approximating 
the hypothesis class using a (7/2)-cover, except that in this case the appropriate 
cover is with respect to the oo pseudometric. Applying Hoeffding's inequality gives 
the lemma. 
To prove Theorem 1, we need to bound the covering numbers in terms of the fat- 
shattering dimension. It is easy to apply Lemma 5 to a quantized version of the 
function class to get such a bound, taking advantage of the range constraint imposed 
by the squashing function r. 
4.2 Proof sketch of Theorem 2 
For x = (Xl,... ,Zrn ) 6 X rn , define the pseudometric dtx() on the class of func- 
tions defined on X by dtx(a:)(f,g) =  ']im__l If(xi)- g(xi)l. Similarly, define 
dta(a:)(f,g) = ( ']im:l(f(xi)-g(xi))2) 1/2. If A is a set of functions defined on 
X, denote maxzcx. A/'(A,7, dtx(z)) by A/'i(A, 7, m), and similarly for A/'2(A, 7, m ). 
The idea of the proof of Theorem 2 is to first derive a general upper bound on an 
 covering number of the class H, and then apply the following result (which is 
implicit in the proof of Theorem 2 in [3]) to give a bound on the fat-shattering 
dimension. 
Lemma 7 For a class F of [0, 1]-valued functions on X satisfying fatr(47) _> d, 
we have log 2 Af (F, 7, d) _ d/32. 
To derive an upper bound on Aft(7, H, m), we start with the bound that Lemma 5 
implies on the too covering number A/'oo(F, 7, m) for the class F of hidden unit 
functions. Since dt(f,g) _< dt.(f,g), this implies the following bound on the t2 
covering number for F (provided m satisfies the condition required by Lemma 5, 
and it turns out that the theorem is trivial otherwise). 
(4ernMN (9rnM 2 ) 
log2Af2(F, 7, m ) < 1 +dlog 2 77 '] 1g2 7 2 ' (2) 
Next, we use the following result on approximation in 2, which A. Barron attributes 
to Maurey. 
Lemma 8 (Maurey) Suppose G is a Hilbert space and F C_ G has Ilfll < b for all 
f in F. Let f be an element from the convex closure of F. Then for all k _> 1 and all 
I[ II 
c > b2-11fll 2, there are functio,s {h,..., C F such that f - f, < 
This implies that any element of H can be approximated to a particular accuracy 
(with respect to 2) using a fixed linear combination of a small number of elements 
of F. It follows that we can construct an t2 cover of H from the 2 cover of F; using 
Lemma 8 and Inequality 2 shows that 
f 36rnMA 2 
72 ' 
140 P. L. Bartlett 
Now, Jensen's inequality implies that dt()(f, g) _< dta()(f, g), which gives a bound 
on Af (H, 7, m). Comparing with the lower bound given by Lemma 7 and solving 
for m gives the result. A more refined analysis for the neural network case involves 
bounding Af2 for successive layers, and solving to give a bound on the fat-shattering 
dimension of the network. 
Acknowledgement s 
Thanks to Andrew Barron, Jonathan Baxter, Mike Jordan, Adam Kowalczyk, Wee 
Sun Lee, Phil Long, John Shawe-Taylor, and Robert Slaviero for helpful discussions 
and comments. 
References 
[10] 
[11] 
[12] 
[1] N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler. Scale-sensitive di- 
mensions, uniform convergence, and learnability. In Proceedings of the 1993 
IEEE Symposium on Foundations of Computer Science. IEEE Press, 1993. 
[2] P. L. Bartlett. The sample complexity of pattern classification with neu- 
ral networks: the size of the weights is more important than the size 
of the network. Technical report, Department of Systems Engineering, 
Australian National University, 1996. (available by anonymous ftp from 
syseng. anu. edu. au: pub/peter/TR96d. ps). 
[3] P. L. Bartlett, S. R. Kulkarni, and S. E. Posner. Covering numbers for real- 
valued function classes. Technical report, Australian National University and 
Princeton University, 1996. 
[4] E. Baum and D. Haussler. What size net gives valid generalization? Neural 
Computation, 1(1):151-160, 1989. 
[5] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Learnability 
and the Vapnik-Chervonenkis dimension. J. ACM, 36(4):929-965, 1989. 
[6] L. Gurvits and P. Koiran. Approximation and learning of convex superposi- 
tions. In Computational Learning Theory: EUROCOLT'95, 1995. 
[7] D. Haussler. Decision theoretic generalizations of the PAC model for neural 
net and other learning applications. Inform. Comput., 100(1):78-150, 1992. 
[8] J. Hertz, A. Krogh, and R. G. Palmer. Introduction to the Theory of Neural 
Computation. Addison-Wesley, 1991. 
[9] G. Lugosi and M. Pint6r. A data-dependent skeleton estimate for learning. In 
Proc. 9th Annu. Conference on Comput. Learning Theory. ACM Press, New 
York, NY, 1996. 
W. Maass. Vapnik-Chervonenkis dimension of neural nets. In M. A. Arbib, 
editor, The Handbook of Brain Theory and Neural Networks, pages 1000-1003. 
MIT Press, Cambridge, 1995. 
J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. A frame- 
work for structural risk minimisation. In Proc. 9th Annu. Conference on Corn- 
put. Learning Theory. ACM Press, New York, NY, 1996. 
J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Structural 
risk minimization over data-dependent hierarchies. Technical report, 1996. 
V. N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer- 
Verlag, New York, 1982. 
