MLP can provably generalise much better 
than VC-bounds indicate. 
A. Kowalczyk and H. Ferrfi 
Telstra Research Laboratories 
770 Blackburn Road, Clayton, Vic. 3168, Australia 
((a.kowalczyk, h.ferra)@trl.oz.au) 
Abstract 
Results of a study of the worst case learning curves for a partic- 
ular class of probability distribution on input space to MLP with 
hard threshold hidden units are presented. It is shown in partic- 
ular, that in the thermodynamic limit for scaling by the number 
of connections to the first hidden layer, although the true learning 
curve behaves as  a - for a  1, its VC-dimension based bound 
is trivial (= 1) and its VC-entropy bound is trivial for a _< 6.2. It 
is also shown that bounds following the true learning curve can be 
derived from a formalism based on the density of error patterns. 
1 Introduction 
The VC-formalism and its extensions link the generalisation capabilities of a binary 
vlued neural network with its counting function  , e.g. via upper bounds implied by 
VC-dimension or VC-entropy on this function' [17, 18]. For linear perceptrons the 
counting function is constant for almost every selection of a fixed number of input 
samples [2], and essentially equal to its upper bound determined by VC-dimension 
and Sauer's Lemma. However, in the case for multilayer percepttons (MLP) the 
counting function depends essentially on the selected input samples. For instance, 
it has been shown recently that for MLP with sigmoidal units although the largest 
number of input samples which can be shattered, i.e. VC-dimension, equals fl(w 2) 
[6], there is always a non-zero probability of finding a (2w + 2)-element input sample 
which cannot be shattered, where w is the number of weights in the network [16]. 
In the case of MLP using Heaviside rather than sigmoidal activations (McCulloch- 
Pitts neurons), a similar claim can be made: VC-dimension is (wlog27-l) [13, 15], 
Known also as the partition function in computational learning theory. 
MLP Can Provably Generalize Much Better than VC-bounds Indicate 191 
where w is the number of weights to the first hidden layer of 7-/1 units, but there is 
a non-zero probability of finding a sample of size w + 2 which cannot be shattered 
[7, 8]. The results on these "hard to shatter samples" for the two MLP types 
differ significantly in terms of techniques used for derivation. For the sigmoidal 
case the result is "existential" (based on recent advances in "model theory") while 
in the Heaviside case the proofs are constructive, defining a class of probability 
distributions from which "hard to shatter" samples can be drawn randomly; the 
results in this case are also more explicit in that a form for the counting function 
may be given [7, 8]. 
Can the existence of such hard to shatter samples be essential for generalisation 
capabilities of MLP? Can they be an essential factor for improvement of theoretical 
models of generalisation? In this paper we show that at least for the McCulloch- 
Pitts case with specific (continuous) probability distributions on the input space 
the answer is "yes". We estimate "directly" the real learning curve in this case and 
show that its bounds based on VC-dimension or VC-entropy are loose at low learning 
sample regimes (for training samples having less than 12 x wl examples) even for 
the linear perceptron. We also show that a modification to the VC-formalism given 
in [9, 10] provides a significantly better bound. This latter part is a more rigorous 
and formal extension and re-interpretation of some results in [11, 12]. All the results 
are presented in the thermodynamic limit, i.e. for MLP with wl - cx and training 
sample size increasing proportionally, which simplifies their mathematical form. 
2 Overview of the formalism 
On a sample space X we consider a class H of binary functions h: X - {0, 1} 
which we shall call a hypothesis space. Further we assume that there are given a 
probability distribution/ on X and a target concept t: X - {0,1}. The quadruple 
 = (X,/, H, t) will be called a learning system. 
In the usual way, with each hypothesis h 6 H we associate the generalization error 
aef aef 1 Zi=l It(xi) - h(xi)l for 
eh - Ex lit(x) - h(x)l] and the training error eh,i = 
any training m-sample ' = (xl,..., Xm) 6 X m. 
Given a learning threshold 0 _  _ 1, let us introduce an auxiliary random variable 
eax() aej max(ca; h 6 H & ea,i _ ) for  6 X m, giving the worst general- 
ization error of all hypotheses with training error _  on the m-sample  6 X m. 2 
The basic objects of interest in this paper are the learning curve 3 defined as 
e,C(m) cted Exm [eax()]. 
2.1 Thermodynamic limit 
Now we introduce the thermodynamic limit of the learning curve. The underly- 
ing idea of such asymptotic analysis is to capture the essential features of learning 
In this paper max(S), where $ C 1, denotes the maximal element in the closure of $, 
or cx if no such element exists. Similarly, we understand min(S). 
aNote that our learning curve is determined by the worst generalisation error of accept- 
able hypotheses and in this respect differs from "average generalisation error" learning 
curves considered elsewhere, e.g. [3, 5]. 
192 A. Kowalczyk and H. Ferrd 
systems of very large size. Mathematically it turns out that in the thermodynamic 
limit the functional forms of learning curves simplify significantly and analytic char- 
acterizations of these are possible. 
We are given a sequence of learning systems, or shortly, v - (Xv,/v,Hv, tv), 
N = 1, 2, ... and a scaling N  rv 6 R +, with the property rv - c; the scaling 
can be thought of as a measure of the size (complexity) of a learning system, e.g. 
VC-dimension of HN. The thermodynamic limit of scaled learning curves is defined 
for a > 0 as follows 4 
wc der 
exoo(a ) = limsupe',CN([arNJ), 
(1) 
Here, and below, the additional subscript N refers to the N-th learning system. 
2.2 Error pattern density formalism 
This subsection briefly presents a thermodynamic version of a modified VC formal- 
ism discussed previously in [9]; more details and proofs can be found in [10]. The 
main innovation of this approach comes from splitting error patterns into error shells 
and using estimates on the size of these error shells rather than the total number 
of error patterns. We shall see on examples discussed in the following section that 
this improves results significantly. 
The space {0, 1} ra of all binary m-vectors naturally splits into m + 1 error pattern 
shells E?, i = 0, 1, ..., m, with the i-th shell composed of all vectors with exactly i 
entries equal to 1 . For each h 6 H and :? = (x,..., xm)  X m, let ffh()  {0, 1} m 
denote a vector (error pattern) having 1 in the j-th position if and only if h(x)  
t(xj). As the i-th error shell has (.) elements, the average error pattern density 
falling into this error shell is 
An der (m) -1 
= i h 
(i = O, 1, ..., m), 
(2) 
where # denotes the cardinality of a set 5 . 
Theorem I Given a sequence of learning systems N = (XN , !aN, HN , tN), a scal- 
ing rv and a function go: R + x (0, 1) -> R + such that 
m m 
In (Ai,v) _< -rNq , + o(rv), (3) 
for all m, N = 1, 2, ..., 0 < i < m. 
Then 
we 
xoo(a) _< ex(a), 
(4) 
4We recall that lxJ denotes the largest integer <_ x and limsup/v_ x/v is defined as 
lim/v- of the monotonic sequence N - max{xl,X2,..., xiv}. Note that in contrast to 
the ordinary limit, lim sup always exists. 
5Note the difference to the concept of error shells used in [4] which are partitions of the 
finite hypothesis space H according to the generalisation error values. Both formalisms 
are related though, and the central result in [4], Theorem 4, can be derived from our 
Theorem I below. 
MLP Can Provably Generalize Much Better than VC-bounds Indicate 193 
for any 0 _<  < i and a,  > O, where 
max 
e e (0, 1); 30<y< a(7/(y) +/7/(x)) - o [a + a, -- 
exl 
1+/ - 
and 7-/(y) e__/-Y In y - (1 - y) ln(1 - y) denotes the entropy function. 
3 Main results: applications of the formalism 
3.1 VC-bounds 
We consider a learning sequence N = (Xv,laN,HN, tN), tNC Hv (realisable 
case) and the scaling of this sequence by VC-dimension [17], i.e. we assume rN -- 
dvc(HN) - oo. The following bounds for the N-th learning system can be derived 
for  - 0 (consistent learning case) [1, 17]: 
e',v(m) _ /0 rain 1,22-m/2 d,c'-N)J de. (5) 
In the thermodynamic limit, i.e. as N -, oo, we get for any a  1/e 
wc [ 2 log2(2ea) ] (6) 
Coo o(a) _ min 1, a ' 
Note that this bound is independent of probability distributions 
3.2 Piecewise constant functions 
Let PC(d) denote the class of piecewise constant binary functions on the unit 
segment [0, 1) with up to d _ 0 discontinuities and with their values defined as 
i at all these discontinuity points. We consider here the learning sequence N -- 
([0, 1),/N, PC(dN),tN) where /zN is any continuous probability distributions on 
[0, 1), dN is a monotonic sequence of positive integers diverging to oo and targets 
tN  PC(dray) are such that the limit 5t ef limN-oo s exists. (Without loss of 
--- dt N 
generality we can assume that all/ are the uniform distribution on [0, 1).) 
For this learning sequence the following can be established. 
Claim 1. The following function det]ned for a > 1 and 0 _ x _ 1 as 
der 
and as O, otherwise, stisqes assumption (3) with respect to the scMing rv = dv. 
Claim 2. The following two sided bound on the learning curve holds: 
wc i (1+ ln(2a+)) 
i (l+ln(2a-)) < eXoo(a ) < 2- 
2c- -- -- 
(7) 
for a > 1, 0 _ ) _ 1 and 0 _ 6t _ a)/2, where 
der a 0 + de._.f 
194 A. Kowalczyk and H. Ferrd 
We outline the main steps of proof of these two claims now. 
For Claim i we start with a combinatorial argument establishing that in the par- 
ticular case of constant target 
A i ,mN -- i - ! 
1 
Edrl /2 ( ra--i--1 
for d + dt < min(2i,2(m - i)), 
otherwise. 
Next we observe that that the above sum equals 
2(1--) 
o<j<ctN /2 ) = e - - 
This easily gives Claim 1 for constant target (6t = 0). Now we observe that this 
particular case gives an upper bound for the general case (of non-constant target) 
if we use the "effective" number of discontinuities dv +dtN instead of dN. 
For Claim 2 we start with the estimate [12, 11] 
ra+l 
( ra+l )wc LdN/2J+I( E ) 
[d/2J 1 + E < e o (m) < 1 + . 
m+l - - m+l 
j: [dN/2J +1 j= [dN/2J +2 
derived from the Mauldon result [14] for the constant target tv = const, m _> dv. 
This implies immediately the expression 
1 
ewc -- (1 + ln(2a)). (8) 
ooo() = 2a 
for the constant target, which extends to the estimate (7) with a straightforward 
lower and upper bound on the "effective" number of discontinuities in the case of a 
non-constant target. 
3.3 Link to multilayer perceptron 
Let MLpn(wl) denote the class of function from R n to {0, 1} which can be imple- 
mented by a multilayer perceptron (feedforward neural network) with _> i number 
of hidden layers, with w connections to the first hidden layer and the first hidden 
layer composed entirely of fully connected, linear threshold logic units (i.e. units 
able to implement any mapping of the form (Xl,.., Xn) -+ O(ao + Ein__l aixi) for 
ai 6 R). It can be shown from the properties of Vandermonde determinant (c.f. 
[7, 8]) that if f: [0, 1) -> R n is a mapping with coordinates composed of linearly 
independent polynomials (generic situation) of degree 5 n, then 
PC(wl) '-- f*MLpn(wl) cted {hof; h E MLPn(Wl)}. 
(9) 
This implies immediately that all results for learning the class of PC functions in 
Section 5.2 are applicable (with obvious modifications) to this class of multilayer 
perceptrons with probability distribution concentrated on the 1-dimensional curves 
of the form f([0, 1)) with f as above. 
However, we can go a step further. We can extend such a distribution to a con- 
tinuous distribution on R n with support "sufficiently close" to the curve f([0, 1)), 
MLP Can Provably Generalize Much Better than VC-bounds Indicate 195 
1.0 
0.8 
0.6 
0.4 
0.2 f 
0.0 
0.0 
10.0 
scaled training sample size (o0 
20.0 
VC Entropy 
---- EPD, = 0.2 
EPD,  = 0.0 
......... TC%  = 0.2 
TCO , = 0.0 
......... TC- , = 0.2 
Figure 1: Plots of different estimates for thermodynamic limit of learning curves for 
the sequence of multilayer perceptrons as in Claim 3 for consistent learning (A - 0). 
Estimates on true learning curve from (7) are for 6t = 0 ('TC0') and 6t - 0.2 ('TC+' 
and 'TC-' for the upper and lower bound, respectively). Two upper bounds of the 
form (4) from the modified VC-formalism for go as in Claim I and/3 - i are plotted 
for 6t = 0.0 and 6t = 0.2 (marked EPD). For comparison, we plot also the bound 
(10) based on the VC-entropy; VC bound (5) being trivial for this scaling, -- 1, c.f. 
Corollary 2, is not shown. 
with changes to the error pattern densities A m the learning curves, etc. as small 
i,N' ' 
as desired. This observation implies the following result: 
Claim 3 For any sequence of multilayer perceptrons, MLPnN(WlN), win -- 
c, there exists a sequence of continuous probability distributions !aN on 
R my with properties as follows. For any sequence of targets tv 6 
MLpn(w]t), both Claim i and Claim 2 of Section 3.2 hold for the learn- 
der 
ing sequence (R n,/N, MLP n (wily), tN) with scaling rN = nlN and 5t = 
limNoo wltN/WliV. In particular bound (4) on the /earning curve holds for go 
as in Claim 1. 
Corollary 2 ff additionally the number of units in first hidden layer iN -- 00, 
then the thermodynamic limit of VC-bound (5) with respect to the scaling rN = 
WlN is trivial, i.e. = 1 for all a > O. 
Proof. The bound (5) is trivial for m < 12dv, where dv de__f dvc(MLpn  (wlt)). 
As dv = (wl/v log2(7-/1/v)) [13, 15] for any continuous probability on the input 
space, this bound is trivial for any a = ___m < 12_4s_ _+  if N - . [] 
'ttlN -- WIN 
There is a possibility that VC dimension based bounds are applicable but fail to cap- 
ture the true behavior because of their independence from the distribution. One op- 
tion to remedy the situation is to try a distribution-specific estimate such as VC en- 
tropy (i.e. the expectation of the logarithm of the counting function 1-IN(X1, ..., Xm) 
which is the number of dichotomies realised by the perceptron for the ra-tuple 
of input points [18]). However, in our case, IIN(Xl,...,Xm) has the lower bound 
X-, min(wl/2,m- 1) 
2 i-i=0 ( m 
i ), for xl,..., Xm in general position, which is virtually the ex- 
pression from Sauer's lemma with VC-dimension replaced by wlv/2. Thus using 
196 A. Kowalczyk and H. Ferrd 
VC entropy instead of VC dimension (and Sauer's Lemma) we cannot hope for 
a better result than bounds of the form (5) with WN/2 replacing VC-dimension 
resulting in the bound 
ec(a) _< min(1,a - log:(4ea)) (a > l/e) 
(10) 
in the thermodynamic limit with respect to the scaling rN = WiN. (Note that 
more "optimistic" VC entropy based bounds can be obtained if prior distribution 
on hypothesis space is given and taken into account [3].) 
The plots of learning curves are shown in Figure 1. 
Acknowledgement. The permission of Director of Telstra Research Laboratories 
to publish this paper is gratefully acknowledged. 
References 
[1] A. Blumer, A. Ehrenfeucht, D. Haussler, and M.K. Warmuth. Learnability and the 
Vapnik-Chervonenkis dimensions. Journal of the ACM, 36:929-965, (Oct. 1989). 
[2] T.M. Cover. Geometrical and statistical properties of linear inequalities with appli- 
cations to pattern recognition. IEEE Trans. Elec. Comp., EC-14:326-334, 1965. 
[3] D. Hausler, M. Kearns, and R. Shapire. Bounds on the Sample Complexity of Bayesian 
Learning Using Information Theory and VC Dimension. Machine Learning, 14:83- 
113, (1994). 
[4] D. Haussler, M. Kearns, H.S. Seung, and N. Tishby. Rigorous learning curve bounds 
from statistical mechanics. In Proc. COLT'9d, pages 76-87, 1994. 
[5] S.B. Holden and M. Niranjan. On the Practical Applicability of VC Dimension 
Bounds. Neural Computation, 7:1265-1288, 1995). 
[6] P. Koiran and E.D. Sontag. Neural networks with quadratic VC-dimension. In Proc. 
NIPS 8, pages 197-263, The MIT Press, Cambridge, Ma., 1996.. 
[7] A. Kowalczyk. Counting function theorem for multi-layer networks. In Proc. NIPS 
6, pages 375-382. Morgan Kaufman Publishers, Inc., 1994. 
[8] A. Kowalczyk. Estimates of storage capacity of multi-layer perceptton with threshold 
logic hidden units. Neural networks, to appear. 
[9] A. Kowalczyk and H. Ferra. Generalisation in feedforward networks. Proc. NIPS 6, 
pages 215-222, The MIT Press, Cambridge, Ma., 1994. 
[10] A. Kowalczyk. An asymptotic version of EPD-bounds on generalisation in learning 
systems. 1996. Preprint. 
[11] A. Kowalczyk, J. Szymanski, and R.C. Williamson. Learning curves from a modified 
VC-formalism: a case study. In Proc. of ICNN'95,2939-2943, IEEE, 1995. 
[12] A. Kowalczyk, J. Szymafiski, P.L. Bartlett, and R.C. Williamson. Examples of learn- 
ing curves from a modified VC-formalism. Proc. NIPS 8, pages 344-350, The MIT 
Press, 1996. 
[13] W. Maas. Neural Nets with superlinear VC-dimesnion. Neural Computation, 6:877- 
884, 1994. 
[14] J.G. Mauldon. Random division of an interval. Proc. Cambridge Phil. Soc., 47:331- 
336, 1951. 
[15] A. Sakurai. Tighter bounds of the VC-dimension of three-layer networks. In Proc. of 
the 1993 World Congress on Neural Networks, 1993. 
[16] E. Sontag. Shattering all sets of k points in "general position" requires (k - 1)/2 
parameters. Report 96-01, Rutgers Center for Systems and Control, 1996. 
[17] V. Vapnik. Estimation of Dependences Based on Empirical Data. Springer-Verlag, 
1982. 
[18] V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, 1995. 
