Generalisation in Feedforward Networks 
Adam Kowalczyk and Herman Ferra 
Telecom Australia, Research Laboratories 
770 Blackburn Road, Clayton, Vic. 3168, Australia 
(a.kowalczyk@trl.oz.au, h.ferra@trl.oz.au) 
Abstract 
We discuss a model of consistent learning with an additional re- 
striction on the probability distribution of training samples, the 
target concept and hypothesis class. We show that the model pro- 
vides a significant improvement on the upper bounds of sample 
complexity, i.e. the minimal number of random training samples 
allowing a selection of the hypothesis with a predefined accuracy 
and confidence. Further, we show that the model has the poten- 
tial for providing a finite sample complexity even in the case of 
infinite VC-dimension as well as for a sample complexity below 
VC-dimension. This is achieved by linking sample complexity to 
an "average" number of implementable dichotomies of a training 
sample rather than the maximal size of a shattered sample, i.e. 
VC-dimension. 
I Introduction 
A number of fundamental results in computational learning theory [1, 2, 11] links the 
generalisation error achievable by a set of hypotheses with its Vapnik-Chervonenkis 
dimension (VC-dimension, for short) which is a sort of capacity measure. They 
provide in particular some theoretical bounds on the sample complexity, i.e. a 
minimal number of training samples assuring the desired accuracy with the desired 
confidence. However there are a few obvious deficiencies in these results: (i) the 
sample complexity bounds are unrealistically high (c.f. Section 4.), and (ii) for 
some networks they do not hold at all since VC-dimension is infinite, e.g. some 
radial basis networks [7]. 
216 Adam Kowalczyk, Herman Ferra 
One may expect that there are at least three main reasons for this state of affairs: 
(a) that the VC-dimension is too crude a measure of capacity, (b) since the bounds 
are universal they may be forced too high by some malicious distributions, (c) 
that particular estimates themselves are too crude, and so might be improved with 
time. In this paper we will attack the problem along the lines of (a) and (b) since 
this is most promising. Indeed, even a rough analysis of some proofs of lower 
bound (e.g. [1]) shows that some of these estimates were determined by clever 
constructions of discrete, malicious distributions on sets of "shattered samples" (of 
the size of VC-dimension). Thus this does not necessarily imply that such bounds 
on the sample complexity are really tight in more realistic cases, e.g. continuous 
distributions and "non-malicious" target concepts, the point eagerly made by critics 
of the formalism. The problem is to find such restrictions on target concepts and 
probability distributions which will produce a significant improvement. The current 
paper discusses such a proposition which significantly improves the upper bounds 
on sample complexity. 
2 A Restricted Model of Consistent Learning 
First we introduce a few necessary concepts and some basic notation. We assume 
we are given a space of samples X with a probabilitll measure/, a set H of binary 
functions X - {0, 1} called the httpothesis space and a target concept t E H. For an 
mv  = (,...,)  X  ,d :   {0,} h wo ((),...,())  
{0, 1}  will be denoted by h(). We define two projections  and ,, of H onto 
{0, }'  fonow ()  
= h() = (h(z), ...,h(z,)) and t,(h) a, (1/_ hi) = 
[t- h[() for every h  H. Below we shall use the notation [S[ for the cardinality 
of the set S. The average density of the sets of projections (H) or t,(H) in 
{0, 1}  is defined as 
PrH(5) a, i(H)[/2, = 
(equivalently, this is the probability of a random vector in {0, 1}  belonging to the 
set (H)). Now we define two associated quantities: 
f = 2-' f (x) 
= m Pr(). 
We recall that dH ' max{r, ; 3x-l*r(H)[ -- 2 } is called the Vapnik- 
Chervonenkis dimension (VC-dimension) of H [1, 11]. If dH < oe then Sauer's 
lemma implies the estimates (c.f. [1, 2, 10]) 
15rH, m(r,) _< PrH, max(r,) <_ 2-n(dH, r,) _< 2 -n (er*/dH) rl" , (2) 
where (d, r,) a,__/y]/a:0 () (we assume () a,__/0 if i > 
Now we are ready to formulate our main assumption in the model. We say that the 
space of hypotheses H is (U', C)-uniform around t  2 x if for every set S C {0, 
Generalisation in Feedforward Networks 217 
The meaning of this condition is obvious: we postulate that on average the number 
of different projections *rt,z(h) of hypothesis h E H falling into $ has a bound 
proportional to the probability P-rs,(n) of random vector in {0, 1}  belonging to 
the set rt,z(//). Another heuristic interpretation of (3) is as follows. Imagine that 
elements of *rt,g(//) are almost uniformly distributed in {0, 1} , i.e. with average 
density pz  Clt,(H)l/2 . Thus the "mass" of the volume ISl i It,(H) Sl  
PzlSl and so its average f It,z(H)Sl(d) has the estimate  Isl p(d)  
ClSlrH,(). 
Of special interest is the particular case of consistent learning [1], i.e. when the 
target concept and the hypothesis fully agree on the training sample. In this case, 
for any  > 0 we introduce the notation 
x = 0 ) 
where er,z(h) a Ei I t- hl(i)/m and er,,(h) a f it - hl()(d) denote error 
rates on the training sample  = (,...,) and X, respectively. Thus Q(m) is 
the set of all samples for which there exists a hypothesis in H with no error on 
the sample and the error at least  on X. 
Theorem 1 If the hypothesis space H is (kt ra, C)-uniform around t  H then for 
any e > 8/rn 
(;) 
2 (4) 
<_ C'15rs,(2rn)(3/2) ra <_ C'(2ern/ds)ax(3/4) ra. [] (5) 
Proof of the theorem is given in the Appendix. 
Given ,5 > 0. The integer rni;(5,) a--/min{rn > 0; /'(Q) < 5} will be called 
the sample complezity following the terminology of computational learning theory 
(c.f. [1]). Note that in our case the sample complexity depends also (implicitly) on 
the target concept t, the hypothesis space H and the probability measure 
Corollary 2 If the hypothesis space H is (! n, C)-uniform around t  H for any 
n > O, then 
rni;(5,) _< maz{8/,min{rn; 2C15rs,(2rn)(3/2) ra < 5}} (6) 
C 
<_ maz{8/,6.9ds + 2.41og -}. [] (7) 
The estimate (7) of Corollary 2 reduces to the estimate 
mL(5, ) < 6.9 as 
independent of 5 and . This predicts that under the assumption of the corollary 
a transition to perfect generalisation occurs for training samples of size < 6.9 ds, 
which is in agreement with some statistical physics predictions showing such tran- 
sition occurring below m 1.Sds for some simple neural networks (c.f. [5]). 
Proof outline. Estimate (6) follows from the arst estimate (5). Estimate (7) can 
be derived from the second bound in (5) (virtually by repeating the proof of [1, 
Theorem 8.4.1] with substitution of 41og(4/3) for  and 5/ for 5). Q.E.D. 
218 Adam Kowalczyk, Herman Ferra 
10 200 
looo 
10 0 
10-oo 
10 '2 
10 
1 
10- 
8(m,0) d.= zo 
1 0 1 0 2 10 3 10 4 
m 
10'2 
10 '3oo 10 '3 
lO 
?:.:-:-:-:.:.:.:-:.:.:.:.:.:.:.:.:.: :::::::::::::::::::::::::::::::::::: 
i:!:i:i:i:i:!:i:i:i:i:i: 
10 2 10 3 1 
rn 
Figure 1: Plots of estimates on lrH,(m) and (m,)/C _ (m,O)/C = 
(3/2)2mlfirH,(2rn) (Fig. a) for the analytic threshold neuron (/SrH,(rn) = 
(dH,2m)(3/4) 2m and (Fig. b) for the abstract perceptton according to the es- 
timate (11) on PrH,(rn) /or rnz = 50, rn = 1000 and p = 0.015. The upper 
bound on the sample complexity, rnL(e, 5), corresponds to the abscissa value of the 
intersection of the curve (m, 0) with the level 5/C (c.. point A in Fig. b). In 
this manner for 5/C = 0.01 we obtain estimates rnL < 602 and rn < 1795 in the 
case of Fig. a for dH -- 100 and dH = 300 and rn <_ 697 in the case of Fig. b 
(dH = m -- 1000), respectively. 
3 An application to feedforward networks 
In this section we shall discuss the problem of estimation of 15rH,(rn) which is 
crucial for application of the above formalism. 
First we discuss an example of an anal!ltic threshold neuron on R  [6] when H is 
the family of all functions R  -* {0, 1}, z  0(E=  wiai(z)), where ai: R  -* R 
are fixed real analytic functions and 0 the ordinary hard threshold. In this case d 
equals the number of linearly independent functions among a, aa, ..., aa. For any 
continuous probability distribution t on R  we have: 
Pr(,) = (I,(d,m)/2 ' (Vm and VE E (R')  with probability 1), (8) 
and consequently 
15rH,(m) -- PrH, max(m) -- (I)(dH, m)/2 rn for every m. 
(9) 
Note that this class of neural networks includes as particular cases, the linear thresh- 
old neuron (if k -- n+ 1 and a, ..., an+ are chosen as 1, z, ..., zn) and higher order 
networks (if ai(z) are polynomials); in the former case (8) follows also from the 
classical result of T. Cover [3]. 
Now we discuss the more complex case of a linear threshold rnultila?ier perceptton 
on X = R , with H defined as the family of all functions R  -* {0, 1} that 
such an architecture may implement and p is any continuous probability measure 
Generalisation in Feedforward Networks 219 
on R . In this case there exist two functions, Bzo(rn) and Bp(m), such that 
Bzo (rn) _ PrH(,) _ B,,(rn) for any a? E (p,.n)rn with probability 1. In other words, 
Prs(.) takes values within a "bifurcation region" similar to the shaded region in 
Fig. 1.b. Further, it is known that B(rn) - 1 for rn C nh,1 4. 1, B,(rn) -- 1 for 
m _< ds and, in general, Bo, (rn) _ ff'(nhl 4. 1, rn) and B(m) _< (ds, m), where 
hi is the number of neurons in the first hidden layer. Given this we can say that 
15rs,(rn) takes values somewhere within the "bifurcation region". It is worth noting 
that the width of the "bifurcation region", which approximately equals 2(ds- nhl) 
(since it is known that values on the boundaries have a positive probability of being 
randomly attained) increases with increasing hi since [9] 
(nhl log2(hl)) -- maxp(n - p)(hl/2 - 2 p) _< ds. (10) 
p 
Estimate (6) is better than (7) in general, and in particular, if lSrs,(rn) "drops" 
to 0 much quicker than B,p - (ds, rn), we may expect that it will provide an 
estimate of sample complexity rnL even below ds; if 15rs,(rn) is close to B,p, then 
the difference between both estimates will be negligible. In order to clarify this 
issue we shall consider now a third, abstract example. 
We introduce the abstract perceptton defined as the set of hypotheses H on a 
probabilistic space (X, !) with the following property for a random rn 4. 1-tuple 
(,z) E X " x X: 
dvc(,,z) - dvc(,) 4. 1 with probability - 1 if dvc(,) < rnl and 
with probability p if rnl _< dvc (.) < rn2, and dvc (7., z) = dvc (), 
otherwise. 
Here 0 _ rnl _ rn2 _ oo are two (integer) constants, 0 _ p _ 1 is another constant 
and dvc(zl,...,z, 0 is the maximal n such that Ir(=,,...,=,)(H)[ - 2  for some 
1 _< il < ... < i _ rn. It can easily be seen that ds - rn2 in this case and that 
the threshold analytic neuron is a particular example of abstract perceptton (with 
p - 0 and rl - r - ds, Bo(m) -- B, - (ml,m)/2"). Note further, that 
if 0 < p < 1, then for any rn _ rnl, dvc(7.) - rnl with probability ) 0, and for 
any rn _ rn2, dye(S) - ds - rn2 with probability ) 0. In this regard the abstract 
perceptton resembles the linear threshold multilayer perceptton (with rnl and rn2 
corresponding to nhl 4- 1 and dH, respectively). However, the main advantage of 
this model is that we can derive the following estimate: 
15rs,(rn) _ 2_rn  rn-- rnl pm_m_i( 1 _ p)iff(min(rn-/,rn),rn) (11) 
i--0 
Using this estimate we find that for sufficiently low p (and sufficiently large rn2) 
the sample complexity upper bound (6) is determined by rnl and can even be lower 
than rn2 -_ ds (c.f. Figure 1.b). In particular, the sample complexity determined 
by Eqns. (6) and (11) can be finite even if ds - rn2 - oo (c.f. the curve (rn, O) 
for p -- .05 in Fig. 1.b which is the same for rn2 - 1000 and rn2 -- 
4 Discussion 
The paper strongly depends on the postulate (3) of (k , C)-uniformity. We admit 
that this is an ad hoc assumption here as we do not give examples when it is 
2 2 0 Adam Kowalczyk, Herman Ferra 
satisfied nor a method to determine the constant C. From this point of view our 
results at the current stage have no predictive power, perchaps only explanatory 
one. The paper should be viewed as an attempt in the direction to explain within 
VC-formalism some known generalisation properties of neural networks which are 
out of the reach of the formalism to date, such as the empirically observed peak 
generalisation for backpropagation network trained with samples of the size well 
below VC-dimension [8] or the phase transitions to perfect generalisation below 
1.5.xVC-dimension [5]. We see the formalism in this paper as one of a number 
of possible approaches in this direction. There are other possibilities here as well 
(e.g. [5, 12]) and in particular other, weaker versions of (t , C)-uniformity can be 
used leading to similar results. For instance in Theorem 1 and Corollary 2 it was 
enough to assume (p', C)-uniformity for a special class of sets S (S 
0,j , c.f. the 
Appendix); we intend to discuss other options in this regard on another occasion. 
Now we relate this research to some previous results (e.g. [2, 4]) which imply the 
following estimates on sample complexity (c.f. [1, Theorems 8.6.1-2]): 
(dH-1 ln(5)/e) < rnL(5'e) < [4 ( dillg212 
* _ -- + log2 , (12) 
where the lower bound is proved for all  < 1/8 and 5 <_ 1/100; here rn(5,) is 
the "universal" sample complexity, i.e. for all target concepts t and all probability 
distributions p. For  - 5 = 0.01 and dH >> 1 this estimate yields 3d < 
rn(.01, .01) < 4000d. These bounds should be compared against estimates of 
Corollary 2 of which (7) provides a much tighter upper bound, rnL(.01, .01) _< 6.9d, 
if the assumption on (t m, C)-uniformity of the hypothesis space around the target 
concept t is satisfied. 
5 Conclusions 
We have shown that under appropriate restriction on the probability distribution 
and target concept, the upper bound on sample complexity (and "perfect general- 
isation") can be lowered to m 6.9x VC-dimension, and in some cases even below 
VC-dimension (with a strong possibility that multilayer perceptton could be such). 
We showed that there are other parameters than VC-dimension potentially impact- 
ing on generalisation capabilities of neural networks. In particular we showed by 
example (abstract perceptton) that a system may have finite sample complexity 
and infinite VC dimension at the same time. 
The formalism of this paper predicts transition to perfect generalisation at relatively 
low training sample sizes but it is too crude to predict scaling laws for learning curves 
(c.f. [5, 12] and references in there). 
Acknowledgement. The permission of Managing Director, Research and Infor- 
mation Technology, Telecom Australia, to publish this paper is gratefully acknowl- 
edged. 
References 
[1] M. Anthony and N. Biggs. 
Computational Learning Theory. Cambridge Uni- 
Generalisation in Feedforward Networks 2 21 
[2] 
[3] 
[4] 
[51 
[6] 
[7] 
[8] 
[9] 
[lO] 
[11] 
[12] 
versity Press, 1992. 
A. Blumer, A. Ehrenfeucht, D. Haussler, and M.K. Warmuth. Learnability and 
the Vapnik-Chervonenkis dimensions. Journal of the ACM, 36:929-965, (Oct. 
1989). 
T.M. Cover. Geometrical and statistical properties of linear inequalities with 
applications to pattern recognition. IEEE Trans. Elec. Comp., EC-14:326- 
334, 1965. 
A. Ehrenfeucht, D. Haussler, M. Kearns, and L. Valiant. A general lower bound 
on the number of examples needed for learning. Information and Computation, 
82:247-261, 1989. 
D. Hausler, M. Kearns, H.S. Seung, and N. Tishby. Rigorous learning curve 
bounds from statistical mechanics. Technical report, 1994. 
A. Kowalc.yk. Separating capacity of analytic neuron. In Proc. ICNN'9J, 
Orlando, 1994. 
A. Macintyre and E. Sontag. Finitehess results for sigmoidal "neural" networks. 
In Proc. of the 5th Annual A CM Syrup. Theory of Comp., pages 325-334, 1993. 
G.L. Martin and J.A. Pitman. Recogni.ing handprinted letters and digits using 
backpropagation learning. Neural Cornput., 3:258-267, 1991. 
A. Sakurai. Tighter bounds of the VC-dimension of three-layer networks. In 
Proceedin#s of the 1993 World Con9ress on Neural Networks, 1993. 
N. Sauer. On the density of family of sets. Journal of Combinatorial Theory 
(Series A), 13:145-147, 1972). 
V. Vapnik. Estimation of Dependences Based on Empirical Data. Springer- 
Verlag, 1982. 
V. Vapnik, E. Levin, and Y. Le Cun. Measuring the vc-dimension of a learning 
machine. Neural Computation, 6 (5):851-876, 1994). 
6 Appendix: 
Sketch of the proof of Theorem I 
We divide it into 
Stage 2. Now we use a combinatorial argument to estimate k2r(7 j). We consider 
the 2m-element commutative group Grn of transformations of X rn x X rn -. X 2rn 
generated by all "co-ordinate swaps" of the form 
,..., , y,..., y)  (, ...,-,,,+,...,,,...,,-,,+,...,), 
The proof is a modification of the proof of [1, Theorem 8.3.1]. 
three stages. 
Stage 1. Let 
7 j dd {(z,y) E X rn x X rn -. X 2rn; 3aerzh = 0 & e%h = j/m} (13) 
for j G {0, 1, ..., m}. Using a Chernoff bound on he "ail" of binomial distribution 
ifi can be shown [1, Lemma 8.3.2] ha for m  8/ 
(q()) 5 2  2() (4) 
  F/2l 
222 Adam Kowalczyk. Herman Ferra 
for 1 _< i _< m. We assume also that G, transforms {0, 1} 
in a similar fashion. Note that 
(,,(,)(h)) = (,,(,(h) (for  E 0n). (lS) 
As transformation a E G preserve the measure t  on X  x X  we obtain 
2( ) = IGI( ) =  f(ddx(a(,) 
a6G 
= f m(dZd  X(a(,). (16) 
a6G 
Let S0,? aj {: (x,)  {0,1} x {0,1}  ; x = 0 & IIAll: J} and 
s? 'J 4  40, 1} =; ll11 = J}, where ll11 '= x + '.. +  for any  - 
(,..., ,)6 {0,1} '. Then ISml: (?), a(S0,? ) c S?  for any a 6 G, and 
 = {(z, 3 E x  
Thus from Eqn. (16) we obtain 
2'( ) < f ,'(aa 
(17) 
  xr.() 
a6G, 
= f(.   
.(.)()ns 
= f u'(aal{,,(,)(n) a s?}l 2 -. 
Applying now the condition of (, C)-uniformity (Eqn. 3), Eqn. 17 and dividing 
by 2 m we ge 
Stage 3. On substitution of the above estimate into (14) we obtain estimate (4). 
 o dw () u oUw  r.X,] (?).- _< ( + /.). q... 
