Examples of learning curves from a modified 
VC-formalism. 
A. Kowalczyk & J. Szymafski 
Telstra Research Laboratories 
770 Blackburn Road, 
Clayton, Vic. 3168, Australia 
{ a. kowalczyk,j.szymanski }@trl. oz. au) 
P.L. Bartlett & R.C. Williamson 
Department of Systems Engineering 
Australian National University 
Canberra, ACT 0200, Australia 
{bartlett,williams }@syseng.anu.edu.au 
Abstract 
We examine the issue of evaluation of model specific parameters in a 
modified VC-formalism. Two examples are analyzed: the 2-dimensional 
homogeneous perceptron and the 1-dimensional higher order neuron. 
Both models are solved theoretically, and their leaming curves are com- 
pared against true learning curves. It is shown that the formalism has 
the potential to generate a variety of leaming curves, including ones 
displaying "phase transitions." 
1 Introduction 
One of the main criticisms of the Vapnik-Chervonenkis theory of leaming [15] is that the 
results of the theory appear very loose when compared with empirical data. In contrast, 
theory based on statistical physics ideas [1] provides tighter numerical results as well as 
qualitatively distinct predictions (such as "phase transitions" to perfect generalization). 
(See [5, 14] for a fuller discussion.) A question arises as to whether the VC-theory can 
be modified to give these improvements. The general direction of such a modification is 
obvious: one needs to sacrifice the universality of the VC-bounds and introduce model (e.g. 
distribution) dependent parameters. This obviously can be done in a variety of ways. Some 
specific examples are VC-entropy [15], empirical VC-dimensions [ 161, efficient complexity 
[17] or (, C)-uniformity [8, 9] in a VC-formalism with error shells. An extension of the 
last formalism is of central interest to this paper. It is based on a refinement of the 
"fundamental theorem of computational learning" [2] and its main innovation is to split the 
set of partitions of a training sample into separate "error shells", each composed of error 
vectors corresponding to the different error values. 
Such a split introduces a whole range of new parameters (the average number of pattems 
in each of a series of error shells) in addition to the VC dimension. The difficulty of 
determining these parameters then arises. There are some crude, "obvious" upper bounds 
Examples of Learning Curves from a Modified VC-formalism 345 
on them which lead to both the VC-based estimates [2, 3, 15] and the statistical physics 
based formalism (with phase transitions) [5] as specific cases of this novel theory. Thus 
there is an obvious potential for improvement of the theory with tighter bounds. In particular 
we find that the introduction of a single parameter (order of uniformity), which in a sense 
determines shifts in relative sizes of error shells, leads to a full family of shapes of leaming 
curves continuously ranging in behavior from decay proportional to the inverse of the 
training sample size to "phase transitions" (sudden drops) to perfect generalization in small 
training sample sizes. We present initial comparison of the leaming curves from this new 
formalism with "true" leaming curves for two simple neral networks. 
2 Overview of the formalism 
The presentation is set in the typical PAC-style; the notation follows [2]. We consider 
a space X of samples with a probability measure , a subspace// of binary functions 
X --,, {0, 1} (dichotomies) (called the hypothesis space) and a target hypothesis t  tt. 
For each h //andeach m-sample4 = (at, ..., a:,.,.,)  X "r' (m  {1, 2, ...}),we denoteby 
df 1 m 
eh,a =  Y4=i It-hl(a:i) the empiricalerror of h on , d by ea = fx [t-hJ()(dz) 
the exct error of h  H. 
For each m  {1, 2, ...} let us consider the rmdom variable 
hrH 
defined as the maximal expected error of an hypothesis h //consistent with t on 4. The 
learning curve oftt, defined as the expected value of e", ", 
Ex,[e = fx (4  X (2) 
is of central interest to us. Upper bounds on it can be derived from basic PAC-estimates as 
follows. For e > 0 we denote by H, '/'=/{h  H; ea > e} the subset of e-bad hypotheses 
and by 
Q ad {4  x '; 3e.. e, = o} = {4  x '; 3he- e, = o & e >_ e} (3) 
the subset of m-samples for which there exists an e-bad hypothesis consistent with the 
target t. 
Lemma I I'""(Q?) < b(e, rn.), then e(rn.) <_ fo  rnin(1, b(e, ra))!(de), and equality 
in the assumption implies equality in the conclusion. [] 
Proof outline. If the assumption holds, then 9(e, m) a,_.=/1 - rain(l, b(e, m)) is a lower 
bound on the cumulative distribution of the random variable (1). Thus Ex, [e "] _< 
fore 9(e, re)de and integration by parts yields the conclusion. 
Given 4: (z, ..., z,.,.,)  X "r', let us introduce the transformation (projection) -,.://--,, 
{0, 1} "r' allocating to each h //the vector 
- ..., - 
called the error pattern ofh on 4. For a subset G C H, let *rt,(G) = {*rt,.(h): h  G}. 
The space {0,1}  is the disjoint union of error shells  a {(,...,,)  
{0,1} ;  +--. +, = i} for/ = 0, 1,...,m, and I,.(H) g[ is the number 
346 A. KOWALCZYK, J. SZYMANSKI, P. L. BARTLETT, R. C. WILLIAMSON 
of different error pattems with i errors which can be obtained for h //. We shall employ 
the following notation for its average: 
(4) 
The central result of this paper, which gives a bound on the probability of the set Q as 
in Lemma 1 in terms of [//[, will be given now. It is obtained by modification of the 
proof of [8, Theorem 1] which is a refinement of the proof of the "fundamental theorem of 
computational leaming" in [2]. It is a simplified version (to the consistent learning case) of 
the basic estimate discussed in [9, 7]. 
Theorem 2 For any integer k _> 0 and 0 _< e, 7 <_ 1 
where A,,. r 
, (5) 
( (:) )-' 
= x-'L'rJ ei(1 - e) -j fork > 0 andA,o,.r 
d.f 1 - z..,j=o , 
Since error shells are disjoint we have the following relation: 
a.f 2_ /X 
I'z(n)l'(d.) = 2 -' Y] Iml? _< n.(,)/2 ' 
i=0 
(6) 
where r(h) a'd ro,(h), Inl? 'd Io17 and n.(,) 'd maxx. I'()1 is the 
growth function [2] of H. (Note that assuming that the target t _= 0 does not affect the 
cardinality of -t,(H).) If the VC-dimension of m, d = dvc(It), is finite, we have the 
well-known estimate [2] 
d 
j-O 
(7) 
Corollm'y 3 (i) If the VC-dimension d of H is finite and m > S/e, then p'(Qp) _< 
( ii) If H has finite cardinality, then p' ( q ) _< 5'h H, (1 -- eh )'. 
Proof. (i) Use the estimate A,,/2 _< 2 for k _> 8/e resulting from the Chemoffbound 
and set 'r = el2 and k = m in (5). (ii) Substitute the following crude estimate: 
I,17 _<  I.,lT' _<  I17 _< Pu < (emla) , 
i=0 i=0 
into the previous estimate. (iii) Set k = 0 into (i) and use the estimate 
_ = -- -- h) h'[-I 
The inequality in Corollary 3.i (ignoring the factor of 2) is the basic estimate of the VC- 
formalism (c.f. [2]); the inequality in Corollary 3.ii is the union bound which is the starting 
point for the statistical physics based formalism developed in [5]. In this sense both of 
these theories are unified in estimate (5) and all their conclusions (including the prediction 
Examples of Learning Curves from a Modified VC-formalism 347 
10  
10  
10 -t 
10 -z 10-z 
4 0 
(b) 
10 20 30 40 50 
m/d 
Figure 1' (a) Examples of upper bounds on the learning curves for the case of finite VC- 
dimension d = dvc(H) implied by Corollary 4.ii for C,o,,.,., = const. They split into five 
distinct "bands" of four curves each, according to the values of the order of uniformity w = 
2, 3, 4, 5, 10 (in the top-down order). Each band contains a solidline (Co,,.,.,  1, d = 100), 
a dotted line (Co,,.,., -- 100, d = 100), a chain line (C,o,, = 1, d = 1000) and a broken line 
(C,, ---- 100, d = 1000). 
(b) Various learning curves for the 2-dimensional homogeneous perceptron. Solid lines 
(top to bottom): (i) - for the VC-theory bound (Corollary 3.ii) with VC-dimension d = 2; 
(ii) - for the bound (for Eqn. 5 and Lemma 1) with 7 = ,/ = m and the upper bounds 
<_ I17 = 2 for i --- 1, ..., m- 1 and _< I17 - I for i = 0, m; (iii)- as in 
(ii) but with the exact values for as in (11); (iv) - true learning curve (Eqn. 13). The 
w-uniformity bound for w = 2 (with the minimal C,o,,r, satisfying (9), which turn out to be 
= const = 1) is shown by dotted line; for w = 3 the chain line gives the result for minimal 
C,o,, and the broken line for C,o,, set to 1. 
of phase transitions to perfect generalization for the Ising perceptron for a = raid < 1.448 
in the thermodynamic limit [5]) can be derived from this estimate, and possibly improved 
with the use of tighter estimates on I//,l. 
We now formally introduce a family of estimates on I//[? in order to discuss a potential 
of our formalism. For any m, e and w >_ 1.0 there exists C,o,, > 0 such that 
(for 0 < i < m). 
(8) 
We shall call such an estimate an w-uniformity bound. 
Corollary 4 (i) ff an w-uniformity bound (8) holds, en 
(9) 
(ii) if additionally d = dvc(H) < oo, then 
Z (2-m(2era/d)a) 
 I-1 (lO) 
3 Examples of learning curves 
In this section we evaluate the above formalism on two examples of simple neural networks. 
348 A. KOWALCZYK, J. SZYMANSKI, P. L. BARTLETT, R. C. WILLIAMSON 
10 0 
10 -! 
20 
...... ' ............... .7.....  ':: 
-,.co = 2: doLLed line 
co = 3 : chain line 
.... I .... I .... I .... ll,ll 0 
0 10 20 30 40 50 0 
m/(d + !) 
15 
I0 
/ . .............. = a 
, ,,/' , I .... J .... J .... J .... 
100 200 300 400 500 
m 
Figure 2: (a) Different learning curves for the higher order neuron (analogous to Fig. 1.b). 
Solid lines (top to bottom) (i) - for the VC-theory bound (Corollary 3.ii) with VC-dimension 
d + 1 = 21; (ii) - for the bound (5) with = e and the upper bounds IJl? _< IJl? with 
Il? given by (15); (iii) - true leaming curve (the upper bound given by (18)). The w- 
uniformity bound/approximation are plotted as chain and dotted lines for the mimmal 
satisfying (8), and as broken (long broken) line for C,o,, = const = 1 with w = 2 (w = 3). 
(b) Plots of the minimal value of C,,,,,, satisfying condition of w-umformity bound (8) for 
higher order neuron and selected values of w. 
3.1 2-dimensional homogeneous perceptron 
We consider X &-=/R: and H defined as the family of all functions (x, 2) -} 0(xwx + 
t:w:), where (w, w:)  R: and O(r) is defined as 1 if r >_ 0 and 0, otherwise, and the 
probability measure  on R: has rotational symmetry with respect to the origin. Fix an 
arbitrary target t //. In such a case 
2(1 - e)" - (1 - 2e) " (for i = 0 and 0 _< e _< 1/2), 
,, = 1 (for/= m), (11) 
i ("r')  (1- )"r'- (otherwise) 
2 E=o  
In pc we find at 1 for i: 0, m d = 2, otheise, d 
Pn(m)=JHl/2=(l+2+'..+2+l)/2=m/2 -. (12) 
i=0 
d e true leng cue is 
e(m) = 1.5(m+ 1) -x (13) 
e latter expression msffits from Lemma 1 d e uEity 
(={ 2(1-e)-(1-2e)  (for0<egl/2), 
' ' 2(1 e) (forl<el), (14) 
Different learning curves (bounds and approximations) for homogeneous perceptron are 
plotted in Figure 1.b. 
3.2 1-dimensional higher order neuron 
We consider X aL-/ [0, 1] C R with a continuous probability distribution/. Define the 
hypothesis space//C {0, 1} x as the set of all functions of the form Oop(z) where p is a 
Examples of Learning Curves from a Modified VC-formalism 349 
polynomial of degree < d on R. Let the target be constant, t = 1. It is easy to see that H 
restricted to a finite subset of [0, 1] is exactly the restriction of the family of all functions 
 C {0, 1}[ ,q with up to d"jumps" from 0 to 1 or 1 toOandthusdvc(H) = d+l. With 
probability 1 an m-sample  = (zx, ..., z,.,.,) from X "r' is such that zi 7  zj for i  j. For 
such a generic , I',(//) n 1 = const = IHl. This observation was used to derive 
the following relations for the computation of [HI : 
min(d,ra- 1) 
Il? - Iq(O>l? q-Iq(o>l_ i, (15) 
=0 
for 0 _< i _< m, where Ir() I, for  = 0, 1, ..., d, is defined as follows. We initialize 
= lfori: x,..., m-x, Iqr)lg = 0 
for i: 0, 1, ..., m, $ = 2, 3, ..., d, d en, remly, for $ ) 2 we set [(*)I aj 
-x I(-)1}_+ ifs is odd md I(O)lp  E5  I(-)1 ifs is even. 
(Here [(O)l is defined by e relation (4) wi e tget t  1 for the hypoesis space 
H (*) C  compos of fmchons having the vue 1 near 0 d exactly $ jmps in (0, 1), 
exactly at envies of ; similly  for H, [H(*)l = Ix,iH(o)  for a genehc 
m-sample   (0, 1).) 
yzing  embding of R into R a, d ing  argment based on the Vdermonde 
determint as in [6, 13], itc be prov at e paaitionction  is given by Cover's 
co, ting ction [4], d at 
ra d 
i=0 i=0 
/2 "r'. (16) 
For the uniform distribution on [0, 1] and a generic . E [0, 1] ' let ), (.) denote the sum of 
k largest segments of the partition of [0, 1] into m + 1 segments by the entries of .. Then 
La/2j ()  ()  La/2j+x(). (17) 
An explicit expression for the expected value of ), is known [11], thus a very tight bound 
on the true learning curve /(m) defined by (2) can be obtained: 
Ld/2J 1+  <e(m)< Ld/2j +1 +  . (18) 
m+l - - m+l 
j=Ld12J +1 j=Ld12J +2 
Numerical results are shown in Figure 2. 
4 Discussion and conclusions 
The basic estimate (5) of Theorem 1 has been used to produce upper bounds on the learning 
curve (via Lemma 1) in three different ways: (i) using the exact values of coefficients 
(Fig. la), (ii) using the estimate 1//17 -< I//17 and the values of I17 and (iii) 
using the w-uniformity bound (8) with minimal value of C,o,,.r, and as an "approximation" 
with C,o,,.,., - const = 1. Both examples of simple learning tasks considered in the paper 
allowed us to compare these results with the true leaming curves (or their tight bounds) 
which can serve as benchmarks. 
Figure 1.a implies that values of parameter w in the w-uniformity bound (approximation) 
governing a distribution of error patterns between different error shells (c.f. [10]) has a 
350 A. KOWALCZYK, J. SZYMANSKI, P. L. BARTLETT, R. C. WILLIAMSON 
significant impact on leaming curve shapes, changing from slow decrease to rapid jumps 
("phase transitions") in generalization. 
Figure 1.b proves that one loses tightness of the bound by using I.r-/l? rather than [H I, and 
even more is lost if w-uniformity bounds (with variable C,o,,) are employed. Inspecting 
Figures 1.b and 2.a we find that approximate approaches consisting of replacing Itl? 
by a simple estimate (w-uniformity) can produce learning curves very close to Irl?- 
leaming curves suggesting that an application of this formalism to leaming systems where 
neither IHl? nor Itl? can by calculated might be possible. This could lead to a sensible 
approximate theory capturing at least certain qualitative properties of leaming curves for 
more complex learning tasks. 
Generally, the results of this paper show that by incorporating the limited knowledge of the 
statistical distribution of error patterns in the sample space one can dramatically improve 
bounds on the learning curve with respect to the classical universal estimates of the VC- 
theory. This is particularly important for "practical" training sample sizes (m _< 12 x 
VC-dimension) where the VC-bounds are void. 
Acknowledgement. The permission of Director, Telstra Research Laboratories, to publish 
this paper is gratefully acknowledged. A.K. acknowledges the support of the Australian 
Research Council. 
References 
[13] 
[14] 
[151 
[161 
[171 
[1] S. Amari, N. Fujita, and S. Shinomoto. Four types of learning curves. Neural Computation, 
4(4):605-618, 1992. 
[2] M. Anthony and N. Biggs. ComputationalLearning Theory. Cambridge University Press, 1992. 
[3] A. Blumer, A. Ehrenfeucht, D. Haussler, and M.K. Warmuth. Learnability and the Vapnik- 
Chervonenkis dimensions. Journal of the ACM, 36:929-965, (Oct. 1989). 
[4] T.M. Cover. Geometrical and statistical properties of linear inequalities with applications to 
pattern recognition. IEEE Trans. Elec. Comp., EC-14:326-334, 1965. 
[5] D. Haussler, M. Kearns, H.S. Seung, and N. Tishby. Rigorous learning curve bounds from 
statistical mechanics. In Proc. 7th Ann. ACM Conf. on Comp. Learn. Theory, pages 76-87, 
1994. 
[6] A. Kowalczyk. Estimates of storage capacity of multi-layer perceptton with threshold logic 
hidden units. Neural Networks, to appear. 
[7] A. Kowalczyk. VC-formalism with explicit bounds on error shells size distribution. A 
manuscript, 1994. 
[8] A. Kowalczyk and H. Ferra. Generalisation in feedforward networks. Adv. in NIPS 7, The MIT 
Press, Cambridge, 1995. 
[9] A. Kowalczyk, J. Szymanski, and H. Ferra. Combining statistical physics with VC-bounds on 
generalisation in learning systems. In Proc. ACNN'95, Sydney, 1995. University of Sydney. 
[10] A. Kowalczyk, J. Szymanski, and R.C. Williamson. Learning curves from a modified vc- 
formalism: a case study. In Proceedings of ICNN'95, Perth (CD-ROM), volume VI, pages 
2939-2943, Rundle Mall, South Australia, 1995. IEEE/Causal Production. 
[11] J.G. Mauldon. Random division of an interval. Proc. Cambridge Phil. Soc., 47:331-336, 1951. 
[12] K.R. Muller, M. Finke, N. Murata, and S. Amari. On large scale simulations for learning curves. 
In Proc. ACNN'95, pages 45--48, Sydney, 1995. University of Sydney. 
A. Sakurai. n-h-1 networks store no less n Ji + 1 examples but sometimes no more. In 
Proceedings of the 1992 International Conference on Neural Networks, pages 111-936-111-941. 
IEEE, June 1992. 
H. Sompolinsky, H.S. Seung, and N. Tishby. Statistical mechanics of learning curves. Physical 
Reviews, A45:6056-6091, 1992. 
V. Vapnik. Estimation of Dependences Based on Empirical Data. Springer-Verlag, 1982. 
V. Vapnik, E. Levin, and Y. Le Cun. Measuring the VC-dimension of a learning machine. Neural 
Computation, 6 (5):851-876, 1994. 
C. Wang and S.S. Venkantesh. Temporal dynamics of generalisation in neural networks. Adv. 
in NIPS 7, The MIT Press, Cambridge, 1995. 
