Estimating Average-Case Learning Curves 
Using Bayesian, Statistical Physics and 
C Dimension Methods 
David Haussler 
University of California 
Santa Cruz, California 
Michael Kearns* 
AT&T Bell Laboratories 
Murray Hill, New Jersey 
Manfred Opper 
Institut fiir Theoretische Physik 
Universitht Giessen, Germany 
Robert Schapire 
AT&T Bell Laboratories 
Murray Hill, New Jersey 
Abstract 
In this paper we investigate an average-case model of concept learning, and 
give results that place the popular statistical physics and VC dimension 
theories of learning curve behavior in a common framework. 
I INTRODUCTION 
In this paper we study a simple concept learning model in which the learner attempts 
to infer an unknown target concept f, chosen from a known concept class ' of {0, 1}- 
valued functions over an input space X. At each trial i, the learner is given a point 
xi  X and asked to predict the value of f(xi). If the learner predicts f(xi) 
incorrectly, we say the learner makes a mistake. After making its prediction, the 
learner is told the correct value. 
This simple theoretical paradigm applies to many areas of machine learning, includ- 
ing much of the research in neural networks. The quantity of fundamental interest 
in this setting is the learning curve, which is the function of rn defined as the prob- 
*Contact author. Address: AT&T Bell Laboratories, 600 Mountain Avenue, Room 
2A-423, Murray Hill, New Jersey 07974. Electronic mail: mkearns@research.att.com. 
855 
856 Haussler, Kearns, Opper, and Schapire 
ability the learning algorithm makes a mistake predicting f(Xm+l ), having already 
seen the examples (Xl, f(xl)), ..., (xm, f(xm)). 
In this paper we study learning curves in an average-case setting that admits a prior 
distribution over the concepts in '. We examine learning curve behavior for the 
optimal Bayes algorithm and for the related Gibbs algorithm that has been studied 
in statistical physics analyses of learning curve behavior. For both algorithms we 
give new upper and lower bounds on the learning curve in terms of the Shannon 
information gain. 
The main contribution of this research is in showing that the average-case or 
Bayesian model provides a unifying framework for the popular statistical physics 
and VC dimension theories of learning curves. By beginning in an average-case set- 
ting and deriving bounds in information-theoretic terms, we can gradually recover 
a worst-case theory by removing the averaging in favor of combinatorial parameters 
that upper bound certain expectations. 
Due to space limitations, the paper is technically dense and almost all derivations 
and proofs have been omitted. We strongly encourage the reader to refer to our 
longer and more complete versions [4, 6] for additional motivation and technical 
detail. 
2 NOTATIONAL CONVENTIONS 
Let X be a set called the instance space. A concept class :1: over X is a (possibly 
infinite) collection of subsets of X. We will find it convenient to view a concept 
f  ' as a function f: X - {0, 1}, where we interpret f(x) = 1 to mean that 
x 6 X is a positive example of f, and f(x) = 0 to mean x is a negative example. 
The symbols P and 73 are used to denote probability distributions. The distribution 
 is over ', and 73 is over X. When ' and X are countable we assume that these 
distributions are defined as probability mass functions. For uncountable ' and X 
they are assumed to be probability measures over some appropriate a-algebra. All 
of our results hold for both countable and uncountable ' and X. 
We use the notation Ele, [x(f)] for the expectation of the random variable X with 
respect to the distribution P, and Prle[cond(f)] for the probability with respect 
to the distribution P of the set of all f satisfying the predicate cond(f). Everything 
that needs to be measurable is assumed to be measurable. 
3 INFORMATION GAIN AND LEARNING 
Let . be a concept class over the instance space X. Fix a target concept f  :1: and 
an infinite sequence of instances x = Xl, ..., x,, x,+l,    with x,  X for all m. 
For now we assume that the fixed instance sequence x is known in advance to the 
learner, but that the target concept f is not. Let P be a probability distribution 
over the concept class '. We think of P in the Bayesian sense as representing the 
prior beliefs of the learner about which target concept it will be learning. 
In our setting, the learner receives information about f incrementally via the label 
Estimating Average-Case Learning Curves 857 
sequence f(x),...,f(x,),f(z,+), .... At time m, the learner receives the label 
j(x.). For any m >_ 1 we define (with respect to x, j) the ruth version space 
(,,,f) = {]  . ]()= f(),...,]()= f()} 
and the ruth volume V(x,f) = P[(x,f)]. We define 0(x,f) =  for all 
x and f, so Vf(x,f) = 1. The version space at time m is simply the cls of 
all concepts in W consistent with the first m labels of f (with respect to x), and 
the ruth volume is the meure of this cls under . For the first part of the 
paper, the infinite instance sequence x and the prior  are fixed; thus we simply 
write W(f) and V(f). Later, when the sequence x is chosen randomly, we will 
reintroduce this dependence explicitly. We adopt this notational practice of omitting 
any dependence on a fixed x in many other places  well. 
For each m  0 let us define the ruth posterior distribution (x, f) =  by 
restricting  to the ruth version space W(f); that is, for all (meurable) S C W, 
v[ = v[  ()]/v[(/)] = v[  (f)]/v(D. 
Having already seen f(x),..., f(x), how much information (suming the prior 
) does the learner expect to gain by seeing f(x+)? If we let +(x,f) (ab- 
breviated +(f) since x is fixed for now) be a random variable whose value is 
the (Shannon) information gained from /(x+i), then it can be shown that the 
expected information is 
rI[+i(f)] = rI -log  = rI[-logx+i(f)] (1) 
where we define the (m + 1)st volume ratio by  (x,f) X+i(f) - 
+1 ()/(). 
We now return to our learning problem, which we define to be that of predicting the 
label f(x+x) given only the previous labels f(xi),...,f(x). The first learning 
algorithm we consider is called the Bayes optimal classification algorithm, or the 
Bayes algorithm for short. For any m and b  {0, 1), define f(x, f) = W(f) = 
{f  W(x, f)' f(x+i)= b). Then the Bayes algorithm is: 
If 7.[1'1(f)] > 
If 7.[]'(f)] < 
If 7. []'l(f)] = 
(f)], predict f(x.+ ) = 1. 
(f)], predict f(Xm+l) -- 0. 
(f)], flip a fair coin to predict f(x.+l). 
It is well known that if the target concept f is drawn at random according to 
the prior distribution P, then the Bayes algorithm is optimal in the sense that it 
minimizes the probability that f(x,+l) is predicted incorrectly. Furthermore, if we 
let Bayes+ (x, f) (abbreviated Bayes+l (f) since x is fixed for now) be a random 
variable whose value is 1 if the Bayes algorithm predicts f(x,+i) correctly and 0 
otherwise, then it can be shown that the probability of a mistake for a random f is 
E.t[Bayesm+i(f)] = E.t [O ( -- Xm+i(f))] . (2) 
Despite the optimality of the Bayes algorithm, it suffers the drawback that its 
hypothesis at any time m may not be a member of the target class '. (Here we 
858 Haussler, Kearns, Opper, and Schapire 
define the hypothesis of an algorithm at time m to be the (possibly probabilistic) 
mapping f' X - {0, 1) obtained by letting ](x) be the prediction of the algorithm 
when x,,+ -- x.) This drawback is absent in our second learning algorithm, which 
we call the Gibbs algorithm [6]: 
Given f(xx),..., f(x,,), choose a hypothesis concept ] randomly from 
Given Xm+l, predict f(x,+x)= /(m+l). 
The Gibbs algorithm is the "zero-temperature" limit of the learning algorithm stud- 
ied in several recent papers [2, 3, 8, 9]. If we let Gibbs+i(x,f ) (abbreviated 
Gibbs+ x (f) since x is fixed for now) be a random variable whose value is 1 if the 
Gibbs algorithm predicts f(x,+) correctly and 0 otherwise, then it can be shown 
that the probability of a mistake for a random f is 
El'[Gibbs+l(f)] = Ele'[1 - Xm+(f)]. 
(3) 
Note that by the definition of the Gibbs algorithm, Equation (3) is exactly the 
average probability of mistake of a consistent hypothesis, using the distribution on 
' defined by the prior. Thus bounds on this expectation provide an interesting 
contrast to those obtained via VC dimension analysis, which always gives bounds 
on the probability of mistake of the worst consistent hypothesis. 
4 THE MAIN INEQUALITY 
In this section we state one of our main results: a chain of inequalities that upper 
and lower bounds the expected error for both the Bayes and Gibbs .algorithms 
by simple functions of the expected information gain. More precisely, using the 
characterizations of the expectations in terms of the volume ratio Xm+l (f) given 
by Equations (1), (2) and (3), we can prove the following, which we refer to as the 
main inequality: 
_ E.t,[Bayesm+l(f)] 
_< E.t,[Gibbsm+l(f)] 
1 
_  E.f7:' [m+l(f)]. (4) 
Here we have defined an inverse to the binary entropy function 7/(p) = -p log p- 
(1 - p)log(1 -p) by letting -i(q), for q  [0, 1], be the unique p 6 [0,1/2] such 
that 7-/(p) = q. Note that the bounds given depend on properties of the particular 
prior P, and on properties of the particular fixed sequence x. These upper and 
lower bounds are equal (and therefore tight) at both extremes Ele,[27,,+(f)] = 1 
(maximal information gain) and Ele,[27,,+(f)] = 0 (minimal information gain). 
To obtain a weaker but perhaps more convenient lower bound, it can also be shown 
that there is a constant co > 0 such that for all p > 0, -i(p) _ cop/log(2/p). 
Finally, if all that is wanted is a direct comparison of the performances of the Gibbs 
and Bayes algorithms, we can also show: 
El7v[Bayes,+l(f)] _< El7v[Gibbs,+(f)] _< 2El6,[Bayes,+(f)]. (5) 
Estimating Average-Case Learning Curves 859 
5 THE MAIN INEQUALITY: CUMULATIVE VERSION 
In this section we state a cumulative version of the main inequality: namely, bounds 
on the expected cumulative number of mistakes made in the first m trials (rather 
than just the instantaneous expectations). 
First, for the cumulative information gain, it can be shown that E]67 [Eim=i 27i(f)] = 
Efe, [- log V,,(f)]. This expression has a natural interpretation. The first m in- 
stances xl,..., xm of x induce a partition II(x) of the concept class ' defined 
by II(x) = II = {',,(x,f)  f 6 '}. Note that IIIl is always at most 2", 
but may be considerably smaller, depending on the interaction between . and 
xi,...,x,. It is clear that Efe[-logV,(f)] = -Y-en P[r] 1og7[r]. Thus the 
expected cumulative information gained from the labels of Xl,..., x,, is simply the 
entropy of the partition II under the distribution P. We shall denote this entropy 
by 7/ (II (x)) 7/ (x)  
= = 7/,. Now analogous to the main inequality for the 
instantaneous case (Inequality (4)), we can show: 
log(2rn/7) _< rn7 -1 7-/ _< EIsa, 
_< Sibbs(j)] < (6) 
i=1 Bayesi(f) 
I , 
Here we have applied the inequality 7/-(p) >_ cop/log(2/p) in order to give the 
lower bound in more convenient form. As in the instantaneous case, the upper 
and lower bounds here depend on properties of the particular P and x. When the 
cumulative information gain is maximum (7/ = m), the upper and lower bounds 
are tight. 
These bounds on learning performance in terms of a partition entropy are of special 
importance to us, since they will form the crucial link between the Bayesian setting 
and the Vapnik-Chervonenkis dimension theory. 
6 
MOVING TO A WORST-CASE THEORY: BOUNDING 
THE INFORMATION GAIN BY THE VC DIMENSION 
Although we have given upper bounds on the expected cumulative number of mis- 
takes for the Bayes and Gibbs algorithms in terms of 7/(x), we are still left with the 
problem of evaluating this entropy, or at least obtaining reasonable upper bounds 
on it. We can intuitively see that the "worst case" for learning occurs when the 
partition entropy 7(x) is as large as possible. In our context, the entropy is qual- 
itatively maximized when two conditions hold: (1) the instance sequence x induces 
a partition of ' that is the largest possible, and (2) the prior P gives equal weight 
to each element of this partition. 
In this section, we move away from our Bayesian average-case setting to obtain 
worst-case bounds by formalizing these two conditions in terms of combinatorial 
parameters depending only on the concept class .. In doing so, we form the link 
between the theory developed so far and the VC dimension theory. 
860 Haussler, Kearns, Opper, and Schapire 
The second of the two conditions above is easily quantified. Since the entropy of 
a partition is at most the logarithm of the number of classes in it, a trivial upper 
bound on the entropy which holds for all priors P is /(x) _ loglII(x)l. VC 
dimension theory provides an upper bound on log In(x)l as follows. 
For any sequence x = xi, x2,... of instances and for m _ 1, let dim,(', x) denote 
the largest d _ 0 such that there exists a subsequence xi,..., xia of Xl,. ., x,, with 
IH((xi,..., xia))l = 2a; that is, for every possible labeling of xi,..., xia there is 
some target concept in ' that gives this labeling. The Vapnik-Chervonenkis (VC) 
dimension of ' is defined by dim(') = max{dim,(', x)  m _ 1 and xi,x2,...  
X}. It can be shown [7, 10] that for all x and m _ d _ 1, 
m 
log IH(x)l _ (1 + o(1)) dim,(, x) log dim,(', x) (7) 
where o(1) is a quantity that goes to zero as a = m/dim,,(', x) goes to infinity. 
In all of our discussions so far, we have assumed that the instance sequence x is 
fixed in advance, but that the target concept f is drawn randomly according to P. 
We now move to the completely probabilistic model, in which f is drawn according 
to P, and each instance x, in the sequence x is drawn randomly and independently 
according to a distribution 7) over the instance space X (this infinite sequence of 
draws from D will be denoted x  D*). Under these assumptions, it follows from 
Inequalities (6) and (7), and the observation above that 7/(x) _ log III(x)l that 
for any  and any D, 
1 
_ Exev. [log [II(x)[] 
' din (., x) log m ] 
_ (1 + o(1))Exev. 2 dim.(',x) 
o(1)) dir m 
_ (1 + ') log dim(.)' (8) 
The expectation Exv* [log II(x)l] is the VC entropy defined by Vapnik and Cher- 
vonenkis in their seminal paper on uniform convergence [11]. 
In terms of instantaneous mistake bounds, using more sophisticated techniques [4], 
we can show that for any/ and any D, 
[ dim_' x)] dim() 
Elep,xev* [Bayesm(x, f)] _ Exev* -- ' _ (9) 
m 
[2 dim. (', x)] 2 dim(') 
gl,xv.[Gibbs,(x,f)] _ gxv. _ (10) 
m m 
Haussler, Littlestone and Warmuth [5] construct specific D,/ and ' for which the 
last bound given by Inequality (8) is tight to within a factor of 1/ln(2)  1.44; thus 
this bound cannot be improved by more than this factor in general/ Similarly, the 
It follows that the expected total number of mistakes of the Bayes and the Gibbs 
algorithms differ by a factor of at most about 1.44 in each of these cases; this was not 
previously known. 
Estimating Average-Case Learning Curves 861 
bound given by Inequality (9) cannot be improved by more than a factor of 2 in 
general. 
For specific D, 79 and .T', however, it is possible to improve the general bounds 
given in Inequalities (8), (9) and (10) by more than the factors indicated above. 
We calculate the instantaneous mistake bounds for the Bayes and Gibbs algorithms 
in the natural case that ' is the set of homogeneous linear threshold functions 
on R e and both the distribution D and the prior 79 on possible target concepts 
(represented also by vectors in R e ) are uniform on the unit sphere in R e . This 
class has VC dimension d. In this case, under certain reasonable assumptions used 
in statistical mechanics, it can be shown that for m >> d >> 1, 
0.44d 
E.te,,xev.[Bayesm(x,.f)]  
m 
(compared with the upper bound of dim given by Inequality (9) for any class of 
VC dimension d) and 
0.62d 
E.te,,xev.[Gibbsm(x,f)]  
(compared with the upper bound of 2d/m in Inequality (10)). The ratio of these 
asymptotic bounds is x/. We can also show that this performance advantage of 
Bayes over Gibbs is quite robust even when 79 and/) vary, and there is noise in the 
examples [6]. 
7 OTHER RESULTS AND CONCLUSIONS 
We have a number of other results, and briefly describe here one that may be of 
particular interest to neural network researchers. In the case that the class .T' has 
infinite VC dimension (for instance, if .T' is the class of all multi-layer perceptrons 
of finite size), we can still obtain bounds on the number of cumulative mistakes by 
decomposing ' into ', '2,...,-i,  ., where each 'i has finite VC dimension, and 
by decomposing the prior 7 9 over . as a linear sum 79 -- Y4= rti79i, where each 79i 
is an arbitrary prior over -i, and i=l rti = 1. A typical decomposition might let 
.T'i be all multi-layer perceptrons of a given architecture with at most i weights, in 
which case di = O(ilog i) [1]. Here we can show an upper bound on the cumulative 
mistakes during the first m examples of roughly 7/{rti } + [Y4__x rtid] log m for both 
the Bayes and Gibbs algorithms, where 7/{eta} = - i__ rti logrti. The quantity 
i__ rtidi plays the role of an "effective VC dimension" relative to the prior weights 
{rti}. In the case that x is also chosen randomly, we can bound the probability of 
mistake on the ruth trial by roughly (7/{cti} + [Zi__ rtidi] log m). 
In our current research we are working on extending the basic theory presented 
here to the problems of learning with noise (see Opper and Haussler [6]), learning 
multi-valued functions, and learning with other loss functions. 
Perhaps the most important general conclusion to be drawn from the work pre- 
sented here is that the various theories of learning curves based on diverse ideas 
from information theory, statistical physics and the VC dimension are all in fact 
closely related, and can be naturally and beneficially placed in a common Bayesian 
framework. 
862 Haussler, Kearns, Opper, and Schapire 
Acknowledgements 
We are greatly indebted to Ron Rivest for his valuable suggestions and guidance, 
and to Sara Solla and Naftali Tishby for insightful ideas in the early stages of this 
investigation. We also thank Andrew Barron, Andy Kahn, Nick Littlestone, Phil 
Long, Terry Sejnowski and Haim Sompolinsky for stimulating discussions on these 
topics. This research was supported by ONR grant N00014-91-J-1162, AFOSR 
grant AFOSR-89-0506, ARO grant DAAL03-86-K-0171, DARPA contract N00014- 
89-;1-1988, and a grant from the Siemens Corporation. This research was conducted 
in part while M. Kearns was at the M.I.T. Laboratory for Computer Science and the 
International Computer Science Institute, and while R. Schapire was at the M.I.T. 
Laboratory for Computer Science and Harvard University. 
References 
[10] 
[11] 
[1] E. Baum and D. Haussler. What size net gives valid generalization? Neural 
Computation, 1(1):151-160, 1989. 
[2] J. Denker, D. Schwartz, B. Wittner, S. Solla, R. Howard, L. Jackel, and J. Hop- 
field. Automatic learning, rule extraction and generalization. Complex Systems, 
1:877-922, 1987. 
[3] G. Gy6rgi and N. Tishby. Statistical theory of learning a rule. In Neural 
Networks and Spin Glasses. World Scientific, 1990. 
[4] D. Haussler, M. Kearns, and R. Schapire. Bounds on the sample complex- 
ity of Bayesian learning using information theory and the VC dimension. In 
Computational Learning Theory: Proceedings of the Fourth Annual Workshop. 
Morgan Kaufmann, 1991. 
[5] D. Haussler, N. Littlestone, and M. Warmuth. Predicting (0, 1)-functions on 
randomly drawn points. Technical Report UCSC-CRL-90-54, University of 
California Santa Cruz, Computer Research Laboratory, Dec. 1990. 
[6] M. Opper and D. Haussler. Calculation of the learning curve of Bayes optimal 
classification algorithm for learning a perceptton with noise. In Computational 
Learning Theory: Proceedings of the Fourth Annual Workshop. Morgan Kauf- 
mann, 1991. 
[7] N. Sauer. On the density of families of sets. Journal of Combinatorial Theory 
(Series A), 13:145-147, 1972. 
[8] H. Sompolinsky, N. Tishby, and H. Seung. Learning from examples in large 
neural networks. Physics Review Letters, 65:1683-1686, 1990. 
[9] N. Tishby, E. Levin, and S. Solla. Consistent inference of probabilities in 
layered networks: predictions and generalizations. In IJCNN International 
Joint Conference on Neural Networks, volume II, pages 403-409. IEEE, 1989. 
V. N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer- 
Verlag, New York, 1982. 
V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of rela- 
tive frequencies of events to their probabilities. Theory of Probability and its 
Applications, 16(2):264-80, 1971. 
