Generalization in decision trees and DNF: 
Does size matter? 
Mostefa Golea l, Peter L. Bartlettl% Wee Sun Lee 2 nd Llew Mason 1 
Department of Systems Engineering 
Research School of Information 
Sciences and Engineering 
Australian National University 
Canberra, ACT, 0200, Australia 
2 School of Electrical Engineering 
University College UNSW 
Australian Defence Force Academy 
Canberra, ACT, 2600, Australia 
Abstract 
Recent theoretical results for pattern classification with thresh- 
olded real-valued functions (such as support vector machines, sig- 
moid networks, and boosting) give bounds on misclassification 
probability that do not depend on the size of the classifier, and 
hence can be considerably smaller than the bounds that follow from 
the VC theory. In this paper, we show that these techniques can 
be more widely applied, by representing other boolean functions 
as two-layer neural networks (thresholded convex combinations of 
boolean functions). For example, we show that with high probabil- 
ity any decision tree of depth no more than d that is consistent with 
m training examples has misclassification probability no more than 
( 1 (Neff VCdim(L/) log 2 mlogd)) 1/2) where/g is the class of 
node decision functions, and Neff _ N can be thought of as the 
effective number of leaves (it becomes small as the distribution on 
the leaves induced by the training data gets far from uniform). 
This bound is qualitatively different from the VC bound and can 
be considerably smaller. 
We use the same technique to give similar results for DNF formulae. 
* Author to whom correspondence should be addressed 
260 M. Gotea, P Bartlett, W. S. Lee and L Mason 
1 INTRODUCTION 
Decision trees are widely used for pattern classification [2, 7]. For these problems, 
results from the VC theory suggest that the amount of training data should grow 
at least linearly with the size of the tree[4, 3]. However, empirical results suggest 
that this is not necessary (see [6, 10]). For example, it has been observed that the 
error rate is not always a monotonically increasing function of the tree size[6]. 
To see why the size of a tree is not always a good measure of its complexity, consider 
two trees, A with NA leaves and B with Ns leaves, where Ns << NA. Although A 
is larger than B, if most of the classification in A is carried out by very few leaves 
and the classification in B is equally distributed over the leaves, intuition suggests 
that A is actually much simpler than B, since tree A can be approximated well by 
a small tree with few leaves. In this paper, we forrealize this intuition. 
We give misclassification probability bounds for decision trees in terms of a new 
complexity measure that depends on the distribution on the leaves that is induced 
by the training data, and can be considerably smaller than the size of the tree. 
These results build on recent theoretical results that give misclassification probabil- 
ity bounds for thresholded real-valued functions, including support vector machines, 
sigmoid networks, and boosting (see [1, 8, 9]), that do not depend on the size of the 
classifier. We extend these results to decision trees by considering a decision tree as 
a thresholded convex combination of the leaf functions (the boolean functions that 
specify, for a given leaf, which patterns reach that leaf). We can then apply the 
misclassification probability bounds for such classifiers. In fact, we derive and use 
a refinement of the previous bounds for convex combinations of base hypotheses, 
in which the base hypotheses can come from several classes of different complexity, 
and the VC-dimension of the base hypothesis class is replaced by the average (un- 
der the convex coefiqcients) of the VC-dimension of these classes. For decision trees, 
the bounds we obtain depend on the effective number of leaves, a data dependent 
quantity that reflects how uniformly the training data covers the tree's leaves. This 
bound is qualitatively different from the VC bound, which depends on the total 
number of leaves in the tree. 
In the next section, we give some definitions and describe the techniques used. We 
present bounds on the misclassification probability of a thresholded convex combina- 
tion of boolean functions from base hypothesis classes, in terms of a misclassification 
margin and the average VC-dimension of the base hypotheses. In Sections 3 and 4, 
we use this result to give error bounds for decision trees and disjunctive normal 
form (DNF) formulae. 
2 
GENERALIZATION ERROR IN TERMS OF MARGIN 
AND AVERAGE COMPLEXITY 
We begin with some definitions. For a class 7/of {-1, 1 }-valued functions defined on 
the input space X, the convex hull co(7/) of7/is the set of [-1, 1f-valued functions of 
the form '-i aihi, where ai _> O, '-i ai = 1, and hi  7/. A function in co(7/) is used 
for classification by composing it with the threshold function, sgn: IR  {-1, 1}, 
which satisfies sgn(a) = i iff a _> 0. So f  co(7/) makes a mistake on the pair 
(x,y)  X x {-1,1} iff sgn(f(x))  y. We assume that labelled examples (x,y) 
are generated according to some probability distribution Z) on X x {-1, 1}, and we 
let Pv [E] denote the probability under Z) of an event E. If S is a finite subset 
of Z, we let Ps [E] denote the empirical probability of E (that is, the proportion 
of points in S that lie in E). We use Ev [.] and Es [.] to denote expectation in a 
similar way. For a function class H of {-1, 1}-valued functions defined on the input 
Generalization in Decision Trees and DNF: Does Size Matter? 261 
space X, the growth function and VC dimension of H will be denoted by IIy(m) 
and VCdim(H) respectively. 
In [8], Schapire et al give the following bound on the misclassification probability 
of a thresholded convex combination of functions, in terms of the proportion of 
training data that is labelled to the correct side of the threshold by some margin. 
(Notice that Pv [sgn(f(x))  y] <_ Pv [yf(x) <_ 0].) 
Theorem I ([8]) Let D be a distribution on X x (-1,1), 7/ a hypothesis class 
with VCdim(H) = d < cx, and 5 > O. With probability at least 1 -  over a training 
set $ of m examples chosen according to D, every function f 6 co(7/) and every 
0  0 satisfy 
rv[yf(x)_<o]<_rs[yf(x)<_o]q-o   o: 
q- log(1/$)) 
In Theorem 1, all of the base hypotheses in the convex combination f are elements 
of a single class 7/with bounded VC-dimension. The following theorem generalizes 
this result to the case in which these base hypotheses may be chosen from any of k 
classes, 7/x,... , 7/k, which can have different VC-dimensions. It also gives a related 
result that shows the error decreases to twice the error estimate at a faster rate. 
Theorem 2 Let D be a distribution on X x {-1, 1), 7/x,... ,7/k hypothesis classes 
with VCdim(Hi) = di, and 5 > O. With probability at least i - 5 over a training 
) 
set $ of m examples chosen according to D, every function f 6 co Ui=x 7/i and 
every 0 > 0 satisfy both 
rv [yf (x) _< O] _< r s [yf (x) _< 8] + 
0  (dlogm 
+ logk)log(mO2/d)+ log(i/J)) x/2) 
rv [yf(x) _< 0] _< 2rs [yf(x) _< 19] + 
O (- (2 (dlogm+ logk)log(mO2/d) + log(1/5))) , 
where d = Y.i aidj and the ai and ji are defined by f = Y-i aihi and hi  7/ji for 
ji e {1,... ,k). 
Proof sketch: We shall sketch only the proof of the first inequality of the the- 
orem. The proof closely follows the proof of Theorem i (see [8]). We consider 
a number of approximating sets of the form CN,t = { (l/N)= hi: hi  7/t 1' 
where I = (h,...,lN)  {1,...,k}  and N  N. Define C = UtC,t. 
) 
For a given f = Y-i aihi from co Ui= 7/i , we shall choose an approximation 
g  CN by choosing h,... ,hN independently from {h,h2,..., }, according to 
the distribution defined by the coefficients a. Let Q denote this distribution 
on CN. As in [8], we can take the expectation under this random choice of 
g e C to show that, for any 0 > 0, rv [yf(x) < 0] <_ Ea~ [Pt)[yg(x) < 8/2]] + 
exp(-NO/8). Now, for a given I E {1,... ,k} , the probability that there is 
a g in CN,t and a 0 > 0 for which Pv[yg(x) <_ 8/2] > Ps[yg(x) _< 8/2] + eiv,t 
is at most 8(N + 1)rI= (2*"a*' exp(-rnev,t/32). Applying the union bound 
k d,, ! 
262 M. Golea, P Bartlett, W. S. Lee and L Mason 
(over the values of /), taking expectation over g . Q, and setting eN,t -' 
- In 8(N + 1) rl/N=l 2era  dr, k N/SN shows that, with probability at least 
i - 5N, every f and 0 > 0 satisfy Pv[yf(x) _ 0] _ Eg[Ps[yg(x) _ S/2]] + 
Eg [eN,t]. As above, we can bound the probability inside the first expectation 
in terms of Ps [yf(x) _ 9]. Also, Jensen's inequality implies that Eg [eN,t] _ 
(- (ln(8(N + 1)/SN)+ Nlnk + N Eiaidj, ln(2em))) / . Setting 5N = 5/(N(N + 
1)) and N= [ln(-)] gives the result. I 
Theorem 2 gives misclassification probability bounds only for thresholded convex 
combinations of boolean functions. The key technique we use in the remainder of the 
paper is to find representations in this form (that is, as two-layer neural networks) 
of more arbitrary boolean functions. We have some freedom in choosing the convex 
coefficients, and this choice affects both the error estimate Ps [yf(x) _< 0] and the 
average VC-dimension d. We attempt to choose the coefficients and the margin 0 
so as to optimize the resulting bound on misclassification probability. In the next 
two sections, we use this approach to find misclassification probability bounds for 
decision trees and DNF formulae. 
3 DECISION TREES 
A two-class decision tree T is a tree whose internal decision nodes are labeled with 
boolean functions from some class 
from {-1, q.1}. For a tree with N leaves, define the leaf functions, hi: X --> {--1, 1} 
by hi(x) - I iff x reaches leaf i, for i - 1,... , N. Note that hi is the conjunction 
of all tests on the path from the root to leaf i. 
For a sample $ and a tree T, let Pi = Ps [hi(x) -' 1]. Clearly, P = (P,..., PN) is 
a probability vector. Let eri  {-1, q.1} denote the class assigned to leaf i. Define 
the class of leaf functions for leaves up to depth j as 
7/i = {h' h=ulAuA'"Au,.Ir_j, 
It is easy to show that VCdim(7/i) _ 2jVCdim(L/) ln(2ej). Let di denote the depth 
of leaf i, so hi  7Q, and let d = maxi 
The boolean function implemented by a decision tree T can be written as a 
thresholded convex combination of the form T(x) = sgn(f(x)), where f(x) -- 
E/N=1WiO'i ((hi(x) + 1)/2) = -]/N= wio'ihi(x)/2 q. ']/N=i wieri/2, with wi > 0 and 
-]iN=l wi = 1. (To be precise, we need to enlarge the classes 7/i slightly to be closed 
under negation. This does not affect the results by more than a constant.) We first 
assume that the tree is consistent with the training sample. We will show later how 
the results extend to the inconsistent case. 
The second inequality of Theorem 2 shows that, for fixed 6 > 0 there is a con- 
stant c such that, for any distribution 29, with probability at least I- 6 over 
the sample S we have rv IT(x)  y] _< 2rs [yf(x) <_ O] + olw ']iN:l widiB, where 
B = --VCdim(L/)logm log d. Different choices of the wis and the 0 will yield dif- 
ferentmestimates of the error rate of T. We can assume (wlog) that P1 > --- > PN. 
A natural choice is wi = Pi and Pj+ _< 0 < Pj for some j  {1,..., N}hicff-gives 
N B 
rv[T(x)y]<_2 E Piq- 0, (1) 
i:j-F1 
Generalization in Decision Trees and DNF: Does Size Matter? 263 
where  - './/= Pidi. We can optimize this expression over the choices of j  
(1... , N) and 0 to give a bound on the misclassification probability of the tree. 
Let p(P,U) -- .iN__(Pi- l/N) 2 be the quadratic distance between the prob- 
ability vector P - (Px,...,PN) and the uniform probability vector U - 
(1/N, 1/N,...,1/N). Define Ne -- N (1 - p(P,U)). The parameter Ne is a mea- 
sure of the effective number of leaves in the tree. 
Theorem 3 For a fixed J > O, there is a constant c that satisfies the following. Let 
l) be a distribution on X x {-1, 1}. Consider the class of decision trees of depth up 
to d, with decision functions in 5t. With probability at least 1 -  over the training 
set $ (of size m), every decision tree T that is consistent with S has 
pv [T(x)  y] _ c ( Nee VCdim( 1g2 m lg d) 
where Nee is the effective number of leaves of T. 
1/2 
, 
Proof: Supposing that 0 _ (/N) 1/2 we optimize (1) by choice of 0. If the chosen 
0 is actually smaller than (3/N) 1/2 then we show that the optimized bound still 
holds by a standard VC result. If 0 _ (3IN) /2 then //=j+l Pi _ 02Nee/. So (1) 
implies that Pv IT(x)  y] _ 202Ne/3 + B/O 2. The optimal choice of 0 is then 
(32B/Nee) /4 So if (tB/Nee) /4 _ (/N) /2, we have the result. Otherwise, 
the upper bound we need to prove satisfies 2(2NeeB) 1/2 > 2NB, and this result is 
implied by standard VC results using a simple upper bound for the growth function 
of the class of decision trees with N leaves. I 
Thus the parameters that quantify the complexity of a tree are: a) the complexity 
of the test function class 51, and b) the effective number of leaves Ner. The effective 
number of leaves can potentially be much smaller than the total number of leaves 
in the tree [5]. Since this parameter is data-dependent, the same tree can be simple 
for one set of Pis and complex for another set of Pis. 
For trees that are not consistent with the training data, the procedure to estimate 
the error rate is similar. By defining Qi = Ps [Y(i -- -1 [ hi(x) - 1] and P/ = 
Pi(1 - Qi)/(1 - Ps IT(x)  y]) we obtain the following result. 
Theorem 4 For a fixed J > O, there is a constant c that satisfies the following. Let 
l) be a distribution on X x {-1, 1}. Consider the class of decision trees of depth up 
to d, with decision functions in 51. With probability at least 1 -  over the training 
set S (of size m), every decision tree T has 
(Ne VCdim(51) log 2 rn logd) x/a 
Pv IT(x)  y] _< Ps IT(x)  y] + c , 
m 
where c is a universal constant, and Nee = N(1 - p(P, U)) is the effective number 
of leaves ofT. 
Notice that this definition of Nee generalizes the definition given before Theorem 3. 
4 DNF AS THRESHOLDED CONVEX COMBINATIONS 
A DNF formula defined on {-1, 1) ' is a disjunction of terms, where each term is a 
conjunction of literals and a literal is either a variable or its negation. For a given 
DNF formula g, we use N to denote the number of terms in g, t to represent the ith 
264 M. Golea, P. Bartlett, W. S. Lee and L Mason 
term in f, Li to represent the set of literals in ti, and Ni the size of Li. Each term ti 
can be thought of as a member of the class 7/N,, the set of monomials with Ni liter- 
als. Clearly, 17/jl = The DNF g can be written as a thresholded convex combi- 
nation of the form g(x) = -sgn(-f(x)) = -sgn (- -./N=x wi ((ti + 1)/2)). (Recall 
that sgn(a) = I ifil _> 0.) Further, each term ti can be written as a thresholded con- 
vex combination of the form ti(x) = sgn(fi(x)) = sgn (,,, vik ((lk(x) - 1)/2)). 
Assume for simplicity that the DNF is consistent (the results extend easily to the 
inconsistent case). Let 7 + (7-) denote the fraction of positive (negative) exam- 
ples under distribution T). Let Pv+ ['] (Pz>- [']) denote probability with respect 
to the distribution over the positive (negative) examples, and let Ps+ ['] (Ps- [']) 
be defined similarly, with respect to the sample S. Notice that Pv [g(x)  y] = 
?+Pv+ [g(x) = -1] + ?-Pv- [(i) ti(x) = 1], so the second inequality of Theorem 2 
shows that, with probability at least 1 - 6, for any 0 and any Ois, 
Pv[g(x)Y]_<7 + 2P$+[f(x)_<O]q- q-?- 2P$-[-fi(x)_<0i]+ 
i=1 
where  = y.=l wNi and B = c (lognlog 2 m + log(N/J))/m. As in the case of 
decision trees, different choices of 0, the 0s, and the weights yield different estimates 
of the error. For an arbitrary order of the terms, let Pi be the fraction of positive 
examples covered by term t but not by terms ti-,..., tx. We order the terms such 
that for each i, with ti_i,...,ti fixed, Pi is maximized, so that P _> ... _> P, 
and we choose wi = Pi. Likewise, for a given term ti with literals lx,..., l in an 
arbitrary order, let p(i) be the fraction of negative examples uncovered by literal 
l but not uncovered by l_,...,lx. We order the literals of term ti in the same 
greedy way as above so that P?) >_ ... _> P(), and we choose vi = p(i). For 
Pj+x <0<P1andP(0 <0<P(0 wherel<j<Nandl<ji<Ni, weget 
-- jiq-1 -- jiq-l' .... 
Po[g(x)y]<? + 2 y Pi+- +?-. 2 . 
i=j+l i=1 k=j +1 
Now, let P = (P,...,Pv) and for each term i let p(i) = (p(i) P}) Define 
Nfr = N(1 - p(P, U)) and N (0 - N(1 - p(p(i) U)), where U is the relevant 
uniform distribution in each case. The parameter Ne is a measure of the effective 
number of terms in the DNF formula. It can be much smaller than N; this would be 
the case if few terms cover a large fraction of the positive examples. The parameter 
Ne (0 is a measure of the effective number of literals in term t. Again, it can be 
much smaller than the actual number of literals in ti: this would be the case if few 
literals of the term uncover a large fraction of the negative examples. 
Optimizing over 0 and the Ois as in the proof of Theorem 3 gives the following 
result. 
Theorem 5 For a fixed 5 > O, there is a constant c that satisfies the following. Let 
D be a distribution on X x {-1, 1}. Consider the class of DNF formulae with up to 
N terms. With probability at least I - (5 over the training set $ (of size m), every 
DNF formulae g that is consistent with $ has 
N 
_ " t' h/(i) IQ 1/2 
ro [g(x)  y] < ?+(NeadB) / + ,- 
i=1 
where d N ?+ 
= maxi= 1Ni, = Pv [Y = +1] and B = c(lognlog 2 m + log(N/6))/m. 
Generalization in Decision Trees and DNF: Does Size Matter? 265 
5 CONCLUSIONS 
The results in this paper show that structural complexity measures (such as size) of 
decision trees and DNF formulae are not always the most appropriate in determining 
their generalization behaviour, and that measures of complexity that depend on the 
training data may give a more accurate descriptinn. Our analysis can be extended 
to multi-class classification problems. A similar analysis implies similar bounds 
on misclassification probability for decision lists, and it seems likely that these 
techniques will also be applicable to other pattern classification methods. 
The complexity parameter, Neff described here does not always give the best possi- 
ble error bounds. For example, the effective number of leaves Neff in a decision tree 
can be thought of as a single number that summarizes the probability distribution 
over the leaves induced by the training data. It seems unlikely that such a number 
will give optimal bounds for all distributions. In those cases, better bounds could be 
obtained by using numerical techniques to optimize over the choice of t and wis. It 
would be interesting to see how the bounds we obtain and those given by numerical 
techniques reflect the generalization performance of classifiers used in practice. 
Acknowledgements 
Thanks to Yoav Freund and Rob Schapire for helpful comments. 
References 
[1] P.L. Bartlett. For valid generalization, the size of the weights is more important 
than the size of the network. In Neural Information Processing Systems 9, pages 
134-140. Morgan Kaufmann, San Mateo, CA, 1997. 
[2] L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone. Classification and 
Regression Trees. Wadsworth, Belmont, 1984. 
[3] A. Ehrenfeucht and D. Haussler. Learning decision trees from random exam- 
ples. Information and Computation, 82:231-246, 1989. 
[4] U.M. Fayyad and K.B. Irani. What should be .ninimized in a decision tree? 
In AAAI-90, pages 249-754, 1990. 
[5] R. C. Holte. Very simple rules perform well on most commonly used databases. 
Machine learning, 11:63-91, 1993. 
[6] P.M. Murphy and M.J. Pazzani. Exploring the decision forest: An empirical 
investigation of Occam's razor in decision tree induction. Journal of Artificial 
Intelligence Research, 1:257-275, 1994. 
[7] J.R. Quinlan. C.5: Programs for Machine Learning. Morgan Kaufmann, 
1992. 
[8] R. E. Schapire, Y. Freund, P. L. Bartlett, and W. S. Lee. Boosting the margin: 
a new explanation for the effectiveness of voting methods. In Machine Learning: 
Proceedings of the Fourteenth International Conference, pages 322-330, 1997. 
[9] J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. A frame- 
work for structural risk minimisation. In Proc. 9th COLT, pages 68-76. ACM 
Press, New York, NY, 1996. 
[10] G.L. Webb. Further experimental evidence against the utility of Occam's razor. 
Journal of Artificial Intelligence Research, 4:397-417, 1996. 
