Boosting with Multi-Way Branching in 
Decision Trees 
Yishay Mansour 
David McAllester 
AT&T Labs-Research 
180 Park Ave 
Florham Park NJ 07932 
{mansour, dmac}@research.att.com 
Abstract 
It is known that decision tree learning can be viewed as a form 
of boosting. However, existing boosting theorems for decision tree 
learning allow only binary-branching trees and the generalization to 
multi-branching trees is not immediate. Practical decision tree al- 
gorithms, such as CART and C4.5, implement a trade-off between 
the number of branches and the improvement in tree quality as 
measured by an index function. Here we give a boosting justifica- 
tion for a particular quantitative trade-off curve. Our main theorem 
states, in essence, that if we require an improvement proportional 
to the log of the number of branches then top-down greedy con- 
struction of decision trees remains an effective boosting algorithm. 
I Introduction 
Decision trees have been proved to be a very popular tool in experimental machine 
learning. Their popularity stems from two basic features -- they can be constructed 
quickly and they seem to achieve low error rates in practice. In some cases the 
time required for tree growth scales linearly with the sample size. Efficient tree 
construction allows for very large data sets. On the other hand, although there 
are known theoretical handicaps of the decision tree representations, it seem that 
in practice they achieve accuracy which is comparable to other learning paradigms 
such as neural networks. 
While decision tree learning algorithms are popular in practice it seems hard to 
quantify their success in a theoretical model. It is fairly easy to see that even 
if the target function can be described using a small decision tree, tree learning 
algorithms may fail to find a good approximation. Kearns and. Mansour [6] used 
the weak learning hypothesis to show that standard tree learning algorithms perform 
boosting. This provides a theoretical justification for decision tree learning similar 
Boosting with Multi-Way Branching in Decision Trees 301 
to justifications that have been given for various other boosting algorithms, such as 
AdaBoost [4]. 
Most decision tree learning algorithms use a top-down growth process. Given a 
current tree the algorithm selects some leaf node and extends it to an internal node 
by assigning to it some "branching function" and adding a leaf to each possible 
output value of this branching function. The set of branching functions may differ 
from one algorithm to another, but most algorithms used in practice try to keep the 
set of branching functions fairly simple. For example, in C4.5 [7], each branching 
function depends on a single attribute. For categorical attributes, the branching 
is according to the attribute's value, while for continuous attributes it performs a 
comparison of the attribute with some constant. 
Of course such top-down tree growth can over-fit the data -- it is easy to construct 
a (large) tree whose error rate on the training data is zero. However, if the class of 
splitting functions has finite VC dimension then it is possible to prove that, with 
high confidence of the choice of the training data, for all trees T the true error rate 
of T is bounded by (T) -{- O (I) where (T) is the error rate of T on the 
training sample, ITI is the number of leaves of T, and rn is the size of the training 
sample. Over-fitting can be avoided by requiring that top-down tree growth produce 
a small tree. In practice this is usually done by constructing a large tree and then 
pruning away some of its nodes. Here we take a slightly different approach. We 
assume a given target tree size s and consider the problem of constructing a tree T 
with ITI = s and (T) as small as possible. We can avoid over-fitting by selecting a 
small target value for the tree size. 
A fundamental question in top-down tree growth is how to select the branching 
function when growing a given leaf. We can think of the target size as a "budget". 
A four-way branch spends more of the tree size budget than does a two-way branch 
-- a four-way branch increases the tree size by roughly the same amount as two two- 
way branches. A sufficiently large branch would spend the entire tree size budget in 
a single step. Branches that spend more of the tree size budget should be required to 
achieve more progress than branches spending less of the budget. Naively, one would 
expect that the improvement should be required to be roughly linear in the number 
of new leaves introduced -- one should get a return proportional to the expense. 
However, a weak learning assumption and a target tree size define a nontrivial 
game between the learner and an adversary. The learner makes moves by selecting 
branching functions and the adversary makes moves by presenting options consistent 
with the weak learning hypothesis. We prove here that the learner achieve a better 
value in this game by selecting branches that get a return considerably smaller than 
the naive linear return. Our main theorem states, in essence, that the return need 
only be proportional to the log of the number of branches. 
2 Preliminaries 
We assume a set Y of instances and an unknown target function f mapping Y 
to {0, 1}. We assume a given "training set" S which is a set of pairs of the form 
(x, f(x)). We let 7/be a set of potential branching functions where each h 6 7/is 
a function from A' to a finite set Rh -- we allow different functions in 7/ to have 
different ranges. We require that for any h 6 7/ we have I Rhl >_ 2. An 7/-tree is 
302 Y. Mansour and D. McAllester 
a tree where each internal node is labeled with an branching function h 6 7/ and 
has children corresponding to the elements of the set Rh. We define [T[ to be the 
number of leaf nodes of T. We let L(T) be the set of leaf nodes of T. For a given tree 
T, leaf node  of T and sample $ we write St to denote the subset of the sample $ 
reaching leaf . For  6 T we define/dr to be the fraction of the sample reaching leaf 
, i.e., [St[/[S[. We define t to be the fraction of the pairs (x, fix)) in St for which 
f(x) = 1. The training error of T, denoted e(T), is -teL(,)ldt min(t, 1 - t). 
3 The Weak Learning Hypothesis and Boosting 
Here, as in [6], we view top-down decision tree learning as a form of Boosting [8, 3]. 
Boosting describes a general class of iterative algorithms based on a weak learning 
hypothesis. The classical weak learning hypothesis applies to classes of Boolean 
functions. Let 7/2 be the subset of branching functions h 6 7/ with [Rh[ = 2. For 
d > 0 the classical d-weak learning hypothesis for 7/2 states that for any distribution 
on A' there exists an h 6 7/2 with Prt(h(x)k fix)) _ 1/2-d. Algorithms designed 
to exploit this particular hypothesis for classes of Boolean functions have proved to 
be quite useful in practice [5]. 
Kearns and Mansour show [6] that the key to using the weak learning hypothesis 
for decision tree learning is the use of an index function I: [0, 1] -+ [0, 1] where 
I(q) <_ 1, I(q) _> min(q,(1- q)) and where I(T)is defined to be 
Note that these conditions imply that g(T) <_ I(T). For any sample W let qw be 
the fraction of pairs (x, f(x))  W such that f(x) = 1. For any h 6 7/ let Ta be 
the decision tree consisting of a single internal node with branching function h plus 
a leaf for each member of IRal. Let Iw(Ta) denote the value of I(Ta) as measured 
with respect to the sample W. Let A(W, h) denote I(qw)- Iw(Th). The quantity 
A(W, h) is the reduction in the index for sample W achieved by introducing a single 
branch. Also note that/dtA($t, h) is the reduction in I(T) when the leaf is replaced 
by the branch h. Kearns and Mansour [6] prove the following lemma. 
Lemma 3.1 (Kearns & Mansour) Assuming the d-weak learning hypothesis for 
7t2, and taking I(q) to be 2v/q(1 - q), we have that for any sample W there exists 
an h  7t2 such that A (W, h) _> Y6 I(qw). 
This lemma motivates the following definition. 
Definition I We say that 7t2 and I satisfies the y-weak tree-growth hypothesis if 
for any sample W from 2( there exists an h  7t2 such that A(W, h) >_ 7I(qw). 
Lemma 3.1 states, in essence, that the classical weak learning hypothesis implies the 
weak tree growth hypothesis for the index function I(q) = 2v/q(1 - q). Empirically, 
however, the weak tree growth hypothesis seems to hold for a variety of index 
functions that were already used for tree growth prior to the work of Kearns and 
Mansour. The Ginni index I(q) = 4q(1- q) is used in CART [1] and the entropy 
I(q) = -qlog q-(1- q)log(1- q)is used in C4.5 [7]. It has long been empirically 
observed that it is possible to make steady progress in reducing I(T) for these 
choices of I while it is difficult to make steady progress in reducing g(T). 
We now define a simple binary branching procedure. For a given training set $ 
and target tree size s this algorithm grows a tree with ITI = s. In the algorithm 
Boosting with Multi-Way Branching in Decision Trees 303 
0 denotes the trivial tree whose root is a leaf node and Tt,h denotes the result of 
replacing the leaf  with the branching function h and a new leaf for each element 
of Rh. 
WHILE (ITI < s) DO 
 - argmax ]() 
h - argmax7t  A($, h) 
T - T,; 
END-WHILE 
n-1 n-1 _ff 
We now define e(n) to be the quantity ]-Ii= (1- ). Note that e(n) <_ I-Ii= e = 
e-'7 Zia__"l 1 1/i < e_, 7 Inn ____ n-3' ' 
Theorem 3.2 (Kearns & Mansour) If Yt2 and I satisfy the y-weak tree growth 
hypothesis then the binary branching procedure produces a tree T with g(T) _ I(T) _ 
e(lTI) _< 
Proof: The proof is by induction on the number of iterations of the procedure. 
We have that I(0) _ 1 = e(1) so the initial tree immediately satisfies the condi- 
tion. We now assume that the condition is satisfied by T at the begining of an 
iteration and prove that it remains satisfied by Tt,a at the end of the iteration. 
Since I(T) = "]-tT 15I() we have that the leaf  selected by the procedure is 
such that StI(qt) _ !. By the y-weak tree growth assumption the function 
h selected by the procedure has the property that A(St, h) _ yI(t). We now 
have that I(T) - I(Tt,a) = ItA(St, h) _ btyI(qt) _ y--. This implies that 
- = - - )e(ITI) = e(ll + = 
4 Statement of the Main Theorem 
We now construct a tree-growth algorithm that selects multi-way branching func- 
tions. As with many weak learning hypotheses, the y-weak tree-growth hypothesis 
can be viewed as defining a game between the learner and an adversary. Given a 
tree T the adversary selects a set of branching functions allowed at each leaf of the 
tree subject to the constraint that at each leaf  the adversary must provide a binary 
branching function h with A(S/, h) _ yI({t). The learner then selects a leaf  and 
a branching function h and replaces T by Tt,n. The adversary then again selects 
a new set of options for each leaf subject to the y-weak tree growth hypothesis. 
The proof of theorem 3.2 implies that even when the adversary can reassign all op- 
tions at every move there exists a learner strategy, the binary branching procedure, 
guaranteed to achieves a final error rate of ITI -. 
Of course the optimal play for the adversary in this game is to only provide a single 
binary option at each leaf. However, in practice the "adversary" will make mistakes 
and provide options to the learner which can be exploited to achieve even lower 
error rates. Our objective now is to construct a strategy for the learner which can 
exploit multi-way branches provided by the adversary. 
We first say that a branching function h is acceptable for tree T and target size 
304 Y. Mansour and D. McAllester 
s if either IRal = 2 or ITI < e(IRaDs7/(21Ra D. We also define g(k) to be the 
quantity (1- e(k))/7. It should be noted that g(2) = 1. It should also be noted 
that e(k) ~ e -'nk and hence for 71nk small we have e(k) ~ 1 - 71nk and hence 
g(k) ~ In k. We now define the following multi-branch tree growth procedure. 
T=0 
WHILE (IT] < s) DO 
 4- argmax t /3t I (l) 
h 4- argmaxa/' h acceptable for T and s 
T 4- Tt,a; 
END-WHILE 
A run of the multi-branch tree growth procedure will be called 7-boosting if at each 
iteration the branching function h selected has the property that A(St, h)/g(IRal) > 
7I(qt). The 7-weak tree growth hypothesis implies that a(&,h)/g(lahl) > 
7I(qt)/g(2) = 7I(t). Therefore, the 7-weak tree growth hypothesis implies that 
every run of the multi-branch growth procedure is ?-bootsing. But a run can be 
?-bootsing by exploiting mutli-way branches even when the ?-weak tree growth 
hypothesis fails. The following is the main theorem of this paper. 
Theorem 4.1 If T is produced by a 7-boosting run of the multi-branch tree-growth 
procedure then I(T) <_ e(lT D < IT[ -'. 
5 Proof of Theorem 4.1 
To prove the main theorem we need the concept of a visited weighted tree, or VW- 
tree for short. A VW-tree is a tree in which each node rn is assigned both a rational 
weight w, 6 [0, 1] and an integer visitation count v, > 1. We now define the 
following VW tree growth procedure. In the procedure T,o is the tree consisting of 
a single root node with weight w and visitation count 1. The tree Tt,,o,...,,ok is the 
result of inserting k new leaves below the leaf  where the ith new leaf has weight 
wi and new leaves have visitation count 1. 
w 4- any rational number in [0, 1] 
T 4- T, 
FOR ANY NUMBER OF STEPS REPEAT THE FOLLOWING 
 4- argmaxt e(v),o 
vl 4- vl q- 1 
OPTIONALLY T 4- Tt,,o,...,,ov WITH w +... wv < e(vt)wt 
We first prove an analog of theorem 3.2 for the above procedure. For a VW-tree T 
we define ITI to be Y]lsi;(a, vt and we define I(T) to be Y]tsi;(a, e(vt)wt. 
Lemma 5.1 The VW procedure maintains the invariant that I(T) < e(ITI). 
Proof.' The proof is by induction on the number of iterations of the algorithm. 
The result is immediate for the initial tree since e(1) = 1. We now assume that 
< e(ITI) at the start of an iteration and show that this remains true at the 
end of the iteration. 
Boosting with Multi-Way Branching in Decision Trees 305 
We can associate each leaf with vt "subleaves" each of weight e(vt)wt/vt. We have 
that ITI is the total number of these subleaves and I(T) is the total weight of these 
subleaves. Therefore there must exist a subleaf whose weight is at least I(T)/IT I. 
Hence there must exist a leaf  satisfying e(vt)wt/vt _ I(T)/ITI. Therefore this 
relation must hold of the leaf  selected by the procedure. 
Let T t be the tree resulting from incrementing vt. We now have I(T) - I(T t) = 
e(vt)wt-e(vt+ 1)wt = e(vt)wt-(1-)e(vt)wt = e(vt)wt _ 7!. So we have 
_< _< - )e(ITI) = 
Finally, if the procedure grows new leaves we have that the I(T) does not increase 
and that IT I remains the same and hence the invariant is maintained. [] 
For any internal node m in a tree T let C(m) denote the set of nodes which are 
children of m. A VW-tree will be called locally-well-formed if for every internal 
node m we have that v, = IC(m)l, that Y'.,c(,,)w, _ e(IC(m)])w,,. A VW-tree 
will be called globally-safe if maxt(T) e(vt)wt/vt _ minmV(T) e(vt -- 1)wt/(vt-- 1) 
where N(T) denotes the set of internal nodes of T. 
Lemma 5.2 If T is a locally well-formed and globally safe VW-tree, then T is a 
possible output of the VW growth procedure and therefore I(T) _ e(lTI). 
Proof: Since T is locally well formed we can use T as a "template" for making 
nondeterministic choices in the VW growth procedure. This process is guaranteed 
to produce T provided that the growth procedure is never forced to visit a node 
corresponding to a leaf of T. But the global safety condition guarantees that any 
unfinished internal node of T has a weight as least as large as any leaf node of T. 
We now give a way of mapping ']/-trees into VW-trees. More specifically, for any 
7/-tree T we define VW(T) to be the result of assigning each node m in T the weight 
1m I(qm), each internal node a visitation count equal to its number of children, and 
each leaf node a visitation count equal to 1. We now have the following lemmas. 
Lemma 5.3 If T is grown by a ?-boosting run of the multi-branch procedure then 
VW(T) is locally well-formed. 
Proof: Note that the children of an internal node m are derived by selecting 
a branching function h for the node m. Since the run is y-boosting we have 
/x($t,h)/g(IRhl) >_ ?I(qt). Therefore A($t,h)= (I(t)- Is(Th)) _ I(t)(1- 
c(IRl)). This implies that Is(Th) _ Multiplying by/t and trans- 
forming the result into weights in the tree VW(T) gives the desired result. [] 
The following lemma now suffices for theorem 4.1. 
Lemma 5.4 If T is grown by a 7-boosting run of the multi-branch procedure then 
VW(T) is globally safe. 
Proof: First note that the following is an invariant of a v-boosting run of the 
multi-branch procedure. 
max wt _ min wt 
t(vw()) ,,v(vw()) 
306 E Mansour and D. McAllester 
The proof is a simple induction on ?-boosting tree growth using the fact that the 
procedure always expands a leaf node of maximal weight. 
We must now show that for every internal node m and every leaf  we have that 
wl _< e(k- 1)w,/(k- 1) where k is the number of children of to. Note that if k = 2 
then this reduces to wl <_ w, which follows from the above invariant. So we can 
assume without loss of generality that k  2. Also, since e(k)/k  e(k - 1) / (k - 1), 
it suffices to show that wt _< e(k)w,/k. 
Let m be an internal node with k  2 children and let T  be the tree at the time 
rn was selected for expansion. Let wt be the maximum weight of a leaf in the final 
tree T. By the definition of the acceptability condition, in the last s/2 iterations 
we are performing only binary branching. Each binary expansion reduces the index 
by at least ? times the weight of the selected node. Since the sequence of nodes 
selected in the multi-branch procedure has non-increasing weights, we have that in 
any iteration the weight of the selected node is at least wt. Since there are at least 
s/2 binary expansions after the expansion of m, each of which reduces I by at least 
?w, we have that s?wl/2 _< I(T') so w <_ 2I(T')/(?s). The acceptability condition 
can be written as 2/(?s) _< e(k)/(kIT'l) which now yields wl _< I(T)e(k)/(klT'l). 
But we have that I(T)/IT'I <_ w, which now yields wl <_ e(k)w,/k as desired. [] 
References 
[1] Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. 
Classification and Regression Trees. Wadsworth International Group, 1984. 
[2] 
Tom Dietterich, Michael Kearns and Yishay Mansour. 
Learning Framework to understand and improve C4.5. 
Learning, 96-104, 1996. 
Applying the Weak 
In Proc. of Machine 
[3] Yoav Freund. Boosting a weak learning algorithm by majority. Information and 
Computation, 121(2):256-285, 1995. 
[4] 
Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of 
on-line learning and an application to boosting. In Computational Learning 
Theory: Second European Conference, EuroCOLT '95, pages 23-37. Springer- 
Verlag, 1995. 
[5] 
Yoav Freund and Robert E. Schapire. Experiments with a new boosting al- 
gorithm. In Machine Learning: Proceedings of the Thirteenth International 
Conference, pages 148-156, 1996. 
[6] 
Michael Kearns and Yishay Mansour. On the boosting ability of top-down 
decision tree learning. In Proceedings of the Twenty-Eighth A CM Symposium 
on the Theory of Computing, pages 459-468, 1996. 
[7] J. Ross Quinlan. C.5: Programs for Machine Learning. Morgan Kaufmann, 
1993. 
[8] Robert E. Schapire. The strength of weak learnability. Machine Learning, 
5(2):197-227, 1990. 
