Data-Dependent Structural Risk 
Minimisation for Perceptron Decision 
Trees 
John Shawe-Taylor 
Dept of Computer Science 
Royal Holloway, University of London 
Egham, Surrey TW20 0EX, UK 
Eraall: jstOdcs.rhbnc.ac.uk 
ble!!o Cristi,,n!ni 
Dept of Engineering Mathematics 
University of Bristol 
Bristol BS8 1TR, UK 
Eraall: nello.cristianiniObristol.ac.uk 
Abstract 
Perceptron Decision Trees (a! known as Linear Machine DTs, 
etc.) are analysed in order that data-dependent Structural Risk 
Minlml,ation can be applied. Data-dependent analysis is per- 
formed which indicates that choosing the maximal margin hyper- 
plancs at the decision nodes will improve the generalization. The 
analysis uses a novel technique to bound the generalization error in 
terms of the margins at individual nodes. Experiments performed 
on real data sets confirm the validity of the approach. 
Introduction 
Neural network researchers have traditionally tackled classification problems by 
sembling perceptton or sigmoid nodes into feedforward neural networks. In this 
paper we consider a less common approach where the percepttons are used as deci- 
sion nodes in a decision tree structure. The approach has the advantage that more 
efficient heuristic algorithms exist for these structures, while the advantages of in- 
herent parallelism are if anything greater as all the percepttons can be evaluated in 
pardel, with the path through the tree determined in a very fast post-processing 
phase. 
Classical Decision Trees (DTs), like the ones produced by popular packages as 
CART [5] or C4.5 [9], partition the input space by means of axis-parallel hyperplanes 
(one at each internal node), hence inducing categories which are represented by 
(axis-parallel) hyperrectangles in such a space. 
A natural extension of that hypothesis space is obtained by associating to each 
internal node hyperplanes in general position, hence partitioning the input space 
by means of polygonal (polyhedral) categories. 
Data-Dependent SRM for Perceptron Decision Trees 337 
This approach has been pursued by many researchers, often with different motiva- 
tions, and hence the resulting hypothesis space has been given a number of different 
names: multivariate DTs [6], oblique DTs [8], or DTs using linear combinations of 
the attributes [5], Linear Machine DTs, Neural Decision Trees [12], Perceptton Trees 
[13], etc. 
We will call them Perceptton Decision Trees (PDTs), as they can be regarded as 
binary trees having & simple perceptton associated to each decision node. 
Different algorithms for Top-Down induction of PDTs from data have been pro- 
posed, based on different principles, [10], [5], [8], 
Experimental study of learning by means of PDTs indicates that their performances 
are sometimes better than those of traditional decision trees in terms of generaliz&- 
tion error, and usually much better in terms of tree-size [8], [6], but on some data 
set PDTs can be outperformed by normal DTs. 
We investigate an alternative strategy for improving the generaliz&tion of these 
structures, namely placing m,dmal margin hyperplanes at the decision nodes. By 
use of & novel analysis we are able to demonstrate that improved generalization 
bounds can be obtained for this approach. Experiments confirm that such a method 
delivers more accurate trees in all tested databases. 
2 Generalized Decision Trees 
Definition 2.1 Generalized Decision Trees (GDT). 
Given a space X and a set of boolean functions 
Y = {f: X - {0, 1}}, the class GDT(Y) of Generalized Decision Trees over Y are 
functions which can be implemented using a binary tree where each internal node 
is labeled with an element of Y, and each leag is labeled with either 1 or 0. 
To evaluate a particular tree T on input z {5 X, All the boolean functions associated 
to the nodes are assigned the same argument z {5 X, which is the argument ofT(z). 
The values assumed by them determine a unique path from the root to a leaf: at 
each internal node the left (respectively right) edge to a child is taken if the output 
of the function associated to that internal node is 0 (respectively 1). The value of 
the function at the assignment of a z {5 X is the value associated to the leag reached. 
We say that input z reaches a node of the tree, if that node is on the evaluation 
path for z. 
In the following, the nodes are the internal nodes of the binary tree, and the leaves 
are its external ones. 
Examples. 
 Given X = {0, 1}", a Boolean Decision Tree (BDT) is & GDT over 
BDT = {: J(x) = xl, x {5 X} 
 Given X = R , a C$.5-1ike Decision e [CDT]   GDT over 
T d of deion rees defined on  CongeRoRs 8pe e he oRpR of 
common gohms llke C4.b d CART, d we   hem - for shorg 
- CDTs. 
 Given X = R , a Peepn Decision e (PDT)   GDT over 
PDT = {x:  6 R+}, 
where we have umM that the puts have bn auent th a coor- 
dinate of constt ue, hence implementg a tMholded perceptton. 
338 J. Shawe-Taylor and N. Cristianini 
3 Data-dependent SRM 
We begin with the definition of the fat-shattering dimension, which was first intro- 
duced in [7], and has been used for several problems in learning since [1, 4, 2, 3]. 
Definition 3.1 Let Y be a set of real valued functions. We say that a set of points 
X is 7-shattered by Y relative to r = (r)Ex if there are real numbers r indexed 
bit z  X such that for all binarid vectors b indexed bit X, there is a function fb  Y 
satis filing 
>r+7 ifb-I 
fb (z)  r. - 7 othervoise. 
The fat shattering dimension fat j, of the set Y is a function from the positive real 
numbers to the integers which maps a value 7 to the size of the largest 7-shattered 
set, if this is finite, or infinitit otherwise. 
As an example which will be relevant to the subsequent analysis consider the class: 
We quote the following result from [11]. 
Corollary 3.2 [11] Let ;Piin be restricted to points in a ball of n dimensions of 
radius R about the origin and with thresholds [01 _ R. Then 
fa%in('r) < + + 
The following theorem bounds the generalization of a classifier in terms of the 
fat shattering dimension rather than the usual Vapniir-Chervonenkis or Pseudo 
dimension. 
Let To denote the threshold function at O: T0: -- {0, 1}, To(a) = 1 itf a > O. For 
a class of functions Y, To(Y) = {To(f): f 
Theorem 3.3 [11] Consider a real valued function class Y having fat shattering 
function bounded above blt the function a. fat: ] - N which is continuous from 
the right. Fix O  . If a learner correctlit classifies m independentlit generated 
examples  with h = To(l) To(Y) such that er.(h) = 0 and 'r = ml. I.t(,d - ol, 
then with confidence 1 -- I the expected error of h is bounded from above by 
e(m,k,') = m2----- (klog (8--) log(32m) + log ($?) ) , 
where k = afar(7/8 ). 
The importance of this theorem is that it can be used to explain how a classifier 
can give better generalization than would be predicted by a classical analysis of its 
VC dimension. Essentially expanding the margin performs an automatic capacity 
control for function classes with small fat shattering dimensions. The theorem shows 
that when a large margin is achieved it is as if we were working in a lower VC class. 
We should stress that in general the bounds obtained should be better for cases 
where a large margin is observed, but that a priori there is no guarantee that such 
a margin will occur. Therefore a priori only the classical VC bound can be used. In 
view of corresponding lower bounds on the generalization error in terms of the VC 
dimension, the a posteriori bounds depend on a favourable probability distribution 
making the actual learning task easier. Hence, the result will only be useful if 
the distribution is favourable or at least not adversarial. In this sense the result 
is a distribution dependent result, despite not being distribution dependent in the 
Data-Dependent SRM fo r Perceptron Decision Trees 339 
traditional sense that assumptions about the distribution have had to be made in 
its derivation. The benign behaviour of the distribution is automatically estimated 
in the learning process. 
In order to perform a similar analysis for perceptton decision trees we will consider 
the set of margins obtained at each of the nodes, bounding the generalization as a 
function of these values. 
4 Generalisation analysis of the Tree Class 
It turns out that bounding the fat shattering dimension of PDT's viewed as real 
function classifiers is difficult. We will therefore do a direct generalization analysis 
mlmlclrig the proof of Theorem 3.3 but taking into account the margins at each of 
the decision nodes in the tree. 
Definition 4.1 Let (X, d) be a (pseudo-) metric space, let A be a subset of X and 
 > O. A set B C_ X is an e-cover for A if, for ever3t a  A, there exists b  B such 
that d(a, b) < e. The e-covering number of A, 3f(e, A), is the minimal cardinalitt 
of an e-cover for A (if there is no such finite cover then it is defined to be oo). 
We write Jq'(e, Y, x) for the e-covering number of Y with respect to the too pseudo- 
metric measuring the mvimum discrepancy on the sample x. These numbers are 
bounded in the following Lemma. 
lemma 4.2 (Alon eta/. [1]) Let Y be a class of functions X -- [0, 1] and P a 
distribution over X. Choose 0 <  < 1 and let d = faty(e/4). Then 
[ 4m  los(a,,,/()) 
E(Jq'(e,Y,x))_< 2 --j , 
vhere the expectation E is taken u.r.t. a sample x  X rn drawn according to p,n. 
Corollary 4.3 [11] Let  be a class of functions X -- [a, b] and P a distribution 
over X. Choose 0 <  < 1 and let d = faty(e/4). Then 
< 2 , 
vhere the expectation E is over samples x  X 'n dravn according to p,n. 
We are now in a position to tackle the main lemma which bounds the probability 
over a double sample that the first half has zero error and the second error greater 
than an appropriate e. Here, error is interpreted as being differently classified at 
the output of tree. In order to simplify the notation in the following lemma we 
assume that the decision tree has K nodes. We !no denote fat0 li n (7) by fat(7) to 
simplify the notation. 
lemmaa 4.4 Let T be a perceptton decision tree with K decision nodes with margins 
71 , 7a,..., 7 ar at the decision nodes. If it has correctly classified m labelled examples 
generated independently according to the unknown (but fated) distribution P, then 
we can bound the following probability to be less than J, 
pa,n {xy: 3 a tree T: T correctly classifies x, 
fraction of y misclassified > e(m, K, J) } < J, 
4,,,, K, = os(4) + 
,,here D = E,=l log(4e/},) and }, = fat(,,/S). 
340 J. Shawe-Taylor and N. Cristianini 
Proof: Using the standard permutation argument, we may fix a sequence xy and 
bound the probability under the uniform distribution on swapping permutations 
that the sequence satisfies the condition stated. We consider generating minimal 
7/2-covers Bxar for each value of k, where 7 = mln{7': fat(7'/8) _< k}. Suppose 
that for node i of the tree the margin 7  of the hyperplane vi satisfies fat(7 i/8) =/q. 
We can therefore find fl  Bxa whose output values are within 7 i/2 of vi. We now 
consider the tree T' obtained by replacing the node percepttons vi of T with the 
corresponding fl. This tree performs the same classification function on the first 
half of the sample, and the margin remains larger than 7 i -71,/2 > 7,/2. If a 
point in the second half of the sample is incorrectly classified by T it will either 
still be incorrectly classified by the adapted tree T' or will at one of the decision 
nodes i in T' be closer to the decision boundary than 7,/2. The point is thus 
distinguishable from left hand side points which are both correctly classified and 
have margin greater than 7,/2 at node i. Hence, that point must be kept on the 
right hand side in order for the condition to be satisfied. Hence, the fraction of 
permutations that can be allowed for one choice of the functions from the covers 
is 2 -'n. We must take the union bound over all choices of the functions from the 
covers. Using the techniques of [11] the numbers of these choices is bounded by 
Corollory 4.3 as follows 
II/K=! 2(8rn)t' 1g(4rn/t') = 2 K (8rn) D , 
where D = Y=l kl log(4em/ki). The value of  in the lemma statement therefore 
ensures that this the union bound is less than 5. 
Using the standard lernma due to Vapnik [14, page 168] to bound the error proba- 
bilities in terms of the discrepancy on a double sample, combined with Lemma 4.4 
gives the following result. 
Theorem 4.5 Suppose we are able to classi an m sample of labelled ezamples 
uting a perceptton decision tree with K nodes and obtaining margins 71 at node i, 
then we can bound the generalisation error with probabilitt greater than 1 -5 to be 
less than 
+ log + 1), ) 
vhere D = E=l k, log(4em/k,) and k, - fat(7,/8). 
Proof: We must bound the probabilities over different architectures of trees and 
different margins. We simply have to choose the values of  to ensure that the 
individual l's are sufficiently small that the total over all possible choices is less 
than 5. The details are omitted in this abstract. 
5 Experiments 
The theoretical results obtained in the previous section imply that an algorithm 
which produces large margin splits should have a better generalization, since in- 
creasing the margins in the internal nodes, has the effect of decreasing the bound 
on the test error. 
In order to test this strategy, we have performed the following experiment, divided 
in two parts: first run a standard perceptton decision tree algorithm and then for 
each decision node generate a maximal margin hyperplane implementing the same 
dichotomy in place of the decision boundary generated by the algorithm. 
Data-Dependent SRM for Perceptron Decision Trees 341 
Input: Random m sample x with corresponding dasshication b. 
Algorithm: Find a perceptton decision tree T which correctly classhies the sample 
using a standard algorithm; 
Let k = number of decision nodes of T; 
From tree T create T  by executing the following loop: 
For each decision node i replace the weight vector vi by the vector v 
which realises the maximal margin hyperplane agreeing with vl on the 
set of inputs reaching node i; 
Let the margin of vl on the inputs reaching node i be 71; 
Output: Classifier T , with bound on the generalisation error in terms of the num- 
ber of decision nodes K and D = Y-=l k, log(4em/k,) where k, = fat(7/8). 
Note that the classification of T and T t agree on the sample and hence, that T t is 
consistent with the sample. 
As a PDT learning algorithm we have used OC1 [8], created by Murthy, Kasif and 
Salzberg and freely available over the internet. It is a randomi.ed algorithm, which 
performs simulated annealing for learning the percepttons. The details about the 
randomi.ation, the pruning, and the splitting criteria can be found in [8]. 
The data we have used for the test are 4 of the 5 sets used in the original OC1 
paper, which are publicly available in the UCI data repository [16]. 
The results we have obtained on these data are compatible with the ones reported 
in the original OC1 paper, the differences being due to different divisions between 
training and testing sets and their sizes; the absence in our experiments of cross- 
validation and other techniques to estimate the predictive accuracy of the PDT; 
and the inherently randomi.ed nature of the algorithm. 
The second stage of the experiment involved finding - for each node - the hyperplane 
which performes the same split as performed by the OC1 tree but with the maximal 
margin. This can be done by considering the subsample reaching each node as 
perfectly divided in two parts, and feeding the data accordingly relabelled to an 
algorithm which finds the optimal split in the linearly separable case. The maximal 
margin hyperplanes are then placed in the decision nodes and the new tree is tested 
on the same testing set. 
The data sets we have used are: Wiscounsin Breast Cancer, Pima Indians Diabetes, 
Boston Housing transformed into a classification problem by thresholdlug the price 
at $ 21.000 and the classical Iris studied by Fisher (More informations about the 
databases and their authors are in [8]). All the details about sample sizes, number 
of attributes and results (training and testing accuracy, tree size) are summarized 
in table 1. 
We were not particularly interested in achieving a high testing accuracy, but rather 
in observing if improved performances can be obtained by increasing the margin. 
For thi. reason we did not try to optimize the performance of the original classifier 
by using cross-validation, or a convenient training/testing set ratio. The relevant 
quantity, in this experiment, is the different in the testing error between a PDT 
with arbitrary margins and the same tree with optimized margins. This quantity 
has turned out to be always positive, and to range from 1.7 to 2.8 percent of gain, 
on test errors which were already very low. 
train OC1 test FAT test #trs #rs artrib. classes nodes 
CANC 96.53 93.52 95.37 249 108 9 2 1 
IRIS 96.67 96.67 98.33 90 60 4 3 2 
DIAB 89.00 70.48 72.45 209 559 8 2 4 
HOUS 95.90 81.43 84.29 306 140 13 2 7 
342 J. Shawe-Taylor and N. Cristianini 
References 
[1] 
[2] 
[3] 
[4] 
[5] 
[6] 
[9] 
[10] 
[11] 
[12] 
[13] 
[14] 
[15] 
[16] 
Ncga Alon, Shai Ben-David, Nicolb Ces-Bianchi and David Hanssler, "Scale- 
sensitive Dimensions, Uniform Convergence, nd Le&rnability," in Proceedings 
of the Conference on Foundations of Computer Science (FOCS), (1993). Also 
to appear in Journal of the A CM. 
Martin Anthony and Peter Bartlett, "Function learning from interpolation", 
Technical Report, (1994). (An extended abstract appeared in Computational 
Learning Theory, Proceedings 2nd European Conference, EuroCOLT'95, pages 
211-221, ed. Paul Vitanyi, (Lecture Notes in Artificial Intelligence, 904) 
Springer-Verlag, Berlin, 1995). 
Peter L. Bartlett and philip M. Long, "Prediction, Learning, Uniform Con- 
vergence, and Scale-Sensitive Dimensions," Preprint, Department of Systems 
Engineering, Australian National University, November 1995. 
Peter L. Bartlett, Philip M. Long, and Robert C. Williamson, "Fat-shattering 
and the learnability of Real-valued Functions," Journal of Computer and Sys- 
tem Sciences, 52(3), 434-452, (1996). 
Breiman L., Friedman J.H., Olshen R.A., Stone C.J., "Classification and Re- 
gression Trees", Wadsworth International Group, Belmont, CA, 1984. 
Brodley C.E., Utgoff P.E., Multivariate Decision Trees, Machine Learning 19, 
pp. 45-77, 1995. 
Michael J. Kearns and Robert E. Schapire, "Efficient Distribution-free Learning 
of Probabili.tic Concepts," pages 382-391 in Proceedings of the 31st Symposium 
on the Foundations of Computer Science, IEEE Computer Society Press, Los 
A1Amltos, CA, 1990. 
Murthy S.K., Kasif S., Salzberg S., A System for Induction of Oblique Decision 
Trees, Journal of Artificial Intelligence Research, 2 (1994), pp. 1-32. 
QulnlAn J.R., "C4.5: Programs for Machine Learning", Morgan Kaufmann, 
1993. 
Sankar A., Mammone R.J., Growing and Pruning Neural Tree Networks, IEEE 
Transactions on Computers, 42:291-299, 1993. 
John Shawe-Taylor, Peter L. Bartlett, Robert C. Williamson, Martin Anthony, 
Structural Risk Minimization over Data-Dependent Hierarchies, NeuroCOLT 
Technical Report NC-TR-96-053, 1996. 
(t;p://t ;p. dcs. rhbnc. ac. uk/pub/neuocol;/;ech 'epor;s). 
J.A. Sirat, and J.*P. Nadal, "eural trees: a new tool for classification" Net* 
work, 1, pp. 42-438, 1990 
Utgoff p.E., Perceptton Trees: a Case Study in Hybrid Concept Representa* 
tions, Connection Science 1 (1989), pp. 377-391. 
Springer*Verlag, New York, 1982. 
Verlag, New York, 1995 
University of California, Irvine - Machine Learn;ng Repository, 
h;;p://www. ics. uci. edu/ mlea_n/P. LReposit oFy. h;ml 
