Optimizing Classifiers for Imbalanced 
Training Sets 
Grigoris Karakoulas 
Global Analytics Group 
Canadian Imperial Bank of Commerce 
161 Bay St., BCE-11, 
Toronto ON, Canada M5J 2S8 
Emaih karakoulcibc. ca 
John Shawe-Taylor 
Department of Computer Science 
Royal Holloway, University of London 
Egham, TW20 0EX 
England 
Emaih j st)dcs. rhbnc. ac. uk 
Abstract 
Following recent results [9, 8] showing the importance of the fat- 
shattering dimension in explaining the beneficial effect of a large 
margin on generalization performance, the current paper investi- 
gates the implications of these results for the case of imbalanced 
datasets and develops two approaches to setting the threshold. 
The approaches are incorporated into ThetaBoost, a boosting al- 
gorithm for dealing with unequal loss functions. The performance 
of ThetaBoost and the two approaches are tested experimentally. 
Keywords: Computational Learning Theory, Generalization, fat-shattering, large 
margin, pac estimates, unequal loss, imbalanced datasets 
I Introduction 
Shawe-Taylor [8] demonstrated that the output margin can also be used as an 
estimate of the confidence with which a particular classification is made. In other 
words if a new example has an output value well clear of the threshold we can 
be more confident of the associated classification than when the output value is 
closer to the threshold. The current paper applies this result to the case where 
there are different losses associated with a false positive, than with a false negative. 
If a significant number of data points are misclassified we can use the criterion 
of minimising the empirical loss. If, however, the data is correctly classified the 
empirical loss is zero for all correctly separating hyperplanes. It is in this case that 
the approach can provide insight into how to choose the hyperplane and threshold. 
In summary, the paper suggests ways in which a hyperplane should be optimised for 
irabalanced datasets where the loss associated with misclassifying the less prevalent 
class is higher. 
254 G. Karakoulas and J. Shawe-Taylor 
2 Background to the Analysis 
Definition 2.1 [3] Let . be a set of real-valued functions. We say that a set of 
points X is 7-shattered by r if there are real numbers rx indexed by z 6 X such 
that for all binary vectors b indexed by X, there is a function f6  . realising 
dichotomy b with margin 7. The fat-shattering dimension Faty of the set . is a 
function from the positive real numbers to the integers which maps a value 7 to the 
size of the largest 7-shattered set, if this is finite, or infinity otherwise. 
In general we are concerned with classifications obtained by thresholding real-valued 
functions. The classification values will be {-1, 1} instead of the usual {0, 1} in or- 
der to simplify some expressions. Hence, typically we will consider a set r of 
functions mapping from an input space X to the reals. For the sake of simplifying 
the presentation of our results we will assume that the threshold used for classi- 
fication is 0. The results can be extended to other thresholds without difficulty. 
Hence we implicitly use the classification functions H = T( r) = {T(f): f  r}, 
where T(f) is the function f thresholded at 0. We will say that f has 7 margin on 
the training set {(xi,yi): i = 1,...,rn}, if minl<i<m{yif(xi)} = 7- Note that a 
positive margin implies that T(f) is consistent. 
Definition 2.2 Given a real-valued function f: X -4 [-1, 1] used for classification 
by thresholding at O, and probability distribution P on X x {-1, 1}, we use erp(f) 
to denote the following probability erp(f): P{(x, y): yf(x) < 0}. Further suppose 
0 _< ri _< 1, then we use erp(flr/) to denote the probability 
erp(fl) = P{(x,y): yf(x) _< 011f(x)l _> 
The probability erp(flr/) is the probability of misclassification of a randomly chosen 
example given that it has a margin of r/or more. 
We consider the following restriction on the set of real-valued functions. 
Definition 2.3 
stants if 
The real-valued function class . is closed under addition of con- 
Note that the linear functions (with threshold weights) used in perceptrons [9] 
satisfy this property as do neural networks with linear output units. Hence, this 
property applies to the Support Vector Machine, and the neural network examples. 
We now quote a result from [8]. 
Theorem 2.4 [8] Let . be a class of real-valued functions closed under addition 
of constants with fat-shattering dimension bounded by Fat:(7 ) which is continuous 
from the right. With probability at least 1 -5 over the choice of a random rn sample 
(xi, yi) drawn according to P the following holds. Suppose that for some f  ., 
/>0, 
1. yif(xi) _> -ri+ 2'7 for all (xi,Yi) in the sample, 
2. n = I{i: yif(xi) >_ ri + 27}1, 
3. n > 3V/2m(2dln(288rn ) log2(12ern ) + ln(32rn2/5)), 
Let d = Fat:(7/6 ). Then the probability that a new example with margin rl is 
misclassified is bounded by 
_3 2dlog2(288m )1og2(12em ) +log2 5 ' 
Optimizing Classifers for Irabalanced Training Sets 255 
3 Unequal Loss Functions 
We consider the situation where the loss associated with an example is different 
for misclassification of positive and negative examples. Let Lb(x, y) be the loss 
associated with the classification function h on example (x, y). For the analysis 
considered above the loss function is taken to be Lh(x, y): Ih(x)- Yl, that is 1 if 
the point x is misclassified and 0 otherwise. This is also known as the discrete loss. 
In this paper we consider a different loss function for classification functions. 
Definition 3.1 The toss function L B is defined as LB(x, y) -- /3y + (1- y), if 
h(x)  y, and O, otherwise. 
We first consider the classical approach of minimizing the empirical loss, that is the 
loss on the training set. Since, the loss function is no longer binary the standard 
theoretical results that can be applied are much weaker than for the binary case. The 
algorithmic implications will, however, be investigated under the assumption we are 
using a hyperplane parallel to the maximal margin hyperplane. The empirical risk 
is given by ER(h) = im__l mfi(xi, Yi), for the training set {(xi, Yi) : i= 1,...,rn}. 
Assuming that the training set can be correctly classified by the hypothesis class 
this criterion will not be able to distinguish between consistent hypotheses, hence 
giving no reason not to choose the standard maximal margin choice. However, 
there is a natural way to introduce the different losses into the maximal margin 
quadratic programming procedure [1]. Here, the constraints given are specified as 
yi({w. Zi) q- O) _ 1, i -- 1, 2,..., rn. In order to force the hyperplane away from the 
positive points which will incur greater loss, a natural heuristic is to set Yi = -1 for 
negative examples and yi - 1//3 for positive points, hence making them further from 
the decision boundary. In the case where consistent classification is possible, the 
effect of this will be to move the hyperplane parallel to itself so that the margin on 
the positive side is/3 times that on the negative side. Hence, to solve the problem 
we simply use the standard maximal margin algorithm [1] and then replace the 
threshold 0 with 
1 
b - 1 +-----[(w. x +) +/(w. x-)I, (1) 
where x + (x-) is one of the closest positive (negative) points. 
The alternative approach we wish to employ is to consider other movements of the 
hyperplane parallel to itself while retaining consistency. Let 70 be the margin of the 
maximal margin hyperplane. We consider a consistent hyperplane h, with margin 
70 +/] to the positive examples, and 70 -/] to the negative example. The basic 
analytic tool is Theorem 2.4 which will be applied once for the positive examples 
and once for the negative examples (note that classifications are in the set {-1, 1}). 
Theorem 3.2 Let ho be the maximal margin hyperplane with margin 7o, while h, 
is as above with/] < 70. Set 7 + = (70 +/])/2 and 7- = (70 -/])/2. With probability 
at least i - 5 over the choice of a random rn sample (ci, Yi) drawn according to P 
the following holds. Suppose that for ho 
I. no -- 1{i: yiho(xi) _ 2/] + 70)1, 
2. no >_ 3v/2rn(dln(288rn ) log2(12em) + ln(8/5)), 
Let d + = Fat(7+/6) and d- = Fat(7-/6 ). Then we can bound the expected loss 
by 
3 
--(2 max(/3d + , d-) log2 (288m) log2(12em) +/3 log2(32m 2/5)) 
no 
256 G. Karakoulas and J. Shawe-Taylor 
Proof: Using Theorem 2.4 we can bound the probability of error given that the 
correct classification is positive in terms of the expression with the fat shattering 
dimension d + and n = no, while for a negative example we can bound the probability 
of error in terms of the expression with fat shattering dimension d- and n = m. 
Hence, the expected loss can be bounded by taking the maximum of the second 
bound with n + in place of m together with a factor /3 in front of the second log 
term and the first bound multiplied by fl.  
The bound obtained suggests a way of optimising the choice of r/, namely to minimise 
the expression for the fat shattering dimension of linear functions [9]. Solving for r/ 
in terms of 70 and fl gives 
(2) 
This choice of r/ does not in general agree with that suggested by the choice of 
the threshold b in the previous section. In a later section we report on initial 
experiments for investigating the performance of these different choices. 
4 The ThetaBoost Algorithm 
The above idea for adjusting the margin in the case of unequal loss function can 
also be applied to the AdaBoost algorithm [2] which has been shown to maximise 
the margin on the training examples and hence the generalization can be bounded 
in terms of the margin and the fat-shattering dimension of the functions that can 
be produced by the algorithm [6]. We will first develop a boosting algorithm for 
unequal loss functions and then extend it for adjustable margin. More specifically, 
assume: (i) a set of training examples (x,y),...,(x,y,) where xi  X and 
y  Y = {-1,+1}; (ii) a weak learner that outputs hypotheses h: X -+ {-1,+1} 
and (iii) the unequal loss function L v () of Definition 3.1. 
We assign initial weight D(i) = w + to the n + positive examples and D(i) = w- 
to the n- negative examples, where w+n + + w n = 1. The values can be set so 
that w+/w - =/ or they can be adjusted using a validation set. The generalization 
of AdaBoost to the case of an unequal loss function is given as the AdaUBoost 
algorithm in Figure 1. We adapt theorem 1 in [7] for this algorithm. 
Theorem 4.1 Assuming the notation and algorithm of Figure 1, the following 
bound holds on the training error of H 
T 
w+[i: H(xi) k Yi = 11 + w-[i : H(xi) k Yi -- --11 _< H Jt. (3) 
t--1 
The choice of w + and w- will force uneven probabilities of misclassification on 
the training set, but to ensure that the weak learners concentrate on misclassified 
positive examples we define Z (suppressing the subscript) as 
Z =  D(i) exp(-fiiYih(xi)). (4) 
i 
Thus, to minimize training error we should seek to minimize Z with respect to a 
(the voting coefficient) on each iteration of boosting. Following [7], we introduce 
the notation W++, W_+, W+_ and W__, where for s and s2 6 {-1, +1} 
i:y,=s,h(x,)=s2 
Optimizing Classifers for Irabalanced Training Sets 257 
By equating to zero the first derivative of (4) with respect to a, Z'(a), and using (5) 
we have -exp(-a/)W++/+exp(a/)W_+//+exp(a)W+_-exp(-a)W__ - 0. 
Letting Y = exp(a) we get a polynomial in Y: 
C1Y1-1/ q- C2Y 1+1/1 + C3Y 2 + C4 -- 0 
(6) 
where C1 = -W++/, C2 = W_+/, C3 = W+_, and C4 = -W__. 
The root of this polynomial can be found numerically. Since Z"(a) > 0, Z'(a) can 
have at most one zero and this gives the unique minimum of Z(a). The solution 
for a from (6) is used (as at) when taking the distance of a training example from 
the standard threshold on each iteration of the AdaUBoost algorithm in Figure 1 
as well as when combining the weak learners in H(x). 
The ThetaBoost algorithm searches for a positive and a negative support vector 
(SV) point such that the hyperplane separating them has the largest margin. Once 
these SV points are found we can then apply the formulas (1) and (2) of Sections 
3.1 and 3.2 respectively to compute values for adjusting the threshold. See Figure 
2 for the complete algorithm. 
Algorithm AdaUBoost(X, Y, ) 
1. Initialize Dl(i) as described above. 
2. For t = 1,...,T 
 train weak learner using distribution Dr; 
 get weak hypothesis hi; 
 choose at  ; 
 update: Dt+l(i)= Dt(i)exp[-at/3ilih(zi)]/Zt 
 where/3i = 1//3 if yi = 1 and 1 if otherwise, and Zt is a normalization 
factor such that i D+l(i) = 1; 
3. Output the final hypothesis: H(z)= sgn  
Algorithm ThetaBoost(X, Y, , 6M ) 
1. H(x) = AdaUBoost(X, Y,/); 
2. Remove from the training dataset the false positive and borderline points; 
3. Find the smallest H(x+) and mark this as the $V+; and remove any nega- 
tive points with value greater than H($V+); 
4. Find the first negative point that is next in ranking to the $V+ and mark 
this as SV_; and compute the margin as the sum of distances, d+ and d_, 
of $V+ and $V_ from the standard threshold; 
5. Check for candidate $V_'s that are near to the current one and change the 
margin by at least 
t3. Use $V+ and $V_ to compute the theta threshold from Eqn (1) and (2); 
7. Output the final hypothesis: H(x)= sgn (y]tr=l atht(x)- O) 
Figure 1: The AdaUBoost and Theta-Boost algorithms. 
258 G. Karakoulas and J. Shawe-Taylor 
5 Experiments 
The purpose of the experiments reported in this section is two-fold: 
(i) to compare the generalization performance of AdaUBoost against that of 
standard Adaboost on imbalanced datasets; 
(ii) to examine the two formulas for choosing the threshold in ThetaBoost and 
evaluate their effect on generalization performance. 
For the evaluations in (i) and (ii) we use two performance measures: the average 
L 3 and the geometric mean of accuracy (g-mean) [4]. The latter is defined as 
g = v/precision  recall, where 
 positives correct  positives correct 
precision = 4 positives predicted; recall = 4 true positives 
The g-mean has recently been proposed as a performance measure that, in contrast 
to accuracy, can capture the "specificity" trade-off between false positives and true 
positives in imbalanced datasets [4]. It is also independent of the distribution of 
examples between classes. 
For our initial experiments we used the satimage dataset from the UCI repository 
[5] and used a uniform D1. The dataset is about classifying neigborhoods of pixels 
in a satelite image. It has 36 continuous attributes and 6 classes. We picked class 
4 as the goal class since it is the less prevalent one (9.73% of the dataset). The 
dataset comes in a training (4435 examples) and a test (2000 examples) set. 
Table 1 shows the performance on the test set of AdaUBoost, AdaBoost and C4.5 
for different values of the beta parameter. It should be pointed out that the latter 
two algorithms minimize the total error assuming an equal loss function (/3 = 1). In 
the case of equal loss AdaUBoost simply reduces to AdaBoost. As observed from the 
table the higher the loss parameter the bigger the improvement of AdaUBoost over 
the other two algorithms. This is particularly apparent in the values of g-mean. 
AdaUBoost AdaBoost C4.5 
values avgLoss g-mean avgLoss g-mean avgLoss g-mean 
0.0545 0.773 0.0545 0.773 0.0885 0.724 
0.0895 0.865 0.0831 0.773 0.136 0.724 
0.13 0.889 0.1662 0.773 0.231 0.724 
0.1785 0.898 0.3324 0.773 0.421 0.724 
16 0.267 0.89 0.664 0.773 0.801 0.724 
Table 1: Generalization performance in the SatImage dataset. 
Figure 2 shows the generalization performance of ThetaBoost in terms of average 
loss (/3 = 2) for different values of the threshold 0. The latter ranges from the largest 
margin of negative examples that corresponds to $V_ to the smallest margin of 
positive examples that corresponds to $V+. This range includes the values of b and 
r/given by formulas (1) and (2). In this experiment 5M was set to 0.2. As depicted in 
the figure, the margin defined by b achieves better generalization performance than 
the margin defined by r/. In particular, b is closer to the value of 0 that gives the 
minimum loss on this test set. In addition, ThetaBoost with b performs better than 
AdaUBoost on this test set. We should emphasise, however, that the differences 
are not significant and that more extensive experiments are required before the two 
approaches can be ranked reliably. 
Optimizing Classifers for Irabalanced Training Sets 
0.2  
259 
0.18 
0.16 
emO.14 
> 
0.12 
0.1 
0.08 
I I 
-50 0 50 100 
Threshold 0 
Figure 2: Average Loss L  (/3 = 2) on test set as a function of 0 
6 Discussion 
In the above we built a theoretical framework for optimally setting the margin 
given an unequal loss function. By applying this framework to boosting we devel- 
oped AdaUBoost and ThetaBoost that generalize Adaboost, a well known boosting 
algorithm, for taking into account unequal loss functions and adjusting the margin 
in imbalanced datasets. Initial experiments have shown that both these factors 
improve the generalization performance of the boosted classifier. 
References 
[1] Corinna Cortes and Vladimir Vapnik, Machine Learning, 20, 273-297, 1995. 
[2] Yoav Freund and Robert Schapire, pages 148-156 in Proceedings of the Inter- 
national Conference on Machine Learning, ICML'96, 1996. 
[3] Michael J. Kearns and Robert E. Schapire, pages 382-391 in Proceedings of the 
$Ist Symposium on the Foundations of Computer Science, FOCS'90, 1990. 
[4] Kubat, M., Holte, R. and Matwin, S., Machine Learning, 30, 195-215, 1998. 
[5] Metz, C.J. and Murphy, P.M. (1997). UCI repository of machine learning 
databases. http://www. ics. uci. edu/ mlearn/MLRepository. html. 
[6] R. Schapire, Y. Freund, P. Bartlett, W. Sun Lee, pages 322-330 in Proceedings 
of International Conference on Machine Learning, ICML'97, 1997. 
[7] Robert Schapire and Yoram Singer, in Proceedings of the Eleventh Annual 
Conference on Computational Learning Theory, COLT'98, 1998. 
[8] John Shawe-Taylor, Algorithmica, 22, 157-172, 1998. 
[9] John Shawe-Taylor, Peter Bartlett, Robert Williamson and Martin Anthony, 
IEEE Trans. Inf. Theory, 44 (5) 1926-1940, 1998. 
