An Incremental Nearest Neighbor 
Algorithm with Queries 
Joel Rat saby* 
N.A.P. Inc. 
Hollis, New York 
Abstract 
We consider the general problem of learning multi-category classifi- 
cation from labeled examples. We present experimental results for 
a nearest neighbor algorithm which actively selects samples from 
different pattern classes according to a querying rule instead of the 
a priori class probabilities. The amount of improvement of this 
query-based approach over the passive batch approach depends on 
the complexity of the Bayes rule. The principle on which this al- 
gorithm is based is general enough to be used in any learning algo- 
rithm which permits a model-selection criterion and for which the 
error rate of the classifier is calculable in terms of the complexity 
of the model. 
1 INTRODUCTION 
We consider the general problem of learning multi-category classification from la- 
beled examples. In many practical learning settings the time or sample size available 
for training are limited. This may have adverse effects on the accuracy of the result- 
ing classifier. For instance, in learning to recognize handwritten characters typical 
time limitation confines the training sample size to be of the order of a few hundred 
examples. It is important to make learning more efficient by obtaining only training 
data which contains significant information about the separability of the pattern 
classes thereby letting the learning algorithm participate actively in the sampling 
process. Querying for the class labels of specificly selected examples in the input 
space may lead to significant improvements in the generalization error (cf. Cohn, 
Atlas & Ladher, 1994, Cohn, 1996). However in learning pattern recognition this 
is not always useful or possible. In the handwritten recognition problem, the com- 
puter could ask the user for labels of selected patterns generated by the computer 
*The author's coordinates axe: ddress: Hamered St. #2, Ra'anana, ISRAEL. Faail: 
jeree.technion.ac.il 
An Incremental Nearest Neighbor Algorithm with Queries 613 
however labeling such patterns are not necessarily representative of his handwriting 
style but rather of his reading recognition ability. On the other hand it is possi- 
ble to let the computer (learner) select particular pattern classes, not necessarily 
according to their a priori probabilities, and then obtain randomly drawn patterns 
according to the underlying unknown class-conditional probability distribution. We 
refer to such selective sampling as sample querying. Recent theory (cf. Ratsaby, 
1997) indicates that such freedom to select different classes at any time during the 
training stage is beneficial to the accuracy of the classifier learnt. In the current 
paper we report on experimental results for an incremental algorithm which utilizes 
this sample-querying procedure. 
2 THEORETICAL BACKGROUND 
We use the following setting: Given M distinct pattern classes each with a class 
conditional probability density fi(x), 1 <_ i <_ M, x C IR d, and a priori probabilities 
pi, 1 < i < M. The functions fi(x), 1 <_ i <_ M, are assumed to be unknown while 
the pi are assumed to be known or easily estimable as is the case of learning character 
recognition. For a sample-size vector m = Imp,..., raM] where -4M__ rai =  denote 
by (m {(xj 
= , yj ) }j = 1 a sam ple of labeled exam ples consisting of mi example from 
pattern class i where yj, 1 _< j <_ , are chosen not necessarily at random from 
{1,2 .... , M}, and the corresponding xj are drawn at random i.i.d. according to the 
class conditional probability density fyj (x). The expected misclassification error of 
a classifier c is referred to as the loss of c and is denoted by L(c). It is defined as the 
probability of misclassification of a randomly drawn x with respect to the underlying 
mixture probability density function f(x) = -M 
i= pifi(x). The loss is commonly 
represented as L(c) - El{:,()u()}, where l{eA} is the indicator function of a set 
A, expectation is taken with respect to the joint probability distribution fu(x)p(y) 
where p(y) is a discrete probability distribution taking values pi over 1 <_ i _< M, 
while y denotes the label of the class whose distribution f(x) was used to draw x. 
The loss L(c) may also be written as L(c)= -/M=l piEil{()i} where Ei denotes 
expectation with respect to fi(x). The pattern recognition problem is to learn based 
on m the optimal classifier, also known as the Bayes classifier, which by definition 
has minimum loss which we denote by L*. 
A multi-category classifier c is represented as a vector c(x) = [l(Z),...,CM(X)] 
of boolean classifiers, where el(x): 1 if c(x) - i, and ci(x) - 0 otherwise, 1 < 
i _< M. The loss L(c) of a multi-category classifier c may then be expressed as 
the average of the losses of its component classifiers, i.e., L(c) =  piL(ci) 
where for a boolean classifier ci the loss is defined as L(ci) - Eil{c(a)}. As 
an estimate of L(c) we define the empirical loss L.(c) -- iM__i piLm,(c) where 
1 
L.,(c) - m j:y=i llc(j)#i} which may also can be expressed as L.,(ci) = 
rn :yi 
The family of all classifiers is assumed to be decomposed into a multi-structure 
S - S x S2 x ... x SM, where Si is a nested structure (cf. Vapnik, 1982) of 
boolean families Bj,, ji - 1,2,..., for I _< i _< M, i.e., S - B,B,...,Bj,..., 
S2 - /,/,...,/j2,..., up to SM : /,/,...,/,..., where kj  + 
denotes the VC-dimension of B and B C_ B.+, I _< i <_ M. For any fixed 
positive integer vector j G t consider the class of vector classifiers 7(j) = 
B. x B x -.- x B -- 7-/ where we take the liberty in dropping the multi- 
index j and write k instead of k(j). Define by 6 the subfamily of 7-/ consisting 
614 J. Ratsaby 
of classifiers c that are well-defined, i.e., ones whose components ci, 1 _ i _ M 
satisfy M = : iR d 
[Ji={x 'ci(x) 1} and {x' ci(x) - 1}{x  cj(x) -- 1} : , for 
l<_ij_<M. 
From the Vapnik-Chervonenkis theory (cf. Vapnik, 1982, Devroye, Gyorfi & Lu- 
gosi, 1996) it follows that the loss of any boolean classifier ci E 13% is, with 
high confidence, related to its empirical loss as L(ci) _< Lm(ci)+ e(mi, k./.) where 
(mi, kj.) -- const V/k}. ln mi/mi, 1 _< i _< M, where henceforth we denote by 
const any constant which does not depend on the relevant variables in the expres- 
sion. Let the vectors m = [m, m4] and k _= k(j) 
Define e(m, k) - y-] pie(mi,kj,). It follows that the deviation between the em- 
pirical loss and the loss is bounded uniformly over all multi-category classifiers in 
a class G by (m, k). We henceforth denote by c the optimal classifier in , i.e., 
c: -- argmin, eoL(c ) and b -- argmin, eoLm(c) is the empirical loss minimizer 
over the class {. 
The above implies that the classifier  has a loss which is no more than L(c) + 
e(m,k). Denote by k* the minimal complexity of a class { which contains the 
Bayes classifier. We refer to it as the Bages complea:itg and henceforth assume 
k? < x>, 1 < i < M. Ilk* was known then based on asample of size with a 
sample size vector m -- [m,..., raM] a classifier . whose loss is bounded from 
above by L* + (m,k*) may be determined where L* - L(c;.) is the Bayes loss. 
This bound is minimal with respect to k by definition of k* and we refer to it as the 
minimal criterion. It can be further minimized by selecting a sample of size vector 
m* - argmin{,e;gT:y-],,=m}(m,k* ). This basically says that more examples 
should be queried from pattern classes which require more complex discriminating 
rules within the Bayes classifier. Thus sample-querying via minimization of the 
minimal criterion makes learning more efficient through tuning the subsample sizes 
to the complexity of the Bayes classifier. However the Bayes classifier depends 
on the underlying probability distributions which in most interesting scenarios are 
unknown thus k* should be assumed unknown. In (Ratsaby, 1997) an incremental 
learning algorithm, based on Vapnik's structural risk minimization, generates a 
random complexity sequence c(n), corresponding to a sequence of empirical loss 
minimizers 5}(,) over {/,), which converges to k* with increasing time n for learning 
problems with a zero Bayes loss. Based on this, a sample-query rule which achieves 
the same minimization is defined without the need to know k*. We briefly describe 
the main ideas next. 
At any time n, the criterion function is e(.,c(n)) and is defined over the m-domain 
;g+. A gradient descent step of a fixed size is taken to minimize the current cri- 
terion. After a step is taken, a new sample-size vector m(n + 1) is obtained and 
the difference m(n + 1) - re(n) dictates the sample-query at time n, namely, the 
increment in subsample size for each of the M pattern classes. With increasing n 
the vector sequence re(n) gets closer to an optimal path defined as the set which 
is comprised of the solutions to the minimization of (m, k*) under all different 
constraints of y'] mi = , where  runs over the positive integers. Thus for 
all large n the sample-size vector re(n) is optimal in that it minimizes the minimal 
criterion (., k*) for the current total sample size (n). This constitutes the sample- 
querying procedure of the learning algorithm. The remaining part does empirical 
loss minimization over the current class /(,) and outputs 6/,q. By assumption, 
since the Bayes classifier is contained in ., it follows that for all large n, the loss 
L(6(,)) < L* + min .,,.. ,--,, (m, k*), which is basically the minimal 
- t"  a  :2_.= 
criterion mentioned above. Thus the algorithm produces a classifier 6(,) with a 
An Incremental Nearest Neighbor Algorithm with Queries 615 
minimal loss even when the Bayes complexity k* is unknown. 
In the next section we consider specific model classes consisting of nearest-neighbor 
classifiers on which we implement this incremental learning approach. 
3 
INCREMENTAL NEAREST-NEIGHBOR 
ALGORITHM 
Fix and Hodges , cf. Silverman & Jones (1989). introduced the simple but powerful 
i=1' 
nearest-neighbor classifier which based on a labeled training sample {(xi, yi) w 
;ri C ]R d, Yi C {1,2,..., M}, when given a pattern z, it outputs the label Yi corre- 
sponding to the example whose z i is closest to x. Every example in the training 
sample is used for this decision (we denote such an example as a prototype) thus 
the empirical loss is zero. The condensed nearest-neighbor algorithm (Hart, 1968) 
and the reduced nearest neighbor algorithm (Gates, 1972) are procedures which 
aim at reducing the number of prototypes while maintaining a zero empirical loss. 
Thus given a training sample of size , after running either of these procedures, a 
nearest neighbor classifier having a zero empirical loss is generated based on  < ,-}7 
prototypes. Learning in this manner may be viewed as a form of empirical loss 
minimization with a complexity regularization component which puts a penalty 
proportional to the number of prototypes. 
A cell boundary i,j of the voronoi diagram (cf. Preparata & Shamos, 1985) 
corresponding to a multi-category nearest-neighbor classifier c is defined as the 
(d- 1)-dimensional perpendicular-bisector hyperplane of the line connecting the 
o 
x-component of two prototypes zi and zj. For a fixed l  {1,..., M}, the collection 
of voronoi cell-boundaries based on pairs of prototypes of the form (zi,/), (x j, q) 
where q  l, forms the boundary which separates the decision region labeled l from 
its complement and represents the boolean nearest-neighbor classifier cl. Denote 
by kt the number of such cell-boundaries and denote by sl the number of proto- 
types from a total of m examples from pattern class I. The value of kt may be 
calculated directly from the knowledge of the s prototypes, 1 _< I _< M, using 
various algorithms. The boolean classifier ct is an element of an infinite class of 
boolean classifiers based on partitions of IRa by arrangements of kt hyperplanes of 
dimensionality d - 1 where each of the cells of a partition is labeled either 0 or 1. 
It follows, cf. Devroye et. al. (1996), that the loss of a multi-category nearest- 
neighbor classifier c which consists of s prototypes out of rn examples, 1 < I < M, 
is bounded as L(c) _< Lm(c) + e(m, k), where the a priori probabilities are taken as 
known, m = [m,... ,m.m], k = [k,... ,km] and e(m, k): y- ple(m, k), where 
(rn, k) = constx/i(d + 1)ktlnm + (ekl/d) )/m. Letting k* denote the Bayes 
complexity then e(-, k*) represents the minimal criterion. 
The next algorithm uses the Condense and Reduce procedures in order to generate a 
sequence of classifiers d(,) with a complexity vector Jc(n) which tends to k* as n - 
oo. A sample-querying procedure referred to as Greedy Query (GQ) chooses at any 
time n to increment the single subsample of pattern class j*(n) where m/.(,) is the 
direction of maximum descent of the criterion e(-, k(n)) at the current sample-size 
vector re(n). For the part of the algorithm which utilizes a Delaunay-Triangulation 
procedure we use the fast Fortune's algorithm (cf. O'Rourke ) which can be used 
only for dimensionality d = 2. Since all we are interested is in counting Voronoi 
borders between all adjacent. Voronoi cells then an efficient computation is possible 
also for dimensions d > 2 by resorting to linear programming for computing the 
adjacencies of facets of a polyhedron, cf. Fukuda (1997). 
616 J. Ratsaby 
Incremental Nearest Neighbor (INN) Algorithm 
Initialization: (Time n = 0) 
Let increment-size A be a fixed small positive integer. Start with m(0) -- 
[c, c], where c is a small positive integer. Draw m(0) _ {-b(0)}M where 
(,,(0) consists of m3(O ) randomly drawn i.i.d. examples from pattern class j. 
While (number of available examples _ A) Do: 
1. Call Procedure CR: k(n) = CR(("('))  
2. Call Procedure GQ: m(n + 1) = GQ(n). 
3. n:=n+l. 
End While 
//Used up all examples. 
Output: NN-classifier 
Procedure Condense-Reduce (CR) 
Input: Sample (m(,) stored in an array A[] of size (n). 
Initialize: Make only the first example A[1] be a prototype. 
//Condense 
Do: 
ChangeOccured := FALSE. 
For i = 1,...,(n): 
 Classify A[i] based on available prototypes using the NN-Rule. 
 If not correct then 
- Let A[i] be a prototype. 
- UhangeOccured := TRUE. 
 End If 
End For 
While ( Change Occured). 
//Reduce 
Do: 
ChangeOccured :- FALSE. 
For i = 1, .... (n): 
 If A[i] is a prototype then classify it using the tenraining prototypes by the 
NN-Rule. 
 If correct then 
- Make A[i] be not a prototype. 
- UhangeOccured := TRUE. 
 End If 
End For 
While ( Uhange Occured). 
Run Delaunay-Triangulation Let (n) = [/(1,..., M], , denotes the number 
of Voronoi-cell boundaries associated with the , prototypes. 
Return (NN-classifier with complexity vector :(n)). 
Procedure Greedy-Query (GQ) 
Input: Time n. 
j*(n) := argmaxl_<3<M  
Draw: A new i.i.d. examples from class j*(n). Denote them by (. 
Update Sample: raJ*(n) (n+l) :-- rnJ*(n)  U' while rn,(n+l) ;= (rn,(n), for 
l _< i  j*(n) _ M. 
Return: (re(n)+ A %.(,)), where ej is an all zero vector except 1 at jth element. 
An Incremental Nearest Neighbor Algorithm with Queries 617 
3.1 EXPERIMENTAL RESULTS 
We ran algorithm INN on several two-dimensional (d = 2) multi-category classifi- 
cation problems and compared its generalization error versus total sample size  
with that of batch learning, the latter uses Procedure CR (but not Procedure GQ) 
m I<i<M. 
with uniform subsample proportions, i.e., rri -- , _ _ 
We ran three classification problems consisting of 4 equiprobable pattern classes 
with a zero Bayes loss. The generalization curves represent the average of 15 inde- 
pendent learning runs of the empirical error on a fixed size test set. Each run (both 
for INN and Batch learning) consists of 80 independent experiments where each 
differs by 10 in the sample size used for training where the maximum sample size is 
800. We call an experiment a success if INN results in a lower generalization error 
than Batch. Let p be the probability of INN beating Batch. We wish to reject the 
 which says that INN and Batch are approximately equal 
hypothesis H that p = 5 
in performance. The results are displayed in Figure I as a series of pairs, the first. 
picture showing the pattern classes of the specific problem while the second shows 
the learning curves for the two learning algorithms. Algorithm INN outperformed 
the simple Batch approach with a reject level of less than 1%, the latter ignoring the 
inherent Bayes complexity and using an equal subsample size for each of the pattern 
classes. In contrast, the INN algorithm learns, incrementally over time, which of 
the classes are harder to separate and queries more from these pattern classes. 
References 
Cohn D., Atlas L., Ladher R. (1994), Improving Generalization with Active Learn- 
ing. Machine Learning, Vol 15, p.201-221. 
Devroye L., Gyorfi L. Lugosi G. (1996). "A Probabilistic Theory of Pattern Recog- 
nition", Springer Verlag. 
Fukuda K. (1997). Frequently Asked Questions in Geometric Computation. 
Technical report, Swiss Federal Institute of technology, Lausanne. Available at 
ftp://ftp. if or. ethz. ch/pub/fukuda/report s. 
Gates, G. W. (1972) The Reduced Nearest Neighbor Rule. IEEE Trans. Info. 
Theo., p.431-433. 
Hart P. E. (1968) The Condensed Nearest Neighbor Rule. IEEE Trans. on Info. 
Theo., Vol. IT-14, No. 3. 
O'rourke J. (1994). "Computational Geometry in C". Cambridge University Press. 
Ratsaby, J. (1997) Learning Classification with Sample Queries. Elec- 
trical Engineering Dept., Technion, CC PUB tk196. Available at URL 
http://www.ee.technion.ac.il/jer/iandc.ps. 
Rivest R. L., Eisenberg B. (1990), On the sample complexity of pac-learning using 
random and chosen examples. Proceedings of the 1990 Workshop on Computational 
Learning Theory, p. 154-162, Morgan Kaufmann, San Maeto, CA. 
B. W. Silverman and M. C. Jones. E. Fix and J. 1. Hodges (1951): An impor- 
tant contribution to nonparametric discriminant analysis and density estimation-- 
commentary on Fix and Hodges (1951). International statistical review, 57(3), 
p.233-247, 1989. 
Vapnik V.N., (1982), "Estimation of Dependences Based on Empirical Data", 
Springer-Verlag, Berlin. 
618 J. Ratsaby 
I I I 
PatternClass! 
Pa,ernCl&.s2 
Pa,errlass3 
Pa,em(21a.s4 
I I . 
IOO 15o 
0.2' 
0.16' 
0,14'  
o. 12'  
0.02]- , " " 
0 I 200 3 4 5  7 8 
- Batch 
1oo 1 mo 
IOO 15o 
200 
0.25' 
0.22' 
0.2  
 o.17' 
o. 13. 
0.1' 
0.08' 
o.05' 
0.o3' 
, I I I I I I I I 
0 I00 2DO 304) 400 500 600 704) 800 
Tnlal numl'r of cxampl 
Balch 
INN 
0.3' 
o.2-  
0.26' t 
 o.24- 
0.2' 
0. l 
0.1 '; 
0 i 00 200 3 
TI n ofl 
'-- Batch 
Figure 1. Three different Pattern Classification Problems and Learning 
Curves of the INN-Algorithm compared to Batch Learning. 
