Convergence of a Neural Network Classifier 
John S. Baras 
Systems Research Center 
University of Maryland 
College Park, Maryland 20705 
Anthony LaVigna 
Systems Research Center 
University of Maryland 
College Park, Maryland 20705 
Abstract 
In this paper, we prove that the vectors in the LVQ learning algorithm 
converge. We do this by showing that the learning algorithm performs 
stochastic approximation. Convergence is then obtained by identifying 
the appropriate conditions on the learning rate and on the underlying 
statistics of the classification problem. We also present a modification to 
the learning algorithm which we argue results in convergence of the LVQ 
error to the Bayesian optimal error as the appropriate parameters become 
large. 
I Introduction 
Learning Vector Quantization (LVQ) originated in the neural network community 
and was introduced by Kohonen (Kohonen [1986]). There have been extensive 
simulation studies reported in the literature demonstrating the effectiveness of LVQ 
as a classifier and it has generated considerable interest as the training times asso- 
ciated with LVQ are significantly less than those associated with backpropagation 
networks. 
In this paper we analyse the convergence properties of LVQ. Using a theorem from 
the stochastic approximation literature, we prove that the update algorithm con- 
verges under the suitable conditions. We also present a modification to the algo- 
rithm which provides for more stable learning. Finally, we discuss the decision error 
associated with this "modified" LVQ algorithm. 
839 
840 Baras and LaVigna 
2 A Review of Learning Vector Quantization 
Let {(xi, N 
dx,))i=l be the training data or past observation set. This means that xi 
is observed when pattern dx, is in effect. We assume that the xi's are statistically 
independent (this assumption can be relaxed). Let Oj be a Voronoi vector and let 
O - {0, ..., 0k) be the set of Voronoi vectors. We assume that there are many more 
observations than Voronoi vectors (Duda & Hart [1973]). Once the Voronoi vectors 
are initialized, training proceeds by taking a sample (xj, d: 3) from the training set, 
finding the closest Voronoi vector and adjusting its value according to equations (1) 
and (2). After several passes through the data, the Voronoi vectors converge and 
training is complete. 
Suppose 0c is the closest vector. Adjust 0c as follows: 
if doc = d,j and 
Oc(n + 1) = Oc(n) + a. (xj - 
Oc(n q- 1) - Oc(n) - o, (xj - Oc(n)) (2) 
if do,  dxj. The other Voronoi vectors are not modified. 
This update has the effect that if xj and 0 have the same decision then 0 is moved 
closer to xj, however if they have different decisions then 0 is moved away fi'om xj. 
The constants {c,,} are positive and decreasing, e.g., c, = 1In. We are concerned 
with the convergence properties of (n) and with the resulting detection error. 
For ease of notation, we assume that there are only two pattern classes. The 
equations for the case of more than two pattern classes are given in (LaVigna 
[1989]). 
3 Convergence of the Learning Algorithm 
The LVQ algorithm has the general forin 
Oi(n + 1) = Oi(n) + a, 7(d:,,do,(,),x,, t3.) (x, - Oi(n)) (3) 
where xn is the currently chosen past observation. The function 7 determines 
whether there is an update and what its sign should be and is given by 
-1 if ds. = do, and x, 
7(da:,,do,,x,,13n) = i if da:,  do, and x,,  Vo, (4) 
0 otherwise 
Here Vo. represents the set of points closest to Oi and is given by 
%, -- {e : II 0,-11< II0-11, Ji} i: 1,...,k. (5) 
The update in (3) is a stochastic approximation algorithm (Benveniste, Metivier & 
Priouret [1987]). It has the forin 
13,+ = 13, + an H(13,,z,) (6) 
where 13 is the vector with components Oi; H(13, z) is the vector with components 
defined in the obvious manner from (3) and z, = (x,,d,.) is the random pair 
Convergence of a Neural Network Classifier 841 
consisting of the observation and the associated true pattern number. If the appro- 
priate conditions are satisfied by an, H, and Zn, then On approaches the solution 
of 
= a((-)(t)) 
(7) 
for the appropriate choice of h(). 
For the two pattern case, we let pi(x) represent the density for pattern 1 and r 
represent its prior. Likewise for po(x) and r0. It can be shown (Kohonen [1986]) 
that 
hi(O) = fv (x - Oi) q(x) dx (8) 
q(O) = { 
pl(C)Trl -- p0(az)Tr0 if do, - 1 
p0(x)Tr0 -- pl(C)Trl if do, -- 1 
(9) 
where 
If the following hypotheses hold then using techniques from (Benveniste, Metivier & 
Priouret [1987]) or (Kushner & Clark [1978]) we can prove the convergence theorem 
below: 
[H.1] {a,,} is a nonincreasing sequence of positive reals such that 
[H.2] Given d,, Xn are independent and distributed according to Pa, (x). 
[H.3] The pattern densities, pi(x), are continuous. 
Theorem I Assume that [H.I]-[H.$] hold. Let * be a locally asymptotic stable 
equilibrium point of (7) with domain of attraction D*. Let Q be a compact subset 
of D*. If ,  Q for infinitely many n then 
lim n = 8' a.s. (10) 
Proof: (see (LaVigna [1989])) 
Hence if the initial locations and decisions of the Voronoi vectors are close to a 
locally asymptotic stable equilibrium of (7) and if they do not move too much then 
the vectors converge. 
Given the form of (8) one might try to use Lyapunov theory to prove convergence 
with 
K 
(0) = y/v IIx - O11 q(x),dx (11) 
as a cmdidate Lyapunov function. This function will not work  is demonstrated 
by the following calculation in the one dimensional ce. Suppose that K = 2 and 
0 < 02 then 
001 001 i=1 
842 Baras and LaVigna 
Figure 1: A possible distribution of observations and two Voronoi vectors. 
Likewise 
= OOl ( Ilx-Olll 2ql(x),ax + Ilx-o.] ".x),ax (13) 
I 
+o)/2 
(o+o)/2 
= --e (-- O)ql(),a + I1(0 -- 0./ell"q((O + 0/) 
= -a(o) + ll(o - o./11"ql((O + o./e) (l 
(17) 
o 
002 L(O) -- -h2(O ) q- 11(02 - 01)/2112 q2((01 q- 02)12) 
(18) 
Therefore 
VL(O)0 = -hl()2-h2() 2 +11(0-02)/2112 q1((01 q-02)/2)(hl(O)-h2(O)) (19) 
In order for this to be a Lyapunov function (19) would have to be strictly nonpositive 
which is not the case. The problem with this candidate occurs because the integrand 
qi(x) is not strictly positive as is the case for ordinary vector quantization and 
adaptive K-means. 
4 Modified LVQ Algorithm 
The convergence results above require that the initial conditions are close to the 
stable points of (7) in order for the algorithm to converge. In this section we 
present a modification to the LVQ algorithm which increases the number of stable 
equilibrium for equation (7) and hence increases the chances of convergence. First 
we present a simple example which emphasizes a defect of LVQ and suggests an 
appropriate modification to the algorithm. 
Let C) represent an observation from pattern 2 and let A represent an observation 
froin pattern 1. We assume that the observations are scalar. Figure 1 shows a 
possible distribution of observations. Suppose there are two Voronoi vectors 01 and 
02 with decisions i and 2, respectively, initialized as shown in Figure 1. At each 
update of the LVQ algorithm, a point is picked at random from the observation set 
and the closest Voronoi vector is modified. We see that during this update, it is 
possible for 02(n) to be pushed towards oo and 01(n) to be pushed towards 
hence the Voronoi vectors may not converge. 
Recall that during the update procedure in (3), the Voronoi cells are changed by 
chauging the location of one Voronoi vector. After an update, the majority vote of 
Convergence of a Neural Network Classifier 843 
the observations in each new Voronoi cell may not agree with the decision previously 
assigned to that cell. This discrepency can cause the divergence of the algorithm. 
In order to prevent this from occuring the decisions associated with the Voronoi 
vectors should be updated to agree with the majority vote of the observations that 
fall within their Voronoi cells. Let 
1N 1 N 
9i(O;N)= 1 ifEl{u, ev,.}l{%,=l} > El{u, ev.}l{%,=2} (20) 
2 otherwise. 
Then gi represents the decision of the majority vote of the observations falling in 
Vo,. With this modification, the learning for Oi becomes 
Oi(n+ 1)-'Oi(n)+on7(da:,,,gi(On;N),xn,On) VO,(n)(Oi(n)--Xn). (21) 
This equation has the same form as (3) with the function H(, z) defined from (21) 
replacing H(, z). 
This divergence happens because the decisions of the Voronoi vectors do not agree 
with the majority vote of the observations closest to each vector. As a result, the 
Voronoi vectors are pushed away from the origin. This phenomena occurs even 
though the observation data is bounded. The point here is that, if the decision 
associated with a Voronoi vector does not agree with the majority vote of the 
observations closest to that vector then it is possible for the vector to diverge. A 
simple solution to this problem is to correct the decisions of all the Voronoi vectors 
after every adjustment so that their decisions correspond to the majority vote. In 
practice this correction would only be done during the beginning iterations of the 
learning algorithm since that is when a,, is large and the Voronoi vectors are moving 
around significantly. With this modification it is possible to show convergence to the 
Bayes optimal classifier (LaVigna [1989]) as the number of Voronoi vectors become 
large. 
15 Decision Error 
In this section we discuss the error associated with the modified LVQ algorithm. 
Here two results are discussed. The first is the simple comparison between LVQ 
and the nearest neighbor algorithm. The second result is if the number of Voronoi 
vectors is allowed to go to infinity at an appropriate rate as the number of obser- 
vations goes to infinity, then it is possible to construct a convergent estimator of 
the Bayes risk. That is, the error associated with LVQ can be made to approach 
the optimal error. As before, we concentrate on the binary pattern case for ease of 
notation. 
5.1 Nearest Neighbor 
If a Voronoi vector is assigned to each observation then the LVQ algorithm reduces 
to the nearest neighbor algorithm. For that algorithm, it was shown (Cover & Hart 
[1967]) that its Bayes minimum probability of error is less than twice that of the 
optimal classifier. More specifically, let r* be the Bayes optimal risk and let r be 
844 Baras and LaVigna 
the nearest neighbor risk. It was sixown that 
r* _< r _< 2r*(1- r*) _< 2r*. (22) 
Hence in the case of no iteration, the Bayes' risk associated with LVQ is given from 
the nearest neighbor algorithm. 
5.2 Other Choices for Number of Voronoi Vectors 
We saw above that if the number of Voronoi vectors equals the number of observa- 
tions then LVQ coincides with the nearest neighbor algorithm. Let ks represent the 
number of Voronoi vectors for an observation sample size of N. We are interested 
in determining the probability of error for LVQ when ks satisfies (1) limkN = oo 
and (2) lira(ks/N) = 0. In this case, there are more observations than vectors and 
hence the Voronoi vectors represent averages of the observations. It is possible to 
show that with ks satisfying (1)-(2) the decision error associated with modified 
LVQ can be made to approach the Bayesian optimal decision error as N becomes 
large (LaVigna [1989]). 
6 Conclusions 
We have shown convergence of the Voronoi vectors in the LVQ algorithm. We have 
also presented the majority vote modification of the LVQ algorithm. This modifi- 
cation prevents divergence of the Voronoi vectors and results in convergence for a 
larger set of initial conditions. In addition, with this modification it is possible to 
show that as the appropriate pm'ameters go to infinity the decision regions asso- 
ciated with the modified LVQ algorithm approach the Bayesian optimal (LaVigna 
[1989]). 
7 Acknowledgements 
This work was supported by the National Science Foundation through grant CDR- 
8803012, Texas Instruments through a TI/SRC Fellowship and the Office of Naval 
Research through an ONR Fellowship. 
8 References 
A. Benveniste, M. Metivier & P. Priouret [1987], Algorithmes Adaptatifs et Approx- 
imations Stochastiques, Mason, Paris. 
T. M. Cover & P. E. Hart [1967], "Nearest Neighbor Pattern Classification," IEEE 
Transactions on Information Theory IT-13, 21-27. 
R. O. Duda & P. E. Hart [1973], Pattern Classification and Scene Analysis, John 
Wiley & Sons, New York, NY. 
T. Kohonen [1986], "Learning Vector Quantization for Pattern Recognition," Tech- 
nical Report TKK-F-A601, Helsinki University of Technology. 
Convergence of a Neural Network Classifier 845 
H. J. Kushner & D. S. Clark[1978], Stochastic Approximation Methods for 
Constrained and Unconstrained Systems , Springer-Verlag, New York- 
Heidelberg-Berlin. 
A. LaVigna [1989], "Nonparametric Classification using Learning Vector Quantiza- 
tion," Ph.D. Dissertation, Department of Electrical Engineering, University 
of Maryland. 
