A Network of Localized Linear Discriminants 
Martin S. Glassman 
Siemens Corporate Research 
755 College Road East 
Princeton, NJ 08540 
msg@ siemens.siemens.com 
Abstract 
The localized linear discriminant network (LLDN) has been designed to address 
classification problems containing relatively closely spaced data from different 
classes (encounter zones [1], the accuracy problem [2]). Locally trained hyper- 
plane segments are an effective way to define the decision boundaries for these 
regions [3]. The LLD uses a modified perceptron training algorithm for effective 
discovery of separating hyperplane/sigmoid units within narrow boundaries. The 
basic unit of the network is the discriminant receptive field (DRF) which combines 
the LLD function with Gaussians representing the dispersion of the local training 
data with respect to the hyperplane. The DRF implements a local distance mea- 
sure [4], and obtains the benefits of networks of localized units [5]. A constructive 
algorithm for the two-class case is described which incorporates DRF's into the 
hidden layer to solve local discrimination problems. The output unit produces a 
smoothed, piecewise linear decision boundary. Preliminary results indicate the 
ability of the LLDN to efficiently achieve separation when boundaries are narrow 
and complex, in cases where both the "standard" multilayer perceptron (MLP) 
and k-nearest neighbor (KNN) yield high error rates on training data. 
1 The LLD Training Algorithm and DRF Generation 
The LLD is defined by the hyperplane normal vector V and its "midpoint" M (a translated 
origin [1] near the center of gravity of the training data in feature space). Incremental 
corrections to V and M accrue for each training token feature vector Yj in the training 
set, as illustrated in figure 1 (exaggerated magnitudes). The surface of the hyperplane is 
appropriately moved either towards or away from Yj by rotating V, and shifting M along 
1102 
A Network of Localized Linear Discriminants 1103 
the axis defined by V M is always shifted towards Yj in the "radial" direction Rj (which is 
the componerit of Dj orthogonal to V, where Dj = Y. - M): 
TOKEN ON CORRECT SIDE OF HYPERPLANE I 
TOKEN ON WRONG SIDE OF HYPERPLANE I 
Figure l: LLD incremental correction vectors associated with training token Yj are shown 
above, and the corresponding LLD update rules below: 
J J 
J J 
The batch mode summation is over tokens in the loc training set, and n is the iteration 
index. The polarity of &Vj and i is set by Sc (c = the class of Yj), where Sc = 1 if Yj is 
classified cogectly, and Sc = -1 if not. Coections for each token are scaled by a sigmoidal 
eor te: ej = 1/(1 + exp((Sc/X) [ rj I)), a function of the distance of me token to 
the plane, the sign of So and a data-dependent scaling parameter: X = I r[ - ][, where 
 is a fixed (experimental) scaling parameten e scaling of the sigmoid is propoffional 
to  estimate of the boundaw region width along the is of V. B c is a weighted average 
of the class c token vectors: c(n + 1) = (1 - a)c(n) + aWc j6c 6j,c(n)(n), where Ej, c 
is a sigmoid with the same scaling as ej, except that it is centered on Bc instead of M, 
emphasizing tokens of class c nearest the hypedlane surface. For small 's, B c will settle 
near the cluster center of gravity, and for large 's, B c will approach the tokens closest to 
the hypedlane surface. (The rate of the movement ofB c is limited by the vflue of a, which 
is not critical.) e inverse of the number of tokens in class c, Wo balances the weight 
of the coections from each class. If a more Bayesian-like solution is required, the slope 
of e can be made class dependent (for example, raplacing  with c  Wc). Since the 
slope of the sigmoid eor te is limited d distribmion dependent, the use of w o along 
with the nonlinear weighting of tokens near the hypedlane surface, is impoffant for the 
development of separating planes in relatively naow boundaries (the assumption is that 
the distributions near these boundies are non-Gaussian). e setting of  simultaneously 
(for convenience) controls the focus on the "inner edges" of the class clusters and the slope 
of the sigmoid relative to the distance between the inner edges, with some resultant control 
over generalization peffoance. This local scaling of the eor also aids the convergence 
rate. The range of good values for  has been found to be reasonably wide, d idemical 
1104 Glassman 
values have been used successfully with speech, ecg, and synthetic data; it could also 
be set/optimized using cross-validation. Separate adaptive learning rates (Iz(n), 7(n), and 
/3(n)) are used in order to take advantage of the distinct nature of the geometric function of 
each component. Convergence is also improved by maintaining M within the local region; 
this controls the rate at which the hyperplane can sweep through the boundary region, 
making the effect of AV more predictable. The LLD normal vector update is simply: 
(n + 1) -- ((n) + A)/ll(n ) + A11, so that V is always normalized to unit magnitude. 
The midpoint is just shifted: /0(n + 1) =/0(n) + A/O k + A/p. 
lamb da 
Bk, c,  
Mk 
Oj,k,c 
lambda: estimate of the 
boundary region width 
sigma(V): dispersion of 
the training data in the 
discriminant direction 
(v) 
sigma(R): dispersion of 
the training data in all 
directiona orthogonal to 
Figure 2: Vectors and parameters associated with the DRF for class c, for LLD k 
DRF's are used to localize the response of the LLD to the region of feature space in which 
it was trained, and are constructed after completion of LLD training. Each DRF represents 
one class, and the localizing component of the DRF is a Gaussian function based on simple 
statistics of the training data for that class. Two measures of the dispersion of the data are 
used: O'v ("normal" dispersion), obtained using the mean average deviation of the lengths of 
Pj,t:,c, and o'R ("radial" dispersion), obtained correspondingly using the Oj,k,c'S. (As shown, 
Pi,k,c is the normal component, and Oi&c the radial component of Yj - Ba,c.) The output in 
response to an input ,ector Yj from the class c DRF associated with the LLD k is 
d 2 2 . 
.i,,c = k,c(ej,-O.5)/ exp(/ ,j,,c+d,,j,,c ), e, = 1/(l +exp((rl/,kk) l [[j-l] I)) 
Two components of the DRF incorporate the LLD discriminant; one is the sigmoid error 
function used in training the LLD but shifted down to a value of zero at the hyperplane 
surface.' The other is a,c, which is 1 if Yj is on the class c side of LLD k, and zero if 
not. (In retrospect, for generalization performance, it may not be desirable to introduce 
this discontinuity to the discriminant component.) The contribution of the Gaussian is 
based on the normal and radial dispersion weighted distances of the input vector to B a,c: 
dp,j,t:,c = and. d.,j,a,c = 
2 Network Construction 
Segmentation of the boundary between classes is accomplished by "growing" LLD's within 
the boundary region. An LLD is initialized using a closely spaced pair of tokens from each 
class. The LLD is grown by adding nearby tokens to the training set, using the k-nearest 
neighbors to the LLD midpoint at each growth stage as candidates for permanent inclusion. 
Candidate DRF's are generated after incremental training of the LLD to accommodate each 
A Network of Localized Linear Discriminants 1105 
new candidate token. Two error measures are used to assess the effect of each candidate, the 
peak value of ej over the local training set, and , which is a measure of misclassification 
error due to the receptive fields of the candidate DRF's extending over the entire training 
set. The candidate token with the lowest average  is permanently added, as long as both 
its ej and  are below fixed thresholds. Growth the the LLD is halted if no candidate has 
both error measures below threshold. The e i and r thresholds directly affect the granularity 
of the DRF representation of the data; they need to be set to minimize the number of DRF's 
generated, while allowing sufficient resolution of local discrimination problems. They 
should perhaps be adaptive so as to encourage coarse grained solutions to develop before 
fine grain structure. 
Figure 3: Four "snapshots" in the growth of an LLD/DRF pain The upper two are "close- 
ups." The initial LLD/DRF pair is shown in the upper left, along with the seed pain Filled 
rectangles and ellipses represent the tokens from each class in the permanent local training 
set at each stage. The large markers are the B points, and the cross is the LLD midpoint. 
The amplitude of the DRF outputs are coded in greyscale. 
1106 Glassman 
At this point the DRF's are fixed and added to the network; this represents the addition of 
two new localized features available for use by the network's output layer in solving the 
global discrimination problem. In this implementation, the output "layer" is a single LLD 
used to generate a two-class decision. The architecture is shown below: 
INPUT 
DATA 
LLD'S 
DRF'S' 
I Bk,O // 
I / '". Vk 
I ,/ k,1 
FEATURES / 
OUTPUT 
DISCRIMINANT 
FUNCTION 
(LLD W! 51GMOID) 
ERROR MEASURE ON 
TRAINING TOKENS 
USED TO SEED NEW 
LLD'S OR HALT 
TRAINING 
Figure 4: LLDN architecture for a two-dimensional, two-class problem 
The ouput unit is completely retrained after addition of a new DRF pair, using the entire train- 
ing set. The output of the network to the input Y/is: qi = 1/(1 +exp ((r//X o)r[(/- aM])), 
where Xo: Ir[/f0 - and (i: [j, 1 ..... (i,p] is the p dimensional vector of DRF 
outputs presented to the output unit.  is the output LLD normal vector, aM the midpoint, 
and 13c'S the cluster edge points in the internal feature space. The output error for each 
token is then used to select a new seed pair for development of the next LLD/DRF pain 
If all tokens are classified with sufficient confidence, of course, construction of the LLDN 
is complete. There are three possibilities for insufficient confidence: a token is covered 
by a DRF of the wrong class, it is not yet covered sufficiently by any DRF's, or it is in a 
region of "conflict" between DRF's of different classes. A heuristic is used to prevent the 
repeated selection of the same seed pair tokens, since there is no guarantee that a given DRF 
will significantly reduce the error for the data it covers after output unit retraining. This 
heuristic alternates between the types of error and the class for selection of the primary seed 
token. Redundancy in DRF shapes is also minimized by error-weighting the dispersion 
computations so that the resultant Gaussian focuses more on the higher error regions of the 
local training data. A simple but reasonably effective pruning algorithm was incorporated 
to further eliminate unnecessary DRF's. 
A Network o Localized Linear Discriminants 1107 
Figure 5: Network response plots illustrating network development. The upper two 
sequences, beginning with the first LLD/DRF pair, and the bottom two plots show final 
network responses for these two problems. A solution to a harder version of the nested 
squares problem is on the lower left. 
3 Experimental Results 
The first experiment demonstrates comparative convergence properties of the LLD and a 
single hyperplane trained by the standard generalized delta rule (GDR) method (no hidden 
units, single output unit "network" is used) on 14 linearly separable, minimal consonant 
1108 Glassman 
pair data sets. The data is 256 dimensional (time/frequency matrix, described in [6]), with 
80 exemplars per consonant. The results compare the best performance obtainable from 
each technique. The LLD converges roughly 12 times faster in iteration counts. The GDR 
often fails to .completely separate f/th, f/v, and s/sh; in the results in figure 6 it fails on 
the f/th data set at a plateau of 25% error. In both experiments described in this paper, 
networks were mn for relatively long times to insure confidence in declaring failure to 
Z lOOK  
O 
i- 
 1OK ' 
Q. 
Iii 
I/I 
LU 
..I 
Q. 
0 
0 
i/I 
z 
_ o 
n- 
LU 
i- o 
I Figure 6: TRAINING A SINGLE HYPERPLANE I [ Figure 7: ERROR RATES VS' GEOMETRIES I 
.0 
10 2:64: 
t/v/ F - 
/ 
29 29 29 29 4,4 I 4.4 I I 
.0 0, :6 ? o ,o , 
 (does not separate) 
GDR 
"/.WIDTH 
% OFFSET 
solve the problem. The second experiment involves complete networks on synthetic two- 
dimensional problems. Two examples of the nested squares problem (random distributions 
of tokens near the surface of squares of alternating class, 400 tokens total) are shown in 
figure 5. Two parameters controlling data set generation are explored: the relative boundary 
region width, and the relative offset from the origin of the data set center of gravity (while 
keeping the upper right comer of the outside square near the (1,1) coordinate); all data is 
kept within the unit square (except for geometry number 2). Relative boundary widths of 
29%, 4.4%, and 1% are used with offsets of 0%, 76%, and 94%. The best results over 
parameter settings are reported for each network for each geometry. Four MLP architectures 
were used: 2:16:1, 2:32:1, 2:64:1, and 2:16:16:1; all of these converge to a solution for 
the easiest problem (wide boundaries, no offset), but all eventually fail as the boundaries 
narrow and/or the offset increases. The worst performing net (2:64:1) fails for 7/8 problems 
(maximum error rate of 49%); the best net (2:16:16:1) fails in 3/8 (maximum of 24% 
error). The LLDN is 1 to 3 orders of magnitude faster in cpu time when the MLP does 
converge, even though it does not use adaptive learning rates in this experiment. (The 
average running time for the LLDN was 34 minutes; for the MLP's it was 3481 minutes 
[Stardent 3040, single cpu], but which includes non-converging runs. The 2:16:16:1 net 
did, however, take 4740 minutes to solve problem 6, which was solved in 7 minutes by the 
LLDN.) The best LLDN's converge to zero errors over the problem set (fig. 6), and are not 
too sensitive to parameter variation, which primarily affect convergence time and number 
of DRF's generated. In contrast, finding good values for learning rate and momentum for 
the MLP's for each problem was a time-consuming process. The effect of random weight 
initialization in the MLP is not known because of the long running times required. The 
KNN error rate was estimated using the leave-one-out method, and yields error rates of 
0%, 10.5%, and 38.75% (for the best k's) respectively for the three values of boundary 
width. The LLDN is insensitive to offset and scale (like the KNN) because of the use 
of the local origin (M) and error scaling (X). While global offset and scaling problems 
for the MLP can be ameliorated through normalization and origin translation, this method 
cannot guarantee elimination of local offset and scaling problems. The LLDN's utilization 
A Network of Localized Linear Discriminants 1109 
of DRF's was reasonably efficient, with the smallest networks (after pruning) using 20, 32, 
and 54 DRF's for the three boundary widths. A simple pruning algorithm, which starts up 
after convergence, iteratively removes the DRF's with the lowest connection weights to the 
output unit (which is retrained after each link is removed). A range of roughly 20% to 40% 
of the DRF's were removed before developing misclassification errors on the training sets. 
The LLDN was also tested on the "two-spirals" problem, which is know to be difficult for 
the standard MLP methods. Because of the boundary segmentation process, solution of the 
two-spirals problem was straightforward for the LLDN, and could be tuned to converge in 
as fast as 2.5 minutes on an Apollo DN10000. The solution shown in fig. 5 uses 50 DRF's 
(not pruned). The generalization pattern is relatively "nice" (for training on the sparse 
version of the data set), and perhaps demonstrates the practical nature of the smoothed 
piecewise linear boundary for nonlinear problems. 
4 Discussion 
The effect of LLDN parameters on generalization performance needs to be studied. In 
the nested squares problem it is clear that the MLP's will have better generalization when 
they converge; this illustrates the potential utility of a multi-scale approach to developing 
localized discriminants. A number of extensions are possible: Localized feature selection 
can be implemented by simply zeroing components of V. The DRF Gaussians could 
model the radial dispersion of the data more effectively (in greater than two dimensions) by 
generating principal component axes which are orthogonal to V. Extension to the multiclass 
case can be based on DRF sets developed for discrimination between each class and all 
other classes, using the DRF's as features for a multi-output classifier. The use of multiple 
hidden layers offers the prospect of more complex localized receptive fields. Improvement 
in generalization might be gained by including a procedure for merging neighboring DRF's. 
While it is felt that the LLD parameters should remain fixed, it may be advantageous to 
allow adjustment of the DRF Gaussian dispersions as part of the output layer training. A 
stopping rule for LLD training needs to be developed so that adaptive learning rates can be 
utilized effectively. This rule may also be useful in identifying poor token candidates early 
in the incremental LLD training. 
References 
[1] J. Sklansky and G.N. Wassel. Pattern Classifiers and Trainable Machines. Springer 
Verlag, New York, 1981 
[2] S. Makram-Ebeid, J.A. Sirat, and J.R. Viala. A rationalized error backpropagation 
learning algorithm. Proc. IJCNN, 373-380, 1988 
[3] J. Sklansky, and Y. Park. Automated design of multiple-class piecewise linear classifiers. 
Journal of Classification, 6:195-222, 1989 
[4] R.D. Short, and K. Fukanaga. A new nearest neighbor distance measure. Proc. Fifth 
Intl. Conf. on Pattern Rec., 81-88 
[5] R. Lippmann. A critical overview of neural network pattern classifiers. NeuralNetworks 
for Signal Processing (IEEE), 267-275, 1991 
[6] M.S. Glassman and M.B. Starkey. Minimal consonant pair discrimination for speech 
therapy. Proc. European Conf on Speech Comm. and Tech., 273-276, 1989 
