Nonlinear Discriminant Analysis using 
Kernel Functions 
Volker Roth & Volker Steinhage 
University of Bonn, Institut of Computer Science III 
RSmerstrasse 164, D-53117 Bonn, Germany 
{ roth, steinhag} @ cs. uni-bonn. de 
Abstract 
Fishers linear discriminant analysis (LDA) is a classical multivari- 
ate technique both for dimension reduction and classification. The 
data vectors are transformed into a low dimensional subspace such 
that the class centroids are spread out as much as possible. In 
this subspace LDA works as a simple prototype classifier with lin- 
ear decision boundaries. However, in many applications the linear 
boundaries do not adequately separate the classes. We present a 
nonlinear generalization of discriminant analysis that uses the ker- 
nel trick of representing dot products by kernel functions. The pre- 
sented algorithm allows a simple formulation of the EM-algorithm 
in terms of kernel functions which leads to a unique concept for un- 
supervised mixture analysis, supervised discriminant analysis and 
semi-supervised discriminant analysis with partially unlabelled ob- 
servations in feature spaces. 
I Introduction 
Classical linear discriminant analysis (LDA) projects N data vectors that belong to 
c different classes into a (c- 1)-dimensionai space in such way that the ratio of be- 
tween group scatter $$ and within group scatter Sw is maximized [1]. LDA formally 
consists of an eigenvalue decomposition of S  SB leading to the so called canonical 
vatlares which contain the whole class specific information in a (c- 1)-dimensional 
subspace. The canonical variates can be ordered by decreasing eigenvalue size in- 
dicating that the first variates contain the major part of the information. As a 
consequence, this procedure allows low dimensional representations and therefore 
a visualization of the data. Besides from interpreting LDA only as a technique for 
dimensionality reduction, it can also be seen as a multi-class classification method: 
the set of linear discriminant functions define a partition of the projected space into 
regions that are identified with class membership. A new observation x is assigned 
to the class with centroid closest to x in the projected space. 
To overcome the limitation of only linear decision functions some attempts have 
been made to incorporate nonlinearity into the classical algorithm. HASTIE et al. 
[2] introduced the so called model of Flexible Discriminant Analysis: LDA is refor- 
mulated in the framework of linear regression estimation and a generalization of this 
method is given by using nonlinear regression techniques. The proposed regression 
techniques implement the idea of using nonlinear mappings to transform the input 
data into a new space in which again a linear regression is performed. In real world 
Nonlinear Discriminant Analysis Using Kernel Functions 569 
applications this approach has to deal with numerical problems due to the dimen- 
sional explosion resulting from nonlinear mappings. In the recent years approaches 
that avoid such explicit mappings by using kernel functions have become popular. 
The main idea is to construct algorithms that only afford dot products of pattern 
vectors which can be computed efficiently in high-dimensional spaces. Examples of 
this type of algorithms are the Support Vector Machine [3] and Kernel Principal 
Component Analysis [4]. 
In this paper we show that it is possible to formulate classical linear regression 
and therefore also linear discriminant analysis exclusively in terms of dot products. 
Therefore, kernel methods can be used to construct a nonlinear variant of dis- 
criminant analysis. We call this technique Kernel Discriminant Analysis (KDA). 
Contrary to a similar approach that has been published recently [5], our algorithm is 
a real multi-class classifier and inherits from classical LDA the convenient property 
of data ,isualization. 
2 Review of Linear Discriminant Analysis 
Under the assumption of the data being centered (i.e. Y-i xi = 0) the scatter ma- 
trices $s and Sw are defined by 
c 1 nj (x?)(x) T 
SB: Ej=I -. El,m=l ) ) (1) 
c . 1 . x? ) ) 1 n 
- Z --Z (2) 
where nj is the number of patterns x? ) that belong to class j. 
LDA chooses a transformation matrix V that maximizes the objective function 
Ivrssvl 
J(V) = iVTSwV I . (3) 
The columns of an optimal V are the generalized eigenvectors that correspond to 
the nonzero eigenvalues in Ssvi = hi Swvi. 
In [6] and [7] we have shown, that the standard LDA algorithm can be restat- 
ed exclusively in terms of dot products of input vectors. The final equation is an 
eigenvalue equation in terms of dot product matrices which are of size N x N. Since 
the solution of high-dimensional generalized eigenvalue equations may cause numer- 
ical problems (N may be large in real world applications), we present an improved 
algorithm that reformulates discriminant analysis as a regression problem. More- 
over, this version allows a simple implementation of the EM-algorithm in feature 
spaces. 
3 Linear regression analysis 
In this section we give a brief review of linear regression analysis which we use as 
"building block" for LDA. The task of linear regression analysis is to approximate 
the regression function by a linear function 
r(x) = E(YlA' = x) m c + x T/. (4) 
on the basis of a sample (y,x),... ,(yN,XN). Let now y denote the vector 
(y,...,yN) T and X denote the data matrix which rows are the input vectors. 
Using a quadratic loss function, the optimal parameters c and / are chosen to 
minimize the average squared residual 
= Ilu-  1 + x11 + 
IN denotes a N-vector of ones, [2 denotes a ridge-type penalty matrix [2 = el which 
penalizes the coefficients of/. Assuming the data being centered, i.e y//v= xi = O, 
the parameters of the regression function are given by: 
c = X -1 yN 
i= Yi =: Uy, / = (XrX + eI) -Xry. (6) 
570 V. Roth and V. Steinhage 
4 LDA by optimal scoring 
In this section the LDA problem is linked to linear regression using the framework 
of penalized optimal scoring. We give an overview over the detailed derivation in 
[2] and [8]. Considering again the problem with c classes and N data vectors, 
the class-memberships are represented by a categorical response variable  with 
c levels. It is useful to code the n responses in terms of the indicator matrix 
Z: Zi,5 = 1, if the i-th data vector belongs to class j, and 0 otherwise. The point 
of optimal scoring is to turn categorical variables into quantitative ones by assigning 
scores to classes: the score vector 0 assigns the real number 0 5 to the j-th level of 
. The vector Z then represents a vector of scored training data and is regressed 
onto the data matrix X. The simultaneous estimation of scores and regression 
coefficients constitutes the optimal scoring problem: minimize the criterion 
ASR(O, fi) = N-l[llZO - Still 4. (7) 
under the constraint -[[Z0[[ 2 = 1. According to (6), for a given score 0 the 
minimizing fi is given by 
rios = ( xrx + fi)- x rzo, (8) 
and the partially minimized criterion becomes: 
minASR(,fi) - 1- N-iT zT M(f)Z, 
(9) 
where M(fl) = X(XTX + fl)-X T denotes the regularized hat or smoother matrix. 
Minimizing of (9) under the constraint [[ZO[[  = 1 can be performed by the 
following procedure: 
1. Choose an initial matrix O0 satisfying the constraint N-10oZTZOo = I 
and set O) = ZOo 
2. Run a multi-response regression of  onto X: ) = M(f)O) = XB, 
where B is the matrix of regression coefficients. 
3. Eigenanalyze o)T)) to obtain the optimal scores, and update the matrix of 
regression coefficients: B* = BW, with W being the matrix ofeigenvectors. 
It can be shown, that the final matrix B* is, up to a diagonal scale matrix, equivalent 
to the matrix of LDA-vectors, see [8]. 
5 Ridge regression using only dot products 
The penalty matrix  in (5) assures that the penalized d x d covariance matrix 
,, - xTx 4- eI is a symmetric nonsingular matrix. Therefore, it has d eigenvectors 
ai with accomplished positive eigenvalues ?i such that the following equations hold: 
-- ---- --aia i (10) 
j--1 i----1 i 
The first equation implies that the first l leading eigenvectors ai with eigenvalues 
?i  e have an expansion in terms of the input vectors. Note that l is the number 
of nonzero eigenvalues of the unpenalized covariance matrix xTX. Together with 
(6), it follows for the general case, when the dimensionality d may extend l, that fi 
can be written as the sum of two terms: an expansion in terms of the vectors xi 
with coefficients ai and a similar expansion in terms of the remaining eigenvectors: 
N d d 
Z OiXi 4- Z jaj -- XToI 4- Z jaj, (11) 
fi "-- i----1 j----l+l j----l+l 
with e = (a ... an) T. However, the last term can be dropped, since every eigen- 
vector aj, j = l + 1,..., d is orthogonal to every vector xi and does not influence 
the value of the regression function (4). 
The problem of penalized linear regression can therefore be stated as minimizing 
Nonlinear Discriminant Analysis Using Kernel Functions 571 
= [ Ilu - I + 
A stationary vector c is determined by 
a = (XX r + 
(12) 
(13) 
Let now the dot product matrix K be defined by Kij -- :riT:rj and let for a given test 
point (a:t) the dot product vector kt be defined by kt = Xxt. With this notation 
the regression function of a test point (xt) reads 
(,)  + [ ( + )- 
= y. (14) 
This equation requires only dot products and we can apply the kernel trick. The 
final equation (14), up to the constant term u, h also been found by SUDERS et 
al., [9]. They restated ridge regression in dual variables and optimized the resulting 
criterion function with a lagrange multiplier technique. Note that our derivation, 
which is a direct generalization of the standard linear regression formalism, leads in 
a natural way to a cls of more general regression functions including the constant 
term. 
6 LDA using only dot products 
Setting/ = xTc as in (11) and using the notation of section 5, for a given score 
0 the optimal vector c is given by: 
Otos -' (XX T q- -)-I zo. (15) 
Analogous to (9), the partially minimized criterion becomes: 
minASR(O,a) = 1- N-Or Zr g7I(fi)ZO, (16) 
ot 
with 
/f/(fl) = XX'(XX T + )- = K(K + eI) - 
1 
To minimize (16) under the constraint llzoll 2 - 1 the procedure described in 
section 4 can be used when M(fl) is substituted by /f/(fl). The matrix Y which 
rows are the input vectors projected onto the column vectors of B* is given by: 
Y -- XB* = K(K + eI) -1ZOoW. (17) 
Note that again the dot product matrix K is all that is needed to calculate Y. 
7 The kernel trick 
The main idea of constructing nonlinear algorithms is to apply the linear methods 
not in the space of observations but in a feature space F that is related to the former 
by a nonlinear mapping b: //v _> F, x --> b(x). 
Assuming that the mapped data are centered in F, i.e. Yi c)(xi) = 0, the present- 
ed algorithms remain formally unchanged if the dot product matrix K is computed 
in F: Kij = (qb(xi)  qb(xj)). As shown in [4], this assumption can be dropped by 
1 n 
writing ) instead of the mapping b: p(xi) := qb(xi) -  Yi= qb(xi). 
Computation of dot products in feature spaces can be done efficiently by using k- 
ernel functions k(xi, xj) [3]: For some choices of k there exists a mapping b into 
some feature space F such that k acts as a dot product in F. Among possible 
kernel functions there are e.g. Radial Basis Function (RBF) kernels of the form 
k(m, u) -- exp(-I[m - ull 2/c). 
8 The EM-algorithm in feature spaces 
LDA can be derived as the maximum likelihood method for normal populations 
with different means and common covariance matrix E (see [11]). Coding the class 
membership of the observations in the matrix Z as in section 4, LDA maximizes 
the (complete data) log-likelihood function 
572 14. Roth and 14. Steinhage 
This concept can be generalized for the case that only the group membership of 
Nc < N observations is known ([14], p.679): the EM-algorithm provides a conve- 
nient method for maximizing the likelihood function with missing data: 
E-step: set Pki = Prob(xi E class k) 
fZik, if the class membership of xi has been observed 
Pi = ) v.c?!_ , otherwise, cfi(xi) (x exp[-1/2(xi - k)T'.-l(xi -- k)] 
M-step: set 
N N 
i  1 
rk -- -- ),Pki, Ilk -' Nrk PkiXi, 
N i--1 i--1 
c N 
E =   YPki (xi - !a)(xi - !ak) T 
k=l i=1 
The idea behind this approach is that even an unclassified observation can be used 
for estimation if it is given a proper weight according to its posterior probability 
for class membership. The M-step can be seen as weighted mean and covariance 
maximum likelihood estimates in a weighted and augmented problem: we augment 
the data by replicating the N observations c times, with the 1-th such replication 
having observation weights Pti. The maximization of the likelihood function can be 
achieved via a weighted and augmented LDA. It turns out that it is not necessary to 
explicitly replicate the observations and run a standard LDA: the optimal scoring 
version of LDA described in section 4 allows an implicit solution of the augmented 
problem that still uses only N observations. Instead of using a response indicator 
matrix Z, one uses a blurred response Matrix , whose rows consist of the current 
class probabilities for each observation. At each M-step this  is used in a multiple 
linear regression followed by an eigen-decomposition. A detailed derivation is given 
in [11]. Since we have shown that the optimal scoring problem can be solved in fea- 
ture spaces using kernel functions this is also the case for the whole EM-aigorithm: 
the E-step requires only differences in Mahalonobis distances which are supplied by 
KDA. 
After iterated application of the E- and M-step an observation is classified to the 
class k with highest probability p. This leads to a unique framework for pure 
mixture analysis (Nc = 0), pure discriminant analysis (Nc = N) and the semi- 
supervised models of discriminant analysis with partially unclassified observations 
(0 < Nc < N) in feature spaces. 
9 Experiments 
Waveform data: We illustrate KDA on a popular simulated example, taken from 
[10], p.49-55 and used in [2, 11]. It is a three class problem with 21 variables. The 
learning set consisted of 100 observations per class. The test set was of size 1000. 
The results are given in table 1. 
Table 1: Results for waveform data. The values are averages over 10 simulations. 
The 4 entries above the line are taken from [11]. QDA: quadratic discriminant 
analysis, FDA: flexible discriminant analysis, MDA: mixture discriminant analysis. 
Technique Training Error [%] Test Error [%] 
LDA 
QDA 
FDA (best model parameters) 
MDA (best model parameters) 
KDA (RBF kernel, a = 2, e = 1.5) 
12.1(0.6) 19.1(0.6) 
3.9(0.4) 20.5(0.6) 
10.0(0.6) 19.1(0.6) 
13.9(0.5) 15.5(0.5) 
10.7(0.6) 14.1(0.7) 
Nonlinear Discriminant Analysis Using Kernel Functions 573 
The Bayes risk for the problem is about 14% [10]. KDA outperforms the other 
nonlinear versions of discriminant analysis and reaches the Bayes rate within the 
error bounds, indicating that one cannot expect significant further improvement 
using other classifiers. Figure 1 demonstrates the data visualization property of 
KDA. Since for a 3 class problem the dimensionality of the projected space equals 
2, the data can be visualized without any loss of information. In the left plot one 
can see the projected learn data and the class centroids, the right plot shows the 
test data and again the class centroids of the learning set. 
Figure 1: Data visualization with KDA. Left: learn set, right: test set 
To demonstrate the effect of using unlabeled data for classification we repeated 
the experiment with waveform data using only 20 labeled observations per class. 
We compared the the classification results on a test set of size 300 using only the 
labeled data (error rate E) with the results of the EM-model which considers 
the test data as incomplete measurements during an iterative maximization of the 
likelihood function (error rate E2). Using a RBF kernel ( = 250), we obtained the 
following mean error rates over 20 simulations: E = 30.5(3.6)%, E2 = 17.1(2.7)%. 
The classification performance could be drastically improved when including the 
unlabelled data into the learning process. 
Object recognition: We tested KDA on the MPI Chair Database . It consists of 
89 regular spaced views form the upper viewing hemisphere of 25 different classes 
of chairs as a training set and 100 random views of each class as a test set. The 
available images are downscaled to 16 x 16 pixels. We did not use the additional 
4 edge detection patterns for each view. Classification results for several classifiers 
are given in table 2. 
Table 2: Test error rates (%). Support Vector Machine, Multi Layer Perceptron, 
Oriented Filter, taken from [12]. 
SVM MLP I 201.F 0 KDA, RBFkernel KDApoly. kernel 
2.0 7.2 1.9 2.1 
For a comparison of the computational performance we also trained the SVM-light 
implementation (V 2.0) on the data, [13]. In this experiment with 25 classes the 
KDA algorithm showed to be significantly faster than the SVM: using the RBF- 
kernel, KDA was 3 times faster, with the polynomial kernel KDA was 20 times 
faster than $VM-light. 
10 Discussion 
In this paper we present a nonlinear version of classical linear discriminant analysis. 
The main idea is to map the input vectors into a high- or even infinite dimensional 
feature space and to apply LDA in this enlarged space. Restating LDA in a way that 
only dot products of input vectors are needed makes it possible to use kernel repre- 
sentations of dot products. This overcomes numerical problems in high-dimensional 
The database is available via ftp://ftp.mpik-tueb.mpg.de/pub/chair_dataset/ 
574 V. Roth and E Steinhage 
feature spaces. We studied the classification performance of the KDA classifier on 
simulated waveform data and on the MPI chair database that has been widely used 
for benchmarking in the literature. For medium size problems, especially if the 
number of classes is high, the KDA algorithm showed to be significantly faster than 
a SVM while leading to the same classification performance. From classical LDA 
the presented algorithm inherits the convenient property of data visualization, since 
it allows low dimensional views of the data vectors. This makes an intuitive inter- 
pretation possible, which is helpful in many practical applications. The presented 
KDA algorithm can be used as the maximization step in an EM algorithm in feature 
spaces. This allows to include unlabeled observation into the learning process which 
can improve classification results. Studying the performance of KDA for other clas- 
siftcation problems as well as a theoretical comparison of the optimization criteria 
used in the KDA- and SVM-algorithm will be subject of future work. 
Acknowledgements 
This work was supported by Deutsche Forschungsgemeinschaft, DFG. We heavily 
profitted from discussions with Armin B. Cremers, John Held and Lothat Hermes. 
References 
[1] R. Duda and P. Hart, Pattern Classification and Scene Analysis. Wiley &: Sons, 1973. 
[2] T. Hastie, R. Tibshirani, and A. Buja, "Flexible discriminant analysis by optimal 
scoring," JASA, vol. 89, pp. 1255-1270, 1994. 
[3] V. N. Vapnik, Statistical learning theory. Wiley &: Sons, 1998. 
[4] B. SchSlkopf, A. Smola, and K.-R. Muller, "Nonlinear component analysis as a kernel 
eigenvalue problem," Neural Computation, vol. 10, no. 5, pp. 1299-1319, 1998. 
[5] S. Mika, G. R/itsch, J. Weston, B. SchSlkopf, and K.-R. Milllet, "Fisher discrimi- 
nant analysis with kernels," in Neural Networks for Signal Processing IX (Y.-H. Hu, 
J. Larsen, E. Wilson, and S. Douglas, eds.), pp. 41-48, IEEE, 1999. 
[6] V. Roth and V. Steinhage, "Nonlinear discriminant analysis using kernel functions," 
Tech. Rep. IAI-TR-99-7, Department of Computer Science III, Bonn University, 1999. 
[7] V. Roth, A. Pogoda, V. Steinhage, and S. SchrSder, "Pattern recognition combining 
feature- and pixel-based classification within a real world application," in Muster- 
erkennung 1999 (W. FSrstner, J. Buhmann, A. Faber, and P. Faber, eds.), Informatik 
aktuell, pp. 120-129, 21. DAGM Symposium, Bonn, Springer, 1999. 
[8] T. Hastie, A. Buja, and R. Tibshirani, "Penalized discriminant analysis," ArmStat, 
vol. 23, pp. 73-102, 1995. 
[9] S. Saunders, A. Gammermann, and V. Vovk, "Ridge regression learning algorithm in 
dual variables," tech. rep., Royal Holloway, University of London, 1998. 
[10] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Re- 
gression Trees. Monterey, CA: Wadsworth and Brooks/Cole, 1984. 
[11] T. Hastie and R. Tibshirani, "Discriminant analysis by gaussian mixtures," JRSSB, 
vol. 58, pp. 158-176, 1996. 
[12] B. SchSlkopf, Support Vector Learning. PhD thesis, 1997. R. Oldenbourg Verlag, 
Munich. 
[13] T. Joachims, "Making large-scale svm learning practical," in Advances in Kernel 
Methods - Support Vector Learning (B. SchSlkopf, C. Burges, and A. Smola, eds.), 
MIT Press, 1999. 
[14] B. Flury, A First Course in Multivariate Statistics. Springer, 1997. 
