Invariant Feature Extraction and 
Classification in Kernel Spaces 
Sebastian Mika , Gunnar Ritsch  , Jason Weston 2, 
Bernhard Sch51kopf 3, Alex Smola 4, and Klaus-Robert Miiller  
1 GMD FIRST, Kekulstr. 7, 12489 Berlin, Germany 
2 Barnhill BioInformatics, 6709 Waters Av., Savannah, GR 31406, USA 
3 Microsoft Research Ltd., 1 Guildhall Street, Cambridge CB2 3NH, UK 
4 Australian National University, Canberra, 0200 ACT, Australia 
{mika, raetsch, klaus}@first.gmd.de, jasonw@dcs.rhbnc.ac.uk 
bsc@microsoft.com, Alex. Smola.anu.edu.au 
Abstract 
We incorporate prior knowledge to construct nonlinear algorithms 
for invariant feature extraction and discrimination. Employing a 
unified framework in terms of a nonlinear variant of the Rayleigh 
coefficient, we propose non-linear generalizations of Fisher's dis- 
criminant and oriented PCA using Support Vector kernel functions. 
Extensive simulations show the utility of our approach. 
I Introduction 
It is common practice to preprocess data by extracting linear or nonlinear features. 
The most well-known feature extraction technique is principal component analysis 
PCA (e.g. [3]). It aims to find an orthonormal, ordered basis such that the i-th 
direction describes as much variance as possible while maintaining orthogonality to 
all other directions. However, since PCA is a linear technique, it is too limited to 
capture interesting nonlinear structure in a data set and nonlinear generalizations 
have been proposed, among them Kernel PCA [14], which computes the principal 
components of the data set mapped nonlinearly into some high dimensional feature 
space r. 
Often one has prior information, for instance, we might know that the sample is 
corrupted by noise or that there are invariances under which a classification should 
not change. For feature extraction, the concepts of known noise or transformation 
invariance are to a certain degree equivalent, i.e. they can both be interpreted as 
causing a change in the feature which ought to be minimized. Clearly, invariance 
alone is not a sufficient condition for a good feature, as we could simply take the 
constant function. What one would like to obtain is a feature which is as invariant 
as possible while still covering as much of the information necessary for describing 
the particular data. Considering only one (linear) feature vector w and restricting 
to first and second order statistics of the data one arrives at a maximization of the 
so called Rayleigh coefficient 
wTsw 
J(w) = wVSvw , (1) 
Invariant Feature Extraction and Classification in Kernel Spaces 52 7 
where w is the feature vector and $I, $v are matrices describing the desired and 
undesired properties of the feature, respectively (e.g. information and noise). If $I 
is the data covariance and Sv the noise covariance, we obtain oriented PCA [3]. 
If we leave the field of data description to perform supervised classification, it is 
common to choose $ as the separability of class centers (between class variance) 
and $v to be the within class variance. In that case, we recover the well known 
Fisher Discriminant [7]. The ratio in (1) is maximized when we cover much of 
the information coded by $ while avoiding the one coded by $v. The problem is 
known to be solved, in analogy to PCA, by a generalized symmetric eigenproblem 
Sw = Svw [3], where  C ll is the corresponding (biggest) eigenvalue. 
In this paper we generalize this setting to a nonlinear one. In analogy to [8, 14] 
we first map the data via some nonlinear mapping  to some high-dimensional fea- 
ture space r and then optimize (1) in r. To avoid working with the mapped data 
explicitly (which might be impossible if r is infinite dimensional) we introduce sup- 
port vector kernel functions [11], the well-known kernel trick. These kernel functions 
k(x, y) compute a dot product in some feature space r, i.e. k(x, y) - ((x). (y)). 
Formulating the algorithms in r using  only in dot products, we can replace any 
occurrence of a dot product by the kernel function k. Possible choices for k which 
have proven useful e.g. in Support Vector Machines [2] or Kernel PCA [14] are Gaus- 
sian RBF, k(x,y): exp(-llx- yll2/c), or polynomial kernels, k(x,y) = (x. y)a, 
for some positive constants c C ll and d  N, respectively. 
The remainder of this paper is organized as follows: The next section shows how to 
formulate the optimization problem induced by (1) in feature space. Section 3 con- 
siders various ways to find Fisher's Discriminant in r; we conclude with extensive 
experiments in section 4 and a discussion of our findings. 
2 Kernelizing the Rayleigh Coefficient 
To optimize (1) in some kernel feature space r we need to find a formulation which 
uses only dot products of -images. As numerator and denominator are both scalars 
this can be done independently. Furthermore, the matrices $ and $v are basically 
covariances and thus the sum over outer products of -images. Therefore, and due 
to the linear nature of (1) every solution w  r can be written as an expansion in 
terms of mapped training data , i.e. 
i=1 
(2) 
To define some common choices in r let X - {x,... , x} be our training sample 
and, where appropriate, X U X2 = X, X Cl X2 = , two subclasses (with Ixil = i). 
We get the full covariance of X by 
I 1 
C =  Z (I,(x) - m)(I,(x) - m) n- with m = j Z (x), 
(3) 
SB and Sw are operators on a (finite-dimensional) subspace spanned by the I,(xi) (in 
a possibly infinite space). Let w -- v + v2, where v E Span((xi): i = 1,... ,f) and 
v2 _L Span((xi): i = 1,... , f). Then for $ -- $w or $ = $B (which are both symmetric) 
o,so) = 
= 
= 
As Vl lies in the span of the (I)(:Ci) and S only operates on this subspace there exist an 
expansion of w which maximizes J(w). 
528 S. Mika, G. Riitsch, J. Weston, B. Sch6lkopf A. J. Smola and K.-R. Miiller 
which could be used as St in oriented Kernel PCA. For SN we could use an estimate 
of the noise covariance, analogous to the definition of C but over mapped patterns 
sampled from the assumed noise distribution. The standard formulation of the 
Fisher discriminant in O r, yielding the Kernel Fisher Discriminant (KFD) [8] is 
given by 
SW : Z Z ((x) - mi)((x) -- mi) y and SB -- (m2 -- m)(m2 -- m) -r, 
i----1,2 x Afl 
the within-class scatter Sw (as SN), and the between class scatter Sis (as St). Here 
mi is the sample mean for patterns from class i. 
To incorporate a known invariance e.g. in oriented Kernel PCA, one could use the 
tangent covariance matrix [12], 
1 
T = t---  Z (I,(x) - (,x))((x) - (,x)) q- for some small t > 0. (4) 
Here t is a local 1-parameter transformation. T is a finite difference approximation 
t of the covariance of the tangent of t at point (x) (details e.g. in [12]). Using 
St - C and $N -- T in oriented Kernel PCA, we impose invariance under the local 
transformation t. Crucially, this matrix is not only constructed from the training 
patterns X. Therefore, the argument used to find the expansion (2) is slightly 
incorrect. Neverthless, we can assume that (2) is a reasonable approximation for 
describing the variance induced by T. 
Multiplying either of these matrices from the left and right with the expansion (2), 
we can find a formulation which uses only dot products. For the sake of brevity, we 
only give the explicit formulation of (1) in Or for KFD (cf. [8] for details). Defining 
1 
(i)j: -. Ea:exi k(xj,x) we can write (1) for KFD as 
= (ru)2 
a--ffNa = dNa , (5) 
where N = KK n-- y.i=,2eilil, / = /2 -/1, M - //'1-, and Kij = k(xi, xj). 
The results for other choices of $I and SN in Or as for the cases of oriented kernel 
PCA or transformation invariance can be obtained along the same lines. Note that 
we still have to maximize a Rayleigh coefficient. However, now it is a quotient in 
terms of expansion coefficients a, and not in terms of w C Or which is a potentially 
infinite-dimensionai space. Furthermore, it is well known that the solution for this 
special eigenproblem is in the direction of N - (/2 -/) [7], which can be solved 
using e.g. a Cholesky factorization of N. The projection of a new pattern x onto 
w in Or can then be computed by 
(w . I,(x)) = Z ai k(xi, x). (6) 
i=1 
3 Algorithms 
Estimating a covariance matrix with rank up to/? from  samples is ill-posed. Fur- 
thermore, by performing an explicit centering in Or each covariance matrix loses one 
more dimension, i.e. it has only rank g- I (even worse, for KFD the matrix N has 
rank - 2). Thus the ratio in (1) is not well defined anymore, as the denomina- 
tor might become zero. In the following we will propose several ways to deal with 
this problem in KFD. Furthermore we will tackle the question how to solve the 
optimization problem of KFD more efficiently. So far, we have an eigenproblem of 
size  x g. If g becomes large this is numerically demanding. Reformulations of the 
original problem allow to overcome some of these limitations. Finally, we describe 
the connection between KFD and RBF networks. 
Invariant Feature Extraction and Classification in Kernel Spaces 529 
3.1 Regularization and Solution on a Subspace 
As noted before, the matrix N has only rank/ - 2. Besides numerical problems 
which can cause the matrix N to be not even positive, we could think of imposing 
some regularization to control capacity in r. To this end, we simply add a multiple 
of the identity matrix to N, i.e. replace N by Nu where 
N := N + . (7) 
This can be viewed in different ways: (i) for t > 0 it makes the problem feasible 
and numerically more stable as N u becomes positive; (ii) it can be seen as decreas- 
ing the bias in sample based estimation of eigenvalues (cf. [6]); (iii) it imposes a 
regularization on lieill 2, favoring solutions with small expansion coefficients. Fur- 
thermore, one could use other regularization type additives to N, e.g. penalizing 
IIw]] 2 in analogy to SVM (by adding the kernel matrix Kij = k(xi, xj)). 
To optimize (5) we need to solve an  x  eigenproblem, which might be intractable 
for large . As the solutions are not sparse one can not directly use efficient algo- 
rithms like chunking for Support Vector Machines (cf. [13]). To this end, we might 
restrict the solution to lie in a subspace, i.e. instead of expanding w by (2) we write 
m 
= (8) 
i=1 
with m < l. The patterns zi could either be a subset of the training patterns X 
or e.g. be estimated by some clustering algorithm. The derivation of (5) does not 
change, only K is now m x  and we end up with m x m matrices N and M. Another 
advantage is, that it increases the rank of N (relative to its size) although there 
still might be some need for regularization. 
3.2 Quadratic optimization and Sparsification 
Even if N has full rank, maximizing (5) is underdetermined: if c is optimal, then so 
is any multiple thereof. Since c-rMc = (cn-tz) 2, M has rank one. Thus we can seek 
for a vector c, such that anna is minimal for fixed c-rtz (e.g. to 1). The solution 
is unique and we can find the optimal c by solving the quadratic optimization 
problem: 
min anna subject to cn-tz = 1. (9) 
Although the quadratic optimization problem is not easier to solve than the eigen- 
problem, it has an appealing interpretation. The constraint cn-/z - I ensures, that 
the average class distance, projected onto the direction of discrimination, is con- 
stant, while the intra class variance is minimized, i.e. we maximize the average 
margin. Contrarily, the SVM approach [2] optimizes for a large minimal margin. 
Considering (9) we are able to overcome another shortcoming of KFD. The solu- 
tions c are not sparse and thus evaluating (6) is expensive. To solve this we can 
add an/-regularizer AIIc]l to the objective function, where A is a regularization 
parameter allowing us to adjust the degree of sparseness. 
3.3 Connection to RBF Networks 
Interestingly, there exists a close connection between RBF networks (e.g. [9, 1]) and 
KFD. If we add no regularization and expand in all training patterns, we find that 
an optimal c is given by c - K-y, where K is the symmetric, positive matrix of 
all kernel elements k(xi, xj) and y the :t:1 label vector 2. A RBF-network with the 
2To see this, note that N can be written as N = KDK where D - I-yy-y2y  has 
rank f - 2, while Yi is the vector of 1/v/'s for patterns from class i and zero otherwise. 
530 S. Mika, G. Rtitsch, d. Weston, B. Sch61kopf A. d. Smola andK.-R. Miiller 
Banana 
B.Cancer 
Diabetes 
German 
Heart 
Image 
Ringnorm 
F.Sonar 
Splice 
Thyroid 
Titanic 
Twonorm 
Waveform 
RBF 
10.8+-0.06 
27.6+.0.47 
24.3+.0.19 
24.7+.0.24 
17.6+.0.33 
3.3+.0.06 
1.7+.0.02 
34.4+.0.20 
10.0-2:0.10 
4.5+0.21 
23.3+.0.13 
2.9+.0.03 
10.7+0.11 
AB 
12.3:l:0.07 
30.4+.0.47 
26.5+.0.23 
27.54-0.25 
20.3+0.34 
2.7+.0.07 
1.9+.0.03 
35.7+.0.18 
22.6+.0.12 
3.04-0.03 
10.8+0.06 
ABR 
9+o. oj 
26.5+.0.45 
23.8+0.18 
24.3+0.21 
16.5+.0.35 
2.7'+.0.06 
1.6+. O. 01 
34.2+.0.22 
9.5+0.0? 
4.6+.0.22 
22.6+.0.12 
2.7+.0.02 
9.8+0.08 
SVM 
11.5+.0.07 
2 6. 0::t: 0.  ? 
23.5+0.17 
23.6+0.21 
16.0+0.33 
3. O::t: O. 06 
1.7+.0.01 
32.4+.0.18 
10.9+0.07 
4.8+.0.22 
22.4+.0.10 
3.0+.0.02 
9.9+0.0 
KFD 
10.8+.0.05 Table 1: Com- 
25.8+0.46 parison between 
23.2+.0.16 KFD, single 
23.7+.0.22 RBF classifier, 
16.1+.0.3 AdaBoost (AB) 
4.8+.0.06 ' 
1.5+.0.01 regul. Ada- 
33.2+.0.17 Boost (ABR) 
10.5+.0.06 and SVMs (see 
4.2+0.21 text). Best re- 
23.2+0.20 suit in bold face, 
2.6+0.02 second best in 
9.9+.0.0J italics. 
same kernel at each sample and fixed kernel width gives the same solution, if the 
mean squared error between labels and output is minimized. Also for the case of 
restricted expansions (8) there exists a connection to RBF networks with a smaller 
number of centers (cf. [4]). 
4 Experiments 
Kernel Fisher Discriminant Figure I shows an illustrative comparison of the 
features found by KFD, and Kernel PCA. The KFD feature discriminates the two 
classes, the first Kernel PCA feature picks up the important nonlinear structure. 
To evaluate the performance of the KFD on real data sets we performed an extensive 
comparison to other state-of-the-art classifiers, whose details are reported in [8]. 3 
We compared the Kernel Fisher Discriminant and Support Vector Machines, both 
with Gaussian kernel, to AdaBoost [5], and regularized AdaBoost [10] (cf. table 1). 
For KFD we used the regularized within-class scatter (7) and computed projections 
onto the optimal direction w 6 9 r by means of (6). To use w for classification we 
have to estimate a threshold. This can be done by e.g. trying all thresholds between 
two outputs on the training set and selecting the median of those with the smallest 
empirical error, or (as we did here) by computing the threshold which maximizes the 
margin on the outputs in analogy to a Support Vector Machine, where we deal with 
errors on the trainig set by using the SVM soft margin approach. A disadvantage 
of this is, however, that we have to control the regularization constant for the slack 
variables. The results in table I show the average test error and the standard 
If K has full rank, the null space of D, which is spanned by Yl and Y2, is the null space 
of N. For & ---- K-ly we get &-N& -- 0 and &-/ : 0. As we are free to fix the constraint 
c-/ to any positive constant (not just 1), & is also feasible. 
SThe breast cancer domain was obtained from the University Medical Center, 
Inst. of Oncology, Ljubljana, Yugoslavia. Thanks to M. Zwitter and M. Sok- 
lic for the data. All data sets used in the experiments can be obtained via 
http://www. first. gmd. de/~raetsch/. 
Figure 1: Comparison of feature 
found by KFD (left) and first 
Kernel PCA feature (right). De- 
picted are two classes (informa- 
tion only used by KFD) as dots " 
and crosses and levels of same 
feature value. Both with polyno- 
mial kernel of degree two, KFD 
with the regularized within class 
scatter (7) (/-- 10-3). '-" 
Invariant Feature Extraction and Classification in Kernel Spaces 531 
deviation of the averages' estimation, over 100 runs with different realizations of 
the datasets. To estimate the necessary parameters, we ran 5-fold cross validation 
on the first five realizations of the training sets and took the model parameters to 
be the median over the five estimates (see [10] for details of the experimental setup). 
Using prior knowledge. A toy example (figure 2) shows a comparison of Ker- 
nel PCA and oriented Kernel PCA, which used $ as the full covariance (3) and 
as noise matrix Sv the tangent covariance (4) of (i) rotated patterns and (ii) along 
the x-axis translated patterns. The toy example shows how imposing the desired 
invariance yields meaningful invariant features. 
In another experiment we incorporated prior knowledge in KFD. We used the USPS 
database of handwritten digits, which consists of 7291 training and 2007 test pat- 
terns, each 256 dimensional gray scale images of the digits 0...9. We used the 
regularized within-class scatter (7) (/ = 10- ) as Sv and added to it an multiple A 
of the tangent covariance (4), i.e. Sv = Nu + AT. As invariance transformations we 
have chosen horizontal and vertical translation, rotation, and thickening (cf. [12]), 
where we simply averaged the matrices corresponding to each transformation. The 
feature was extracted by using the restricted expansion (8), where the patterns zi 
were the first 3000 training samples. As kernel we have chosen a Gaussian of width 
0.3. 256, which is optimal for SVMs [12]. For each class we trained one KFD which 
classified this class against the rest and computed the 10-class error by the winner- 
takes-all scheme. The threshold was estimated by minimizing the empirical risk on 
the normalized outputs of KFD. 
Without invariances, i.e. A = 0, we achieved a test error of 3.7%, slightly better than 
a plain SVM with the same kernel (4.2%) [12]. For A = 10 -3, using the tangent 
covariance matrix led to a very slight improvement to 3.6%. That the result was not 
significantly better than the corresponding one for KFD (3.7%) can be attributed 
to the fact that we used the same expansion coefficients in both cases. The tangent 
covariance matrix, however, lives in a slightly different subspace. And indeed, a 
subsequent experiment where we used vectors which were obtained by clustering a 
larger dataset, including virtual examples generated by the appropriate invariance 
transformation, led to 3.1%, comparable to an SVM using prior knowledge (e.g. [12]; 
best SVM result 2.9% with local kernel and virtual support vectors). 
5 Conclusion 
In the task of learning from data it is equivalent to have prior knowledge about 
e.g. invariances or about specific sources of noise. In the case of feature extraction, 
we seek features which are sufficiently (noise-) invariant while still describing in- 
teresting structure. Oriented PCA and, closely related, Fisher's Discriminant, use 
particularly simple features, since they only consider first and second order statis- 
tics for maximizing the Rayleigh coefficient (1). Since linear methods can be too 
restricted in many real-world applications, we used Support Vector Kernel functions 
to obtain nonlinear versions of these algorithms, namely oriented Kernel PCA and 
Kernel Fisher Discriminant analysis. 
Our experiments show that the Kernel Fisher Discriminant is competitive or in 
Figure 2: Comparison of first 
features found by Kernel PCA 
and oriented Kernel PCA (see ' 
text); from left to right: KPCA, 
OKPCA with rotation and 
translation invariance; all with 
Gaussian kernel. 
532 S. Mika, G. Riitsch, J. Weston, B. SchOlkopf A. J. Smola and K.-R. Mailer 
some cases even superior to the other state-of-the-art algorithms tested. Interest- 
ingly, both SVM and KFD construct a hyperplane in r which is in some sense 
optimal. In many cases, the one given by the solution w of KFD is superior to 
the one of SVMs. Encouraged by the preliminary results for digit recognition, we 
believe that the reported results can be improved, by incorporating different invari- 
ances and using e.g. local kernels [12]. 
Future research will focus on further improvements on the algorithmic complexity 
of our new algorithms, which is so far larger than the one of the SVM algorithm, 
and on the connection between KFD and Support Vector Machines (cf. [16, 15]). 
Acknowledgments This work was partially supported by grants of the DFG (JA 
379/5-2,7-1,9-1) and the EC STORM project number 25387 and carried out while 
BS and AS were with GMD First. 
References 
[1] C.M. Bishop. Neural Networks for Pattern Recognition. Oxford Univ. Press, 1995. 
[2] B. Boser, I. Guyon, and V.N. Vapnik. A training algorithm for optimal margin 
classifiers. In D. Haussler, editor, Proc. COLT, pages 144-152. ACM Press, 1992. 
[3] K.I. Diamantaras and S.Y. Kung. Principal Component Neural Networks. Wiley, New 
York, 1996. 
[4] B.Q. Fang and A.P. Dawid. Comparison of full bayes and bayes-least squares criteria 
for normal discrimination. Chinese Journal of Applied Probability and Statistics, 
12:401-410, 1996. 
[5] Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning 
and an application to boosting. In EuroCOLT 93. LNCS, 1994. 
[6] J.H. Friedman. Regularized discriminant analysis. Journal of the American Statistical 
Association, 84(405):165-175, 1989. 
[7] K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, San 
Diego, 2nd edition, 1990. 
[8] S. Mika, G. Riitsch, J. Weston, B. SchSlkopf, and K.-R. Mfiller. Fisher discriminant 
analysis with kernels. In Y.-H. Hu, J. Larsen, E. Wilson, and S. Douglas, editors, 
Neural Networks for Signal Processing IX, pages 41-48. IEEE, 1999. 
[9] J. Moody and C. Darken. Fast learning in networks of locally-tuned processing units. 
Neural Computation, 1(2):281-294, 1989. 
[10] G. Riitsch, T. Onoda, and K.-R. Mfiller. Soft margins for adaboost. Technical Report 
NC-TR-1998-021, Royal Holloway College, University of London, UK, 1998. 
[11] S. Saitoh. Theory of Reproducing Kernels and its Applications. Longman Scientific 
& Technical, Harlow, England, 1988. 
[12] B. SchSlkopf. Support vector learning. Oldenbourg Verlag, 1997. 
[13] B. SchSlkopf, C.J.C. Burges, and A.J. Smola, editors. Advances in Kernel Methods - 
Support Vector Learning. MIT Press, 1999. 
[14] B. SchSlkopf, A.J. Smola, and K.-R. Mfiller. Nonlinear component analysis as a kernel 
eigenvalue problem. Neural Computation, 10:1299-1319, 1998. 
[15] A. Shashua. On the relationship between the support vector machine for classification 
and sparsifted fisher's linear discriminant. Neural Processing Letters, 9(2):129-139, 
April 1999. 
[16] S. Tong and D. Koller. Bayes optimal hyperplanes - maximal margin hy- 
perplanes. Submitted to IJCAI'99 Workshop on Support Vector Machines 
(robotics. stanford. edu/~koller/), 1999. 
