Understanding stepwise generalization of 
Support Vector Machines: a toy model 
Sebastian Risau-Gusman and Mirta B. Gordon 
DRFMC/SPSMS CEA Grenoble, 17 av. des Martyrs 
38054 Grenoble cedex 09, France 
Abstract 
In this article we study the effects of introducing structure in the 
input distribution of the data to be learnt by a simple perceptron. 
We determine the learning curves within the framework of Statis- 
tical Mechanics. Stepwise generalization occurs as a function of 
the number of examples when the distribution of patterns is highly 
anisotropic. Although extremely simple, the model seems to cap- 
ture the relevant features of a class of Support Vector Machines 
which was recently shown to present this behavior. 
1 Introduction 
A new approach to learning has recently been proposed as an alternative to feedfor- 
ward neural networks: the Support Vector Machines (SVM) [1]. Instead of trying to 
learn a non linear mapping between the input patterns and internal representations, 
like in multilayered perceptrons, the SVMs choose a priori a non-linear kernel that 
transforms the input space into a high dimensional feature space. In binary classi- 
fication tasks like those considered in the present paper, the SVMs look for linear 
separation with optimal margin in feature space. The main advantage of SVMs 
is that learning becomes a convex optimization problem. The difficulties of having 
many local minima that hinder the process of training multilayered neural networks 
is thus avoided. One of the questions raised by this approach is why SVMs do not 
overfit the data in spite of the extremely large dimensions of the feature spaces 
considered. 
Two recent theoretical papers [2, 3] studied a family of SVMs with the tools of 
Statistical Mechanics, predicting typical properties in the limit of large dimensional 
spaces. Both papers considered mappings generated by polynomial kernels, and 
more specifically quadratic ones. In these, the input vectors x  R S are trans- 
formed to N(N-b 1)/2-dimensional feature vectors ((x). More precisely, the map- 
ping ((x) - (X, XlX, X2X,.--,xkx) has been studied in [3] as a function of k, 
the number of quadratic features, and (2 (x) -- (x, XlX/N, xx/N,..., XNX/N) has 
been considered in [2], leading to different results. These mappings are particu- 
lar cases of quadratic kernels. In particular, in the case of learning quadratically 
separable tasks with mapping (, the generalization error decreases up to a lower 
bound for a number of examples proportional to N, followed by a further decrease 
if the number of examples increases proportionally to the dimension of the feature 
322 S. Risau-Gusman and M. B. Gordon 
space, i.e. to N 2. In fact, this behavior is not specific of the SVMs. It also arises 
in the typical case of Gibbs learning (defined below) in quadratic feature spaces [4]: 
on increasing the training set size, the quadratic components of the discriminating 
surface are learnt after the linear ones. In the case of learning linearly separable 
tasks in quadratic feature spaces, the effect of overfitting is harmless, as it only 
slows down the decrease of the generalization error with the training set size. In 
the case of mapping (1, overfitting is dramatic, as the generalization error at any 
given training set size increases with the number k of features. 
The aim of the present paper is to understand the influence of the mapping scaling- 
factor on the generalization performance of the SVMs. To this end, it is worth to 
remark that features 2 may be obtained by compressing the quadratic subspace 
of  by a fixed factor. In order to mimic this contraction, we consider a linearly 
separable task in which the input patterns have a highly anisotropic distribution, so 
that the variance in one subspace is much smaller than in the orthogonal directions. 
We show that in this simple toy model, the generalization error as a function of 
the training set size exhibits a cross-over between two different behaviors: a rapid 
decrease corresponding to learning the components in the uncompressed space, fol- 
lowed by a slow improvement in which mainly the components in the compressed 
space are learnt. The latter would correspond, in this highly stylized model, to 
learning the scaled quadratic features in the SVM with mapping 2. 
The paper is organized as follows: after a short presentation of the model, we de- 
scribe the main steps of the Statistical Mechanics calculation. The order parameters 
caracterizing the properties of the learning process are defined, and their evolution 
as a function of the training set size is analyzed. The two regimes of the generaliza- 
tion error are described, and we determine the training set size per input dimension 
at the crossover, as a function of the pertinent parameters. Finally we discuss our 
results, and their relevance to the understanding of the generalization properties of 
SVMs. 
2 The model 
We consider the problem of learning a binary classification task from examples. 
The training data set a contains P = aN N-dimensional patterns (",r ") 
(p - 1,--.,P) where r" - sign(  w*) is given by a teacher of weights 
w* - (w, w, .... , Wn). Without any loss of generality we consider normalized teach- 
ers: w*  w* - N. We assume that the components i, (i - 1,-.., N) of the input 
patterns  are independent, identically distributed random variables drawn from 
a zero-mean gaussian distribution, with variance a along Nc directions and unit 
variance in the Nu remaining ones (Nc + Nu - N): 
P(')- H v/2ra2 exp ff-a H x/' exp - ' (1) 
i N. i N, 
We take a < 1 without any loss of generality, as the case a > 1 may be deduced 
from the former through a straightforward rescaling of Nc and Nu. Hereafter, the 
subspace of dimension Nc and variance a will be called compressed subspace. The 
corresponding orthogonal subspace, of dimension Nu = N - Nc, will be called 
uncompressed subspace. 
We study the typical generalization error of a student perceptron learning the clas- 
sification task, using the tools of Statistical Mechanics. The pertinent cost function 
Understanding Stepwise Generalization of SVM's: a Toy Model 323 
is the number of misclassified patterns: 
P 
(2) 
The weight vectors in version space correspond to a vanishing cost (2). Choosing a 
w at random from the a posteriori distribution 
- z -1 PO(W) exp 
(3) 
in the limit of  -  is called Gibbs' learning. In eq. (3),  is equivalent to an 
inverse temperature in the Statistical Mechanics formulation, the cost (2) being the 
energy function. We assume that P0, the a priori distribution of the weights, is 
uniform on the hypersphere of radius v/: 
Po(w) -- (27r) -N/2 ((w  w -- X). 
(4) 
The normalization constant (2e) N/2 is the leading order term of the hypersphere's 
surface in N-dimensional space. Z is the partition function ensuring the correct 
normalization of P(wla )' 
Z(;T)a) = f dw Po(w) exp (-E(w;T)a)). 
(5) 
In general, the properties of the student are related to those of the free energy 
F(/; a) - -lnZ(; a)/. In the limit N -  with the training set size per 
input dimension a -- PIN constant, the properties of the student weights become 
independant of the particular training set a. They are deduced from the averaged 
free energy per degree of freedom, calculated using the replica trick: 
f()_ __1 lnZ(;Va )=_ l___lim lnZn(;Va) (6) 
N3 N3 n-0 n 
where the overline represents the average over 7), composed of patterns selected 
according to (1). In the case of Gibbs learning, the typical behavior of any intensive 
quantity is obtained in the zero temperature limit/ - oc. In this limit, only error- 
free solutions, with vanishing cost, have non-vanishing posterior probability (3). 
Thus, Gibbs learning corresponds to picking at random a student in version space, 
i.e. a vector w that classifies correctly the training set 7)a, with a probability 
proportional to P0 (w). 
In the case of an isotropic pattern distribution, which corresponds to a - I in 
(1), the properties of cost function (2) have been extensively studied [5]. The case 
of patterns drawn from two gaussian clusters in which the symmetry axis of the 
clusters is the same [6] and different [7] from the teacher's axis, have recently been 
addressed. Here we consider the problem where, instead of having a single direction 
along which the patterns' distribution is contracted (or expanded), there is a finite 
fraction of compressed dimensions. In this case, all the properties of the student's 
perceptron may be expressed in terms of the following order parameters, that have 
to satisfy corresponding extremum conditions of the free energy: 
qb 1 
= (7) 
i6 N. 
ua 1 
= / N w,aw,) (s) 
i N 
324 S. Risau-Gusman and M. B. Gordon 
1 
= waw;) (9) 
iNc 
1 
k = < E waw) (10) 
i N 
1 
Q = ( E (wi)2) (11) 
iN 
where (---) indicates the average over the posterior (3); a, b are replica indices, 
and the subcripts c and u stand for compressed and uncompressed respectively. 
Notice that we do not impose that Q, the typical squared norm of the student's 
components in the compressed subspace, be equal to the corresponding teacher's 
norm Q* = ieN.(W)2/N. 
3 Order parameters and learning curves 
Assuming that the order parameters are invariant under permutation of replicas, 
we can drop the replica indices in equations (7) to (11). We expect that this 
hypothesis of replica symmetry is consistent, like it is in other cases of perceptrons 
learning realizable tasks. The problem is thus reduced to the determination of 
five order parameters. Their meaning becomes clearer if we consider the following 
combinations: 
qc 
qu -- 
Rc -- 
Ru -- 
, (12) 
1 -Q' (13) 
kc 
v/Qx/Q  , (14) 
x/i-Z-x/1 _ Q * , (15) 
1 
q = < E (wi)2)' (16) 
N. 
qc and qu are the typical overlaps between the components of two student vectors in 
the compressed and the uncompressed subspaces respectively. Similarly, Rc and Ru 
are the corresponding overlaps between a typical student and the teacher. In terms 
of this set of parameters, the typical generalization error is eg = (1/r) arccos R with 
R = aRc + Ruv/(1 -Q)(1 -Q*) 
(17) 
v/a:Q+(1-Q)x/a2Q * + (1 - Q*) 
Given a, the general solution to the extremum conditions depends on the three 
parameters of the problem, namely a, Q* and nc =- Nc/N. An interesting case 
is the one where the teacher's anisotropy is consistent with that of the pattern's 
distribution, i.e. Q* - nc. In this case, it easy to show that Q = Q*, qc = Rc and 
qu = R. Thus, . 
nuRu + a2ncRc 
R = , (18) 
n u -]- ff2n c 
where nu -= Nu/N, Ru and Rc are given by the following equations: 
Rc _ a 2 a f Z)t exp (-Rt 2/2) 
1-Re - a2nc + nu r V - R H(tV) ' (19) 
Understanding Stepwise Generalization of SVM's: a Toy Model 325 
t I ' I  I ' I ' I ' I 
n =0.9 R 
,o =o.o! u..._ ...............:...: ..... - ........ 
0,8 : . R .......... 
/ ../ ,%,,, 
I' ,," RG .'"'""'0.,I- ..... ' ' ' '% 
 ,,''' 04 ........................... ' 
0,4 / [ -' " 't --- .................... 
/ ' ', ,.' I 00 
 I; "-..._'" . 't ........ 
[ ! ;........_ o- o,o o, o, o, o, ,,o 
0'21- I ;-' -""'.-.._ 
0,0 """ ' ' ;' ..... " ........ 
/ I I I , I t I I I I I 
0 2 4 6 8 10 
Figure 1: Order parameters and generalization error for the case Q* = nc = 0.9, 
2 _ 10-2. The curves for the case of spherically distributed patterns is shown for 
comparison. The inset shows the first step of learning and its plateau (see text). 
ac = e2 au (20) 
I - Rc I - Ru' 
where/)t = dt e-t2/2/v/- and H(x) = f l)t. If e2 = 1, we recover the equations 
corresponding to Gibbs learning of isotropic pattern distributions [5]. 
The order parameters are represented as a function of a on figure 1, for a particular 
choice of nc and . Ru grows much faster than Rc, meaning that it is easier to 
learn the components of the uncompressed space. As a result, R (and therefore the 
generalization error %) presents a cross-over between two behaviors. At small a, 
both Ru << 1 and Re << 1, so that R(c,  2) = Ra(c(nu+4nc)/(nu+2nc) 2) where 
Ra is the overlap for Gibbs learning with an isotropic (2 = 1) distribution [5]. 
Learning the anisotropic distribution is faster (in a) than learning the isotropic 
one. If  << I the anisotropy is very large and R increases like Ra but with an 
effective training set size per input dimension  a/n > a. On increasing a, 
there is an intermediate regime in which Ru increases but Rc << 1, so that R _ 
Rn,/(nu + 2nc). The corresponding generalization error seems to reach a plateau 
corresponding to R = 1 and Rc = 0. At a >> 1, R(a, 2) _ R(a), the asymptotic 
behavior is independent of the details of the distribution, like in [7]. The crossover 
between these two regimes, when 2 << 1, occurs at a0  v/2(nu + 2nc)/(2nc). 
The cases Q* = 1 and Q* = 0 are also of interest. Q* = I corresponds to a teacher 
having all the weights components in the compressed subspace, whereas Q* = 0 
326 S. Risau-Gusman and M. B. Gordon 
g 
0,5 
0,4 
0,3 
0,2 
0,1 
! ' I 
n =0.9 
c 
o'2=0.01 
100 
Q*=O.O 
0,0 '1 i I I I i I , I , - 
0 2 4 6 8 
Figure 2: Generalization errors as a function of a for different teachers (Q* = 1, 
Q* = 0.9 and Q* = 1), for the case nc = 0.9 and 2 = 10-2. The curve for 
spherically distributed patterns [5] is included for comparison. The inset shows the 
large alpha behaviors. 
corresponds to a teacher orthogonal to the compressed subspace, i.e. with all the 
components in the uncompressed subspace. They correspond respectively to tasks 
where either the uncompressed or the compressed components are irrelevant for the 
patterns' classification. In Figure 2 we show all the generalization error curves, 
including the generalization error egg for a uniform distribution [5] for comparison. 
The behaviour of eg(a) is very sensitive to the value of Q*. If Q* = 1, the teacher 
is in the compressed subspace where learning is difficult. Consequently, g(a) > 
egG(a) as expected. On the contrary, for Q* = 0, only the components in the 
uncompressed space are relevant for the classification task. In this subspace learning 
is easy and eg(a) < egG(a). At Q*  0, i there is a crossover between these regimes, 
as already discussed. All the curves merge in the asymptotic regime a - , as 
may be seen in the inset of Figure 2. 
4 Discussion 
We analyzed the typical learning behavior of a toy perceptton model that allows 
to clarify some aspects of generalization in high dimensional feature spaces. In 
particular, it captures an element essential to obtain stepwise learning, which is 
shown to stem from the compression of high order features. The components in the 
compressed space are more difficult to learn than those not compressed. Thus, it 
Understanding Stepwise Generalization of SVM ' s : a Toy Model 327 
the training set is not large enough, mainly the latter are learnt. 
Our results allow to understand the importance of the scaling of high order features 
in the SVMs kernels. In fact, with SVMs one has to choose a priori the kernel that 
maps the input space to the feature space. If high order features are conveniently 
compressed, hierarchical learning occurs. That is, low order features are learnt 
first; higher order features are only learnt if the training set is large enough. In the 
cases where the higher order features are irrelevant, it is likely that they will not 
hinder the learning process. This interesting behavior allows to avoid overfitting. 
Computer simulations currently in progress, of SVMs generated by quadratic kernels 
with and without the 1/N scaling, show a behavior consistent with the theoretical 
predictions [2, 3]. These may be understood with the present toy model. 
References 
[1] V. Vapnik (1995) The nature of statistical learning theory. Springer Verlag, 
New York. 
[2] R. Dietrich, M. Opper, and H. Sompolinsky (1999) Statistical Mechanics of 
Support Vector Networks. Phys. Rev. Lett. 82, 2975-2978. 
[3] A. Buhot and M. B. Gordon (1999) Statistical mechanics of support vector 
machines. ESANN'99-European Symposium on Artificial Neural Networks Pro- 
ceedings, Michel Verleysen ed. 201-206; A. Buhot and M. B. Gordon (1998) 
Learning properties of support vector machines. Cond-Mat/9802179. 
[4] H. Yoon and J.-H. Oh (1998) Learning of higher order percepttons with tunable 
complexities J. Phys. A: Math. Gen. 31, 7771-7784. 
[5] G. GySrgyi and N. Tishby (1990) Statistical Theory of Learning a Rule. In 
Neural Networks ad Spin Glasses (W. K. Theumann and R. KSberle, Worls 
Scientific), 3-36. 
[6] R. Meir (1995) Empirical risk minimizaton. A case study. Neural Comp. 7, 
144-157. 
[7] C. Marangi, M. Biehl, S. A. Solla (1995) Supervised Learning from Clustered 
Examples Europhys. Lett. 30 (2), 117-122. 
