The Entropy Regularization 
Information Criterion 
Alex J. Smola 
Dept. of Engineering and RSISE 
Australian National University 
Canberra ACT 0200, Australia 
Alex. Smola@anu.edu.au 
John Shawe-Tayior 
Royal Holloway College 
University of London 
Egham, Surrey TW20 0EX, UK 
john @dcs.rhbnc.ac.uk 
Bernhard Schiilkopf 
Microsoft Research Limited 
St. George House, 1 Guildhall Street 
Cambridge CB2 3NH 
bsc@microsoft.com 
Robert C. Williamson 
Dept. of Engineering 
Australian National University 
Canberra ACT 0200, Australia 
Bob. Williamson @anu.edu.au 
Abstract 
Effective methods of capacity control via uniform convergence bounds 
for function expansions have been largely limited to Support Vector ma- 
chines, where good bounds are obtainable by the entropy number ap- 
proach. We extend these methods to systems with expansions in terms of 
arbitrary (parametrized) basis functions and a wide range of regulariza- 
tion methods covering the whole range of general linear additive models. 
This is achieved by a data dependent analysis of the eigenvalues of the 
corresponding design matrix. 
1 INTRODUCTION 
Model selection criteria based on the Vapnik-Chervonenkis (VC) dimension are known to 
be difficult to obtain, worst case, and often not very tight. Yet they have the theoretical 
appeal of providing bounds, with few or no assumptions made. 
Recently new methods [8, 7, 6] have been developed which are able to provide a better 
characterization of the complexity of function classes than the VC dimension, and more- 
over, are easily obtainable and take advantage of the data at hand (i.e. they employ the 
concept of luckiness). These techniques, however, have been limited to linear functions 
or expansions of functions in terms of kernels as happens to be the case in Support Vector 
(SV) machines. 
In this paper we show that the previously mentioned techniques can be extended to expan- 
sions in terms of arbitrary basis functions, covering a large range of practical algorithms 
such as general linear models, weight decay, sparsity regularization [3], and regularization 
networks [4]. 
The Entropy Regularization Information Criterion 343 
2 SUPPORT VECTOR MACHINES 
Support Vector machines carry out an effective means of capacity control by minimizing a 
weighted sum of the training error 
Remp[f] .- 1  
'- -- c(xi,Yi, f(xi)) (1) 
m 
i=1 
and a regularization term Q[f] =  ][wl[ 2' i.e. they minimize the regularized risk functional 
Rreg[f] := Reinpill q- AQ[f] - 1 Z c(xi, yi,f(xi)) + 11wll 2. 
m 
i=1 
(2) 
Here X := {x,... Xm) C X denotes the training set, Y := {y,... Ym) C  the cor- 
responding labels (target values), C,  the corresponding domains,  > 0 a regularization 
constant, c: C x  x  -+ I a cost function, and f: C -+  is given by 
f(x) := (x, w>, or in the nonlinear case f(x) :-- ((x), w>. 
(3) 
Here : C -+ :T is a map into a feature space :T. Finally, dot products in feature space can 
be written as ((x), (x')) = k(x, x') where k is a so-called Mercer kernel. 
For n E IN, I ' denotes the n-dimensional space of vectors x = (x,..., x,). We de- 
fine spaces  as follows: as vector spaces, they are identical to I ', in addition, they are 
endowed with p-norms: 
) /P 
Ilxlle; :- Ilxllp- Ixjl p for0 <p < oo 
j=l 
Ilxlle:o :- maxj:,...,n Ixjl forp = oo 
We write ep = e. Furthermore let Ut; := {x: Ilxllt; <_  } be the unit -ball. 
For model selection purposes one wants to obtain bounds on the richness of the map $x 
Sx: w -> (f(x),..., f(xm)) = (((Y(x), w),..., ((I)(xm),W}). 
(4) 
where w is restricted to an 2 unit ball of some radius A (this is equivalent to choosing an 
appropriate value of  an increase in  decreases A and vice versa). By the "richness" 
of Sx specifically we mean the  e-covering numbers N(e, Sx(AUG),) of the set 
Sx (AU,). In the standard COLT notation, we mean 
51'(e,$x(AU;),moo) := min {n 
There exists a set {z,... z,} C F such that for all } 
z  $x(AUG ) we have min IIz- zillt < e 
l<i<n -- 
See [8] for further details. 
When carrying out model selection in this case, advanced methods [6] exploit the distribu- 
tion of X mapped into feature space :T, and thus of the spectral properties of the operator 
Sx by analyzing the spectrum of the Gram matrix G = [gij]ij, where gij :- k (xi, x j). 
All this is possible since k(xi,xj) can be seen as a dot product of xi,xj mapped into 
some feature space :T, i.e. k(xi, Xj) -- ((I)(xi), (Xj)). This property, whilst true for SV 
machines with Mercer kernels, does not hold in general case where f is expanded in terms 
of more or less arbitrary basis functions. 
344 A. J. Smola, J. Shawe-Taylor, B. Sch6lkopf and R. C. Vlliamson 
3 THE BASIC PROBLEMS 
One basic problem is that when expanding f into 
f(a:) --  aifi(a:) where cti  IR (5) 
i=1 
with fi(a:) being arbitrary functions, it is not immediately obvious how to regard f as a 
dot product in some feature space. One can show that the VC dimension of a set of r 
linearly independent functions is r. Hence one would intuitively try to restrict the class of 
admissible models by controlling the number of basis functions n in terms of which f can 
be expanded. 
Now consider an extreme case. In addition to the n basis functions fi defined previously, 
we are given r further basis functions f, linearly independent of the previous ones, which 
differ from fi only on a small domain 2C , i.e. filx\x' = fJlx\x'. Since this new set of 
functions is linearly independent, the VC dimension of the joint set is given by 2r. On the 
other hand, if hardly any data occurs on the domain 2C , one would not notice the difference 
between fi and ft. In other words, the joint system of functions would behave as if we 
only had the initial system of r basis functions. 
An analogous situation occurs if f = fi + #i where  is a small constant and #i was 
bounded, say, within [0, 1]. Again, in this case, the additional effect of the set of functions 
fJ would be hardly noticable, but still, the joint set of functions would count as one with VC 
dimension 2r. This already indicates, that simply counting the number of basis functions 
may not be a good idea after all. 
Figure 1: From left to right: (a) initial set of functions f,..., f5 (dots on the :r-axis 
indicate sampling points); (b) additional set of functions f,..., f which differ globally, 
but only by a small amount; (c) additional set of functions f,..., f which differ locally, 
however by a large amount; (d) spectrum of the corresponding design matrices - the bars 
denote the cases (a)-(c) in the corresponding order. Note that the difference is quite small. 
On the other hand, the spectra of the corresponding design matrices (see Figure 1) are very 
similar. This suggests the use of the latter for a model selection criterion. 
Finally we have the practical problem that capacity control, which in SV machines was 
carried out by minimizing the length of the "weight vector" w in feature space, cannot be 
done in an analogous way either. There are several ways to do this. Below we consider 
three that have appeared in the literature and for which there exist effective algorithms. 
1 2. 
Example 1 (Weight Decay) Define Q[f] :-  5-]i cti , i.e. the coefficients cti of the func- 
tion expansion are constrained to an 2 ball. In this case we can consider the following 
operator $()'  -+ moo, where 
$?'ct->(f(xi),...,f(xm))-((f(xx),ct},...,(f(xm),ct))=Fa (6) 
Here f(x) := (fi(x),... fn(X)), Fij := fi(xj), ct := (cti,...,Ctn) and c  AUto for 
some A > O. 
The Entropy Regularization Information Criterion 345 
Example 2 (Sparsity Regularization) In this case Q[f] := Yi ]ci[, i.e the coefficients 
ci of the function expansion are constrained to an  ball to enforce sparseness [3]. Thus 
$()'  ---> o with $() mapping c as in (6) except c  AUto. This is similar to expan- 
sions encountered in boosting or in linear programming machines. 
Example 3 (Regularization Networks) Finally one could set Q [f] :-- a -r Q a for some 
positive definite matrix Q. For instance, Q ij could be obtained from { P f i , P f j ) where P is 
a regularization operator penalizing non-smooth functions [4]. In this case a lives inside 
some n-dimensional ellipsoid. By substituting a' := Q  a one can reduce this setting to the 
case of example 1 with a different set of basis functions (f' (x) -- Q- f(x)) and consider 
an evaluation operator S?'  -+ t given by 
S(f):a '  (f(x),...,f(x,)) - ((Q- f(x),a'},...,(Q- f(xm),a')) 
(7) 
where c   AUt for some A > 0 and Fij = fi(xj) as in example 1. 
Example 4 (Support Vector Machines) An important special case of example 3 are Sup- 
port Vector Machines where we have Qij = k(xi,xj) and fi(x) -- k(xi, x), hence Q = F. 
Hence the possible values generated by a Support Vector Machine can be written as 
S?:a'  (f(x),...,f(xm)) = ((Q- f(x),a'),...,(Q- f(xm),a'>)= 
(8) 
where a   AUt for some A > O. 
4 ENTROPY NUMBERS 
Covering numbers characterize the difficulty of learning elements of a function class. En- 
tropy numbers of operators can be used to compute covering numbers more easily and 
more tightly than the traditional techniques based on VC-like dimensions such as the fat 
shattering dimension [1]. Knowing el(Sx) = e (see below for the definition) tells one 
that log N(e, F, ) <_ l, where F is the effective class of functions used by the regu- 
larised learning machines under consideration. In this section we summarize a few basic 
definitions and results as presented in [8] and [2]. 
The Ith entropy number el (F) of a set F with a corresponding metric d is the precision up 
to which F can be approximated b_y l elements of F; i.e. for all f E F there exists some 
fi  {f,...,.h} such that d(f, fi) <_ el. Hence el(F) is the functional inverse of the 
covering number of F. 
The entropy number of an bounded linear operator T: A --> B between normed linear 
spaces A and B is defined as el(T) := ei(T(UA)) with the metric d being induced by 
II  liB. The dyadic entropy numbers el are defined by el :-- e2t+ (the latter quantity is 
often more convenient to deal with since it corresponds to the log of the covering number). 
We make use of the following three results on entropy numbers of the identity mapping 
from n into n diagonal operators, and products of operators. Let 
Pl P2' 
 TI   ' i TI  
ldpx,p.'pl --+ p ' dp,p: x - x 
The following result is due to Schtitt; the constants 9.94 and 1.86 were obtained in [9]. 
Proposition 1 (Entropy numbers for identity operators) Be   IN. Then 
! 
el(ida,2) _< 9.94 log 1 + - 
1 
& el(id,o ) <_ 1.86 log 1 +  
(9) 
346 A. J. Smola, J. Shawe-Taylor, B. Sch61kopf and R. C. I4411iamson 
Proposition 2 (Carl and Stephani [2, p. 11]) Let E, F, G be Banach spaces, R: F --> 
G, and S: E -> F. Then, for n, t 6 N, 
Note that the latter two inequalities follow directly from the fact that q (R) = IlRII for all 
R: F -+ G by definition of the operator norm IIRII. 
Proposition 3 Let al > a2 _> ... >_ aj _> ... _> O, 1 _< p _< c and 
Dx = (ax, a2x2,..., ajxj,...) 
(11) 
for x = (x, x2,..., xj,...)  p be the diagonal operator from p into itself, generated 
by the sequence (aj)j. Then for all n  N, 
n--1 1 n--1 1 
sup _< _< ... o-j);. 
(12) 
5 THE MAIN RESULT 
We can now state the main theorem which gives bounds on the entropy numbers of $() for 
the first three examples of model selection described above (since Support Vector Machines 
are a special case of example 3 we will not deal with it separately). 
Proposition 4 Let f be expanded in a linear combination of basis functions as f :- 
n 
-i= aifi and the coefficients a restricted to one of the convex sets as described in the 
examples 1 to 3. Moreover denote by Fij := fj(xi) the design matrix on a particular 
sample X, and by Q the regularization matrix in the case of example 3. Then the following 
bound on $x holds. 
1. In the case of weight decay (ex. 1) (with li + 12 >_ 1 + 1) 
et(S( )) _< 1.96 (l -x log(1 + rnll))  et2 (E). 
(13) 
2. In the case of weight sparsity regularization (ex. 2) (with l + 12 q- 13 _> 1 + 2) 
et(S( )) < 18.48 (I -x log(1 + mllx))  
1 
eta(E) (/-1og(1 + m/13))  . (14) 
3. Finally, in the case of regularization networks (ex. 3) (with l + 12 _> 1 + 1) 
e,(S(ff )) _< 1.96 (l - log(1 + m/l))  
(15) 
Here E is a diagonal scaling operator (matrix) with (i, i) entries v/ and ( v/-)i are the 
eigenvalues (sorted in decreasing order) of the matrix FF -r in the case of examples 1 and 
2, and FQ-F -r in the case of example 3. 
The entropy number of E is readily bounded in terms of (al)i by using (3). One can see 
that the first setting (weight decay) is a special case of the third one, namely when Q - 1, 
i.e. when Q is just the identity matrix. 
Proof The proof relies on a factorization of S? (i = 1, 2, 3) in the following way. First 
we consider the equivalent operator x mapping from  to  and perform a singular 
value decomposition [5] of the latter into x -- VEW where V, W are operators of norm 
1, and E contains the singular values of S?, i.e. the singular values of F and FQ- 
The Entropy Regularization Information Criterion 347 
respectively. 
FF T or FQ-1F T. Consequently we can factorize S? as in the diagram 
The latter, however, are identical to the square root of the eigenvalues of 
Finally, in order to compute the entropy number of the overall operator one only has to 
use the factorization of Sx into S? = id2,VEW for i  {1,3} and into S( ) = 
id2m,VEWid,2 for example 2, and apply Proposition 2 several times. We also exploit 
the fact that for singular value decompositions 11vII, IlWll <_ 1. I 
The present theorem allows us to compute the entropy numbers (and thus the complexity) 
of a class of functions on the current sample X. Going back to the examples of section 3, 
which led to large bounds on the VC dimension one can see that the new result is much less 
susceptible to such modifications: the addition of f{,... f to f,... f, does not change 
the eigenspectrum E of the design matrix significantly (possibly only doubling the nominal 
value of the singular values), if the functions f differ from fi only slightly. Consequently 
also the bounds will not change significantly even though the number of basis functions 
just doubled. 
Also note that the current error bounds reduce to the results of [6] in the SV case: here 
Oij '- Fij -- k(xi, Xj) (both the design matrix F and the regularization matrix Q are 
determined by kernels) and therefore FQ-F -- Q. Thus the analysis of the singular 
values of FQ-F leads to an analysis of the eigenvalues of the kernel matrix, which is 
exactly what is done when dealing with SV machines. 
6 ERROR BOUNDS 
To use the above result we need a bound on the expected error of a hypothesis f in terms 
of the empirical error (training error) and the observed entropy numbers e,(:Y). We use [6, 
Theorem 4.1] with a small modification. 
Theorem 1 Let Y be a set of linear functions as described in the previous examples with 
en (Sx) as the corresponding bound on the observed entropy numbers of Y on the dataset 
X. Moreover suppose that for a fixed threshold b  11 for some f  Y, sgn(f - b) correctly 
classifies the set X with a margin 7 :- min<i<m [f(:ei) - b[. 
Finally let U := min{n  N with e,(Sx) < 7/8.001} and a(U, 6) := 3.08(1 + In -}). 
Then with confidence 1 - 6 over X (drawn randomly from P" where P is some probability 
distribution) the expected error of sgn(f - b) is bounded from above by 
= 2 (U(l+a(U,)log(5--)log(17m))+log(-)). (17) 
U, 6) 
The proof is essentially identical to that of [6, Theorem 4.1] and is omitted. [6] also shows 
how to compute e, (Sx) efficiently including an explicit formula for evaluating et (E). 
7 DISCUSSION 
We showed how improved bounds could be obtained on the entropy numbers of a wide 
class of popular statistical estimators ranging from weight decay to sparsity regularization 
348 A. J. Smola, J. Shawe-Taylor, B. Sch6lkopf and R. C. Vlliamson 
(with SV machines being a special case thereof). The results are given in a way that is 
directly useable for practicioners without any tedious calculations of the VC dimension or 
similar combinatorial quantities. In particular, our method ignores (nearly) linear depen- 
dent basis functions automatically. Finally, it takes advantage of favourable distributions 
of data by using the observed entropy numbers as a base for stating bounds on the true 
entropy numbers with respect to the function class under consideration. 
Whilst this leads to significantly improved bounds (we achieved an improvement of ap- 
proximately two orders of magnitude over previous VC-type bounds involving only the 
radius of the data R and the weight vector IIwll in the experiments) on the expected risk, 
the bounds are still not good enough to become predictive. This indicates that possibly 
rather than using the standard uniform convergence bounds (as used in the previous sec- 
tion) one might want to use other techniques such as a PAC-Bayesian treatment (as recently 
suggested by Herbrich and Graepel) in combination with the bounds on eigenvalues of the 
design matrix. 
Acknowledgements: This work was supported by the Australian Research Council and a 
grant of the Deutsche Forschungsgemeinschaft SM 62/1-1. 
References 
[1] N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler. Scale-sensitive Dimen- 
sions, Uniform Convergence, and Learnability. J. of the ACM, 44(4):615-631, 1997. 
[2] B. Carl and I. Stephani. Entropy, compactness, and the approximation of operators. 
Cambridge University Press, Cambridge, UK, 1990. 
[3] S. Chen, D. Donoho, and M. Saunders. Atomic decomposition by basis pursuit. Tech- 
nical Report 479, Department of Statistics, Stanford University, 1995. 
[4] F. Girosi, M. Jones, and T Poggio. Regularization theory and neural networks archi- 
tectures. Neural Computation, 7:219-269, 1995. 
[5] R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, Cam- 
bridge, 1992. 
[6] B. Sch61kopf, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Generalization 
bounds via eigenvalues of the gram matrix. Technical Report NC-TR-99-035, Neuro- 
Colt2, University of London, UK, 1999. 
[7] J. Shawe-Taylor and R. C. Williamson. Generalization performance of classifiers in 
terms of observed covering numbers. In Proc. EUROCOLT'99, 1999. 
[8] R. C. Williamson, A. J. Smola, and B. Sch61kopf. Generalization performance of 
regularization networks and support vector machines via entropy numbers of compact 
operators. NeuroCOLT NC-TR-98-019, Royal Holloway College, 1998. 
[9] R. C. Williamson, A. J. Smola, and B. Sch61kopf. A Maximum Margin Miscellany. 
Typescript, 1999. 
