Inference for the Generalization Error 
Claude Nadeau 
CIRANO 
2020, University, 
Montreal, Qc, Canada, H3A 2A5 
j cnadeauOaltavista. net 
Yoshua Bengio 
CIRANO and Dept. IRO 
Universitd de Montrdal 
Montreal, Qc, Canada, H3C 3J7 
bengioyOiro. umontreal. ca 
Abstract 
In order to to compare learning algorithms, experimental results reported 
in the machine learning litterature often use statistical tests of signifi- 
cance. Unfortunately, most of these tests do not take into account the 
variability due to the choice of training set. We perform a theoretical 
investigation of the variance of the cross-validation estimate of the gen- 
eralization error that takes into account the variability due to the choice 
of training sets. This allows us to propose two new ways to estimate 
this variance. We show, via simulations, that these new statistics perform 
well relative to the statistics considered by Dietterich (Dietterich, 1998). 
1 Introduction 
When applying a learning algorithm (or comparing several algorithms), one is typically 
interested in estimating its generalization error. Its point estimation is rather trivial through 
cross-validation. Providing a variance estimate of that estimation, so that hypothesis test- 
ing and/or confidence intervals are possible, is more difficult, especially, as pointed out in 
(Hinton et al., 1995), if one wants to take into account the variability due to the choice of 
the training sets (Breiman, 1996). A notable effort in that direction is Dietterich's work (Di- 
etterich, 1998). Careful investigation of the variance to be estimated allows us to provide 
new variance estimates, which turn out to perform well. 
Let us first lay out the framework in which we shall work. We assume that data are avail- 
able in the form Z = {Z,..., Zn}. For example, in the case of supervised learning, 
Zi = (Xi, Yi)  Z C_ R p+q, where p and q denote the dimensions of the Xi's (inputs) 
and the Y/'s (outputs). We also assume that the Zi's are independent with Zi ,',, P(Z). 
Let (D; Z), where D represents a subset of size nl < n taken from Z, be a function 
,Z TM x ,Z -->  For instance, this function could be the loss incurred by the decision 
that a learning algorithm trained on D makes on a new example Z. We are interested in 
estimating ntz -- E[(Z; Zn+)] where Zn+ "' P(Z) is independent of Z. Subscript 
n stands for the size of the training set (Z here). The above expectation is taken over Z 
and Zn+, meaning that we are interested in the performance of an algorithm rather than 
the performance of the specific decision function it yields on the data at hand. According to 
Dietterich's taxonomy (Dietterich, 1998), we deal with problems of type 5 through 8, (eval- 
uating learning algorithms) rather then type 1 through 4 (evaluating decision functions). We 
call n/z the generalization error even though it can also represent an error difference: 
 Generalization error 
We may take 
(D; Z) = (D; (X, Y)) = Q(F(D)(X), Y), (1) 
308 C. Nadeau and Y. Bengio 
where F(D) (F(D) : lRv  lRq) is the decision function obtained when training an 
algorithm on D, and Q is a loss function measuring the inaccuracy of a decision. For 
instance, we could have Q(0, y) = I[0  y], where I[ ] i the indicator function, for 
classification problems and Q(O, y) -II - u II 2, where is II' I is the Euclidean norm, for 
"regression" problems. In that case ,it is what most people call the generalization error. 
 Comparison of generalization errors 
Sometimes, we are not interested in the performance of algorithms per se, but instead in 
how two algorithms compare with each other. In that case we may want to consider 
(D;Z) = (D;(X,Y)) =Q(FA(D)(X),Y) -Q(FB(D)(X),Y), (2) 
where FA(D) and FB(D) are decision functions obtained when training two algorithms 
(A and B) on D, and Q is a loss function. In this case ,/z would be a difference of 
generalization errors as outlined in the previous example. 
The generalization error is often estimated via some form of cross-validation. Since there 
are various versions of the latter, we lay out the specific form we use in this paper. 
 Let Sj be a random set of nx distinct integers from {1,...,n}(n < n). Here nx 
represents the size of the training set and we shall let n2 - n - nx be the size of the test 
set. 
 Let $x,... $j be independent such random sets, and let $ = {1,..., n} \ $j denote the 
complement of Sj. 
 Let Z$ s = {Zli 6 $j} be the training set obtained by subsampling Z according to the 
random index set Sj. The corresponding test set is Z$ = {Zili  $ }. 
 Let L(j, i) = (Z$s; Zi). According to (1), this could be the error an algorithm trained 
on the training set Z$ makes on example Zi. According to (2), this could be the difference 
of such errors for two different algorithms. 
 Let j = K Y'k=x L(j, i) where if,..., i3c are randomly and independently drawn 
from $. Here we draw K examples from the test set Z$ with replacement and compute 
the average error committed. The notation does not convey the fact that j depends on K, 
n and n2. 
^ -- Y'4$? L(j, i) denote what bj becomes as K increases 
 Let ttf = limx-}oo j = = , 
without bounds. Indeed, when sampling infinitely often from Zs, each Zi (i  S) is 
x , yielding the usual "average test egor". The use of K is 
chosen with relative frequency  
just a mathematical device to make the test exmples sampled independently from S. 
Then the cross-validation estimate of the generalization egor consider in this paper is 
n2gK 1   
nJ =  j' 
=1 
We note that this an unbiased estimator of n = E[E(Z  , Zn+)] (not the same as n). 
This paper is about the estimation of the vance of n=  
m a  We first study theoretically 
this vmiance in section 2, leading to two new vmiance estimators developped in section 3. 
Stion 4 shows p of a simulation study we peffomed to see how the proposed statistics 
behave compared to statistics already in use. 
2 Analysis of Vat[ 
n/J J 
Here we study Vat[ n r,c This is important to understand why some inference proce- 
n tJ J' 
dures about m tt presently in use are inadequate, as we shall underline in section 4. This 
investigation also enables us to develop estimators of Vat[ n/j ] in section 3. Before we 
proceed, we state the following useful lemma, proved in (Nadeau and Bengio, 1999). 
Inference for the Generalization Error 309 
Lemma 1 Let U1,..., Uk be random variables with common mean fi, common variance 
6 and Cov[Ui, Uj] = % Vi  j. Let rr =  be the correlation between Ui and Uj (i  j). 
Let  = k -1 I I 
Ei=l Ui a S = 1 
- i=i (Ui - 0): be the sample mean and sample 
variance respectively. Then E[Sb] = 6 - 7 and Var[O] = 7 + (-) - 6 ( + ) 
To study Vat[ n;',K1 
m J  we need to define the following covariances. 
 Let a0 = ao(nl) = Var[L(j, i)] when i is randomly drawn from S. 
 Let o- = rr(n,n2) = Cov[L(j,i),L(j,i')] for i and i' randomly and indepen- 
dently drawn from $. 
 Let rr2 = a2(n,n2) = C[L(j,i),L(j',i')], with j  j', i and i' randomly and 
independently drawn from S and S, respectively. 
* Letaa = aa(n) = C[L(j,i),L(j,i')] fori, i'  S and/ i'. is is not the 
same as a. In fact, it may be shown that 
a = Cov[L(j,i) L(j,i')] ao (n2- 1)a a0 
, = + = as + (3) 
2 2 
Let us look at the mean and viance of  and  . Concerning expectations, we 
obviously have E[] = n and thus E[ = . From Lemma 1, we have 
o- which implies 
Var[fiF] = Vat[ lim fi2] = lim Var[fij] = 
It can also be shown that Cov[fi2, fly] = a2, j  f, and therefore (using Lemma 1) 
mJ l = 2 + j = 2 + K -- 
j (4) 
We shall often encounter a0, a, a2, aa in the future, so some knowlge about those quan- 
tities is valuable. Here's what we can say about them. 
Proposition 1 For given n and n2, we have 0  a2'  al  ao and 0 
Proof See (Nadeau and Bengio, 1999). 
A natural question about the estimator n2 ?,K is how hi, n2, K and J affect its viance. 
Proposition 2 The variance of n r,K is non-increasing in J, K and 
nJ 
Proof See (Nadeau and Bengio, 1999). 
Clearly, increasing K leads to smaller variance because the noise introduced by sampling 
with replacement from the test set disappears when this is done over and over again. Also, 
averaging over many train/test (increasing J) improves the estimation of m/. Finally, all 
things equal elsewhere (r fixed among other things), the larger the size of the test sets, the 
better the estimation of m/. 
^K 
The behavior of Vat[ n2j ] with respect to rl is unclear, but we conjecture that in most 
n 
situations it should decrease in n. Our argument goes like this. The variability in n2 fi 
comes from two sources: sampling decision rules (training process) and sampling testing 
examples. Holding r2, J and K fixed freezes the second source of variation as it solely de- 
pends on those three quantities, not r. The problem to solve becomes: how does n affect 
the first source of variation? It is not unreasonable to say that the decision function yielded 
by a learning algorithm is less variable when the training set is large. We conclude that the 
first source of variation, and thus the total variation (that is Vat[ n2r, Kll is decreasing in 
nJ 
r. We advocate the use of the estimator 
310 C. Nadeau and Y. Bengio 
as it is easier to compute and has smaller variance than 
Var[ 
nllj ] = lim Vat[ n2f ] = 0'2 + 
K-- oo 
where p =  = Cr[, ]. 
3 Estimation of Vat[ 
1 'J (J' '1, '2 held constant). 
0'1-0'2 (l-p) (6) 
j =0' p+  ' 
We are interested in estimating n2,,2 Vat[ 2b] where ,:b is as defined in (5). We 
nl 'J - 1 1 
provide two different estimators of Vat[ n ^ oo 
m/J ]' The first is simple but may have a positive 
or negative bias for the actual variance. The second is meant to be conservative, that is, 
if our conjecture of the previous section is correct, its expected value exceeds the actual 
variance. 
1st Method: Corrected Resampled t-Test. Let us recall that 
5 2 be the sample variance of the 's. According to Lemma 1, 
E[a 2] = (1 - p) = .... p + = 
so that ( + l)2 is an unbiased estimator of Vat[ 
n2hoo _ 1 J 
- Let 
Vg/'[ n2 h] 
= 1 p , (7) 
7+l-p 
The only problem is 
that p = p(n n2) = a2(m,n2) the correlation between the / s, is unknown and 
difficult to estimate. We use a naive surrogate for p as follows. Let us recall that 
la ? i (Zs' Zi). For the purpose of building our estimator, let us make the 
^ = , 
approximation that (Z&; Zi) depends only on Zi and n. Then it is not hard to show (see 
(Nadeau and Bengio, 1999)) that the correlation between the O,s becomes ' There- 
n+n2 ' 
fore our first estimator of Vat[ 
nlJ ] 1--po] wherepo = po(nl,n2) = n+n2' 
('} ) "2 ;'1 accord- 
n_z 52 This will tend to overestimate or underestimate Var[ 
that is + n  
ing to whether Po > p or Po < p. Note that this first method basically does not require 
any more computations than that already performed to estimate generalization error by 
cross-validation. 
2nd Method: Conservative Z. Our second method aims at overestimating Vat[ 
which will lead to conservative inference, that is tests of hypothesis with actual size less 
than the nominal size. This is important because techniques currently in use have the 
opposite defect, that is they tend to be liberal (tests with actual size exceeding the nominal 
size), which is typically regarded as less desirable than conservative tests. 
Estimating n,,.2 unbiasedly is not trivial as hinted above. However we may estimate 
nl J 
unbiasedly n2 0'. = Vat[ n  c n n 52 
n i njlwheren = [J--n2 <n.Let n j be the unbiased 
estimator, developed below, of the above variance. We argued in the previous section that 
Vat[ nm Vat[ n2;',oo] Therefore n2-2 
n[ tj ] > n *j . n[ oj will tend to overestimate n2rr2 that is 
-- nl "J' 
E[ n22] n20'2 n2rr2 
 rl "J' 
n2 0.2 
Here's how we may estimate n  without bias. For simplicity, assume that n is even. 
We have to randomly split our data Z into two distinct data sets, D and D, of size  
each. Let () be the statistic of interest ( n2 c c 
n t  computed on D. This involves, among 
other things, drawing J train/test subsets from D. Let ) be the statistic computed on 
D. Then () and ) are independent since D and D  independent data sets, so 
that (() ()+) )2 = )2 
2 + 
2  ((1) - 1) is an unbiased estimate 
of 2 a. This splitting process may be repeat M times. This yields Dm and D, with 
InJbrence for the Generalization Error 311 
Dm tO D = Z[ , D, Cl D = 0 for m = 1,..., M. Each split yields a pair (fi(m), bm)) 
that is such that  ^ ^ c 2 n2 _2 
5 (/(m) -/(m)) is unbiased for n[ oj. This allows us to use the following 
n2 
unbiased estimator of [ o-.' 
M 
'5-, = 1 
n[ 2M E ((m) -- m)) 2- (8) 
m=l 
Note that, according to Lemma 1 Vat[ n22  ^c 2 
: -- (m)) ](/' 1._) 
' n al  Var[((m) + with 
r -- Corr[(b(i) ^c 2 
-/(i)) , ((J) - j))2] for i  j. Simulations suggest that r is usually 
close to 0, so that the above variance decreases roughly like -- for M up to 20, say. The 
second method is therefore a bit more computation intensive, since requires to perform 
cross-validation M times, but it is expected to be conservative. 
4 Simulation study 
We consider five different test statistics for the hypothesis H0  m P = p0. The first three 
are methods already in use in the machine learning community, the last two are the new 
methods we put forward. They all have the following form 
rejectHoif ] 
 > c. (9) 
Table 1 describes what they are  We performed a simulation study to inves- 
tigate the size (probability of rejecting the null hypothesis when it is true) and 
the power (probability of rejecting the null hypothesis when it is false) of the 
five test statistics shown in Table 1. We consider the problem of estimating gen- 
eralization errors in the Letter Recognition classification problem (available from 
www. ics. uci. edu/pub/machine-learnin-databases). The learning algo- 
rithms are 
1. Classification tree 
We used the function tree in Splus version 4.5 for Windows. The default argu- 
ments were used and no pruning was performed. The function predict with option 
type="class" was used to retrieve the decision function of the tree: FA (Z$) (X). 
Here the classification loss function LA(j, i) = I[FA(Z$ s)(Xi)  Y/] is equal 
to 1 whenever this algorithm misclassifies example i when the training set is 
otherwise it is 0. 
2. First nearest neighbor 
We apply the first nearest neighbor rule with a distorted distance metric to pull 
down the performance of this algorithm to the level of the classification tree (as 
in (Dietterich, 1998)). We have LB (j, i) equal to 1 whenever this algorithm mis- 
classifies example i when the training set is $j; otherwise it is 0. 
In addition to inference about the generalization errors n ]ZA and m B associated with 
those two algorithms, we also consider inference about nIZA_B -- nIZA -- nIZB -- 
E[LA_8(j,i)] where LA_8(j,i) = LA(j,i) -- Ls(j,i). 
We sample, without replacement, 300 examples from the 20000 examples available in the 
Letter Recognition data base. Repeating this 500 times, we obtain 500 sets of data of the 
form {Z1,..., Za00}. Once a data set Z a = {Z,... Za00} has been generated, we may 
When comparing two classifiers, (Nadeau and Bengio, 1999) show that the t-test is closely re- 
lated to McNemar's test described in (Dietterich, 1998). The 5 x 2 cv procedure was developed in 
(Dietterich, 1998) with solely the comparison of classifiers in mind but may trivially be extended to 
other problems as shown in (Nadeau and Bengio, 1999). 
312 C. Nadeau and Y. Bengio 
t-test (McNemar) n2 $V(L(1,i)) tn2_l,l_a/2 > 1 
nl 
resampl t    - 2 
7a tJ-l,l-a/2 1 + J > 1 
nl 
n/2  
Dietterich's 5 x 2 cv n/2l see (Diettefich, 1998) t5,1_a/2 
n2 2 
1' conservative Z -2  -  2 " < 1 
n n J Zl-a/2 
2: corn resampled t n + ts_x,x_./ + 
Table 1: Description of five test statistics in relation to the rejection criteria shown in (9). 
Z v and tk,v refer to the quarttile p of the N(0, 1) and Student tk distribution respectively. 
2 is as defined above (7) and SV(L(1, i)) is the sample variance of the L(1, i)'s involved 
in ,2 . The ratio (which comes from proper application of Lemma 1, except for 
Dietterich's 5 x 2 cv and the Conservative Z) indicates if a test will tend to be conservative 
(ratio less than 1) or liberal (ratio greater than 1). 
perform hypothesis testing based on the statistics shown in Table 1. A difficulty arises 
however. For a given n (n - 300 here), those methods don't aim at inference for the same 
generalization error. For instance, Dietterich's 5 x 2 cv test aims at n/2[, while the others 
2. for 
aim at m P where nl would usually be different for different methods (e.g. 
9. for the resampled t test statistic, for instance). In order 
the t test statistic, and nl = - 
to compare the different techniques, for a given n, we shall always aim at n/2P, i.e. use 
= n2 f with J > 1, normal usage would call for 
n However, for statistics involving n 
nl . 
n Therefore, for those statistics, we 
nl to be 5 or l0 times larger than n2, not nl = n2 = . 
n n/10 c 
also use nl =  and n2 =  so that  = 5. To obtain ./2 s we simply throw out 40% 
of the data. For the conservative Z, we do the variance calculation as we would normally 
n2 2 n/10.2 
do (n2 =  for instance) to obtain ./2_.2os = 2./5 s' However, in the numerator we 
compute both f and -2 o 
./2 s ./2 s instead of 
= ---2 P, as explained above. 
-/106.2 
Note that the rationale that led to the conservative Z statistics is maintained, that is 2n/5 
overestimates both Var[ nn512] and Var[ nnS;]  E [ n/16'2] n/lr'ml 
2,/5 s] > Vat[ > 
-- n/2 IJ J 
Vat[ n/2 
n/21J 1' 
Figure 1 shows the estimated power of different statistics when we are interested in/z4 and 
/4-. We estimate powers by computing the proportion of rejections of H0. We see that 
tests based on the t-test or resampled t-test are liberal, they reject the null hypothesis with 
probability greater than the prescribed a = 0.1, when the null hypothesis is true. The other 
tests appear to have sizes that are either not significantly larger the 10% or barely so. Note 
that Dietterich's 5 x 2cv is not very powerful (note that its curve has the lowest power on 
the extreme values of rnu0). To make a fair comparison of power between two curves, one 
should mentally align the size (bottom of the curve) of these two curves. Indeed, even the 
resampled t-test and the conservative Z that throw out 40% of the data are more powerful. 
That is of course due to the fact that the 5 x 2 cv method uses J = 1 instead of J = 15. 
This is just a glimpse of a much larger simulation study. When studying the corrected 
resampled t-test and the conservative Z in their natural habitat (nl " 
= 5-6 and n2 = 00)' we 
see that they are usually either right on the money in term of size, or slightly conservative. 
Their powers appear equivalent. The simulations were performed with J up to 25 and M 
up to 20. We found that taking J greater than 15 did not improve much the power of the 
In./brence for the Generalization Error 313 
Figure 1: Powers of the tests about H0  /A = /0 (left panel) and Ho ' A-B = 0 
(right panel) at level a - 0.1 for varying/0. The dotted vertical lines correspond to the 
95% confidence interval for the actual/A or IA-B, therefore that is where the actual size 
of the tests may be read. The solid horizontal line displays the nominal size of the tests, 
i.e. 10%. Estimated probabilities of rejection laying above the dotted horizontal line are 
significatively greater than 10% (at significance level 5%). Solid curves either correspond 
to the resampled t-test or the corrected resampled t-test. The resampled t-test is the one that 
has ridiculously high size. Curves with circled points are the versions of the ordinary and 
corrected resampled t-test and conservative Z with 40% of the data thrown away. Where it 
matters J = 15, M = 10 were used. 
statistics. Taking M = 20 instead of M = 10 does not lead to any noticeable difference 
in the distribution of the conservative Z. Taking M = 5 makes the statistic slightly less 
conservative. See (Nadeau and Bengio, 1999) for further details. 
5 Conclusion 
This paper addresses a very important practical issue in the empirical validation of new 
machine learning algorithms: how to decide whether one algorithm is significantly better 
than another one. We argue that it is important to take into account the variability due to 
the choice of training set. (Dietterich, 1998) had already proposed a statistic for this pur- 
pose. We have constructed two new variance estimates of the cross-validation estimator 
of the generalization error. These enable one to construct tests of hypothesis and confi- 
dence intervals that are seldom liberal. Furthermore, tests based on these have powers that 
are unmatched by any known techniques with comparable size. One of them (corrected 
resampled t-test) can be computed without any additional cost to the usual K-fold cross- 
validation estimates. The other one (conservative Z) requires M times more computation, 
where we found sufficiently good values of M to be between 5 and 10. 
References 
Breiman, L. (1996). Heuristics of instability and stabilization in model selection. Annals of Statistics, 
24 (6):2350-2383. 
Dietterich, T. (1998). Approximate statistical tests for comparing supervised classification learning 
algorithms. Neural Computation, 10 (7): 1895-1924. 
Hinton, G., Neal, R., Tibshirani, R., and DELVE team members (1995). Assessing learning proce- 
dures using DELVE. Technical report, University of Toronto, Department of Computer Science. 
Nadeau, C. and Bengio, Y. (1999). Inference for the generalisation error. Technical Report in prepa- 
ration, CIRANO. 
