Neural Network Model Selection Using 
Asymptotic Jackknife Estimator and 
Cross-Validation Method 
Yong Liu 
Department of Physics and 
Institute for Brain and Neural Systems 
Box 1843, Brown University 
Providence, RI, 02912 
Abstract 
Two theorems and a lemma are presented about the use of jackknife es- 
timator and the cross-validation method for model selection. Theorem 1 
gives the asymptotic form for the jackknife estimator. Combined with the 
model selection criterion, this asymptotic form can be used to obtain the 
fit of a model. The model selection criterion we used is the negative of the 
average predictive likehood, the choice of which is based on the idea of the 
cross-validation method. Lemma 1 provides a formula for further explo- 
ration of the asymptotics of the model selection criterion. Theorem 2 gives 
an asymptotic form of the model selection criterion for the regression case, 
when the parameters optimization criterion has a penalty term. Theorem 
2 also proves the asymptotic equivalence of Moody's model selection cri- 
terion (Moody, 1992) and the cross-validation method, when the distance 
measure between response y and regression function takes the form of a 
squared difference. 
1 INTRODUCTION 
Selecting a model for a specified problem is the key to generalization based on the 
training data set. In the context of neural network, this corresponds to selecting 
an architecture. There has been a substantial amount of work in model selection 
(Lindley, 1968; Mallows, 1973; Akaike, 1973; Stone, 1977; Atkinson, 1978; Schwartz, 
599 
600 Liu 
1978; Zellner, 1984; MacKay, 1991; Moody, 1992; etc.). In Moody's paper (Moody, 
1992), the author generalized Akaike Information Criterion (AIC) (Akaike, 1973) 
in the regression case and introduced the term effective number of parameters. It 
is thus of great interest to see what the link between this criterion and the cross- 
validation method (Stone, 1974) is and what we can gain from it, given the fact 
that AIC is asymptotically equivalent to the cross-validation method (Stone, 1977). 
In the method of cross-validation (Stone, 1974), a data set, which has a data point 
deleted from the original training data set, is used to estimate the parameters of a 
model by optimizing a parameters optimization criterion. The optimal parameters 
thus obtained are called the jackknife estimator (Miller, 1974). Then the predictive 
likelihood of the deleted data point is calculated, based on the estimated parame- 
ters. This is repeated for each data point in the original training data set. The fit 
of the model, or the model selection criterion, is chosen as the negative of the aver- 
age of these predictive likelihoods. However, the computational cost of estimating 
parameters for different data point deletion is expensive. In section 2, we obtained 
an asymptotic formula (theorem 1) for the jackknife estimator based on optimizing 
a parameters optimization criterion with one data point deleted from the training 
data set. This somewhat relieves the computational cost mentioned above. This 
asymptotic formula can be used to obtain the model selection criterion by plugging 
it into the criterion. Furthermore, in section 3, we obtained the asymptotic form 
of the model selection criterion for the general case (Lemma 1) and for the special 
case when the parameters optimization criterion has a penalty term (theorem 2). 
We also proved the equivalence of Moody's model selection criterion (Moody, 1992) 
and the cross-validation method (theorem 2). Only sketchy proofs are given when 
these theorems and lemma are introduced. The detail of the proofs are given in 
section 4. 
2 APPROXIMATE JACKKNIFE ESTIMATOR 
Let the parameters optimization criterion, with data set co: {(a:i, Yi), i - 1, ..., n) 
and parameters 0, be Co(O), and let co-i denote the data set with ith data point 
deleted from co. If we denote  and O_i as the optimal parameters for criterion C (0) 
and Co_c(0), respectively, X7 a as the derivative with respect to 0 and superscript t 
as transpose, we have the following theorem about the relationship between  and 
Theorem I If the criterion function Co (a) is an infinite- order differeniiable func- 
tion and its derivatives are bounded around . The estimator -i (also called jack- 
knife estimator (Miller, 197)) can be approzimated as 
- -(v0vco() - 
(1) 
in c(o) = co(o) - co_,(0). 
Proof. Use the Taylor expansion of equation VoCo_(_i): 0 around . Ignore 
terms higher than the second order. 
Model Selection Using Asymptotic Jackknife Estimator & Cross-Validation Method 601 
Ezample 1: Using the generalized mazimum likelihood method from Bayesian 
analysis  (Berger, 1985), if r(19) is the prior on the parameters and the observations 
are mutually independent, for which the distribution is modeled as ylx ....f(ylx, 9), 
the parameters optimization criterion is 
Co(0)= log[ H f(yi[z,,O)r(O) ]=  logf(y,[.z,,O) + log(0). (2) 
Thus C(O) = logf(yz,O). If we ignore the influence of the deleted data point in 
the d nominator of equation 1, we have 
b_, - b  -(VoVC())-XVologf(y, lg,,). (3) 
Ezample : In the special case of example 1, with noninformative prior r(0): 1, 
the criterion is the ordinary log-likelihood function, thus 
_i-tJm-[ y. VoVlogf(yjla:j, ) ]-Vologf(yilxi,). (4) 
3 
CROSS-VALIDATION METHOD AND MODEL 
SELECTION CRITERION 
Hereafter we use the negative of the average predictive likelihood, or, 
T,(w)- 1 )' logf(y, lm, _,) (5) 
n 
(;gi,yl) 
as the model selection criterion, in which n is the size of the training data set w, 
rn G :M denotes parametric probability models f(yl and :M is the set of all the 
models in consideration. It is well known that T(w) is an unbiased estimator of 
r(90, (-)), the risk of using the model m and estimator , when the true parameters 
are 0 and the training data set is w (Stone, 1974; Efron and Gong, 1983; etc.), i.e., 
: 
= E{-logf(ylx,(w)) } 
1 
: E{ k  lgf(YJlxJ'(w)) } (6) 
in which w = {(xj,yj) , j = 1, ... k} is the test data set, (-) is an implicit 
function of the training data set w and it is the estimator we decide to use after 
we have observed the training data set w. The expectation above is taken over the 
randomness of w, , y and w. The optimal model will be the one that minimizes 
this criterion. This procedure of using _ and T (w) to obtain an estimation of risk 
is often called the cross-validation method (Stone, 1974; Efron and Gong, 1983). 
Remark: After we have obtained  for a model, we can use equation 1 to calculate 
-i for each i, and put the resulting -i into equation 5 to get the fit of the model, 
thus we will be able to compare different models m 
t Strictly speaking, it is a method to find the posterior mode. 
602 Liu 
Lemma 1 / the probability model f(y]oz, 0), as a function of O, is differeniiable up 
infinite order and its derivatives are bounded around . The approzimaiion go 
model selection criterion, equation 5, can be written as 
1 1 
()  -;  logf(ylx,)-   Vlogf(ylxi,)(_ -) (7) 
(Zi,yi) (0 
Proof. Igoring the terms higher than the second order of the Taylor expansion of 
logf(yj [x,_i) around t} will yield the result. 
Example  (continued): Using equation 4, we have, for the model selection criterion, 
7-,.(w)- 1  logf(yilxi, O)- 
n 
(,y.) 
  Vlogf(yi[i,)A-Vologf(yi[i,). (8) 
in which A = (,y) VsVlogf(yj[j,). If the model f(y[,O) is the true 
one, the second term is asymptotically equal to p, the number of parameters in the 
model. So the model selection criterion is 
- log-likelihood + number of parameters of the model. 
This is the well known Akaike's Information Criterion (AIC) (Akaike, 1973). 
Eample /(continued): Consider the probability model 
1 
f(y[,O) = Zexp(-f(y, W())) (9) 
in which Z is a normalization factor, f(y, W()) is a distance measure between y and 
regression knction W(). f(') as function of 0 is assumed differentiable. Denoting  
U(O,,w) = (,,y,) f(Yi, W(i))- 2alog(0), we have the following theorem, 
Theorem 2 For the model specified in equation 9 and the parameters optimization 
criterion specified in equation  (example 1), under regular condition, the unbiased 
estimator of 
1 
'( i (Y' "()) ) (0) 
asymptotically equals to 
(,y)eo 
+ 
V(yi,l(xi)){VoVU(,X,w)}-Vo(yi,l(ozi)). (11) 
'For example, r(O[X)= A/;(O, a'/X), this corresponds to 
U(O,X,w)=  (yi,ls(zi)) q- XO' + const(A,a'). 
Model Selection Using Asymptotic Jackknife Estimator & Cross-Validation Method 603 
For the case when (y, r/s(z))= (y-t/s(z)) 2, we get, for the asymptotic equivalency 
of the equation 11, 
2(r 2 1 
+ --- x 
n 2 
0 (12) 
Oyi 
in which w -- {(zi, Yi), i -- 1, ..., n) is the training data set, wn -- {(zi,Yi), i ---- 
1 
1, ..., k} is the test data set, and - 5(Y, 
Proof. This result comes directly from theorem 1 and lemma 1. Some asymptotic 
technique has to be used. 
Remark: The result in equation 12 was first proposed by Moody (Moody, 1992). 
The effective number of parameters formulated in his paper corresponds to the 
summation in equation 12. Since the result in this theorem comes directly from 
the asymptotics of the cross-validation method and the jackknife estimator, it gives 
the equivalency proof between Moody's model selection criterion and the cross- 
validation method. The detailed proof of this theorem, presented in section 4, is 
in spirit the same as the one presented in Stone's paper about the proof of the 
asymptotic equivalence of AIC and the cross-validation method (Stone, 1977). 
4 DETAILED PROOF OF LEMMAS AND THEOREMS 
In order to prove theorem 1, lemma 1 and theorem 2, we will present three auxiliary 
lemmas first. 
Lemma 2 For random variable sequence n and Yn, if limn-ozn =  and 
lirn_o Yn = , then n and Yn are asymptotically equivalent. 
Proof. This comes from the definition of asymptotic equivalence. Because asymp- 
totically the two random variable will behave the same as random variable z. 
Lemma 3 Consider the summation ..i h(zi, Yi)g(i, z). If E(h(g, y)[g, z) is a 
constant c independent of z, y, z, then the summation is asymptotically equivalent 
tO c.ig(i,z ). 
Proof. According to the theorem of large number, 
lim 1 E h(zi,yi)g(zz,z) 
i 
= E(h(g,y)g(,z)) 
= E(E(h(z,y)[g,z)g(z,z)) = cE(g(z,z)) 
which is the same as the limit of  ]] g(z, z). Using lemma 2, we get the result of 
this lemma. 
Lemma 4 If 7(') and g(t,.) are differenttable up to the second order, and the 
model y- ?(z) + e with e- A;(O, ) is the true model, the second derivative with 
604 Liu 
respect to 19 of 
U(0, ,w): Y(yi - r/0(mi)) 2 + g(0, ) 
i=1 
evaluated at the minimum of Lt, i.e., , is asymptotically independent of random 
variable {yi,i = 1, ..., n}. 
Proof. Explicit calculation of the second derivative of/d with respect to 19, evaluated 
at , gives 
v0vu(, ,) = 2  vo.(....)v,(,) 
i--1 
- 2 (y - ,())v0.() 
i=1 
+ V0%g(, ) 
As n approaches infinite, the effect of the second term in /d vanishes,  approach 
the mean squared error estimator with infinite amount of data points, or the true 
parameters 0o of the model (consistency of MSE estimator (Jennrich, 1969)), E(y - 
r/(m)) approaches E(y- r/00 (m)) which is 0. According to lemma 2 and lemma 3, the 
second term of this second derivative vanishes asymptotically. So as n approaches 
infinite, the second derivative of/d with respect to 0, evaluated at , approaches 
vov;u(oo),X,) = 2  vo.oo(,)v.oo() + vov;g(Oo, X) 
i=1 
which is independent of {Yi, i = 1, ..., n}. According to lemma 2, the result of this 
lemma is readily obtained. 
Now we give the detailed proof of theorem 1, lemma 1 and theorem 2. 
Proof of Theorem 1. The jackknife estimator g_ satisfies, VoCo_(-) : 0. 
The Taylor expansion of the left side of this equation around $ gives 
VoC_,() + V0VC_.()(_ - ) + O(1_ - 1) = 0 
According to the definition of t} and (-i, their difference is thus a small quantity. 
Also because of the boundhess of the derivatives, we can ignore higher order terms 
in the Taylor expansion and get the approximation 
_ -   -(vovc_,())-voc_,() 
Since tJ satisfies VoCo(() = 0, we can rewrite this equation and obtain equation 1. 
Proof of Lemma 1. The Taylor expansion of logf(yi lai, tJ_i) around  is 
logf(ul,, _,) = logf(u I,, ) + VlogI(u I, )(_, - ) + o(l_i - 
Putting this into equation 5 and ignoring higher order terms for the same argument 
as that presented in the proof of theorem 1, we readily get equation 7. 
Proof of Theorem 2. Up to an additive constant dependent only on  and ct 2, 
the optimization criterion, or equation 2, can be rewritten as 
1 
c(0): 2 u(0, , o) (3) 
Model Selection Using Asymptotic Jackknife Estimator & Cross-Validation Method 605 
Now putting equation 9 and 13 into equation 3, we get, 
-i -   -(VVa(, , ct)}-lv(yi, i(i)) (14) 
Putting equation 14 into equation 7, we get, for the model selection criterion, 
1 
z()    5( ()) + 
n 
(,)e 
!  i v( ,())(vv.(. ))-v5(.,()) (1) 
. 2 2  , 
(,) 
Recall the discussion associated with equation 6 and now 
1 1. 
(-  ogi(,))=(  2;(,())) () 
(,)e (,)e 
after some simple algebra, we can obtain the unbiased estimator of equation 10. 
The result is equation 15 multiplied by 2 , or equation 11. Thus we prove the first 
part of the theorem. 
Now consider the case when 
5(. ,()) = (- ,()) () 
The second term of equation 11 now becomes 
!  4( - .())V.()(VV"(,,))-V.() () 
(,) 
As n approaches infinite,  approach the true parameters 90, VV(.) approaches 
VV0(.) and E((y- Va())) asymptotically equals to . Using lemma 4 and 
lemma 3, we get, for the asymptotic equivalency of equation 18, 
!  v;,()(vv;.(..))-v,() (1) 
n 
1 
If we use notation (9, w)=  (,,)e (y, V()), with (y, V(z)) of the form 
specified in equation 17, we can get, 
V5(S,) = -V.() (0) 
Combining this with equation 19 and equation 11, we can readily obtain equation 12. 
5 SUMMARY 
In this paper, we used asymptotics to obtain the jackknife estimator, which can 
be used to get the fit of a model by plugging it into the model selection criterion. 
Based on the idea of the cross-validation method, we used the negative of the 
average predicative likelihood as the model selection criterion. We also obtained 
the asymptotic form of the model selection criterion and proved that when the 
parameters optimization criterion is the mean squared error plus a penalty term, 
this asymptotic form is the same as the form presented by (Moody, 1992). This 
also served to prove the asymptotic equivalence of this criterion to the method of 
cross-validation. 
606 Liu 
Acknowledgement s 
The author thanks all the members of the Institute for Brain and Neural Systems, 
in particular, Professor Leon N Cooper for reading the draft of this paper, and Dr. 
Nathan Intrator, Michael P. Perrone and Harel Shouval for helpful comments. This 
research was supported by grants from NSF, ONR and ARO. 
References 
Akaike, H. (1973). Information theory and an extension of the maximum likeli- 
hood principle. In Petrov and Czaki, editors, Proceedings of the rd International 
Symposium on Information Theory, pages 267-281. 
Atkinson, A. C. (1978). Posterior probabilities for choosing a regression model. 
Biometrika, 65:39-48. 
Berger, J. O. (1985). Statistical Decision Theory and Bayesian Analysis. Springer- 
Verlag. 
Efron, B. and Gong, G. (1983). A leisurely look at the bootstrap, the jackknife and 
cross-validation. Amer. Slat., 37:36-48. 
Jennrich, R. (1969). Asymptotic properties of nonlinear least squares estimators. 
Ann. Math. Slat., 40:633-643. 
Lindley, D. V. (1968). The choice of variables in multiple regression (with discus- 
sion). J. Roy. Slat. Soc., Ser. B, 30:31-66. 
MacKay, D. (1991). Bayesian methods for adaptive models. PhD thesis, California 
Institute of Technology. 
Mallows, C. L. (1973). Some comments on Cp. Technometrics, 15:661-675. 
Miller, R. G. (1974). The jackknife - a review. Biometrika, 61:1-15. 
Moody, J. E. (1992). The effective number of parameters, an analysis of general- 
ization and regularization in nonlinear learning system. In Moody, J. E., Hanson, 
S. J., and Lippmann, R. P., editors, Advances in Neural Information Processing 
System . Morgan Kaufmann Publication. 
Schwartz, G. (1978). Estimating the dimension of a model. Ann. Slat, 6:461-464. 
Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions 
(with discussion). J. Roy. Slat. Soc., Ser. B. 
Stone, M. (1977). An asymptotic equivalence of choice of model by cross-validation 
and Akaike's criterion. J. Roy. Slat. Soc., Ser. B, 39(1):44-47. 
Zellner, A. (1984). Posterior odds ratios for regression hypotheses: General consid- 
eration and some specific results. In Zellner, A., editor, Basic Issues in Economet- 
rics, pages 275-305. University of Chicago Press. 
