Cross-Validation Estimates IMSE 
Mark Plutowski t, 
Shinichi Sakata  
Halbert White 
Department of Computer Science and Engineering 
 Department of Economics 
, Institute for Neural Computation 
University of Clifornia, San Diego 
Abstract 
Integrated Mean Squared Error (IMSE) is a version of the usual 
mean squared error criterion, averaged over all possible training 
sets of a given size. If it could be observed, it could be used 
to determine optimal network complexity or optimal data sub- 
sets for efficient training. We show that two common methods of 
cross-validating average squared error deliver unbiased estimates 
of IMSE, converging to IMSE with probability one. These esti- 
mates thus make possible approximate IMSE-based choice of net- 
work complexity. We also show that two variants of cross validation 
measure provide unbiased IMSE-based estimates potentially useful 
for selecting optimal data subsets. 
I Summary 
To begin, assume we are given a fixed network architecture. (We dispense with 
this assumption later.) Let z N denote a given set of N training examples. Let 
QN(z N) denote the expected squared error (the expectation taken over all possible 
examples) of the network after being trained on z N. This measures the quality of 
fit afforded by training on a given set of N examples. 
Let IMSEN denote the Integrated Mean Squared Error for training sets of size 
N. Given reasonable assumptions, it is straightforward to show that IMSEN ---- 
E[QN(ZN)] -- o '2, where the expectation is now over all training sets of size N, Z N 
is a random training set of size N, and a2 is the noise variance. 
Let Cs '- CN(z N) denote the "delete-one cross-validation" squared error measure 
for a network trained on z N. Cs is obtained by training networks on each of the N 
training sets of size N- 1 obtained by deleting a single example; the measure follows 
391 
392 Plutowski, Sakata, and White 
by computing squared error for the corresponding deleted example and averaging the 
results. Let GN, M = GN, M(Z N, M) denote the "generalization" measure obtained 
by separating the available data of size N + M into a training set z N of size N, and 
a validation ("test") set M of size M; the measure follows by training on z N and 
computing averaged squared error over M. 
We show that CN is an unbiased estimator of E[QN_x(zN)], and hence, estimates 
IMSEN_x up to noise variance. Similarly, GN, M is an unbiased estimator of 
E[QN(ZN,M]. Given reasonable conditions on the estimator and on the data 
generating process we demonstrate convergence with probability I of GN, M and 
aN to E[QN(ZN)] as N and M grow large. 
A direct consequence of these results is that when choice is restricted to a set of 
network architectures whose complexity is bounded above a priori, then choosing 
the architecture for which either Cs (or CS, M) is minimized leads to choice of 
the network for which IMSEN is nearly minimized for all N (respectively, N, M) 
sufficiently large. 
We also provide results for training sets sampled at particular inputs. Conditional 
IMSE is an appealing criterion for evaluating a particular choice of training set in 
the presence of noise. These results demonstrate that delete-one cross-validation es- 
timates average MSE (the average taken over the given set of inputs,) and that hold- 
out set cross-validation gives an unbiased estimate of E[QN(ZN)I Zv = (z N, yN)], 
given a set of N input values z N for which corresponding (random) output values 
yN are obtained. Either cross-validation measure can therefore be used to select 
a representative subset of the entire dataset that can be used for data compaction, 
or for more efficient training (as training can be faster on smaller datasets) [4]. 
2 Definitions 
2.1 Learning Task 
We consider the learning task of determining the relationship between a random 
vector X and a random scalar Y, where X takes values in a subset X of r, and Y 
takes values in a subset Y of g. (e.g. X = r and Y = g). We refer to X as the 
input space. The learning task is thus one of training a neural network with r inputs 
and one output. It is straightforward to extend the following analysis to networks 
with multiple targets. We make the following assumption on the observations to be 
used in the training of the networks. 
Assumption I X is a Borel subset of r and Y is a Borel subset of . Let Z = 
X x Y and f2 _-- Z O =- xxZ. Let (f2,',7 ) be a probability space with .T = 
The observations on Z -- (X ,Y) to be used in the training of the network are a 
realization of an i.i.d. stochastic process {Zi -= (X, Y/): f2 - X x Y}. 
When w 6 f2 is fixed, we write zi = Zi(w) for each i = 1, 2, .... Also write Z v = 
(Zx,..., Zv) and z  -- (zx,..., z). 
Assumption 1 allows uncertainty caused by measurement errors of observations 
as well as a probabilistic relationship between X and Y . It, however, does not 
prevent a deterministic relation ship between X and Y such that Y = g(X) for 
some measurable mapping g  Rr __ R. 
Cross-Validation Estimates IMSE 393 
We suppose interest attaches to the conditional expectation of Y given X, written 
g(z) = E(YIX ). The next assumption guarantees the existence of E(YilXi) and 
E(eilXi), ei -= Yi - E(YilXi). Next, for convenience, we assume homoscedasticity 
of the conditional variance of Y/ given Xi. 
Assumption 2 E(Y 2) < cx. 
Assumption 3 E(qlXx) - where  is a strictly positive constant. 
2.2 Network Model 
Let fP(., .)  X x Wp -- y be a network function with the "weight space" Wp, 
where p denotes the dimension of the "weight space" (the number of weights.) We 
impose some mild conditions on the network architecture. 
Assumption 4 For each p  {1,2,...,fi}, i6  N, Wp is a compact subset ofp, 
and fP  X x Wp --  satisfies the following conditions: 
1. if(., w)  X - Y is measurable for each w  Wp ; 
. if(z, .)' W r -- Y is continuous for all z  X. 
We further make a joint assumption on the underlying data generating process 
and the network architecture to assure that the training dataset and the networks 
behaves appropriately. 
Assumption 5 There ezists a function D  X - R+ = [0, c) such that for each 
z  X and w  We, Ire(z, w)l _< D(z), and E [(D(X)) ] < c. 
Hence, fP is square integrable for each w p  W p. We will measure network per- 
formance using mean squared error, which for weights w p is given by .(wP;p)  
E [(Y - fP(X, wp)2]. The optimal weights are the weights that minimize .(wP;p). 
The set of all optimal weights are given by W p* =- {w*  Wp  A(w*;p) _ 
A(w;p) for any w 6 Wp}. The index of the best network is p*, given by the smallest 
p minimizing minopEwp .(wP;p), p  {1,2,...,}. 
2.3 Least-Squares Estimator 
When assumptions 1 and 4 hold, the nonlinear least-squares estimator exists. 
mally, we have 
For- 
Lemma I Suppose that Assumptions I and d hold. Then 1. For each N 6 N, 
there ezists a measurable function 1N(.;p) ' Z N -- W e such that 1N(ZN;p) solves 
 N-XEiN=x(yi_f(Xi,w))  
the following problem with probability one: mmoEW p  
. A(.;p)  Wp -- R is continuous on Wp, and W p* is not empty. 
For convenience, we also define bv  9 -- We by bv(co ) = 1N(ZN (CO); p) for 
each co  9. Next let i,ie,...,iv be distinct natural numbers and let ,v = 
 ., ir). Then lv(, v) given above solves N Ej=x( ij - f(Xij, wp))2 with 
Z ' y. 
probability one. In particular, we will consider the estimate using the dataset z_i 
made by deleting the ith observation from z v. Let Z i be a random matrix made 
394 Plutowski, Sakata, and White 
by deleting the ith row from Z N. Thus, IN-1 (z_Ni;p) is a measurable least squares 
estimator and we can consider its probabilistic behavior. 
3 Integrated Mean Squared Error 
Integrated Mean Squared Error (IMSE) has been used to regulate network complex- 
ity [9]. Another (conditional) version of IMSE is used as a criterion for evaluating 
training examples [5, 6, 7, 8]. The first version depends only on the sample size, 
not the particular sample. The second (conditional) version depends additionally 
upon the observed location of the examples in the input space. 
3.1 Unconditional IMSE 
The (unconditional) mean squared error (MSE) of the network output at a partic- 
ular input value  is 
MN(i:;p) = E [{g(i) - f(i:,lN(ZN;p))}2] . (1) 
Integrating MSE over all possible inputs gives the unconditional IMSE: 
= f 
(2) 
= E (3) 
where p is the input distribution. 
3.2 Conditional IMSE 
To evaluate exemplars obtained at inputs :r N, we modify Equation (1) by condi- 
tioning on z v, giving 
Mv(lzV;p) - E [{g(5:)- f(i:,lv(Zv))}2lXV - 
The conditional IMSE (given inputs z N) is then 
IMSEN(xN;p) -- f mN(zlzN;p)p(dz) (4) 
= E (5) 
4 Cross-Validation 
Cross-validatory measures have been used successfully to assess the performance of 
a wide range of estimators [10, 11, 12, 13, 14, 15]. Cross-validatory measures have 
been derived for various performance criteria, including the Kullback-Liebler Infor- 
mation Criterion (KLIC) and the Integrated Squared Error (ISE, asymptotically 
equivalent to IMSE) [16]. Although provably inappropriate in certain applications 
[17, 18], optimality and consistency results for the cross-validatory measures have 
been obtained for several estimators, including linear regression, orthogonal series, 
splines, histograms, and kernel density estimators [16, 19, 20, 21, 22, 23, 24]. The 
authors are not aware of similar results applicable to neural networks, although two 
more general, but weaker results do apply [26]. A general result applicable to neural 
networks shows asymptotic equivalence between cross-validation and Akaike's Cri- 
terion for network selection [25, 29], as well as between cross-validation and Moody's 
Criterion [30, 29]. 
Cross-Validation Estimates IMSE 395 
4.1 Expected Network Error 
Given our assumptions, we can relate cross-validation to IMSE. For clarity and 
notational convenience, we first introduce a measure of network error closely related 
to IMSE. For each weight wp  Wp, we have defined the mean squared error A(wp; p) 
in Section 2.2. We define QN to map each dataset to the mean squared error of the 
estimated network 
QN(zN;p) ---- A(1N(zN;p);p). 
When Assumption 3 holds, we have 
A(wP;p) -- E[(g(X)- f(X, wP)) 2] +o '2 
= E [(g(XN+x) -- f(XN+x, wr)) 2] + r 2 
as is easily verified. We therefore have 
Qv(zV ;p) = E [(g(Xv+x) - f(XN+x,IN(ZN;p)))2IZN = Z v] + O '2. 
Thus, by using the law of iterated expectations, we have 
E [QN(ZN;p)] -- IMSEN(p)+ er 2. 
Likewise, given x N  X N, 
p)lxV = = (6) 
4.2 Cross-Validatory Estimation of Error 
In practice we work with observable quantities only. In particular, we must esti- 
mate the error of network p over novel data ("generalization") from a finite set of 
examples. Such an estimate is given by the delete-one cross-validation measure: 
N 
1 
CN(zN;P) --  Y (Yi -- f(xi,lN-x(z-Ni;P))) 2 (7) 
i----1 
where z N. denoS.es the training set obtained by deleting the ith example. Using 
z_/-instet of z  avoids a downward bias due to testing upon exampies used in 
training, as we show below (Theorem 3.) Another version of cross-validation is 
commonly used for evaluating "generalization" when an abundant supply of novel 
data is available for use as a "hold-out" set: 
M 
1 
GN'M(zN';?M;P) = M  (Oi- f('i,lN(zN;P))) 2 (8) 
, 
i--1 
where M _ (ZN+X,...,ZN+M). 
5 Expectation of the Cross-Validation Measures 
We now consider the relation between cross-validation measure and IMSE. We ex- 
amine delete-one cross-validation first. 
Proposition 1 (Unbiasedness of Cv) Let Assumptions I through 5 hold. Then 
for given N, Cs is an unbiased estimator of IMSEN-x (p) + tr 2' 
E [Cv(ZV;p)] = IMSEv_x(p) +o '2. (9) 
396 Plutowski, Sakata, and White 
With hold-out set cross-validation, the validation set Z M gives i.i.d. information 
regarding points outside of the training set Z N. 
Proposition 2 (Unbiasedness of GN, M) Let Assumptions I through 5 hold. Let 
,M _-- (ZN+x,..., ZM) . Then for given N and M, GN, M is an unbiased estimator 
of IMSEv (p) + a 2: 
= 
The latter result is appealing for large M, N. We expect delete-one cross-validation 
to be more appealing when training data is not abundant. 
6 
Expectation of Cross-Validation when Sampling at 
Selected Inputs 
We obtain analogous results for training sets obtained by sampling at a given set 
of inputs x N. We first consider the result for delete-one cross-validation. 
Proposition 3 (Expectation of Cs given xN Let Assumptions I through 5 
hold. Then, given a particular set of inputs, x , CN is an unbiased estimator 
of average MSEN-x + rr 2, the average taken over xN : 
N 
1 
E [Cv(ZV,P)lXV-x v] - y.MN_x(xi[x_/;p) + , 
i=1 
where x_Ni is a matrix made by deleting the ith row of x N. 
This essentially gives an estimate of MSEv_x limited to x  x v, losing a degree of 
freedom while providing no estimate of the MSE off of the training points. For this 
average to converge to IMSEv_, it will suffice for the empirical distribution of 
xN, v, to converge to/N, i.e., fiV =Y /N. We obtain a stronger result for hold-out 
set cross-validation. The hold-out set gives independent information on MSEv off 
of the training points, resulting in an estimate of IMSEv for given x v. 
Proposition 4 (Expectation of GV,M given v) Let Assumptions I through 5 
hold. Let M = (Zv+x,..., ZV+M) 
GN,M is an unbiased estimator of of lMSEN(xN;p) + 
7 Strong Convergence of Hold-Out Set Cross-Validation 
Our conditions deliver not only unbiasedhess, but also convergence of hold-out set 
cross-validation to IMSEv, with probability 1. 
Theorem 1 (Convergence of Hold-Out Set w.p. 1) Let Assump- 
tions I through 5 hold. Also let ,M __-- (ZN+x,...,ZN+M) . Ill or some A > 0 
a sequence {MN} of natural numbers satisfies MN > AN for any N = 1,2,..., 
then 
p) p)] 
, , - ; , O, asN--oo. 
Cross-Validation Estimates IMSE 397 
8 Strong Convergence of Delete-One Cross-Validation 
Given an additional condition (uniqueness of optimal weights) we can show strong 
convergence for delete-one cross-validation. First we establish uniform convergence 
of the estimators P(Zi) to optimal weights (uniformly over 1 < i < N.) 
Theorem 2 Let Assumptions I through 5 hold. Let Z} be the dataset made by 
deleting the kth observation from Z N. Then 
max d (/N_(Z_i;p),W p') --0 a.s.- as N --, () 
where d(w, WP*) _--inf.EWp. IIw- w*ll. 
This convergence result leads to the next result that the delete-one cross validation 
measure does not under-estimate the optimized MSE, namely, infopEW, 
Theorem 3 Let Assumptions I through 5 hold. Then 
liminfCN(ZN;p) >_ min A(w;p) a.s.-7 . 
N-*oo wW 
When the optimum weight is unique, we have a stronger result about convergence 
of the delete-one cross validation measure. 
Assumption 6 W p* is a singleton, i.e., W p* has only one element. 
Theorem 4 Let Assumptions I through 6 hold. Then 
CN (ZN;p) -- E [QN(ZN;p)] -- 0 a.s. as N -- 
O Conclusion 
Our results justify the intuition that cross-validation measures unbiasedly and con- 
sistently estimate the expected squared error of networks trained on finite training 
sets, therefore providing means of obtaining IMSE-approximate methods of select- 
ing appropriate network architectures, or for evaluating particular choice of training 
set. 
Use of these cross-validation measures therefore permits us to avoid underfitting 
the data, asymptotically. Note, however, that although we also thereby avoid over- 
fitting asymptotically, this avoidance is not necessarily accomplished by choosing 
a minimally complex architecture. The possibility exists that IMSEN-x(p) = 
IMSEN-x(p + 1). Because our cross-validated estimates of these quantities are 
random we may by chance observe CN(ZN;p) > CN(ZN;p + 1) and therefore se- 
lect the more complex network, even though the less complex network is equally 
good. Of course, because the IMSE's are the same, no performance degradation 
(overfitting) will result in this solution. 
Acknowledgement s 
This work was supported by NSF grant IRI 92-03532. We thank David Wolpert, 
Jan Larsen, Jeff Racine, Vjachislav Krushkal, and Patrick Fitzsimmons for valuable 
discussions. 
398 Plutowski, Sakata, and White 
References 
[41 
[51 
[7] 
[8] 
[9] 
[lO] 
[11] 
[12] 
[13] 
[14] 
[15] 
[16] 
[1] White, H. 1989. "Learning in Artificial Neu- 
ral Networks: A Statistical Perspective." Neu- 
ral Computation, I 4, pp.425-464. MIT Press, 
Cambridge, MA. 
[2] Plutowski, Mark E., Shinichi Sakata, and Hal- 
bert White. 1993. "Cross-validation delivers 
strongly consistent unbiased estimates of Inte- 
grated Mean Squared Error." To appear. 
[3] Plutowski, Mark E., and Halbert White. 1993. 
"Selecting concise training sets from clean data." 
IEEE Transactions on Neural Networks. 4, 3, 
pp.305-318. 
Plutowski, Mark E., Garrison Cottrell, and Hal- 
bert White. 1992. "Learning Mackey-Glass from 
25 examples, Plus or Minus 2." (Presented 
at 1992 Neural Information Processing Systems 
conference.) Jack D. Cowan, Gerald Tesauro, 
Joshua Aspector (eds.), Advances in neural in- 
formation processing systems 6, San Mateo, CA: 
Morgan Kaufmann Publishers. 
Fedorov, V.V. 1972. Theory of Optimal Ex- 
periments. Academic Press, New York. 
Box,G, and N.Draper. 1987. Empirical Model- 
Building and P. esponse Surfaces. Wiley, New 
York. 
Khuri, A.I., and J.A.Cornell. 1987. P. esponse 
Surfaces (Designs and Analyses). Marcel 
Dekker, Inc., New York. 
Faraway, Julian J. 1990. "Sequential design for 
the nonparametric regression of curves and sur- 
faces." Technical Report 177, Department of 
Statistics, The University of Michigan. 
Geman, Stuart, Elle Bienenstock, and lCten 
Doursat. 1992. "Neural networks and the 
bias/variance dilemma." Neural Computation. 4, 
1, 1-58. 
Stone, M. 1959. "Application of a measure of in- 
formation to the design and comparison of re- 
gression experiments." Annals Math. Star. 30 
55-69 
"Cross-validatory choice and assessment of sta- 
tistical predictions." J.R. Statist. Soc. B. 86, 2, 
111-47. 
]Bowman, Adrian W. 1984. "An alternative 
method of cross-validation for the smoothing of 
density estimates." Biometrika (1984), 71, 2, pp. 
353-60. 
Bowman, Adrian W., Peter Hall, D.M. Titter- 
ington. 1984. "Cross-validation in nonparametric 
estimation of probabilities and probability den- 
sities." Biometrika (1984), 71, 2, pp. 341-51. 
Matron, M. 1987. "A comparison of cross- 
validation techniques in density estimation." The 
Annals of Statistics. 15, 1, 152-162. 
Wahba, Grace. 1990. Spline Models for Ob- 
servational Data. v. 59 in the CBMS-NSF Re- 
glonal Conference Series in Applied Mathemat- 
ics, SIAM, Philadelphia, PA, March 1990. Soft- 
cover, 169 pages, bibliography, author index. 
ISBN 0-89871-244-0 
Hall, Peter. 1983. "Large sample optimality of 
least squares cross-vMidation in density estima- 
tion." The Annals of Statistics. 11, 4, 1156-1174. 
[17] Schuster, E.F., and G.G. Gregory. 1981. "On the 
nonconsistency of maximum likelihood nonpara- 
metric density estlmators". Comp. Sci.  Statis- 
tics: Proc. 13th Symp. on the Interface. W.F. 
Eddy ed. 295-298. Springer-Verlag. 
[18] Chow, Y.-S. and S.Geman, L.-D. Wu. 1983. 
"Consistent cross-validated density estimation." 
The Annals of Statistics. 11, 1, 25-38. 
[19] ]Bowman, Adrian W. 1980. "A note on consis- 
tency of the kernel method for the analysis of 
categorical data." Biometrika (1980), 67, 3, pp. 
682-4. 
[20] Stone, Charles J. 1984 "An asymptotically opti- 
mal window selection rule for kernel density 
timates." The Annals of Statistics. 12, 4, 1285- 
1297. 
[21] Li, Ker-Chau. 1986. "Asymptotic optimality of 
C L and generalized cross-vMidation in ridge re- 
gression with application to spline smoothing." 
The Annals of Statistics. 14, 3, 1101-1112. 
[22] Li, Ker-Chau. 1987. "Asymptotic optimality for 
C'p, C'L, cross-validation, and generalized cross- 
validation: discrete index set." The Annals of 
Statistics. 15, 3, 958-975. 
[23] Utreras, Florencio I. 1987. "On generalized cross- 
validation for multivariate smoothing spline 
functions." SIAM J. Sci. Star. Cornput. 8, 4, July 
1987. 
[24] Andrews, Donald W.K. 1991. "Asymptotic opti- 
mality of generalized C' L, cross-vMidation, and 
generalized cross-validation in regression with 
heteroskedastic errors." Journal of Econometrics. 
47 (1991) 359-377. North-Holland. 
[25] Stone, M. 1977. "An asymptotic equivalence of 
choice of model by cross-validation and Akaike's 
criterion." J. Roy. Stat. Soc. Ser B, 39, 1, 44-47. 
[26] Stone, M. 1977. "Asymptotics for and against 
cross-validation." Biometrika. 64, 1, 29-35. 
[27] BillingsIcy, Patrick. 1986. Probability and 
Measure. Wiley, New York. 
[28] Jennrich, R. 1969. "Asymptotic properties of 
nonlinear least squares estimators." Ann. Math. 
Stat. 40, 633-643. 
[29] Liu, Yong. 1993. "Neural network model selec- 
tion using asymptotic jackknife estimator and 
cross-validation method." In Giles, C.L., Han- 
son, S.J., and Cowan, J.D. (eds.), Advances in 
neural information processing systems 5, San 
Mateo, CA: Morgan Kaufmann Publishers. 
[30] Moody, John E. 1992. "The effective number 
of parameters, an analysis of generalization and 
regularization in nonlinear learning system." In 
Moody, J.E., Hanson, S.J., and Lippmann, R.P., 
(eds.), Advances in neural information process- 
ing systems 4, San Mateo, CA: Morgan Kauf- 
mann Publishers. 
[31] Bailey, Timothy L. and Charles Elkan. 1993. 
"Estimating the accuracy of learned concepts." 
To appear in Proc. International Joint Confer- 
ence on Artificial Intelligence. 
[32] White, Halbert. 1993. Estimation, Inference, 
and Specification Analysis. Manuscript. 
[33] White, Halbert. 1984. Asymptotic Theory for 
Econometricians. Academic Press. 
[34] Klein, Erwin and Anthony C. Thompson. 1984 
Theory of correspondences : including ap- 
plications to mathematical economics. Wi- 
ley. 
