The Bias-Variance Tradeoff and the Randomized 
GACV 
Grace Wahba, Xiwu Lin and Fangyu Gao 
Dept of Statistics 
Univ of Wisconsin 
1210 W Dayton Street 
Madison, W! 53706 
wahba, xiwu, fgaostat .wisc. edu 
Dong Xiang 
SAS Institute, Inc. 
SAS Campus Drive 
Cary, NC 27513 
sasdxxunx. sas. com 
Ronald Klein, MD and Barbara Klein, MD 
Dept of Ophthalmalogy 
610 North Walnut Street 
Madison, WI 53706 
kleinr, kleinbepi. ophth.wisc. edu 
Abstract 
We propose a new in-sample cross validation based method (randomized 
GACV) for choosing smoothing or bandwidth parameters that govern the 
bias-variance or fit-complexity tradeoff in 'soft' classification. Soft clas- 
sification refers to a learning procedure which estimates the probability 
that an example with a given attribute vector is in class 1 vs class 0. The 
target for optimizing the the tradeoff is the Kullback-Liebler distance 
between the estimated probability distribution and the 'true' probabil- 
ity distribution, representing knowledge of an infinite population. The 
method uses a randomized estimate of the trace of a Hessian and mimics 
cross validation at the cost of a single relearning with perturbed outcome 
data. 
1 INTRODUCTION 
We propose and test a new in-sample cross-validation based method for optimizing the bias- 
variance tradeoff in 'soft classification' (Wahba et al 1994), called ranGACV (randomized 
Generalized Approximate Cross Validation). Summarizing from Wahba et al(1994) we are 
given a training set consisting of r examples, where for each example we have a vector 
t C 7- of attribute values, and an outcome /, which is either 0 or 1. Based on the training 
data it is desired to estimate the probability p of the outcome 1 for any new examples in the 
The Bias- Variance Tradeoff and the Randomized GACV 621 
future. In 'soft' classification the estimate/3(t) ofp(t) is of particular interest, and might be 
used by a physician to tell patients how they might modify their risk p by changing (some 
component of) t, for example, cholesterol as a risk factor for heart attack. Penalized like- 
lihood estimates are obtained for p by assuming that the logit f(t), t E T, which satisfies 
p(t) = e/(t)/(1 + e/(t)) is in some space 7-/of functions. Technically 7-/is a reproducing 
kernel Hilbert space, but you don't need to know what that is to read on. Let the training 
set be {yi,ti,i: 1,... ,n}. Letting fi = f(ti), the negative log likelihood 12{yi,ti, fi} of 
the observations, given f is 
C{yi, ti,fi} -- Z[--Yifi + b(fi)], 
i=1 
(1) 
where b(f) = lo9(1 + el). The penalized likelihood estimate of the function f is the 
solution to: Find f E 7-/to minimize Ix (f)' 
i=1 
(2) 
where Jx(f) is a quadratic penalty functional depending on parameter(s) A = (A1, ..., Aq) 
which govern the so called bias-variance tradeoff. Equivalently the components of A con- 
trol the tradeoff between the complexity of f and the fit to the training data. In this paper we 
sketch the derivation of the ranGACV method for choosing A, and present some prelim- 
inary but favorable simulation results, demonstrating its efficacy. This method is designed 
for use with penalized likelihood estimates, but it is clear that it can be used with a variety 
of other methods which contain bias-variance parameters to be chosen, and for which mini- 
mizing the Kullback-Liebler (KL) distance is the target. In the work of which this is a part, 
we are concerned with A having multiple components. Thus, it will be highly convenient 
to have an in-sample method for selecting A, if one that is accurate and computationally 
convenient can be lbund. 
Let px be the the estimate and p be the 'true' but unknown probability function and let 
pi = p(ti),pxi = px(ti). For in-sample tuning, our criteria for a good choice of A is 
i n (1--pi) 
the KL distance KL(p,px) =  Ei=l[PilOg pi q- (1 -- pi)log (i--px/)]' We may replace 
KL(p, px) by the comparative KL distance (CKL), which differs from KL by a quantity 
which does not depend on A. Letting fxi = fx(ti), the CKL is given by 
CKL(p, px) --- CKL(A) = i [-Pifi + b(fi)]. 
n 
i---1 
(3) 
CKL(A) depends on the unknown p, and it is desired is to have a good estimate or proxy 
for it, which can then be minimized with respect to A. 
It is known (Wong 1992) that no exact unbiased estimate of CKL(A) exists in this case, so 
that only approximate methods are possible. A number of authors have tackled this prob- 
lem, including Utans and M9ody(1993), Liu(1993), Gu(1992). The iterative UBR method 
of Gu(1992) is included in GRKPACK (Wang 1997), which implements general smooth- 
ing spline ANOVA penalized likelihood estimates with multiple smoothing parameters. It 
has been successfully used in a number of practical problems, see, for example, Wahba 
et al (1994,1995). The present work represents an approach in the spirit of GRKPACK 
but which employs several approximations, and may be used with any data set, no matter 
how large, provided that an algorithm for solving the penalized likelihood equations, either 
exactly or approximately, can be implemented. 
622 G. Wahba et al. 
2 THE GACV ESTIMATE 
In the general penalized likelihood problem the minimizer f), (.) of (2) has a representation 
M n 
f),(t) =  d3(t) +  ciQx(ti,t) (4) 
u=l i=1 
where the gS, span the null space of J),, Q), (s, t) is a reproducing kernel (positive definite 
function) for the penalized part of 7-/, and c = (c,- ., cn)  satisfies M linear conditions, 
so that there are (at most) n free parameters in f),. Typically the unpenalized functions 
gSvare low degree polynomials. Examples of Q(ti, ') include radial basis functions and 
various kinds of splines; minor modifications include sigmoidal basis functions, tree basis 
functions and so on. See, for example Wahba(1990,1995), Girosi, Jones and Poggio(1995). 
If f),(.) is of the form (4) then J),(f),) is a quadratic form in c. Substituting (4) into (2) 
results in Ix a convex functional in c and d, and c and d are obtained numerically via a 
Newton Raphson iteration, subject to the conditions on c. For large n, the second sum on 
K 
the right of (4) may be replaced by 5".k= ciQx(ti, t), where the ti are chosen via one 
of several principled methods. 
To obtain the GACV we begin with the ordinary leaving-out-one cross validation function 
CV(A) for the CKL: 
cv(x) = 1 + 
n 
i---1 
where fx I-i] the solution to the variational problem of (2) with the ith data point left out 
and fx[? is the value of A -i] at ti. Although fx(') is computed by solving for c and d 
the GAGV is derived in terms of the values (f,..., fn)  of f at the ti. Where there is 
no confusion between functions f(.) and vectors (f,..., fn)' of values of f at t,. ., tn, 
we let f: (f,...,, fn)'. For any f(.) of the form (4), J),(f) also has a representation as 
a non-negative definite quadratic form in (f,   , f)'. Letting E), be twice the matrix of 
this quadratic form we can rewrite (2) as 
I),(f, y) = Z[-yifi + b(fi)] + f'Exf. (6) 
i=1 
Let W = W(f) be the n x n diagonal matrix with aii -= pi(1 - Pi) in the iith position. 
Using the fact that aii is the second derivative of b(fi), we have that H = [W + E),] - 
is the inverse Hessian of the variational problem (6). In Xiang and Wahba (1996), several 
Taylor series approximations, along with a generalization of the leaving-out-one lemma 
(see Wahba 1990) are applied to (5) to obtain an approximate cross validation function 
AGV(A), which is a second order approximation to CV(/). Letting hii be the iith entry 
of H, the result is 
 1 hiiYi(Yi-Pxi) 
CV(A)  ACV(A) - i [-yifxi + b(f),i)] + [1 - hiio'ii] (7) 
Tt 1 Tt i=1 
Then the GACV is obtained from the ACV by replacing hii by  i= hii - 
and replacing 1 - ha by tr[I - (W/2HW/2)], giving 
n tr(H) Y'in=l Yi(Yi -- PXi) 
GACV(A) = 1 Z[_yifi + b(fi)] + (8) 
I i1 Tt 1'[I- (W1/2HW1/2)] ' 
where W is evaluated at fx. Numerical results based on an exact calculation of (8) appear 
in Xiang and Wahba (1996). The exact calculation is limited to small n however. 
The Bias-Variance Tradeoff and the Randomized GACV 623 
3 THE RANDOMIZED GACV ESTIMATE 
Given any 'black box' which, given A, and a training set {Yi, ti} produces f,x (') as the min- 
imizer of (2), and thence f,x = (f,x,   ', f,x,)', we can produce randomized estimates of 
trH and tr[I - W1/2HW 1/2] without having any explicit calculations of these matrices. 
This is done by running the 'black box' on perturbed data {yi + 5i, ti}. For the Yi Gaus- 
sian, randomized trace estimates of the Hessian of the variational problem (the 'influence 
matrix') have been studied extensively and shown to be essentially as good as exact calcu- 
lations for large n, see for example Girard(1998). Randomized trace estimates are based 
on the fact that if A is any square matrix and 5 is a zero mean random n-vector with inde- 
pendent components with variance a, then E5'A5 = -atrA. See Gong et a1(1998) and 
references cited there for experimental results with multiple regularization parameters. Re- 
turning to the 0-1 data case, it is easy to see that the minimizer f,x (') of I,x is continuous in 
y, not withstanding the fact that in our training set the Yi take on only values 0 or 1. Letting 
f = (fl,'", fan)' be the minimizer of (6) given y = (yl,'", Yn)', and f:+6 be the 
minimizer given data y+5 = (yl +51,  ., yn+Sn)' (the ti remain fixed), Xiang and Wahba 
(1997) show, again using Taylor series expansions, that f+6 _ f ,, [W(f,) + E),] -5. 
This suggests that -,/a 5' (f,+6 _ f,) provides an estimate of tr[W (f) + Ex] - . However, 
if we take the solution f, to the nonlinear system for the original data y as the initial value 
for a Newton-Raphson calculation of f,+6 things become even simpler. Applying a one 
step Newton-Raphson iteration gives 
02Ix  OIx 
f+*' : f -[Of, Of(f,y + 5)]- --]-w,x, Y + 5). 
(9) 
Since -f(f,y + 5) : -5 + -f (f,,y) : -5, and [ox_(f,y + 5)]- : 
t0f'0f 
[a=/x try )]- we have f+*' [o _ f : 
of'of wx, , = f + (f y)]-5 so that 
[W(f) + X] - 6. The result is the following rauGACV function: 
ranGACV(X) = 1 E[_yifxi+b(fi)] + 
i=1 
5' (f:+& - f:) 
To reduce the variance in the term after the '+' 
independent replicate vectors 51,...,5R, and replace 
ranGACV (X) function. 
Ein_--i Yi(Yi - PM) 
[5'5 - 5'W(f)(f +& - f,)] ' 
(10) 
in (10), we may draw R 
the term after the '+' in 
to obtain an R-replicated 
4 NUMERICAL RESULTS 
In this section we present simulation results which are representative of more extensive 
simulations to appear elsewhere. In each case, K < < n was chosen by a sequential clus- 
tering algorithm. In that case, the t were grouped into K clusters and one member of each 
cluster selected at random. The model is fit. Then the number of clusters is doubled and the 
model is fit again. This procedure continues until the fit does not change. In the randomized 
trace estimates the random variates were Gaussian. Penalty functionals were (multivariate 
generalizations of) the cubic spline penalty functional , f0  (f" (x)) 2, and smoothing spline 
ANOVA models were fit. 
624 G. Wahba et al. 
4.1 EXPERIMENT 1. SINGLE SMOOTHING PARAMETER 
In this experiment t E [0,1],f(t) = 2sin(lOt), ti = (i- .5)/500, i = 1,...,500. A 
random number generator produced 'observations' Yi = 1 with probability pi = eL/(1 + 
e-), to get the training set. Qx is given in Wahba(1990) for this cubic spline case, K = 50. 
Since the true p is known, the true CKL can be computed. Fig. l(a) gives a plot of 
CIL(A) and 10 replicates of ranGACV(A). In each replicate R was taken as 1, and  
was generated anew as a Gaussian random vector with a5 = .001. Extensive simulations 
with different a5 showed that the results were insensitive to a5 from 1.0 to 10 -6. The 
minimizer of CKL is at the filled-in circle and the 10 minimizers of the 10 replicates of 
ranGACV are the open circles. Any one of these 10 provides a rather good estimate of 
the ,k that goes with the filled-in circle. Fig. l(b) gives the same experiment, except that 
this time R = 5. It can be seen that the minimizers ranGACV become even more reliable 
estimates of the minimizer of CKL, and the CKL at all of the ranGACV estimates are 
actually quite close to its minimum value. 
4.2 EXPERIMENT 2. ADDITIVE MODEL WITH ,k = (,k, ,k2) 
Here t C [0, 1]  [0, 1]. r = 500 values of ti were generated randomly according to 
a uniform distribution on the unit square and the Yi were generated according to pi = 
e-/(1 + e L) with t = (:cl,:c2) and f(t) = 5 sin27r:cl - 3sir27r:c2. An additive model 
as a special case of the smoothing spline ANOVA model (see Wahba et al, 1995), of the 
form f(t) = b + f(:c) + f2(:c2) with cubic spline penalties on fl and f2 were used. 
/ = 50, a6 = .001,/t = 5. Figure l(c) gives a plot of CKL(Xi, ,k2) and Figure l(d) 
gives a plot of rarGACV(X, ,k2). The open circles mark the minimizer of rarGACV 
in both plots and the filled in circle marks the minimizer of CKL. The inefficiency, as 
measured by CKL()/mirxCKL(/) is 1.01. Inefficiencies near 1 are typical of our 
other similar simulations. 
4.3 EXPERIMENT 3. COMPARISON OF ranGACV AND UBR 
This experiment used a model similar to the model fit by GRKPACK for the risk of 
progression of diabetic retinopathy given t = (:c1,:c2,:c3) = (duration, glycosylated 
hemoglobin, body mass index) in Wahba et a/(1995) as 'truth' A training set of 669 
examples was generated according to that model, which had the structure f(:cl, :c2, :c3) = 
 + f (x) + f2 (x2) + fa (xa) + f,a(x, xa). This (synthetic) training set was fit by GRK- 
PACK and also using K = 50 basis functions with ranGACV. Here there are p = 6 
smoothing parameters (there are 3 smoothing parameters in fa) and the ranGACV func- 
tion was searched by a downhill simplex method to find its minimizer. Since the 'truth' is 
known, the CKL for  and for the GRKPACK fit using the iterative UBR method were 
computed. This was repeated 100 times, and the 100 pairs of CKL values appears in Fig- 
ure l(e). It can be seen that the UBR and ranGACV give similar CKL values about 90% 
of the time, while the ranGACV has lower CKL for most of the remaining cases. 
4.4 DATA ANALYSIS: AN APPLICATION 
Figure 1 (f) represents part of the results of a study of association at baseline of pigmentary 
abnormalities with various risk factors in 2585 women between the ages of 43 and 86 in the 
Beaver Dam Eye Study, R. Klein et al(1995). The attributes are: :el = age, :c2 =body mass 
index, :ca = systolic blood pressure, :c4 = cholesterol. :cs and :c6 are indicator variables for 
taking hormones, and history of drinking. The smoothing spline ANOVA model fitted was 
f (t) = b +dl xl +d2x2 + f3 (x3)+ f4 (x4) + f34 (x3, x4)+d5I(x5) +d6I (x6), where I is the 
indicator function. Figure l(e) represents a cross section of the fit for xs = no, x6: no, 
The Bias-Variance Tradeoff and the Randomized GACV 625 
a:2, a:3 fixed at their medians and a: fixed at the 75th percentile. The dotted lines are the 
Bayesian confidence intervals, see Wahba et al(1995). There is a suggestion of a borderline 
inverse association of cholesterol. The reason for this association is uncertain. More details 
will appear elsewhere. 
Principled soft classification procedures can now be implemented in much larger data sets 
than previously possible, and the rartGACV should be applicable in general learning. 
References 
Girard, D. (1998), 'Asymptotic comparison of (partial) cross-validation, GCV and random- 
ized GCV in nonparametric regression', Ann. Statist. 126, 315-334. 
Girosi, F., Jones, M. & Poggio, T. (1995), 'Regularization theory and neural networks 
architectures', Neural Computation 7, 219-269. 
Gong, J., Wahba, G., Johnson, D. & Tribbia, J. (1998), 'Adaptive tuning of numerical 
weather prediction models: simultaneous estimation of weighting, smoothing and physical 
parameters', Monthly Weather Review 125, 210-231. 
Gu, C. (1992), 'Penalized likelihood regression: a Bayesian analysis', Statistica Sinica 
2, 255-264. 
Klein, R., Klein, B. & Moss, S. (1995), 'Age-related eye disease and survival. the Beaver 
Dam Eye Study', Arch Ophthalmol 113, 1995. 
Liu, Y. (1993), Unbiased estimate of generalization error and model selection in neural 
network, manuscript, Department of Physics, Institute of Brain and Neural Systems, Brown 
University. 
Utans, J. & Moody, J. (1993), Selecting neural network architectures via the prediction 
risk: application to corporate bond rating prediction, in 'Proc. First Int'l Conf. on Artificial 
Intelligence Applications on Wall Street', IEEE Computer Society Press. 
Wahba, G. (1990), Spline Models for Observational Data, SIAM. CBMS-NSF Regional 
Conference Series in Applied Mathematics, v. 59. 
Wahba, G. (1995), Generalization and regularization in nonlinear learning systems, in 
M. Arbib, ed., 'Handbook of Brain Theory and Neural Networks', MIT Press, pp. 426- 
430. 
Wahba, G., Wang, Y., Gu, C., Klein, R. & Klein, B. (1994), Structured machine learning 
for 'soft' classification with smoothing spline ANOVA and stacked tuning, testing and 
evaluation, in J. Cowan, G. Tesauro & J. Alspector, eds, 'Advances in Neural Information 
Processing Systems 6', Morgan Kauffman, pp. 415-422. 
Wahba, G., Wang, Y., Gu, C., Klein, R. & Klein, B. (1995), 'Smoothing spline ANOVA for 
exponential families, with application to the Wisconsin Epidemiological Study of Diabetic 
Retinopathy', Ann. Statist. 23, 1865-1895. 
Wang, Y. (1997), 'GRKPACK: Fitting smoothing spline analysis of variance models to data 
from exponential families', Commun. Statist. Sire. Comp. 26,765-782. 
Wong, W. (1992), Estimation of the loss of an estimate, Technical Report 356, Dept. of 
Statistics, University of Chicago, Chicago, I1. 
Xiang, D. & Wahba, G. (1996), 'A generalized approximate cross validation for smoothing 
splines with non-Gaussian data', Statistica Sinica 6, 675-692, preprint TR 930 available 
via www. star. wisc. edu/~wahba - > TRLIST. 
Xiang, D. & Wahba, G. (1997), Approximate smoothing spline methods for large data sets 
in the binary case, Technical Report 982, Department of Statistics, University of Wisconsin, 
Madison WI. To appear in the Proceedings of the 1997 ASA Joint Statistical Meetings, 
Biometrics Section, pp 94-98 (1998). Also in TRLIST as above. 
626 G. Wahba et al. 
(.0- 
o 
o 
0- 
o 
CKL I 
...... ranGACV 
log lambda log lambda 
(a) (b) 
..... 0.-28 - ... 
.......... i' ' .... ranGACV I ..., ", ",,,,, 0.7 
,: ," O')g3 
 ,' ,,'"0.'227 ', 
', : : o : ', ', 
) P5 (.24 :' 0.5 0.'8 0.]2 
log lambda1 log lambda1 
(c) (d) 
100 
o 0.6 0.8 0.0 0.2 10 260 20 360 30 400 
ranGACV Cholesterol (mg/dL) 
(e) (f) 
Figure 1' (a) and (b): Single smoothing parameter comparison of rarGACV and CKL. 
(c) and (d): Two smoothing parameter comparison of rarGACV and CKL. (e): Compar- 
ison of rarGAC'V and UBtt. (f): Probability estimate from Beaver Dam Study 
