Structured Machine Learning For 'Soft' 
Classification with Smoothing Spline 
ANOVA and Stacked Tuning, Testing 
and Evaluation 
Grace Wahba 
Dept of Statistics 
University of Wisconsin 
Madison, WI 53706 
Yuedong Wang 
Dept of Statistics 
University of Wisconsin 
Madison, WI 53706 
Chong Gu 
Dept of Statistics 
Purdue University 
West Lafayette, IN 47907 
Ronald Klein, MD 
Dept of Ophthalmalogy 
University of Wisconsin 
Madison, WI 53706 
Barbara Klein, MD 
Dept of Ophthalmalogy 
University of Wisconsin 
Madison, WI 53706 
Abstract 
We describe the use of smoothing spline analysis of variance (SS- 
ANOVA) in the penalized log likelihood context, for learning 
(estimating) the probability p of a '1' outcome, given a train- 
ing set with attribute vectors and outcomes. p is of the form 
p(t) - el(t)/(1 + el(t)), where, if t is a vector of attributes, f 
is learned as a sum of smooth functions of one attribute plus a 
sum of smooth functions of two attributes, etc. The smoothing 
parameters governing f are obtained by an iterative unbiased risk 
or iterative GCV method. Confidence intervals for these estimates 
are available. 
1. Introduction to 'soft' classification and the bias-variance tradeoff. 
In medical risk factor analysis records of attribute vectors and outcomes (0 or 1) 
for each example (patient) for n examples are available as training data. Based on 
the training data, it is desired to estimate the probability p of the i outcome for any 
415 
416 Wahba, Wang, Gu, Klein, and Klein 
new examples in the future, given their attribute vectors. In 'soft' classification, the 
estimate p of p is of particular interest, and might be used, say, by a physician to 
tell a patient that if he reduces his cholesterol from t to t', then he will reduce his 
risk of a heart attack from p(t) to/5(t'). We assume here that p varies 'smoothly' 
with any continuous attribute (predictor variable). 
It is long known that smoothness penalties and Bayes estimates are intimately re- 
lated (see e.g. Kimeldorf and Wahba(1970, 1971), Wahba(1990) and references 
there). Our philosophy with regard to the use of priors in Bayes estimates is to 
use them to generate families of reasonable estimates (or families of penalty func- 
tionals) indexed by those smoothing or regularization parameters which are most 
relevant to controlling the generalization error. (See Wahba(1990) Chapter 3, also 
Wahba(1992)). Then use cross-validation, generalized cross validation (GCV), un- 
biased risk estimation or some other performance oriented method to choose these 
parameter(s) to minimize a computable proxy for the generalization error. A person 
who believed the relevant prior might use maximum likelihood (ML) to choose the 
parameters, but ML may not be robust against an unrealistic prior (that is, ML 
may not do very well from the generalization point of view if the prior is off), see 
Wahba(1985). One could assign a hyperprior to these parameters. However, except 
in cases where real prior information is available, there is no reason to believe that 
the use of hyperpriors will beat out a performance oriented criterion based on a good 
proxy for the generalization error, assuming, of course, that low generalization error 
is the true goal. 
O'Sullivan et a/(1986) proposed a penalized log likelihood estimate of f, this work 
was extended to the SS-ANOVA context in Wahba, Gu, Wang and Chappell(1993), 
where numerous other relevant references are cited. This paper is available by 
ftp from ftp. stat.wisc.edu, cd pub/wahba in the file soft-class.ps.Z. An 
extended bibliography is available in the same directory as ml-bib.ps. The SS- 
ANOVA allows a variety of interpretable structures for the possible relationships 
between the predictor variables and the outcome, and reduces to simple relations 
in some of the attributes, or even, to a two-layer neural net, when the data suggest 
that such a representation is adequate. 
2. Soft classification and penalized log likelihood risk factor estimation 
To describe our 'worldview', let t be a vector of attributes, t  g2  T, where g2 is 
some region of interest in attribute space T. Our 'world' consists of an arbitrarily 
large population of potential examples, whose attribute vectors are distributed in 
some way over g2 and, considering all members of this 'world' with attribute vectors 
in a small neighborhood about t, the fraction of them that are 1's is p(t). Our 
training set is assumed to be a random sample of n examples from this population, 
whose outcomes are known, and our goal is to estimate p(t) for any t  fl. In 'soft' 
classification, we do not expect one outcome or the other to be a 'sure thing', that 
is we do not expect p(t) to be 0 or i for large portions of g2. 
Next, we review penalized log likelihood risk estimates. Let the training data be 
{Yi, t(i), i = 1, ...n} where yi has the value 1 or 0 according to the classification of 
example i, and t(i) is the attribute vector for example i. If the n examples are a 
random sample from our 'world', then the likelihood function of this data, given 
"Soft" Classification with Smoothing Spline ANOVA 417 
v('), is 
l ikel ihood { y, p } = II= xp( t ( i) )u' (1 - p( t ( i) ) ) -u ' , (1) 
which is the product of n Bernoulli likelihoods. Define the logit f(t) by f(t) = 
log[p(t)/(1 - p(t))], then p(t) = e.t(t)/(1 + e.t(t)). Substituting in f and taking logs 
gives 
- log likelihood{y,f) -- (y,f) =  log(1 + e l(t(i))) - yif(t(i)). (2) 
i=1 
We estimate f assuming that it is in some space 7/of smooth functions. (Technically, 
7/ is a reproducing kernel Hilbert space, see Vahba(1990), but you don't need to 
know what this is to read on). The fact that f is assumed 'smooth' makes the 
methods here very suitable for medical data analysis. The penalized log likelihood 
estimate fx of f will be obtained as the minimizer in 7/of 
(y, f) + xJ(f) 
(a) 
where J(f) is a suitable 'smoothness' penalty. A simple example is, T = [0, 1] and 
J(f) = fo(f(m)(t)?dt, in which case fx is a polynomial spline of degree 2m- 1. If 
Jam(f)= -- ax!:7]aa! '" ,Ox...Oxj d Ildxj. (4) 
o +...+o,=m c c j 
then fx is a thin plate spline. The thin plate spline is a linear combination of 
polynomials of degree m or less in d variables, and certain radial basis functions. 
For more details and other penalty functionals which result in rbf's, see Wahba(1980, 
1990, 1992). 
The likelihood function (y, f) will be maximized if p(t(i)) is 1 or 0 according as 
yi is i or 0. Thus, in the (full-rank) spline case, as A -+ 0, fx tends to 
at the data points. Therefore, by letting A be small, we can come close to fitting 
the data points exactly, but unless the 1's and O's are well separated in attribute 
space, fx will be a very 'wiggly' function and the generalization error (not precisely 
defined yet) may be large. 
The choice of A represents a tradeoff betwcen overfitting and underfitting the data 
(bias-variance tradeoff). It is important in practice good value of A. We now define 
what we mean by a good value of X. Given the family px, A >_ 0, we want to choose 
X so that px is close to the 'true' but unknown p so that, if new examples arrive with 
attribute vector in a neighborhood of t, px(t) will be a good estimate of the fraction 
of them that are 1's. 'Closeness' can be defined in various reasonable ways. We use 
the Kullbach-Leibler (KL) distance (not a real distance!). The KL distance between 
two probability measures (g,) is defined as KL(g,) = Ea[log(g/)] , where E a 
means expectation given g is the true distribution. If (t) is some probability 
measure on T, (say, a proxy for the distribution of the attributes in the population), 
then define KL (p, Px) (for Bernoulli random variables) with respect to  as 
KL.(p, px) = f [p(t)log (p(t)  (1-p(t) 
[,p-) ] + (1 -- p(t))log )] dr(t). 
,1 - px(t) 
(5) 
418 Wahba, Wang, Gu, Klein, and Klein 
Since KLv is not computable from the data, it is necessary to develop a computable 
proxy for it. By a computable proxy is meant a function of A that can be calculated 
from the training set which has the property that its minimizer is a good estimate 
of the minimizer of KLv. By letting px('t) - eld(t)/(1 q- e l(t)) it is seen that to 
minimize KL, it is only necessary to minimize 
[log(1 q- e Ix(t)) - p(t)f(t)]dy(t) (6) 
over A since (5) and (6) differ by something that does not depend on A. Leaving- 
out-half cross validation (CV) is one conceptually simple and generally defensible 
(albeit possibly wasteful) way of choosing A to minimize a proxy for KL(p, px). 
The n examples are randomly divided in half and the first n/2 examples are used 
to compute Px for a series of trial values of A. Then, the remaining n/2 examples 
are used to compute 
A 2 Z [log(1 q- e lx(t(i))) -- yifx(t(i))] (7) 
KLcv( ) = . 
,=+1 
for the trial values of A. Since the expected value of Yi is p(t(i)), (7) is, for each A an 
unbiased estimate of () with dr the sampling distribution of the {t(1), ..., t(n/2)}. 
A would then be chosen by minimizing (7) over the trial values. It is inappropriate to 
just evaluate (7) using the same data that was used to obtain fx, as that would lead 
to overfitting the data. Variations on (7) are obtained by successively leaving out 
groups of data. Leaving-out-one versions of (7) may be defined, but the computation 
may be prohibitive. 
3. Newton-Raphson Iteration and the Unbiased Risk estimate of A. 
We use the unbiased risk estimate given in Craven and Wahba(1979) for smoothing 
spline estimation with Gaussian errors, which has been adapted by Gu(1992a) for 
the Bernoulli case. To describe the estimate we need to describe the Newton- 
Raphson iteration for minimizing (3). Let b(f) -- log(1 q- el), then (y,f) - 
y4n=l[b(f(t(i)) - yif(t(i))]. It is easy to show that Eyi = f(t(i)) = b'(f(t(i)) and 
vat yi = p(t(i))(1- p(t(i)) = b"(f(t(i)). Represent f either exactly by using a 
basis for the (known) n-space of functions containing the solution, or approximately 
by suitable approximating basis functions, to get 
N 
f 
k=l 
Then we need to find c = (Cl,..., cs)' to minimize 
n N N 
n 
Ix(c) = Z b( cB(t( i) ) ) - yi( cB(t( i) ) ) q- - Ac 2c, 
i=1 k=l k--1 
where Z is the necessarily non-negative definite matrix determined 
J( c:B) = c'Zc. The gradient VIx and the Hessian V2Ix of Ixare given by 
i oI; I 
 = X'(p - y) + nAZc, 
(8) 
(o) 
by 
(10) 
"Soft" Classification with Smoothing Spline ANOVA 419 
{V2Ix}/, - O=Ix = X'WcX + nAE, (11) 
OcjOc 
where X is the matrix with ijth entry Bj (t(i)), Pc is the vector with ith entry 
do('(')) where f(.) 
= = = cB(.), and We is the diagonal 
given by pc(t(i)) (l+e('('))) 
matrix with iith entry pc(t(i))(1 - pc(t(i))). Given the tth Newton-Raphson iterate 
c , c (t+x) is given by 
c (t+) = c  -- (X'Wc(oX + nAE)-l(X'(pdo -- y) + nAEc ) (12) 
and c(t+) is the minimizer of 
1/2 2 
I?(c) = IIz()- w:(oXcl[ + nc'c. (13) 
where z(/), the sincalled pseudmdata, is given by 
z  = W -12' z/-(t) (14) 
c(t) I,--Pc(O) 'c(O .... 
The 'predicted' value (t) -,/2 - 
= c(o c, where c is the minimizer of (13), is related to 
the pseudo-data z(t) by 
() = A()()(), (la) 
where A(t) () is the smoother matrix given by 
wl/ ' W(,)X + nS)-  X' w/(16) 
A(t) () = "() '  .... 
In Wahba(1990), Section 9.2 , it w proposed to obtain a GCV score for 
in (9)  follows: For fixed X, iterate (12) to convergence. Define V(t)() 
[[(I- t(O())z(Oll/(t(I-1()())). Letting L be the converged value of 
compute 
ll(I - A() ())z() l[2 ll?(y- pc(=))ll 2 
v()(x) = ((- A()()))  ((- A()())) (1) 
and minimize V () with respect to . Gu(1992a) showed tha (since the variance 
is known once the mean is known here) that the unbied risk estimate U() in 
Craven and Wahba can also be adapted to this problem  
1 _/e [e 2 
v(o() =-II(,) (v-n(,))l +- A(O(). (18) 
He also proposed an alternating iteration, different than that described in 
Wahba(1990), namely, given cU) = c()(U)), find  = (+x) to minimize (18). 
Given (+), do a Newton step to get c (+x), get X(+e) by minimizing (18), continue 
until convergence. He showed that the alternating ierafion gave better estimates of 
 using V than the iteration in Wahba(1990),  menured by the KL-distance. His 
results (with the alternating iteration) suggested U had somewhat of an advantage 
over V, and that is what we are using in the present work. Zhao et al, his volume, 
have used V successfully with the alternating iteration. 
The defion of  ghere &fiers from ghe defion here by a facgor of n/2. Please 
noge he gypoapcM eor in (9.2.18) ghere where  shoed be 2X. 
420 Wahba, Wang, Gu, Klein, and Klein 
4. Smoothing spline analysis of variance (SS-ANOVA) 
In SS-ANOVA, f(t) = f(t, ...,ta) is decomposed as 
f(t) = ll + E fa(ta) + ]] fa(ta,t) + '" (19) 
where the terms in the expansion are uniquely determined by side conditions which 
generalize the side conditions of the usual ANOVA decompositions. Let the logitf (t) 
be of the form (19) where the terms are summed over a 6 , a,  6 , etc. where 
 indexes terms which are chosen to be retained in the model after a model selec- 
tion procedure. Then fx,, an estimate of f, is obtained  the minimizer of 
C(y, fx,o)+ A&(f) (20) 
where 
= X] + X] + ... 
(21) 
The J, JAB,'" are quadratic 'smoothness' penalty functionMs, and the O's satisfy 
a single constraint. For certain spline-like smoothness penalties, the minimizer of 
(20) is known to be in the span of a certain set of n functions, and the vector 
c of coefficients of these functions can (for fixed (A,0)) be chosen by the Newton 
Raphson iteration. Both A and the O's are estimated by the unbiased risk estimate 
of Gu using RKPACK(available from ne;libCresearch. a;;. cora) as a subroutine 
at each Newton iteration. Details of smoothing spline ANOVA decompositions may 
be found in Wahba(1990) and in Gu and Wahba(1993) (also available by ftp to 
f;p.s;a;.wiac.edu, cd to pub/wahba , in the file aaanova.pa.Z). In Wahba et 
a/(1993) op cit, we estimate the risk of diabetes given some of the attributes in the 
Pima-Indian data base. There JM was chosen partly by a screening process using 
paramteric GLIM models and partly by a leaving out approximately 1/3 procedure. 
Continuing work involves development of confidence intervals based on Gu(1992b), 
development of numerical methods suitable for very large data sets based on Gi- 
rard's(1991) randomized trace estimation, and further model selection issues. 
In the Figures we provide some preliminary analyses of data from the Wiscon- 
sin Epidemiological Study of Diabetic Retinopathy (WESDR, Klein et al 1988). 
The data used here is from people with early onset diabetes participating in the 
WESDR study. Figure l(left) gives a plot of body mass index (brai) (a mea- 
sure of obesity) vs age (age) for 669 instances (subjects) in the WESDR study 
that had no diabetic retinopathy or nonproliferative retinopathy at the start of 
the study. Those subjects who had (progressed) retinopathy four years later, are 
marked as  and those with no progression are marked as .. The contours are 
lines of estimated posterior standard deviation of the estimate /5 of the proba- 
bility of progression. These contours are used to delineate a region in which /5 
is deemed to be reliable. Glycosylated hemoglobin (gly), a measure of blood 
sugar control. was also used in the estimation of p. A model of the form 
p = e l/(1 + el), f(age, gly, brai) = p + fl (age) + b. gly + fa(bmi) + fla(age, bmi) 
was selected using some of the screening procedures described in Wahba et a/(1993), 
along with an examination of the estimated multiple smoothing parameters, which 
indicated that the linear term in gly was sufficient to describe the (quite strong) 
dependence on gly. Figure l(right) shows the estimated probability of progression 
"Soft" Classification with Smoothing Spline ANOVA 421 
given by this model. Figure 2(left) gives cross sections of the fitted model of Figure 
l(right), and Figure 2(right) gives another cross section, along with its confidence 
interval. Interesting observations can be made, for example, persons in their late 
20's with higher gly and bmJ. are at greatest risk for progression of the disease. 
Figure 1: Left: Data and contours of constant posterior standard deviation at 
the median gly, as a function of age and brai. Right: Estimated probability of 
progression at the median gly, as a function of age and brai. 
o. ql bmi gly=q2 
 ...... q2bmi 
c - - q3 bmi 
 -- - q4 bmi 
o 
10 20 30 40 )  
age (Y0 
gly----q3 
x',..:-, 
10 20 30 40 50 60 
age (yr) 
10 20 30 40 50 60 
age (yr) 
Figure 2: Left: Eight cross sections of the right panel of Figure 1, Estimated prob- 
ability of progression as a function of age, at four levels of brai by two of gly. 
ql,...q4 are the quartiles at .125, .375, .625 and .875. Right: Cross section of the 
right panel of Figure I for brai and gly at their medians, as a function of age, 
with Bayesian 'condifidence interval' (shaded) which generalizes Gu(1992b) to the 
multivariate case. 
422 Wahba, Wang, Gu, Klein, and Klein 
Acknowledgement s 
Supported by NSF DMS-9121003 and DMS-9301511, and NEI-NIH EY09946 and 
EY03083 
References 
Craven, P. & Wahba, G. (1979), 'Smoothing noisy data with spline functions: 
estimating the correct degree of smoothing by the method of generalized cross- 
validation', Numer. Math. 31,377-403. 
Girard, D. (1991), 'Asymptotic optimality of the fast randomized versions of GCV 
and CL in ridge regression and regularization', Ann. Statist. 19, 1950-1963. 
Gu, C. (1992a), 'Cross-validating non-Gaussian data', J. Cornput. Graph. Stats. 
1, 169-179. 
Gu, C. (1992b), 'Penalized likelihood regression: a Bayesian analysis', Statistica 
Sinica 2,255-264. 
Gu, C. & Wahba, G. (1993), 'Smoothing spline ANOVA with component-wise 
Bayesian "confidence intervals" ', J. Computational and Graphical Statistics 2, 1-21. 
Kimeldorf, G. &; Wahba, G. (1970), 'A correspondence between Bayesian estimation 
of stochastic processes and smoothing by splines', Ann. Math. Statist. 41,495-502. 
Klein, R., Klein, B., Moss, S. Davis, M.,  DeMets, D. (1988), Glycosylated 
hemoglobin predicts the incidence and progression of diabetic retinopathy, JAMA 
260, 2864-2871. 
O'Sullivan, F., Yandell, B. &; Raynor, W. (1986), 'Automatic smoothing of regres- 
sion functions in generalized linear models', J. Am. 5'tat. Soc. 81, 96-103. 
Wahba, G. (1980), Spline bases, regularization, and generalized cross validation for 
solving approximation problems with large quantities of noisy data, in W. Cheney, 
ed., 'Approximation Theory III', Academic Press, pp. 905-912. 
Wahba, G. (1985), 'A comparison of GCV and GML for choosing the smoothing 
parameter in the generalized spline smoothing problem', Ann. Statist. 13, 1378- 
1402. 
Wahba, G. (1990), Spline Models for Observational Data, SIAM. CBMS-NSF Re- 
gional Conference Series in Applied Mathematics, vol. 59. 
Wahba, G. (1992), Multivariate function and operator estimation, based on smooth- 
ing splines and reproducing kernels, in M. Casdagli &; S. Eubank, eds, 'Nonlinear 
Modeling and Forecasting, SFI Studies in the Sciences of Complexity, Proc. Vol 
XII', Addison-Wesley, pp. 95-112. 
Wahba, G., Gu, C., Wang, Y. & Chappell, R. (1993), Soft classification, a. k. 
a. risk estimation, via penalized log likelihood and smoothing spline analysis of 
variance, to appear, Proc. Santa Fe Workshop on Supervised Machine Learning, D. 
Wolpert and A. Lapedes, eds, and Proc. CLNL92, T. Petsche, ed, with permission 
of all eds. 
