ADAPTIVE SPLINE NETWORKS 
Jerome H. Friedman 
Department of Statistics and 
Stanford Linear Accelerator Center 
Stanford University 
Stanford, CA 94305 
Abstract 
A network based on splines is described. It automatically adapts the num- 
ber of units, unit parameters, and the architecture of the network for each 
application. 
I INTRODUCTION 
In supervised learning one has a system under study that responds to a set of 
simultaneous input signals x...xn). The response is characterized by a set of 
output signals y, y2,"', Ym). The goal is to learn the relationship between the 
inputs and the outputs. This exercise generally has two purposes: prediction and 
understanding. With prediction one is given a set of input values and wishes to 
predict or forecast likely values of the corresponding outputs without having to 
actually run the system. Sometimes prediction is the only purpose. Often, however, 
one wishes to use the derived relationship to gain understanding of how the system 
works. Such knowledge is often useful in its own right, for example in science, or it 
may be used to help improve the characteristics of the system, as in industrial or 
engineering applications. 
The learning is accomplished by taking training data. One observes the outputs 
produced by the system in response to varying sets of input values 
(Yli '''Ymi [ Xli''' Xni]l N. (1) 
These data (1) are then used to train an "artificial" system (usually a computer 
program) to learn the input/output relationship. The underlying framework or 
model is usually taken to be 
y- f(x...xn)+e, k= l,m (2) 
675 
676 Friedman 
with ave(ek I Xl''' Xn) -- 0. Here (2) yk is the kth responding output signal, f is 
a single valued deterministic function of an n-dimensional argument (inputs) and 
 is a random (stochastic) component that reflects the fact that (if nonzero) y 
is not completely specified by the observed inputs, but is also responding to other 
quantities that are neither controlled nor observed. In this framework the learning 
goal is to use the training data to derive a function f(xl ... Xn) that can serve as a 
reasonable approximation (estimate) of the true underlying ("target") function f 
(2). The supervised learning problem can in this way be viewed as one of function 
or surface approximation, usually in high dimensions (n >> 2). 
2 SPLINES 
There is an extensive literature on the theory of function approximation (see Cheney 
[1986] and Chui [1988], and references therein). From this literature spline methods 
have emerged as being among the most successful (see deBoor [1978] for a nice in- 
troduction to spline methods). Loosely speaking, spline functions have the property 
that they are the smoothest for a given flexibility and vice versa. This is impor- 
tant if one wishes to operate under the least restrictive assumptions concerning 
fk(xl... Xn) (2), namely, that it is relatively smooth compared to the noise  but 
is otherwise arbitrary. A spline approximation is characterized by its order q [q = 1 
(linear), q = 2 (quadratic), and q = 3 (cubic) are the most popular orders]. The 
procedure is to first partition the input variable space into a set of disjoint regions. 
The approximation f(xl ... Xn) is taken to be a separate n-dimensional polynomial 
in each region with maximum degree q in any one variable, constrained so that f 
and all of its derivatives to order q - 1 are continuous across all region boundaries. 
Thus, a particular spline approximation is determined by a choice for q, which tends 
not to be very important, and the particular set of chosen regions, which tends to 
be crucial. The central problem associated with spline approximations is how to 
choose a good set of associated regions for the problem at hand. 
2.1 TENSOR-PRODUCT SPLINES 
The most popular method for partitioning the input variable space is by the tensor 
or outer product of interval sets on each of the n axes. Each input axis is partitioned 
into K + i intervals delineated by K points ("knots"). The regions in the n- 
dimensional space are taken to be the (K + 1) n intersections of all such intervals. 
Figure I illustrates this procedure for K -- 4 knots on each of two axes producing 
25 regions in the corresponding two-dimensional space. 
Owing to the regularity of tensor-product representations, the corresponding spline 
approximation can be represented in a simple form as a basis function expansion. 
Let x- (x...xn). Then 
f(x) =  wt Bt(x) (3) 
where {wt} are the coefficients (weights) for each respective basis function Bt(x), 
and the basis function set {Be(x)} is obtained by taking the tensor product of the 
set of functions 
' ),=0, {(xj - (4) 
Adaptive Spline Networks 677 
over all of the axes, j = 1, n. That is, each of the K + q + 1 functions on each axis j 
(j = 1, n) is multiplied by all of the functions (4) corresponding to all of the other 
axes k (k = 1,n; k  j). As a result the total number of basis functions (3) defining 
the tensor-product spline approximation is 
(K + q + 1) n . (5) 
The functions comprising the second set in (4) are known as the truncated power 
functions: 
(xj -tkj) = { 0 < (6) 
(xj -- tj) q xj > tj 
and there is one for each knot location tj (k = 1,K) on each input axis j (j = 1, n). 
Although conceptually quite simple, tensor-product splines have severe limitations 
that preclude their use in high dimensional settings (n >> 2). These limitations 
stem from the exponentially large number of basis functions that are required (5). 
For cubic splines (q = 3) with five inputs (n = 5) and only five knots per axis 
(K = 5) 59049 basis functions are required. For n = 6 that number is 531441, and 
for n = 10 it is approximately 3.5 x 109. This poses severe statistical problems 
in fitting the corresponding number of weights unless the training sample is large 
compared to these numbers, and computational problems in any case since the 
computation grows as the number of weights (basis functions) cubed. These are 
typical manifestations of the so-called "curse-of-dimensionality" (Bellman [1961]) 
that afflicts nearly all high-dimensional problems. 
3 ADAPTIVE SPLINES 
This section gives a very brief overview of an adaptive strategy that attempts 
to overcome the limitations of the straightforward application of tensor-product 
splines, making practical their use in high-dimensional settings. This method, called 
MARS (multivariate adaptive regression splines), is described in detail in Friedman 
[1991] along with many examples of its use involving both real and artificially gen- 
erated data. (A FORTRAN program implementing the method is available from 
the author.) 
The method (conceptually) begins by generating a tensor-product partition of the 
input variable space using a large number of knots, K < N, on each axis. Here 
N (1) is the training sample size. This induces a very large (K + 1)" number 
of regions. The procedure then uses the training data to select particular unions 
of these (initially large number of) regions to define a relatively small number of 
(larger) regions most suitable for the problem at hand. 
This strategy is implemented through the basis function representation of spline 
approximations (3). The idea is to select a relatively small subset of basis functions 
{B(x))0 M C ,,{Be(x)) uge (7) 
small 
from the very large set (3) (4) (5) i,,duced by the initial tensor-product partition. 
The particular subset for a problem at hand is obtained through standard statistical 
variable subset selection, treating the basis functions as the "variables". At the 
678 Friedman 
first step the best single basis function is chosen. The second step chooses the basis 
function that works best in conjunction with the first. At the rath step, the one 
that works best with the ra- 1 already selected, is chosen, and so on. The process 
stops when including additional basis functions fails to improve the approximation. 
3.1 ADAPTIVE SPLINE NETWORKS 
This section describes a network implementation that approximates the adaptive 
spline strategy described in the previous section. The goal is to synthesize a good 
set of spline basis functions (7) to approximate a particular system's input/output 
relationship, using the training data. For the moment, consider only one output y; 
this is generalized later. The basic observation leading to this implementation is 
that the approximation takes the form of sums of products of very simple functions, 
namely the truncated power functions (6), each involving a single input variable, 
and 
= 
k=l 
(8) 
M 
/(x) = wmB(x). (9) 
Here 1 _< j(k) _< n is an input variable and 1 <_ K,, _ n is the number of factors in 
the product (interaction level). 
The network is comprised of an ordered set of interconnected units. Figure 2 shows 
a diagram of the interconnections for a (small) network. Figure 3 shows a schematic 
diagram of each individual unit. Each unit has as its inputs all of the system inputs 
x    z, and all of the outputs from the previous units in the network B0  -. BM. It 
is also characterized by three parameters' j, , t. The triangles in Figure 3 represent 
selectors. The upper triangle selects one of the system inputs, zj; the left triangle 
selects one of the previous unit outputs, Be. These serve as inputs, along with the 
parameter t, to two internal units that each produce an output. The first output 
is Be-(zj - t)_ and the second is Be. (t - zd).. The whole unit thereby produces 
two outputs BM+t and BM+2, that are available to serve as inputs to future units. 
In addition to units of this nature, there is an initial unit (B0) that produces the 
constant output B0 = 1, that is also available to be selected as an input to all units. 
The output of the entire network, , is a weighted sum (9) of all of the unit outputs 
(including B0 = 1). This is represented by the bottom trapezoid in Figure 2. 
The parameters associated with the network are the number of units Nu, the pa- 
rameters associated with each one 
(lO) 
and the weights in the final adder 
{wl}o M=2'Nu. (11) 
The goal of training the network is to choose values for these parameters (10) (11) 
that minimize average future prediction error (squared), that is the squared error on 
Adaptive Spline Networks 679 
(test) data not used as part of the training sample. An estimate of this quantity is 
provided by the generalized cross-validation model selection criterion (Craven and 
Wahba [1979]) 
N 5.Nu+ 1. (12) 
1 E(y i_fi)2 / 1 N 
GCV =  
The numerator in (12) is the average squared-error over the training data. The 
denominator is an (inverse) penalty for adding units. The quantity (5.Nu+ 1) is just 
the number of adjustable parameters in the network. This GCV criterion (12) has 
its roots in ordinary (leave-one-out) cross-validation and serves as an approximation 
to it (see Craven and Wahba [1979]). 
The training strategy used is a semi-greedy one. The units are considered in order. 
For the ruth unit the weights of all later units are set to zero, that is 
W2m+l -- ''' -- W2Mmx -- 0 
where Mmax is the maximum number of units in the network. The GCV criterion 
(12) is then minimized with respect to the parameters of the ruth unit (m, jm, tin), 
and the weights associated with all previous units as well as the unit under con- 
sideration {wk) m, given the parameter values associated with the previous units 
{i,ji,ti} -. This optimization can be done very rapidly, O(nm2N), using least 
squares updating formulae (see Friedman [1991]). This process is repeated un- 
til Mmx units have been added to the network. A post optimization procedure 
(weight elimination) is then applied to select an optimal subset of weights to be set 
to zero, so as to minimize the GCV criterion (12). This will (usually) decrease the 
GCV value since it includes a penalty for increasing the number of nonzero weights 
The semi-greedy training strategy has the advantage of being quite fast. The total 
computation is O(nNMmax) where n is the number of system inputs, N is the 
training sample size, and Mmax is the maximum number of units to be included 
in the network (before weight elimination). On a SUN SPARCstation, small to 
moderate sized problems train in seconds to minutes, and very large ones in a 
few hours. A potential disadvantage of this strategy is possible loss of prediction 
accuracy compared to a more thorough optimization strategy. This tends not to 
be the case. Experiments with more complete optimization seldom resulted in 
even moderate improvement. This is because units added later to the network can 
compensate for the suboptimal settings of parameters introduced earlier. 
Figure 4 illustrates a (very small) network that might be realized with the MARS 
procedure. The number above each unit is the system input that it selected. The 
letter within each unit represents its knot parameter. The first unit necessarily has 
as its input the constant B0 = 1. Its first output goes to the final adder but was not 
selected as an input to any future units. Its second output serves as the selected 
input to the next two units, but was eliminated from the adder by the final weight 
elimination, and so on. The final approximation for this network is 
q 
S(x) = w0 + w(xa - s) + w,.(s- xa)( - 
q 
+w( - )$ (, - ,)$ + w,(- )$(, - x,)_(, - v)+ 
+w(- )$(- )$ (v- 
680 Friedman 
Two possible network topologies that might be realized are of special interest. One 
is where all units happen to select the constant line B0 = 1 as their unit input. In 
this case the resulting approximation will be a sum of spline functions each involving 
only one input variable. This is known as an additive function (no interactions) 
J 
f(x) = fi(xj). (13) 
j=l 
An additive function has the property that the functional dependence on any vari- 
able is independent of the values of all other input variables up to an overall additive 
constant. Additive function approximations are important because many true un- 
derlying functions f(x) (2) are close to additive and thus well approximated by 
additive functions. MARS can realize additive functions as a subclass of its poten- 
tial models. 
Another potential network topology that can be realized by MARS is one in which 
every unit output serves either as an input to one (and only one) other unit or goes 
to the final weighted adder (but not both). This is a binary tree topology similar to 
those generated by recursive partitioning strategies like CART (Breiman, Friedman, 
Olshen and Stone [1984]). In fact, if one were to impose this restriction and employ 
q = 0 splines, the MARS strategy reduces to that of CART (see Friedman [1991]). 
Thus, MARS can also realize CART approximations as a subclass of its potential 
models. 
MARS can be viewed as a generalization of CART. First by allowing q > 0 splines 
continuous approximations are produced. This generally results in a dramatic in- 
crease in accuracy. In addition, all unit outputs are eligible to contribute to the 
final adder, not just the terminal ones; and finally, all previous unit outputs are 
eligible to be selected as inputs for new units, not just the currently terminal ones. 
Both additive and CART approximations have been highly successful in largely 
complementary situations: additive modeling when the true underlying function is 
close to additive, and CART when it dominately involves high order interactions 
between the input variables. MARS unifies both into a single framework. This 
lends hope that MARS will be successful at both these extremes as well as the 
broad spectrum of situations in between where neither works well. 
Multiple response outputs Y'"Ym (1) (2) are incorporated in a straightforward 
manner. The internal units and their interconnections are the same as described 
above and shown in Figures 2 and 3. Only the final weighted adder unit (Figure 2) 
is modified to incorporate a set of weights 
(14) 
for each response output (k = 1, m). The approximation for each output is 
M 
= = 1, 
The numerator in he GCV criterion (12) is replaced by 
m N 
i 
k=l i=1 
Adaptive Spline Networks 681 
and it is minimized with respect to the internal network parameters (10) and all of 
the weights (14). 
4 DISCUSSION 
This section (briefly) compares and contrasts the MARS approach with radial basis 
functions and sigmoid "back-probagation" networks. An important consequence 
of the MARS strategy is input variable subset selection. Each unit individually 
selects the best system input so that it can best contribute to the approximation. 
It is often the case that some or many of the inputs are never selected. These will be 
inputs that tend to have little or no effect on the output(s). In this case excluding 
them from the approximation will greatly increase statistical accuracy. It also aids 
in the interpretation of the produced model. In addition to global variable subset 
selection, MARS is able to do input variable subset selection locally in different 
regions of the input variable space. This is a consequence of the restricted support 
(nonzero value) of the basis functions produced. Thus, if in any local region, the 
target function (2) depends on only a few of the inputs, MARS is able to use this to 
advantage even if the relevant inputs are different in different local regions. Also, 
MARS is able to produce approximations of low interaction order even if the number 
of selected inputs is large. 
Radial basis functions are not able to do local (or usually even global) input variable 
subset selection as a part of the procedure. All basis functions involve all of the 
inputs at the same relative strength everywhere in the input variable space. If the 
target function (2) is of this nature they will perform well in that no competing 
procedure will do better, or likely even as well. If this is not the case, radial 
basis functions are not able to take advantage of the situation to improve accuracy. 
Also, radially symmetric basis functions produce approximations of the highest 
possible interaction order (everywhere in the input space). This results in a marked 
disadvantage if the target function tends to dominately involve interactions in at 
most a few of the inputs (such as additive functions (13)). 
Standard networks based on sigmoidal units of linear combinations of inputs share 
the properties described above for radial basis functions. Including "weight elim- 
ination" (Rumelhart [1988]) provides an (important) ability to do global (but not 
local) input variable subset selection. The principal differences between MARS and 
this approach center on the use of splines rather than sigmoids, and products rather 
than linear combinations of the input variables. Splines tend to be more flexible in 
that two spline functions can closely approximate any sigmoid whereas it can take 
many sigmoids to approximate some splines. MARS' use of product expansions en- 
ables it to produce approximations that are local in nature. Local approximations 
have the property that if the target function is badly behaved in any local region 
of the input space, the quality of the approximation is not affected in the other re- 
gions. Also, as noted above, MARS can produce approximations of low interaction 
order. This is difficult for approximations based on linear combinations. 
Both radial basis functions and sigmoidal networks produce approximations that 
are difficult to interpret. Even in situations where they produce high accuracy, 
they provide little information concerning the nature of the target function. MARS 
approximations on the other hand can often provide considerable interpretable in- 
682 Friedman 
formation. Interpreting MARS models is discussed in detail in Friedman [1991]. 
Finally, training MARS networks tends to be computationally much faster than 
other types of learning procedures. 
References 
Bellman, R. E. (1961). Adaptive Control Processes. Princeton University Press, 
Princeton, NJ. 
Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J. (1984). Classification 
and Regression Trees. Vadsworth, Belmont, CA. 
Cheney, E. W. (1986). Multivariate Approximation Theory: Selected Topics. 
Monograph: SIAM CBMS-NSF Regional Conference Series in Applied Mathemat- 
ics, Vol. 51. 
Chui, C. K. (1988). Multivariate Splines. Monograph: SIAM CBMS-NSF Regional 
Conference Series in Applied Mathematics, Vol. 54. 
Craven, P. and Wahba, G. (1979). Smoothing noisy data with spline functions. 
Estimating the correct degree of smoothing by the method of generalized cross- 
validation. Numerische Mathematik 31 317-403. 
deBoor, C. (1978). A Practical Guide to Splines. Springer-Verlag, New York, NY. 
Friedman, J. H. (1991). Multivariate adaptive regression splines (with discussion). 
Annals o Statistics, March. 
Rumelhart, D. E. (1988). Learning and generalization. IEEE International Confer- 
ence on Neural Networks, San Diego, plenary address. 
Adaptive Spline Networks 683 
FIGURE 2 
FIGURE 1 
FIGURE 3 
FIGURE 4 
