Iterative Construction of 
Sparse Polynomial Approximations 
Terence D. Sanger 
Massachusetts Institute 
of Technology 
Room E25-534 
Cambridge, MA 02139 
tds@ai.mit.edu 
Richard S. Sutton 
GTE Laboratories 
Incorporated 
40 Sylvan Road 
Waltham, MA 02254 
sutton@gte.com 
Christopher J. Matheus 
GTE Laboratories 
Incorporated 
40 Sylvan Road 
Waltham, MA 02254 
matheus@gte.com 
Abstract 
We present an iterative algorithm for nonlinear regression based on con- 
struction of sparse polynomials. Polynomials are built sequentially from 
lower to higher order. Selection of new terms is accomplished using a novel 
look-ahead approach that predicts whether a variable contributes to the 
remaining error. The algorithm is based on the tree-growing heuristic in 
LMS Trees which we have extended to approximation of arbitrary poly- 
nonrials of the input features. In addition, we provide a new theoretical 
justification for this heuristic approach. The algorithm is shown to dis- 
cover a known polynomial from samples, and to make accurate estimates 
of pixel values in an image-processing task. 
1 INTRODUCTION 
Linear regression attempts to approximate a target function by a model that is a 
linear combination of the input features. Its approximation ability is thus limited 
by the available features. We describe a method for adding new features that are 
products or powers of existing features. Repeated addition of new features leads 
to the construction of a polynomial in the original inputs, as in (Gabor 1961). 
Because there is an infinite number of possible product terms, we have developed 
a new method for predicting the usefulness of entire classes of features before they 
are included. The resulting nonlinear regression will be useful for approximating 
functions that can be described by sparse polynomials. 
1064 
Iterative Construction of Sparse Polynomial Approximations 1065 
Figure 1: Network depiction of linear regression on a set of features xi. 
2 THEORY 
Let {Xi}in__l be the set of features already included in a model that attempts to 
predict the function f. The output of the model is a linear combination 
where the ci's are coefficients determined using linear regression. The model can 
also be depicted as a single-layer network as in figure 1. The approximation error 
is e = f - f, and we will attempt to minimize E[e 2] where E is the expectation 
operator. 
The algorithm incrementally creates new features that are products of existing 
features. At each step, the goal is to select two features x r and xq already in the 
model and create a new feature XpXq (see figure 2). Even if XpXq does not decrease 
the approximation error, it is still possible that XpXqX r will decrease it for some 
xr. So in order to decide whether to create a new feature that is a product with 
xr, the algorithm must "look-ahead" to determine if there exists any polynomial a 
in the xi's such that inclusion of ax r would significantly decrease the error. If no 
such polynomial exists, then we do not need to consider adding any features that 
are products with 
Define the inner product between two polynomials a and b as (alb) = E[ab] where 
the expected value is taken with respect to a probability measure p over the (zero- 
mean) input values. The induced norm is Ilall - and let P be the set of 
polynomials with finite norm. {P, (.[.)} is then an infinite-dimensional linear vector 
space. The Weierstrass approximation theorem proves that P is dense in the set of 
all square-integrable functions over p, and thus justifies the assumption that any 
function of interest can be approximated by a member of P. 
Assume that the error e is a polynomial in P. In order to test whether aXp partic- 
ipates in e for any polynomial a 6 P, we write 
e = arx r + b r 
1066 Sanger, Sutton, and Matheus 
1.. / .) 
pq 
Xl Xp 2q n 
Figure 2: Incorporation of a new product term into the model. 
where a e and by are polynomials, and a e is chosen to minimize Ilaee- ell' = 
E[(aez e - e)2]. The orthogonality principle then shows that aez e is the projection 
of the polynomial e onto the linear subspace of polynomials zeP. Therefore, by is 
orthogonal to zeP, so that E[beg ] = 0 for all g in zeP. 
We now write 
E[e 2] = ' ' 2E[aeebe] [b'] ' ' 
z[%%] + + = [%] + 
since E[arxrbr] = 0 by orthogonality. If arxr were included in the model, it would 
thus reduce E[e =] by E = = = = 
Z[apxp]. 
[aexr] , so we wish to choe x r to maxize Un- 
fortunately, we have no drect meurement of a r. 
3 METHODS 
Although E[arxr ] cannot be measured directly, Sanger (1991) suggests choosing x e 
to maximize E[e2xr 2] instead, which is directly measurable. Moreover, note that 
'[':'e'"e] + + 
2 4 
= 
and thus E[e2x2_] is related to the desired but unknown value 2 2 
E[aexe]. Perhaps 
better would be to use 
E[e,xr2 ] , 4 
__ E[apxp] 
which can be thought of as the regression of (ax)x r against x r. 
More recently, (Sutton and Matheus 1991) suggest using the regression coefficients 
of e  against x for all i as the bis for comparison. The regression coefficients wi 
are called "potentials", and lead to a linear approximation of the squared error: 
2 
i=1 
Iterative Construction of Sparse Polynomial Approximations 1067 
If a new term arx r were included in the model of f, then the squared error would 
2 Thus 
be bp  which is orthogonal to any polynomial in xrP and in particular to 
2 in (1) would be zero after inclusion of arx r, and wrE[xr ] is an 
the coefficient of x r 
approximation to the decrease in mean-squared error E[e ] - E[b] which we can 
expect from inclusion of arx r. We thus choose x r by maximizing wrE[x]. 
This procedure is a form of look-ahead which allows us to predict the utility of a 
high-order term apXp without actually including it in the regression. This is perhaps 
most useful when the term is predicted to make only a small contribution for the 
optimal at, because in this case we can drop from consideration any new features 
that include x r. 
We can choose a different variable xq similarly, and test the usefulness of incorporat- 
ing the product XpXq by computing a "joint potential" wpq which is the regression of 
the squared error against the model including a new term  2 
xpxq. The joint potential 
attempts to predict the magnitude of the term    
We now use this method to choose a single new feature xpxq to include in the model. 
For all pairs xixj such that xi and xj individually have high potentials, we perform 
a third regression to determine the joint potentials of the product terms xixj. Any 
term with a high joint potential is likely to participate in f. We choose to include the 
new term xrx q with the largest joint potential. In the network model, this results in 
the construction of a new unit that computes the product of x r and xq, as in figure 
2. The new unit is incorporated into the regression, and the resulting error e will 
be orthogonal to this unit and all previous units. Iteration of this technique leads 
to the successive addition of new regression terms and the successive decrease in 
mean-squared error E[e=]. The process stops when the residual mean-squared error 
drops below a chosen threshold, and the final model consists of a sparse polynomial 
in the original inputs. 
We have implemented this algorithm both in a non-iterative version that computes 
coefficients and potentials based on a fixed data set, and in an iterative version that 
uses the LMS algorithm (Widrow and Hoff 1960) to compute both coefficients and 
potentials incrementally in response to continually arriving data. In the iterative 
version, new terms are added at fixed intervals and are chosen by maximizing over 
the potentials approximated by the LMS algorithm. The growing polynomial is 
efficiently represented as a tree-structure, as in (Sanger 1991a). 
Although the algorithm involves three separate regressions, each is over only O(n) 
terms, and thus the iterative version of the algorithm is only of O(n) complexity 
per input pattern processed. 
4 RELATION TO OTHER ALGORITHMS 
Approximation of functions over a fixed monomial basis is not a new technique 
(Gabor 1961, for example). However, it performs very poorly for high-dimensional 
input spaces, since the set of all monomials (even of very low order) can be pro- 
hibitively large. This has led to a search for methods which allow the generation of 
sparse polynomials. A recent example and bibliography are provided in (Grigoriev 
e! al. 1990), which describes an algorithm applicable to finite fields (but not to 
1068 Sanger, Sutton, and Matheus 
Figure 3: Products of hidden units in a sigmoidal feedforward network lead to a 
polynomial in the hidden units themselves. 
real-valued random variables). 
The GMDH algorithm (Ivakhnenko 1971, Ikeda et al. 1976, Barron et al. 1984) 
incrementally adds new terms to a polynomial by forming a second (or higher) 
order polynomial in 2 (or more) of the current terms, and including this polynomial 
as a new term if it correlates with the error. Since GMDH does not use look-ahead, 
it risks avoiding terms which would be useful at future steps. For example, if the 
polynomial to be approximated is xyz where all three variables are independent, 
then no polynomial in x and y alone will correlate with the error, and thus the 
term xy may never be included. However, x2y 2 does correlate with x2yz , so 
the look-ahead algorithm presented here would include this term, even though the 
error did not decrease until a later step. Although GMDH can be extended to 
test polynomials of more than 2 variables, it will always be testing a finite-order 
polynomial in a finite number of variables, so there will always exist target functions 
which it will not be able to approximate. 
Although look-ahead avoids this problem, it is not always useful. For practical 
purposes, we may be interested in the best Nth-order approximation to a function, 
so it may not be helpful to include terms which participate in monomials of order 
greater than N, even if these monomials would cause a large decrease in error. 
For example, the best 2nd-order approximation to x  q- y000 q- z000 may be x , 
even though the other two terms contribute more to the error. In practice, some 
combination of both infinite look-ahead and GMDH-type heuristics may be useful. 
5 APPLICATION TO OTHER STRUCTURES 
These methods have a natural application to other network structures. The inputs 
to the polynomial network can be sinusolds (leading to high-dimensional Fourier 
representations), Gaussians (leading to high-dimensional Radial Basis Functions) 
or other appropriate functions (Sanger 1991a, Sanger 1991b). Polynomials can 
Iterative Construction of Sparse Polynomial Approximations 1069 
even be applied with sigmoidal networks as input, so that 
xi = a ( si.iz.i) 
where the z/'s are the original inputs, and the sij's are the weights to a sigmoidal 
hidden unit whose value is the polynomial term xi. The last layer of hidden units 
in a multilayer network is considered to be the set of input features xi to a linear 
output unit, and we can compute the potentials of these features to determine the 
hidden unit x r that would most decrease the error if arx r were included in the 
model (for the optimal polynomial at). But a r can now be approximated using a 
subnetwork of any desired type. This subnetwork is used to add a new hidden unit 
&rxr that is the product of x r with the subnetwork output r, as in figure 3. 
In order to train the &r subnetwork iteratively using gradient descent, we need to 
compute the effect of changes in &r on the network error  = E[(.f- ])2]. We have 
0 
= -2El(y- 
where sa is the weight from the new hidden unit to the output. Without loss of 
generality we can set sa,, = 1 by including this factor within r. Thus the error 
term for iteratively training the subnetwork &r is 
which can be used to drive a standard backpropagation-type gradient descent al- 
gorithm. This gives a method for constructing new hidden nodes and a learning 
algorithm for training these nodes. The same technique can be applied to deeper 
layers in a multilayer network. 
6 EXAMPLES 
We have applied the algorithm to approximation of known polynomials in the pres- 
ence of irrelevant noise variables, and to a simple image-processing task. 
Figure 4 shows the results of applying the algorithm to 200 samples of the polyno- 
mial 2 + 3xxx2 + 4xax4x with 4 irrelevant noise variables. The algorithm correctly 
finds the true polynomial in 4 steps, requiring about 5 minutes on a Symbolics Lisp 
Machine. Note that although the error did not decrease after cycle 1, the term x4x 
was incorporated since it would be useful in a later step to reduce the error as part 
of xax4x in cycle 2. 
The image processing task is to predict a pixel value on the succeeding scan line 
from a 2x5 block of pixels on the preceding 2 scan lines. If successful, the resulting 
polynomial can be used as part of a DPCM image coding strategy. The network 
was trained on random blocks from a single face image, and tested on a different 
image. Figure 5 shows the original training and test images, the pixel predictions, 
and remaining error . Figure 6 shows the resulting 55-term polynomial. Learning 
this polynomial required less than 10 minutes on a Sun Sparcstation 1. 
1070 Sanger, Sutton, and Matheus 
200 samples of y ---- 2 q- 3XlX 2 q- 4x3x4x $ 
with 4 additional irrelevant inputs, x6-x 9 
Original MSE: 1.0 
Cycle 1: 
MSE: 0.967 
Terms: X 1 X 2 X 3 X 4 
Coeffs: -0.19 0.14 0.24 0.31 
Potentials: 0.22 0.24 0.25 0.32 
Top Pairs: ($ 4) (5 3) (4 3) (4 4) 
New Term: X10 =X4X $ 
Cycle 2: 
MSE: 0.966 
Terms: X 1 X 2 X 3 X 4 
Coeffs: -0.19 0.14 0.24 0.30 
Potentials: 0.25 0.22 0.25 0.05 
Top Pairs: (10 3) (10 1) (10 2) (10 10) 
New Term: Xll =X10X 3 =X3X4X 5 
Cycle 3: 
MSE: 0.349 
Terms: X 1 X 2 X 3 X 4 
Coeffs: 0.04 -0.26 0.09 0.37 
Potentials: 0.52 0.59 0.03 0.02 
Top Pair,: (2 1) (2 9) (2 2) (1 9) 
New Term: X12 =X 1X 2 
Cycle 4: 
MSE: 0.000 
Terms: X 1 X 2 X 3 X 4 
Coeffs: -0.00 -0.00 -0.00 0.00 
Solution: 
2 q- 3X 1X 2 q- 4X3X4X 5 
X$ X 6 X 7 X 8 X 9 
0.17 0.48 0.03 0.05 0.58 
0.33 0.01 0.08 0.01 0.05 
X5 X6 X7 X8 X9 X10 
0.18 0.48 0.03 0.05 0.57 0.05 
0.02 0.03 0.08 0.02 0.03 0.47 
X5 X6 X7 X8 X9 Xlo 
-0.04 0.27 0.10 0.22 0.42 -0.26 
-0.08 0.03 -0.05 -0.06 0.05 -0.05 
X 5 X6 X7 X 8 X9 X10 
-0.00 0.00 0.00 0.00 0.00 -0.00 
Xll 
4.07 
0.05 
Xll X12 
4.00 3.00 
Figure 4: A simple example of polynomial learning. 
:::::::::::::::::::::::::::: ' +:::-':- '"::'i'?:"...::::i:!:!:i:i:i::?. 
:::::::::::::::::::::: -":.' .' 8.:,i:i::::::::-" '"..:' --.:::::::::::."-:::: 
Figure 5: Original, predicted, and error images. The top row is the training image 
(RMS error 8.4), and the bottom row is the test image (RMS error 9.4). 
Iterative Construction of Sparse Polynomial Approximations 1071 
--40.1z 0 -{- --23.9z 1 -{- --5.4z 2 -{- --17.1z3- [- 
(1.1z 5 -!' 2-4z 8 -!' --1.1z 2 -!' --1-Sz 0 -!' --2.0z 1 -!' 1.3z 4 + 2.3z 6 '{- 3.1z 7 '{- --25.6)z4-{- 
( 
(--2.9x 9 + 3-Ox 8 '{- --2.9x 4 -!' --2.8x 3 '{- --2.9x 2 + --1.9x 5 -!' --6.3x 0 '{- --5.2x 1 -I' 2.5x 6 '[' 6.7x 7 '[' 1-1)z9+ 
(3-9z 8 '}' x 5 '{- 3.3z 4 '{- 1-6z 3 -!' 1 .lx 2 -{- 2.9x 6 -}- 5.0x 7 -{- 16.1)z$-l- 
--2.3x 3 + --2.12 -!' --1.6Zl -!' 1.1z 4 -!' 2.1z6 -!' 3.57 + 28.6)zS'l- 
87.1z 6 + 128.1x 7 -{- 80.5z8- {- 
( 
(--2.6z 9 + --2.4z 5 -{- --4.5z 0 -{- --3.9z 1 + 3.4x 6 + 7.3x 7 -{- --2.5)x9+ 
21.7z8 + --16-0z4 Jr --12.1z3 Jr --8.8z2 Jr 31.4)z9+ 
2.6 
Figure 6: 55-term polynomial used to generate figure 5. 
Acknowledgments 
We would like to thank Richard Brandau for his helpful comments and suggestions 
on an earlier draft of this paper. This report describes research done both at 
GTE Laboratories Incorporated, in Waltham MA, and at the laboratory of Dr. 
Emilio Bizzi in the department of Brain and Cognitive Sciences at MIT. T. Sanger 
was supported during this work by a National Defense Science and Engineering 
Graduate Fellowship, and by NIH grants 5R37AR26710 and 5R01NS09343 to Dr. 
Bizzi. 
References 
Barron R. L., Mucciardi A. N., Cook F. J., Craig J. N., Barron A. R., 1984, Adaptive 
learning networks: Development and application in the United States of algorithms 
related to GMDH, In Farlow S. J., ed., Self-Organizing Melhods in Modeling, pages 
25-65, Marcel Dekker, New York. 
Gabor D., 1961, A universal nonlinear filter, predictor, and simulator which opti- 
mizes itself by a learning process, Proc. IEE, 108B:422-438. 
Grigoriev D. Y., Karpinski M., Singer M. F., 1990, Fast parallel algorithms for 
sparse polynomial interpolation over finite fields, SIAM J. Cornpuling, 19(6):1059- 
1063. 
Ikeda S., Ochiai M., Sawaragi Y., 1976, Sequential GMDH algorithm and its ap- 
plication to river flow prediction, IEEE Trans. Syslems, Man, and Cybernetics, 
SMC-6(7):473-479. 
Ivakhnenko A. G., 1971, Polynomial theory of complex systems, IEEE Trans. 
Syslems, Man, and Cybernelics, SMC-1(4):364-378. 
Sanger T. D., 1991a, Basis-function trees as a generalization of local variable se- 
lection methods for function approximation, In Lippmann R. P., Moody J. E., 
Touretzky D. S., ed.s, Advances in Neural Informalion Processing Systems 3, pages 
700-706, Morgan Kaufmann, Proc. NIPS'90, Denver CO. 
Sanger T. D., 1991b, A tree-structured adaptive network for function approximation 
in high dimensional spaces, IEEE Trans. Neural Nelworks, 2(2):285-293. 
Sutton R. S., Matheus C. J., 1991, Learning polynomial functions by feature con- 
struction, In Proc. Eighth Intl. Workshop on Machine Learning, Chicago. 
Widrow B., Hoff M. E., 1960, Adaptive switching circuits, In IRE WESCON Cony. 
Record, Part , pages 96-104. 
