Some Approximation Properties of Projection 
Pursuit Learning Networks 
Ying Zhao Christopher O. Atkeson 
The Artificial Intelligence Laboratory 
Massachusetts Institute of Technology 
Cambridge, MA 02139 
Abstract 
This paper will address an important question in machine learning: What 
kind of network architectures work better on what kind of problems? A 
projection pursuit learning network has a very similar structure to a one 
hidden layer sigmoidal neural network. A general method based on a 
continuous version of projection pursuit regression is developed to show 
that projection pursuit regression works better on angular smooth func- 
tions than on Laplacian smooth functions. There exists a ridge function 
approximation scheme to avoid the curse of dimensionality for approxi- 
mating functions in 
1 INTRODUCTION 
Projection pursuit is a nonparametric statistical technique to find "interesting" 
low dimensional projections of high dimensional data sets. It has been used for 
nonparametric fitting and other data-analytic purposes (Friedman and Stuet,.le, 
1981, Huber, 1985). Approximation properties have been studied by Diaconis & 
Shahshahani (1984) and Donoho & Johnstone (1989). It ws first introduced into 
the context of learning networks by Barron & Barron (1988). A one hidden layer 
sigmoidal feedforward neural network approximates f(x) using the structure (Figure 
l(a)): 
g(,:) = + 
936 
Some Approximation Properties of Projection Pursuit Learning Networks 937 
Figure 1: (a) A One Hidden Layer Feedforward Neural Network. (b) A Projection 
Pursuit Learning Network. 
where tr is a sigmoidal function. Oj are direction parameters with IlOJll - 1, and 
aj, p, 6j are function parameters. A projection pursuit learning network based 
on projection pursuit regression (PPR) (Friedman .nd Stuetzle, 1981) or a ridge 
function approximation scheme (RA) has a very similar structure (Figure l(b)). 
j=! 
(2) 
where 05 are also direction parameters with II0sl I - 1. The corresponding function 
parameters are ridge functions gi which are any smooth function to be learned 
from the data. Since r is replaced by a more general smooth function gj, projection 
pursuit learning networks can be viewed as a generali-.ation of one hidden layer 
sigmoidal feedforward neural networks. This paper will discuss some approximation 
properties of PPR: 
Projection pursuit learning networks work better on angular smooth functions 
than on Laplacian smooth functions. Here "work better" means that for fixed 
complexities of hidden unit functions and a certain accuracy, fewer hidden units 
are required. For the two dimensional case (d = 2), Donoho and Johnstone 
(1989) show this result using equispaced directions. The equispaced directions 
may not be available when d > 2. We use a set of directions generated from 
 .eros of orthogonal polynomials and uniformly distributed directions on an unit 
sphere instead. The analysis method in D & J's paper is limited to two dimen- 
sions. We apply the theory of spherical harmonics (Muller, 1966) to develop a 
continuous ridge function representation of any arbitray smooth functions and 
then employ different numerical integration schemes to discreti-.e it for cases 
when d > 2. 
2. The curse of dimensionality can be avoided when a proper ridge function ap- 
proximation is applied. Once a continuous ridge function representation is es- 
tablished for any function in L2(bl), a Monte Carlo type of integration scheme 
can be applied which has a RMS error convergence rate O(N-i) where N is 
the number of ridge functions in the !i_near combinations. This is a similar 
result to Barron's result (Barron, 1991) except that we have less restrictions 
on the underlying function class. 
938 Zhao and Atkeson 
(-) 
Figure 2: (&) A radial basis element $o14. (b) & harmonic basis element Jsto. 
2 SMOOTHNESS CLASSES AND A L2(4d) BASIS 
We use L2(ibd) as our underlying function space with Gaussian measure d = 
(.!)!}e-l-. [[f]]2 = f, f2$ddx ' The smoothn cls chtere the rat of 
convergence. Let Ad be the Lapli operator d A be the Laplacian-Beltr 
operator (Muller, 1966). The smoothness cles c be defined : 
Definition 1 The function !  L2(qbl) will be said to have Cartesian smoothness 
of order p i.f it has p derivatives and these derivatives are all in L2(41). R will 
be said to have anular smoothness o.f order q i.f AI  L2(4). It will be said 
to have Laplacian smoothness o.f order r if AI  2(4,). Let f', be the cta,, 4 
functions with Cartesian smoothness p, .A be the class of functions with Cartesian 
smoothness p and angular smoothness q and ,. be the class o.f functions with 
Cartesian smoothness p and aplacian smoothness r. 
We derive an orthogonal basis in L2(bd) from the eigenfunctions of & self-adjoinS 
operator. The basis element is defined as: 
j..(=) = . r S.(() z.() (3) 
haffmonic  
where x = r(,. = 0,...,oo,- = 0,...,oo, j = 1,...,N(d,.),. = (-2)'-!, = 
n + 4-2 S() are linearly independent spherical harmonics of degree n in d 
dimensions (Muller, 1966). Z,() is a Laguerre polynomial. The advantage of 
the basis comes from its representation as & product of a spherical harmonic and 
a radial polynomial. $pecitcally J0im is a radial polynomial for n = 0 aad J, j0 
is a harmonic polynomial for m = 0. Figure 2(a),(b) show a radial basis element 
and a harmonic basis element when n + 2m = 8. The basis element J.m has an 
orthogonality: 
,i,i.,i 2n+2m-I d 
r(.=(r,().(.,()) = _.._ .t (" +" + )V(" + ) (4) 
where E denotes expectation with respect to bd. Since it is a basis in :(b4), ny 
function f  L:(bd) has an ezpansion in terms of basis elements J,jm 
Some Approximation Properties of Projection Pursuit Learning Networks 939 
The circular harmonic e i"s is a special case of the spherical harmonic $.j (). In two 
dimensions, d = 2, N(d,n) = 2 and x = (' cos S, ' sin S). The spherical harmonic 
1 sin 
$.j() can be reduced to the following: $.() = cosn0, $.2() =  
which is the circular harmonic. 
Smoothness classes can also be defined qualitatively from expansions of functions 
in terms of basis elements J.j"*. Since Ilfll' = E c.j"*s.j"*, one can think of 
c  j 
"-'q as representing the distribution of energy in f among the 
p.j"*(f) = %.. 
different modes of oscillation or,i.. If f is cartesian smooth, p.j.,(f) peaks around 
small n + 2m. If f is angular smooth, p.j.,(f) peaks around small n. If f is 
Laplacian smooth, p.j.,(f) peaks around small m. To explain why PPR works 
better on angular smooth functions than on Laplacian smooth functions, we first 
examine how to represent these L2(qbd) basis elements systematically in terms of 
ridge functions and then use the expansion (5) to derive a error bound of RA for 
general smooth functions. 
3 CONTINUOUS RIDGE FUNCTION SCHEMES 
There exists a continuous ridge function representation for any function f(x)  
L2(bd) which is an integral of ridge functions through all possible directions. 
= fn (6) 
This works intuitively because any object is determined by any infinite set of radio- 
graphs. More precisely, any function f(x)  L=(qbd) can be approximated arbitrarily 
well by a linear combination of ridge functions  g(xWr/, r/) provided infinitely 
many combination units (Jones, 1987). As k  0% we have (6). The natural 
discrete approximation to (6) has the form: f.(x) * w  
= j= wjg(x rlj,rlj), which 
becomes the usual PPR (2). We proved a continuous ridge function representation 
of basis elements J,j., which is shown in Lemma 1. 
Lenuna 1 The continuous ridge function representation of Y.j., is: 
where )..,d is a constant and H.+2.,() is a Heite polynomial. 
(7) 
Therefore any function f  L2(qbd) has a continuous ridge function representation 
(6) with 
= (8) 
Gaussian quadrature and Monte Carlo integration schemes can be used to discreti.e 
(6). 
4 GAUSSIAN QUADRATURE 
2 a-s _ _ 
Since f.. g(xW7,7)dd07) = f.._, fix g(xWr/,r/)(1 - td_)---dtd_dd_tOTd_), 
simple product rules using Gaussian quadrature formulae can be used here. tij, i -- 
940 Zhao and Atkeson 
(') is) 
Figure 3: Directions (a) for a radial polynomial (b) for & harmonic polynomial 
g- 1, ..., 1,  = 1, ...,  are seros of orthogonal polynomials with weights (1 - t ) '. 
N = n -x points on the unit sphere fl can be formed using giy through 
(9) 
If g(x'l, 1) is a polynomial of degree at most 2n- 1 (in terms of/x, ..., t,-x), then 
N = n - x points (directions) ae sufficient to represent the integral exactly. This 
can be demonstrated with two examples by taking d = 3. 
Example 1: a radial function 
 
(lO) 
d = 3,n = 3. n: = 9 directions from (9) are sufficient to represent this poly- 
ix = 0, ,- ( set. of a deuce 3 T.heby.heff polyno). More arections 
e need to reprent a hmoc nction th etly the e numar of ter 
of monos but th &fierent coefficients. 
Exple 2:  hmoc function 
where S4y(Q) = (35( 4 - 3 3 + 3),Q = (3 + . n = 5, n 
25 ections om m (g) e sucient to reprent the polyno th t) = 
0,0.90618,-0.90618,0.537,-0.53847 d tx = c*,j = 1,...,5. The 
tbution of the dectio on a ut sphere e shown  Figure 3(a) d (b). 
 genera, N = (n + m + 1) -x rtions e sufficient to reprint Jnym extly 
by ng ser of orthon polynos.  p = n + 2m (the degree of the 
is , N = (- m+ 1) - = ( + 1)-x  se when. = 0 wch 
corrn to the r  element. N  msed when m = 0 wMch  the 
hmoc element. Ung deflations of smoothns s in Section 2. we c 
show that ridge function appromation works better on lar smooth nctions. 
The bic rt is  follows: 
Some Approximation Properties of Projection Pursuit Learning Networks 941 
Theorem 1 f  4, let Rr f denote a sum of ridge functions which best approz- 
iraate f by using a set of directions generated by zeros of orthogonal polynomials. 
Then 
Er = IIRr f ' 
- l[I.,, < BN-- + BN-'-r (12) 
This error bound says that ridge function approximation does take advantage of 
angular smoothness. Radial functions are the most angular smooth functions with 
q = +oo and harmonic functions are the least angular smooth functions when the 
Cartesian smoothness p is fixed. Therefore ridge function approximation works 
better on angular smooth functions than on Laplacian smooth functions. Radial 
and harmonic functions are the two extreme cases. 
5 UNIFORMLY DISTRIBUTED DIRECTIONS ON gld 
Instead of using directions from zeros of orthogonal polynomials, N uniformly dis- 
tributed directions on "d is an alternative to generalizing equispaced directions. 
This is a Monte Carlo type of integration scheme on 
To approximate the integral (7). N uniformly distributed directions 
on "d drawn from the density f(v) = 1/wa on d Rre used: 
.(x) = X.d  U.+,(xw ,,)S.i(,,) (13) 
The me vue for .im(x) is 
 Wd 
/=1 
The variance is 
(x)- =(x) (15) 
N 
whe,e '(x) = fn, [X"mdWH"+'m(XW"S"J("- J. jm(X)]' dwd(,). Therefo,e 
j.y(x) appromates J.(x) with a rate (x)/. The difference between a 
radi bis and a hmonic bis is (x). Let us average (x) with respect to d: 
Fo, a .d. + 2 = p, II()ll = is nized at n = 0 (a ri element) and 
mzed at m = 0 (a harmonic element). 
The same justification e be done for a gener function f  L'(Od) with (6) and 
(8). A  scheme is: 
942 Zhao and Atkeson 
mv(x) = f(x). rv(x ) = -, r2(x) = fnVdg2(x',.,)d(,)- f2(x) nd 
11()11 
= c,mllJmll,am. Since a,jm is sm when n is sm, large when m 
is sm d recM1 the distribution P,jm(f) in Section 3, IIl?/11fll is small when 
f is smooth. Among these smooth functions, if f is angular smooth, 
is smer th that  f is Laplian smooth. The RMS error convergence rate 
 = llll is consequently smer for f being angul smooth than for f being 
Laplian smooth. But both rates e O(N-) no matter what cls the underlying 
function belon to. The difference h the constant which is related to the distribu- 
tion of energy in f ong the different modes of osclations (angular or Lapisinn). 
The radi and harmonic functions are two extremes. 
6 THE CURSE OF DIMENSIONALITY IN RA 
Generally. if N directions /. . ...... . /v. on d drawn from any distribution p(/) 
on the sphere d tO approximate (6) 
I v g(xT/k,/k ) 
/(X) =   p(/,) ) (17) 
k=l 
mr - f(x) av - -- where 2(x)- fn ,(--g2( x'/, l)&Vd(l) -- f2(x). And 
fn I 
(18) 
Then 
ill(x)_ f(x)ll 2 < [l(x)ll 2 _ c! (19) 
- N N 
That is, ](x) -- f(x) with a rate O(N-). Equation (19) shows that there is no 
curse of dimensionality if a ridge function approximation scheme (17) is used for 
f(x). The same conclusion can be drawn when sigmoidal hidden unit function neu- 
ral networks are applied to Barron's class of underlying function (Barron, 1991). 
But our function class here is the function class that can be represented by a con- 
tinuous ridge function (6), which is a much larger function class than Barron's. Any 
function f  L2(Cd) has a representation (6)(Section 4). Therefore, for any func- 
tion f  L2(Cd), there exists a node function g(xTl, l) and related ridge function 
approximation scheme (17) to approximate f(x) with a rate O(N-), which has no 
curse of dimensionality. In other words, if we are allowed to choose a node function 
g(xT/, /) according to the property of data, which is the characteristic of PPR, then 
ridge function approximation scheme can avoid the curse of dimensionality. That 
is a generalization of Barron's result that the curse of dimensionality goes away if 
certain types of node function (e.g., cos and ) are considered. 
The smoothness of a underlying function determines the sie of the constant c!. As 
shown in the previous section, ifp(/) - 1/Vd (i.e., uniformly distributed directions), 
then angular smooth functions have smaller c! than Laplucian smooth functions do. 
Choosing different p(/) does not change this conclusion. But a properly chosen p(/) 
reduces c! in general. If f(x) is smooth enough, the node function g(xTl,l) can be 
Some Approximation Properties of Projection Pursuit Learning Networks 943 
computed from the Radon transform Rn of f in the direction r/which is defined 
= + (20) 
and we proved: g(xT,) = -i(F.(t)ItI d-) [,=x., where Fn(t ) is the Fourier 
transform of f(s) and - denotes the inverse Fourier transform. In practice, 
learning g(xW, ) N usuly replied by a smoothing step which seeks a one di- 
mension function to fit xW best to the residu in this direction (Friedman and 
Stuetzle, 1981, Zhao and Atkeson, 1991). 
7 CONCLUSION 
As we showed, PPR works better on angular smooth function than on L&placian 
smooth functions by discretezing & continuous ridge function representation. PPR 
can avoid the curse of dimensionality by learning node functions from data. 
Acknowledgments 
Support was provided under Air Force Office of Scientific Research grant AFOSR- 
89-0500. Support for CGA was provided by a National Science Foundation Presi- 
dential Young Investigator Award, an Alfred P. Sloan Research Fellowship, and the 
W. M. Keck Foundation Associate Professorship in Biomedical Engineering. Spe- 
cial thanks goes to Prof. Zhengfang Zhou and Prof. Peter Huber at Math Dept. in 
MIT, who provided useful discussions. 
References 
Barron, A. R. and Barron, R. L. (1988) "Statistical Learning Networks: A 
Unifying View." Computing Science and Statistics: Proceedings of Oth S!lmposium 
on the Interface. Ed Wegman, editor, Amer. Statist. Assoc., Washington, D.C., 
192-203. 
Barron, A. R. (1991) "Universal Approximation Bounds for Superpositions of 
A Sigmoidal Function". TR. 58. Dept. of Star., Univ. of Illinois at Urbana- 
Champaign. 
Donoho, D. L. and Johnstone, I. (1989). "Projection-based Approximation, 
and Duality with Kernel Methods". Ann. Statist., 17, 58-106. 
Diaconis, P. and Shahshahani, M. (1984) "On Non-linear Functions of Linear 
Combinations", SIAM J. Sci. Star. Cornpt. 5, 175-191. 
Friedman, J. H. and Stuet.le, W. (1981) "Projection Pursuit Regression". J. 
Arner. Star. Assoc., 76,817-823. 
Huber, P. J. (1985) "Projection Pursuit (with discussion)", Ann. Statist., 13, 
435-475. 
Jones, L. (1987) "On A Conjecture of Huber Concerning the Convergence of Pro- 
jection Pursuit Regression". Ann. Statist., 15, 880-882. 
Muner, C. (1966), Spherical Harmonics. Lecture Notes in Mathematics, no.17. 
Zhao, Y. and C. G. Atkeson (1991) "Projection Pursuit Learning", Proc. 
IJCNN-91-SEATTLE. 
