Splines, Rational Functions and Neural Networks 
Robert C. Willimnson 
Department of Systems Engineering 
Australian National University 
Canberra, 2601 
Australia 
Peter L. Bartlett 
Department of Electrical Engineering 
University of Queensland 
Queensland, 4072 
Australia 
Abstract 
Connections between spline approximation, approximation with rational 
functions, and feedforward neural networks are studied. The potential 
improvement in the degree of approximation in going from single to two 
hidden layer networks is examined. Some results of Bitman and So]omjak 
regarding the degree of approximation achievable when knot positions are 
chosen on the basis of the probability distribution of examples rather than 
the function values are extended. 
1 INTRODUCTION 
Feedforward neural networks have been proposed as parametrized representations 
suitable for nonlinear regression. Their approximation theoretic properties are still 
not well understood. This paper shows some connections with the more widely 
known methods of spline and rational approximation. A result due to Vitushkin is 
applied to determine the relative improvement in degree of approximation possible 
by having more than one hidden layer. Furthermore, an approximation result rel- 
evant to statistical regression originally due to Birman and So]omjak for Sobo]ev 
space approximation is extended to more genera] Besov spaces. The two main 
results are theorems 3.1 and 4.2. 
1040 
Splines, Rational Functions and Neural Networks 1041 
2 SPLINES AND RATIONAL FUNCTIONS 
The two most widely studied nonlinear approximation methods are splines with free 
knots and rational functions. It is natural to ask what connection, if any, these have 
with neural networks. It is already known that splines with free knots and rational 
functions are closely related, as Petrushev and Popov's remarkable result shows: 
Theorem 2.1 ([10, chapter 8]) Let 
R,(f)p :-- inf(llf- flip: r a rational function of degree n} 
S (f)p := inf{[lf- slip: s a spline of degree k- 1 with n - 1 free knots}. 
If f  Lp[a,b],oo < a < b < oo, l < p < , k  1, O < a < k, then 
Rn(f)p = O(n -) if and only if S(f)p = O(n-). 
In both cases the efficacy of the methods can be understood in terms of their 
flexibility in partitioning the domain of definition: the partitioning amounts to a 
"balancing" of the error of local linear approximation [4]. 
There is an obvious connection between single hidden layer neural networks and 
splines. For example, replacing the sigmoid (1 d- e-) - by the piecewise linear 
function ([x + 11- Ix - 11)/2 results in networks that are in one dimension splines, 
and in d dimensions can be written in "Canonical Piecewise Linear" form [3]: 
f(x) := a + b T x +  ci[ctiT x --/i] 
i--1 
defines f: d _ , where a, ci,/3i   and b, ai  d. Note that canonical piece- 
wise linear representations are unique on a compact domain if we use the form 
+ [ct/Tx 1[. Multilayer piecewise linear nets are not generally canon- 
f(x) :- z_i=x ci - 
ical piecewise linear: Let g(x) := Ixd-y-ll-lxd-yd- ll-lx-yd- ll-]x-y- lld-xd-y. 
Then g(.) is canonical piecewise linear, but [g(x)l (a simple two-hidden layer net- 
work) is not. 
The connection between certain single hidden layer networks and rational functions 
has been exploited in [13]. 
3 COMPOSITIONS OF RATIONAL FUNCTIONS 
There has been little effort in the nonlinear approximation literature in understand- 
ing nonlinearly parametrized approximation classes "more complex" than splines or 
rational functions. Multiple hidden layer neural networks are in this more complex 
class. As a first step to understanding the utility of these representations we now 
consider the degree of approximation of certain smooth function classes via ratio- 
nal functions or compositions of rational functions in the sup-metric. A function 
b: 1 - 1 is rational of degree r if b can be expressed as a ratio of polynomials in 
x  I of degree at most r. Thus 
71' 
Zi----1 gtigi 
(3.1) 
1042 Williamson and Bartlett 
Let a(f, 5) := inf{l[f- 0ll: deg  } denote the degree of approximation of 
f by a rational function of degree 7r or less. Let b := 5 o p, where 5 and p are 
rational functions: p: ]R x Op - ll, 5: ll x O4 -- IR, both of degree 7r. Let F be 
some function space (metrized by [[. [[) and let a(F, .) := sup{a(f, .): f e F} 
denote the degree of approximation of the function class F. 
Theorem 3.1 Let IF := Wff(2) denote the Sobolev space of functions from a 
compact subset 2 C ll to ll with s :- [aJ continuous derivatives and the sth 
derivative satisfying a Lipschitz condition with order c-s. Then there exist positive 
constants Cl and c2 not depending on 7r such that for sufficiently large 7r 
(3.2) O'r(iFot,p'  1 (W)  
and 
( 
(3.3) o'-(iF, ,) )_ c2 47rlog(7r + 1) 
Note that (3.2) is tight: it is achievable. Whether (3.3) is achievable is unknown. 
The proof is a consequence of theorem 3.4. The above result, although only for 
rational functions of a single variable, suggests that no great benefit in terms of 
degree of approximation is to be obtained by using multiple hidden layer networks. 
3.1 PROOF OF THEOREM 
Definition 3.2 Let F a C I a. A map r: I'd -* IR is called a piecewise rational 
function of degree k with barrier b of order q if there is a polynomial b of degree 
q in x  F d such that on any connected component ofTi C Fd\ {x' b(x): 0}, r is 
a rational function on 7i of degree k: 
r := PL(x) 
Q,i (x) d,i, d,i e  
Note that at any point x  7  -, (i  j), r is nol necessarily single valued. 
Definition 3.3 Let IF be some function class defined on a set G metrized with II-II 
and letO-- v. Then Fp,;q: G x O-- IR, Fq: (x,O) - F(x,O) where 
F(x, O) is a piecewise rational function of 0 of degree k or less with barrier 
(possibly depending on x) of order q; 
2. For all f  IF there is a 0  O such that Ilf- F(., 0)ll ; 
is called an -representation of IF of degree k and order q. 
Theorem 3.4 ([12, page 191, theorem 1]) If F; q is an s-representation ofiF 
of degree k and order q with barrier b not depending on x, then for sufficiently small 
(3.4) ulog[(q+l)(k+l)]>c(3) 1/ 
where c is a constant not dependent on s, , k or q. 
Splines, Rational Functions and Neural Networks 1043 
Theorem 3.4 holds for any v-representation F and therefore (by rearrangement of 
(3.4) and setting v - 27r) 
1 
(3.5) (, r) _> c 
(27r log[(q + 1)(k + 1)])  
Now b0 given by (3.1) is, for any given and fixed z  , a piecewise rational function 
of 0 of degree 1 with barrier of degree 0 (no barrier is actually required). Thus (3.5) 
immediately gives (3.2). 
Now consider 0 =  o p, where 
   j 
(x e 
: , (y  ) nd p:  
i = 1 3i yi j = 1 5j X j 
Direct substitution and rearrangement gives 
where we write 0 = [a, fl, 7, 5] and for simplicity set 4 = p = ' Thus dim 0 = 
4 =: v. For arbitrary but fixed x, g, is a rational function of degree k = . No 
barrier is needed so q = 0 and hence by (3.4), 
( 1 
rr (IFs, p) >_ c2 4r log(r + 1) 
3.2 OPEN PROBLEMS 
An obvious further question is whether results as in the previous section hold for 
multivariable approximation, perhaps for multivariable rational approximation. 
A popular method of d-dimensional nonlinear spline approximation uses dyadic 
splines [2,5, 8]. They are piecewise polynomial representations where the partition 
used is a dyadic decomposition. Given that such a partition E is a subset of a 
partition generated by the zero level set of a barrier polynomial of degree _< [El, can 
Vitushkin's results be applied to this situation? Note that in Vitushkin's theory 
it is the parametrization that is piecewise rational (PR), not the representation. 
What connections are there in general (if any) between PR representations and PR 
parametrizations? 
4 DEGREE OF APPROXIMATION AND LEARNING 
Determining the degree of approximation for given parametrized function classes is 
not only of curiosity value. It is now well understood that the statistical sample 
complexity of learning depends on the size of the approximating class. Ideally 
the approximating class is small whilst well approximating as large as possible 
an approximated class. Furthermore, in order to make statements such as in [1] 
regarding the overall degree of approximation achieved by statistical learning, the 
classical degree of approximation is required. 
1044 Williamson and Bartlett 
For regression purposes the metric used is Lp,,, where 
]]f- gllp, := 
where/ is a probability measure. 
[J(f(x)- g(x))P dlu(x)] 1/p 
Ideally one would like to avoid calculating the 
degree of approximation for an endless series of different function spaces. Fortu- 
nately, for the case of spline approximation (with free knots) this not necessary 
because (thanks to Petrusher and others) there now exist both direct and converse 
theorems characterizing such approximation classes. Let ,S'n(f)p denote the error 
of n knot spline approximation in Lp[O, 1]. Let I denote the identity operator and 
T(h) the translation operator (T(h)(f,x):= f(x + h)) and let A := (T(h)- i)k, 
k = 1,2,..., be the difference operators. The modulus of smoothness of order k for 
f e Lp(Q) is 
w(f,t)p :- E IlAaf(')llLgn)  
Petrushev [9] has obtained 
Theorem 4.1 Let -= (ct/d + l/p) -. Then 
(4.1) -][n'S,(f)p] 1- < oo 
n----1 
if and only if 
(4.2) [t-w(f,t)] dt 
The somewhat strange quantity in (4.2) is the norm of f in a Besov space B,;k. 
Note that for  large enough, r < 1. That is, the smoothness is measured in an Lp 
(p < 1) space. More generally [11], we have (on domain [0, 1]) 
[[fllB,q; := (t-eck(f,t)p) q 
Besov spaces are generalizations of classical smoothness spaces such  Sobolev 
spaces (see [11]). 
We are interested in approximation in Lp, and following Birman and Solomjak [2] 
ask what degree of approximation in Lp, can be obtained when the knot positions 
are chosen according to y rather than f. This is of interest because it makes the 
problem of determining the parameter values on the bis of observations linear. 
Theorem 4.2 Let f  Lp, where I u  L, for some A > 1 and is absolutely contin- 
uous. Choose the n knot positions of a spline approximant v to f on the basis of I u 
only. Then for all such f there is a constant c not dependent on n such that 
(4.3) 
where  = (ct + (1 - ,-1)p-1)-1 and p < . The constant c depends on lu and ,. 
Splines, Rational Functions and Neural Networks 1045 
If p _) 1 and a _( p, for any ct ( a -1 for all f under the conditions above, there is 
a v such that 
(4.4) Ilf- vllLp, < cn-+k-illfll&% 
and again c depends on I u and A but does not depend on n. 
Proof First we prove (4.3). Let [0,1] be partitioned by E. Thus if v is the 
approximant to f on [0, 1] we have 
For any  )_ 1, 
if(x) - v(x)lPdla(x) - If- v[  x dx 
% If- vl p(x--x)-xdx  dx 
where  = p(1 - -)- Now Pe[rushev and Popov [10, p.21] have shown 
there exis[s a polynomial of degree k on  = [r, s] such 
where 
(z 
(-)/ dt 
and a := ( + -)- 0 <  <  and k > 1. Le[ IE] =: n and choose E = 
(i = [ri, si]) such 
 d:-IId/d11(o,x). 
Thus IId/d11() = -/lld/d11(o,x)..ene 
(4.) Ill 1  
Since (by hypo[hesis) p < a, Holder's inequali[y gives 
I1 I  
-  I, 5 clld/d11 Ilfl]B() 
Now for arbi[rary par[i[ions E of [0, 1] Pe[rushev and Popov [10, page 21] have 
shown 
1046 Williamson and Bartlett 
where B.; k: B.,.; = B([0, 1]). Hence 
IIf-vll[p,  clldMdxllL ,,+1- ilfll: k 
and so 
(4.6) Ill- vllLp, _< clldMdxll}/ff 
withr=(c+b-1) - b=p(1--l) -1 Hencea=(c+l--) -. Thus given c 
' 
and p, choosing different A adjusts the a used to measure f on the right-hand side 
of (4.6). This proves (4.3). 
Note that because of the restriction that p ( a, c > 1 is only achievable for p ( 1 
(which is rarely used in statistical regression [6]). Note also the effect of the term 
II/dxll/ff. When A - i this is identically 1 (since / is a probability measure). 
When  ) i it measures the departure from uniform distribution, suggesting the 
degree of approximation achievable under non-uniform distributions is worse than 
under uniform distributions. 
Equation (4.4) is proved similarly. When a < p with p >_ 1, for any c _< l/a, we 
can set  :: (1 -  + po) -1 >_ 1. From (4.5) we have 
Ill- vl p clldMdxll Y] IIflIB() 
_< 
and therefore 
1 
5 CONCLUSIONS AND FURTHER WORK 
In this paper a result of Vitushkin has been applied to "multi-layer" rational ap- 
proximation. Furthermore, the degree of approximation achievable by spline ap- 
proximation with free knots when the knots are chosen according to a probability 
distribution has been examined. 
The degree of approximation of neural networks, particularly multiple layer net- 
works, is an interesting open problem. Ideally one would like both direct and con- 
verse theorems, completely characterizing the degree of approximation. If it turns 
out that from an approximation point of view neural networks are no better than 
dyadic splines (say), then there is a strong incentive to study the PAC-like learning 
theory (of the style of [7]) for such spline representations. We are currently working 
on this topic. 
Splines, Rational Functions and Neural Networks 1047 
Acknowledgements 
This work was supported in part by the Australian Telecommunications and Elec- 
tronics Research Board and OTC. The first author thanks Federico Girosi for pro- 
viding him with a copy of [4]. The second author was supported by an Australian 
Postgraduate Research Award. 
References 
[1] A.R. Barron, Approximation and Estimation Bounds for Artificial Neural Networks, 
To appear in Machine Learning, 1992. 
[2] M.S. Bitman and M. Z. Solomjak, Piecewise-Polynomial Approximations of Func- 
tions of the Classes W, Mathematics of the USSR- Sbornik, 2 (1967), pp. 295- 
317. 
[3] L. Chua and A. -C. Deng, Canonical Piecewise-Linear Representation, IEEE Trans- 
actions on Circuits and Systems, 35 (1988), pp. 101-111. 
[4] R.A. DeVote, Degree of Nonlinear Approximation, in Approximation Theory VI, 
Vohnne 1. C. K. Chui, L. L. Schumaker and J. D. Ward, eds., Academic Press, 
Boston, 1991, pp. 175-201. 
[5] R.A. DeVore, B. Jawerth and V. Popov, Compression of Wavelet Decompositions, 
To appear in American Journal of Mathematics, 1992. 
[6] H. Ekblom, Lp-methods for Robust Regression, BIT, 14 (1974), pp. 22-32. 
[7] D. Haussler, Decision Theoretic Generalizations of the PAC Model for Neural Net 
and Other Learning Applications, Report UCSC-CRL-90-52, Baskin Center for 
Computer Eugineering and Information Sciences, University of California, Santa 
Crnz, 1990. 
[8] P. Oswald. On the Degree of Nonlinear Spline Approximation in Besov-Sobolev 
Spaces, .]ounM of Approximation Theory, 61 (1990), pp. 131-157. 
[9] P. P. Petrushev, Direct. and Converse Theorems for Spline and Rational Approxi- 
mation and Besov Spaces, in Function Spaces and Applications (Lecture Notes 
in Mattematics 1502), M. Cwikel, J. Peetre, Y. Sagher and H. Wallin, eds., 
Springer-Verlag, Berlin, 1988, pp. 363-377. 
[10] P.P. Petrusher and V. A. Popov, Rational Approximation of Real Functions, Cam- 
bridge University Press, Cambridge, 1987. 
[11] H. Triebel. Theory of Function Spaces, Birkhuser Verlag, Basel, 1983. 
[12] A.G. \;ilushkin, Theory of the Transmission and Processing of Information, Perg- 
amon Press, Oxford, 1961, Originally published as Otsenka stozhnosti zadachi 
tabulirovaniya (Estimation of the Complexity of the Tabulation Problem), Fiz- 
matgiz, Moscow, 1959. 
[13] R. C. Williamson and U. Hehnke, Existence and Uniqueness Results for Neural 
Network Approximations, Submitted, 1992. 
