z-Entropy and the Complexity of 
Feedforward Neural Networks 
Robert C. Williamson 
Department of Systems Engineering 
Research School of Physical Sciences and Engineering 
Australian National University 
GPO Box 4, Canberra, 2601, Australia. 
Abstract 
We develop a new feedforward neural network representation of Lipschitz 
functions from [0, p]"' into [0, 1] based on the level sets of the function. We 
show tha.t 
npLl ( n)/pL ' 
+ 
is an upper bound on tile nulnber of nodes needed to represent f to within 
uniform error st, where L is the Lipschitz constant. ;Ve also show that the 
number of bits needed to represent the weights in the network in order to 
achieve this approximation is given by 
VZe conapare this bound with the e-entropy of the functional class under 
consideration. 
I INTRODUCTION 
We are concerned with tile problem of the number of nodes needed in a feedforward 
neural network in order to represent a function to within a specified accuracy. 
All results to date (e.g. [7, 10, 15]) have been in the form of existence theorems, 
stating that there does exist a neural network which achieves a certain accuracy of 
representation, but no indication is given of the number of nodes necessary in order 
to achieve this. The two teclmiques we use are the notion of e-entropy (also known 
946 
E-Entropy and the Complexity of reedforward Neural Networks 947 
Table 1: Itierarchy of theoretical problems to be solved. 
ABSTRACT 
1. Determination of the general approximation properties of feedforward 
neural networks. (Non-constructive results of the form mentioned 
above [15].) 
2. Explicit constructive approximation theorems for feedforward neural 
networks, indicating the nmnber (or bounds on the number) of nodes 
needed to approxilnate a function from a given class to within a given 
accuracy. (This is the subject of the present paper. We are unaware of 
any other work along these lines apart frOlll [6].) 
3. Learning in general. That is, results on learning that are not dependent 
on the particular representation chosen. The exciting new results using 
the Vapnik-Chervonenkis dilnension [4, 9] fit into this category, as do 
studies on the use of Shortest Description Length principles [2]. 
4. Specific results on capabilities of learning in a given architecture [11]. 
5. Specific algorithms for learning in a specific architecture [14]. 
CONCRETE 
as lnetric entropy) originally introduced by Kolmogorov [16] and a representation 
of a, function in terms of its level sets, which was used by Arnold [1]. The place of 
the current paper with respect to other works in the literature can be judged from 
table 1. 
We study the qnestion of representing a function f in the class ,(m,...,p,),r* which 
' L,C ' 
is the space of real valued functions deftlied on the n-dimensional closed interval 
)<i=110, fill with a Lipschitz constant L and bounded in absolute value by C. If 
,.. 'P'"The error measure we use is the 
Pi -- P for i = 1 ., n, we denote the space , .c' 
uniform or sup metric: 
-- sup If(x)- fix)l, (1) 
where f is the approximation of f. 
2 e-ENTROPY OF FUNCTIONAL CLASSES 
The e-entropy 7-/, gives an indication of the nnlnber of bits required to represent 
with accuracy  an arbitrary function f in some functional class. It is defined as 
the logarithm to base 2 of the nnmber of elelnents in the smallest s-cover of the 
functional class. Kohnogorov [16] has proved that 
?-l r[,,c = B(n) (2) 
where B(n) is a constant which depends only on n. We use this result as a yardstick 
for our neural network representation. A more powerful result is [18, p.86]: 
948 Williamson 
_- 2) (f) % 
(i-2 
(i-3 
cq- 2)(.0 
-2) Oe) 
f 
Figure 1' Illustration of some level sets of a function on R 2. 
Theoreln I Let p be a no.-negative integer and let o  (0, 1]. Set s = p + o. Let 
FP,',c(o) denote the space of real functions f defined on [O,p]r* all of whose partial 
derivatives of order p satisfy a ipschitz condition with constant L and index a, 
and are such that 
0,+'+...+ f(o) 
Oxlk  k,, ..kn 
-)X 2' ' " n 
_< c for _< p. (3) 
i--1 
Then for sufficiently small 
_< 7-G \ ,,z,,c(o) _< B(s,n)p" L 
where A(s, n) and B(s, n) are positive constants depending only on s and n. 
(4) 
We discuss the implication of this below. 
3 
A NEURAL NETWORK REPRESENTATION BASED 
ON LEVEL SETS 
We develop a new neural network architecture for representing functions from [0, 
onto [0, 1] (the restriction of the range to [0, 1] is just a convenience and can be 
easily dropped). The basic idea is to represent approximations f of the function 
f in terms of the level sets of f (see figure 1). Then neural networks are used to 
approximate the above sets (f) of f, where ?(f)  {x: f(x) >_ o} = >_o, l3(f) 
and la(f) is the a'th level set: lo(f) = {x: f(x) = a}. The approximations a(f) can 
-Entropy and the Complexity of Feedforward Neural Networks 949 
be iml)lemented using three layer neural nets with threshold logic neurons. These 
approximations are of the form 
Isothetc approximation to the ruth component of [o.(f). 
(-'a, h m 
m=l k,,,=l j=l 
n-rectangle of dimensions  
where O) TM is the "width" in the jth dimension of the X,,,th rectangular part of the 
ruth component (disjoint connected subset) b,, ') of the ith approximate above-set 
[,, C, is he number of components of the above-set/re(f), A,,, is the number of 
co.t,x .-rectangles (parts) that are required to form an et-cover for c')(f), Uj = 
(uS 1) ..... uf')). uf " = ;,,,, ,q'(h.,,0) is the .-half-space defined by the hyperplane 
h u,,0' 
,5'(/,.,,0) = {.r'/',..(.0  0}, (6) 
where I%,,o(.r) = w.r - 0 and u'= (w .... , w.). 
The function f is then approximated by 
N 
:.x,' + 7 (7) 
/=1 
where oi = i-t i = t .V and ls is the indicator function of a set S. The 
approximation fx-u(.v) is tlwn flirther approximated by iml)lementing (5) using 
N 3-layer neural nets ill parallel' 
/NN(.,:) : I , '" _ ' _o(i) n [0, p] (8) 
n =1 k =1 
last tlnl',l sc,:on d first 
)r _  
where x = (.r,. .... v,, . .s - ,v and rS, ' is the nmnber of nodes in the second 
layer. The last layer combines the abow>sets in the manner of (7). The general 
architcctm'e of the network is shown in figure 2. 
4 
NUMBER OF BITS NEEDED TO REPRESENT THE 
WEIGHTS OF THE NETWORK 
The two main resttits of this paper are bounds on the numl)er of nodes needed in such 
g"'" with uniform error s and bounds 
a neural hal. work in order t,o represent f E , L,C , 
on the number of bits needed to represent the weights in such an approximation. 
Theorem 2 The .umber of.odes.ceded i. a neural network of the above archi- 
Itclure i. ord(r lo '6p'r,sr.l a.y .f G . L,C tO within 5r in the sup-metric is given 
b 
+ ;+ 
950 Williamson 
2 
NNi approximates 
Sot 
2 
Sot 
iNN(x) 
(dimx = n) 
Ill 
2 
1 
2N 
Figtire 2: The Neural Network architecture we adopt. 
This theorem is proved in a straight-forward manner by taking account of all the 
errors incurred in the approximation of a worst-case function in 'P'" 
Since comparing the number of nodes alone is inadequate for comparing the com- 
plexity of neural nets (because the nodes themselves could implement quite complex 
functions) we have also calculated the nulnber of bits needed to represent all of the 
weights (including zero weights which denote no colmection) in order to achieve an 
- approximation :1 
Theoreln 3 Tb.e number of bits needed to specify the weights in a neural network 
with the above architecture in order to represent an arbitrary function f   L,C 
with accuracy e,. in the sup-metric is bounded above by 
v4"e,. ' (10) 
Equation 10 can be coinpared with (2) to see that the neural net representation is 
close to optimal. It is suboptimal by a factor of O(). The v4 term is considered 
subsulned into the B(n) term in (2). 
1The idea of using the number of bits as a measure of network complexity has also 
recently been adopted in [5]. 
-Entropy and the Complexity of Feedforward Neural Networks 951 
5 FURTHER WORK 
Theorem 3 shows that the colnplexity of representing an arbitrary f  ., is 
exponential in n. This is not so much a limitation of the neural network as an 
indication that our probleln is too hard. Theorem i shows that if smoothness 
constraints are tinposed, then the complexity can be considerably reduced. It is an 
open problem to determine whether the construction of the network presented in 
this paper can be extended to make good use of smoothness constraints. 
Of course the most important question is whether functions can be learned using 
neural networks. Apropos of this is Stone's result on rates of convergence in non- 
parametric regression [17]. Although we do not have space to give details here, 
suffice it say that he shows that the gains suggested by theorem 1 by imposing 
smoothness constraints in the representation problem, are also achievable in the 
learning problem. A more general statement of this type of result, making explicit 
the connexion with e-entropy is given by Yatracos [19]: 
Theorem 4 Let M be a L1-tolally bounded set of measures on a probability space. 
Let the metric defined on the space be the Ll-distance between measures. Then there 
exists a uniformly co.sislent estimalor i for some parameter 0 from a possibly 
infinite dimensional family of measures  C M whose rate of convergence in i 
asymptotically satisfies the equation 
a i -- 
(11) 
where 7-le() is the e-etvpy of. 
Similar resnlts have been discussed by Ben-David et al. [3] (who have made use of 
Dudley's (loose) relationships between e-entropy and Vapnik-Chervonenkis dimen- 
sion IS]) and others [12, 13]. Tilere remain lnany open problems in this field. One of 
the lnain difficulties however is the calculation of?-/ for non-trivial function classes. 
One of tile lnost significant results would be a. complete and tight deterlnination of 
the e-entropy for a fcedforward neural network. 
A cknowledgelnent s 
This work was supported in part by a. grant from ATERB. I thank Andrew Paice 
for many useful discussions. 
References 
[1] 
V. I. Arnold, Representation of Continuous Functions of Three Variables by the Su- 
perposition of Continuous Functions of Two Variables, Matematicheshii Sbornik 
(N.S.), 48 (1959), pp. 3-74, Translation in American Mathematical Society Trans- 
lations Series 2, 28 (1959) pp. 61-147. 
[2] A.R. Barron, Statistical Properties of Artificial Neural Networks, in Proceedings of 
the 28th Conerence on Decision and Control, 1989, pp. 280-285. 
952 Williamson 
[3] S. Ben-David, A. Itai and E. Kushilevitz, Learning by Distances, in Proceedings of 
the Third Annual Workshop on Computational Learning Theory, M. Fulk and 
J. Case, eds., Morgan Kaufmann, San Mateo, 1990, pp. 232-245. 
[4] A. Blumer, A. Ehrenfeucht, D. Haussler and M. K. Warmuth, Learnability and 
the Vapnik-Chervonenkis Dimension, Journal of the Association for Computing 
Machinery, 36 (1989), pp. 929-965. 
[5] J. Bruck and J. W. Goodman, On the Power of Neural Networks for Solving Hard 
Problems, Journal of Complexity, 6 (1990), pp. 129-135. 
[6] S.M. Carroll and B. W. Dickinson, Construction of Neural Nets using the Radon 
Transform, in Proceedings of the InternationaJ Joint Conference on NeurM Net- 
works, 1989, pp. 607-611, (Volume I). 
[7] G. Cybenko, Approximation by Superpositions of a Sigmoidal Function, Mathemat- 
ics of Control, Signals, and Systems, 2 (1989), pp. 303-314. 
[8] R.M. Dudley, A Course on Empirical Processes, in tcole d'tt4 de Probabilit4s 
de Saint-Flour XII-1982, R. M. Dudley, H. Kunitay and F. Ledrappier, eds., 
Springer-Verlag, Berlin. 1984, pp. 1-142, Lecture Notes in Mathematics 1097. 
[9] A. Ehrenfeucht, D. Haussler, M. Kearns and L. Valiant, A General Lower Bound on 
the Number of Examples Needed for Learning, Information and Computation, 
82 (1989), pp. 247-261. 
[10] K. -I. Funahashi, On the Approximate Realization of Continuous Mappings by Neu- 
ral Networks, Neural Networks, 2 (1989), pp. 183-192. 
[11] S.I. Gallant, A Connectionist Learning Algorithm with Provable Generalization and 
Scaling Bounds, NeuraJ Netwrks, 3 (1990), pp. 191-201. 
[12] S. van de Geer, A New Approach to Least-Squares Estimation with Applications, 
The Annals of Statistics, 15 (1987), pp. 587-602. 
[13] R. Itasminskii and I. Ibragimov, On Density Estimation in the View of Kolmogorov's 
Ideas in Approximation Theory, The Annals of Statistics, 18 (1990), pp. 999- 
1010. 
[14] R. Hecht-Nielsen, Theory of the Backpropagation Neural Network, in Proceedings 
of the International Joint Conference on Neural Networks, 1989, pp. 593-605, 
Volume 1. 
[15] K. Hornik, M. Stinchcombe and H. White, Multilayer Feedforward Networks are 
Universal Approximators, NeurM Networks, 2 (1989), pp. 359-366. 
[16] A.N. Kohnogorov and V. M. Tihomirov, s-Entropy and s-Capacity of Sets in Func- 
tional Spaces, Uspehi Mat. (N.S.), 14 (1959), pp. 3-86, Translation in American 
Mathematical Society Translations, Set/es 2, 17 (1961) pp. 277-364. 
[17] C. J. Stone, Optimal Global Rates of Convergence for Nonparametric Regression, 
The AnnMs of Statistics, 10 (1982), pp. 1040-1053. 
[18] A.G. Vitushkin, Theory of the Transmission and Processing of Information, Perg- 
amon Press, Oxford, 1961, Originally published as Otsenka slozhnosti zadachi 
tabulirovaniya (Estimation of the Complexity of the Tabulation Problem), Fiz- 
matgiz, Moscow, 1959. 
[19] Y.G. Yatracos, Rates of Convergence of Minimum Distance Estimators and Kol- 
mogorov's Entropy, The Annals of Statistics, 13 (1985), pp. 768-774. 
