How to Choose an Activation Function 
H. N. Mhaskar 
Department of Mathematics 
California State University 
Los Angeles, CA 90032 
hmhaska@ calstatela.edu 
C. A. Micchelli 
IBM Watson Research Center 
P.O. Box 218 
Yorktown Heights, NY 10598 
cam@watson.ibm.corn 
Abstract 
We study the complexity problem in artificial feedforward neural networks 
designed to approximate real valued functions of several real variables; i.e., 
we estimate the number of neurons in a network required to ensure a given 
degree of approximation to every function in a given function class. We 
indicate how to construct networks with the indicated number of neurons 
evaluating standard activation functions. Our general theorem shows that 
the smoother the activation function, the better the rate of approximation. 
I INTRODUCTION 
The approximation capabilities of feedforward neural networks with a single hidden 
layer has been studied by many authors, e.g., [1, 2, 5]. In [10], we have shown that 
such a network using practically any nonlinear activation function can approximate 
any continuous function of any number of real variables on any compact set to any 
desired degree of accuracy. 
A central question in this theory is the following. If one needs to approximate 
a function from a known class of functions to a prescribed accuracy, how many 
neurons will be necessary to accomplish this approximation for all functions in the 
class? For example, Barron shows in [1] that it is possible to approximate any 
function satisfying certain conditions on its Fourier transform within an L 2 error 
of O(1/n) using a feedforward neural network with one hidden layer comprising of 
n 2 neurons, each with a sigmoidal activation function. On the contrary, if one is 
interested in a class of functions of s variables with a bounded gradient on [-1, 1] s, 
319 
320 Mhaskar and Micchelli 
then in order to accomplish this order of approximation, it is necessary to use at 
least O(n s) number of neurons, regardless of the activation function (cf. [3]). 
In this paper, our main interest is to consider the problem of approximating a 
function which is known only to have a certain number of smooth derivatives. We 
investigate the question of deciding which activation function will require how many 
neurons to achieve a given order of approximation for all such functions. We will 
describe a very general theorem and explain how to construct networks with various 
activation functions, such as the Gaussian and other radial basis functions advocated 
by Girosi and Poggio [13] as well as the classical squashing function and other 
sigmoidal functions. 
In the next section, we develop some notation and briefly review some known facts 
about approximation order with a sigmoidal type activation function. In Section 
3, we discuss our general theorem. This theorem is applied in Section 4 to yield 
the approximation bounds for various special functions which are commonly in use. 
In Section 5, we briefly describe certain dimension independent bounds similar to 
those due to Barron [1], but applicable with a general activation function. Section 
6 summarizes our results. 
2 SIGMOIDAL-TYPE ACTIVATION FUNCTIONS 
In this section, we develop some notation and review certain known facts. For the 
sake of concreteness, we consider only uniform approximation, but our results are 
valid also for other LP-norms with minor modifications, if any. Let s >_ I be the 
nmnber of input variables. The class of all continuous functions on [-1, 1] s will be 
denoted by C s. The class of all 2r- periodic continuous functions will be denoted 
by C s*. The uniform norm in either case will be denoted by I I' I I. Let 
denote the set of all possible outputs of feedforward neural networks consisting of 
n neurons arranged in I hidden layers and each neuron evaluating an activation 
function rr where the inputs to the network are from R s. It is customary to assume 
more a priori knowledge about the target function than the fact that it belongs 
to C s or C s* For example, one may assume that it has continuous derivatives of 
order r >_ I and the sum of the norms of all the partial derivatives up to (and 
including) order r is bounded. Since we are interested mainly in the relative error 
in approximation, we may assume that the target function is normalized so that this 
sum of the norms is bounded above by 1. The class of all the functions satisfying 
this condition will be denoted by W (or W* if the functions are periodic). In this 
paper, we are interested in the universal approcimation of the classes W (and their 
periodic versions). Specifically, we are interested in estimating the quantity 
(2.1) sup 
where 
(2.2) 
E',,t,s,o(f) := inf Ilf-PII. 
PIIn,l,s,, 
The quantity E,,t,s,o(f) measures the theoretically possible best order of approxi- 
mation of an individual function f by networks with n neurons. We are interested 
How to Choose an Activation Function 321 
in determining the order that such a network can possibly achieve for all functions 
in the given class. An equivalent dual formulation is to estimate 
(2.3) /,t,s,o(Wf) := min{m e Z  sup Em,t,s,o(f) _< l/n). 
This quantity measures the minimum number of neurons required to obtain accuracy 
of 1In for all functions in the class W. An analogous definition is assumed for W* 
in place of W. 
Let / denote the class of all s-variable trigonometric polynomials of order at most 
n and for a continuous function f, 2r-periodic in each of its s variables, 
(2.4) E,(f) := min Ilf- PII. 
We observe that/H can be thought of as a subclass of all outputs of networks with 
a single hidden layer comprising of at most (2n + 1) * neurons, each evaluating the 
activation function sin x. It is then well known that 
(2.5) sup E(f) < cn-". 
lewd. 
Here and in the sequel, c, c,... will denote positive constants independent of the 
functions and the number of neurons involved, but generally dependent on the other 
parameters of the problem such as r, s and rr. Moreover, several constructions for 
the approximating trigonometric polynomials involved in (2.5) are also well known. 
In the dual formulation, (2.5)states that if rr(x):= sinx then 
(2.6) ln,l,s,sin (Wr s*) -- 
It can be proved [3] that any "reasonable" approzimation process that aims to ap- 
proximate all functions in W* up to an order of accuracy 1In must necessarily 
depend upon at least O(n s/") parameters. Thus, the activation function sin x pro- 
vides optimal convergence rates for the class Wf*. 
The problem of approximating an r times continuously differentiable function 
f : R s  R on [-1,1] s can be reduced to that of approximating another 
function from the corresponding periodic class as follows. We take an infinitely 
many times differentiable function p which is equal to I on [-2, 2] s and 0 outside 
of [-r, r] s. The function fP can then be extended as a 2r-periodic function. This 
function is r times continuously differentiable and its derivatives can be bounded 
by the derivatives of f using the Leibnitz formula. A function that approximates 
this 2r-periodic function also approximates f on [-1, 1] s with the same order of 
approximation. In contrast, it is not customary to choose the activation function 
to be periodic. 
In [10] we introduced the notion of a higher order sigmoidal function as follows. Let 
k >_ 0. We say that a function rr  R -- R is sigmoidal of order k if 
(2.7) lina  - 1, lira  - 0, 
and 
(2.8) I()1 - c(1 q-Ixl) , x GR. 
322 Mhaskar and Micchelli 
A sigrnoidal function of order 0 is thus the customary bounded sigmoidal function. 
We proved in [10] that for any integer r ) 1 and a sigmoidal function tr of order 
r-l, wehave 
W? ( O(n /") if s - 1, 
(2.9) /,,,,,o(r) = O(n,/,.+(,+2,.)/,., ) if s _ 2. 
Subsequently, Mhaskar showed in [6] that if tr is a sigmoidal function of order k ) 2 
and r > 1 then, with I - O(logr/logk)), 
(2.10) 
k.,,.,o(w;) = 
Thus, an optimal network can be constructed using a sigmoidal function of higher 
order. During the course of the proofs in [10] and [6], we actually constructed the 
networks explicitly. The various features of these constructions from the connec- 
tionist point of view are discussed in [7, 8, 9]. 
In this paper, we take a different viewpoint. We wish to determine which acti- 
vation function leads to what approximation order. As remarked above, for the 
approximation of periodic functions, the periodic activation function sin x provides 
an optimal network. Therefore, we will investigate the degree of approximation by 
neural networks first in terms of a general periodic activation function and then 
apply these results to the case when the activation function is not periodic. 
3 A GENERAL THEOREM 
In this section, we discuss the degree of approximation of periodic functions using 
periodic activation functions. It is our objective to include the case of radial basis 
functions as well as the usual "first order" neural networks in our discussion. To 
encompass both of these cases, we discuss the following general formuation. Let 
s >_ d ) I be integers and b G C d*. We will consider the approximation of functions 
in C s* by linear combinations of quantities of the form d(Ax + t) where A is a d x s 
matrix and t G R d. (In general, both A and t are parameters of the network.) When 
d: s, A is the identity matrix and b is a radial function, then a linear combination 
of n such quantities represents the output of a radial basis function network with n 
neurons. When d = I then we have the usual neural network with one hidden layer 
and periodic activation function b. 
We define the Fourier coefficients of d by the formula 
1 /_ c(t)e_im.tdt, m  Z  
(3.1) (ln) .- (2r)a ,,,]a ' 
Let 
(3.2) S4, _C {m  Z a  (m) y O} 
and assume that there is a set J co taining d x s matrices with integer entries such 
that 
(3.3) Z s ={ATm  nSz, ACJ} 
How to Choose an Activation Function 323 
where A r denotes the transpose of A. If d = 1 and (1) :/- 0 (the neural network 
case) then we may choose S, = {1} and d to be Z  (considered as row vectors). 
If d = s and 6 is a function with none of its Fourier coefficients equal to zero (the 
radial basis case) then we may choose S 4,= Z  and d = {Ix}. For m G Z , we 
let km be the multi-integer with minimum magnitude such that m = ATkm for 
some A - Am  d. Our estimates will need the quantities 
(3.4) mr := min{l(km)l ' -2n _< m < 2n} 
and 
(3.5) Nr := max{lkml ' -2n < m < 2n} 
where Ikml is the maximum absolute value of the components of km. In the neural 
network case, we have m = I(1)1 and N = 1. In the radial basis case, N, = 2n. 
Our main theorem can be formulated as follows. 
THEOREM 3.1. Let s >_ d >_ 1, n > 1 and N > Nr be integers, f  C *, c  C a*. 
It is possible to construct a network 
(3.6) 
Gr,N,,(f;x) :=  djb(Ajx + tj) 
such that 
(3.7) IIf } 
- Ilfll  
In (3.6), the sum contains at most O(nN a) terms, Aj  J, tj  R a, and dj are 
linear functionals of f, depending upon n, N, 
The estimate (3.7) relates the degree of approximation of f by neural networks 
explicitly in terms of the degree of approximation of f and  by trigonometric poly- 
nomials. Well known estimates from approximation theory, such as (2.5), provide 
close connections between the smoothness of the functions involved and their degree 
of trigonometric polynomial approximation. In particular, (3.7) indicates that the 
smoother the function  the better will be the degree of approximation. 
In [11], we have given explicit constructions of the operator G,,N,. The formulas in 
[11] show that the network can be trained in a very simple manner, given the Fourier 
coefficients of the target function. The weights and thresholds (or the centers in the 
case of the radial basis networks) are determined universally for all functions being 
approximated. Only the coefficients at the output layer depend upon the function. 
Even these are given explicitly as linear combinations of the Fourier coefficients of 
the target function. The explicit formulas in [11] show that in the radial basis case, 
the operator G,,N,O actually contains only O(n + N)  summands. 
4 APPLICATIONS 
In Section 3, we had assumed that the activation function & is periodic. If the 
activation function rr is not periodic, but satisfies certain decay conditions near 
324 Mhaskar and Micchelli 
c, it is still possible to construct a periodic function for which Theorem 3.1 can 
be applied. Suppose that there exists a function p in the linear span of.Ao,. := 
{rr(Ax + t) : A E J, t E Rd}, which is integrahie on R d and satisfies the condition 
that 
(4.1) I(x)l _< cllxll -, for some r > d. 
Under this assumption, the function 
(4.2) )(x) :=  )(x- 2rk) 
keZ d 
is a 2r-periodic function integrahie on [-r, r] . We can then apply Theorem 3.1 
with o instead of . In G,,N,o, we next replace po by a function obtained by 
judiciously truncating the infinite sum in (4.2). The error made in this replacement 
can be estimated using (4.1). Knowing the number of evaluations of rr in the 
expression for b as a finite linear combination of elements of.Ao,., we then have an 
estimate on the degree of approximation of f in terms of the number of evaluations of 
rr. This process was applied on a number of functions rr. The results are summarized 
in Table 1. 
5 DIMENSION INDEPENDENT BOUNDS 
In this section, we describe certain estimates on the L 2 degree of approximation that 
are independent of the dimension of the input space. In this section, II' II denotes 
the L 2 norm on [-1, 1]  (respectively [-r, r] ) and we approximate functions in the 
class SF defined by 
(5.1) SF, :- {f e C  Ilfllsr, := If(m)[ _< 1}. 
mZ' 
Analogous to the degree of approximation from F/,, we define the n-th degree of 
approximation of a function f  C * by the formula 
(5.2) e,,(f) := inf Ilf-  f(m)enn'Xll 
AcZ',IAIin nleA 
where we require the norm involved to be the L 2 norm. In (5.2), there is no need 
to assume that n is an integer. 
Let b be a square integrahie 27r-periodic function of one variable. We define the L 2 
degree of approximation by networks with a single hidden layer by the formula 
(5.3) (2) 
,,,,(f) := inf III-PII 
PH,,,,, 
where m is the largest integer not exceeding n. Our main theorem in this connection 
is the following 
THEOREM 5.1. Lets>_ 1 be an integer, f G SF, c) G L21 
integers n, N _ 1, 
and ((1) - 0. Then, for 
(5.4) 
How to Choose an Activation Function 325 
Table 1: Order of magnitude of r,t,,o(Wd) for different o"s 
Function o' r,,,o Remarks 
Sigmoidal, order r- 1 n / s = d = 1, 1 = 1 
Sigmoidal, order r - 1 n*/"+( +2")/" s _> 2, d = 1, I = 1 
x ,ifx>0,0, ifx<0. nlr+( 2r+)lr k>_2, s>2, d= 1,1= 1 
(1 q- e-') -' n/"(logn) 2 s >_ 2, d - 1, l- 1 
Sigmoidal, order k n / k ) 2, s ) 1, d = 1, 
I: O(logr/logk)) 
exp(-lxl/2) n 2/ s = d> 2,1 = 1 
[xl(loglxl) * n (s/r)(2+(3s+2)/) s = d >_ 2, k > O, k + s even, 
5-0ifsodd, lifseven, l- 1 
where {Sr} is a sequence of positive numbers, 0 _< 5, _< 2, depending upon f such 
that 6, -- 0 as n -- co. Moreover, the coefficients in the network that yields (5.4) 
are bounded, independent of n and N. 
We may apply Theorem 5.1 in the same way as Theorem 3.1. For the squashing 
activation function, this gives an order of approximation O(n -/2) with a network 
consisting of n(log n)  neurons arranged in one hidden layer. With the truncated 
power function x[ (cf. Table 1, entry 3) as the activation function, the same 
order of approximation is obtained with a network with a single hidden layer and 
O(n +/()) neurons. 
6 CONCLUSIONS. 
We have obtained estimates on the number of neurons necessary for a network with 
a single hidden layer to provide a given accuracy of all functions under the only a 
priori assumption that the derivatives of the function up to a certain order should 
exist. We have proved a general theorem which enables us to estimate this number 
326 Mhaskar and Micchelli 
in terms of the growth and smoothness of the activation function. We have explicitly 
constructed networks which provide the desired accuracy with the indicated number 
of neurons. 
Acknowledgement s 
The research of H. N. Mhaskar was supported in part by AFOSR grant 2-26 113. 
References 
1. B^RRON, A. R., Universal approximation bounds for superposition of a 
sigmoidal function, IEEE Trans. on Information Theory, 39. 
2. CYBENKO, G., Approximation by superposition of sigmoidal functions, 
Mathematics of Control, Signals and Systems, 2, # 4 (1989), 303-314. 
3. DEVORE, R., HOWARD, R. aND MlCCHELLI, C.A., Optimal nonlinear 
approximation, Manuscripta Mathematica, 63 (1989), 469-478. 
4. HECHT-NILESEN, R., Thoery of the backpropogation neural network, IEEE 
International Conference on Neural Networks, I (1988), 593-605. 
5. HORNIK, K., STINCHCOMBE, M. AND WHITE, H., Multilayerfeedforward 
networks are universal approximators, Neural Networks, 2 (1989), 359-366. 
6. MHASK^R, H. N., Approximation properties of a multilayered feedfor- 
ward artificial neural network, Advances in Computational Mathematics 
I (1993), 61-80. 
7. MH^SKAR, H. N., Neural networks for localized approximation of real 
functions, in "Neural Networks for Signal Processing, III", (Kamm, Huhn, 
Yoon, Chellappa and Kung Eds.), IEEE New York, 1993, pp. 190-196. 
8. MHASKAR, H. N., Approximation of real functions using neural networks, 
in Proc. of Int. Conf. on Advances in Cornput. Math., New Delhi, India, 
1993, World Sci. Publ., H. P. Dikshit, C. A. Micchelli eds., 1994. 
9. MHASKAR, H. N., Noniterative training algorilhms for neural networks, 
Manuscript, 1993. 
10. MHASKAR, H. N. AND MICCHELLI, C. A., Approximation by superposi- 
tion of a sigmoidal function and radial basis functions, Advances in Ap- 
plied Mathematics, 13 (1992), 350-373. 
11. MHASKAR, H. N. AND MICCHELLI, C. A., Degree of approximation by 
superpositions of a fixed function, in preparation. 
12. MHASKAR, H. N. AND MICCHELLI, C. A., Dimension independent bounds 
on the degree of approximation by neural networks, Manuscript, 1993. 
13. POGGIO, T. AND GIROSI, F., Regularization algorithms for learning that 
are equivalent to multilayer networks, Science, 247 (1990), 978-982. 
