The Capacity of a Bump 
Gary William Flake* 
Institute for Advance Computer Studies 
University of Maryland 
College Park, MD 20742 
Abstract 
Recently, several researchers have reported encouraging experimental re- 
sults when using Gaussian or bump-like activation functions in multilayer 
percepttons. Networks of this type usually require fewer hidden layers 
and units and often learn much faster than typical sigmoidal networks. 
To explain these results we consider a hyper-ridge network, which is a 
simple perceptron with no hidden units and a rid e activation function If 
g  
we are interested in partitioningp points in d dimensions into two classes 
then in the limit as d approaches infinity the capacity of a hyper-ridge and 
a perceptton is identical. However, we show that for p >> d, which is the 
usual case in practice, the ratio of hyper-ridge to perceptron dichotomies 
approaches p/2(d + 1). 
1 Introduction 
A hyper-ridge network is a simple perceptron with no hidden units and a ridge activation 
function. With one output this is conveniently described as y - g(h) - g(w  x - b) 
where g(h) = sgn(1 - h2). Instead of dividing an input-space into two classes with a 
single hyperplane, a hyper-ridge network uses two parallel hyperplanes. All points in the 
interior of the hyperplanes form one class, while all exterior points form another. For more 
information on hyper-ridges, learning algorithms, and convergence issues the curious reader 
should consult [3]. 
We wouldn't go so far as to suggest that anyone actually use a hyper-ridge for a real-world 
problem, but it is interesting to note that a hyper-ridge can represent linear inseparable 
mappings such as XOR, NEGATE, SYMMETRY, and COUNT(m) [2, 3]. Moreover, 
hyper-ridges are very similar to multilayer perceptrons with bump-like activation functions, 
such as a Gaussian, in the way the input space is partitioned. Several researchers [6, 2, 3, 5] 
have independently found that Gaussian units offer many advantages over sigmoidal units. 
*Current address: Adaptive Information and Signal Processing Department, Siemens Corporate 
Research, 755 College Road East, Princeton, NJ 08540. Email: flake@scr. siemens.com 
The Capacity of a Bump 557 
In this paper we derive the capacity of a hyper-ridge network. Our first result is that 
hyper-ridges and simple perceptrons are equivalent in the limit as the input dimension 
size approaches infinity. However, when the number of patterns is far greater than the 
input dimension (as is the usual case) the ratio of hyper-ridge to perceptron dichotomies 
approaches p/2(d + 1), giving some evidence that bump-like activation functions offer an 
advantage over the more traditional sigmoid. 
The rest of this paper is divided into three more sections. In Section 2 we derive the number 
of dichotomies for a hyper-ridge network. The capacities for hyper-ridges and simple 
perceptrons are compared in Section 3. Finally, in Section 4 we give our conclusions. 
2 The Representation Power of a Hyper-Ridge 
Suppose we have p patterns in the pattern-space, R d, where d is the number of inputs of our 
neural network. A dichotomy is a classification of all of the points into two distinct sets. 
Clearly, there are at most 2' dichotomies that exist. We are concerned with the number of 
dichotomies that a single hyper-ridge node can represent. Let the number of dichotomies 
ofp patterns in d dimensions be denoted as D(p, d). 
For the case of D(1, d), when p = 1 there are always two and only two dichotomies since 
one can trivially include the single point or no points. Thus, D(1, d) = 2. 
For the case of D(p, 1), all of the points are constrained to fall on a line. From this set 
pick two points, say Xa and xb. It is always possible to place a ridge function such that 
all points between Xa and xt, (inclusive of the end points) are included in one set, and all 
other points are excluded. Thus, there are p dichotomies consisting of a single point, p - 1 
dichotomies consisting of two points, p - 2 dichotomies consisting of three points, and 
so on. No other dichotomies besides the empty set are possible. The number of possible 
hyper-ridge dichotomies in one dimension can now be expressed as 
P 1 
D(p, 1)= Ei+ 1= p(p+ 1)+ 1, 
/--1 
(1) 
with the extra dichotomy coming from the empty set. 
To derive the general form of the recurrence relationship, we would have to resort to 
techniques similar to those used by Cover [1], Nilsson [7], and Gardner [4]. Because of 
space considerations, we do not give the full derivation of the general form of the recurrence 
relationship in this paper, but instead cite the complete derivation given in [3]. The short 
version of the story is that the general form of the recurrence relationship for hyper-ridge 
dichotomies is identical to the equivalent expression for simple perceptrons: 
D(p, d) = D(p - 1, d) + D(p - 1, d - 1). 
(2) 
All differences between the capacity of hyper-ridges and simple perceptrons are, therefore, 
a consequence of the different base cases for the recurrence expression. 
To get Equation 2 into closed form, we first expand D(p, d) a total of p times, yielding 
- 1 D(1, d- i). 
D(p, d) = i 
(3) 
For Equation 3 it is possible for the second term of D(1, d - 1) to become zero or negative. 
Taking the two identities D(p, 0) = p + 1 and D(p, -1) = 1 are the only choices that are 
consistent with the recurrent relationship expressed in Equation 2. With this in mind, there 
are three separate cases that we need to be concerned with: p < d + 2, p = d + 2, and 
558 G. W. FLAKE 
p>d+2. Whenp < d+2 
P-i_  - 1 D (1, d - 0 = 2 - 1 
D(p, d) = i t = 2p' (4) 
 i--0 
since all of the second terms in D(1, d - i) are always greater or equal to zero. When 
p = d + 2, the last term in D(1, d - i), in the summation, will be equal to - 1. Thus we can 
expand Equation 3 in this case to 
D(p, d) = p -1 D(1, d - i) = -1 D(1, p - 2 - i) 
i i 
i--o 
p-y2 -1 D(1, p- 2- i) + l = 2 -1 
= i i +1 
i=0 i=0 
= 2(2 p-!-- 1)+1=2 p- 1. (5) 
Finally, when p > d + 2, some of the last terms in D(1, d - i) are always negative. We can 
disregard all d - i < -1, taking D(1, d - i) equal to zero in these cases (which is consistent 
with the recurrence relationship), 
O(p,d) = Z (P- 1 O(1, d-i)= - 1 O(1, d-i) 
i t 
d 
= Z(P-1) D(1, d-i)+(5 
i=0 
Combining Equations 4, 5, and 6 gives 
D(p, d) = 
2 p - 1 
2 p 
i--0 
forp>d+2 
forp=d+2 
forp<d+2 
(6) 
(7) 
3 Comparing Representation Power 
Cover [ 1 ], Nilsson [7], and Gardner [4] have all shown that D(p, d) for simple perceptrons 
D(p, d) = 
obeys the rule 
2Z i 
i--o 
2p - 2 
1) forp >d+2 
forp=d+2 
(8) 
2 p for p < d + 2 
The interesting case is when p > d + 2, since that is where Equations 7 and 8 differ the 
most. Moreover, problems are more difficult when the number of training patterns greatly 
exceeds the number of trainable weights in a neural network. 
Let Dh(p, d) and Dp(p, d) denote the number of dichotomies possible for hyper-ridge net- 
works and simple perceptrons, respectively. Additionally, Let Ca, and Cp denote the 
The Capacity of a Bump 559 
respective capacities. We should expect both Dh(p, d)/2 p and Dp(p, d)/2 p to be at or around 
1 for small values of p/(d + 1). At some point, for large p/(d + 1), the 2p term should 
dominate, making the ratio go to zero. The capacity of a network can loosely be defined as 
 This is more rigorously defined as 
the value p/(d + 1) such that D(p, d)/2P = 5' 
C={ c' lim D(c(d+l)'d) } 
d---+ oo 2c(d+l) ' 
which is the point in which the transition occurs in the limit as the input dimension goes to 
infinity. 
Figures 1, 2, and 3 illustrate and compare C e and Ca at different stages. In Figure 1 
the capacities are illustrated for perceptrons and hyper-ridges, respectively, by plotting 
D(p, d)/2e versus p/(d + 1) for various values of d. On par with our intuition, the ratio 
D(p, d)/2 e equals 1 for small values ofp/(d+ 1) but decreases to zero as p(d+ 1) increases. 
Figure 2 and the left diagram of Figure 3 plot D(p, d)/2e versus p/(d + 1) for perceptron 
and hyper-ridges, side by side, with values of d = 5, 20, and 100. As d increases, the two 
curves become more similar. This fact is further illustrated in the right diagram of Figure 3 
where the plot is of Dh(p, d)/De(p, d) versus p for various values of d. The ratio clearly 
approaches 1 as d increases, but there is significant difference for smaller values of d. 
The differences between D e and Dh can be more explicitly quantified by noting that 
Dh(p, d) = Dp(p, d) + + 1 
for p > d + 2. This difference clearly shows up in in the plots comparing the two capacities. 
We will now show that the capacities are identical in the limit as d approaches infinity. To 
do this, we will prove that the capacity curves for both hyper-ridges and perceptrons crosses 
- at p/(d + 1) = 2. This fact is already widely known for perceptrons. Because of space 
2 
limitations we will handwave our way through lemma and corollary proofs. The curious 
reader should consult [3] for the complete proofs. 
Lemma 3.1 
lim 1 (2nn) 
Short Proof Since n approaches infinity, we can use Stirling's formula as an approximation 
of the factorials. 
Corollary 3.2 
For all positive integer constants, a, b, and c, 
1 f2n+b 
nlirnoo 22n+a , n + c J = O. 
Short Proof When adding the constants b and c to the combination, the whole combination 
can always be represented as comb(2n, n). y, where y is some multiplicative constant. Such 
a constant can always be factored out of the limit. Additionally, large values of a only 
increase the growth rate of the denominator. 
1 
Lemma 3.3 Forp/(d + 1) = 2, lima-oo De(p, d)/2e = 5' 
Short Proof Consult any of Cover [1], Nilsson [7], or Gardner [4] for full proof. 
560 G. W. FLAKE 
d= 20 ..... 
, d=100 ..... 
2 3 
p/(d+ 1) 
' x "\' '".. ' d = 5 -- 
,,  d= 20 ..... 
\\. ' d=Oo - 
I 2 3 
p/(d+ 1) 
Figure 1' On the left, Dp(p, d)/2 p versus p/(d + 1), and on the right, Dh(p, d)/2 p versus 
p/(d + 1) for various values of d. Notice that for perceptrons the curve always passes 
through  at p/(d + 1) = 2. For hyper-ridges, the point where the curve passes through  
decreases as d increases. 
0.5 
I 
', ",, ' perceptron -- 
' "x.. \  hypcr-ridge 
I 2 3 
p/(d+ l) 
0.5 
I  , ' i perceptton -- 
,,\\ 'hypr-ridge ...... 
1 2 3 
p/(d+ l) 
Figure 2: On the left, capacity comparison for d = 5. There is considerable difference for 
small values of d, especially when one considers that the capacities are normalized by 2'. 
On the right, comparison for d = 20. The difference between the two capacities is much 
more subtle now that d is fairly large. 
1 
o5 
p/(d+l) 
30 I 
25 
20 
15 
10 
5 
0 
d=l-- 
d=2 ..... 
:, -- 
d=l .... 
10  30   60 70 80  1 
P 
Figure 3: On the left, capacity comparison for d = 100. For this value of d, the capacities 
are visibly indistinguishable. On the right, Dh(p, d)/Dp(p, d) versus p for various values of 
d. For small values of d the capacity of a hyper-ridge is much greater than a perceptron. 
As d grows, the ratio asymptotically approaches 1. 
The Capacity of a Bump 561 
Theorem 3.4 
Forp/(d + 1) = 2, 
Dh(p, d) 1 
lim = 
Proof Taking advantage of the relationship between perceptron dichotomies and hyper- 
ridge dichotomies allows us to expand Dh(p, d), 
limDh(p, d) limDp(p ,d) 
= + lim . 
d--* oo 2P d--*oo 2P d-- oo 
By Lemma 3.3, and substituting 2(d + 1)forp, we get: 
1 f2d+ 1 
-+ lim ,d+lJ' 
1 
Finally, by Corollary 3.2 the right limit vanishes leaving us with . 
Superficially, Theorem 3.4 would seem to indicate that there is no difference between the 
representation power of a perceptron and a hyper-ridge network. However, since this result 
is only valid in the limit as the number of inputs goes to infinity, it would be interesting to 
know the exact relationship between De(d, p) and Dh(d, p) for finite values of d. 
In the right diagram of Figure 3 values of De(d, p)/Dh(d, p) are plotted against various 
values of p. The figure is slightly misleading since the ratio appears to be linear in p, 
when, in fact, the ratio is only approximately linear in p. If we normalize the ratio by  
and recompute the ratio in the limit as p approaches infinity the ratio becomes linear in d. 
Theorem 3.5 establishes this rigorously. 
Theorem 3.5 
1 Dh(d, p) 1 
lim - 
poo p Dp(d, p) 2(d + 1) 
Proof First, note that we can simplify the left hand side of the expression to 
lim 1Dh(d,p ) lim 1Dp(d'P)+) 7) 
= - = lira 1 (9) 
.-.oo p DAd, p) .-oo p DAd ' p) .-.oo p D.(d, p) 
In the next step, we will invert Equation 9, making it easier to work with. We need to show 
that the new expression is equal to 2(d + 1). 
Dp(d,p) 
limp = 
d 
lim 2pE (p-l)! (d+l)!(p-d-2)! 
p-oo i!(p-i- 1)! (p- 1)! 
i--0 
lim 2p 
lim 
a (d+l)!(p-d-2)! 
2PE i! (p-i- 1)! 
i--o 
d!(p-d-1)!_ lim 
lim P 2(d + 1) i! (p - i - 1)! p-oo 
p--,oo (p- 1 -d) i---0 
In Equation 10, the summation can be reduced to 1 since 
d!(p-d- 1)! { 0 whenO<i<d 
lim i7 ' = 1 wheni d 
p--,oo (p-i- 1)! = 
a d!(p-d-1)! 
2(d+l)E i! (p-i-l)! 
(10) 
562 G. W. FLAKE 
Thus, Equation 10 is equal to 2(d + 1), which proves the theorem. 
Theorem 3.5 is valid only in the case when p >> d, which is typically true in interesting 
classification problems. The result of the theorem gives us a good estimate of how many 
more dichotomies are computable with a hyper-ridge network when compared to a simple 
perceptron. When p >> d the equation 
Dh(d,p) p 
De(d, p) - 2(d + 1) (11) 
is an accurate estimate of the difference between the capacities of the two architectures. 
For example, taking d = 4 and p = 60 and applying the values to Equation 11 yields the 
ratio of 6, which should be interpreted as meaning that one could store six times the number 
of mappings in a hyper-ridge network than one could in a simple perceptron. Moreover, 
Equation 11 is in agreement with the right diagram of Figure 3 for all values ofp >> d. 
4 Conclusion 
An interesting footnote to this work is that the VC dimension [8] of a hyper-ridge network 
is identical to a simple perceptron, namely d. However, the real difference between 
perceptrons and hyper-ridges is more noticeable in practice, especially when one considers 
that linear inseparable problems are representable by hyper-ridges. 
We also know that there is no such thing as a free lunch and that generalization is sure 
to suffer in just the cases when representation power is increased. Yet given all of the 
comparisons between MLPs and radial basis functions (RBFs) we find it encouraging that 
there may be a class of approximators that is a compromise between the local nature of 
RBFs and the global structure of MLPs. 
References 
[ 1 ] T.M. Cover. Geometrical and statistical properties of systems of linear inequalities 
with applications in pattern recognition. IEEE Transactions on Electronic Computers, 
14:326-334, 1965. 
[2] M.R.W. Dawson and D.P. Schopflocher. Modifying the generalized delta rule to train 
networks of non-monotonic processors for pattern classification. Connection Science, 
4(1), 1992. 
[3] G. W. Flake. Nonmonotonic Activation Functions in Multilayer Perceptrons. PhD 
thesis, University of Maryland, College Park, MD, December 1993. 
[4] E. Gardner. Maximum storage capacity in neural networks. Europhysics Letters, 
4:481-485, 1987. 
[5] F. Girosi, M. Jones, and T. Poggio. Priors, stabilizers and basis functions: from 
regularization to radial, tensor and additive splines. Technical Report A.I. Memo No. 
1430, C.B.C.L. Paper No. 75, MIT AI Laboratory, 1993. 
[6] E. Hartman and J. D. Keeler. Predicting the future: Advanages of semilocal units. 
Neural Computation, 3:566-578, 1991. 
[7] N.J. Nilsson. Learning Machines: Foundations of Trainable Pattern Classifying Sys- 
tems. McGraw-Hill, New York, 1965. 
[8] V.N. Vapnik and A.Y. Chervonenkis. On the uniform convergence of relative frequencies 
of events to their probabilities. Theory of Probability and Its Applications, 16:264-280, 
1971. 
