1 
CONNECTIVITY VERSUS ENTROPY 
Yaser S. Abu-Mostafa 
California Institute of Technology 
Pasadena, CA 91125 
ABSTRACT 
How does the connectivity of a neural network (number of synapses per 
neuron) relate to the complexity of the problems it can handle (measured by 
the entropy)? Switching theory would suggest no relation at all, since all Boolean 
functions can be implemented using a circuit with very low connectivity (e.g., 
using two-input NAND gates). However, for a network that learns a problem 
from examples using a local learning rule, we prove that the entropy of the 
problem becomes a lower bound for the connectivity of the network. 
INTRODUCTION 
The most distinguishing feature of neural networks is their ability to spon- 
taneously learn the desired function from 'training' samples, i.e., their ability 
to program themselves. Clearly, a given neural network cannot just learn any 
function, there must be some restrictions on which networks can learn which 
functions. One obvious restriction, which is independent of the learning aspect, 
is that the network must be big enough to accommodate the circuit complex- 
ity of the function it will eventually simulate. Are there restrictions that arise 
merely from the fact that the network is expected to learn the function, rather 
than being purposely designed for the function? This paper reports a restriction 
of this kind. 
The result imposes a lower bound on the connectivity of the network (num- 
ber of synapses per neuron). This lower bound can only be a consequence of 
the learning aspect, since switching theory provides purposely designed circuits 
of low connectivity (e.g., using only two-input NAND gates) capable of imple- 
menting any Boolean function [1,2]. It also follows that the learning mechanism 
must be restricted for this lower bound to hold; a powerful mechanism can be 
@ American Institute of Physics 1988 
2 
designed that will find one of the low-connectivity circuits (perhaps by exhaus- 
tive search), and hence the lower bound on connectivity cannot hold in general. 
Indeed, we restrict the learning mechanism to be local; when a training sample 
is loaded into the network, each neuron has access only to those bits carried by 
itself and the neurons it is directly connected to. This is a strong assumption 
that excludes sophisticated learning mechanisms used in neural-network models, 
but may be more plausible from a biological point of view. 
The lower bound on the connectivity of the network is given in terms of 
the entropy of the environment that provides the training samples. Entropy is a 
quantitative measure of the disorder or randomness in an environment or, equiv- 
alently, the amount of information needed to specify the environment. There 
are many different ways to define entropy, and many technical variations of this 
concept [3]. In the next section, we shall introduce the formal definitions and 
results, but we start here with an informal exposition of the ideas involved. 
The environment in our model produces patterns represented by N bits 
x = x... xN (pixels in the picture of a visual scene if you will). Only h different 
patterns can be generated by a given environment, where h < 2 v (the entropy 
is essentially log 2 h). No knowledge is assumed about which patterns the en- 
vironment is likely to generate, only that there are h of them. In the learning 
process, a huge number of sample patterns are generated at random from the 
environment and input to the network, one bit per neuron. The network uses 
this information to set its internal parameters and gradually tune itself to this 
particular environment. Because of the network architecture, each neuron knows 
only its own bit and (at best) the bits of the neurons it is directly connected to 
by a synapse. Hence, the learning rules are local: a neuron does not have the 
benefit of the entire global pattern that is being learned. 
After the learning process has taken place, each neuron is ready to perform 
a function derned by vhat it has larned. The collective interaction of the 
functions of the neurons is what defines the overall function of the network. The 
main result of this paper is that (roughly speaking) if the connectivity of the 
network is less than the entropy of the environment, the network cannot learn 
about the environment. The idea of the proof is to show that if the connectivity 
is small, the final function of each neuron is independent of the environment, 
and hence to conclude that the overall network has accumulated no information 
about the environment it is supposed to learn about. 
FORMAL RESULT 
A neural network is an undirected graph (the vertices are the neurons and the 
edges are the synapses). Label the neurons 1,-.-, N and define K, _c 1,.. -, N 
to be the set of neurons connected by a synapse to neuron n, together with 
neuron n itself. An environment is a subset e __C C0, 1 v (each x  e is a sample 
3 
from the environment). During learning, Zl,-", zv (the bits of x) are loaded 
into the neurons 1,...,N, respectively. Consider an arbitrary neuron n and 
telabel everything to make K become {1,...,K}. Thus the neuron sees the 
first K coordinates of each x. 
Since our result is asymptotic in N, we will specify K as a function of N; 
K = aN where a = a(N) satifies limN-.,o a(N) = ao (0 < ao < 1). Since the 
result is also statistical, we will consider the ensemble of environments  
e = {, c I I,I = 
where h = 2 oN and/ =/(N) satifies limN_.oo/(N) = /o (0 < /o < 1). The 
probability distribution on  is uniform; any environmen e G    likely 
occur  any oher. 
The neuron sees only he firs K coordinates of each x generated by he 
environment e. For each e, we define the function : {0,1}   {0,1,2,--.} 
where 
n(a...a) =l{x6e [ z,=a, fork=l,--.,K}l 
and the normalized version 
The function v describes the relative frequency of occurrence for each of the 2 r 
binary vectors a:l -" zr as x = zl '" z Jr runs through all h vectors in e. In other 
words, /specifies the projection of e as seen by the neuron. Clearly, v(a) _> 0 
for all a  {0, 1} r and Z&E{O,1}K v(a) = 1. 
Corresponding to two environments el and es, we will have two functions vl 
and bt 2. If//1 is not distinguishable from t/z, the neuron cannot tell the difference 
between ea and es. The distinguishability between btl and t/: can be measured 
by 
1 
d(l/l'l/2) --  Z 
The range of d(t/1,) is 0 <_ d(t/1,) <_ 1, where '0' corresponds to complete 
indistinguishability while '1' corresponds to maximum distinguishability. We 
axe now in a position to state the main result. 
Let e and es be independently selected environments from  according to the 
uniform probability distribution. d(vl, v) is now a random variable, and we are 
interested in the expected value E(d(vl,v2)). The case where E(d(vl,v2)) -- 0 
corresponds to the neuron getting no information about the environment, while 
the case where E(d(Vl,V2)) = I corresponds to the neuron getting maximum 
information. The theorem predicts, in the limit, one of these extremes depending 
on how the connectivity (ao) compares to the entropy (/o). 
4 
Theorem. 
1. If ao > f/o, then limv-.o E (d(Vl, v2)) = 1. 
2. If co < o, then limN._.ooS(d(,,,v2)) =0. 
The proof is given in the appendix, but the idea is easy to illustrate infor- 
mally. Suppose h = 2 K+ (corresponding to part 2 of the theorem). For most 
environments e 6 , the first K bits of x 6 e go through all 2 K possible val- 
ues approximately 2  times each as x goes through all h possible values once. 
Therefore, the patterns seen by the neuron are drawn from the fixed ensemble of 
all binary vectors of length K with essentially uniform probability distribution, 
i.e., v is the same for most environments. This means that, statistically, the 
neuron will end up doing the same function regardless of the environment at 
hand. 
What about the opposite case, where h = 2 K- (corresponding to part 1 of 
the theorem)? Now, with only 2 - patterns available from the environment, 
the first K bits of x can assume at most 2 K- values out of the possible 2 g 
values a binary vector of length K can assume in principle. Furthermore, which 
values can be assumed depends on the particular environment at hand, i.e., 
, does depend on the environment. Therefore, although the neuron still does 
not have the global picture, the information it has says something about the 
environment. 
ACKNOWLEDGEMENT 
This work was supported by the Air Force Office of Scientific Research under 
Grant AFOSR-86-0296. 
APPENDIX 
In this appendix we prove the main theorem. We start by discussing some 
basic properties about the ensemble of environments . Since the probability 
distribution on  is uniform and since [1-- (2h), we have 
which is equivalent to generating e by choosing h elements x 6 {0, 1} v with 
uniform probability (without replacement). It follows that 
h 
Pr(x6e)= 2 v 
5 
while for x 1  x2, 
h h-1 
Pr(xle, xie) = 2 v x 2 v_l 
and so on. 
The functions  and  are defined on K-bit vectors. 
(a random variable for fixed a) is independent of a 
The statistics of r(a) 
Pr(r(sx) = m) = Pr(r(s) = m) 
which follows from the symmetry with respect to each bit of a. The same holds 
for the statistics of (a). The expected value E(r(a)) = h2 -K (h objects going 
into 2 K cells), hence E((a)) = 2 -. We now restate and prove the theorem. 
Theorem. 
1. If co > o, then limr-.o E (d(l, 2)) = 1. 
2. If ao < o, then limN-o E (d(l, 2)) =0. 
Proof. 
We expand E (d(l, 2)) as follows 
where n and n2 denote nl(0.--0) and n2(0..-0), respectively, and the last step 
follows from the fact that the statistics of nl(a) and n2(a) is independent of a. 
Therefore, to prove the theorem, we evaluate E(Irh- r21) for large N. 
1. Assume ao > fo. Let n denote n(0...0), and consider Pr(n - 0). For r to 
be zero, all 2 N-K strings x of N bits starting with K O's must not be in the 
environment e. Hence 
Pr(r = O) = (1- -- 
h 
2)(1 
h h 
2 r - 1 )'" (1 - 2 r _ 2r_: + 1 ) 
where the first term is the probability that 0... 00  e, the second term is the 
6 
probability that 0.--O1  e given that 0-.. O0  e, and so on. 
: (1 - h2-m'(1 -- 2-x) -') 
>_ (1 - 
> 1 - 2h2-r2 v-K 
= 1 - 2h2 -K 
2N--K 
Hence, Pr(n, = 0): Pr(n2: 0): Pr(n: 0) _> i - 2h2 -K. However, E(n,) = 
E(n2) = h2 -. Therefore, 
h h 
Z Z Pr(rtl: i, rt2 '-- j)li -- Jl 
i=0 j=O 
h h 
= Z Y] Pr(nl = i)Pr(n2: j)l i - Jl 
i=0 j=0 
h 
_  Pt(hi = 0)Pr(n2 = j)j 
h 
+ Z Pt(hi =/)Pr(n2 = 0)i 
i=0 
which follows by throwing away all the terms where neither i nor j is zero (the 
term where both i an j are zero appears twice for convenience, but this term is 
zero anyway). 
= Pr(nl = 0)E(n2) + Pr(n2 = 0)E(rh) 
> 2(1- 2h2-)h2 - 
Substituting this estimate in the expression for E(d(tl, t2)), we get 
2 K 
E(d(//1,//2)) = 2hE(lnl -- 
_ - x 2(1- 2h2-K)h2 -t 
= 1 - 2h2 -K 
= 1 - 2 x 2 (f-")v 
Since ao >/o by assumption, this lower bound goes to 1 as N goes to infinity. 
Since 1 is also an upper bound for d(l, 2) (and hence an upper bound for the 
expected value E(d(/l,/2))), limv-.oo E(d(/,/2)) must be 1. 
7 
2. Assume ao < o. Consider 
To evaluate E([n - h2-K[), we estimate the variance of n and use the fact 
that E([n- h2-KI) <_ va, (recall that h2 - = E(n)). Since vat(n) = 
E(n 2) - (E(n)) 2, we need an estimate for E(n2). We write n = Eaei0.1)N- a, 
where 
1, if 0.-.0a 6 e; 
5 = 0, otherwise. 
In this notation, E(n 2) can be written as 
: >- 
&{0,1} N-I be{0,1} 
For the 'diagonal' terms (a = b), 
= h2 -/v 
There are 2 N-K such diagonal terms, hence a total contribution of 2 N-K X 
h2 -r = h2 -K to the sum. For the 'off-diagonal' terms (a  b), 
E(5. Sb) = Pr(5. = 1, Sb = 1) 
= Pr(5. = 1)Pr(Sb: l[a = 1) 
h h-1 
-- -- X 
2 r 2 r - I 
There are 2r-(2 v-K - 1) such off-diagonal terms, hence a total contribution of 
2V-(2N-K  h(h-i) < {h,_K2 2 v 
k --'1 ^ 2N(2N--1) -- ,  ] --i to the sum. Putting the contributions 
8 
from the diagonal and off-diagonal terms together, we get 
wr(n) = E(n 2) - (E(n)) 2 
< h2-K + (h2-K)a2v - 1 
1 
= h2-K + (h2-)a 2 N - 1 
h2_  
= h2 -K 1 + 
< 2h2 - 
The last step follows since h2 -K is much smaller than 2 r - 1. Therefore, E(In - 
1 
h2-]) < v < (2h2-)  Substituting this estimate in the expression for 
we get 
Since Cto < f/o by assumption, this upper bound goes to 0 as N goes to infinity. 
Since 0 is also a lower bound for d(yx,ya) (and hence a lower bound for the 
expected value E(d(,x,,a))), limo E(d(,,,a)) must be 0. I 
REFERENCES 
[1] Y. Abu-Mostafa, "Neural networks for computing?,"AIP Conference Pro- 
ceedings  151, Neural Networks for Computing, J. Denker (ed.), pp. 1-6, 1986. 
[2] Z. Kohavi, Switching and Finite Automata Theory, McGraw-Hill, 1978. 
[3] Y. Abu-Mostafa, "The complexity of information extraction,"IEEE Trans. 
on Information Theory, vol. IT-32, pp. 513-525, July 1986. 
[4] Y. Abu-Mostafa, "Complexity in neural systems,"in Analog VLSiand Neural 
Systems by C. Mead, Addison-Wesley, 1988. 
