412 
CAPACITY FOR PATTERNS AND SEQUENCES IN KANERVA'S SDM 
AS COMPARED TO OTHER ASSOCIATIVE MEMORY MODELS 
James D. Keeler 
Chemistry Department, Stanford University, Stanford, CA 94305 
and RIACS, NASA-AMES 230-5 Moffett Field, CA 94035. 
e.rnail: jdk hydra.riacs. edu 
ABSTRACT 
The information capacity of Kanerva's Sparse, Distributed Memory (SDM) and Hopfield-type 
neural networks is investigated. Under the approximations used here, it is shown that the to- 
tal information stored in these systems is proportional to the number connections in the net- 
work. The proportionality constant is the same for the SDM and Hopfield-type models in- 
dependent of the particular model, or the order of the model. The approximations are 
checked numerically. This same analysis can be used to show that the SDM can store se- 
quences of spatiotemporal patterns, and the addition of time-delayed connections allows the 
retrieval of context dependent temporal patterns. A minor modification of the SDM can be 
used to store correlated patterns. 
INTRODUCTION 
Many different models of memory and thought have been proposed by scientists over the 
years. In (1943) McCulloch and Pitts proposed a simple model neuron with two states of activity 
(on and off) and a large number of inputs.  Hebb (1949) considered a network of such neurons and 
postulated mechanisms for changing synaptic strengths 2 to learn memories. The learning rule 
considered here uses the outer-product of patterns of +Is and -Is. Anderson (1977) discussed the 
effect. of iterative feedback in such a system) Hopfield (1982) showed that for symmetric connec- 
tions, ' the dynamics of such a network is governed by an energy function that is analogous to the 
energy function of a spin glass? Numerous investigations have been carried out on similar 
modelsfi --a 
Several limitations of these binary interaction, outer-product models have been pointed out. 
For example, the number of patterns that can be stored in the system (its capacity) is limited to a 
fraction of the length of the pattern vectors. Also, these models are not very successful at storing 
correlated patterns or temporal sequences. 
Other models have been proposed to overcome these limitations. For example, one can 
allow higher-order interactions among the neurons. 9't In the following, I focus on a model 
developed by Kanerva (1984) called the Spame, Distributed Memory (SDM) model.  The SDM 
can be viewed as a three layer network that uses an outer-product learning between the second and 
third layer. As discussed below, the SDM is more versatile than the above mentioned networks 
because the number of stored patterns can increased independent of the length of the pattern, and 
the SDM can be used to store spatiotemporal patterns with context retrieval, and store correlated 
patterns. 
The capacity limitations of outer-product models can be alleviated by using higher-order 
interaction models or the SDM, but a price must be paid for this added capacity in terms of an 
increase in the number of connections. How much information is gained per connection? It is 
shown in the following that the total information stored in each system is proportional to the 
number of connections in the network, and that the proportionality constant is independent of the 
particular model or the order of the model. This result also holds if the connections are limited to 
one bit of precision (clipped weights). The analysis presented here requires certain simplifying 
assumptions. The approximate results are compared numerically to an exact calculation developed 
by Chou.2 
SIMPLE OUTER-PRODUCT NEURAL NETWORK MODEL 
As an example or a simple first-order neural network Model, I consider in detail the model 
developed by Hopfieldfi This model will be used to introduce the mathematics and the concepts 
that will be generalized for the analysis of the $DM. The "neurons" are simple two-state 
American Institute of Physics 1988 
413 
threshold devices: The state of the i 'h neuron, ui, is either either +1 (on), or -1 (off). Consider a 
set of n such neurons with net input (local field), hi, to the i 'h neuron given by 
hi = Tii uj, (1) 
where T././ represents the interaction suength between the i ' neuron and the j''. The state of each 
neuron is updated asynchronously (at random) according to the rule 
Ui ('--g (hi), (2) 
where the function g is a simple threshold function g (x) = sign (x). 
Suppose we are given M randomly chosen patterns (strings of length n of :iris) which we 
wish to store in this system. Denote these M memory patterns as pattern vectors: 
p,X = (p  ,pX ..... pp), a -- 1,2,3 ..... M. For example, p might look like 
(+1,-1,+1,-I,-1 ..... +1). One method of storing these patterns is the outer-product (Hebbian) learn- 
ing rule: Start with T-0, and accumulate the outer-products of the pattern vectors. The resulting 
connection matrix is given by 
M 
r:j = rii = 0. (3) 
a=l 
The system described above is a dynamical system with attracting fixed points. To obtain 
an approximate upper bound on the total information stored in this network, we sidestep the issue 
of the basins of attraction, and we check to see if each of the pattsms stored by Eq. (3) is actually 
a fixed point of (2). Suppose we are given one of the patterns, p, say, as the initial configuration 
of the neurons. I will show that p is expected to be a fixed point of Eq. (2). After inserting (3) 
for T into (1), the net input to the i"' neuron becomes 
h: = f,pix[  p?p?]. (4) 
a= j 
The important term in the sum on a is the one for which a = [3. This term represents the "sig- 
nal" between the input p and the desired output. The rest of the sum represents "noise" result- 
ing from crosstalk with all of the other stored patterns. The expression for the net input becomes 
h i = signall + noisei where 
signali = p,.[' p? p?], (5) 
J 
noisei = ' plaX[ ' p? PPl. (6) 
Summing on all of the j} in (6) yields signall = (n-1)pi . Since n is positive, the sign of 
the signal term and pi It will be the same. Thus, ff the noise term were exacfiy zero, the signal 
would give the same sign as : with a magnitude of = n a, and p would be a fixed point of (2). 
Moreover, patterns close to pVwould give nearly the same signal, so that p shotfid be an attract- 
ing fixed point. 
For randomly chosen patterns, <noise > = 0, where < > indicates statistical expectation, and 
its variance will be o 2 = (n-1)a(M-1). The probability that there will be an error on recall ofpi  
is given by the probability that the noise is greater than the signal. For n large, the noise distribu- 
tion is approximately gaussian, and the probability that there is an error in the i t' bit is 
1 
P' = 2do I e-2r:edx' (7) 
I signal I 
INFORMATION CAPACITY 
The number of paRems that can be stored in the network is known as its capacity. 3'v* How- 
ever, for a fair comparison between all of the models discussed here, it is more relevant to com- 
pare the total number of bits (total information) stored in each model rather than the number of 
414 
patterm. This allows comparison of information storage in models with different lengths of the 
pattern vectors. If we view the memory model as a black box which receives input bit strings and 
outputs them with some small probability of error in each bit, then the definition of bit-capacity 
used here is exactly the definition of channel capacity used by Shannon? 
Define the bit-capacity as the number of bits that can be stored in a network with fixed pro- 
bability of getting an error in a recalled bit, i.e. Pe = constant in (10). Explicitly, the bit-capacity 
is given by TM 
B = bit capacity = nMq, (8) 
where q = (1 + pelog2p, + (1-p,)log2(1-p,)). Note that rl=l for p,=0. Setting p, to a constant is 
tantamount to keeping the signal-to-noise ratio (fidelity) constant, where the fidelity, R, is given by 
R = Isignal I/. Explicitly, the relation between (constant) pe and R, is just R = -l(1 -p,), 
where 
R 
I(R ) = (1/2Jr) 'A I e-t2/2dt' (9) 
Hence, the bit-capacity of these networks can be investigated by examining the fidelity of the 
models as a function of n, M, and R. From (8) and (9) the fidelity of the Hopfield model is is 
R 2 = n/(n(M-1))  (n>l). Solving for M in terms of (fixed) R and q, the bit-capacity becomes 
B = q[(n 2/R 2)+n ]. 
The results above can be generalized to models with d '' order interactions? 't8 The resulting 
expression for the bit-capacity for d '' order interaction models is just 
It d+l 
B = (10) 
Hence, we see that the number of bits stored in the system increases with the order d. However, 
to store these bits, one must pay a price by including more connections in the connection tensor. 
To demonstrate the relationship between the number of connections and the information stored, 
define the information capacity, 7, to be the total information stored in the network divided by the 
number of bits in the connection tensor (note that thi.q is different than the definition used by Abu- 
Mostafa et al.)? Thus  is just the bit-capacity divided by the number of bits in the tensor T, 
and represents the efficiency with which information is stored in the network. Since T has n 
elements, the information capacity is found to be 
rl (11) 
= R2b , 
where b is the number of bits of precision used per tensor element (b > 10g2M for no clipping of 
the weights). For large n, the information stored per neuronal coimection is . q/R 2b, indepen- 
dent of the order of the model (compare this result to that of Peretto, et al.). ' To illustrate this 
point, suppose one decides that the maximum allowed probability of getting an error in a recalled 
bit is p, = 1/1000, then this would fix the !!ainirnum value of R at 3.1. Thus, to store 10,000 bits 
with a probability of getting an error of a recalled bit of 0.001, equation (15) states that it would 
take --'96,000b bits, independent of the order of the model, or =0.1n patterm can be stored with 
probability 1/1000 of getting an error in a recalled bit. 
KANERVA'S SDM 
Now we focus our attention on Kanerva's Sparse, Distributed Memory model (SDM). n The 
SDM can be viewed as a 3-layer network with the middle layer playing the role of hidden units. 
To get an autoassociative network, the output layer can be fed back into the input layer, effectively 
making thi.q a two layer network. The first layer of the SDM is a layer of n, :kl input units (the 
input address, a), the middle layer is a layer of m, hidden units, s, and the third layer consists of 
the n +1 output units (the data, d). The connections between the input units and the hidden units 
are random weights of :t:1 and are given by the rn xn matrix A. The connections between the hid- 
den units and the output units are given by the n xm connection matfix C, and these matrix ele- 
ments are modified by an outer-product learning rule (C ill analogous to the matrix T of the 
Hopfield model). 
415 
Given an input pattern a, the hidden unit activations are determined by 
s = 0,(Aa), 
(12) 
where 0, is the Hamming-distance threshold function: The k h element is 1 if the input a is at 
most r Hamming units away from the k"' row in A, and 0 if it is further than r units away, i.e., 
l if A(n-xi) <-0 
0(x); = 0 if A(n-xi)>r. (13) 
The hidden-units vector, or select vector, s, is mostly Os with an average of 15m Is, where 15 is 
some small number dependent on r; 15n:1. Hence, s represents a large, sparsely coded vector of Os 
and ls representing the input address. The net input, h, to the final layer can be simply expressed 
as the product of C with s: 
h = C s. (14) 
F'mally, the output data is given by d = g(h), where gi (hi) = sign (hi). 
To store the M patterns, pt,p2,... pt, form the outer-product of these pattern vectors and 
their correspooding select vectors, 
M 
C = p%=r, (15) 
where T denotes the transpose of the vector, and where each select vector is formed by the 
corresponding address, s'= 0(Ap'). The storage algorithm (15) is an outer-product learning rule 
similar to (3). 
Suppose that the M patterns (p,p2,... pSt) have been stored according to (15). Follow, ing 
the analysis presented for the Hopfield model, I show that if the system is presented with pP as 
input, the output will be p, (i.e. p is a fixed point). Setting a = p in (16) and separating terms 
as before, the net input (18) becomes 
M 
h = dl(s-s ) +  p=(s=-s). (16) 
where the first term represents the signal and the second is the noise. Recall that the select vectors 
have an average of m ls and the remainder Os, so that the expected value of the signal is &n s 0. 
Assuming that the addresses and data are randomly chosen, the expected value of the noise 
is zero. To evaluate the fidelity, I make certain approximations. First, I assume that the select vec- 
tors are independent of each other. Second, I assume that the variance of the signal alone is zero 
or small compared to the variance of noise term alone. The first assumption will be valid for 
m152<I, and the second assumption will be valid for MS>l. With these assumptions, we can 
easily calculate the variance of the noise term, because each of the select vectors are i.i.d. vectors 
of length rn with mostly Os and =m Is. With these assumptions, the fidelity is given by 
R 2 = (17) 
[(M-l)(l+152m (1-1/m))]' 
In the limit of large m, with 15m = constant, the number of stored bits scales as 
mn 
B = q[R2(l+152 m + n]. (18) 
) 
If we divide this by the number of elements in C, we find the information capacity,  = q/R 2b, 
just as before, so the information capacity is the same for the two models. (If we divide the bit 
capacity by the number of elements in C and A then we get =q/Re(b+l), which is about the 
same for large M.) 
A few comments before we continue. First, it should be pointed out that the assumption 
made by Kanerva t and Keeler TM that the variance of the signal term is much less than that of 
the noise is not valid over the entire range. If we took this in.to account, then the magnitude of the 
denominator would be increased by the variance of the signal tenn. Further, if we read at a dis- 
tance I away from the write address, then it is easy to see that the signal changes to be m S(1), 
where (1) the overlap of two spheres of radius r length I apart in the binomial space n 
416 
(8 -- (0) ). The fidelity for reading at a distance I away from the write address is 
R: = m282(1) 
m (l )(1---(l )) + (M-1)m 82+(M-1 'm 2(1-1/m )' 
(19) 
Compare this to the formula derived by Chou, t2 for the exact signal-to-noise ratio: 
R z = m28:(1) 
m (l )(1-(1 )) + (M-1)m gn +(M-1)nm2( 1-1/m ))' 
(20) 
where g r is e average overlap of the spheres of radius r bmomially distributed with parameters 
(n,1/2) '-'arid o  is the square of this overlap. The difference in these two formulas lies in the 
denominator in the terms 82 verses g. and 8 ' rs. c.. The difference comes from the fact that 
Chou correctly calculates the overlap of the spheres without using the independence assumption. 
How do these formula's differ? First of all, it is found numerically that 82 is identical with 
I1 .. Hence, the only difference comes from 8 ' verses c.. For m 8: < 1, the 8 ' term is negligi- 
ble compared to the other terms in the denominator. In addition, 8 a and c 2 are approximately 
equal for large n and r=n/2. Hence, in the limit n-oo the two formulas agree over most of the 
range ifM=O. lm, m:2  . However, for finite n, the two formulas can disagree when m8Z=l (see 
Figure 1). 
Signal-to-Noise 
2O 
Ratios 
zq. 
zq. 
Zq. (ao) 
20 40 60 
Hamming Radius 
Figure 1: A comparison of the fidelity calculations of the SDM for typical n, M, andre 
values. Equation (17) was derived assuming no variance of the signal term, and is shown 
by the + line. Equation (19) uses the approximation that all of the select vectors are 
indep.e.ndent denoted by the o line. Equation (20) (*'s) is the exact derivation done by 
Chou t2. The values used here were n = 150, m = 2000, M = 100. 
417 
Equation (20) suggests that flrl i is a best read-write Hamming radius for the SDM. By set- 
ting I = 0 in (19) and by setting ..d5 = 0, we get an approximate expression for the best Ham- 
ming radius: 8 ={2Mrr -1/3. This trend is qualitatively shown in Figure 2. 
49.0 
dLsLomce 
52. 
SS. g 
Figure 2: Numerical investigation of the capacity of the SDM. The vertical axis is the per- 
cent of recovered patterns with no errors. The x-axis (left to right) is the Hamming dis- 
tance used for reading and writing. The y-axis (back to forwar. d) is the number of patterns 
that were written into the memory. For this investigation, n = 128, rn = 1024, and M 
ranges from 1 to 501. Note the similarity of a cross-section of this graph at constant M 
with Figure 1. This calculation was performed by David Colan at RIACS, NASA-Ames. 
Figure 1 indicates that the formula (17) that neglected the variance of the signal term is 
incorrect over much of the range. However, a variant of the SDM is to constrain the number of 
selected locations to be constant; circuitry for doing this is easily built. n The variance of the sig- 
nal term would be zero in that case, and the approximate expression for the fidelity is given by Eq. 
(17). There are certain problems where it would be better to keep 8 = constant, as in the case of 
correlated patterns (see below). 
The above analysis was done assuming that the elements (weights) in the outer-product 
matrix are not clipped i.e. that there are enough bits to store the largest value of any matrix ele- 
menL It is interesting to consider what happens if we allow these values to be represented by only 
a few bits. If we consider the case case b = 1, i.e. the weights are clipped at one bit, it is easy 
to show t7 that y2q/.R z for the d 'h order models and for thd SDM, which yields  = 0.07 for rea- 
sonable R, (this is substantially less than Willshaw's 0.69). 
418 
SEQUENCES 
In an autoassociative memory, the system relaxes to one of the stored patterns and stays 
fixed in time until a new input is presented. However, there are many problems where the recalled 
panems must change sequenfially in time. For example, a song can be remembered as a string of 
notes played in the correct sequence; cyclic panems of muscle contractions axe essential for walk- 
ing, riding a bicycle, or dribbling a basketball. As a first step we consider the very simplisfic 
sequence production as put forth by Hopfield (1982) and Kanerva (1984). 
Suppose that we wished to store a sequence of panems in the SDM. Let the pattern vectors 
be given by (pl,p2 ..... pSt). This sequence of patterns could be stored by having each pattem 
point to the next pattern in the sequence. Thus, for the SDM, the panems would be stored as 
input-output pairs (a",d'z), where a ' = p' and d'x= pi for tz = 1,2,3,...34-1. Convergence to 
this sequence works as follows: If the SDM is presented with an address that is close to p the 
read data will be close to p2. Iterating the system with p2 as the new input address, the read data 
will be even closer to pa. As this itemfive process continues, the read data will converge to the 
stored sequence, with the next pattern in the sequence being presented at each time step. 
The convergence statisfics are essenfially the same for sequential patterns as that shown 
above for autoassociative panems. Presented with p,X as an input address, the signal for the stored 
sequence is found as before 
<signal> = &n p=+. (21) 
Thus, given p", the read data is expected to be p,X+. Assuming that the patterns in the sequence 
are randomly chosen, the mean value of the noise is zero, with variance 
<oa> = (M-1)5'-rn (l+52(rn -1)). (22) 
Hence, the length of a sequence t_hat can be stored in the SDM increases linearly with m for large 
/9l. 
Attempting to store sequences like this in the Hopfield model is not very successful due to 
the asynchronous updating use in the Hopfield model. A synchronously updated outer-product 
model (for example [6]) would work just as described for the SDM, but it would still be limited to 
storing fraction of the word size as the maximum sequence length. 
Another method for storing sequences in Hop. field-like networks has been proposed indepen- 
dently by Kleinfeld  and Sompolinsky and Kanter. 23 These models relieve the problem created by 
asynchronous updating by using a time-delayed sequential term. This time-delay storage algorithm 
has different dynamics than the synchronous SDM model. In the time-delay algorithm, the system 
allows time for the units to relax to the first pattern before proceeding on to the next pattern, 
whereas in the synchronous algorithms, the sequence is recalled imprecisely from imprecise input 
for the first few iterations and then correctly after that. In other words, convergence to the 
sequence takes place "on the fly" in the synchronous models -- the system does not wait to zero 
in on the fast pattern before proceeding on to recover the following panems. This allows the syn- 
chronous algorithms to proceed k times as fast as the asynchronous time-delay algorithms with 
half as many (variable) matrix elements. This difference should be able to be detected in biological 
systems. 
TIME DELAYS AND HYSTERESIS: FOLDS 
The above scenario for storing sequences is inadequate to explain speech recognifion or pat- 
tern generation. For example, the above algorithm cannot store sequences of the form ABAC, or 
overlapping sequences. In Kanerva's original work, he included the concept of time delays as a 
general way of storing sequences with hysteresis. The problem addressed by this is the following: 
Suppose we wish to store two sequences of panems that overlap. For example, the two pattern 
sequences (a,b,c,d,e,f,...) and (x,yz,d,w,v,...) overlap at the panera d. If the system only has 
knowledge of the present state, then when given the input d, it cannot decide whether to output w 
or e. To store two such sequences, the system must have some knowledge of the immediate past. 
Kanerva incorporates this idea into the SDM by using "folds." A system with F+I folds has a 
time history of F past states. These F states may be over the past F time steps or they may go 
even further back in time, skipping some time steps. The algorithm for reading from the SDM with 
folds becomes 
d(t+l) = g(C's(t) + C t's(t-x) +    + C F's(t--xt:)), (23) 
419 
where s(t---= 0(Aa(t---l)). 
I 2 .. 2x t...t 2, 
(P/,P ..... m  .... ,v ... 
C  +t  
=wEEp , 
M 1 
To store the Q pattem sequences (p,p ..... Pt ), 
,p{2e), construct the matrix of the [yh fold as follows: 
(24) 
= 0(Ap ), and w is a 
where any vector with a superscript less than 1 is taken to be zero, 
weighting factor that would normally decrease with increasing 13. 
Why do these folds work? Suppose that the system is presented with the pattern sequence 
1 2 114'I ' h 
(Pt,Pt ..... p ), wth eac pattern presented sequentially as input until the zl, time step. For 
simplicity, assme that wl = 1 for all 13. Each term in Eq. (39) will conuibute a signal similar to 
the signal for the single-f61d system. Thus, on the z 'h time step, the signal term coming firore Eq. 
(39) is <signal(t+l)> =Frnp +t. The signal will have this value until the end of the pattem 
sequence is reached. The mean of the noise terms is zero, with variance 
<noise2> = F(M-1)82m(l+82(m-1)). Hence, the signal-to-noise ratio is 4' times as strong as it 
is for the SDM without folds. 
Suppose further that the second stored pattern sequence happens to match the first stored 
sequence at t = '. The signal term would then be 
signal(t+l) = FiSrn p-t + &np-t. (25) 
With no history of the past (F = 1) the signal is split between p[+t and p+t, and the output is 
ambiguous. However, for F>I, the signal for the first pattern sequence dominates and allows 
retrieval of the remainder of the correct sequence. This formulation allows context to aid in the 
retrieval of stored sequences, and can differentiate between overlapping sequences by using time 
delays. 
The above formulation is still too simplistic in terms of being able to do real recognition 
problems such as speech recognition. First, the above algorithm can only recall sequences at a 
fixed time rate, whereas speech recognition occurs at widely varying rates. Second, the above 
algorithm does not allow for deletions in the incoming data. For example "seqnce" can be recog: 
nized as "sequence" even though some letters are missing. Third, as pointed out by Lashley"" 
speech processing relies on hierarchical structures. 
Although Kanerva's original algorithm is too simplistic, a straightforward modification 
allows retrieval at different rates with deletions. To achieve this, we can add on the time-delay 
terms with weights which are smeared out in time. Kanerva's (1984) formulation can thus be 
viewed as a discrete-time formulation of that put forth by Hopfield and Tank, (1987)? Explicitly 
we could write 
h =   Ws(t-_), (26) 
[= t =- 
where the coefficients W. are a discrete approximation to a smooth function which spreads the 
delayed signal out over ume. As a further step, we could modify these weights dynamically to 
optimize the signal coming out. The time-delay patterns could also be placed in a hierarchical 
structure as in the matched filter avalanche structure put forth by Grossberg et al. (1986). 26 
CORRELATED PATTERNS 
In the above associative memories, all of the patterns were taken to be randomly chosen, 
uniformly distributed binary vectors of length n. However, there are many applications where the 
set of input paRems is not uniformly disuibuted; the input patterns are correlated. In mathematical 
terms, the set < of input patterns would not be uniformly distributed over the entire space of 2" 
possible patterns. Let the probabfli.ty distribution function for the Hamming distance between two 
randomly chosen vectors pX and p from the distribution : be given by the function p(d(pX-pl)), 
where d(x-y) is the Hamming distance between x and y. 
The SDM can be generalized from Kanerva's original formulation so that correlated input 
patterns can be associated with output paRems. For the moment, assume that the distribution set 
c and the probability density function p(x) are known a priori. Instead of constructing the rows 
of the matrix A from the entire space of 2" patterns, construct the rows of A from the distribution 
r,. Adjust the Hamming distance r so that  = bn = constant number of locations are selected. 
420 
In other words, adjust r so that the value of 8 is the same as given above, where  is determined 
by 
 = (27) 
2  
This implies that r would have to be adjusted dynamically. This could be done, for example, by a 
feedback loop. Circuitry for doing this is easily built," and a similar structure appears in the 
Golgi cells in the Cerebellum?. 
Using the same distribution for the rows of A as the distribution of the patterns in  and 
using (27) to specify the choice of r, all of the above analysis is applicable (assuming randomly 
chosen output patterns). If the outputs do not have equal Is and -Is the mean of the noise is not 
0. However, if the distribution of outputs is also known, the system can still be made to work by 
storing l/p+ and l/p_ for Is and -Is respectively, where p is the probability of getting a 1 or a -1 
respectively. Using this storage algorithm, all of the above formulas hold, (as long as the distribu- 
tion is smooth enough and not extremely dense). The SDM will be able to recover data stored 
with correlated inputs with a fidelity given by Equation (17). 
What if the distribution function : is not known a priori? In that case, we would need to 
have the matrix A learn the distribution p(x). There are many ways to build A to mimic p. One 
such way is to start with a random A matrix and modify the entries of  randomly chosen rows of 
A at each step according to the statistics of the most recent input patterns. Another method is to 
use competitive learning zs-3 to achieve the proper distribution of A. 
The competitive learning algorithm is a method for adjusting the weights Ai; between the 
first and second layer to match this probability density functton, p(x). The t 'n row of the address 
matrix A can be viewd as a vector A,. The competitive learning algorithm holds a competition 
between these vectors, and a few vectors that are the closest (within the Hamming sphere r) to the 
input pattern x are the winners. Each of these winners are then modified slightly in the direction 
of x. For large enou m, this algorithm almost always converges to a distribution of the Ai that 
is the same as p(x)/' The updating equation for the selected addresses is just 
A? = nf - X(A? - x) (28) 
Note for . = 1, this reduces to the so-called unary representation of Baum et al. 3 Vfnich gives 
the maximum efficiency in terms of capacity. 
DISCUSSION 
The above analysis said nothing about the basins of attraction of these memory states. A 
measure of the performance of a content addressable memory shotlid also say something about the 
average radius of convergence of the basin of attraction. The basins are in general quite compli- 
cated ' and have been investigated numerically for the unclipped models and values of n and m 
ranging in the 100s. 2 The basins of attraction for the SDM and the d=l model are very similar in 
their characteristics and their average radius of convergence. However, the above results give an 
upper bound on the capacity by looking at the fixed points of the system (if there is no fixed point, 
there is no basin). 
In summary, the above arguments show that the total information stored in outer-product 
neural networks is a constant times the number of connections between the neurons. This constant 
is independent of the order of the model and is the same (q/R2b) for the SDM as well as higher- 
order Hopfield-type networks. The advantage of going to an architecture like the SDM is that the 
number of patterns that can be stored in the network is independent of the size of the pattern, 
whereas the number of stored patterns is limited to a fraction of the word size for the Willshaw or 
Hopfield architecture. The point of the above analysis is that the efficiency of the SDM in terms 
of information stored per bit is the same as for Hopfield-type models. 
It was also demonstrated how sequences of patterns can be stored m the SDM, and how time 
delays can be used to recover contextual information. A minor modification of the SDM could be 
used to recover time sequences at slightly different rates of presentation. Moreover, another minor 
modification allows the storage of correlated patterns in the SDM. With these modifications, the 
SDM presents a versatile and efficient tool for investigating properties of associative memory. 
421 
Acknowledgements: 
howledged. This work was supported by DARPA contract # 86-A227500-000. 
Discussions with John Hopfield and Pentti Kanerva are gratefully ack- 
REFERENCES 
[1] McCulloch, W. S. & Pitts, W. (1943), Bull. Math. Biophys. $, 115-133. 
[2] Hebb, D. O. (19,19) The Organization of Behavior. John Wiley, New York. 
[3] Anderson, J. A., Solverstein, J. W., Ritz, S. A. & Jones, R. S. (1977) Psych. Rev., 84, 
,112-,151. 
[4] Hopfield, J. J. (1982) Proc. Natn'l. Acad. Sci. USA 79 255,1-2558. 
[5] Kirkpatrick, S. & Sherrington, D. (1978) Phys Rev. 17,138,1-4405. 
[6] Little, W. A. & Shaw, G. L.(1978)Math. Biosci. 39, 281-289. 
[7] Nakano, K. (1972), Association - A model of associative memory, IEEE Trans. Sys. Man 
Cyber. 2, 
[8] Willshaw, D. J., Buneman, O. P. & Longuet-Higgins, H. C., (1969) Nature, 222 960-962. 
[9] Lee, Y. C.; Doolen, G.; Chen, H. H.; Sun, G. Z.; Maxwell, T.; Lee, H. Y.; & Giles, L. 
(1985) Physica, 22D, 276-306. 
[10] Baldi, P., and Venkatesh, S.S., (1987) Phys. Rev. Lett. $8, 913-916. 
[11] Kanerva, P. (1984) Self-propagating Search: A Unified Theory of Memory, Stanford 
University Ph.D. Thesis, and Bradford Books (M1T Press). In press (1987 es0. 
[12] Chou, P. A., The capacity of Kanerva's Associative Memory these proceedings. 
[13] McEliece, R. J., Posner, E. C., Rodemich, E. R., & Venkatesh, S.S. (1986), IEEE Trans. 
on Information Theory. 
[1,1] Amit, D. J., Gutfreund, H. & Sompolinsky, H. (1985) Phys. Rev. Lett. $$, 1530-1533. 
[15] Shannon, C. E., (19,18), Bell Syst. Tech. J., 27, 379,623 (Reprinted in Shannon and Weaver 
19,19). 
[16] Kleinfeld, D. & Pendergraft, D. B., (1987) Biophys. J. $1, ,17-53. 
[17] Keeler, J. D. (1986), Cornpan'son of Sparsely Distributed Memory and Hopfield-type Neural 
Network Models, RIACS Technical Report 86-31, also submitted to o r. Cog. Sci. 
[18] Keeler, J. D. (1987) Physics Letters 124A, 53-58. 
[19] Abu-Mostafa, Y. & St. Jacques, (1985), IEEE Trans. on Info. Theor., 31, ,161. 
[18] Keeler, J. D., Basins of Attraction of Neural Network Models AIP Conf. Proc. #151, Ed: 
John Denker, Neural Networks for Computing, Snowbird Utah, (1986). 
[20] Peretto, P. & J.J. Niez, (1986) Biol. Cybem., $4. 53-63. 
[21] Keeler, J. D., Ph.D. Dissertation. Collective phenomena of coupled lattice maps: Reaction- 
diffusion systems and neural networks. Department of Physics, University of California, San 
Diego, (1987). 
[22] Kleinreid, D. (1986). Proc. Nat. Acad. Sci. 83 9469-9,173. 
[23] Sompolinsky, H. & Kanter, I. (1986). Physical Review Letters. 
[24] Lashley, K. S. (1951). Cerebral Mechanisms in Behavior. Edited by Jeffress, L. A. Wiley, 
New York, 112-136. 
[25] Hopfield, J. J. & Tank, D. W. (1987). ICNN San Diego preprint. 
[26] Grossberg, S. & Stone, G. (1986). Psychological Review, 93, ,16-7,1 
[27] Mart, D. (1969). A orournal of Phisiology, 202, 437-t70. 
[28] Grossberg, S. (1976). Biological Cybernetics 23, 121-134. 
[29] Kohonen, T. (1984) Self-organization and associative memory. Springer-Verlag, Berlin. 
[30] Rumelhart, D. E. & Zipset, D. or. Cognitive Sci., 9, (1985), 75. 
[31] Baum, E., Moody J., Wilczek F. (1987). Preprint for Biological Cybernetics 
