Statistically Efficient Estimation Using 
Cortical Lateral Connections 
Alexandre Pouget 
alex@salk.edu 
Kechen Zhang 
zhang@salk.edu 
Abstract 
Coarse codes are widely used throughout the brain to encode sen- 
sory and motor variables. Methods designed to interpret these 
codes, such as population vector analysis, are either inefficient, i.e., 
the variance of the estimate is much larger than the smallest possi- 
ble variance, or biologically implausible, like maximum likelihood. 
Moreover, these methods attempt to compute a scalar or vector 
estimate of the encoded variable. Neurons are faced with a simi- 
lar estimation problem. They must read out the responses of the 
presynaptic neurons, but, by contrast, they typically encode the 
variable with a further population code rather than as a scalar. 
We show how a non-linear recurrent network can be used to per- 
form these estimation in an optimal way while keeping the estimate 
in a coarse code format. This work suggests that lateral connec- 
tions in the cortex may be involved in cleaning up uncorrelated 
noise among neurons representing similar variables. 
I Introduction 
Most sensory and motor variables in the brain are encoded with coarse codes, i.e., 
through the activity of large populations of neurons with broad tuning to the vari- 
ables. For instance, direction of visual motion is believed to be encoded in visual 
area MT by the responses of a large number of cells with bell-shaped tuning, as 
illustrated in figure 1-A. 
Neurophysiological recordings have shown that, in response to an object moving 
along a particular direction, the pattern of activity across such a population would 
look like a noisy hill of activity (figure l-B). On the basis of this activity, _, the 
best that can be done is to recover the conditional probability of the direction of 
motion, 0, given the activity, p(01, ). A slightly less ambitious goal is to come up 
with a good "guess", or estimate, , of the direction, 0, given the activity. Because 
of the stochastic nature of the noise, the estimator is a random variable, i.e, for 
AP is at the Institute for Computational and Cognitive Sciences, Georgetown Univer- 
sity, Washington, DC 20007 and KZ is at The Salk Institute, La Jolla, CA 92037 . This 
work was funded by McDonnell-Pew and Howard Hughes Medical Institute. 
98 A. PougetandK. Zhang 
A 
B 
3 
2.5 
-' 2 
 1o$ 
1 
0.5 
0 
100 200 300 
Direction (deg) 
100 200 300 
Preferred Direction (deg) 
Figure 1: A- Tuning curves for 16 direction tuned neurons. B- Noisy pattern of 
activity (o) from 64 neurons when presented with a direction of 180 . The ML 
estimate is found by moving an "expected" hill of activity (dotted line) until the 
squared distance with the data is minimized (solid line) 
the same image, t will vary from trial to trial. A good estimator should have 
the smallest possible variance across those trials because the variance determines 
how well two similar directions can be discriminated using this estimator. The 
Cram6r-Rao bound provides an analytical lower bound for this variance given the 
noise in the system and the unit tuning curves [5] Typically, computationally 
simple estimators, such as optimum linear estimator (OLE), are very inefficient; 
their variances are several times the bound. By contrast, Bayesian or maximum 
likelihood (ML) estimat0rs (which are equivalent for the case under consideration 
in this paper) can reach this bound but require more complex calculations [5]. 
These decoding technics are valuable for a neurophysiologist interested in reading 
out the population code but they are not directly relevant for understanding how 
neural circuits perform estimation. In particular, they all provide the estimate in a 
format which is incompatible with what we know of sensory representations in the 
cortex. For example, cells in V4 are estimating orientation from the noisy responses 
of orientation tuned V1 cells, but, unlike ML or OLE which provide a scalar esti- 
mate, V4 neurons retain orientation in a coarse code format, as demonstrated by 
the fact that V4 cells are just as broadly tuned to orientation as V1 neurons. 
Therefore, it seems that a theory of estimation in biological networks should have 
two critical characteristics: 1- it should preserve the estimate in a coarse code and 
2- it should be efficient, i.e., the variance should be close to the Cramdr-Rao bound. 
We explore in this paper various network architectures for performing estimations 
with coarse code using lateral connections. We start by briefly describing several 
classical estimators such as OLE or ML. Then, we consider linear and non-linear 
recurrent networks and compare their performances with the classical estimators. 
2 Classical Methods 
The simplest estimators are linear of the form tOLr = lgrT. Better performance 
can be obtained with a center of mass estimator (COM), tCOM = Y-i Oiai/Y-i ai; 
however, in the case of a periodic variable, such as direction of motion, the best 
one-shot method known is the complex estimator (COMP), tCOMP = phase(z) 
N 
where z = Y-=I aeik [5]. This estimator consists in fitting a cosine through 
the pattern of activity, like the one shown in figure l-B, and using the phase of 
Statistically Efficient Estimations Using Cortical Lateral Connections 99 
A B 
40. 
20, 
1 oo 
Activity over Time 
5( 
0 100 200 300 
preferred Direction (,deg} 
Figure 2: A- Circular network of 64 units. Only the connections originating from 
one unit are shown. B- Activity over time in the non-linear network when initialized 
with a random pattern at t = 0. The activity of the units are plotted as a function 
of their position along the circle which is equivalent to their preferred direction of 
motion with appropriate choice of weights. 
the best cosine fit as the estimate of direction. This method is suboptimal if the 
data were not generated by cosine tuning functions as in the case illustrated in 
figure 1-A. It is possible to obtain optimum performance by fitting the curve that 
was actually used to generate the data, i.e, the actual tuning curves of the units. 
A maximum likelihood estimate, defined as being the direction maximizing p(ff]0), 
involves exactly this type of curve fitting, a process illustrated in figure 1-B [5]. The 
estimate is computed by finding first the "expected" hill- the hill that would be 
obtained in a noise free system- minimizing the distance with the data. In the case 
of gaussian noise, the appropriate distance measure to minimize is the euclidian 
squared distance. The final position of the peak of the hill corresponds to the 
maximum likelihood estimate, M. 
3 Recurrent Networks 
Consider a circular network of 64 units fully connected like the one depicted in 
figure 2-A. With an appropriate choice of weights and activation function, this 
network will develop a hill-shaped pattern of activity in response to a transient 
input as illustrated in figure 2-B. If we initialize this networks with activity patterns 
- = {ai} corresponding to the responses of 64 direction tuned units (figure 1), we 
can use the final position of the hill across the neuronal array after relaxation as 
an estimate of the direction, 0. The variance of this estimator will depend on the 
exact choice of activation function and weights. 
3.1 Linear Network 
We first consider a network of 64 units whose dynamics is governed by the following 
difference equation: 
( ) 
o,(t + ,t) = o(t) + ,t + wjoj(t) 
The dynamics of such networks is well understood [3]. If each unit receives the 
same weight vector , then the weight matrix W is symmetric. In this case, the 
100 A. Pouget and K. Zhang 
network dynamics amplifies or suppresses the Fourier component of the initial input 
pattern, {ai}, independently by a factors equal to the corresponding component of 
the Fourier transform, 5, of . For example, if the first component of v is more 
than one (resp. less than one) the first Fourier component of the initial pattern of 
activity will be amplified (resp. suppressed). 
Thus, we can choose W such that the network amplifies selectively the first Fourier 
component of the data while suppressing the others. The network would be unstable 
but if we stop after a large, yet fixed, number of iterations, the activity pattern would 
look like a cosine function of direction with a phase corresponding to the phase of 
the first Fourier components of the data. In other words, the network would end 
up fitting a cosine function in the data which is equivalent to the COMP method 
described above. A network for orientation selectivity proposed by Ben-Yishai et 
al [1] is closely related to this linear network. 
Although this method keeps the estimate in a coarse code format, it suffers two 
problems: it is unclear how it could be extended to non periodic variables, such as 
disparity, and it is suboptimal since it is equivalent to the COMP estimator. 
3.2 Non-Linear Network 
We consider next a network of 64 units fully connected whose dynamics is governed 
by the following difference equations: 
oi(t) = g(ui(t))= 6.3 (log (1 + e 5+''(t)) )0.s (2) 
( 
.,(t + = .,(t) + -.,(t) + w,joj(t) (3) 
Zhng (1996) has demonstrated that with pproprite symmetric weights, 
this network develops  stable hill of ctivity in response to n rbitmry transient 
input pttern (h)(figure 2-B). The shape of the hill is fully specified by the weights 
nd ctiwtion function where,, by controt, the final position of the hill on the 
neuronl rray depends only on the initial input. Therefore, like ML, the network 
fits n "expected" function through the dt. We first present  set of simulations 
in which we investigated whether ML nd the network place the hill t the sme 
loction. 
Methods: The simulations consisted estimating the value of the direction of a 
moving bar based on the activity, _ = {ai}, of 64 input units with hill-shaped 
tuning to direction corrupted by noise. We used circular normal functions like the 
ones showed in figure 1-A to model the mean activities, fi(O): 
fi(O) = 3exp(7(cos(O -- Oi) - 1)) + 0.3 (4) 
The value 0.3 corresponds to the mean spontaneous activity of each unit. The peak, 
Oi, of the circular normal functions were uniformly spread over the interval [0 , 360]. 
The activities, {ai}, depended on the noise distribution. We used two types of noise, 
 i nd Poisson distributed: 
normally distributed with fixed vrince,  = 
P(ai = alO) = 1 ( (a fi(O)) fi(O)ke -L() 
oexp 2 ), P(ai = kJO) = (5) 
Our results compare he sandard deviation of four esfimators, OLE, GOM, GOMP 
nd ML o the non-linear recurren network (RN) wih lranenl inputs (the inpu 
patterns are shown on he firs iteration only). In he case of ML, we used he 
Statistically Efficient Estimations Using Cortical Lateral Connections 101 
Noise with Normal Distribution 
20 
)15 
glO 
'o 5 
0 
OLE COM COMP ML RN 
Noise with Poisson Distribution 
I 
oll . I 
OLE COM COMP ML RN 
Figure 3: Histogram of the standard deviations of the estimate for all five methods 
Cramdr-Rao bound to compute the standard deviation as described in Seung and 
Sompolinsky (1993). The weights in the recurrent network were chosen such that 
the final pattern of activity in the network have a profile very similar to the tuning 
function fi(O). 
Results: Since the preferred direction of two consecutive units in the network 
are more than 5  apart, we first wonder whether RN estimates exhibit a bias 
a difference between the mean estimate and the true direction-- in particular for 
directions between the peaks of two consecutive units. Our simulations showed no 
significant bias for any of the orientations tested (not shown). Next, we compared 
standard deviations of the estimates for all five methods and for the two types 
of noise. The RN method was found to outperform the OLE, COM and COMP 
estimators in both cases and to match the Cramdr-Rao bound for gaussian noise 
(figure 3) as suggested by our analysis. For noise with Poisson distribution, the 
standard deviation for RN was only 0.344  above ML (figure 3). 
We also estimated numerically--ORN/Oai 10=170  , the derivative of the RN estimate 
with respect to the initial activity of each of 64 units for an orientation of 170  . This 
derivative in the case of ML matches closely the derivative of the cell tuning curve, 
f/(O). In other words, in ML, units contribute to the estimate according to the 
amplitude of the derivative of the tuning curve. As shown in figure the same is true 
for RN, --ORN/Oai 10=70o matches closely the derivative of the units tuning curves. 
In contrast, the same derivatives for the COMP estimate, (dotted line), or the 
COM estimate, (dash-dotted line), do not match the profile of f/(O). In particular, 
units with preferred direction far away from 170  , i.e. units whose activity is just 
noise, end up contributing to the final estimate, hindering the performance of the 
estimator. 
We also looked at the standard deviation of the RN as a function of time, i.e., 
the number of iterations. Reaching a stable state can take up to several hundred 
iterations which could make the RN method too slow for any practical purpose. 
We found however that the standard deviation decreases very rapidly over the first 
5-6 iterations and reaches asymptotic values after around 20 iterations (figure 4-B). 
Therefore, there is no need to wait for a perfectly stable pattern of activity to obtain 
minimum standard deviation. 
Analysis: One way to determine which factors control the final position of the 
hill is to find a function, called a Lyapunov function, which is minimized over time 
by the network dynamics. Cohen and Grossberg (1983) have shown that network 
characterized by the dynamical equation above and in which the input pattern {sIi} 
102 A. Pouget and K. Zhang 
A 
-0.5 
-1 
"-. / -- RN 
"" "t'  " COMP 
.. "'j."i..t...-. COM 
 "I' '" 
[ '  'x 
100 200 300 
Preferred Direction (deg) 
B 
''15 
o 
-o 5 
0 
0 20 40 
Time (# of iterations) 
6O 
Figure 4: A- Comparison of g'(O) (solid line), --OO/Oailo=17oo for RN, COMP and 
COM. All functions have been normalized to one. B- Standard deviation as a 
function of the number of iterations for RN. 
is clamped, minimizes a Lyapunov function of the form: 
L = - -,wijg(ui)g(uj) + E zgt(z)dz- sE Iig(ui). (6) 
i,j  i 
The last term is the dot product between the input pattern, {sIi}, and the current 
activity pattern, {g(ui)}, on the neuronal array. Here s is simply a scaling factor 
for the input pattern. The dynamics of the network will therefore tend to minimize 
-5',i Iig(ui), or equivalently, to maximize the overlap between the stable pattern 
and the input pattern. The other terms however are also dependent on Ii because 
the shape of the final stable activity profile depends on the input pattern. Therefore 
the network will settle into a compromise between maximizing overlap and getting 
the right profile given the clamped input. 
We can show however that, for small input (i.e., as the scaling factor s --* 0), 
the dominant term in the Lyapunov function is the dot product. To see this, we 
consider the Taylor expansion of Lyapunov function L with respect to s.' First, let 
{Ui} denote the profile of the stable activity {ui} in the limit of zero input (s -- 0), 
and then write the corresponding vMue of the Lyapunov function at zero input  
L0. Now keeping only the first-order terms of s in the Taylor expansion, we obtMn: 
L  Lo - s E Iig(Ui). 
(7) 
This means that the dot product is the only first order term of s, and disturbances 
to the shape of the final activity profile contribute only to higher order terms of 
s, which are negligible when s is small. Notice that in the limit of zero input, the 
shape of the activity profile {Ui} is fixed, and the only thing unknown is its peak 
position. Because L0 is a constant, the global minimum of the Lyapunov function 
here should correspond to a peak position which maximizes the dot product. The 
difference between ui and Ui is negligible for sufficiently small input because, by 
definition, ui -- Ui as s -- O. Consequently, for small input, the network will 
converge to a solution maximizing primarily 5'4 Iig(ui), which is mathematically 
equivalent to minimizing the square distance between the input and the output 
pattern. 
Therefore, if we use an activity pattern, _ -- {ai}, as the input to this network, 
the stable hill should have its peak at a position very close to the direction corre- 
Statistically Ejyicient Estimations Using Cortical Lateral Connections 103 
sponding to the maximum likelihood estimate (under the assumption of gaussian 
noise), provided the network is not attracted into a local minimum of the Liapunov 
function. This result is valid when using a small clamped input but our simulations 
show that a transient input is sufficient to reach the Cramdr-Rao bound. 
4 Discussion 
Our results demonstrate that it is possible to perform efficient unbiased estimation 
with coarse coding using a neurally plausible architecture. Our model relies on 
lateral connections to implement a prior expectation on the profile of the activity 
patterns. As a consequence, units determine their activation according to their 
own input and the activity of their neighbors. This approach shows that one of 
the advantages of coarse code is to provide a representation which simplifies the 
problem of cleaning up uncorrelated noise within a neuronal population. 
Unlike OLE, COM and COMP, the RN estimate is not the result of a voting process 
in which units vote from their preferred direction, Oi. Instead, units turn out to 
contribute according to the derivatives of their tuning curves, f[(O), as in the case 
of ML. This feature allows the network to ignore background noise, that is to say, 
responses due to other factors beside the variable of interest. This property also 
predicts that discrimination of directions around the vertical (90  ) would be most 
affected by shutting off the units tuned at 60  and 120  . This prediction is consistent 
with psychophysical experiments showing that discrimination around the vertical 
in human is affected by prior adaptation to orientations displaced from the vertical 
by 4-00  [4]. 
Our approach can be readily extended to any other periodic sensory or motor vari- 
ables. For non periodic variables such as the disparity of a line in an image, our 
network needs to be adapted since it currently relies on circular symmetrical weights. 
Simply unfolding the network will be sufficient to deal with values around the center 
of the interval under consideration, but more work is needed to deal with boundary 
values. We can also generalize this approach to arbitrary mapping between two 
coarse codes for variables x and y where y is a function of x. Indeed, a coarse code 
for x provides a set of radial basis functions of x which can be subsequently used to 
approximate arbitrary functions. It is even conceivable to use a similar approach 
for one-to-many mappings, a common situation in vision or robotics, by adapting 
our network such that several hills can coexist simultaneously. 
References 
[1] R. Ben-Yishai, R. L. Bar-Or, and H. Sompolinsky. Proc. Natl. Acad. $ci. USA, 
92:3844-3848, 1995. 
[2] M. Cohen and S. Grossberg. IEEE Trans. $MC, 13:815-826, 1983. 
[3] M. Hirsch and S. Smale. Differential equations, dynamical systems and linear 
algebra. Academic Press, New York, 1974. 
[4] D. M. Regan and K. I. Beverley. J. Opt. $oc. Am., 2:147-155, 1985. 
[5] H. S. Seung and H. Sompolinsky. Proc. Natl. Acad. Sci. USA, 90:10749-10753, 
1993. 
