Deriving Receptive Fields Using An 
Optimal Encoding Criterion 
Ralph Linsker 
IBM T. J. Watson Research Center 
P.O. Box 218, Yorktown Heights, NY 10598 
Abstract 
An information-theoretic optimization principle ('infomax') has 
previously been used for unsupervised learning of statistical reg- 
ularities in an input ensemble. The principle states that the input- 
output mapping implemented by a processing stage should be cho- 
sen so as to maximize the average mutual information between 
input and output patterns, subject to constraints and in the pres- 
ence of processing noise. In the present work I show how infomax, 
when applied to a class of nonlinear input-output mappings, can 
under certain conditions generate optimal filters that have addi- 
tional useful properties: (1) Output activity (for each input pat- 
tern) tends to be concentrated among a relatively small number 
of nodes. (2) The filters are sensitive to higher-order statistical 
structure (beyond pairwise correlations). If the input features are 
localized, the filters' receptive fields tend to be localized as well. 
(3) Multiresolution sets of filters with subsampling at low spatial 
frequencies - related to pyramid coding and wavelet representations 
- emerge as favored solutions for certain types of input ensembles. 
I INTRODUCTION 
In unsupervised network learning, the development of the connection weights is 
influenced by statistical properties of the ensemble of input vectors, rather than by 
the degree of mismatch between the network's output and some 'desired' output. 
An implicit goal of such learning is that the network should transform the input 
so that salient features present in the input are represented at the output in a 
953 
954 Linsker 
more useful form. This is often done by reducing the input dimensionaltry in a way 
that preserves the high-variance components of the input (e.g., principal component 
analysis, Kohonen feature maps). 
The principle of maximum information preservation ('infomax') is an unsupervised 
learning strategy that states (Linsker 1988): From a set of allowed input-output 
mappings (e.g., parametrized by the connection weights), choose a mapping that 
maximizes the (ensemble-averaged) Shannon information that the output vector 
conveys about the input vector, in the presence of noise. Such a mapping maximizes 
the ensemble-averaged mutual information (MI) between input and output. 
This paper (a) summarize earlier results on infomax solutions for linear networks, 
(b) identifies some limitations of these solutions (ways in which very different filter 
sets are equally optimal from the infomax standpoint), and (c) shows how, by adding 
a small nonlinearity to the network, one can remove these limitations and at the 
same time improve the utility of the output representations. We show that infomax, 
acting on the modified network, tends to favor sparsely coded representations and 
(depending on the input ensemble) sets of filters that span multiple resolution scales 
(related to wavelets and 'pyramid coding'). 
2 INFOMAX IN LINEAR NETWORKS 
For definiteness and brevity, we consider a linear network having a particular type 
of noise model and input statistical properties. For a more detailed discussion of 
related models see (Linsker 1989). 
Since the computation of the MI (which involves the output entropy) is in general 
intractable for continuous-valued output vectors, previous work (and the present 
paper) makes use of a surrogate MI, which we will call the 'as-if-Gaussian' MI. This 
quantity is, by definition, computed as though the output vectors comprised a mul- 
tivariate Gaussian distribution having the same mean and covariance as the actual 
distribution of output vectors. Although expedient, this substitution has lacked a 
principled justification. The Appendix shows that, under certain conditions, using 
this 'surrogate MI' (and not the full MI) is indeed appropriate and justified. 
Denote the input vector by S _= {Si} (Si is the activity at input node i), the output 
vector by Z = {Zn), the matrix of connection weights by C _= {Cni), noise at the 
input nodes by N =- {Ni), and noise at the output nodes by , = {'n). Then our 
processing model is, in matrix form, Z - C($ 4- N) 4- . Assume that N and , are 
Gaussian random variables, (S) = (N = (,) = O, (_SN T } = (S ,T ) = (N ,T) = O, 
and, for the covariance matrices, (SS") = Q, (NN ') = 7I, (''?) = HI'. (Angle 
brackets denote an ensemble average, superscript T denotes transpose, and I and 
I' denote unit matrices on the input and output spaces, respectively.) In general, 
MI = Hz - (Hzls) where Hz is the output entropy and Hzls is the entropy of the 
output for given S. Replacing MI by the 'as-if-Gaussian' MI means replacing Hz 
by the expression for the entropy of a multivariate Gaussian distribution, which is 
(apart from an irrelevant constant term) Hz a = (1/2) In det Q', where Q' -- (zz T) = 
CQC ' + riCC ' +/3I' is the output covariance. Note that, when S is fixed, Z = 
CS+(CN+) is a Gaussian distribution centered on CS, so that we have (Hzls) = 
(1/2)lndetQ" where Q" = ((CN 4- ,)(CN 4- )) - ,CC  4- HI'. Therefore the 
Deriving Receptive Fields Using An Optimal Encoding Criterion 955 
'as-if-Gaussian' MI is 
MI' = (1/2)[lndet Q'- lndet Q"]. 
(1) 
The variance of the output at node n (prior to adding noise v,) is V, = ([C($ + 
N)]}) = (CQC ' + CCa'),,. We will constrain the dynamic range of each output 
node (limiting the number of output values that can be discriminated from one 
another in the presence of output noise) by requiring that V, = 1 for each n. 
Subject to this constraint, we are to find a matrix C that maximizes MI'. For a 
local Hebbian algorithm that accomplishes this maximization, see (Linsker 1992). 
Here, in order to proceed analytically, we consider a special case of interest. 
Suppose that the input statistics are shift-invariant, so that the covariance (SiSj) 
is a function of (j - i). We then use a shift-invariant filter Ansatz, Cni = C(i - n). 
Infomax then determines the optimal filter gain as a function of spatial frequency; 
i.e., the magnitude of the Fourier components c(k) of C(i - n). The derivation is 
summarized below. 
Denote by q(k), q'(k), and q"(k) the Fourier transforms of Q - i), Q'(m - n), 
and Q"(m- n) respectively. Since Q' = CQCr+ yCC:"+/5I', therefore 
q'(k) = [q(k) + r/) [ c(k) 12 +/5. Similarly, q"(k) = r/ I c(k) 12 +/5. We obtain 
MI' = (1/2)Ea[lnq'(k) - lnq"(k)]. Each node's output variance V is equal to 
v = V] K number of terms in the sum over k. 
To maximize MI' subject to the constraint on V we use the Lagrange multiplier 
method; that is, we mimize MI"  MI' + (V - 1) with respect to each 
This yields an equation for each k that is quadratic in ]c(k) ]a. The unique solution 
is 
q(k) (1 + [1 2,K ]/a 
-1 + .] } (2) 
if the RHS is positive, and zero otherwise. The Lagrange multiplier (< 0) is chosen 
so that the {I c(k)I} satisfy V = 1. 
Starting from a differently-stated goal (that of reducing redundancy subject to a 
limit on information loss), which turns out to be closely related to infomax, (Atick 
& Redlich 1990a) found an expression for the optimal filter gain that is the same 
as that of Eq. 2 except for the choice of constraint. 
Filter properties found using this approach are related to those found in early stages 
of biological sensory processing. Smoothing and bandpass (contrast-enhancing) 
filters emerge as infomax solutions (Linsker 1989, Atick & Redlich 1990a) in certain 
cases, and good agreement with retinal controt sensitivity measurements has been 
found (Atick & Redlich 1990b). 
Nonetheless, the value of the infomax solution Eq. 2 is limited in two important 
ways. First, the phases of the {c(k)} are left undetermined. Any choice of phases is 
equally good at maximizing MIt in a linear network. Thus the real-space response 
function C(i - n), which determines the receptive field properties of the output 
nodes, is nonunique (and indeed may be highly nonlocalized in space). 
Second, it is useful to extend the solution Ansatz to allow a number of different filter 
types a = 1,..., A at each output site, while continuing to require that each type 
956 Linsker 
satisfy the shift-invariance condition Cni(a)  C(i - n; a). For example, one may 
want to model a topographic 'retinocortical' mapping in which each patch of cortex 
(each 'site') contains multiple filter types, yet each patch carries out the same set of 
processing functions on its input. For this Ansatz, one again obtains Eq. 2 (deriva- 
tion omitted here), but with [ c(k) [2 on the LHS replaced by Eap(a)[ c(k;a)]2, 
where c(k; a)is the F.T. of C(i-n; a), and p(a) is the fraction of the total number 
of filters (at each site) that are of type a. The partitioning of the overall (sum- 
squared) gain among the multiple filter types is thus left undetermined. 
The higher-order statistical structure of the input (beyond covariance) is not being 
exploited by infomax in the above analysis, because (1) the network is linear and (2) 
only pairwise correlations among the output activities enter into MI . We shall show 
that if we make the network even mildly nonlinear, MI  is no longer independent of 
the choice of phases or of the partitioning of gain among multiple filter types. 
3 NETWORK WITH WEAK NONLINEARITY 
We consider the weakly nonlinear input-output relation Zn = U + eU--EiCniNi-{- 
v, where U _= iCniSi, for small e. This differs from the linear network analyzed 
above by the term in U. (For simplicity, terms nonlinear in the noise are not 
included.) The cubic term increases the signal-to-noise ratio selectively when U, is 
large in absolute value. We maximize MI  as defined in Eq. 1. 
Heuristically, the new term will cause infomax to favor solutions in which some 
output nodes have large (absolute) activity values, over solutions in which all output 
nodes have moderate activities. The output layer can thus encode information 
about the input vector (e.g., signal the presence of a feature) via the high activity 
of a small number of nodes, rather than via the particular activity values of many 
nodes. This has several (interrelated) potential advantages. (1) The concentration 
of activity among fewer nodes is a type of sparse coding. (2) The resulting output 
representation may be more resistant to noise. (3) The presence of a feature can be 
signaled to a later processing stage using fewer connections. (4) Since the particular 
nodes that have high activity depend upon the input vector, this type of mapping 
transforms a set of continuous-valued inputs at each site into a partially place-coded 
representation. A model of this sort may thus be useful for understanding better 
the formation of place-coded representations in biological systems. 
3.1 MATHEMATICAL DETAILS 
This section may be skipped without loss of continuity. In matrix form, U _-- C'S, 
W _-- U for each n, and Z - U + eW + CN + v. Keeping terms through first 
order in e, the output covariance is Q' -- (ZZ T) = CQC T + CC T + 13I' + eF, 
where F _= (wuT)+(uwT). [As an aside, Fn,. = (U,U,.(U +U)) resembles the 
covariance (U, Urn), except that presentations having large U +U are given greater 
weight in the ensemble average.] For shift-invariant input statistics and one filter 
type Cni -- C(i-n), taking the Fourier transform yields q'(k) = [q(k)+r/][ c(k; a) [2+ 
/3 + el(k) where f(k) is the F.T. of F(m - n) -- Fnm. So lndet Q' = E lnq'(k) = 
Z ln{[q(k) 
+7] Ic(k) [ +lS}+e52g(k) where g(k) 
Using a Lagrange multiplier as before, the quantity to be maximized is MI" = 
Deriving Receptive Fields Using An Optimal Encoding Criterion 957 
, ,  . ''"V'IV'VV 
Figure 1: Breaking of phase degeneracy. See text for discussion. 
MI"(e = 0) + (e/2)Eg(k). 
Now suppose there are multiple filter types a = 1,..., A at each output site. For 
each k define d(k) to be the A x A matrix whose elements are: d(k) _-- [q(k) + 
rl]c(k;a)c*(k;b)+ [/3/p(a)]&, where 5 is the Kronecker delta. Also define f(k)to 
be the A x A matrix each of whose elements f(k) is the F.T. of F(m - n; a, b) 
where F(m - n; a, b) . (Un(a)W,.(b)) + (Wn(a) (b)). Then the O(e) part of MI tt 
is: (e/2)E'I'r{[d(k)]- f(k)}. Note that [d(k)]- is the inverse of the matrix d(k), 
and that 'Tr' denotes the trace. [Outline of derivation: In the basis defined by the 
Fourier harmonics, Q' is block diagonal (one A x A block for each k). So In det Q' = 
Ea lndetq'(k) where each q'(k) is an A x A matrix of the form q[(k) + eq[(k). 
Expanding In det qt(k) through O(e) yields the stated result.] 
The infomax calculation to lowest order in e [i.e., O(e)] is the same as for the linear 
network. Here, for simplicity, we determine the sum-squared gain, Eap(a)l c(k; a) [2, 
as in the linear case; then seek to maximize the new term, of O(e), subject to 
this constraint on the value of the sum-squared gain. How the nonlinear term 
breaks phase and gain-apportionment degeneracies is of interest here; a small O(e) 
correction to the sum-squared gain is not. 
4 ILLUSTRATIVE RESULTS 
Two examples will show how adding the nonlinear perturbative term to the net- 
work's output breaks a degeneracy among different filter solutions. In each case the 
input space is a one-dimensional 'retina' with wraparound. 
4.1 BREAKING THE PHASE DEGENERACY 
In this example (see Figure 1) there is one filter type at each output site. We consider 
two types of input ensembles: (1) Each input vector (Fig. la shows one example) 
is drawn from a multivariate Gaussian distribution (so there is no higher-order 
statistical structure beyond pairwise correlations). The input covariance matrix 
Q(j - i) is a Gaussian function of the distance between the sites. (2) Each input 
958 Linsker 
vector is a random sum of Gaussian 'bumps': Si = Ejaj[s(i- j) - so] where s(i- j) 
is a Gaussian (shown in Fig. lb for j=20; there are 64 nodes in all); so is the mean 
value of s(i -j); and each aj is independently and randomly chosen (with constant 
probability) to be 1 or 0. This ensemble does have higher-order structure, with 
each input presentation being characterized by the presence of localized features 
(the bumps) at particular locations. 
The infomax solution for [ c(k) [2 is plotted versus spatial frequency k in Fig. lc 
for a particular choice of noise parameters (r/,/5). As stated earlier, MIt for a linear 
network is indifferent to the phases of the Fourier components {c(k)}. A particular 
random choice of phases produces the real-space filter C(i - n) shown in Fig. ld, 
which spans the entire 'retina.' Setting all phases to zero produces the localized 
filter shown in Fig. if. If the Gaussian 'bump' of Fig. lb is presented as input to a 
network of filters each of which is a shifted version of Fig. ld, the linear response of 
the network (i.e., the convolution of the 'bump' with the filter) is shown in Fig. le. 
Replacing the filter of Fig. ld by that of Fig. if, but keeping the input the same, 
produces the output response shown in Fig. lg. 
The cubic nonlinearity causes MI  to be larger for the filter of Fig. if than for that of 
Fig. ld. Heuristically, if we focus on the diagonal elements of the output covariance 
Q', the nonlinear term is 2e(U4). Maximizing MI' favors increasing this term (sub- 
ject to a constraint on output variance) hence favors filter solutions for which the 
U, distribution is non-Gaussian with a preponderance of large values. Projection 
pursuit methods also use a measure of the non-Gaussianity of the output distri- 
bution to construct filters that extract 'interesting' features from high-dimensional 
data (cf. Intrator 1992). 
4.2 
BREAKING THE PARTITIONING DEGENERACY FOR 
MULTIPLE FILTER TYPES 
In this example (see Fig. 2), the input ensemble comprises a set of self-similar 
patterns (each is a sine-Gabor 'ripple' as in Fig. 2a) that are related by translation 
and dilation (scale change over a factor of 80). Figure 2b shows the input power 
spectrum vs. k; the scaling region goes as 1/k. Figure 2c shows the infomax solution 
for the gain [c(k; a) I vs. k when there is just one filter type. When the input SNR 
is large (as in the scaling region) the infomax filters 'whiten' the output; note the 
flat portion of the output power spectrum (Fig. 2d). [We modify the infomax 
solution by extending the power-law form of[ c(k) I to low k (dotted line in Figs. 
2c,d). This avoids artifacts resulting from the rapid increase in [ c(k) I, which is 
in turn caused by our having omitted low-k patterns from the input ensemble for 
reasons of numerical efficiency.] The dotted envelope curve in Figure 2e shows the 
sum-squared gain Eap(a) [c(k) [2 when multiple filter types a are allowed. The 
quantity plotted is just the square of that shown in Fig. 2c, but on a linear rather 
than log-log plot (note values greater than 5 are cut off to save space). 
The network nonlinearity has the following effect. We first allow two filter types 
to share the overall gain. Optimizing MI  over various partitionings, we find that 
infomaxfavors a crossover between filter types at k  400. Allowing three, then four, 
filter types produces additional crossovers at lower k. For an Ansatz in which each 
filter's share of the sum-squared gain is tapered linearly near its cutoff frequencies, 
Deriving Receptive Fields Using An Optimal Encoding Criterion 959 
the best solution found for each p(a) Ic(k)12 is shown in Fig. 2e (semilog plot vs. 
k). Figure 2fplots the corresponding [ c(k; a) ] vs. k on a linear scale. Note that the 
three lower-k filters appear roughly self-similar. (The peak in the highest-k filter is 
an artifact due to the cutoff of the input ensemble at high k.) The four real-space 
filters C(i-n; a) are plotted vs. (i-n) in Fig. 2g [phases chosen to make C(i-n; a) 
antisymmetric]. 
The resulting filters span multiple resolution scales. The density p(a) is less for the 
lower-frequency filters (spatial subsampling). When more filter types are allowed, 
the increase in MI' becomes progressively less. Although in our model the filters are 
present with density p at each output site, a similar MI' is obtained if one spaces 
adjacent filters of type a by a distance  lip(a). The resulting arrangement of 
filters resembles the 'tiling' of the joint space and spatial-frequency domain that is 
used in wavelet and 'pyramid coding' approaches to image processing. [The infomax 
filters overlap, rather than disjointly tiling the (, k) domain.] 
Using the infomax method, the region of (a, k) space spanned by an optimal filter 
has an aspect ratio that depends upon the relative distances - along the a and k axes 
- over which the input feature is 'coherent' (possesses higher-order correlations). 
One may thus be able to use infomax to predict relationships between statistical 
measures of coherence in natural scenes and observed (a, k) aspect ratios for, e.g., 
orientation-selective cells. See (Field 1989) for a discussion of this issue that is not 
based on infomax. 
5 
APPENDIX: HEURISTIC JUSTIFICATION FOR 
USING A SURROGATE, 'AS-IF-GAUSSIAN,' 
MUTUAL INFORMATION 
The mutual information between input S and output Z is MI = 
f dSdZPszln(Psz/PsPz) : f dSPsKD(Pzls;Pz) where KD(Pzls;Pz) = 
f dZPzIsln(PzIs/Pz) is a Kullback divergence. So, maximizing MI means maxi- 
mizing the average (over $) of KD(PzIs; Pz). 
What does the KD represent? Suppose that the network has somehow learned the 
distribution Pz. Before being presented with a particular input S, the network 'ex- 
pects' an output vector drawn from Pz. The actual output response to S, however, 
is a vector drawn from Pzis. The KD measures the 'surprise' (i.e., the amount of 
information gained) upon seeing the actual distribution Pzls when one expected 
Pz. Infomax maximizes this average 'surprise.' 
However, the network cannot in general have access to the full distribution Pz, 
which contains far too much information (including all higher-order statistics) to 
be stored in the connections and nodes of the network. Let us suppose for defi- 
niteness that the system remembers only the mean and the covariance matrix of 
Z. Define Pz a to be the multivariate Gaussian distribution that has the same mean 
and covariance as Pz. Then we may think of the system as a priori 'expecting' the 
output vector to be drawn from the distribution Pz a. 
We accordingly modify the principle so that we maximize the aver- 
age (over S) of KD(PzIs;P) (note the superscript G). This equals 
960 Linsker 
ooo [ 
0.1 I i 
1 
o. 1 
1 lO lOO lOOO 
1 
0.'{ t , , 0 
1 lO lOO lOOO 
Figure 2: Partitioning among multiple filter types. See text. 
f dSPs fdZPzlsln(Pzls/P): (-Hzls)s - fdZPz lnP (where H denotes en- 
tropy). Using a property of the Gaussian distribution, we have - f dZPz In Pz : 
- f dZP lnP: H. We conclude that the average of KD equals Hz  -(Hzls)s, 
which is exactly equal to the surrogate 'as-if-Gaussian' MI defined preceding Eq. 1. 
This argument provides a principled justification for using the surrogate MI, when 
the system has stored information about the output vectors' mean and covariance, 
but not about higher-order statistics. 
References 
J. J. Atick & A. N. Redlich. (1990a) Towards a theory of early visual processing. 
Neural Computation 2:308-320. 
J. J. Atick & A. N. Redlich. (1990b) Quantitative tests of a theory of retinal 
processing: contrast sensitivity curves. Inst. Adv. Study IASSNS-HEP-90/51. 
D. J. Field. (1989) What the statistics of natural images tell us about visual coding. 
In Proc. SPIE 1077:269-276. 
N. Intrator. (1992) Feature extraction using an unsupervised neural network. Neu- 
ral Computation 4:98-107. 
R. Linsker. (1988) Self-organization in a perceptual network. Computer 21(3)'105- 
117. 
R. Linsker. (1989) An application of the principle of maximum information preser- 
vation to linear systems. In D. S. Touretzky (ed.), Advances in Neural Information 
Processing Systems 1, 186-194. San Mateo, CA: Morgan Kaufmann. 
R. Linsker. (1992) Local synaptic learning rules suffice to maximize mutual infor- 
mation in a linear network. Neural Computation 4(5):691-702. 
