Bayesian Self-Organization 
Alan L. Yuille 
Division of Applied Sciences 
Harvard University 
Cambridge, MA 02138 
Stelios M. Smirnakis 
Lyman Laboratory of Physics 
Harvard University 
Cambridge, MA 02138 
Lei Xu * 
Dept. of Computer Science 
HSH ENG BLDG, Room 1006 
The Chinese University of Hong Kong 
Shatin, NT 
Hong Kong 
Abstract 
Recent work by Becker and Hinton (Becker and Hinton, 1992) 
shows a promising mechanism, based on maximizing mutual in- 
formation assuming spatial coherence, by which a system can self- 
organize itself to learn visual abilities such as binocular stereo. We 
introduce a more general criterion, based on Bayesian probability 
theory, and thereby demonstrate a connection to Bayesian theo- 
ries of visual perception and to other organization principles for 
early vision (Atick and Redlich, 1990). Methods for implementa- 
tion using variants of stochastic learning are described and, for the 
special case of linear filtering, we derive an analytic expression for 
the output. 
1 Introduction 
The input intensity patterns received by the human visual system are typically 
complicated functions of the object surfaces and light sources in the world. It 
* Lei Xu was a research scholar in the Division of Applied Sciences at Harvard University 
while this work was performed. 
lOOl 
1002 Yuille, Smirnakis, and Xu 
seems probable, however, that humans perceive the world in terms of surfaces and 
objects (Nakayama and Shimojo, 1992). Thus the visual system must be able to 
extract information from the input intensities that is relatively independent of the 
actual intensity values. Such abilities may not be present at birth and hence must 
be learned. It seems, for example, that binocular stereo develops at about the age 
of two to three months (Held, 1987). 
Becker and Hinton (Becker and Hinton, 1992) describe an interesting mechanism 
for self-organizing a system to achieve this. The basic idea is to assume spatial 
coherence of the structure [o be extracted and to train a neural network by maxi- 
mizing the mutual information between neurons with disjoint receptive fields. For 
binocular stereo, for example, the surface being viewed is assumed flat (see (Becker 
and Hinton, 1992) for generalizations of this assumption) and hence has spatially 
constant disparity. The intensity patterns, however, do not have any simple spatial 
behaviour. Adjusting the synaptic strengths of the network to maximize the mutual 
information between neurons with non-overlapping receptive fields, for an ensem- 
ble of images, causes the neurons to extract features that are spatially coherent - 
thereby obtaining the disparity [fig. l]. 
maximize I (a;b) 
0 1 1 1 
1 0 1 1 
0 0 I 1 
1 0 0 1 
Figure 1: In Hinton and Becker's initial scheme (Becker and Hinton, 1992), max- 
imization of mutual information between neurons with spatially disjoint receptive 
fields leads to disparity tuning, provided they train on spatially coherent patterns 
(i.e. those for which disparity changes slowly with spatial position) 
Workers in computer vision face a similar problem of estimating the properties of 
objects in the world from intensity images. It is commonly stated that vision is ill- 
posed (Poggio et al, 1985) and that prior assumptions about the world are needed 
to obtain a unique perception. It is convenient to formulate such assumptions by 
the use of Bayes' theorem P(SID) = P(DIS)P(S)/P(D). This relates the proba- 
Bayesian Self-Organization 1003 
bility P(SID) of the scene $ given the data D to the prior probability of the scene 
P(S) and the imaging model P(DIS ) (P(D) can be interpreted as a normalization 
constant). Thus a vision theorist (see (Clark and Yuille, 1990), for example) deter- 
mines an imaging model P(DIS), picks a set of plausible prior assumptions about 
the world P(S) (such as natural constraints (Marr, 1982)), applies Bayes' theorem, 
and then picks an interpretation $* from some statistical estimator of P(S[D) (for 
example, the maximum a postertort (MAP) estimator S* = ARG{MAXsP(SI D) }.) 
An advantage of the Bayesian approach is that, by nature of its probabilistic formu- 
lation, it can be readily related to learning with a teacher (Kersten et al, 1987). It is 
unclear, however, whether such a teacher will always be available. Moreover, from 
Becker and Hinton's work on self-organization, it seems that a teacher is not always 
necessary. This paper proposes a way for generalizing the self-organization ap- 
proach, by starting from a Bayesian perspective, and thereby relating it to Bayesian 
theories of vision. The key idea is to force the activity distribution of the outputs to 
be close to a pre-specified prior distribution Pp(S). We argue that this approach is 
in the same spirit as (Becker and Hinton, 1992), because we can choose the prior dis- 
tribution to enforce spatial coherence, but it is also more general since many other 
choices of the prior are possible. It also has some relation to the work performed by 
Atick and Redlich (Atick and Redlich, 1990) for modelling the early visual system. 
We will take the viewpoint that the prior Pp(S) is assumed known in advance by 
the visual system (perhaps by being specified genetically) and will act as a self- 
organizing principle. Later we will discuss ways that this might be relaxed. 
2 Theory 
We assume that the input D is a function of a signal Y, that the system wants 
to determine and a distractor N [fig.2]. For example E might correspond to the 
disparities of a pair of binocular stereo images and N to the intensity patterns. The 
distribution of the inputs is PD(D) and the system assumes that the signal E has 
distribution Pp (E). 
Let the output of the system be S - C(D,'7) where C is a function of a set 
of parameters '7 to be determined. For example, the function C(D,7) could be 
represented by a multi-layer perceptron with the -7's being the synaptic weights. 
By approximation theory, it can be shown that a large variety of neural networks 
can approximate any input-output function arbitrarily well given enough hidden 
nodes (Hornik et al, 1989). 
The aim of self-organizing the network is to ensure that the parameters '7 are chosen 
so that the outputs $ are as close to the E as possible. We claim that this can be 
achieved by adjusting the parameters '7 so as to make the derived distribution of the 
outputs Poo(S : 7) = f 5(S - G(D, 7))Po(D)[dD] as close as possible to Pp(S). 
This can be seen to be a consistency condition for a Bayesian theory as from Bayes 
formula we obtain the equation: 
P(S]D)Pr>(D)[dD] = / P(DIS)Pp(S)[dD] = Pp(S). (1) 
1004 Yuille, Smirnakis, and Xu 
which is equivalent to our condition, provided we choose to identify P(SID) with 
6(S- G(D,')')). 
To make this more precise we must define a measure of similarity between the two 
distributions Pp(S) and PDD(S'')'). An attractive measure is the Kullback-Leibler 
distance (the entropy of PD relative to Pp): 
log 
Pp(s) [dS]. 
(2) 
D=F(E,N) 
g(z) 
S=G(D, 7) 
v ?o.(s.r) 
Pot> ( S : 7)log( 
?.. (s : r).)a s 
Figure 2: The parameters 7 are adjusted to minim,ze the Kullback-Leibler dis- 
tance between the prior (Pp) distribution of the true signal (E) and the derived 
distribution (PDD) of the network output (S). 
This measure can be divided into two parts: (i) -f PDD(S'')')Iog Pr(S)[dS] and 
(ii) f Poo(S: 7)log Poo(S'7)[dS]. The second term encourages variability of the 
output while the first term forces similarity to the prior distribution. 
Suppose that Pp(S) can be expressed as a Markov random field (i.e. the spatial 
distribution of Pp($) has a local neighbourhood structure, as is commonly assumed 
in Bayesian models of vision). Then, by the Hammersely-Clifford theorem, we can 
write Pt(S) = e-tz,(S)/z where Er(S ) is an energy function with local connections 
(for example, Ep(S) = Ei($i - Si+l?),  is an inverse temperature and Z is a 
normalization constant. 
Then the first term can be written (Yuille et al, 1992) as 
- f Poo(S' 7)log Pp(S)[dS] = i(Er(G(D , ')')))o + log Z. 
Bayesian Self-Organization 1005 
We can ignore the logZ term since it is a constant (independent of 7). Mini- 
mizing the first term with respect to '7 will therefore try to minimize the energy 
of the outputs averaged over the inputs- {Ep(G(D, 7))}D - which is highly desir- 
able (since it has a close connection to the minimal energy principles in (Poggio 
et al, 1985, Clark and Yuille, 1990)). It is also important, however, to avoid the 
trivial solution G(D, 7) - 0 as well as solutions for which G(D, 7) is very small 
for most inputs. Fortunately these solutions are discouraged by the second term: 
f PDD(D, ,7)log PD(D, ,7)[dD], which corresponds to the negative entropy of the 
derived distribution of the network output. Thus, its minimization with respect to 
,7 is a maximum entropy principle which will encourage variability in the outputs 
G(D, ,7) and hence prevent the trivial solutions. 
3 Reformulating for Implementation. 
Our theory requires us to minimize the Kullback-Leibler distance, equation 2, with 
respect to ,7. We now describe two ways in which this could be implemented using 
variants of stochastic learning. First observe that by substituting the form of the 
derived distribution into equation 2 and integrating out the $ variable we obtain: 
f ,7) 
KL(,7) = PD(D)log Pp(G(D,,7)) 
(4) 
Assuming a representative sample { D":/ e A} of inputs we can approximate KL(,7) 
by -],^ log[Pt)t)(G(D", ,7): ,7)/Pp(G(D", ,7))]. We can now, in principle, perform 
stochastic learning using backpropagation: pick inputs D, at random and update 
the weights '7 using log[PDD(G(D , ,7): ,7)/Pp(G( D", ,7))] as the error function. 
To do this, however, we need expressions for PDD(G(D",,7) : ,7) and its deriva- 
tive with repect to ,7- If the function G(D,,7) can be restricted to being 1-1 (in- 
creasing the dimensionality of the output space if necessary) then we can obtain 
(Yuille et al, 1992) analytic expressions Pt)t)(G(D, ,7): ,7) - Pt)(D)/] det(OG/OD)l 
and ((1OgPDD(G(D,,7): ,7)/C'},7) =-((G/(D)-(c2G/cD(,7), where [-1] denotes 
the matrix inverse. Alternatively we can perform additional sampling to estimate 
PDD(G(D, ,7): ,7) and ((logPDD(G(D, ,7): ,7)/c,7) directly from their integral rep- 
resentations. (This second approach is similar to (Becker and Hinton, 1992) though 
they are only concerned with estimating the first and second moments of these 
distributions.) 
4 Connection to Becker and Hinton. 
The Becker and Hinton method (Becker and Hinton, 1992) involves maximizing the 
mutual information between the output of two neuronal units $, $2 [fig.l]. This is 
given by: 
I(S, $2): - < log ?DD($t) 
where the first two terms correspond to maximizing the entropies of 
while the last term forces 
1006 Yuille, Smirnakis, and Xu 
By contrast, our version tries to minimize the quantity  
< log ?z>z>(&, s2) > - f logPp(S,S2)Pz>z>(S,S2)[d]. 
If we then ensure that Pp(S, S2) = 5(S - S2) our second term will force S  S2 
and our first term will maximize the entropy of the joint distribution of S, S2. We 
argue that this is effectively the same as (Becker and Hinton, 1992) since maxi- 
mizing the joint entropy of $, $2 with $ constrained to equal $2 is equivalent to 
maximizing the individual entropies of $ and $2 with the same constraint. 
To be more concrete, we consider Becker and Hinton's implementation of the mutual 
information maximization principle in the case of units with continuous outputs. 
They assume that the outputs of units 1,2 are Gaussian  and perform steepest 
descent to maximize the symmetrized form of the mutual information between $ 
and S2: 
v(s) v(s2) 
Ism;s2 = log V(S - S2) +log V(S - S2): log V(S)+log V(S2)-2 log V(S - S2) 
(5) 
where V() stands for variance over the set of inputs. They assume that the difference 
between the two outputs can be expressed as uncorrelated additive noise, S = 
S2 + N. We reformalize their criterion as maximizing EsH(V(S2), V(N)) where 
E,H(V(S2),V(N))=log{V(S2)+ V(N)}+logV(S2)-21ogV(N). (6) 
For our scheme we make similar assumptions about the distributions of S and 
S2. We see that < logPDD(S,S2) >= -log{< S >< S 2 > - < SS2 >2} __ 
-log{V(S2)V(N)} (since < SS2 >:< (S2 + N)S2 >= V(S2) and < S 2 >= 
V(S2) + V(N)). Using the prior distribution Pp(S,S2)   -r(s-s2) our criterion 
corresponds to minimizing Ersx(V(S2), V(N)) where: 
EYSX(V(S2), V(N)): -log V(S2)- log V(N)q- rV(N). 
(7) 
It is easy to see that maximizing EBn(V(S2), V(N)) will try to make V(S2) as 
large as possible and force V(N) to zero (recall that, by definition, V(N) _> 0). 
Minimizing our energy will try to make V($2) as large as possible and will force 
V(N) to 1/r (recall that r appears as the inverse of the variance of a Gaussian 
prior distribution for S - S2 so making r large will force the prior distribution to 
approach 5($ - $2).) Thus, provided r is very large, our method will have the 
same effect as Becker and Hinton's. 
5 Application to Linear Filtering. 
We now describe an analysis of these ideas for the case of linear filtering. 
approach will be contrasted with the traditional Wiener filter approach. 
We assume for simplicity that these Gaussians have zero mean. 
Our 
Bayesian Self-Organization 1007 
Consider a process of the form D(): E()+N() where D() denotes the input to 
the system, E() is the true signal which we would like to predict, and N() is the 
noise corrupting the signal. The resulting Wiener filter Aw () has fourier transform 
fiw - s,z/(z,z + N,N) where z,z and N,N are the power spectrum of the 
signal and the noise respectively. 
By contrast, let us extract a linear filter Ab by applying our criterion. In the case 
that the noise and signal are independent zero mean Gaussian distributions this 
filter can be calculated explicitly (Yuille et al, 1992). It has fourier transform with 
squared magnitude given by ]fibl 2 - s,s/(s,s + Ndv). Thus our filter can be 
thought of as the square root of the Wiener filter. 
It is important to realize that although our derivation assumed additive Gaussian 
noise our system would not need to make any assumptions about the noise distribu- 
tion. Instead our system would merely need to assume that the filter was linear and 
then would automatically obtain the "correct" result for the additive Gaussian noise 
case. We conjecture that the system might detect non-Gauusian noise by finding it 
impossible to get zero Kullback-Liebler distance with the linear ansatz. 
6 Conclusion 
The goal of this paper was to introduce a Bayesian approach to self-organization 
using prior assumptions about the signal as an organizing principle. We argued that 
it was a natural generalization of the criterion of maximizing mutual information 
assuming spatial coherence (Becker and Hinton, 1992). Using our principle it should 
be possible to self-organize Bayesian theories of vision, assuming that the priors 
are known, the network is capable of representing the appropriate functions and 
the learning algorithm converges. There will also be problems if the probability 
distributions of the true signal and the distractor are too similar. 
If the prior is not correct then it may be possible to detect this by evaluating 
the goodness of the Kullback-Leibler fit after learning 2. This suggests a strategy 
whereby the system increases the complexity of the priors until the Kullback-Leibler 
fit is sufficiently good (this is somewhat similar to an idea proposed by Mumford 
(Mumford, 1992)). This is related to the idea of competitive priors in vision (Clark 
and Yuille, 1990). One way to implement this would be for the prior probability 
itself to have a set of adjustable parameters that would enable it to adapt to different 
classes of scenes. We are currently (Yuille et al, 1992) investigating this idea and 
exploring its relationships to Hidden Markov Models. 
Ways to implement the theory, using variants of stochastic learning, were described. 
We sketched the relation to Becker and Hinton. 
As an illustration of our approach we derived the filter that our criterion would give 
for filtering out additive Gaussian noise (possibly the only analytically tractable 
case). This had a very interesting relation to the standard Wiener filter. 
2This is reminiscent of Barlow's suspicious coincidence detectors (Barlow, 1993), where 
we might hope to determine if two variables x & y are independent or not by calculating 
the Kullback-Leibler distance between the joint distribution P(x,y) and the product of 
the individual distributions P(x)P(y). 
1008 Yuille, Smimakis, and Xu 
Acknowledgement s 
We would like to thank DARPA for an Air Force contract F49620-92-J-0466. Con- 
versations with Dan Kersten and David Mumford were highly appreciated. 
References 
J.J. Atick and A.N. Redlich. "Towards a Theory of Early Visual Processing". 
Neural Computation. Vol. 2, No. 3, pp 308-320. Fall. 1990. 
H.B. Barlow. "What is the Computational Goal of the Neocortex?" To appear in: 
Large scale neuronal theories of the brain. Ed. C. Koch. MIT Press. 1993. 
S. Becker and G.E. Hinton. "Self-organizing neural network that discovers surfaces 
in random-dot stereograms". Nature, Vol 355. pp 161-163. Jan. 1992. 
J.J. Clark and A.L. Yuille. Data Fusion for Sensory Information Processing 
Systems. Kluwer Academic Press. Boston/Dordrecht/London. 1990. 
R. Held. "Visual development in infants". In The encyclopedia of neurosclence, 
vol. 2. Boston: Birkhauser. 1987. 
K. Hornik, S. Stinchocombe and H. White. "Multilayer feed-forward networks are 
universal approximators". Neural Networks 4, pp 251-257. 1991. 
D. Kersten, A.J. O'Toole, M.E. Sereno, D.C. Knill and J.A. Anderson. "Associative 
learning of scene parameters from images". Optical Society of America, Vol. 26, 
No. 23, pp 4999-5006. 1 December, 1987. 
D. Marr. Vision. W.H. Freeman and Company. San Francisco. 1982. 
D. Mumford. "Pattern Theory: a unifying perspective". Dept. Mathematics 
Preprint. Harvard University. 1992. 
K. Nakayama and S. Shimojo. "Experiencing and Perceiving Visual Surfaces". 
Science. Vol. 257, pp 1357-1363.4 September. 1992. 
T. Poggio, V. Torre and C. Koch. "Computational vision and regularization the- 
ory". Nature, 317, pp 314-319. 1985. 
A.L. Yuille, S.M. Smirnakis and L. Xu. "Bayesian Self-Organization". Harvard 
Robotics Laboratory Technical Report. 1992. 
PART IX 
SPEECH AND SIGNAL 
PROCESSING 
