A Phase Space Approach to Minimax 
Entropy Learning and the Minutemax 
Approximations 
James M. Coughlan 
Smith-Kettlewell Inst. 
San Francisco, CA 94115 
A. L. Yuille 
Smith-Kettlewell Inst. 
San Francisco, CA 94115 
Abstract 
There has been much recent work on measuring image statistics 
and on learning probability distributions on images. We observe 
that the mapping from images to statistics is many-to-one and 
show it can be quantified by a phase space factor. This phase 
space approach throws light on the Minimax Entropy technique for 
learning Gibbs distributions on images with potentials derived from 
image statistics and elucidates the ambiguities that are inherent to 
determining the potentials. In addition, it shows that if the phase 
factor can be approximated by an analytic distribution then this 
approximation yields a swift "Minutemax" algorithm that vastly 
reduces the computation time for Minimax entropy learning. An 
illustration of this concept, using a Gaussian to approximate the 
phase factor, gives a good approximation to the results of Zhu 
and Mumford (1997) in just seconds of CPU time. The phase 
space approach also gives insight into the multi-scale potentials 
found by Zhu and Mumford (1997) and suggests that the forms of 
the potentials are influenced greatly by phase space considerations. 
Finally, we prove that probability distributions learned in feature 
space alone are equivalent to Minimax Entropy learning with a 
multinomial approximation of the phase factor. 
I Introduction 
Bayesian probability theory gives a powerful framework for visual perception (Knill 
and Richards 1996). This approach, however, requires specifying prior probabilities 
and likelihood functions. Learning these probabilities is difficult because it requires 
estimating distributions on random variables of very high dimensions (for example, 
images with 200 x 200 pixels, or shape curves of length 400 pixels). An important 
762 ,I. M. Coughlan and A. L. Yuille 
recent advance is the Minimax Entropy Learning theory. This theory was developed 
by Zhu, Wu and Mumford (1997 and 1998) and enables them to learn probability 
distributions for the intensity properties and shapes of natural stimuli and clutter. 
In addition, when applied to real world images it has an interesting link to the work 
on natural image statistics (Field 1987), (Ruderman and Bialek 1994), (Olshaussen 
and Field 1996). We wish to simplify Minimax and make the learning easier, faster 
and more transparent. 
In this paper we present a phase space approach to Minimax Entropy learning. This 
approach is based on the observation that the mapping from images to statistics 
is many-to-one and can be quantified by a phase space factor. If this phase space 
factor can be approximated by an analytic function then we obtain approximate 
"Minutemax" algorithms which greatly speed up the learning process. In one version 
of this approximation, the unknown parameters of the distribution to be learned 
are related linearly to the empirical statistics of the image data set, and may be 
solved for in seconds or less. Independent of this approximation, the Minutemax 
framework also illuminates an important combinatoric aspect of Minimax, namely 
the fact that many different images can give rise to the same image statistics. This 
"phase space" factor explains the ambiguities inherent in learning the parameters 
of the unknown distribution, and motivates the approximation that reduces the 
problem to linear algebra. Finally, we prove that probability distributions learned in 
feature space alone are equivalent to Minimax Entropy learning with a multinomial 
approximation of the phase factor. 
2 A Phase Space Perspective on Minimax 
We wish to learn a distribution P(I) on images, where I denotes the set of pixel 
values I(x, y) on a finite image lattice, and each value I(x, y) is quantized to a finite 
set of intensity values. (In fact, this approach is general and applies to any patterns, 
not just images.) We define a set of image statistics b (I), b2(I),..., b$(I), which 
we concatenate as a single vector function (I). If these statistics have empirical 
mean o< (I) > on a dataset of images (we assume a large enough dataset for 
the law of large numbers to apply; see Zhu and Mumford (1997) for an analysis 
of the errors inherent in this assumption) then the maximum entropy distribution 
PM(I) with these empirical statistics is an exponential (Gibbs) distribution of the 
form 
eX.(I) 
PM(I) = , (1) 
z(7,) 
where the potential Y is set so that < (I) >M = o 
In summary, the goal of Minimax Learning is to to find an appropriate set of 
image filters for the domain of interest (i.e. maximally informative filters) and to 
estimate  given o Extensive computation is required to determine ; the phase 
space approach to Minimax Lear'ning motivates approximations that make X easy 
to estimate. 
2.1 Image Histogram Statistics 
The statistics we consider (following Zhu, Wu and Mumford (1997, 1998)) are de- 
fined as histograms of the responses of one or more filters applied across an entire 
image. Consider a single filter f (linear or non-linear) with response fx(I) centered 
at position x in the image. Without loss of generality, we will assume the filter has 
quantized integer responses from I through fma. 
A Phase Space Approach to Minimax Entropy Learning and the Minutemax Approximations 763 
For notational convenience we transform the filter response fx(I) to a binary repre- 
sentation x (I), defined as a column vector with fmax components: 
where index z ranges from I through f,a. This vector is composed of all zeros 
except for the entry corresponding to the filter response, which is set to one. The 
image statistics vector is then a histogram vector defined as the average of the 
1 
x(I)'s over all N pixels: (I) = W Yx x(I). The entries in (I) then sum to 1. 
(We can generalize to the case of multiple filters f(), f(2),..., f(m), as detailed in 
Coughlan and Yuille (1999).) 
2.2 The Phase Factor 
The original Minimax distribution PM (I) induces a distribution PM () on the statis- 
tics themselves, without reference to a particular image: 
eA.o 
PM(O) =  55o,$(i)PM(I) = g(o) Z(X) 
I 
(2) 
where g() is a combinatoric phase space factor, with a corresponding normalized 
combinatoric distribution (), defined by: 
g(&) =  5$o,$(),and () = g()lQtV, (3) 
I 
where the phase space factor g() counts the number of images I having statistics 
. N is the number of pixels and Q is the number of pixel intensity levels, i.e. 
QV is the total number of possible images I. It should be emphasized that the 
phase factor depends only on the set of filters chosen and is independent of the true 
distribution P(I). Thus the phase factor can be computed offline, independent of 
the image data set. 
In this paper we will discuss two useful approximations to g()' a Gaussian ap- 
proximation, which yields the swift approximation for learning, and a multinomial 
approximation, which establishes a connection between Minimax and standard fea- 
ture learning. 
2.3 The Non-Uniqueness of the Potential 
Given a set of filters and their empirical mean statistics  is the potential X uniquely 
specified? Clearly, any solution for X may be shifted by an additive constant 
(Xi --> Xi = Xi + k for all i), yielding a different normalization constant Z(;) 
but preserving PM(I). In this section we show that other, non-trivial ambiguities 
in  which preserve PM(I) can exist, stemming from the fact that some values of 
 are inconsistent with every possible image I and hence never arise (in any possi- 
ble image dataset). These "intrinsic" ambiguities are inherent to Minimax and are 
independent of the true distribution P(I). We will also discuss a second type of 
possible ambiguity which depends on the characteristics of the image dataset used 
for learning. 
We can uncover the intrinsic ambiguities in  by examining the covariance C of 
(). (See Coughlan and Yuille (1999) for details on calculating the mean 5' and 
covariance C for any set of linear filters or non-linear filters that are scalar functions 
764 J. M. Coughlan and A. L. Yuille 
of linear filters.) Defining the set of all possible statistics values (I) = ($' g()  0), 
the null space of C reflects degeneracy (i.e. flatness) in (I). The following theorem, 
proved in Coughlan and Yuille (1999), shows that  is determined only up to a 
hyperplane whose dimension is the nullity of C. 
Theorem 1 (Intrinsic Ambiguity in ). Cfi = 0 if and only if e(Y'+)'$(I)/Z(+ 
fi) and eL$()/Z() are identical distributions on I. 
In addition to this intrinsic ambiguity in , it is also possible that different values of 
 may yield distinct distributions which nevertheless have the same mean statistics 
(  ) on the image dataset. (As shown in Coughlan and Yuille (1999), there is a 
convex set of distributions, of which the true distribution P(I) is a member, which 
share the same mean statistics (  ).) This second kind of ambiguity stems from 
the fact that the mean statistics convey only a fraction of the information that 
is contained in the true distribution P(I). To resolve this second ambiguity it is 
necessary to extract more information from the image data set. The simplest way 
to achieve this is to use a larger (or more informative) set of filters to lower the 
entropy of PM(I) (this topic is discussed in more detail in Zhu, Wu and Mumford 
(1997, 1998), Coughlan and Yuille (1999)). Alternatively, one can extend Minimax 
to include second-order statistics, i.e. the covariance of  in addition to its mean t 
This is an important topic for future research. 
3 The Minutemax Approximations 
We now illustrate the phase space approach by showing that suitable approximations 
of the phase space factor g() make it easy to estimate the potential  given the 
empirical mean t The resulting fast approximations to Minimax Learning are 
called "Minutemax" algorithms. 
3.1 The Gaussian Approximation of g() 
If the phase space factor g() may be approximated as a multi-variate Gaussian 
(see Coughlan and Yuille (1999) for a justification of this approximation) then the 
probability distribution PM() -- g()eL$/Z() reduces to another multi-variate 
Gaussian. (Note that we are making the Gaussian approximation in  space-the 
space of all possible image statistics histograms-and not filter response (feature) 
space.) As we will see, this result greatly simplifies the problem of estimating the 
potential . 
Recall that the mean and covariance of () are denoted by 5' and C, respectively. 
The null space of C has dimension n and is spanned by vectors i (x), i(2)...if(n). 
As discussed in Theorem 1, for all feasible values of  (i.e. all  e (I)) and all fi in 
the null space, fl.  is a constant k. Thus we have that 
n 
g9.u88() c (H 5$.a,,k)e-($"-e")Tcgx($"-e")' (4) 
where the subscript r denotes projection onto the rank of C. Thus P9,ss () oc 
ggauss()e Lq oc (H=x 5q.a,k) e-($"-e")Tcg(q"-e")+L$ Completing the square 
in the exponent yields P9-8() c (Hix 55.,) e-($"-7")C2($"-7") where  
A Phase Space Approach to Minimax Entropy Learning and the Minutemax Approximations 765 
Figure 1: From left to right:  5'and - (as computed by the Gaussian Minutemax 
approximation) for first filter alone. 
is the projection of any  that satisfies = '+ CX. Since P9auss() is a Gaussian 
we have <  >gauss=  =  and so we can write a linear equation relating X and 
It can be shown (Zhu - private communication) that solving this equation is equiv- 
alent to one step of Newton-Raphson for minimization of an appropriate cost func- 
tion. This will fail to be a good approximation if the cost function is highly non- 
quadratic. As explained in Coughlan and Yuille (1999), the Gaussian approximation 
is also equivalent to a second-order perturbation expansion of the partition function 
Z(); higher-order corrections can be made by computing higher-order moments of 
3.2 Experimental Results 
We tested the Gaussian Minutemax procedure on two sets of filters: a single (fine 
scale) image gradient filter OI/Ox, and a set of multi-scale image gradient filters 
defined at three scales, similar to those used by Zhu and Mumford (1997). In both 
sets, the fine scale gradient filter is linear with kernel (1,-1), representing a dis- 
cretization of O/Ox. In the second set, the medium scale filter kernel is (U2,-U2)/4 
and the coarse scale kernel is (U4, -U4)/16, where Un denotes the n x n matrix of all 
ones. The responses of the medium and coarse filters were rounded (i.e. quantized) 
to the nearest integer, thus adding a non-linearity to these filters. Finally, o/was 
measured on a data set of over 100 natural images; the fine scale components of o/ 
are shown in the first panel of Figure (1) and were empirically very similar to the 
medium and coarse scale components. 
A  that solves d= '+ C is shown in the third panel of Figure (1) for the first 
filter (along with ' in the second panel) and in the three panels of Figure (2) for 
the multi-scale filter set. The form of  is qualitatively similar to that obtained by 
Zhu and Mumford (1997) (bearing in mind that Zhu disregarded any filter responses 
with magnitude above Q/2, i.e. his filter response range is half of ours). In addition, 
the eigenvectors of C with small eigenvalues are large away from the origin, so one 
should not trust the values of the potentials there (obtained by any algorithm). 
Zhu and Mumford (1997) report interactions between filters'applied at different 
scales. This is because the resulting potentials appear different than the potential 
at the fine scale even though the histograms appear similar at all scales. We argue, 
however, that some of this "interaction" is due to the different phase factors at 
different scales. In other words the potentials would look different at different scales 
even if the empirical histograms were identical because of differing phase factors. 
3.3 The Multinomial Approximation of g() 
Many learning theories simply make probability distributions on feature space. How 
do they differ from Minimax Entropy Learning which works on image space? By 
766 J. M. Coughlan and A. L. Yuille 
Figure 2: From left to right: the fine, medium and coarse components of -, as 
computed by the Gaussian Minutemax approximation. 
Figure 3: Left to right: t if, and - as given by multinomial approximation for the 
O/Ox filter at fine scale. 
examining the phase factor we will show that the two approaches are not identical 
in general. The feature space learning ignores the coupling between the filters 
which arise due to how the statistics are obtained. More precisely, the probability 
distribution obtained on feature space, PF, is equivalent to the Minimax distribution 
PM if, and only if, the phase factor is multinomial. 
We begin the analysis by considering a single filter. As before we define the com- 
binatoric mean E' = y]$(). The multinomial approximation of 0() is equiv- 
alent to assuming that the combinatoric frequencies of filter responses are inde- 
pendent from pixel to pixel. Since the combinatoric frequency of filter response 
j E {1, 2,..., f,ax} is cj and there are Nc)j pixels with response j, we have: 
N 
(NqS) ! 
(S) 
using Pmuu(& or gmuu(&e x'$. Therefore Pmuu(& is also a multinomial. Shifting 
the ,kj's by an appropriate additive constant, we can make the constant of propor- 
tionality in the above equation equal to 1. In this case we have < c)j >rnult = cje Aj/N 
and ,kj = Nlog(dj/cj) by setting < cfj >mutt to the empirical mean dj. 
Note that if any component dj of the empirical mean is close to 0 then by the 
previous equation any small perturbations in dj (e.g. from sampling error) will 
yield large changes in ,kj, making the estimate of that component unstable. 
We can generalize the multinomial approximation of g(q) to the multiple filter 
case merely by factoring Omult() into separate multinomials, one for each filter. 
Of course, this approximation neglects all interactions among filters (and among 
pixels). 
A Phase Space Approach to Minimax Entropy Learning and the Minutemax Approximations 767 
3.4 The Multinomial Approximation and Feature Learning 
The connection between the multinomial approximation and feature learning is 
straightforward once we consider a distribution on the feature vector f This dis- 
tribution (denoted PF for "feature") is constructed assuming independent filter 
responses from pixel to pixel and with statistics matching the empirical mean  
P,(f) - 1-I/=x d(f), where fi denotes the filter response at pixel i. Then it follows 
that PF() is a multinomial: PF() ylf.a 7qj N! 
= lj=x d l-lf,a . Since dj = cje xj/N 
 j=l (Nqi)! ' 
we have our main result that Pr (& = P.,u (&. 
4 Conclusion 
The main point of this paper is to introduce the phase space factor to quantify the 
mapping between images and their feature statistics. This phase space approach 
can: (i) provide fast approximate "Minutemax" algorithms, (ii) clarify the relation- 
ship between probability distributions learned in feature and image space, and (iii) 
to determine intrinsic ambiguities in the  potentials. 
Acknowledgements 
We acknowledge stimulating discussions with Song Chun Zhu. Funding was pro- 
vided by the Smith-Kettlewell Institute Core Grant and the Center for Imaging 
Sciences ARO grant DAAN04-95-1-0494. 
References 
Coughlan, J.M. and Yuille, A.L. "The Phase Space of Minimax Entropy Learning". 
In preparation. 1999. 
Field, D. J. "Relations between the statistics of natural images and the response 
properties of cortical cells". Journal of the Optical Society 4,(12), 2379-2394. 1987. 
D.C. Knill and W. Richards. (Eds). Perception as Bayesian Inference. Cam- 
bridge University Press. 1996. 
Olshausen, B. A. and Field, D. J. "Emergence of simple-cell receptive field properties 
by learning a sparse code for natural images". Nature. 381,607-609. 1996. 
B.D. Ripley. Pattern Recognition and Neural Networks. Cambridge Univer- 
sity Press. 1996. 
Ruderman, D. and Bialek, W. "Statistics of Natural Images: Scaling in the Woods". 
Physical Review Letters. 73, Number 6,(8 August 1994), 814-817. 1994. 
S.C. Zhu, Y. Wu, and D. Mumford. "Minimax Entropy Principle and Its Applica- 
tion to Texture Modeling". Neural Computation. Vol. 9. no. 8. Nov. 1997. 
S.C. Zhu and D. Mumford. "Prior Learning and Gibbs Reaction-Diffusion". IEEE 
Trans. on PAMI vol. 19, no. 11. Nov. 1997. 
S-C Zhu, Y-N Wu and D. Mumford. FRAME: Filters, Random field And Maximum 
Entropy: Towards a Unified Theory for Texture Modeling. Int'l Journal of 
Computer Vision 27(2) 1-20, March/April. 1998. 
