Features as Sufficient Statistics 
D. Geiger * 
Department of Computer Science 
Courant Institute 
and Center for Neural Science 
New York University 
geigercs. nyu. edu 
A. Rudra ? 
Department of Computer Science 
Courant Institute 
New York University 
archi,cs. nyu. edu 
L. Maloney; 
Departments of Psychology and Neural Science 
New York University 
ltmcns. nyu. edu 
Abstract 
An image is often represented by a set of detected features. We get 
an enormous compression by representing images in this way. Fur- 
thermore, we get a representation which is little affected by small 
amounts of noise in the image. However, features are typically 
chosen in an ad hoc manner. We show how a good set of fea- 
tures can be obtained using sufficient statistics. The idea of sparse 
data representation naturally arises. We treat the 1-dimensional 
and 2-dimensional signal reconstruction problem to make our ideas 
concrete. 
I Introduction 
Consider an image, I, that is the result of a stochastic image-formation process. The 
process depends on the precise state, f, of an environment. The image, accordingly, 
contains information about the environmental state f, possibly corrupted by noise. 
We wish to choose feature vectors (I) derived from the image that summarize this 
information concerning the environment. We are not otherwise interested in the 
contents of the image and wish to discard any information concerning the image 
that does not depend on the environmental state f. 
*Supported by NSF grant 5274883 and AFOSR grants F 49620-96-1-0159 and F 49620- 
96-1-0028 
Partially supported by AFOSR grants F 49620-96-1-0159 and F 49620-96-1-0028 
tSupported by NIH grant EY08266 
Features as Sufficient Statistics 795 
We develop criteria for choosing sets of features (based on information theory and 
statistical estimation theory) that extract from the image precisely the information 
concerning the environmental state. 
2 Image Formation, Sufficient Statistics and Features 
As above, the image I is the realization of a random process with distribution 
Pznvronment(f). We are interested in estimating the parameters f of the environ- 
mental model given the image (compare [4]). We assume in the sequel that f, the 
environmental parameters, are themselves a random vector with known prior dis- 
tribution. Let b(I) denote a feature vector derived from the the image I. Initially, 
we assume that b(I) is a deterministic function of I. 
For any choice of random variables, X, Y, define[2] the mutual information of X 
and Y to be M(X;Y) = x,yP(X,Y)lg 
P(X)P(Y)' The information about 
the environmental parameters contained in the image is then M(f;I), while the 
information about the environmental parameters contained in the feature vector 
b(I) is then M(f; qb(I)). As a consequence of the data processing inequality[2], 
M(f;c(I)) _< M(f;I). 
A vector b(I), 'of features is defined to be su]ficient if the inequality above is an 
equality. We will use the terms feature and statistic interchangeably. The definition 
of a sufficient feature vector above is then just the usual definition of a set of jointly 
su.ficient statistics[2]. 
To summarize, a feature vector q(I) captures all the information about the envi- 
ronmental state parameters f precisely when it is sufficent. 1 
Graded Sufficiency: A feature vector either is or is not sufficient. For every 
possible feature vector b(I), we define a measure of its failure to be sufficent: 
Surf(b(I)) = M (f; I) - M (f; q(I)). This su.ficency measure is always non-negative 
and it is zero precisely when b is sufficient. We wish to find feature vectors b(I) 
where Suff(q(I)) is close to 0. We define b(I) to be e-sufficient if Suff((I)) _ e. In 
what follows, we will ordinarily say sufficient, when we mean e-sufficient. 
The above formulation of feature vectors as jointly sufficient statistics, maximiz- 
ing the mutual information, M(f, qb(I)), can be expressed as the Kullback-Leibler 
distance between the conditional distributions, P(f[I) and 
E[D(P(fII) II P(flqb(I)))] = M(f;I)- M(f;qb(I)), (1) 
where the symbol E denotes the expectation with respect to I, D denotes the 
Kullback-Leibler (K-L) distance, defined by D(fllg)- f(x)log(f(x)/g(x)) 
Thus, we seek feature vectors b(I) such that the conditional distributions, P(flI) 
and P(fl((I)) in the K-L sense, averaged across the set of images. However, this 
optimization for each image could lead to over-fitting. 
3 Sparse Data and Sufficient Statistics 
The notion of sufficient statistics may be described by how much data can be re- 
moved without increasing the K-L distance between P(f[4(I)) and P(f]I). Let us 
An information-theoretic framework has been adopted in neural networks by others; 
e.g., [5] [9][6] [1][8]. However, the connection between features and sufficiency is new. 
2We won't prove the result here. The proof is simple and uses the Markov chain 
property to say that P(f, I, qb(I)) = P(I, qb(I))P(flI, qb(I)) = P(I)P(flI). 
796 D. Geiser, A. Rudra and L. T. Maloney 
formulate the approach more precisely, and apply two methods to solve it. 
3.1 Gaussian Noise Model and Sparse Data 
We are required to construct P(f]I) and P(f]b(I)). Note that according to Bayes' 
rule P(f]b(I)) -- P(b(I)]f)P(f)/P(b(I)). We will assume that the form of 
the model P(f) is known. In order to obtain P(q3(I)]f) we write P(b(I)]f) - 
Computing P(fl q(I)): Let us first assume that the generative process of the image 
I, given the model f, is Gaussian i.i.d. , i.e., P(I]f) = 
where i = 0, 1, ..., N- 1 are the image pixel index for an image of size N. Fur- 
ther, P(Ii]fi) is a function of (Ii - fi) and Ii varies from -co to +co, so that the 
normalization constant does not depend on fi. Then, P(f]I) can be obtained by 
normalizing P(f)P(I]f). 
i 
where Z is the normalization constant. 
Let us introduce a binary decision variable si = O, 1, which at every image pixel i 
decides if that image pixel contains "important" information or not regarding the 
model f. Our statistic  is actually a (multivariate) random variable generated 
from I according to 
/(1 - si) e- 
This distribution gives i = Ii with probability 1 (Dirac delta function) when si = 0 
(data is kept) and gives bi uniformly distributed otherwise (si = 1, data is removed). 
We then have 
 - - 2.= (li-O) 
= II o. s,) -,-. 
i 2ro' e 
The conditional distribution of b on f satisfies the properties that we mentioned in 
connection with the posterior distribution of f on I. Thus, 
rs(f[b) = (1/Zs)r(f)(He-2'%"A? (fi-ll)2(1-si)) (2) 
i 
where Z, is a normalization constant. 
It is also plausible to extend this model to non-Gaussian ones, by simply modifying 
the quadratic term (fi - Ii)  and keeping the sparse data coefficient (1 - si). 
3.2 Two Methods 
We can now formulate the problem of finding a feature-set, or finding a sufficient 
statistics, in terms of the variables si that can remove data. More precisely, we can 
find s by minimizing 
Features as Sufficient Statistics 797 
Z(s,I) = D(P(fl I) IIPs(fl(I))) + A (x - 
i 
(3) 
It is clear that the K-L distance is minimized when si = 0 everywhere and all the 
data is kept. The second term is added on to drive the solution towards a minimal 
sufficient statistic, where the parameter A has to be estimated. Note that, for A 
very large, all the data is removed (si = 1), while for A = 0 all the data is kept. 
We can further write (3) as 
Z(s,I) -  P(flI) log(P(flI)/Ps(fl(I))) + A (! - sd 
! 
( ----l--(!i-ll)2(1--(1--sl))) 
- _P(flI)log (Z/Z) 11 
_ , + 
! i 
s, 
= to9 _ ,[ _, (f, _ ,)2)] +  (x _ s,). 
i i 
where Ep[.] denotes the expectation taken with respect to the distribution P. 
If we let si be a continuous variable the minimum E(s, I) will occur when 
as _ (r,[(f _ i)2]_ r[(A - )]) - A (4) 
0 - Os'--. ' 
We note that the Hessian matrix 
c92E -- Ep.[(fi - /-i)2(fj _/-j)2]_ EPo[(fi - Ii)2]Ep.[(f - Ij)2], (5) 
Hs[i,j]- OsiOsy 
is a correlation matrix, i.e., it is positive semi-definite. Consequently, E(s) is convex. 
Continuation Method on A: 
In order to solve for the optimal vector s we consider the continuation method on 
the parameter A. We know that s = 0, for A = 0. Then, taking derivatives of (4) 
with respect to A, we obtain 
OE Os OE Osj 
E OSiOSj OA O/OSi -- 0 -- O/ -- E H-l[i'J]' 
It was necessary the Hessian to be invertible, i.e., the continuation method works 
because E is convex. The computations are expected to be mostly spent on esti- 
mating the Hessian matrix, i.e., on computing the averages Zp,[(fi- Ii)a(fj - I)2], 
Zp,[(fi- Ii)2], and Zpo[(fj -iy)2]. Sometimes these averages can be exactly com- 
puted, for example for one dimensional graph lattices. Otherwise these averages 
could be estimated via Gibbs sampling. 
The above method can be very slow, since these computations for Hs have to be 
repeated at each increment in A. We then investigate an alternative direct method. 
A Direct Method: 
Our approach seeks to find a "large set" of si = I and to maintain a distribution 
Ps(flb(I)) close to P(flI), i.e., to remove as many data points as possible. For this 
798 D. Geiser, A. Rudra and L T. Maloney 
o 
o 
,=, 0'4 h 
10 
3O 4O 5O 
& vdu# lot I, fred$4 
i 0.1s i 
60 0 10 
o ,o ,o ,o ,o (b) 
(a) 
20 30 4 0 0 
20 30 4 0 0 
 0.7  , 
0 10 20 30 40 50 60 
Figure 1: (a). Complete results for step edge showing the image, the effective 
variance and the computed s-value (using the continuation method). (b) Complete 
results for step edge with added noise. 
goal, we can investigate the marginal distribution 
where Pe//(fi) is an effective marginal distribution that depends on all the other 
values of I besides the one at pixel i. 
How to decide if si - 0 or si = I directly from this marginal distribution 
The entropy of the first term Hz (fi) - f dfiP&(fi)logP& (fi) indicates how much 
fi is conditioned by the data. The larger the entropy the less the data constrain 
fi, thus, there is less need to keep this data. The second term entropy He/l(fi) -- 
f dfiPel/(fi)logPel/(fi) works the opposite direction. The more/i is constrained 
by the neighbors, the lesser the entropy and the lesser the need to keep that data 
point. Thus, the decision to keep the data, si - 0, is driven by minimizing the 
"data" entropy H (fi) and maximizing the neighbor entropy Hell (fi). The relevant 
quantity is Hell(fi)-H& (fi). When this is large, the pixel is kept. Later, we will see 
a case where the second term is constant, and so the effective entropy is maximized. 
For Gaussian models, the entropy is the logarithm of the variance and the appro- 
priate ratio of variances may be considered. 
4 Example: Surface Reconstruction 
To make this approach concrete we apply to the problem of surface reconstruction. 
First we consider the I dimensional case to conclude that edges are the important 
features. Then, we apply to the two dimensional case to conclude that junctions 
followed by edges are the important features. 
Features as Sufficient Statistics 799 
4.1 1D Case: Edge Features 
Various simplifications and manipulations can be applied for the case that the model 
f is described by a first order Markov model, i.e., P(f) = 1-[i Pi(fi, fi_l).Then the 
posterior distribution is 
1 
P(fl) = II e , 
i 
where pi are smoothing coefficients that may vary from pixel to pixel according to 
 with and 
how much intensity change occurs ar pixel i, e.g., p/ - Pl+p(,__)2 P 
p to be estimated. We have assumed that the standard deviation of the noise is 
homogeneous, to simplify the calculations and analysis of the direct method. Let 
us now consider both methods, the continuation one and the direct one to estimate 
the features. 
Os5 
Continuation Method: Here we apply ox = Y'.i H-l[i,J] by computing Hs[i,j], 
given by (5), straight forwardly. We use the Baum-Welch method [2] for Markov 
chains to exactly compute Ep. [(f i-Ii)2 (f j-Ij )2], Ep. [(fi-Ii)2], and Ep. [(f j-Ij )2]. 
The final result of this algorithm, applied to a step-edge data (and with noise added) 
is shown in Figure 1. Not surprisingly, the edge data, both pixels, as well as the 
data boundaries, were the most important data, i.e., the features. 
Direct Method: We derive the same result, that edges and boundaries are the 
most important data via an analysis of this model. We use the result that 
f e_(/' 
P(f[I) - dfo ...df-i df+l ...dfr-1 P(flI) - z'e -2'--k('-'? , 
where A v is obtained recursively, in log2 N steps (for simplicity, we are assuming 
N to be an exact power of 2), as follows 
K K 
Ai+Ki+K 
K 
where K  {1,2,4,8,...,N}, pK 
K K 
+ + + (6) 
- 
x, r, +,, r,_.,,+ ,.  = 1/(2a2),  1 
),2: +2: ++: , and X i Pi = pi , I'i = Ii i. 
The effective variance is given by vareLt(fi) = 1/(2A v) while the data variance is 
given by var(fi) = er 2. Since var(fi) does not depend on any pixel i, maximizing 
the ratio vare.tf/var (as the direct method suggested) as equivalent to maximizing 
either the effective variance, or the total variance (see figure(1). 
Thus, the lower is A the lower is si. We note that A/K increases with K, and p/K 
decreases with K. Consequently A K increases less and less as K increases. In a 
perturbative sense A most contribute to A v and is defined by the two neighbors 
values Pi and fii+l, i.e., by the edge information. The larger are the intensity edges 
the smaller are Pi and therefore, the smaller will A be. Moreover, A/N is mostly 
defined by A (in a perturbative sense, this is where most of the contribution comes). 
Thus, we can argue that the pixels i with intensity edges will have smaller values 
for A v and therefore are likely to have the data kept as a feature (si - 0). 
4.2 2D Case: Junctions, Corners, and Edge Features 
Let us investigate the two dimensional version of the 1D problem for surface recon- 
struction. Let us assume the posterior 
--.) +m )21, 
P(flI) = e -I--- ('-')2+"7(' (I'-f"- 
800 D. Geiser, A. Rudra and L. T. Maloney 
where pjh are the smoothing coefficients along vertical and horizontal direction, 
that vary inversely according to the V'I along these direction. We can then approx- 
imately compute (e.g., see [3]) 
1 N N 2 
e- (f-r o) 
where, analogously to the 1D case, we have 
i,j-Ktij i,j+Ki,j+K 
Xi,j-K Xi,j+K 
K  v,K  v,K 
- i-K'jtiJ ' i+K'jgi+K'J 
xiK_K,j K 
Xi+K,j 
]ij i,+K 
K K h,K v,K h,K v,K /,2K ,K ,K 
where Xi,j = Aq + Pi + ij - i,j+K - i+K,j, and .. = 
The larger is the effective variance at one site (i,j), the smaller is l v, the more 
likely that image portion to be a feature. The larger the intensity gradient along 
h,. The smaller is pi% 'v the smaller will be contribution 
h, v, at (i,j), the smaller 
to 12. In a perturbative sense ([3]) 12 makes the largest contribution to l v. Thus, 
at one site, the more intensity edges it has the larger will be the effective variance. 
Thus, T-junctions will produce very large effective variances, followed by corners, 
followed by edges. These will be, in order of importance, the features selected to 
reconstruct 2D surfaces. 
5 Conclusion 
We have proposed an approach to specify when a feature set has sufficient infor- 
mation in them, so that we can represent the image using it. Thus, one can, in 
principle, tel] what kind of feature is likely to be important in a given model. Two 
methods of computation have been proposed and a concrete analysis for a simple 
surface reconstruction was carried out. 
References 
[1] A. Berger and S. Della Pietra and V. Della Pietra "A Maximum Entropy Approach 
to Natural Language Processing" Computational Linguistics, Vol.22 (1), pp 39-71, 
1996. 
[2] T. Cover and J. Thomas. Elements of Information Theory. Wiley Interscience, New 
York, 1991. 
[3] D. Geiger and J. E. Kogler. Scaling Images and Image Feature via the Renormaliza- 
tion Group. In Proc. IEEE Conf. on Computer Vision 4 Pattern Recognition , New 
York, NY, 1993. 
[4] G. Hinton and Z. Ghahramani. Generative Models for Discovering Sparse Distributed 
Representations To Appear Phil. Trans. of the Royal Society B, 1997. 
[5] R. Linsker. Self-Organization in a Perceptual Network. Computer, March 1988, 
105-117. 
[6] J. Principe, U. of Florida at Gainesville Personal Communication 
[7] T. Sejnowski. Computational Models and the Development of Topographic Projec- 
tions Trends Neurosci, 10, 304-305. 
[8] S.C. Zhu, Y.N. Wu, D. Mumford. Minimax entropy principle and its application to 
texture modeling Neural Computation 1996 B. 
[9] P. Viola and W.M. Wells III. "Alignment by Maximization of Mutual Information". 
In Proceedings of the International Conference on Computer Vision. Boston. 1995. 
