Learning a Continuous Hidden Variable 
Model for Binary Data 
Daniel D. Lee 
Bell Laboratories 
Lucent Technologies 
Murray Hill, NJ 07974 
ddleebell-labs. com 
Haim Sompolinsky 
Racah Institute of Physics and 
Center for Neural Computation 
Hebrew University 
Jerusalem, 91904, Israel 
haimfiz. huj i. ac. il 
Abstract 
A directed generative model for binary data using a small number 
of hidden continuous units is investigated. A clipping nonlinear- 
ity distinguishes the model from conventional principal components 
analysis. The relationships between the correlations of the underly- 
ing continuous Gaussian variables and the binary output variables 
are utilized to learn the appropriate weights of the network. The 
advantages of this approach are illustrated on a translationally in- 
variant binary distribution and on handwritten digit images. 
Introduction 
Principal Components Analysis (PCA) is a widely used statistical technique for rep- 
resenting data with a large number of variables [1]. It is based upon the assumption 
that although the data is embedded in a high dimensional vector space, most of 
the variability in the data is captured by a much lower dimensional manifold. In 
particular for PCA, this manifold is described by a linear hyperplane whose char- 
acteristic directions are given by the eigenvectors of the correlation matrix with 
the largest eigenvalues. The success of PCA and closely related techniques such as 
Factor Analysis (FA) and PCA mixtures clearly indicate that much real world data 
exhibit the low dimensional manifold structure assumed by these models [2, 3]. 
However, the linear manifold structure of PCA is not appropriate for data with 
binary valued variables. Binary values commonly occur in data such as computer 
bit streams, black-and-white images, on-off outputs of feature detectors, and elec- 
trophysiological spike train data [4]. The Boltzmann machine is a neural network 
model that incorporates hidden binary spin variables, and in principle, it should be 
able to model binary data with arbitrary spin correlations [5]. Unfortunately, the 
516 D. D. Lee and H. Sornpolinsky 
Yl Y 
Wll 
s 1 s 2 SN_ 1 s N 
Figure 1: Generative model for N-dimensional binary data using a small number 
P of continuous hidden variables. 
computational time needed for training a Boltzmann machine renders it impractical 
for most applications. 
In these proceedings, we present a model that uses a small number of continuous 
hidden variables rather than hidden binary variables to capture the variability of 
binary valued visible data. The generative model differs from conventional PCA 
because it incorporates a clipping nonlinearity. The resulting spin configurations 
have an entropy related to the number of hidden variables used, and the resulting 
states are connected by small numbers of spin flips. The learning algorithm is par- 
ticularly simple, and is related to PCA by a scalar transformation of the correlation 
matrix. 
Generative Model 
Figure I shows a schematic diagram of the generafive process. As in PCA, the 
model assumes that the data is generated by a small number P of continuous hidden 
variables Yi. Each of the hidden variables are assumed to be drawn independently 
from a normal distribution with unit variance: 
P(Yi) = exp(-y /2) / v r. (1) 
The continuous hidden variables are combined using the feedforward weights Wij, 
and the N binary output units are then calculated using the sign of the feedforward 
activations' 
P 
x, = w,j yj (2) 
j=l 
si = sgn(xi). (3) 
Since binary data is commonly obtained by thresholding, it seems reasonable that 
a proper generafive model should incorporate such a clipping nonlinearity. The 
generative process is similar to that of a sigmoidal belief network with continuous 
hidden units at zero temperature. The nonlinearity will alter the relationship be- 
tween the correlations of the binary variables and the weight matrix W as described 
below. 
The real-valued Gaussian variables xi are exactly analogous to the visible variables 
of conventional PCA. They lie on a linear hyperplane determined by the span of 
the matrix W, and their correlation matrix is given by: 
C  = (xxr> = WW r. (4) 
Learning a Continuous Hidden Variable Model for Binary Data 517 
Y2 
/' x3_> 
.ooo?' Xl= Wljyj > 0 
" Yl 
', 
 >0 
, , ,, x 2- 
Figure 2: Binary spin configurations si in the vector space of continuous hidden 
variables yj with P = 2 and N - 3. 
By construction, the correlation matrix C xx has rank P which is much smaller 
than the number of components N. Now consider the binary output variables 
si = sgn(xi). Their correlations can be calculated from the probability distribution 
of the Gaussian variables xi: 
(Css)ij = (sisj) = / H dYk P(xk)sgn(xi)sgn(xj) 
k 
(5) 
where 
1 
P(x): (2)N/2 exp - x r(Cx)-x . (6) 
The integrals in Equation 5 can be done analytically, and yield the surprisingly 
simple result: 
(CSS)iJ = sin- /x  (7) 
V..ii vjj 
Thus, the correlations of the clipped binary variables C** are related to the corre- 
lations of the corresponding Gaussian variables C xx through the nonlinear arcsine 
function. The normalization in the denominator of the arcsine argument reflects the 
fact that the sign function is unchanged by a scale change in the Gaussian variables. 
Although the correlation matrix C ss and the generating correlation matrix C xx are 
easily related through Equation 7, they have qualitatively very different properties. 
In general, the correlation matrix C** will no longer have the low rank structure of 
C . As illustrated by the translationally invariant example in the next section, the 
spectrum of C** may contain a whole continuum of eigenvalues even though C x 
has only a few nonzero eigenvalues. 
PCA is typically used for dimensionality reduction of real variables; can this model 
be used for compressing the binary outputs si? Although the output correlations 
C** no longer display the low rank structure of the generating C x, a more appropri- 
ate measure of data compression is the entropy of the binary output states. Consider 
how many of the 2 v possible binary states will be generated by the clipping process. 
The equation xi = Yj Wijyj '- 0 defines a P- I dimensional hyperplane in the 
P-dimensional state space of hidden variables yj, which are shown as dashed lines 
in Figure 2. These hyperplanes partition the half-space where si - +1 from the 
518 D. D. Lee and H. Sompolinsky 
s i =+1 
10 3 
I S i =-1 0  
1 
I I 
 10  
I I 
I o 
i1o 
I I 10" 
I I 
10' I 0o 
%_ 
CXX - -, 
10' 10 2 
I=ienvalue rank 
Figure 3: Translationally invariant binary spin distribution with N - 256 units. 
Representative samples from the distribution are illustrated on the left, while the 
eigenvalue spectrum of C ss and C xx are plotted on the right. 
region where si = -1. Each of the N spin variables will have such a dividing hyper- 
plane in this P-dimensional state space, and all of these hyperplanes will generically 
be unique. Thus, the total number of spin configurations si is determined by the 
number of cells bounded by N dividing hyperplanes in P dimensions. The number 
of such cells is approximately N p for N >> P, a well-known result from perceptrons 
[6]. To leading order for large N, the entropy of the binary states generated by this 
process is then given by $ = P log N. Thus, the entropy of the spin configurations 
generated by this model is directly proportional to the number of hidden variables 
P. 
How is the topology of the binary spin configurations si related to the PCA man- 
ifold structure of the continuous variables xi? Each of the generated spin states is 
represented by a polytope cell in the P dimensional vector space of hidden variables. 
Each polytope has at least P + 1 neighboring polytopes which are related to it by a 
single or small number of spin flips. Therefore, although the state space of binary 
spin configurations is discrete, the continuous manifold structure of the underlying 
Gaussian variables in this model is manifested as binary output configurations with 
low entropy that are connected with small Hamming distances. 
Translationally Invariant Example 
In principle, the weights W could be learned by applying maximum likelihood to 
this generarive model; however, the resulting learning algorithm involves analyti- 
cally intractable multi-dimensional integrals. Alternatively, approximations based 
upon mean field theory or importance sampling could be used to learn the appropri- 
ate parameters [7]. However, Equation 7 suggests a simple learning rule that is also 
approximate, but is much more computationally efficient [8]. First, the binary cor- 
relation matrix C s* is computed from the data. Then the empirical C ** is mapped 
into the appropriate Gaussian correlation matrix using the nonlinear transforma- 
tion: C x = sin(rC*S/2). This results in a Gaussian correlation matrix where the 
variances of the individual xi are fixed at unity. The weights W are then calculated 
using the conventional PCA algorithm. The correlation matrix C  is diagonalized, 
and the eigenvectors with the largest eigenvalues are used to form the columns of 
Learning a Continuous Hidden Variable Model for Binary Data 519 
W to yield the best low rank approximation C xx  WW T. Scaling the variables xi 
will result in a correlation matrix C xx with slightly different eigenvalues but with 
the same rank. 
The utility of this transformation is illustrated by the following simple example. 
Consider the distribution of N - 256 binary spins shown in Figure 3. Half of the 
spins are chosen to be positive, and the location of the positive bump is arbitrary 
under the periodic boundary conditions. Since the distribution is translationally 
invariant, the correlations Cf depend only on the relative distance between spins 
I i - Jl. The eigenvectors are the Fourier modes, and their eigenvalues correspond 
to their overlap with a triangle wave. The eigenvalue spectrum of C ss is plotted in 
Figure 3 as sorted by their rank. In this particular case, the correlation matrix C ss 
has N/2 positive eigenvalues with a corresponding range of values. 
Now consider the matrix C x = sin(rCSS/2). The eigenvalues of C x are also 
shown in Figure 3. In contrast to the many different eigenvalues C ss, the spectrum 
of the Gaussian correlation matrix C  has only two positive eigenvalues, with all 
the rest exactly equal to zero. The corresponding eigenvectors are a cosine and sine 
function. The generative process can thus be understood as a linear combination 
of the two eigenmodes to yield a sine function with arbitary phase. This function 
is then clipped to yield the positive bump seen in the original binary distribution. 
In comparison with the eigenvalues of C *s, the eigenvalue spectrum of C x makes 
obvious the low rank structure of the generative process. In this case, the original 
binary distribution can be constructed using only P - 2 hidden variables, whereas 
it is not clear from the eigenvalues of C ss what the appropriate number of modes 
is. This illustrates the utility of determining the principal components from the 
calculated Gaussian correlation matrix C  rather than working directly with the 
observable binary correlation matrix C ss. 
Handwritten Digits Example 
This model was also applied to a more complex data set. A large set of 16 x 16 
black and white images of handwritten twos were taken from the US Post Office 
digit database [9]. The pixel means and pixel correlations were directly computed 
from the images. The generative model needs to be slightly modified to account for 
the non-zero means in the binary outputs. This is accomplished by adding fixed 
biases i to the Gaussian variables xi before clipping: 
Si = sgn(i + xi). (8) 
The biases i can be related to the means of the binary outputs through the ex- 
pression: 
i =  er f- {si). (9) 
This allows the biases to be directly computed from the observed means of the 
binary variables. Unfortunately, with non-zero biases, the relationship between 
the Gaussian correlations C x and binary correlations C ** is no longer the simple 
expression found in Equation 7. Instead, the correlations are related by the following 
integral equation: 
1 
dp __ exp 
y/ _ p2 
I 2 _ 2pij)] 
2(1 - p2) (/2 + , . 
Given the empirical pixel correlations C ss for the handwritten digits, the integral 
in Equation 10 is numerically solved for each pair of indices to yield the appropriate 
520 
D. D. Lee and H. Sompolinsky 
10 2 ..... 
C xx envectors I Morph 
lol 
100  
10' ' ~~.... cSs 
10 3 
50 100 150 200 250 
Eigenvalue Rank 
Figure 4: Eigenvalue spectrum of C ss and C xx for handwritten images of twos. The 
inset shows the P = 16 most significant eigenvectors for C xx arranged by rows. The 
right side of the figure shows a nonlinear morph between two different instances of 
a handwritten two using these eigenvectors. 
Gaussian correlation matrix C . The correlation matrices are diagonalized and 
the resulting eigenvalue spectra are shown in Figure 4. The eigenvalues for C  
again exhibit a characteristic drop that is steeper than the falloff in the spectrum 
of the binary correlations C s*. The corresponding eigenvectors of C x with the 16 
largest positive eigenvalues are depicted in the inset of Figure 4. These eigenmodes 
represent common image distortions such as rotations and stretching and appear 
qualitatively similar to those found by the standard PCA algorithm. 
A generative model with weights W corresponding to the P = 16 eigenvectors 
shown in Figure 4 is used to fit the handwritten twos, and the utility of this nonlin- 
ear generative model is illustrated in the right side of Figure 4. The top and bottom 
images in the figure are two different examples of a handwritten two from the data 
set, and the generative model is used to morph between the two examples. The hid- 
den values Yi for the original images are first determined for the different examples, 
and the intermediate images in the morph are constructed by linearly interpolat- 
ing in the vector space of the hidden units. Because of the clipping nonlinearity, 
this induces a nonlinear mapping in the outputs with binary units being flipped in 
a particular order as determined by the generative model. In contrast, morphing 
using conventional PCA would result in a simple linear interpolation between the 
two images, and the intermediate images would not look anything like the original 
binary distribution [10]. 
The correlation matrix C  also happens to contain some small negative eigen- 
values. Even though the binary correlation matrix C ** is positive definite, the 
transformation in Equation 10 does not guarantee that the resulting matrix C x 
will also be positive definite. The presence of these negative eigenvalues indicates 
a shortcoming of the generative processs for modelling this data. In particular, 
the clipped Gaussian model is unable to capture correlations induced by global 
Learning a Continuous Hidden lZariable Model for Binary Data 521 
constraints in the data. As a simple illustration of this shortcoming in the gen- 
erative model, consider the binary distribution defined by the probability density: 
P({s}) oc lim_ exp(- Yij sisj). The states in this distribution are defined by 
the constraint that the sum of the binary variables is exactly zero: Y'4 si = 0. Now, 
for _N _> 4, it can be shown that it is impossible to find a Gaussian distribution 
whose visible binary variables match the negative correlations induced by this sum 
constraint. 
These examples illustrate the value of using the clipped generative model to learn 
the correlation matrix of the underlying Gaussian variables rather than using the 
correlations of the outputs directly. The clipping nonlinearity is convenient because 
the relationship between the hidden variables and the output variables is particu- 
larly easy to understand. The learning algorithm differs from other nonlinear PCA 
models and autoencoders because the inverse mapping function need not be explic- 
itly learned [11, 12]. Instead, the correlation matrix is directly transformed from the 
observable variables to the underlying Gaussian variables. The correlation matrix 
is then diagonalized to determine the appropriate feedforward weights. This results 
in a extremely efficient training procedure that is directly analogous to PCA for 
continuous variables. 
We acknowledge the support of Bell Laboratories, Lucent Technologies, and the 
US-Israel Binational Science Foundation. We also thank H. S. Seung for helpful 
discussions. 
References 
[1] 
[21 
[31 
[41 
[5] 
[61 
[7] 
[8] 
[9] 
[11] 
Jolliffe, IT (1986). Principal Component Analysis. New York: Springer-Verlag. 
Bartholomew, DJ (1987). Latent variable models and factor analysis. London: 
Charles Griffin & Co. Ltd. 
Hinton, GE, Dayan, P & Revow, M (1996). Modeling the manifolds of images 
of handwritten digits. IEEE Transactions on Neural networks 8, 65-74. 
Van Vreeswijk, C, Sompolinsky, H, & Abeles, M. (1999). Nonlinear statistics 
of spike trains. In preparation. 
Ackley, DH, Hinton, GE, & Sejnowski, TJ (1985). A learning algorithm for 
Boltzmann machines. Cognitive Science 9, 147-169. 
Cover, TM (1965). Geometrical and statistical properties of systems of linear 
inequalities with applications in pattern recognition. IEEE Trans. Electronic 
Cornput. 14, 326-334. 
Tipping, ME (1999). Probabilistic visualisation of high-dimensional binary 
data. Advances in Neural Information Processing Systems 11. 
Christoffersson, A (1975). Factor analysis of dichotomized variables. Psychome- 
trika 40, 5-32. 
LeCun, Yet al. (1989). Backpropagation applied to handwritten zip code recog- 
nition. Neural Computation I, 541-551. 
Bregler, C, & Omohundro, SM (1995). Nonlinear image interpolation using 
manifold learning. Advances in Neural Information Processing Systems 7, 973- 
980. 
Hastie, T and Stuetzle, W (1989). Principal curves. Journal of the American 
Statistical Association 84, 502-516. 
Demers, D, & Cottrell, G (1993). Nonlinear dimensionality reduction. Advances 
in Neural Information Processing Systems 5,580-587. 
