Learning a Hierarchical Belief Network of 
Independent Factor Analyzers 
H. Attias* 
hagai@gatsby. ucl.ac.uk 
Sloan Center for Theoretical Neurobiology, Box 0444 
University of California at San Francisco 
San Francisco, CA 94143-0444 
Abstract 
Many belief networks have been proposed that are composed of 
binary units. However, for tasks such as object and speech recog- 
nition which produce real-valued data, binary network models are 
usually inadequate. Independent component analysis (ICA) learns 
a model from real data, but the descriptive power of this model 
is severly limited. We begin by describing the independent factor 
analysis (IFA) technique, which overcomes some of the limitations 
of ICA. We then create a multilayer network by cascading single- 
layer IFA models. At each level, the IFA network extracts real- 
valued latent variables that are non-linear functions of the input 
data with a highly adaptive functional form, resulting in a hier- 
archical distributed representation of these data. Whereas exact 
maximum-likelihood learning of the network is intractable, we de- 
rive an algorithm that maximizes a lower bound on the likelihood, 
based on a variational approach. 
1 Introduction 
An intriguing hypothesis for how the brain represents incoming sensory informa- 
tion holds that it constructs a hierarchical probabilistic model of the observed data. 
The model parameters are learned in an unsupervised manner by maximizing the 
likelihood that these data are generated by the model. A multilayer belief net- 
work is a realization of such a model. Many belief networks have been proposed 
that are composed of binary units. The hidden units in such networks represent 
latent variables that explain different features of the data, and whose relation to the 
*Current address: Gatsby Computational Neuroscience Unit, University College Lon- 
don, 17 Queen Square, London WCiN 3AR, U.K. 
362 H. Attias 
data is highly non-linear. However, for tasks such as object and speech recognition 
which produce real-valued data, the models provided by binary networks are often 
inadequate. Independent component analysis (ICA) learns a generarive model from 
real data, and extracts real-valued latent variables that are mutually statistically 
independent. Unfortunately, this model is restricted to a single layer and the latent 
variables are simple linear functions of the data; hence, underlying degrees of free- 
dom that are non-linear cannot be extracted by ICA. In addition, the requirement 
of equal numbers of hidden and observed variables and the assumption of noiseless 
data render the ICA model inappropriate. 
This paper begins by introducing the independent factor analysis (IFA) technique. 
IFA is an extension of ICA, that allows different numbers of latent and observed 
variables and can handle noisy data. The paper proceeds to create a multilayer 
network by cascading single-layer IFA models. The resulting generarive model pro- 
duces a hierarchical distributed representation of the input data, where the latent 
variables extracted at each level are non-linear functions of the data with a highly 
adaptive functional form. Whereas exact maximum-likelihood (ML) learning in 
this network is intractable due to the difficulty in computing the posterior density 
over the hidden layers, we present an algorithm that maximizes a lower bound on 
the likelihood. This algorithm is based on a general variational approach that we 
develop for the IFA network. 
2 
Independent Component and Independent Factor 
Analysis 
Although the concept of ICA originated in the field of signal processing, it is actually 
a density estimation problem. Given an L' x 1 observed data vector y, the task is 
to explain it in terms of an L x 1 vector x of unobserved 'sources' that are mutually 
statistically independent. The relation between the two is assumed linear, 
y = Hx + u, (1) 
where H is the 'mixing' matrix; the noise vector u is usually assumed zero-mean 
Gaussian with a covariance matrix A. In the context of blind source separation 
[1]-[4], the source signals x should be recovered from the mixed noisy signals y with 
no knowledge of H, A, or the source densities p(xi), hence the term 'blind'. In the 
density estimation approach, one regards (1) as a probabilistic generative model for 
the observed p(y), with the mixing matrix, noise covariance, and source densities 
serving as model parameters. In principle, these parameters should be learned by 
ML, followed by inferring the sources via a MAP estimator. 
For Gaussian sources, (1) is the factor analysis model, for which an EM algorithm 
exists and the MAP estimator is linear. The problem becomes interesting and more 
difficult for non-Gaussian sources. Most ICA algorithms focus on square (L' = L), 
noiseless (y = Hx) mixing, and fix p(xi) using prior knowledge (but see [5] for the 
case of noisy mixing with a fixed Laplacian source prior). Learning H occurs via 
gradient-ascent maximization of the likelihood [1]-[4]. Source density parameters 
can also be adapted in this way [3] ,[4], but the resulting gradient-ascent learning is 
rather slow. This state of affairs presented a problem to ICA algorithms, since the 
ability to learn arbitrary source densities that are not known in advance is crucial: 
using an inaccurate p(xi) often leads to a bad H estimate and failed separation. 
This problem was recently solved by introducing the IFA technique [6]. IFA 
employs a semi-parametric model of the source densities, which allows learning 
them (as well as the mixing matrix) using expectation-maximization (EM). Specif- 
ically, p(xi) is described as a mixture of Gaussians (MOG), where the mixture 
Hierarchical IF.4 Belief Networks 3 63 
components are labeled by s = 1, ...,hi and have means /i,8 and variances 7i,8: 
p(xi) = Y p(si = s)G(xi - lui,, 7i,). I The mixing proportions are parametrized 
using the softmax form: p(si = s) = exp(ai,)/Y8, exp(ai,,). Beyond noiseless 
ICA, an EM algorithm for the noisy case (1) with any L, L' was also derived in 
[6] using the MOG description. 2 This algorithm learns a probabilistic model 
P(Y I W) for the observed data, parametrized by W = (H, A, {ai,, lui,, 7i,}). A 
graphical representation of this model is provided by Fig. 1, if we set n = 1 and 
o b},s 1 
yj = =vj,8=O. 
3 Hierarchical Independent Factor Analysis 
In the following we develop a multilayer generalization of IFA, by cascading dupli- 
cates of the generative model introduced in [6]. Each layer n = 1, ..., N is composed 
of two sublayers: a source sublayer which consists of the units x, i = 1, ..., L, and 
an output sublayer which consists of yy, j = 1,..., L'. The two are linearly related 
via y = Hx  + u  as in (1); u  is a Gaussian noise vector with covariance A . 
The nth-layer source x? is described by a MOG density model with parameters a . 
/i,, and ')'i,s, in analogy to the IFA sources above. 
The important step is to determine how layer n depends on the previous layers. We 
choose to introduce a dependence of the ith source of layer n only on the ith output 
of layer n - 1. Notice that matching Ln = L'_ 1 is now required. This dependence 
is implemented by making the means and mixture proportions of the Gaussians 
which compose p(x?) dependent on y-l. Specifically, we make the replacements 
    -1 a? - a? + The resulting joint density for 
b  yn--1 
li, s ---> li, s q- bti,sY i and ,s ,s i,s i  
layer n, conditioned on layer n - 1, is 
p(sn,x,y  l y W = 
, 
H P($i I n-1 n n n--1 X n 
 Yi )p(xi I$i,Yi )P(YI ), (2) 
i----1 
where W  are the parameters of layer n and 
b  y-X 
P($i $ I n-1 exp(a,s q- i,s i ) 
Yi )= 
, ,-1 , P(Y' I x') = (Y' - H'x', A') , 
E exp(as, + bi,s,i ) 
n n--1 n n n--1 n 
p(x? I 8i = 8, Yi ) = 6(X? -- i,s -- Pi,sYi , Ti,s) ' 
The full model joint density is given by the product of (2) over n = 1, ..., N (setting 
y0 = 0). A graphical representation of layer n of the hierarchical IFA network is 
given in Fig. 1. All units are hidden except y. 
To gain some insight into our network, we examine the relation between the nth- 
--1 
layer source x and the n - lth-layer output Yi . This relation is probabilistic 
and is determined by the conditional density p(x i I n-1 n n- )p(x i I 
n Yi ) : sP(Si [Yi 1 n 
8 n--1 
, Yi ). Notice from (2) that this is a MOG density. Its y-X-dependent mean is 
given by 
X? = i , i ) : p(8 i : 8 I Yi ) (lin, s q- bti,sYi , 
8 
(3) 
Throughout this paper, 6(x, ) =l 27r12 I -/2 exp(--xT1]-x/2). 
2However, for many sources the E-step becomes intractable, since the number 1-li ni 
of source state configurations s = (s, ..., s) depends exponentially on L. Such cases are 
treated in [6] using a variational approximation. 
364 H. Attias 
Figure 1: Layer n of the hierarchical ICA generative model. 
and is a non-linear function of y- due to the softmax form of P($i I n-1 
n 
By adjusting the parameters, the function f can assume a very wide range of 
forms: suppose that for state s? a? and b . are set so that p(s? = s I Y-) is 
significant only in a small, continuous range of y- values, with different ranges 
associated with different s's. In this range, f will be dominated by the linear 
n n n--1 n 
term i,s + i,si . Hence, a desired fi can be produced by placing oriented 
line segments at appropriate points above the --axis, then smoothly join them 
together by the p(s 
i ). Using the algorithm below, the optimal form of fi 
will be learned from the data. Therefore, our model describes the data  as a 
potentially highly complex function of the top layer sources, produced by repeated 
application of linear mixing followed by a non-linearity, with noise allowed at each 
stage. 
4 
Learning and Inference by Variational EM 
The need for summing over an exponentially large number of source state config- 
urations (s, ..., s.), and integrating over the softmax functions p(s? I Y?), makes 
exact learning intractable in our network. Thus, approximations must be made. 
In the following we develop a variational approach, in the spirit of [8], to hierar- 
chical IFA. We begin, following the approach of [7] 1o EM, by bounding the log- 
likelihood from below:  = logp(y N) _> y.n{Elogp(y  I xn) + i,s?[Elogp(x? I 
Yi )]}- E log q, where E denotes averaging over the hidden 
layers using an arbitrary posterior q = q(s '"v, x '"v, yX...v- [ yV). In exact EM, 
q at each iteration is the true posterior, parametrized by W '"N from the previ- 
ous iteration. In variational EM, q is chosen to have a form which makes learning 
tractable, and is parametrized by a separate set of parameters V '"N These are 
optimized to bring q as close to the true posterior as possible. 
Hierarchical IFA Belief Networks 365 
E-step. We use a variational posterior that is factorized across layers. Within layer 
n it has the form 
L, 
q(sn'xn'Yn I Vn): H vin, sl (zn - pn,En) , Z n: (xn,yn) T (4) 
i:1 
for n < N, and q(s,x [ V) = H vJ( x- P, Z)  The variational param- 
eters V  = (p, , {vi}) depend on the data y. The full N-layer posterior is 
simply a product of (4) over n. Hence, given the data, the nth-layer sources and 
outputs are jointly Gaussian whereas the states s are independent. 3 
i ) in the lower 
Even with the variational posterior (4), the term Elogp(s I 
bound cannot be calculated analytically, since it involves integration over the 
softmax function. Instead, we calculate yet a lower bound on this term. Let 
n n--1 
c = a + bi,sY i and drop the unit and layer indices i n, then logp(s [ y) = 
-log(1 + e -c , eC'). Borrowing an idea from [8], we multiply and divide by 
e ' under the logarithm sign and use Jensen's inequality to get Elogp(s 
-sEc- logE [e -*c +e -(+*)c' r,eC']. This results in a bound that can 
be calculated in closed form: 
Elogp(si s, -  -  ( ef")  ? (5) 
 = Yi )  -v v c - v log e f: +  , , 
/2, - + )q + 
where c = a s + s Py f _ n-n _ n -n 
, - -Vc + (vb) Eyy 
q, + 
-n n n _ Eyy /2, and the subscript i is omitted. We also defined 
p = (p, p)r and similarly E, yy, Exy = y are the subblocks of . Since 
(5) holds for arbitrary i, the latter are treated as additional variational parameters 
which are optimized to tighten this bound. 4 
To optimize the variational parameters V "', we equate the gradient of the lower 
bound on  to zero and obtain 
( (HTA-H)'*+A ' -(HTA-) ' )p 
-(A-1H) n (A-)"+ B + 
- A "+ 0 p- 
= -+ + r+ , (6) 
E.= ( (HTA-H)"+A" -(HTA-)" )- 
_(A-H), (A-i), + B,+ _ r+ , (7) 
n( /] n V n 
where A = E(vi,/%,) O, B. : (vi, i,/?i,) 8ij, ? : ( i,bi,s/%,) , 
and ? = s(vi,Pi,vi,/?i,)". (All parameters within (-..) belong to layer n). 
"+ contain the corresponding derivatives of f+ (5) summed over s. For the 
state posteriors we have 
v 7= exp + 2 [(p--p +S7+(2)S7;]+ O , (8) 
3It is easy to introduce more structure into (4) by allowing the means p to depend 
' ' thus making the approximation more 
n, and the covariances Ei' . to depend on 
on 8 z 
accurate (but more complex) while maintaining tractability. 
4An alternative approach to handle Elogp(s? [ 
Yi ) is to approximate the required 
integral by, e.g., the maximum value of the integrand, possibly including Gaussian correc- 
tions. The resulting approximation is simpler than (5); however, it is no longer guaranteed 
to bound the log-likelihood from below. 
366 H. A ttias 
where the unit subscript i is omitted (i.e., Ex = Exnx,ii); jn = j? is set such that 
Y-s v' = 1. A simple modification of these equations is required for layer n = N. 
The optimal V '"v are obtained by solving the fixed-point equations (6-8) iter- 
atively for each data vector yV, keeping the generative parameters W '"v fixed. 
Notice that these equations couple layer n to layers n + 1. The additional parame- 
ters i are adjusted using gradient ascent on  Once learning is complete, the 
ZS ' 
inference problem is solved since the MAP estimate of the hidden unit values given 
the data is readily available from p and v 
ZS ' 
M-Step. In terms of the variational parameters obtained in the E-step, the new 
generative parameters are given by 
H - (PyPx T + Eyx)(Pxp x T n -1 
- , 
- 
- pypy +yy-H (ppr 
Vs I ' Vs Px Vs 
Px Py 
's P y V s V s 
1 n n--l2 n n-- n 
: - - - + + 
, 
omitting the subscript i as in (8), and are slightly modified for layer N. In batch 
mode, averaging over the data is implied and the v do not cancel out. Finally, the 
softmax parameters a7 b7 are adapted by gradient ascent on the bound (5) 
5 Discussion 
The hierarchical IFA network presented here constitutes a quite general framework 
for learning and inference using real-valued probabilistic models that are strongly 
non-linear but highly adaptive. Notice that this network includes both continuous 
x.n n 
,,Yi and binary s? units, and can thus extract both types of latent variables. 
 may represent class labels in classification 
In particular, the uppermost units s i 
tasks. The models proposed in [9]-[11] can be viewed as special cases where x? is 
a prescribed deterministic function (e.g., rectifier) of the previous outputs yy-: in 
the IFA network, a deterministic (but still adaptive) dependence can be obtained 
n in such a case assumes 
by setting the variances "/?,s = 0. Note that the source x i 
only the values p?,s, and thus corresponds to a discrete latent variable. 
The learning and inference algorithm presented here is based on the variational 
approach. Unlike variational approximations in other belief networks [8],[10] which 
use a completely factorized approximation, the structure of the hierarchical IFA 
network facilitates using a variational posterior that allows correlations among hid- 
den units occupying the same layer, thus providing a more accurate description of 
the true posterior. It would be interesting to compare the performance of our varia- 
tional algorithm with the belief propagation algorithm [12] which, when adapted to 
the densely connected IFA network, would also be an approximation. Markov chain 
Monte Carlo methods, including the more recent slice sampling procedure used in 
[11], would become very slow as the network size increases. 
It is possible to consider a more general non-linear network along the lines of hi- 
erarchical IFA. Notice from (2) that given the previous layer output yn-, the 
mean output of the next layer is Yi : y.j rrn n n-1 
n -rijfj (yj ) (see (3)), i.e. a linear 
mixing preceded by a non-linear function operating on each output component sep- 
arately. However, if we eliminate the sources xy, replace the individual source 
Hierarchical IFA Belief Networks 367 
states sy by collective states s , and allow the linear transformation to depend on 
s n, we arrive at the following model: p(s n : s I yn-)  exp(asn + b Tyn-), 
n H n. n- A n _ 
P(Y I s = s,Y -) = (Y - h8 - 8Y , ). Now we have = Ep(s n - 
sly -)(h7 + Hy '-)  F(y -), which is a more general non-linearity. 
S n yn-1 
Finally, the blocks {y,x , I } (Fig. 1), or alternatively the blocks {y,sn [ 
y-} described above, can be connected not only vertically (as in this paper) and 
horizontally (creating layers with multiple blocks), but in any directed acyclic graph 
architecture, with the variational EM algorithm extended accordingly. 
Acknowledgements 
I thank V. de Sa for helpful discussions. Supported by The Office of Naval Research 
(N00014-94-1-0547), NIDCD (R01-02260), and the Sloan Foundation. 
References 
[1] Bell, A.J. and Sejnowski, T.J. (1995). An information-maximization approach 
to blind separation and blind deconvolution. Neural Computation ?, 1129-1159. 
[2] Cardoso, J.-F. (1997). Infomax and maximum likelihood for source separation. 
IEEE Signal Processing Letters 4, 112-114. 
[3] Pearlmutter, B.A. and Parra, L.C. (1997). Maximum likelihood blind source sep- 
aration: A context-sensitive generalization of ICA. Advances in Neural Information 
Processing Systems 9 (Ed. Mozer, M.C. et al), 613-619. MIT Press. 
[4] Attias, H. and Schreiner, C.E. (1998). Blind source separation and deconvolu- 
tion: the dynamic component analysis algorithm. Neural Computation 10, 1373- 
1424. 
[5] Lewicki, M.S. and Sejnowski, T.J. (1998). Learning nonlinear overcomplete 
representations for efficient coding. Advances in Neural Information Processing 
Systems 10 (Ed. Jordan, M.I. et al), MIT Press. 
[6] Attias, H. (1999). Independent factor analysis. Neural Computation, in press. 
[7] Neal, R.M. and Hinton, G.E. (1998). A view of the EM algorithm that justifies 
incremental, sparse, and other variants. Learning in Graphical Models (Ed. Jordan, 
M.I.), Kluwer Academic Press. 
[8] Saul, L.K., Jaakkola, T., and Jordan, M.I. 1996). Mean field theory of sigmoid 
belief networks. Journal of Artificial Intelligence Research 4, 61-76. 
[9] Frey, B.J. (1997) Continuous sigmoidal belief networks trained using slice sam- 
pling. Advances in Neural Information Processing Systems 9 (Ed. Mozer, M.C. et 
al). MIT Press. 
[10] Frey, B.J. and Hinton, G.E. (1999). Variational learning in non-linear Gaussian 
belief networks. Neural Computation, in press. 
[11] Ghahramani, Z. and Hinton, G.E. (1998). Hierarchical non-linear factor analy- 
sis and topographic maps. Advances in Neural Information Processing Systems 10 
(Ed. Jordan, M.I. et al), MIT Press. 
[12] Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems. Morgan Kauf- 
mann, San Mateo, CA. 
