Independent Factor Analysis with 
Temporally Structured Sources 
Hagai Attias 
hagai@gatsby. ucl.ac.uk 
Gatsby Unit, University College London 
17 Queen Square 
London WC1N 3AR, U.K. 
Abstract 
We present a new technique for time series analysis based on dy- 
namic probabilistic networks. In this approach, the observed data 
are modeled in terms of unobserved, mutually independent factors, 
as in the recently introduced technique of Independent Factor Anal- 
ysis (IFA). However, unlike in IFA, the factors are not i.i.d.; each 
factor has its own temporal statistical characteristics. We derive a 
family of EM algorithms that learn the structure of the underlying 
factors and their relation to the data. These algorithms perform 
source separation and noise reduction in an integrated manner, and 
demonstrate superior performance compared to IFA. 
I Introduction 
The technique of independent factor analysis (IFA) introduced in [1] provides a 
tool for modeling L-dim data in terms of L unobserved factors. These factors 
are mutually independent and combine linearly with added noise to produce the 
observed data. Mathematically, the model is defined by 
yt - Hxt q- ut , 
(1) 
where xt is the vector of factor activities at time t, Yt is the data vector, H is the 
L  x L mixing matrix, and ut is the noise. 
The origins of IFA lie in applied statistics on the one hand and in signal processing 
on the other hand. Its statistics ancestor is ordinary factor analysis (FA), which as- 
sumes Gaussian factors. In contrast, IFA allows each factor to have its own arbitrary 
distribution, modeled semi-parametrically by a 1-dim mixture of Gaussians (MOG). 
The MOG parameters, as well as the mixing matrix and noise covariance matrix, 
are learned from the observed data by an expectation-maximization (EM) algorithm 
derived in [1]. The signal processing ancestor of IFA is the independent component 
analysis (ICA) method for blind source separation [2]-[6]. In ICA, the factors are 
termed sources, and the task of blind source separation is to recover them from the 
observed data with no knowledge of the mixing process. The sources in ICA have 
non-Gaussian distributions, but unlike in IFA these distributions are usually fixed 
by prior knowledge or have quite limited adaptability. More significant restrictions 
Dynamic Independent Factor Analysis 387 
are that their number is set to the data dimensionality, i.e. L = L  ('square mix- 
ing'), the mixing matrix is assumed invertible, and the data are assumed noise-free 
(ut = 0). In contrast, IFA allows any L, L  (including more sources than sensors, 
L > L'), as well as non-zero noise with unknown covariance. In addition, its use of 
the flexible MOG model often proves crucial for achieving successful separation [1]. 
Therefore, IFA generalizes and unifies FA and ICA. Once tle model has been 
learned, it can be used for classification (fitting an IFA model for each class), com- 
pleting missing data, and so on. In the context of blind separation, an optimal 
reconstruction of the sources xt from data is obtained [1] using a MAP estimator. 
However, IFA and its ancestors suffer from the following shortcoming: They are 
oblivious to temporal information since they do not attempt to model the temporal 
statistics of the data (but see [4] for square, noise-free mixing). In other words, the 
model learned would not be affected by permuting the time indices of {yt }. This is 
unfortunate since modeling the data as a time series would facilitate filtering and 
forecasting, as well as more accurate classification. Moreover, for source separation 
applications, learning temporal statistics would provide additional information on 
the sources, leading to cleaner source reconstructions. 
To see this, one may think of the problem of blind separation of noisy data in terms 
of two components: source separation and noise reduction. A possible approach 
might be the following two-stage procedure. First, perform noise reduction using, 
e.g., Wiener filtering. Second, perform source separation on the cleaned data us- 
ing, e.g., an ICA algorithm. Notice that this procedure directly exploits temporal 
(second-order) statistics of the data in the first stage to achieve stronger noise re- 
duction. An alternative approach would be to exploit the temporal structure of 
the data indirectly, by using a temporal source model. In the resulting single-stage 
algorithm, the operations of source separation and noise reduction are coupled. This 
is the approach taken in the present paper. 
In the following, we present a new approach to the independent factor problem 
based on dynamic probabilistic networks. In order to capture temporal statistical 
properties of the observed data, we describe each source by a hidden Markov model 
(HMM). The resulting dynamic model describes a multivariate time series in terms 
of several independent sources, each having its own temporal characteristics. Section 
2 presents an EM learning algorithm for the zero-noise case, and section 3 presents 
an algorithm for the case of isotropic noise. The case of non-isotropic noise turns out 
to be computationally intractable; section 4 provides an approximate EM algorithm 
based on a variational approach. 
Notation: The multivariable Gaussian density is denoted by G(z, E) =l 2rY:, 1-1/2 
exp(--zTE-z/2). We wo. rk with T-point time blocks denoted X:T = {xt}=l. The 
ith coordinate of xt is xl. For a function f, (f(x:T)} denotes averaging over an 
ensemble of x:- blocks. 
2 Zero Noise 
The MOG source model employed in IFA [1] has the advantages that (i) it is capable 
of approximating arbitrary densities, and (ii) it can be learned efficiently from data 
by EM. The Gaussians correspond to the hidden states of the sources, labeled by 
s. Assume that at time t, source i is in state sl - s. Its signal xl is then generated 
i In order 
by sampling from a Gaussian distribution with mean uis and variance 's. 
to capture temporal statistics of the data, we endow the sources with temporal 
structure by introducing a transition matrix i between the states. Focusing on 
388 H. Attias 
a time block t - 1, ..., T, the resulting probabilistic model is defined by 
i 
p(s = s l i i p(s = s) = 7rs , 
St-1 -- st) __ as's , 
i = s) -- (x //s e) P(Yl:T) =l det G I T p(Xl:T) (2) 
p(x l s t - , , , 
where p(xi:r) is the joint density of all sources x i i = 1, L at all time points, and 
the last equation follows from xt = Gyt with G = H - being the unmixing matrix. 
As usual in the noise-free scenario (see [2]; section 7 of [1]), we are assuming that 
the mixing matrix is square and invertible. 
The graphical model for the observed density P(Y:T I w) defined by (2) is 
i i i 
parametrized by W = {Gij,l.t}, es,7rs,as, s}. This model describes each source as 
a first-order HMM; it reduces to a time-independent model if i i Whereas 
a$/$  7r$. 
temporal structure can be described by other means, e.g. a moving-average [4] or 
autoregressive [6] model, the HMM is advantageous since it models high-order tem- 
poral statistics and facilitates EM learning. Omitting the derivation, maximization 
with respect to Gij results in the incremental update rule 
T 
5G = G- -  (xt)xtTG, (3) 
where (x) = Es 7[(s)(x ' ' 
- s)/es, and the natural gradient [3] was used; e is an 
appropriately chosen learning rate. For the source parameters we obtain the update 
rules 
u,; = E,,t(8)x ,= , 
Et 7j(s) ' es Et 7(s) , as, s = Et ' (4) 
i = ?(s.). We used the standard HMM 
with the initial probabilities updated via 7r s 
notation 7(s) = p(s = s I X:T), (s',s) -- p(s}_ = s',s = s I X:T). These 
posterior densities are computed in the E-step for each source, which is given in 
terms of the data via x = yj Gijyt j, using the forward-backward procedure [7]. 
The algorithm (3-4) may be used in several possible generalized EM schemes. An 
efficient one is given by the following two-phase procedure: (i) freeze the source 
parameters and learn the separating matrix G using (3); (ii) freeze G and learn the 
source parameters using (4), then go back to (i) and repeat. Notice that the rule (3) 
is similar to a natural gradient version of Bell and Sejnowski's ICA rule [2]; in fact, 
the two coincide for time-independent sources where ck(xi) - -01ogp(xi)/Oxi. We 
also recognize (4) as the Baum-Welch method. Hence, in phase (i) our algorithm 
separates the sources using a generalized ICA rule, whereas in phase (ii) it learns 
an HMM for each source. 
Remark. Often one would like to model a given L'-variable time series in terms 
of a smaller number L < L' of factors. In the framework of our noise-free model 
yt -- Hxt, this can be achieved by applying the above algorithm to the L largest 
principal components of the data; notice that if the data were indeed generated by L 
factors, the remaining L'-L principal components would vanish. Equivalently, one 
may apply the algorithm to the data directly, using a non-square L x L' unmixing 
matrix G. 
Results. Figure 1 demonstrates the performance of the above method on a 4 x 4 
mixture of speech signals, which were passed through a non-linear function to mod- 
ify their distributions. This mixture is inseparable to ICA because the source model 
used by the latter does not fit the actual source densities (see discussion in [1]). We 
also applied our dynamic network to a mixture of speech signals whose distributions 
Dynamic Independent Factor Analysis 389 
0.8 
0.7 
0.8 
, 0.5 
0.4 
0.3 
0.2 
0.1 
0 
--2 0 2 
x 
3 
HMM--ICA ICA 
4 --2 0 2 --2 0 2 
Figure 1: Left: Two of the four source distributions. Middle: Outputs of the EM algo- 
rithm (3-4) are nearly independent. Right: the outputs of ICA [2] are correlated. 
were made Gaussian by an appropriate non-linear transformation. Since temporal 
information is crucial for separation in this case (see [4],[6]), this mixture is in- 
separable to ICA and IFA; however, the algorithm (3-4) accomplished separation 
successfully. 
3 Isotropic Noise 
We now turn to the case of non-zero noise ut  0. We assume that the noise is white 
and has a zero-mean Gaussian distribution with covariance matrix A. In general, 
this case is computationally intractable (see section 4). The reason is that the E- 
step requires computing the posterior distribution p(S0:T, X:T I Y:T) not only over 
the source states (as in the zero-noise case) but also over the source signals, and 
this posterior has a quite complicated structure. We now show that if we assume 
isotropic noise, i.e. Aij = Sij, as well as square invertible mixing as above, this 
posterior simplifies considerably, making learning and inference tractable. This is 
done by adapting an idea suggested in [8] to our dynamic probabilistic network. 
We start by pre-processing the data using a linear transformation that makes their 
covariance matrix unity, i.e., (ytyt T) = I ('sphering'). Here (-) denotes averaging 
over T-point time blocks. From (1) it follows that HSH T = 'I, where S = (xtxt z) 
is the diagonal covariance matrix of the sources, and ' = 1 - . This, for a square 
invertible H, implies that HTH is diagonal. In fact, since the unobserved sources 
can be determined only to within a scaling factor, we can set the variance of each 
source to unity and obtain the othogonality property HH - 'I. It can be shown 
that the source posterior now factorizes into a product over the individual sources, 
-- p(SO:T,X]: T 
p(S0:T, XX: I YX:) Hi i 
ixi 
P(SO:T, I:T lYl:T) (x 
Yl:T), where 
 vtp(s t I s_1) 
i i 
VoP(So) . (5) 
The means and variances at time t in (5), as well as the quantities v, depend on 
both the data yt and the states s; in particular,  = ('j Hjiyt j + Aus)/(A'vs + A) 
and a = Av/(A'ys + A), using s = s; the expression for the v are omitted. The 
transition probabilities are the same as in (2). Hence, the posterior distribution 
(5) effectively defines a new HMM for each source, with yt-dependent emission and 
transition probabilities. 
To derive the learning rule for H, we should first compute the conditional mean it 
of the source signals at time t given the data. This can be done recursively using 
(5) as in the forward-backward procedure. We then obtain 
1  
H = v/7C(CZC)-/2 , C =  Eytit t . (6) 
390 H. Attias 
This fractional form results from imposing the orthogonality constraint HTH = 'I 
using Lagrange multipliers and can be computed via a diagonalization procedure. 
The source parameters are computed using a learning rule (omitted) similar to the 
noise-free rule (4). It is easy to derive a learning rule for the noise level  as well; in 
fact, the ordinary FA rule would suffice. We point out that, while this algorithm has 
been derived for the case L = L ', it is perfectly well defined (though sub-optimah 
see below) for  _< L . 
4 Non-Isotropic Noise 
The general case of non-isotropic noise and non-square mixing is computationally 
intractable. This is because the exact E-step requires summing over all possible 
source configurations (st s  ) at all times tl. t; -- l, ..., T. The intractability 
'"', tL ' ", 
problem stems from the fact that, while the sources are independent, the sources 
conditioned on a data vector Y:T are correlated, resulting in a large number of 
hidden configurations. This problem does not arise in the noise-free case, and can 
be avoided in the case of isotropic noise and square mixing using the orthogonality 
property; in both cases, the exact posterior over the sources factorizes. 
The EM algorithm derived below is based on a variational approach. This approach 
was introduced in [9] in the context of sigmoid belief networks, but constitutes a 
general framework for ML learning in intractable probabilistic networks; it was 
used in a HMM context in [10]. The idea is to use an approximate but tractable 
posterior to place a lower bound on the likelihood, and optimize the parameters by 
maximizing this bound. 
A starting point for deriving a bound on the likelihood/2 is Neal and Hinton's Ill] 
formulation of the EM algorithm: 
T L 
: 1ogp(y1:T)_> E Eqlogp(yt Ixt)+ E Eqlogp(s:T,X:T) -- Eqlogq , (7) 
t=l i=1 
where Eq denotes averaging with respect to an arbitrary posterior density over the 
hidden variables given the observed data, q = q(so:r, Xl:r I yl:r). Exact EM, 
as shown in [11], is obtained by maximizing the bound (7) with respect to both 
the posterior q (corresponding to the E-step) and the model parameters W (M- 
step). However, the resulting q is the true but intractable posterior. In contrast, in 
variational EM we choose a q that differs from the true posterior, but facilitates a 
tractable E-step. 
E-Step. We use q(sb:r,x:r [ Y:T) = I-li i 
q(so:r [ y:r) Htq(x 
parametrized as 
q(s sl i i i i i 
= st- = y:r) 
q(xt l Y:r) = 6(xt -pt,E:t). (8) 
Thus, the variational transition probabilities in (8) are described by multiplying the 
i by the parameters i subject to the normalization constraints. 
original ones 
/s,t' 
The source signals x at time t are jointly Gaussian with mean Pt and covariance 
It. The means, covariances and transition probabilities are all time- and data- 
dependent, i.e., Pt = f(y:T, t) etc. This parametrization scheme is motivated by 
  i there become 
the form of the posterior in (5); notice that the quantities r/, a, vs, t 
the variational parameters p, E?, A}, t of (8). A related scheme was used in [10] in 
a different context. Since these parameters will be adapted independently of the 
model parameters, the non-isotropic algorithm is expected to give superior results 
compared to the isotropic one. 
Dynamic Independent Factor Analysis 391 
o 
-lO 
. -3c 
-40 
-50 
Mixing Reconstruction 
( 
w --10 
--15 
--20 
5 10 15 --5 0 5 10 15 
SNR (dB) SNR (dB) 
Right: quality of the source 
Figure 2: Left: quality of the model parameter estimates. 
reconstructions. (See text). 
Of course, in the true posterior the xt are correlated, both temporally among them- 
selves and with st, and the latter do not factorize. To best approximate it, the 
variational parameters V = (p, E?, ,k/s,t) are optimized to maximize the bound on 
/2, or equivalently to minimize the KL distance between q and the true posterior. 
This requirement leads to the fixed point equations 
Pt = ( HTA-1H + Bt)-i(HTA-lyt + bt) , ]t = (HTA -1H + Bt) -1 , 
';,t = 1 exp[-logv -(p-u})2+Ei] 
' (9) 
where B? i ' ' 
-- Es[t($)/l]ij, b Es i i li 
fit (s)/z, the z ensure nor- 
= / 8, and factors 
malization. The HMM quantities '7(s) are computed by the forward-backward 
procedure using the variational transition probabilities (8). The variational param- 
eters are determined by solving eqs. (9) iteratively for each block Yl.T; in practice, 
we found that less then 20 iterations are usually required for convergence. 
M-Step. The update rules for W are given for the mixing parameters by 
-1 
and for the source parameters by 
where the (s', s) are computed using the variational transition probabilities (8). 
Notice that the learning rules for the source parameters have the Baum-Welch form, 
in spite of the correlations between the conditioned sources. In our variational 
approach, these correlations are hidden in V, as manifested by the fact that the 
fixed point equations (9) couple the parameters V across time points (since 
i 
depends on ,ks,t=:T) and sources. 
Source Reconstruction. From q(xt I Yl:r) (8), we observe that the MAP source 
estimate is given by it = Pt(Yl:T), and depends on both W and V. 
Results. The above algorithm is demonstrated on a source separation task in Fig- 
ure 2. We used 6 speech signals, transformed by non-linearities to have arbitrary 
one-point densities, and mixed by a random 8 x 6 matrix H0. Different signal- 
to-noise (SNR) levels were used. The error in the estimated H (left, solid line) is 
T I T 
quantified by the size of the non-diagonal elements of (H H)- H Ho relative to the 
H = ytpT ptPt  + Et , A = 7 YtYt - YtPt H ), (10) 
t 
392 H. Attias 
diagonal; the results obtained by IFA [1], which does not use temporal information, 
are plotted for reference (dotted line). The mean squared error of the reconstructed 
sources (right, solid line) and the corresponding IFA result (right, dashed line) are 
also shown. The estimate and reconstruction errors of this algorithm are consis- 
tently smaller than those of IFA, reflecting the advantage of exploiting the temporal 
structure of the data. Additional experiments with different numbers of sources and 
sensors gave similar results. Notice that this algorithm, unlike the previous two, 
allows both  (  and   . We also considered situations where the number of 
sensors was smaller than the number of sources; the separation quality was good, 
although, as expected, less so than in the opposite case. 
5 Conclusion 
An important issue that has not been addressed here is model selection. When ap- 
plying our algorithms to an arbitrary dataset, the number of factors and of HMM 
states for each factor should be determined. Whereas this could be done, in princi- 
ple, using cross-validation, the required computational effort would be fairly large. 
However, in a recent paper [12] we develop a new framework for Bayesian model 
selection, as well as model averaging, in probabilistic networks. This framework, 
termed Variational Bayes, proposes an EM-like algorithm which approximates full 
posterior distributions over not only hidden variables but also parameters and model 
structure, as well as predictive quantities, in an analytical manner. It is currently 
being applied to the algorithms presented here with good preliminary results. 
One field in which our approach may find important applications is speech technol- 
ogy, where it suggests building more economical signal models based on combining 
independent low-dimensional HMMs, rather than fitting a single complex HMM. 
It may also contribute toward improving recognition performance in noisy, multi- 
speaker, reverberant conditions which characterize real-world auditory scenes. 
References 
[1] Attias, H. (1999). Independent factor analysis. Neut. Comp. 11, 803-851. 
[2] Bell, A.J. & Sejnowski, T.J. (1995). An information-maximization approach to blind 
separation and blind aleconvolution. Neut. Comp. 7, 1129-1159. 
[3] Amari, S., Cichocki, A. & Yang, H.H. (1996). A new learning algorithm for blind signal 
separation. Adv. Neut. Info. ?roc. Sys. 8, 757-763 (Ed. by Touretzky, D.S. et al). MIT 
Press, Cambridge, MA. 
[4] Pearlmutter, B.A. & Parra, L.C. (1997). Maximum likelihood blind source separation: 
A context-sensitive generalization of ICA. Adv. Neut. Info. Proc. Sys. 9, 613-619 (Ed. 
by Mozer, M.C. et al). MIT Press, Cambridge, MA. 
[5] Hyv'inen, A. & Oja, E. (1997). A fast fixed-point algorithm for independent compo- 
nent analysis. Neut. Comp. 9, 1483-1492. 
[6] Attias, H. & Schreiner, C.E. (1998). Blind source separation and aleconvolution: the 
dynamic component analysis algorithm. Neut. Comp. 10, 1373-1424. 
[7] Rabiner, L. & Juang, B.-H. (1993). Fundamentals of Speech Recognition. Prentice Hall, 
Englewood Cliffs, NJ. 
[8] Lee, D.D. & Sompolinsky, H. (1999), unpublished; D.D. Lee, personal communication. 
[9] Saul, L.K., Jaakkola, T., and Jordan, M.I. (1996). Mean field theory of sigmoid belief 
networks. J. Art. Int. Res. 4, 61-76. 
[10] Ghahramani, Z. & Jordan, M.I. (1997). Factorial hidden Markov models. Mach. 
Learn. 29, 245-273. 
[11] Neal, R.M. & Hinton, G.E. (1998). A view of the EM algorithm that justifies incre- 
mental, sparse, and other variants. Learning in Graphical Models, 355-368 (Ed. by Jordan, 
M.I.). Kluwer Academic Press. 
[12] Attias, H. (2000). A variational Bayesian framework for graphical models. Adv. Neut. 
Info. Proc. Sys. 12 (Ed. by Leen, T. et al). MIT Press, Cambridge, MA. 
