Factorial Hidden Markov Models 
Zoubin Ghahramani 
zoubin@psyche.mit.edu 
Department of Computer Science 
University of Toronto 
Toronto, ON M5S 1A4 
Canada 
Michael I. Jordan 
jordan@psyche.mit.edu 
Department of Brain & Cognitive Sciences 
Massachusetts Institute of Technology 
Cambridge, MA 02139 
USA 
Abstract 
We present a framework for learning in hidden Markov models with 
distributed state representations. Within this framework, we de- 
rive a learning algorithm based on the Expectation-Maximization 
(EM) procedure for maximum likelihood estimation. Analogous to 
the standard Baum-Welch update rules, the M-step of our algo- 
rithm is exact and can be solved analytically. However, due to the 
combinatorial nature of the hidden state representation, the exact 
E-step is intractable. A simple and tractable mean field approxima- 
tion is derived. Empirical results on a set of problems suggest that 
both the mean field approximation and Gibbs sampling are viable 
alternatives to the computationally expensive exact algorithm. 
I Introduction 
A problem of fundamental interest to machine learning is time series modeling. Due 
to the simplicity and efficiency of its parameter estimation algorithm, the hidden 
Markov model (HMM) has emerged as one of the basic statistical tools for modeling 
discrete time series, finding widespread application in the areas of speech recogni- 
tion (Rabiner and Juang, 1986) and computational molecular biology (Baldi et al., 
1994). An HMM is essentially a mixture model, encoding information about the 
history of a time series in the value of a single multinomial variable (the hidden 
state). This multinomial assumption allows an efficient parameter estimation algo- 
rithm to be derived (the Baum-Welch algorithm). However, it also severely limits 
the representational capacity of HMMs. For example, to represent 30 bits of in- 
formation about the history of a time sequence, an HMM would need 23 distinct 
states. On the other hand an HMM with a distributed state representation could 
achieve the same task with 30 binary units (Williams and Hinton, 1991). This paper 
addresses the problem of deriving efficient learning algorithms for hidden Markov 
models with distributed state representations. 
Factorial Hidden Markov Models 4 73 
The need for distributed state representations in HMMs can be motivated in two 
ways. First, such representations allow the state space to be decomposed into 
features that naturally decouple the dynamics of a single process generating the 
time series. Second, distributed state representations simplify the task of modeling 
time series generated by the interaction of multiple independent processes. For 
example, a speech signal generated by the superposition of multiple simultaneous 
speakers can be potentially modeled with such an architecture. 
Williams and Hinton (1991) first formulated the problem of learning in HMMs with 
distributed state representation and proposed a solution based on deterministic 
Boltzmann learning. The approach presented in this paper is similar to Williams 
and Hinton's in that it is also based on a statistical mechanical formulation of hidden 
Markov models. However, our learning algorithm is quite different in that it makes 
use of the special structure of HMMs with distributed state representation, resulting 
in a more efficient learning procedure. Anticipating the results in section 2, this 
learning algorithm both obviates the need for the two-phase procedure of Boltzmann 
machines, and has an exact M-step. A different approach comes from Saul and 
Jordan (1995), who derived a set of rules for computing the gradients required for 
learning in HMMs with distributed state spaces. However, their methods can only 
be applied to a limited class of architectures. 
2 Factorial hidden Markov models 
Hidden Markov models are a generalization of mixture models. At any time step, 
the probability density over the observables defined by an HMM is a mixture of 
the densities defined by each state in the underlying Markov model. Temporal 
dependencies are introduced by specifying that the prior probability of the state at 
time t depends on the state at time t- 1 through a transition matrix, P (Figure la). 
Another generalization of mixture models, the cooperative vector quantizer (CVQ; 
Hinton and Zemel, 1994 ), provides a natural formalism for distributed state repre- 
sentations in HMMs. Whereas in simple mixture models each data point must be 
accounted for by a single mixture component, in CVQs each data point is accounted 
for by the combination of contributions from many mixture components, one from 
each separate vector quantizer. The total probability density modeled by a CVQ 
is also a mixture model; however this mixture density is assumed to factorize into 
a product of densities, each density associated with one of the vector quantizers. 
Thus, the CVQ is a mixture model with distributed representations for the mixture 
components. 
Factorial hidden Markov models 1 combine the state transition structure of HMMs 
with the distributed representations of CVQs (Figure lb). Each of the d underlying 
t at time t and transition probability matrix 
Markov models has a discrete state s i 
Pi. As in the CVQ, the states are mutually exclusive within each vector quantizer 
and we assume real-valued outputs. The sequence of observable output vectors is 
generated from a normal distribution with mean given by the weighted combination 
of the states of the underlying Markov models: 
i=1 
where C is a common covariance matrix. The k-valued states si are represented as 
We refer to HMMs with distributed state as factorial HMMs as the features of the 
distributed state factorize the total state representation. 
4 74 Z. GHAHRAMANI, M. I. JORDAN 
discrete column vectors with a 1 in one position and 0 everywhere else; the mean of 
the observable is therefore a combination of columns from each of the Wi matrices. 
a) 
(ooooo) 
P 
b) 
(ooooo)  
) 
Figure 1. a) Hidden Markov model. b) Factorial hidden Markov model. 
We capture the above probability model by defining the energy of a sequence of T 
states and observations, {(st, yt)}tT= , which we abbreviate to {s,y}, as: 
yt _ y Vis C -1 
7/({s,y}) -  t--1 i----1 
-- -- S i Ais i , (1) 
i=1 t=l i=1 
= sit ) such that Ej-1 e[A ---- 1 and denotes matrix 
where [Ai]t logP(sjl t--1 k 
transpose. Priors for the initial state, s 1, are introduced by setting the second term 
I 
in (1) to --]-{=1 si logTri. The probability model is defined from this energy by 
the Boltzmann distribution 
1 exp{-7t({s, y}) }. 
P({s,y}) =  
(2) 
Note that like in the CVQ (Ghahramani, 1995), the unclamped partition function 
Z - / d{y) y.exp{-7/({s, y})), 
{s} 
evaluates to a constant, independent of the parameters. This can be shown by 
first integrating the Gaussian variables, removing all dependency on {y}, and then 
summing over the states using the constraint on e[A'] . 
The EM algorithm for Factorial HMMs 
As in HMMs, the parameters of a factoriM HMM can be estimated via the EM 
(Baum-Welch) algorithm. This procedure iterates between assuming the current 
parameters to compute probabilities over the hidden states (E-step), and using 
these probabilities to maximize the expected log likelihood of the parameters (M- 
step). 
Using the likelihood (2), the expected log likelihood of the parameters is 
Q(n'l ) = (-7/({s, y)) -logZ)c, (3) 
Factoffal Hidden Markov Models 4 75 
where ( -- {Wi,Pi, C)id__l denotes the current parameters, and (. denotes ex- 
pectation given the clamped observation sequence and b. Given the observation 
sequence, the only random variables are the hidden states Expanding equation (3) 
and limiting the expectation to these random variables we find that the statistics 
t t t t t t-1 t 
that need to be computed for the E-step are (si)c, ($i$j)c, and ($i$i )c. Note 
that in standard HMM notation (Rabiner and Juang, 1986), (s)c corresponds to 
7t and t t--i t t 
(s,s, corresponds to 6, where no analogue when there 
is only a single underlying Markov model. The M-step uses these expectations to 
maximize Q with respect to the parameters. 
The constant partition function allowed us to drop the second term in (3). There- 
fore, unlike the Boltzmann machine, the expected log likelihood does not depend 
on statistics collected in an unclamped phase of learning, resulting in much faster 
learning than the traditional Boltzmann machine (Neal, 1992). 
M-step 
Setting the derivatives of Q with respect to the output weights to zero, we obtain 
a linear system of equations for W: 
t 
N,t j 
wnew= 
L ]V,t . 
the vector and 
where s and W are matrix of concatenated si and. Wi, 
respectively,-. N denotes summation over a data set of N sequences, and i is the 
Moore-Penrose pseudo-inverse. To estimate the log transition probabilities we solve 
 [A,b, 
OQ/O[Ai]jt - 0 subject to the constraint .j e ' - 1, obtaining 
[A ]new EN, t(8J 8il 1)c 
t i./t - log (4) 
- 1st t- ' 
Z-N,t,jk ijSil 1) c 
The covariance matrix can be similarly estimated: 
anew .-- Eyy'- E y(S)'c(SS')(s)cy'. 
N,t N,t 
The M-step equations can therefore be solved analytically; furthermore, for a single 
underlying Markov chain, they reduce to the traditional Baum-Welch re-estimation 
equations. 
E-step 
Unfortunately, as in the simpler CVQ, the exact E-step for factorial HMMs is com- 
putationally intractable. For example, the expectation of the jth unit in vector i at 
time step t, given {y), is: 
(sj)c = P(sj = 11{y),) 
k 
t __ 
- 1,... ,s2 -- 1,..., sa,2,,- ll{y}, 
jl ,...,jhi,...,ja 
Although the Markov property can be used to obtain a forward-backward-like fac- 
torization of this expectation across time steps, the sum over all possible configura- 
tions of the other hidden units within each time step is unavoidable. For a data set 
4 76 Z. GHAHRAMANI, M. I. JORDAN 
of N sequences of length T, the full E-step calculated through the forward-backward 
procedure has time complexity O(NTked). Although more careful bookkeeping can 
reduce the complexity to O(NTdkd+i), the exponential time cannot be avoided. 
This intractability of the exact E-step is due inherently to the cooperative nature 
of the model the setting of one vector only determines the mean of the observable 
if all the other vectors are fixed. 
Rather than summing over all possible hidden state patterns to compute the exact 
expectations, a natural approach is to approximate them through a Monte Carlo 
method such as Gibbs sampling. The procedure starts with a clamped observable 
sequence {y} and a random setting of the hidden states {s}}. At each time step, 
each state vector is updated stochastically according to its probability distribution 
t,,, P(sl{y} {sj.j  
conditioned on the setting of all the other state vectors: s i , 
i or r y t}, 4). These conditional distributions are straightforward to compute and 
a frill pass of Gibbs sampling requires O(NTkd) operations. The first and second- 
t t t' tt-l'x 
order statistics needed to estimate (si)c, and (''i ?c are 
(sisj) collected using 
the sj's visited and the probabilities estimated during this sampling process. 
Mean field approximation 
A different approach to computing the expectations in an intractable system is 
given by mean field theory. A mean field approximation for factorim HMMs can be 
obtained by defining the energy function 
1 t' t 
7({s, y}) =  E [yt _ tit]' C-1 [yt _ tit] _  si logmi ' 
t t,i 
which results in a completely factorized approximation to probability density (2): 
I , 
P(ls, y)) l-Iexpl- [yt-tt t] C-l[yt-ttt]}H(m)" (5) 
t t ,i ,j 
In this approximation, the observables are independently Gaussian distributed with 
mean t and each hidden state vector is multinomially distributed with mean mi. 
This approximation is made as tight  possible by chosing the mean field parame- 
ters t and m that minimize the Fullback-Liebler divergence 
(PIIP)  (logP)p - (log P)p 
where (.)p denotes expectation over the mean field distribution (5). With the 
observables clamped, t can be set equal to the observable yt. Minimizing rE(PlIP) 
with respect to the mean field parameters for the states results in a fixed-point 
equation which can be iterated until convergence: 
1  
t diag{WiC-m}_ 1 (6) 
tnew _ {W/C-1 [yt_ t] + WlC-lWimi 
t&l 
+Aim - + Ai, i } 
m t 
where t  i Wi i and a{.} is the softmax exponential, normalized over each 
hidden state vector. The first term is the projection of the error in the observable 
onto the weights of state vector ithe more a hidden unit can reduce this error, the 
larger its mean field parameter. The next three terms arise from the fact that 2 _ 
is equal to mij and not mj. The last two terms introduce dependencies forward 
and backward in time. Each state vector is ynchronously updated using (6), at 
a time cost of O(NTkd) per iteration. Convergence is diagnosed by monitoring 
the  divergence in the mean field distribution between successive time steps; in 
practice convergence is very rapid (about 2 to 10 iterations of (6)). 
Factoffal Hidden Markov Models 477 
Table 1: Comparison of factorial HMM on four problems of varying size 
d k Alg # Train Test Cycles Time/Cycle 
3 2 HMM 5 649 8 358 :t:81 33 :t: 19 1.1s 
Exact 877 :t: 0 768 0 22 +6 3.0s 
Gibbs 710 :t: 152 627 :t: 129 28 :t: 11 6.0 s 
MF 755 + 168 670 :t: 137 32 :t: 22 1.2 s 
3 3 HMM 5 670 26 -782  128 23  10 3.6s 
Exact 568  164 276  62 35  12 5.2 s 
Gibbs 564  160 305  51 45  16 9.2 s 
MF 495 83 326 62 38 22 1.6s 
5 2 HMM 5 588 37 -2634 566 18  1 5.2s 
Exact 223  76 159  80 31  17 6.9 s 
Gibbs 123  103 73  95 40  5 12.7s 
MF 292  101 237  103 54 29 2.2s 
5 3 HMM 3 1671,1678,1690 -,-,- 14,14,12 90.0 s 
Exact -55,-354,-295 -123,-378,-402 90,100,100 51.0 s 
Gibbs -123,-160,-194 -202,-237,-307 100,73,100 14.2 s 
MF -287,-286,-296 -364,-370,-365 100,100,100 4.7 s 
Table 1. Data was generated from a factoffal HMM with d underlying Markov models of 
k states each. The training set was 10 sequences of length 20 where the observable was a 
4-dimensional vector; the test set was 20 such sequences. HMM indicates a hidden Markov 
model with k a states; the other algorithms are factoffal HMMs with d underlying k-state 
models. Gibbs sampling used 10 samples of each state. The algorithms were run until 
convergence, as monitored by relative change in the likelihood, or a maximum of 100 cycles. 
The # column indicates number of runs. The Train and Test columns show the log likelihood 
& one standard deviation on the two data sets. The last column indicates approximate time 
per cycle on a Silicon Graphics R4400 processor running Matlab. 
3 Empirical Results 
We compared three EM algorithms for learning in factoffal HMMs--using Gibbs 
sampling, mean field approximation, and the exact (exponential) E step--on the 
basis of performance and speed on randomly generated problems. Problems were 
generated from a factorial HMM structure, the parameters of which were sam- 
pled from a uniform [0, 1] distribution, and appropriately normalized to satisfy the 
sum-to-one constraints of the transition matrices and priors. Also included in the 
comparison was a traditional HMM with as many states (k d) as the factorial HMM. 
Table i summarizes the results. Even for moderately large state spaces (d _ 
3 and k _ 3) the standard HMM with k d states suffers from severe overfitting. 
Furthermore, both the standard HMM and the exact E-step factorial HMM are 
extremely slow on the larger problems. The Gibbs sampling and mean field approx- 
imations offer roughly comparable performance at a great increase in speed. 
4 Discussion 
The basic contribution of this paper is a learning algorithm for hidden Markov 
models with distributed state representations. The standard Baum-Welch proce- 
dure is intractable for such architectures as the size of the state space generated 
from the cross product of d k-valued features is O(k), and the time complexity of 
Baum-Welch is quadratic in this size. More importantly, unless special constraints 
are applied to this cross-product HMM architecture, the number of parameters also 
4 78 Z. GHAHRAMANI, M. I. JORDAN 
grows as O(k2d), which can result in severe overfitting. 
The architecture for factorial HMMs presented in this paper did not include any 
coupling between the underlying Markov chains. It is possible to extend the algo- 
rithm presented to architectures which incorporate such couplings. However, these 
couplings must be introduced with caution as they may result either in an exponen- 
tial growth in parameters or in a loss of the constant partition function property. 
The learning algorithm derived in this paper assumed real-valued observables. The 
algorithm can also be derived for HMMs with discrete observables, an architecture 
closely related to sigmoid belief networks (Neal, 1992). However, the nonlinearities 
induced by discrete observables make both the E-step and M-step of the algorithm 
more difficult. 
In conclusion, we have presented Gibbs sampling and mean field leaining algorithms 
for factorial hidden Markov models. Such models incorporate the time series mod- 
eling capabilities of hidden Markov models and the advantages of distributed rep- 
resentations for the state space. Future work will concentrate on a more efficient 
mean field approximation in which the forward-backward algorithm is used to com- 
pute the E-step exactly within each Markov chain, and mean field theory is used to 
handle interactions between chains (Saul and Jordan, 1996). 
Acknowledgement s 
This project was supported in part by a grant from the McDonnell-Pew Foundation, by a 
grant from ATR Human Information Processing Research Laboratories, by a grant from 
Siemens Corporation, and by grant N00014-94-1-0777 from the Office of Naval Research. 
References 
Baldi, P., Chauvin, Y., Hunkapiller, T., and McClure, M. (1994). Hidden Markov models 
of biological primary sequence information. Proc. Nat. Acad. Sci. (USA), 91(3):1059- 
1063. 
Ghahramani, Z. (1995). Factorial learning and the EM algorithm. In Tesauro, G., Touret- 
zky, D., and Leen, T., editors, Advances in Neural Information Processing Systems 
7. MIT Press, Cambridge, MA. 
Hinton, G. and Zemel, R. (1994). Autoencoders, minimum description length, and 
Helmholtz free energy. In Cowan, J., Tesauro, G., and Alspector, J., editors, Ad- 
vances in Neural Information Processing Systems 6. Morgan Kaufmanm Publishers, 
San Francisco, CA. 
Neal, R. (1992). Connectionist learning of belief networks. Artificial Intelligence, 56:71- 
113. 
Rabiner, L. and Juang, B. (1986). An Introduction to hidden Markov models. IEEE 
Acoustics, Speech  Signal Processing Magazine, 3:4-16. 
Saul, L. and Jordan, M. (1995). Boltzmann chains and hidden Markov models. In Tesauro, 
G., Touretzky, D., and Leen, T., editors, Advances in Neural Information Processing 
Systems 7. MIT Press, Cambridge, MA. 
Saul, L. and Jordan, M. (1996). Exploiting tractable substructures in Intractable net- 
works. In Touretzky, D., Mozer, M., and Hasselmo, M., editors, Advances in Neural 
Information Processing Systems 8. MIT Press. 
Williams, C. and Hinton, G. (1991). Mean field networks that learn to discriminate tem- 
porally distorted strings. In Touretzky, D., Elman, J., Sejnowski, T., and Hinton, G., 
editors, Connectionist Models: Proceedings of the 1990 Summer School, pages 18-22. 
Morgan Kaufmann Publishers, Man Mateo, CA. 
