Exploiting Tractable Substructures 
in Intractable Networks 
Lawrence K. Saul and Michael I. Jordan 
{lksaul, j ordar}*psyche. mir. edu 
Center for Biological and Computational Learning 
Massachusetts Institute of Technology 
79 Amherst Street, E10-243 
Cambridge, MA 02139 
Abstract 
We develop a refined mean field approximation for inference and 
learning in probabilistic neural networks. Our mean field theory, 
unlike most, does not assume that the units behave as independent 
degrees of freedom; instead, it exploits in a principled way the 
existence of large substructures that are computationally tractable. 
To illustrate the advantages of this framework, we show how to 
incorporate weak higher order interactions into a first-order hidden 
Markov model, treating the corrections (but not the first order 
structure) within mean field theory. 
1 INTRODUCTION 
Learning the parameters in a probabilistic neural network may be viewed as a 
problem in statistical estimation. In networks with sparse connectivity (e.g. trees 
and chains), there exist efficient algorithms for the exact probabilistic calculations 
that support inference and learning. In general, however, these calculations are 
intractable, and approximations are required. 
Mean field theory provides a framework for approximation in probabilistic neural 
networks (Peterson & Anderson, 1987). Most applications of mean field theory, 
however, have made a rather drastic probabilistic assumption--namely, that the 
units in the network behave as independent degrees of freedom. In this paper we 
show how to go beyond this assumption. We describe a self-consistent approxi- 
mation in which tractable substructures are handled by exact computations and 
only the remaining, intractable parts of the network are handled within mean field 
theory. For simplicity we focus on networks with binary units; the extension to 
discrete-valued (Potts) units is straightforward. 
Exploiting Tractable Substructures in Intractable Networks 487 
We apply these ideas to hidden Markov modeling (Rabiner & Juang, 1991). The 
first order probabilistic structure of hidden Markov models (HMMs) leads to net- 
works with chained architectures for which efficient, exact algorithms are available. 
More elaborate networks are obtained by introducing couplings between multiple 
HMMs (Williams & Hinton, 1990) and/or long-range couplings within a single HMM 
(Stolorz, 1994). Both sorts of extensions have interesting applications; in speech, 
for example, multiple HMMs can provide a distributed representation of the artic- 
ulatory state, while long-range couplings can model the effects of coarticulation. In 
general, however, such extensions lead to networks for which exact probabilistic cal- 
culations are not feasible. One would like to develop a mean field approximation for 
these networks that exploits the tractability of first-order HMMs. This is possible 
within the more sophisticated mean field theory described here. 
2 MEAN FIELD THEORY 
We briefly review the basic methodology of mean field theory for networks of binary 
(4-1) stochastic units (Parisi, 1988). For each configuration {$} = {$1, $,..., $N }, 
we define an energy E{S} and a probability P{$} via the Boltzmann distribution: 
P{S} = Z ' (1) 
where /3 is the inverse temperature and Z is the partition function. When it is 
intractable to compute averages over P{S}, we are motivated to look for an ap- 
proximating distribution Q{$}. Mean field theory posits a particular parametrized 
form for Q{$}, then chooses parameters to minimize the Kullback-Liebler (KL) 
divergence: 
KL(QllP) = y] Q{$} In [P--J' (2) 
{s} 
Why are mean field approximations valuable for learning? Suppose that P{$} 
represents the posterior distribution over hidden variables, as in the E-step of an 
EM algorithm (Dempster, Laird, & Rubin, 1977). Then we obtain a mean field 
approximation to this E-step by replacing the statistics of P{S} (which may be 
quite difficult to compute) with those of Q{$} (which may be much simpler). If, in 
addition, Z represents the likelihood of observed data (as is the case for the example 
of section 3), then the mean field approximation yields a lower bound on the log- 
likelihood. This can be seen by noting that for any approximating distribution 
Q{$}, we can form the lower bound: 
lnZ = lnEe-z{s} (3) 
{s} 
[e-E{s} 
= lnEQ{$}- [ Q{$} (4) 
{s} 
_ EQ{$}[-E{$}-lnQ{$}], (5) 
{s} 
where the last line follows from Jensen's inequality. The difference between the left 
and right-hand side of eq. (5) is exactly KL(QIIP); thus the better the approximation 
to P{$}, the tighter the bound on In Z. Once a lower bound is available, a learning 
procedure can maximize the lower bound. This is useful when the true likelihood 
itself cannot be efficiently computed. 
488 L.K. SAUL, M. I. JORDAN 
2.1 Complete Factorizability 
The simplest mean field theory involves assuming marginal independence for the 
units $i. Consider, for example, a quadratic energy function 
- = + 
i<j i 
and the factorized approximation: 
. (7) 
i 
The expectations under this mean field approximation are (Si) = rni and (SiSj) = 
mired for i  j. The best approximation of this form is found by minimizing the 
KL-divergence, 
KL(Q,,P) '-z[(l+mi)ln(l+mi) (1-mi) 
 2 2 +  In  (8) 
$ 
- Z Jijmirnj - Z himi + In Z, 
i<j i 
with respect to the mean field parameters mi. Setting the gradients of eq. (8) equal 
to zero, we obtain the (classical) mean field equations: 
tanh-X(mi) = Z Jijmj + hi. (9) 
2.2 Partial Factorizability 
We now consider a more structured model in which the network consists of interact- 
ing modules that, taken in isolation, define tractable substructures. One example 
of this would be a network of weakly coupled HMMs, in which each HMM, taken 
by itself, defines a chain-like substructure that supports efficient probabilistic cal- 
culations. We denote the interactions between these modules by parameters K?, 
where the superscripts/ and y range over modules and the subscripts i and j index 
units within modules. An appropriate energy function for this network is: 
- E{S} =  ... K..S.   
ij 
The first term in this energy function contains the intra-modular interactions; the 
lt term, the inter-modular ones. 
We now consider a mean field approximation that maintains the first sum over 
modules but dispenses with the inter-modular corrections: 
i 
The parameters of this mean field approximation are H; they will be chosen to 
provide a self-consistent model of the inter-modular interactions. We eily obtain 
the following expectations under the mean field approximation, where   : 
Exploiting Tractable Substructures in Intractable Networks 489 
Note that units in the same module are statistically correlated and that these cor- 
relations are assumed to be taken into account in calculating the expectations. We 
assume that an efficient algorithm is available for handling these intra-modular cor- 
relations. For example, if the factorized modules are chains (e.g. obtained from 
a coupled set of HMMs), then computing these expectations requires a forward- 
backward pass through each chain. 
The best approximation of the form, eq. (11), is found by minimizing the KL- 
divergence, 
KL(QIIP)=ln(Z/ZQ)+E(H-h)($I- EK'($$I, (14) 
3 
with respect to the mean field parameters H. To compute the appropriate gradi- 
ents, we use the fact that derivatives of expectations under a Boltzmann distribu- 
tion (e.g. O(Sf)/OH) yield cumulants (e.g. (SfS) -(Sf)(S)). The conditions 
for stationarity are then: 
O= (H-h)[(SS)- (S)(S)]- KJ  [(SSS)- (5S)(3)] . (15) 
Substituting the expectations from eqs. (12) and (13), we find that KL(QIIP) is 
minimized when 
O=VH-h-VKi(S)}[(S)-(S)(S) ]. (16) 
The resulting mean field equations are: 
v j 
These equations may be solved by iteration, in which the (assumed) tractable algo- 
rithms for averaging over Q{$} are invoked as subroutines to compute the expecta- 
tions ($f/on the right hand side. Because these expectations depend on HI, these 
equations may be viewed as a self-consistent model of the inter-modular interac- 
tions. Note that the mean field parameter H plays a role analogous to-tanh-X(mi) 
in eq. (9) of the fully factorized case. 
2.3 Inducing Partial Factorizability 
Many interesting networks do not have strictly modular architectures and can only 
be approximately decomposed into tractable core structures. Techniques are needed 
in such cases to induce partial factorizability. Suppose for example that we are given 
an energy function 
i<j i 
(18) 
for which the first two terms represent tractable interactions and the last term, 
intractable ones. Thus the weights Jij by themselves define a tractable skeleton 
network, but the weights Kij spoil this tractability. Mimicking the steps of the 
previous section, we obtain the mean field equations: 
i 
(19) 
490 L.K. SAUL, M. I. JORDAN 
In this case, however, the weights Kij couple units in the same core structure. Be- 
cause these units are not assumed to be independent, the triple correlator 
does not factorize, and we no longer obtain the decoupled update rules of eq. (17). 
Rather, for these mean field equations, each iteration requires computing triple 
correlators and solving a large set of coupled linear equations. 
To avoid this heavy computational load, we instead manipulate the energy function 
into one that can be partially factorized. This is done by introducing extra hidden 
variables Wij = 4-1 on the intractable links of the network. In particular, consider 
the energy function 
i<j i i<j 
(20) 
The hidden variables Wij in eq. (20) serve to decouple the units connected by 
the intractable weights Kij. However, we can always choose the new interactions, 
KJ ) and K ), so that 
e -tz{s} = y]. e -z{s'w}. (21) 
Eq. (21) states that the marginal distribution over {S} in the new network is iden- 
tical to the joint distribution over {S} in the original one. Summing both sides of 
eq. (21) over {S}, it follows that both networks have the same partition function. 
The form of the energy function in eq. (20) suggests the mean field approximation: 
Q{S, w} = zo i<j i i<j 
where the mean field parameters Hi have been augmented by a set of additional 
mean field parameters Hij that account for the extra hidden variables. In this 
expression, the variables $i and Wij act as decoupled degrees of freedom and the 
methods of the preceding section can be applied directly. We consider an example 
of this reduction in the following section. 
3 EXAMPLE 
Consider a continuous-output HMM in which the probability of an output t at 
time t is dependent not only on the state at time t, but also on the state at time 
t + A. Such a context-sensitive HMM may serve as a flexible model of anticipatory 
coarticulatory effects in speech, with A  50ms representing a mean phoneme 
lifetime. Incorporating these interactions into the basic HMM probability model, 
we obtain the following joint probability on states and outputs: 
T-1 T--A 
t=l t=l 
1 {1 
(2a.)D/2 exp - [2t- 
(23) 
Denoting the likelihood of an output sequence by Z, we have 
Z= P{2}= y]P{S, 2}. (24) 
{s} 
We can represent this probability model using energies rather than transition prob- 
abilities (Luttrell, 1989; Saul and Jordan, 1995). For the special case of binary 
Exploiting Tractable Substructures in Intractable Networks 491 
states, this is done by choosing weights J, K, and he related to the parameters of 
the HMM and the output sequence as follows 1' 
j _ 1 In 
4 
'a++a__ 
.a+_a_+. 
1(+ - _)  (17+ - 17_), 
he -- 
(25) 
v+ +v_ + + + - 
2 
Here, a++ is the probability of transitioning from the ON state to the ON state 
(and similarly for the other a parameters), while + and 1+ are the mean outputs 
associated with the ON state at time steps t and t + A (and similarly for U_ and 
1_). Given these definitions, we obtain an equivalent expression for the likelihood: 
Z =  exp --e0 +  Ztt+x +  htt +  Ktt+A , (27) 
{S} t=l t=l t=l 
where e0 is a placeholder for the terms in In P(S, } that do not depend on (S}. 
We can interpret Z  the partition function for the chained network of T binary 
units that represents the HMM unfolded in time. The nearest neighbor connec- 
tivity of this network reflects the first order structure of the HMM; the long-range 
connectivity reflects the higher order interactions that model sensitivity to context. 
The exact likelihood can in principle be computed by summing over the hidden 
states in eq. (27), but the required forward-backward algorithm scales much worse 
than the ce of first-order HMMs. Because the likelihood can be identified  a 
partition function, however, we can obtain a lower bound on its value from mean 
field theory. To exploit the tractable first order structure of the HMM, we induce a 
partially factorizable network by introducing extra link variables on the long-range 
connections,  described in section 2.3. The resulting mean field approximation 
uses the chained structure  its backbone and should be accurate if the higher 
order effects in the data are weak compared to the bic first-order structure. 
The above scenario w tested in numerical simulations. In actuality, we imp]e- 
mented a generalization of the model in eq. (23): our HMM had non-binary hidden 
states and a coarticulation model that incorporated both left and right context. 
This network w trained on several artificial data sets according to the following 
procedure. First, we fixed the "context" weights to zero and used the Bantu-Welch 
algorithm to estimate the first order structure of the HMM. Then, we lifted the 
zero constraints and re-estimated the parameters of the HMM by a mean field EM 
algorithm. In the E-step of this algorithm, the true posterior P(S]} w approx- 
imated by the distribution Q(sI} obtained by solving the mean field equations; 
in the M-step, the parameters of the HMM were updated to match the statistics of 
Q(S]}. Figure 1 shows the type of structure captured by a typical network. 
4 CONCLUSIONS 
Endowing networks with probabilistic semantics provides a unified framework for in- 
corporating prior knowledge, handling missing data, and performing inferences un- 
der uncertainty. Probabilistic calculations, however, can quickly become intractable, 
so it is important to develop techniques that both approximate probability distri- 
butions in a flexible manner and make use of exact techniques wherever possible. In 
1There are boundary corrections to ht (not shown) for t = I and t > T - A. 
492 L.K. SAUL, M. I. JOVAN 
15 
10 
-10 
-10 10 15 
-15    
-15 -5 0 5 20 
Figure 1' 2D output vectors {)t} sampled from a first-order HMM and a context- 
sensitive HMM, each with n = 5 hidden states. The latter's coarticulation model 
used left and right context, coupling )t to the hidden states at times t and t q- 5. 
At left: the five main clusters reveal the basic first-order structure. At right' weak 
modulations reveal the effects of context. 
this paper we have developed a mean field approximation that meets both these ob- 
jectives. As an example, we have applied our methods to context-sensitive HMMs, 
but the methods are general and can be applied more widely. 
Acknowledgement s 
The authors acknowledge support from NSF grant CDA-9404932, ONR grant 
N00014-94-1-0777, ATR Research Laboratories, and Siemens Corporation. 
References 
A. Dempster, N. Laird, and D. Rubin. (1977) Maximum likelihood from incomplete 
data via the EM algorithm. J. Roy. Star. Soc. B39:1-38. 
B. H. Juang and L. R. Rabiner. (1991) Hidden Markov models for speech recogni- 
tion, Technometrics 33: 251-272. 
S. Luttrell. (1989) The Gibbs machine applied to hidden Markov model problems. 
Royal Signals and Radar Establishment: SP Research Note 99. 
G. Parisi. (1988) Statistical field theory. Addison-Wesley: Redwood City, CA. 
C. Peterson and J. R. Anderson. (1987) A mean field theory learning algorithm for 
neural networks. Complex Systems 1:995-1019. 
L. Saul and M. Jordan. (1994) Learning in Boltzmann trees. Neural Comp. 6: 
1174-1184. 
L. Saul and M. Jordan. (1995) Boltzmann chains and hidden Markov models. 
In G. Tesauro, D. Touretzky, and T. Leen, eds. Advances in Neural Information 
Processing Systems 7. MIT Press: Cambridge, MA. 
P. Stolorz. (1994) Recursive approaches to the statistical physics of lattice proteins. 
In L. Hunter, ed. Proc. 27th Hawaii Intl. Conf. on System Sciences V: 316-325. 
C. Williams and G. E. Hinton. (1990) Mean field networks that learn to discriminate 
temporally distorted strings. Proc. Connectionist Models Summer School: 18-22. 
