Tractable Variational Structures for 
Approximating Graphical Models 
David Barber Wim Wiegerinck 
{davidb, wimw} mbfys. kun. nl 
RWCP* Theoretical Foundation SNN t University of Nijmegen 
6525 EZ Nijmegen, The Netherlands. 
Abstract 
Graphical models provide a broad probabilistic flamework with ap- 
plications in speech recognition (Hidden Markov Models), medical 
diagnosis (Belief networks) and artificial intelligence (Boltzmann 
Machines). However, the computing time is typically exponential 
in the number of nodes in the graph. Within the variational flame- 
work for approximating these models, we present two classes of dis- 
tributions, decimatable Boltzmann Machines and Tractable Belief 
Networks that go beyond the standard factorized approach. We 
give generalised mean-field equations for both these directed and 
undirected approximations. Simulation results on a small bench- 
mark problem suggest using these richer approximations compares 
favorably against others previously reported in the literature. 
I Introduction 
Graphical models provide a powerful framework for probabilistic inference[l] but 
suffer intractability when applied to large scale problems. Recently, variational ap- 
proximations have been popular [2, 3, 4, 5], and have the advantage of providing 
rigorous bounds on quantities of interest, such as the data likelihood, in contrast to 
other approximate procedures such as Monte Carlo methods[1]. One of the original 
models in the neural networks community, the Boltzmann machine (BM), belongs 
to the class of undirected graphical models. The lack of a suitable algorithm has 
hindered its application to larger problems. The deterministic BM algorithm[6], a 
variational procedure using a factorized approximating distribution, speeds up the 
learning of BMs, although the simplicity of this approximation can lead to unde- 
sirable effects[7]. Factorized approximations have also been successfully applied to 
sigmoid belief networks[4]. One approach to producing a more accurate approxi- 
mation is to go beyond the class of factorized approximating models by using, for 
example, mixtures of factorized models. However, it may be that very many mix- 
ture components are needed to obtain a significant improvement beyond using the 
factorized approximation[5]. In this paper, after describing the variational learn- 
*Real World Computing Paxtnership 
tFoundation for Neural Networks 
184 D. Barber and W. Wiegerinck 
ing flamework, we introduce two further classes of non-factorized approximations, 
one undirected (decimatable BMs in section (3)) and the other, directed (Tractable 
Belief Networks in section (4)). To demonstrate the potential benefits of these 
methods, we include results on a toy benchmark problem in section (5) and discuss 
their relation to other methods in section (6). 
2 ariational Learning 
We assume the existence of a graphical model P with known qualitative structure 
but for which the quantitative parameters of the structure remain to be learned from 
data. Given that the variables can be considered as either visible (V) or hidden 
(H), one approach to learning is to carry out maximum likelihood on the visible 
variables for each example in the dataset. Considering the KL divergence between 
the true distribution P(HIV ) and a distribution Q(H), 
Q(H) 
KL(Q(H),P(HIV)) =  Q(H)ln ):,(HiV) _> 0 
H 
and using P(HIV ) = P(H, V)/P(V) gives the bound 
lnP(V) > -  Q(H)lnQ(H) +  Q(H)lnP(H,V) (1) 
H H 
Betraying the connection to statistical physics, the first term is termed the "en- 
tropy" and the second the "energy". One typically chooses a variational distribu- 
tion Q so that the entropic term is "tractable". We assume that the energy E(Q) 
is similarly computable, perhaps with recourse to some extra variational bound (as 
in section (5)). By tractable, we mean that all necessary marginals and desired 
quantities are computationally feasible, regardless of the issue of the scaling of the 
computational effort with the graph size. Learning consists of two iterating steps: 
first optimize the bound (1) with respect to the parameters of Q, and then with 
respect to the parameters of P(H, V). We concentrate here on the first step. For 
clarity, we present our approach for the case of binary variables si  {0, 1}, i = 1..N. 
We now consider two classes of approximating distributions Q. 
3 Undirected Q: Decimatable Boltzmann Machines 
Boltzmann machines describe probability distributions parameterized by a sym- 
metric weight matrix J 
1 
Q(s) =  exp,  ----  Jij$isj '- s'Js (2) 
ij 
where the normalization constant, or "partition function" is Z = , exp . For 
convenience we term the diagonals of J the "biases", hi = Jii. Since In Z(J, h) is a 
generating function for the first and second order statistics of the variables s, the 
entropy is tractable provided that Z is tractable. For general connection structures, 
J, computing Z is intractable as it involves a sum over 2 v states; however, not all 
Boltzmann machines are intractable. A class of tractable structures is described by 
a set of so-called decimation rules in which nodes from the graph can be removed 
one by one, fig(I). Provided that appropriate local changes are made to the BM 
parameters, the partition function of the reduced graph remains unaltered (see eg 
[2]). For example, node c in fig(l) can be removed, provided that the weight matrix 
J and bias h are transformed, J  J', h  h', with J}c = Jc = h' = 0 and 
1 (1 + e he) (1 + e hc+2("'*+"'*)) 1 + enc+2aa' b.c 
Jb = Jab +  In (1 + e h+2Ja) (1 + e h+2&) ' hta/b = ha/b + In 1 + e hc (3) 
Tractable Variational Structures for Approximating Graphical Models 185 
Figure 1: A decimation rule for BMs. We can remove the upper node on the left so 
that the partition function of the reduced graph is the same. This requires a simple 
change in the parameters J, h coupling the two nodes on the right (see text). 
By repeatedly applying such rules, Z is calculable in time linear in N. 
3.1 Fixed point (Mean Field) Equations 
Using (2) in (1), the bound we wish to optimize with respect to the parameters 
0 = (J, h) of Q has the form ((.../ denotes averages with respect to Q) 
B(O) = - {qb) + In Z + E(O) 
(4) 
where E(O) is the energy. Differentiating (4) with respect to Jij (i  j) gives 
OB OE 
OJi = - E Fij,ktJkt + OJi 
kl 
(5) 
where Fij,kt = (sissst) - (sis) (sst) is the Fisher information matrix. A similar 
expression holds for the bias parameters, h, so that we can form a linear fixed point 
equation in the total parameter set 0 where the derivatives of the bound vanish. 
This suggests the iterative solution, 0 'ew = F-XVof where the right hand side is 
evaluated at the current parameter values, 0 ta. 
4 Directed Q: Tractable Belief Networks 
Belief networks are products of conditional probability distributions, 
Q(H) = IX Q(Hilri) (6) 
iH 
in which ri denotes the parents of node i (see for example, [1]). The efficiency 
of computation depends on the underlying graphical structure of the model and is 
exponential in the maximal clique size (of the moralized triangulated graph [1]). We 
now assume that our model class consists of belief networks with a fixed, tractable 
graphical structure. The entropy can then be computed efficiently since it decouples 
into a sum of averaged entropies per site i (Q(z'i) -= 1 if ri = b), 
H iH rrl H 
(7) 
Note that the conditional entropy at each site i is trivial to compute since the values 
required can be read off directly from the definition of Q (6). By assumption, the 
marginals Q(z'i) are tractable, and can be found by standard methods, for example 
using the Junction Tree Algorithm[I]. 
To optimize the bound (1), we parameterize Q via its conditional probabilities, 
qi(wi) = Q(Hi = llwi). The remaining probability (2(Hi = 01wi) follows from 
186 D. Barber and . Wiegerinck 
normalization. We therefore have a set {qi(7ri)171'i -- (0... 0),... , (1... 1)) of vari- 
ational parameters for each node in the graph. Setting the gradient of the bound 
with respect to the qi(7ri)'S equal to zero yields the equations 
+ 
qi(7ri) -- or Q(Tri) 
(8) 
with 
Li,, = - y [X7i,,Q(wj)] y Q(HjIwj)In Q(HjIwy ) 
j rj Hj 
(9) 
where a (z) = 1/(1 + e-Z). The gradient Vi,, is with respect to qi(wi). The 
explicit evaluation of the gradients can be performed efficiently, since all that need 
to be differentiated are at most scalar functions of quantities that depend again 
only linearly on the parameters qi(ri). To optimize the bound, we iterate (8) till 
convergence, analogous to using factorized models[4]. However, the more powerful 
class of approximating distributions described by belief networks should enable a 
much tighter bound on the likelihood of the visible units. 
5 Application to Sigmoid Belief Networks 
We now describe an application of these non-factorized approximations to a par- 
ticular class of directed graphical models, sigmoid belief networks[8] for which the 
conditional distributions have the form 
P(si=l[wi)--cr (Wijsjq-ki) 
J 
(10) 
Wij - 0 if j ( ri. The joint distribution then has the form 
P(H,V) = Hexp[zisi - ln(1 + ea)] (11) 
i 
where zi = Y'}j Wijsj + ki. In (11) it is to be understood that the visible units are 
set to their observed values. In the lower bound (1), unfortunately, the average of 
In P(H, V) is not tractable, since (ln [1 + eZ]) does not decouple into a polynomial 
number of single site averages. Following [4] we use therefore the bound 
{ln [1 + eZ]) < {z)+ In (e -ez + e (-e) } 
(12) 
where  is a variational parameter in [0, 1]. We can then define the energy function 
E(Q, ) =  Wij (sisj)+  [zi (si)- i kiwi- i In (e -i + e(-')i)(13 ) 
ij i ' ' 
where /i = ki - Yj jWji. Expect for the final term, the energy is a function of 
first or second order statistics of the variables. For using a BM as the variational 
distribution, the final terms of (13) (e -'' ) = Y-H e-'z'/Z are simply the ratio of 
two partition functions, with the one in the numerator having a shifted bias. This 
is therefore tractable, provided that we use a tractable BM Q. 
Similarly, if we are using a Belief Network as the variational distribution, all but the 
last term in (13) is trivially tractable, provided that Q is tractable. We write the 
terms (e -'') = e -h' Y}.I-t R(H), where R(H) = Hj R(Hj]wj) and R(Hj[wj) ------ 
Tractable Variational Structures for Approximating Graphical Models 187 
(a) Directed graph 
toy problem. Hidden 
units are black 
 0  0.02 0.04 
(b) Decimatable BM - 25 parame- 
ters, mean: 0.0020. 
    o 0.02 0.04 O--'-O---O---O o 0.02 
(c) disconnected ('standard mean (d) chain - 19 parameters, 
field') - 16 parameters, mean: 0.01529. Max. clique size: 2 
0.01571. Max. clique size: 1 
mean: 
  o 0.02 0.04   o 0.02 
0.04 
(e) trees- 20 parameters, mean: 
0.0089. Max. clique size: 2 
(f) network - 28 parameters, mean: 
0.00183. Max. clique size: 3 
Figure 2: (a) Sigmoid Belief Network for which we approximate In P(V). (b): BM approx- 
imation. (c,d,e,f): Structures of the directed approximations on H. For each structure, 
histograms of the relative error between the true log likelihood and the lower bound is 
plotted. The horizontal scale has been fixed to [0,0.05] in all plots. The maximum clique 
size refers to the complexity of computation for each approximation, which is exponential 
in this quantity. The number of parameters includes the vector . 
Q(HjIy) exp(-iJijHj). R and Q have the same graphical structure and we can 
therefore use message propagation techniques again to compute (e -z' ). 
To test our methods numerically, we generated 500 networks with parameters 
{Wij, kj} drawn randomly from the uniform distribution over [-1, 1]. The lower 
bounds v for several approximating structures are compared with the true log 
likelihood, using the relative error  = .T'v/lnP(V) - 1, fig. 2. These show that 
considerable improvements can be obtained when non-factorized variational dis- 
tributions are used. Note that a 5 component mixture model ( 80 variational 
parameters) yields  = 0.01139 on this problem [5]  These results suggest there- 
fore that exploiting knowledge of the graphical structure of the model is useful. For 
instance, the chain (fig. 2(b)) with no graphical overlap with the original graph 
shows hardly any improvement over the standard mean field approximation. On 
the other hand, the tree model (fig. 2(c)), which has about the same number of 
parameters, but a larger overlap with the original graph, does improve considerably 
over the mean field approximation (and even over the 5 component mixture model). 
By increasing the overlap, as in fig. 2(d), the improvement gained is even greater. 
1In which Q = y'}j AjQ(Hll j) where Q(HII ) = 1-Ii I ain' (1 - lai) -"' and y.j Xj = 1. 
188 D. Barber and W. Wiegerinck 
6 Discussion 
In this section, we briefly explain the relationship of the introduced methods to 
other, "non-factorized" methods in the literature, namely node-elimination[9] and 
substructure variation[10]. 
6.1 Graph Partitioning and Node Elimination 
A further class of approximating distributions Q that could be considered are those 
in which the nodes can be partitioned into clusters, with independencies between 
the clusters. For expositional clarity, consider two partitions, s = (s,s2), and 
define Q to be factorized over these partitions 2, Q = Q(s)Q(s). Using this Q 
in (1), we obtain (with obvious notational simplifications) 
lnP(V) > -(lnQ])l -(lnQ2)2 + (lnP), 2 (14) 
A functional derivative with respect to Q and 12 gives the optimal forms: 
Q1 = exp (lnP)2/Z Q2 = exp (lnP)/z2 (15) 
If we substitute this form for 12 in (14) and use Z2 = Y exp (ln P), we obtain 
In P(V) > - (lnQ) + In  exp (lnP)x (16) 
In general, the final term may not have a simple form. In the case of approximating 
a BM P, lnP = s .JSl + 2Sl 'J1252 q- $2'J2252 - lnZv. Used in (16), we get: 
in P(V) > - (lnQ) -lnZp + (Sl.Jls) + In  exp(s2.J22s2 + 2s2.J2(s)) 
(17) 
so that the final term of (17) is the normalizing constant of a BM with connection 
matrix J and whose diagonals are shifted by J2 (s). One can therefore identify a 
set of nodes s which, when eliminated, reveal a tractable structure on the nodes s. 
The nodes that were removed are compensated for by using a variational distribution 
Qx(s). If P is a BM, then the optimal Q has its weights fixed to those of P 
restricted to variables s, but with variable biases shifted by J, (s). Restricting 
Q to factorized models, we recover the node elimination bound [9] which can 
readily be improved by considering non-factorized distributions Q (for example 
those introduced in this paper), see fig(3). Note, however, that there is no a- 
priori guarantee that using such partitioned approximations will lead to a better 
approximation than that obtained from a tractable variational distribution defined 
on the whole graph, but which does not have such a product form. Using a product 
of conditional distributions over clusters of nodes is developed more fully in [11]. 
6.2 Substructure Variation 
The process of using a Q defined on the whole graph but for which only a subset of 
the connections are adaptive is termed substructure variation [10]. In the context of 
BMs, Saul et al [2] identified weights in the original intractable distribution P that, 
if set to zero, would lead to a tractable graph Q(s) = P(slh , J, Jintractabte -- 0). To 
compensate for these removed weights they allowed the biases in Q to vary such 
that the KL divergence between Q and P is minimized. In general, this is a weaker 
method than one in which potentially all the parameters in the approximating 
network are adaptive, such as using a decimatable BM. 
2In the case of fully connected BMs, for computing with a Q which is the product of 
K partitions (each of which is fully connected say), the computing time reduces from 2 v 
for the "intractable" P to K2 v/K for Q, which can be a considerable reduction. 
Tractable Variational Structures for Approximating Graphical Models 189 
o o o 
o o 
(a) Intractable Model (b) "Naive" mean field (c) Node elimination (d) Partioning 
Figure 3: (a) A non-decimatable 5 node BM. (b) The standard factorized approxi- 
mation. (c) Node Elimination (d) Partitioning, where a richer distribution is con- 
sidered on the eliminated nodes. A solid line denotes a weight fixed to those in the 
original graph. A solid node is fixed, and an open node represents a variable bias. 
7 Conclusion 
Finding accurate, controllable approximations of graphical models is crucial if their 
application to large scale problems is to be realised. We have elucidated two general 
classes of tractable approximations, both based on the Kullback-Leibler divergence. 
Future interesting directions include extending the class of distributions to higher 
order Boltzmann Machines (for which the class of decimation rules is greater), and to 
mixtures of these approaches. Higher order perturbative approaches are considered 
in [12]. These techniques therefore facilitate the approximating power of tractable 
models which can lead to a considerable improvement in performance. 
References
[1] E. Castillo, J. M. Gutierrez, and A. S. Hadi. Expert Systems and Probabilistic Network 
Models. Springer, 1997. 
[2] L. K. Saul and M. I. Jordan. Boltzmann Chains and Hidden Markov Models. In 
G. Tesauro, D. S. Touretzky, and T. K. Leen, editors, Advances in Neural Information 
Processing Systems, pages 435-442. MIT Press, 1995. NIPS 7. 
[3] T. Jaakkola. Variational Methods for Inference and Estimation in Graphical Models. 
PhD thesis, Massachusetts Institute of Technology, 1997. 
[4] L. K. Saul, T. Jaakkola, and M. I. Jordan. Mean Field Theory for Sigmoid Belief 
Networks. Journal of Artificial Intelligence Research, 4:61-76, 1996. 
[5] C.M. Bishop, N. Lawrence, T. Jaakkola, and M. I. Jordan. Approximating Posterior 
Distributions in Belief Networks using Mixtures. MIT Press, 1998. NIPS 10. 
[6] C. Peterson and J. R. Anderson. A Mean Field Theory Learning Algorithm for Neural 
Networks. Complex Systems, 1:995-1019, 1987. 
[7] Conrad C. Galland. The limitations of deterministic Boltzmann machine learning. 
Network: Computation in Neural Systems, 4:355-379, 1993. 
[8] R. Neal. Connectionist learning of Belief Networks. Artificial Intelligence, 56:71-113, 
1992. 
[9] T. S. Jaakkola and M. I. Jordan. Recursive Algorithms for Approximating Probabil- 
ities in Graphical Models. MIT Press, 1996. NIPS 9. 
[10] L. K. Saul and M. I. Jordan. Exploiting Tractable Substructures in Intractable Net- 
works. MIT Press, 1996. NIPS 8. 
[11] W. Wiegerinck and D. Barber. Mean Field Theory based on Belief Networks for 
Approximate Inference. 1998. ICANN 98. 
[12] D. Barber and P. van de Laar. Variational Cumulant Expansions for Intractable 
Distributions. Journal of Artificial Intelligence Research, 1998. Accepted. 
