Triangulation by Continuous Embedding 
Marina Meilfi and Michael I. Jordan 
{mmp, jordan}@ai.mit.edu 
Center for Biological & Computational Learning 
Massachusetts Institute of Technology 
45 Carleton St. E25-201 
Cambridge, MA 02142 
Abstract 
When triangulating a belief network we aim to obtain a junction 
tree of minimum state space. According to (Rose, 1970), searching 
for the optimal triangulation can be cast as a search over all the 
permutations of the graph's vertices. Our approach is to embed 
the discrete set of permutations in a convex continuous domain D. 
By suitably extending the cost function over D and solving the 
continous nonlinear optimization task we hope to obtain a good 
triangulation with respect to the aformentioned cost. This paper 
presents two ways of embedding the triangulation problem into 
continuous domain and shows that they perform well compared to 
the best known heuristic. 
1 INTRODUCTION. WHAT IS TRIANGULATION ? 
Belief networks are graphical representations of probability distributions over a set 
of variables. In what follows it will be always assumed that the variables take 
values in a finite set and that they correspond to the vertices of a graph. The 
graph's arcs will represent the dependencies among variables. There are two kinds of 
representations that have gained wide use: one is the directed acyclic graph model, 
also called a Bayes net, which represents the joint distribution as a product of the 
probabilities of each vertex conditioned on the values of its parents; the other is the 
undirected graph model, also called a Markov field, where the joint distribution is 
factorized over the cliques  of an undirected graph. This factorization is called a 
junction tree and optimizing it is the subject of the present paper. The power of both 
models lies in their ability to display and exploit existent marginal and conditional 
independencies among subsets of variables. Emphasizing independencies is useful 
A clique is a fully connected set of vertices and a maximal clique is a clique that is 
not contained in any other clique. 
558 M. Meil and M. I. Jordan 
from both a qualitative point of view (it reveals something about the domain under 
study) and a quantitative one (it makes computations tractable). The two models 
differ in the kinds of independencies they are able to represent and often times 
in their naturalness in particular tasks. Directed graphs are more convenient for 
learning a model from data; on the other hand, the clique structure of undirected 
graphs organizes the information in a way that makes it immediately available to 
inference algorithms. Therefore it is a standard procedure to construct the model 
of a domain as a Bayes net and then to convert it to a Markov field for the purpose 
of querying it. 
This process is known as decomposition and it consists of the following stages: 
first, the directed graph is transformed into an undirected graph by an operation 
called moralization. Second, the moralized graph is triangulated. A graph is called 
triangulated if any cycle of length > 3 has a chord (i.e. an edge connecting two 
nonconsecutive vertices). If a graph is not triangulated it is always possible to add 
new edges so that the resulting graph is triangulated. We shall call this procedure 
triangulation and the added edges the fill-in. In the final stage, the junction tree 
(Kjaerulff, 1991) is constructed from the maximal cliques of the triangulated graph. 
We define the state space of a clique to be the cartesian product of the state spaces 
of the variables associated to the vertices in the clique and we call weight of the 
clique the size of this state space. The weight of the junction tree is the sum of the 
weights of its component cliques. All further exact inference in the net takes place 
in the junction tree representation. The number of computations required by an 
inference operation is proportional to the weight of the tree. 
For each graph there are several and usually a large number of possible triangu- 
lations, with widely varying state space sizes. Moreover, triangulation is the only 
stage where the cost of inference can be influenced. It is therefore critical that the 
triangulation procedure produces a graph that is optimal or at least "good" in this 
respect. 
Unfortunately, this is a hard problem. No optimal triangulation algorithm is known 
to date. However, a number of heuristic algorithms like mazimum cardinality search 
(Tarjan and Yannakakis, 1984), lezicographic search (Rose et at., 1976) and the 
minimum weight heuristic (MW) (Kjaerulff, 1990) are known. An optimization 
method based on simulated annealing which performs better than the heuristics 
on large graphs has been proposed in (Kjaerulff, 1991) and recently a "divide and 
conquer" algorithm which bounds the maximum clique size of the triangulated graph 
has been published (Becker and Geiger, 1996). All but the last algorithm are based 
on R,ose's (Rose, 1970) elimination procedure: choose a node v of the graph, connect 
all its neighbors to form a clique, then eliminate v and all the edges incident to it 
and proceed recursively. The resulting filled-in graph is triangulated. 
It can be proven that the optimal triangulation can always be obtained by applying 
Pose's elimination procedure with an appropriate ordering of the nodes. It follows 
then that searching for an optimal triangulation can be cast as a search in the space 
of all node permutations. The idea of the present work is the following: embed 
the discrete search space of permutations of n objects (where n is the number of 
vertices) into a suitably chosen continuous space. Then extend the cost to a smooth 
function over the continuous domain and thus transform the discrete optimization 
problem into a continuous nonlinear optimization task. This allows one to take 
advantage of the thesaurus of optimization methods that exist for continuous cost 
functions.The rest of the paper will present this procedure in the following sequence: 
the next section introduces and discusses the objective function; section 3 states 
the continuous version of the problem; section 4 discusses further aspects of the 
optimization procedure and presents experimental results and section 5 concludes 
Triangulation by Continuous Embedding 
the paper. 
559 
2 THE OBJECTIVE 
In this section we introduce the objective function that we used and we discuss its 
relationship to the junction tree weight. First, some notation. Let G = (V, E) be a 
graph, its vertex set and its edge set respectively. Denote by n the cardinality of the 
vertex set, by r,the number of values of the (discrete) variable associated to vertex 
v G V, by # the elimination ordering of the nodes, such that #v = i means that 
node v is the i-th node to be eliminated according to ordering , by n(v) the set of 
neighbors of v G V in the triangulated graph and by C,= {v} U {u 6 n(v) I #u > 
#v}. 2 Then, a result in (Golumbic, 1980) allows us to express the total weight of 
the junction tree obtained with elimination ordering # as 
J#)- yismax(C)II r (1) 
v6V u6C 
where ismax(C, is a variable which is 1 when C,is a maximal clique and 0 oth- 
erwise. As stated, this is the objective of interest for belief net triangulation. Any 
reference to optimality henceforth will be made with respect to J*. 
This result implies that there are no more than n maximal cliques in a junction tree 
and provides a method to enumerate them. This suggests defining a cost function 
that we call the raw weight J as the sum over all the cliques C,(thus possibly 
including some non-maximal cliques): 
J(#) = Y II r 
(2) 
J is the cost function that will be used throughout this paper. A reason to use 
it instead of J* in our algorithm is that the former is easier to compute and to 
approximate. How to do this will be the object of the next section. But it is 
natural to ask first how well do the two agree? 
Obviously, J is an upper bound for J*. Moreover, it can be proved that if r = min r 
J#) _< J(#)_< J#)r(1-) 
and therefore that J is less than a fraction 1/(r- 1) away from J*. The upper 
bound is attained when the triangulated graph is fully connected and all r are 
equal. 
In other words, the differece between J and J* is largest for the highest cost tri- 
angulation. We also expect this difference to be low for the low cost triangulation. 
An intuitive argument for this is that good triangulations are associated with a 
large number of smaller cliques rather than with a few large ones. But the former 
situation means that there will be only a small number of small size non-maximal 
cliques to contribute to the difference J- J*, and therefore that the agreement with 
J* is usually closer than (3) implies. This conclusion is supported by simulations 
(Meil and Jordan, 1997). 
2Both n(v) asad C depend on # but we chose not to emphasize this in the notation 
for the sake of readability. 
560 M. Meil and M. I. Jordan 
3 THE CONTINUOUS OPTIMIZATION PROBLEM 
This section shows two ways of defining J over continuous domains. Both rely on 
a formulation of J that eliminates explicit reference to the cliques C; we describe 
this formulation here. 
Let us first define new variables !auv and eu u, v = 1, .., n. For any permutation 
1 if u < v { 
/u = 0 otherw-se eu = 
1 if the edge (u,v)  Et0 F# 
0 otherwise 
where F# is the set of fill-in edges. 
In other words, / represent precedence relationships and e represent the edges 
between the n vertices. Therefore, they will be called precedence variables and edge 
variables respectively. With these variables, J can be expressed as 
vV uV 
(4) 
In (4), the product/ae acts as an indicator variable being 1 iff "u 6 C" is true. 
For any given permutation, finding the tt variables is straightforward. Computing 
the edge variables is possible thanks to a result in (Rose et al., 1976). It states that 
an edge (u, v) is contained in F# iff there is a path in G between u and v containing 
only nodes w for which w < min(u, v). Formally, euv = e = 1 iff there exists 
a path P = (u, Wl, we,...v) such that 
wiSP 
So far, we have succeeded in defining the cost d associated with any permutation 
in terms of the variables/ and e. In the following, the set of permutations will be 
embedded in a continuous domain. As a consequence, / and e will take values in 
the interval [0, 1] but the form of d in (4) will stay the same. 
The first method, called /-continuous embedding (/-CE) assumes that the vari- 
ables /u, [0, 1] represent independent probabilities that u < v. For any 
permutation, the precedence variables have to satisfy the transitivity condition. 
Transitivity means that if u < v and v < w, then u < w, or, that for 
any triple (/,/o,/ou) the assignments (0, 0, 0) and (1, 1, 1) are forbidden. Ac- 
cording to the probabilistic interpretation of/ we introduce a term that penalizes 
the probability of a transitivity violation: 
R(!a) - E P[(u, v, w) nontransitive] (5) 
u<v<w 
= + (1 - - uw)(1 - 
u<v<w 
>_ P[assignment non transitive] 
(6) 
(7) 
In the second approach, called O-continuous embedding (O-CE), the permutations 
are directly embedded into the set of doubly stochastic matrices. A doubly stochastic 
matrix 0 is a matrix for which the elements in a row or column sum to one. 
Z Oij = Z Oij = 10ij_>O for i,j = l, ..n. (8) 
Triangulation by Continuous Embedding 561 
When Oij are either 0 or 1, implying that there is exactly one nonzero element 
in each row or column, the matrix is called a permutation matriz. Oij = 1 and 
#i = j both mean that the position of object i is j in the given permutation. The 
set of doubly stochastic matrices O is a convex polytope of dimension (n - 1) 2 
whose extreme points are the permutation matrices (Balinski and Russakoff, 1974). 
Thus, every doubly stochastic matrix can be represented as a convex combination 
of permutation matrices. To constrain the optimum to be a an extreme point, we 
use the penalty term 
R(O) = Zoij(1 - 00) (9) 
The precedence variables are defined over O as 
_ I <i 
ll = 1 
u, v, i, j = 1, ..n 
and v :fi u 
Now, for both embeddings, the edge variables can be computed from p as follows 
max HwP wu] lwv 
P{paths u-*v} 
for (u,v) 6 E or u = v 
otherwise 
The above assignments give the correct values for p and e for any point representing 
a permutation. Over the interior of the domain, e is a continuous, piecewise differen- 
tiable function. Each eu, (u, v) E can be computed by a shortest path algorithm 
between u and v, with the length of (wl, w2) G E defined as (- 
0-CE is an interior point method whereas in/-CE the current point, although inside 
[0, 1] n(n-1)/2, isn't necessarily in the convex hull of the hypercube's corners that 
represent permutations. The number of operation required for one evaluation of J 
and its gradient is as follows: O(n 4) operations to compute p from 0, O(n 3 log n) to 
compute e O(n a) for 0J and O(n ) for oJ and oa afterwards. Since computing/ is 
the most computationally intensive step,/-CE is a clear win in terms of computation 
cost. In addition, by operating directly in the/ domain, one level of approximation 
is eliminated, which makes one expect /-CE to perform better than 0-CE. The 
results in the following section will confirm this. 
4 EXPERIMENTAL RESULTS 
To assess the performance of our algorithms we compared their results with the 
results of the minimum weight heuristic (MW), the heuristic that scored best in 
empirical tests (Kjaerulff, 1990). The lowest junction tree weight obtained in 200 
runs of MW was retained and denoted by Jw' Tests were run on 6 graphs of 
different sizes and densities: 
graph h9 h12 d10 m20 a20 d20 
n - IVI 9 12 10 20 20 20 
density .33 .25 .6 .25 .45 .6 
rmin/rmax/ravg 2/2/2/ 3/3/3 6/15/10 2/8/5 6/15/10 6/15/10 
lgl0 J4W 2.43 2.71 7.44 5.47 12.75 13.94 
The last row of the table shows the log10 Jw. We ran 11 or more trials of each 
of our two algorithms on each graph. To enforce the variables to converge to a 
permutation, we minimized the objective J + AR, where A > 0 is a parameter 
562 M. Meil and M. I. Jordan 
lOO 
30 
10 
3 
1 
0.3; 
20 
10 
h9 h12 d10 m20 a20 d20 h9 h12 d10 m20 a20 d20 
a b 
Figure 1: Minimum, maximum (solid line) and median (dashed line) values of Jg, w 
obtained by 0-CE (a) and tt-CE (b). 
that was progressively increased following a deterministic annealing schedule and 
R is one of the aforementioned penalty terms. The algorithms were run for 50- 
150 optimization cycles, usually enough to reach convergence. However, for the 
it-embedding on graph d20, there were several cases where many tt values did not 
converge to 0 or 1. In those cases we picked the most plausible permutation to be 
the answer. 
The results are shown in figure 1 in terms of the ratio of the true cost obtained by 
the continuous embedding algorithm (denoted by J*) and Jw. For the first two 
graphs, h9 and h12, Juw is the optimal cost; the embedding algorithms reach it 
most trials. On the remaining graphs, tt-CE clearly outperforms 0-CE, which also 
performs poorer than MW on average. On all0, a20 and m20 it also outperforms 
the MW heuristic, attaining junction tree weights that are 1.6 to 5 times lower 
on average than those obtained by MW. On d20, a denser graph, the results are 
similar for MW and tt-CE in half of the cases and worse for tt-CE otherwise. The 
plots also show that the variability of the results is much larger for CE than for 
MW. This behaviour is not surprising, given that the search space for CE, although 
continuous, comprises a large number of local minima. This induces dependence on 
the initial point and, as a consequence, nondeterministic behaviour of the algorithm. 
Moreover, while the number of choices that MW has is much lower than the upper 
limit of n!, the "choices" that CE algorithms consider, although soft, span the space 
of all possible permutations. 
5 CONCLUSION 
The idea of continuous embedding is not new in the field of applied mathematics. 
The large body of literature dealing with smooth (sygmoidal) functions instead 
of hard nonlinearities (step functions) is only one example. The present paper 
shows a nontrivial way of applying a similar treatment to a new problem in a new 
field. The results obtained by -embedding are on average better than the standard 
MW heuristic. Although not directly comparable, the best results reported on 
triangulation (Kjaerulff, 1991; Becker and Geiger, 1996) are only by little better 
than ours. Therefore the significance of the latter goes beyond the scope of the 
present problem. They are obtained on a hard problem, whose cost function has no 
feature to ease its minimization (J is neither linear, nor quadratic, nor is it additive 
Triangulation by Continuous Embedding 563 
w.r.t. the vertices or the edges) and therefore they demonstrate the potential of 
continuous embedding as a general tool. 
ColaterMly, we have introduced the cost function J, which is directly amenable 
to continuous approximations and is in good agreement with the true cost J*. 
Since minimizing J may not be NP-hard, this opens a way for investigating new 
triangulation methods. 
Acknowledgements 
The authors are grateful to Tommi Jaakkola for many discussions and to Ellie 
Bonsaint for her invaluable help in typing the paper. 
References 
Balinski, M. and Russakoff, R. (1974). On the assignment polytope. SIAM Rev. 
Becker, A. and Geiger, D. (1996). A sufficiently fast algorithm for finding close to 
optimal junction trees. In UAI 96 Proceedings. 
Golumbic, M. (1980). Algorithmic Graph Theory and Perfect Graphs. Academic 
Press, New York. 
Kjmrulff, U. (1990). Triangulation of graphs-algorithms giving small total state 
space. Technical Report R 90-09, Department of Mathematics and Computer 
Science, Aalborg University, Denmark. 
Kjmrulff, U. (1991). Optimal decomposition of probabilistic networks by simulated 
annealing. Statistics and Computing. 
Meil&, M. and Jordan, M. I. (1997). An objective function for belief net triangula- 
tion. In Madigan, D., editor, AI and Statistics, number 7. (to appear). 
Rose, D. J. (1970). Triangulated graphs and the elimination process. Journal of 
Mathematical Analysis and Applications. 
Rose, D. J., Tarjan, R. E., and Lueker, E. (1976). Algorithmic aspects of vertex 
elimination on graphs. SIAM J. Comput. 
Tarjan, R. and Yannakakis, M. (1984). Simple linear-time algorithms to test chordal- 
ity of graphs, test acyclicity of hypergraphs, and select reduced acyclic hyper- 
graphs. SIAM J. Cornput. 
