Inference in Multilayer Networks via 
Large Deviation Bounds 
Michael Kearns and Lawrence Saul 
AT&T Labs -- Research 
Shannon Laboratory 
180 Park Avenue A-235 
Florham Park, NJ 07932 
{mkearns, lsaul}research. art. tom 
Abstract 
We study probabilistic inference in large, layered Bayesian net- 
works represented as directed acyclic graphs. We show that the 
intractability of exact inference in such networks does not preclude 
their effective use. We give algorithms for approximate probabilis- 
tic inference that exploit averaging phenomena occurring at nodes 
with large numbers of parents. We show that these algorithms 
compute rigorous lower and upper bounds on marginal probabili- 
ties of interest, prove that these bounds become exact in the limit 
of large networks, and provide rates of convergence. 
I Introduction 
The promise of neural computation lies in exploiting the information processing 
abilities of simple computing elements organized into large netqorks. Arguably one 
of the most important types of information processing is the capacity for proba- 
bilistic reasoning. 
The properties of undirected probabilistic models represented as symmetric networks 
have been studied extensively using methods from statistical mechanics (Hertz et 
al, 1991). Detailed analyses of these models are possible by exploiting averaging 
phenomena that occur in the thermodynamic limit of large networks. 
In this paper, we analyze the limit of large, multilayer networks for probabilistic 
models represented as directed acyclic graphs. These models are known as Bayesian 
networks (Pearl, 1988; Neal, 1992), and they have different probabilistic semantics 
than symmetric neural networks (such as Hopfield models or Boltzmann machines). 
We show that the intractability of exact inference in multilayer Bayesian networks 
Inference in Multilayer Networks via Large Deviation Bounds 261 
does not preclude their effective use. Our work builds on earlier studies of varia- 
tional methods (Jordan et al, 1997). We give algorithms for approximate proba- 
bilistic inference that exploit averaging phenomena occurring at nodes with N >> 1 
parents. We show that these algorithms compute rigorous lower and upper bounds 
on marginal probabilities of interest, prove that these bounds become exact in the 
limit N ---> oc, and provide rates of convergence. 
2 Definitions and Preliminaries 
A Bayesian network is a directed graphical probabilistic model, in which the nodes 
represent random variables, and the links represent causal dependencies. The joint 
distribution of this model is obtained by composing the local conditional probability 
distributions (or tables), Pr[child[parents], specified at each node in the network. 
For networks of binary random variables, so-called transfer functions provide a 
convenient way to parameterize conditional probability tables (CPTs). A transfer 
function is a mapping f: [-cx>, cx>] --+ [0, 1] that is everywhere differentiable and 
satisfies f'(x) _> 0 for all x (thus, f is nondecreasing). If ft(x) < a for all x, we say 
that f has slope a. Common examples of transfer functions of bounded slope include 
the sigmoid fix) = 1/(1 + e-X), the cumulative gaussian fix): fdt e- /v/-, 
and the noisy-OR f(x) - 1 - e -x. Because the value of a transfer function f 
is bounded between 0 and 1, it can be interpreted as the conditional probability 
that a binary random variable takes on a particular value. One use of transfer 
functions is to endow multilayer networks of soft-thresholding computing elements 
with probabilistic semantics. This motivates the following definition: 
Definition 1 For a transfer function f, a layered probabilistic f-network has: 
 Nodes representing binary variables {Xit},  = 1,..., L and i = 1,..., N. 
Thus, L is the number of layers, and each layer contains N nodes. 
 For every pair of nqdes XJ -1 and Xi  in adjacent layers, a real-val[ted weight 
t-1 from t-1 to Xi t . 
Oij Xj 
 For every node Xi  in the first layer, a bias Pi. 
We will sometimes refer to nodes in layer I as inputs, and to nodes in layer L as 
outputs. A layered probabilistic f-network defines a joint probability distribution 
over all of the variables {X/} as follows: each input node X is independently set 
to I with probability pi, and to 0 with probability 1 -pi. Inductively, given binary 
t-1 -1 
values Xj : xj G {0,1} for all of the nodes in layer - l, the node Xi t is set 
N ,-1 t-lx 
to I with probability f(Y'}j=l lYij Xj 
Among other uses, multilayer networks of this form have been studied as hierarchi- 
cal generatire models of sensory data (Hinton et al, 1995). In such applications, 
the fundamental computational problem (known as inference) is that of estimating 
the marginal probability of evidence at some number of output nodes, say the first 
It _< N. (The computation of conditional probabilities, such as diagnostic queries, 
can be reduced to marginals via Bayes rule.) More precisely, one wishes to estimate 
PRIX1  = xl,... ,X -- XK] (where xi  {0, 1}), a quantity whose exact computa- 
tion involves an exponential sum over all the possible settings of the uninstantiated 
nodes in layers 1 through L - 1, and is known to be computationally intractable 
(Cooper, 1990). 
262 M. Kearns and L. Saul 
3 Large Deviation and Union Bounds 
One of our main weapons will be the theory of large deviations. As a first illustration 
of this theory, consider the input nodes {XJ } (which are independently set to 0 or 1 
according to their biases pj) and the weighted sum y']Y= Oj X) that feeds into the 
ith node X in the second layer. A typical large deviation bound (Kearns & Saul, 
1997) states that for all e > 0, er[I y']7= 10j(X) -pj)[ > e] < 2e -22/(:v02) where 
6) is the largest weight in the network. If we make the scaling assumption that 
each weight Oj is bounded by r/N for some constant r (thus, 6) < r/N), then we 
see that the probability of large (order 1) deviations of this weighted sum from its 
mean decays exponentially with N. (Our methods can also provide results under 
the weaker assumption that all weights are bounded by O(N -a) for a > 1/2.) 
How can we apply this observation to the problem of inference? Suppose we are 
interested in the marginal probability Pr[X = 1]. Then the large deviation bound 
tells us that with probability at least i - 5 (where we define 5 = 2e--2V2/T2), the 
weighted sum at node X will be within e of its mean value tti - YY= Ojpj. Thus, 
with probability at least 1 - 5, we are assured that Pr[X = 1] is at least f(tti - e) 
and at most f(lui + e). Of course, the flip side of the large deviation bound is that 
with probability at most 5, the weighted sum may fall more than e away from tti. 
In this case we can make no guarantees on Pr[X = 1] aside from the trivial lower 
and upper bounds of 0 and 1. Combining both eventualities, however, we obtain 
the overall bounds: 
(1-5)f(lui-e) _< Pr[X/2 = 1] _< (1-5)f(lui+e)+& (1) 
Equation (1) is based on a simple two-point approximation to the distribution over 
the weighted sam of inputs, -'7= OjXJ. This approximation places one point, 
with weight i - 5, at either e above or below the mean/i (depending on whether 
we are deriving the upper or lower bound); and the other point, with weight 5, at 
either -x> or +. The value of 5 depends on the choice of e: in particular, as e 
becomes smaller, we give more weight to the + point, with the trade-off governed 
by the large deviation bound. We regard the weight given to the + point as a 
throw-away probability, since with this weight we resort to the trivial bounds of 0 
or i on the marginal probability Pr[X = 1]. 
Note that the very simple bounds in Equation (1) already exhibit an interesting 
trade-off, governed by the choice of the parameter e--namely, as e becomes smaller, 
the throw-away probability 5 becomes larger, while the terms f(lui + e) converge to 
the same value. Since the overall bounds involve products of f(lui + e) and 1 - 5, 
the optimal value of e is the one that balances this competition between probable 
explanations of the evidence and improbable deviations from the mean. This trade- 
off is reminiscent of that encountered between energy and entropy in mean-field 
approximations for symmetric networks (Hertz et al, 1991). 
So far we have considered the marginal probability involving a single node in the 
second layer. We can also compute bounds on the marginal probabilities involving 
K > i nodes in this layer (which without loss of generality we take to be the nodes 
X12through X:). This is done by considering the probability that one or more 
of the weighted sums entering these I  nodes in the second layer deviate by more 
than e from their means. We can upper bound this probability by I5 by appealing 
to the so-called union bound, which simply states that the probability of a union of 
events is bounded by the sum of their individual probabilities. The union bound 
allows us to bound marginal probabilities involving multiple variables. For example, 
Inference in Multilayer Networks via Large Deviation Bounds 263 
consider the marginal probability PRIX12 = 1,...,X: = 1]. Combining the large 
deviation and union bounds, we find: 
K K 
(i-K6) H f(]li--e ) _< Pr[X 2 = 1,...,X} = 1] _< (1-KS)f(i+6)+I6. (2) 
i=1 i=1 
A number of observations are in order here. First, Equation (2) directly leads to 
efficient algorithms for computing the upper and lower bounds. Second, although 
for simplicity we have considered e-deviations of the same size at each node in the 
second layer, the same methods apply to different choices of (i (and therefore 5i) 
at each node. Indeed, variations in ei can lead to significantly tighter bounds, and 
thus we exploit the freedom to choose different ei in the rest of the paper. This 
results, for example, in bounds of the form: 
1-Si H f(i--fi)  er[X = 1,...,X} = 1], where 5i = 2e -2/". 
i=1 i=1 
(a) 
The reader is invited to study the small but important differences between this 
lower bound and the one in Equation (2). Third, the arguments leading to bounds 
on the marginal probability Pr[X = 1,..., X} = 1] generalize in a straightfor- 
ward manner to other patterns of evidence besides all l's. For instance, again just 
considering the lower bound, we have: 
1-6i H [1--f(i+i)] H f(i--(i)  er[X :Xl,...,X}=XK] (4) 
i=1 x,:0 
where xi  {0, 1} are arbitrary binary values. Thus together the large deviation 
and union bounds provide the means to compute upper and lower bounds on the 
marginal probabilities over nodes in the second layer. Further details and conse- 
quences of these bounds for the special case of two-layer networks are given in a 
companion paper (Kearns & Saul, 1997); our interest here, however, is in the more 
challenging generalization to multilayer networks. 
4 Multilayer Networks: Inference via Induction 
In extending the ideas of the previous section to multilayer networks, we face the 
problem that the nodes in the second layer, unlike those in the first, are not inde- 
pendent. But we can still adopt an inductive strategy to derive bounds on marginal 
probabilities. The crucial observation is that conditioned on the values of the incom- 
ing weighted sums at the nodes in the second layer, the variables {X} do become 
independent. More generally, conditioned on these weighted sums all falling "near" 
their means -- an event whose probability we quantified in the last section -- the 
nodes {X} become "almost" independent. It is exactly this near-independence 
that we now formalize and exploit inductively to compute bounds for multilayer 
networks. The first tool we require is an appropriate generalization of the large 
deviation bound, which does not rely on precise knowledge of the means of the 
random variables being summed. 
Theorem I For all 1 _< j _< N, let Xj C {0, 1} denote independent binary random 
variables, and let Irjl _< r. Suppose that the means are bounded by ]E[Xj]-pj[ < Aj, 
where O < Aj _< pj _< i - Aj. Then for all e >  y: lrjlAj: 
1 k ] 2/ (5__ 1 Z$I IT$1'$) 2 (5) 
rr j:lV(Xj-pj) >e <2e -  
264 M. Kearns and L. Saul 
The proof of this result is omitted due to space considerations. Now for induction, 
consider the nodes in the th layer of the network. Suppose we are told that for 
every i, the weighted sum Z7=l t-1 / 1 
Oij X;- entering into the node X[ lies in the 
e Then the mean of 
interval [/ - e, tt + e], for some choice of the/ and the e i. 
node X/t is constrained to lie in the interval [p 
-- Ai, Pi + A], where 
: 1 [s(4 - d) + s(4 + d)] (6) 
= 1 [s(4 + d) - s(4 - d)] (7) 
2 ' 
Here we have simply run the leftmost and rightmost allowed values for the incoming 
weighted sums through the transfer function, and defined the interval around the 
mean of unit X/t to be centered around p. Thus we have translated uncertainties 
on the incoming weighted sums to layer  into conditional uncertainties on the 
means of the nodes X[ in layer . To complete the cycle, we now translate these 
into conditional uncertainties on the incoming weighted sums to layer  + 1. In 
particular, conditioned on the original intervals [/ - e, tt + e], what is probability 
that for each i, 1 OX lies inside some new interval [+1-i ,. + 
In order to make some guarantee on this probability, we set 'i = 10ijP 
.+1  ]A. These conditions suiTice to ensure that 
and assume that i > Z7=l 
the new intervals contain the (conditional) expected values of the weighted sums 
zJV=I O ij X, and that the new intervals are large enough to encompass the incoming 
uncertainties. Because these conditions are a minimal requirement for establishing 
  e] define valid 
any probabilistic guarantees, we shall say that the [tt - el, ]/i nt- a 
set of e-intervals if they meet these conditions for all i _< i < N. Given a valid set 
of e-intervals at the ( + 1)th layer, it follows from Theorem i and the union bound 
that the weighted sums entering nodes in layer  + i obey 
[kl l 'l+l I l+1 for some 1 < i < N 1 < kf +1 (8) 
Pr OijXj -  > 'i - 
j:l i=1 
where 
2N (6q- 1 __ f N 
+1 __ 2e 2'' z_j= (9) 
In what follows, we shall frequently make use of the fact that the weighted sums 
zjN:I t  __ rt+l . 
OijX i are bounded by intervals [tt +1 i ,'i +e?l]. This motivates the 
following definitions. 
Definition 2 Given a valid set of e-intervals and binary values {X[ = x} for the 
nodes in the th layer, we say that the ( + 1)st layer of the network satisfies its 
Oij Xj -- i < 
the ( + 1)st layer violates its e-intervals. 
Suppose that we are given a valid set of e-intervals and that we sample from the joint 
distribution defined by the probabilistic f-network. The right hand side of Equation 
(8) provides an upper bound on the conditional probability that the ( + 1)st layer 
violates its e-intervals, given that the th layer did not. This upper bound may be 
vacuous (that is, larger than 1), so let us denote by 5 t+l whichever is smaller- the 
right hand side of Equation (8), or 1; in other words, 5 t+l : min{-'4N=15/t+l 1} 
, 
Since at the th layer, the probability of violating the e-intervals is at most 5 t we 
Inference in Multilayer Networks via Large Deviation Bounds 265 
are guaranteed that with probability at least I-I>l[1- 5], all the layers satisfy 
their e-intervals. Conversely, we are guaranteed that the probability that any layer 
violates its e-intervals is at most i- 1-le>l[1- 5]. Treating this as a throw-away 
probability, we can now compute upper and lower bounds on marginal probabilities 
involving nodes at the Lth layer exactly as in the case of nodes at the second layer. 
This yields the following theorem. 
Theorem 2 For any subset {X,...,Xp() of the outputs of a probabilistic f- 
network, for any setting Xl,..., XK, and for any valid set of e-intervals, the marginal 
probability of partial evidence in the output layer obeys: 
H [ 1-5l] H f(lui-ei) 
[1- f(y/ + 
>1 x--I x--0 
_< Pr[XIL--xl,...,Xp(-XK] 
f>l x,:l x,--0 
(10) 
[1- f(tt' - eP)] + (1- H [1 - 5])(11) 
>1 
Theorem 2 generalizes our earlier results for marginal probabilities over nodes in the 
second layer; for example, compare Equation (10) to Equation (4). Again, the upper 
and lower bounds can be efficiently computed for all common transfer functions. 
5 Rates of Convergence 
To demonstrate the power of Theorem 2, we consider how the gap (or additive 
difference) between these upper and lower bounds on Pr[X L = Xl,...,X = XK] 
behaves for some crude (but informed) choices of the {e}. Our goal is to derive 
the rate at which these upper and lower bounds converge to the same value as we 
examine larger and larger networks. Suppose we choose the e-intervals inductively 
by defining A} = 0 and setting 
tt+l t '/T N 
j=l 
(12) 
for some 7 > 1. From Equations (8) and (9), this choice gives 5 +1 _ 2N 1-2 as 
an upper bound on the probability that the ( + 1)th layer violates its e-intervals. 
Moreover, denoting the gap between the upper and lower bounds in Theorem 2 by 
G, it can be shown that: 
[1- f( -ep)]+ 2L 
N2,v_ 1  
(13) 
Let us briefly recall the definitions of the parameters on the right hand side of this 
equation: c is the maximal slope of the transfer function f, N is the number of 
nodes in each layer, K is the number of nodes with evidence, r = NO is N times the 
largest weight in the network, L is the number of layers, and 7 > i is a parameter 
at our disposal. The first term of this bound essentially has a 1/v/- dependence on 
N, but is multiplied by a damping factor that we might typically expect to decay 
exponentially with the number K of outputs examined. To see this, simply notice 
that each of the factors f(laj +ej) and [1-f(laj-ej)] is bounded by 1; furthermore, 
266 M. Kearns and L. Saul 
since all the means it; are bounded, if N is large compared to 3' then the e i are 
small, and each of these factors is in fact bounded by some value / < 1. Thus 
the first term in Equation (13) is bounded by a constant times/:-i Kv/ln(N)/N. 
Since it is natural to expect the marginal probability of interest itself to decrease 
exponentially with K, this is desirable and natural behavior. 
Of course, in the case of large K, the behavior of the resulting overall bound can 
be dominated by the second term 2L/N 2-1 of Equation (13). In such situations, 
however, we can consider larger values of 7, possibly even of order K; indeed, for 
sufficiently large 7, the first term (which scales like V/-) must necessarily overtake 
the second one. Thus there is a clear trade-off between the two terms, as well as 
optimal value of 3' that sets them to be (roughly) the same magnitude. Generally 
speaking, for fixed N and large N, we observe that the difference between our upper 
and lower bounds on er[X = xl,...,X = xx] vanishes as O (v/ln(N)/N). 
6 An Algorithm for Fixed Multilayer Networks 
We conclude by noting that the specific choices made for the parameters ei in 
Section 5 to derive rates of convergence may be far from the optimal choices for a 
fixed network of interest. However, Theorem 2 directly suggests a natural algorithm 
for approximate probabilistic inference. In particular, regarding the upper and lower 
bounds on er[X = xl,...,X = x:] as functions of {e}, we can optimize these 
bounds by standard numerical methods. Eor the upper bound, we may perform 
gradient descent in the {e} to find a local minimum, while for the lower bound, we 
may perform gradient ascent to find a local maximum. The components of these 
gradients in both cases are easily computable for all the commonly studied transfer 
functions. Moreover, the constraint of maintaining valid e-intervals can be enforced 
by maintaining a floor on the e-intervals in one layer in terms of those at the previous 
one. The practical application of this algorithm to interesting Bayesian networks 
will be studied in future work. 
References 
Cooper, G. (1990). Computational complexity of probabilistic inference using 
Bayesian belief networks. Artificial Intelligence 42:393-405. 
Hertz, J,. Krogh, A., 8 Palmer, R. (1991). Introduction to the theory of neural 
computation. Addison-Wesley, Redwood City, CA. 
Hinton, G., Dayan, P., Frey, B., and Neal, R. (1995). The wake-sleep algorithm for 
unsupervised neural networks. Science 268:1158-1161. 
Jordan, M., Ghahramani, Z., Jaakkola, T., & Saul, L. (1997). An introduction to 
variational methods for graphical models. In M. Jordan, ed. Learning in Graphical 
Models. Kluwer Academic. 
Kearns, M., & Saul, L. (1998). Large deviation methods for approximate proba- 
bilistic inference. In Proceedings of the 1dth Annual Conference on Uncertainty in 
Artificial Intelligence. 
Neal, R. (1992). Connectionist learning of belief networks. Artificial Intelligence 
56:71-113. 
Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plau- 
sible Inference. Morgan Kaufmann, San Mateo, CA. 
