116 
THE BOLTZMANN PERCEPTRON NETWORK: 
A MULTI-LAYERED FEED-FORWARD NETWORK 
EQUIVALENT TO THE BOLTZMANN MACHINE 
Eyal Yair and Allen Gersho 
Center for Information Processing Research 
Departmere of Electrical & Computer Engineering 
University of California, Santa Barbara, CA 93106 
ABSTRACT 
The concept of the stochastic Boltzmann machine (BM) is attractive for 
decision making and pattern classification purposes since the probability of 
attaining the network states is a function of the network energy. Hence, the 
probability of attaining particular energy minima may be associated with the 
probabilities of making certain decisions (or classifications). However, 
because of ill stochastic nature, the complexity of the BM is fairly high and 
therefore such networks are not very likely to be used in practice. In this 
paper we suggest a way to alleviate this drawback by converting the sto- 
elmstic BM into a deterministic network which we call the Boltzmann Per- 
ceplxon Network (BPN). The BPN is functionally equivalent to the BM but 
has a feed-forward structure and low complexity. No annealing is required. 
The conditions under which such a conversion is feasible are given. A 
learning algorithm for the BPN based on the conjugate gradient method is 
also provided which is somewhat akin to the backpropagation algorithm. 
INTRODUCTION 
In decision-making applications, it is desirable to have a network which computes the pro- 
babilities of deciding upon each of M possible propositions for any given input pattern. In 
principle, the Boltzmann machine (BM) (Hinton, Sejnowsld and Ackley, 1984) can provide 
such a capability. The network is composed of a set of binary units connected through sym- 
metric connection links. The units are randomly and asynchronously changing their values 
in {0,1} according to a stochastic Iransition rule. The transition rule used by Hinton et. al. 
defines the probability of a unit to be in the 'on' state as the logistic function of the energy 
change resulting by changing the value of that unit. The BM can be described by an 
ergodic Markov chain in which the thermal equilibrium probability of attaining each state 
obeys the Boltzmann distribution which is a function of only the energy. By associating the 
set of possible propositions with subsell of network states, the probability of deciding upon 
each of these propositions can be measured by the probability of a,alning the correspond- 
ing set of states. This probability is also affected by the temperature. As the temperature 
This wod was supported by the Wizmann Foundation for scientific research, by the 
University of California MICRO program, and by Bell Communications Research, Inc. 
The Boltzmann Perceptron Network 
increases, the Boltzmann probability distribution become more uniform and thus the deci- 
sion made is 'vague'. The lower the temperature the greater is the probability of a_n__aining 
states with lower energy thereby leading to more 'distinctive' decisions. 
This approach, while very attractive in principle, has two major drawbacks which make the 
complexity of the computations become non-feasible for nontrivial problems. The first is 
the need for thermal equilibrium in order to obtain the Boltzmann distribution. To make 
distinctive decisions a low temperature is required. This implies slower convergence 
towards thermal equilibrium. Generally, the method used to reach thermal equilibrium is 
simulated annealing (SA) (Kirkpatrick et. al., 1983) in which the temperature starts from a 
high value and is gradually reduced to the desired final value. In order to avoid 'freezing' 
of the network, the cooling schedule should be fairly slow. SA is thus time consuming and 
computationally expensive. The second drawback is due to the stochastic nature of the com- 
putation. Since the network state is a random vector, the desired probabilities have to be 
estimated by accumulating statistics of the network behavior for only a finite period of 
time. Hence, a trade-off between speed and accuracy is unavoidable. 
In this paper, we propose a mechanism to alleviate the above computational drawbacks by 
converting the stochastic BM into a functionally equivalent deterministic network, which 
we call the Boltzmann Perceptron Network (BPN). The BPN circumvents the need for a 
Monte Carlo type of computation and instead evaluates the desired probabilities using a 
multilayer perceptron-like network. The very time consuming learning pn:rss for the BM 
is similarly replaced by a deterministic learning scheme, somewhat akin to the backpropa- 
gation algorithm, which is computationally affordable. The similarity between the learning 
algorithm of a BM having a layered structure and that of a two-layer perceptron has been 
recently pointed out by Hopfield (1987). In this paper we further elaborate on such an 
equivalence between the BM and the new perceptton-like network, and give the conditions 
under which the conversion of the stochastic BM into the deterministic BPN is possible. 
Unlike the original BM, the BPN is virtually always in thermal equilibrium and thus SA is 
no longer required. Nevertheless, the temperature still plays the same role and thus varying 
it may be beneficial to control the 'sofmess' of the decisions made by the BPN. Using the 
BPN as a soft classifier is described in details in (Yair and Gersho, 1989). 
THE BOLTZMANN PERCEPTRON NETWORK 
Suppose we have a network of K units connected through symmetric connection links with 
no self-feedback, so that the connection matrix r is symmetric and zero-diagonal. Let us 
categorize the units into three different types: input, output and hidden units. The input 
pattern will be supplied to the network by clamping the input units, denoted by 
x_= (x,..,xi ,..,xt) r, with this pattern. x_ is a real-valued vector in R . The output of the net- 
work will be observed on the output units E=(y,..,y,,..,yt) r, which is a binary vector. 
The remaining units, denoted v=(v,..,vi,..,vj) r, are the hidden units, which are also 
binary-valued. The hidden and output units are asynchronously and randomly changing 
their binary values in {0,1 } according to inputs they receive from other units. 
The state of the network will be denoted by the vector u which is partitioned as follows: 
u r =(x_r,vr,r). The energy associated with state u is denoted by Eu and is given by: 
-E,, = urF_u + ur$ (1) 
where J is a vector of bias values, partitioned to comply with the partition of _u as follows: 
b r = (_rr,_c_r,sr). 
118 Yair and Gersho 
The transition from one state to another is performed by selecting each time one unit, say 
unit k, at random and determine its output value according to the following stochastic rule: 
set the output of the unit to 1 with probability p,, and to 0 with a probability 1-p,. The 
parameter p, is determined locally by the k-th unit as a function of the energy change 
in the following fashion: 
pk=g(AF.k) ; g(x) =a 1 (2) 
1 +e-Ix 
AE, = {E,,(unit k is off) - E,, (unit k is on) }, and I= 1/r is a control parameter. T is called 
the temperature and g (-) is the logistic function. With this transition rule the thermal equili- 
brium probability P,, of attaining a state _u obeys the Boltzmann distribution: 
p, 1 -  E. 
= e (3) 
z 
where Z,, called the partition function, is a normalization factor (independent of v_ and y.) 
such that the sum of Pu over all the 2 s+u possible states will sum to unity. 
In order to use the network in a dterministic fashion rather than accumulate statistics while 
observing its random behavior, we should be able to explicitly compute the probability of 
This 
auaining a certain vector ig on the output units while x is clamped on the input units. 
probability, denoted by Py ,,, can be expressed as: 
(4) 
where B is the set of all binary vectors of length J, and v ,y Ix denotes a state 1t in which 
a specific input vector x is clamped. The explicit evaluation of the desired probabilities 
therefore involves the computation of the partition function for which the number of opera- 
tions grows exponentially with the number of units. That is, the complexity is O(2J+s). 
Obviously, this is computationally unacceptable. Nevertheless, we shall see that under a 
certain restriction on the connection malTix F the explicit computation of the desired proba- 
bilities becomes possible with a complexity of 0 (JM) which is computationally feasible. 
Let us assume that for each input pattern we have M possible propositions which are asso- 
ciated with the M output vectors: lu = {It ,..J_ ,..J }, where I_,,, is the m-th column of the 
M,,M identity matrix. Any state of the network having output vector i[=I_,,, (for any m) 
will be denoted by v ,m Ix and will be called a feasible state. All other state vectors v ,y Ix 
for x;l_,,, will be considered as intermediate steps between two feasible states. This 
redefinition of the network states is equivalent to redefining the states of the underlying 
Markov model of the network and thus conserves the equilibrium Boltzmann distribution. 
The probability of proposition m for a given input x, denoted by P,, ,,, will be taken as the 
probability of obtaining output vector x=l_,,, given that the output is one of the feasible 
values. That is, 
P,, = Pr {X = I_,,, I x, lelu } (5) 
which can be computed from (4) by restricting the state space to the 2sM feasible state 
vectors and by setting =. The partition function, conditioned on restricting : to lie in 
the set of feasible outputs, lu, is denoted by Z and is given by: 
The Boltzmann Perceptron Network 119 
Let us now partition the connection matrix V to comply with the partition of the state vec- 
tor and rewrite the energy for the feasible state v,m ix as: 
-E,,,,, =_vT(Rx_+Q +D2Y_+.g.)+ I_r(Wx+D I: +S)+ x_T(AD_x+f) . (7) 
Since x is clamped on the input units, the last term in the energy expression serves only as 
a bias term for the energy which is independent of the binary units _v and ]t. Therefore, 
without any loss of generality it may be assumed that D=0 and f=0. The second term, 
denoted by T,, ,,,, can be simplified since D3 has a zero diagonal. Hence, 
I 
T, =  w xi + s, (8) 
i=l 
The absence of the matrix D3 in the energy expression means that interconnections between 
output units have no effect on the probability of attaining output vectors lu, and may be 
assumed absent without any loss of generality. 
Defining L,, x(A) to be: 
= 'rat+q"' +VV+r0 ] , (9) 
L,, x(z) a OTm, +In[ e- 
_B 
in which q, is the m-th column of Q, the desired probabilities, P,, , for m=l,..,M are 
obtained using (4) and (7) as a function of these quantities as follows: 
I   e 
= -- e with: = . (10) 
Zz m=l 
The complexity of evaluating the desired probabilities P,,  is still exponential with the 
number of hidden units J due to the sum in (9). However, if we impose the restriction that 
D2=0, namely, the hidden units are not directly interconnected, then this sum becomes 
separable in the components vj and thus can be decomposed into the product of only J 
terms. This restricted connectivity of course imposes some restrictions on the capability of 
the network compared to that of a fully connected network. On the other hand, it allows the 
computation of the desired probabilities in a deterministic way with the atuactive complex- 
ity of only O (JM) operations. The tedious estimation of conditional expectations com- 
monly required by the learning algorithm for a BM and the annealing procedure are 
avoided, and an accurate and computationally affordable scheme becomes possible. We thus 
suggest a Wade-off in which the operation and learning of the BM are emendously 
simplified and the exact decision probabilities are computed (rather than their statistical 
estimates) at the expense of a restricted connectivity, namely, no interconnections are 
allowed between the hidden units. Hence, in our scheme, the connection matrix, 1", 
becomes zero block-diagonal, meaning that the network has connections only between units 
of different categories. This su'ucture is shown schematically in Figure 1. 
Soft 
IComptitiol',, 
Figure 1. Schematic architecture of 
the stochastic BM. 
Figure 2. Block diagram of the corresponding 
deterministic BPN. 
120 Yair and Gersho 
By applying the property D2= 0 to (9), the sum over the space of hidden units, which can 
be explicitly written as the sum over all the J components of v, can be decomposed using 
the separability of the different vj components into a sum of only J terms as follows: 
./ 
L, x() = IT,, + f(V ') (11a) 
j=l 
I 
where: f (x) a ln(l+e,, ) and V.'? I,, a 
= j =rjixi 
i=l 
f (.) is called the activation function. 
(11b) 
Note that as I is increased f (.) approaches the linear 
P, = + e -t""'(x) ; where: L., x()=L, x()-L,n x() . (12) 
n=l 
Eqs. (8) and (11) describe a two-layer feed-forward perceptton-like subnetwork which uses 
the nonlinearity f (.). It evaluates the quantity L,(3) which we call the score of the m-th 
proposition. Eq. (12) describes a competition between the scores L, x() generated by the M 
subnetworks (m=l,..,M) which we call a soft competition with lateral inhibition. That is, If 
several scores receive relatively high values compared to the others, they will share, accord- 
ing to their relative strengths, the total amount (unity) oi probability, while inhibiting the 
remaining probabilities to approach zero. For example, if one of the scores, say Lk x(), is 
large compared to all the other scores, then the exponentiation of the pairwise score 
differences will result in Pk,=I while the remaining probabilities will approach zero. 
Specifically, for any n-k, P,= exp (-L,,, x()), which is essentially zero ff L x() is 
sufficiently high. In other words, by being large compared to the others, L x() won the 
competition so that the corresponding probability P,,, approaches unity, while all the 
remaining probabilities have been attenuated by the high value of Lk x() to approach zero. 
Let us examine the effect of the gain I on this competition. When I is increased, the slope 
of the activation function f (.) is increased thereby accentuating the differences between the 
M contenders. In the limit when I--o, one of the L, x() will always be sufficiently large 
compared to the others, and thus only one proposition will win. The competition then 
becomes a winner-take-all competition. In this case, the network becomes a max/mum a 
posteriori (MAP) decision scheme in which the L, x() play the role of nonlinear discrim- 
inant functions and the most probable proposition for the given input pattern is chosen: 
P,,=I for k=argmax{L,x(x)} and P,,,=0 for nk. (13) 
This results coincides with our earlier observation that the temperature controls the 'soft- 
ness' of the decision. The lower the temperature, the 'harder' is the competition and the 
more distinctive are the decisions. However, in contrast to the stochastic network, there is 
no need to grad,_a!!y 'cool' the network to achieve a desired (low) temperature. Any 
desired value of I is directly applicable in the BPN scheme. The above notion of soft com- 
petition has its merits in a wider scope of applications apart from its role in the BPN 
classifier. In many competitive schemes a soft competition between a set of contenders has 
threshold function in which a zero response is obtained for a negative input and a linear 
one (with slope I) for a positive input. 
Finally, the desired probabilities P,,, can be expressed as a function of the L, x(x) 
(m=l...,M) in an expression which can be regarded as the generalization of the logistic 
function to M inputs: 
The Boltzmann Perceptton Network 121 
a substantial benefit in comparison to the winner-take-all paro___digrn. The above competition 
scheme which can be implementcA by a two-layer feed-forward network thus offers a valu- 
able scheme for such purposes. 
The block diagram of the BPN is depicted in Figure 2. The BPN is thus a four-layer feed- 
forward deterministic network. It is comprised of a two-layer perceptron-like network fol- 
lowed by a two-layer competition network. The competition can be 'hard' (winner-take-all) 
or 'soft' (graded decision) and is governed by a single gain parameter I. 
THE LEARNING ALGORITHM 
Let us denote the BPN output by the M-dimensional probabihty vector P_, where: 
P_ =(Pl,..,Pm,..,Pt) r. For any given set of weights O, the BPN realizes some deter- 
ministic mapping W: R  -, [0,1 so that  = P x(). The objective of learning is to 
determine the mapping P (by estimating the set of parameters _) which 'best' explains a 
finite set of examples given from the desired mapping and are called the training set. The 
training set is specified by a set of N patterns { x_l,..,.n,.., } (denoted for simplicity by 
{ x_}), the a pr/or/probability for each training pattern x_: Q x(F.), and the desired mapping 
for each pattern x:  =(Qt,..,Q,t,..,Qt,t) r, where Q,t =Pr {proposition m Ix} is the 
desired probability. For each input pattern presented to the BPN, the actual output probabil- 
ity vector P. is, in general, different from the desired one, Q.. We denote by Gt the dis- 
tortion between the actual P. and the desired  for an input pattern _x. Thus, our task is to 
determine the network weights (and the bias values) 0 so that, on the average, the distortion 
over the whole set of waining patterns will be minimized. Adopting the original distortion 
measure suggested for Boltzmann machines, the average distortion, G (.0.), is given by: 
M 
G (.O.) = ' Q x() Gt (.O_) ; Gt (.O.) =  Q, t ln[ Q, t / P,n t (.O_) ] (14) 
which is always non-negative since P. and  are probability vectors. To minimize G (.0.) a 
gradient based minimization search is used. Specifically, a Partial Conjugate Gradient 
(PCG) search (Fletcher and Reeves, 1964; Luenberger, 1984) was found to be significantly 
more efficient than the ordinary steepest descent approach which is so widely used in mul- 
tilayer percepttons. A further discussion supporting this finding is given in (Yair and 
Gersho, 1989). For each set of weights we thus have to be able to compute the gradient 
g= V0 G of the cost function G(.0_). Let us denote the components of the 'instantaneous' 
  
gradient by 
G't=Gt/r. To get the full gradient, the instantaneous components should be accumu- 
lacl while th input patterns are presented (one at a time) to the network, until one full 
cycle through the whole waining set is completed. 
It is straightforward to show that the gradient may be evaluated in a recursive manner, in a 
fashion somewhat similar to the evaluation of the gradient by the backpropagation algo- 
rithm used for feed-forward networks (Rumelhart et. al., 1986). The evaluation of the gra- 
dient is accomplished by propagating the errors e,,t=Q,t-P,,t through a linear net- 
work, termed the Error Propagation Network (EPN), as follows: 
 t (15) 
c It q It  
; : ; = x, . 
n=l 
b'"'t=g(V7't), which can be 
The only new variables required are the b' t, given by 
122 Yair and Gersho 
mlz 
easily obtained by applying the logistic nonlinearity to V . The above en'or propagation 
scheme can also be written in a maix form. Let us define the following notation: 
e. = (ei,,..,e,,,,,..,eu,) r. E will be a diagonal MxM mau-ix whose diagonal is e.. 
B = [b'], a lxM malrix. Let Ja denote a column vector of length M whose com- 
ponents are all l's. Similarly we will define the vectors: GX'= (..,Gzx',..) r, and the 
Hence, the 
malxices: G x = [GX  ] with the appropriate dimensions (for any X,  and 1). 
error propagation can be written as: 
G ; G = G x 
G 'z =-[SB zEz ; G c z =G lt J.u 
(16) 
The EPN is depicted in Figure 3. This is a linear system composed of inner and outer pro- 
ducts between matrices, which can be efficiently implemented using a neural network. The 
gradient g is used in the PCG update formula in which a new set of weights is created and 
is used for the next update iteration. The learning scheme of the BPN is given in Figure 4. 
G  
G  
_x _pxn 
_xl 
0_ 
Figure 3. The Erro Propagation Network (EPN). iag' is 
a diagonalizationoperator. ight' and left' are 
right and left multipliers, respectively. 
Figure 4. The learning scheme. The BPN outputs, _Px, are 
comlxred with the desired probabilities, _Qx  The resulted 
errors, e_x ,propagate through the EPN to fon the gradient 
g_ which is used in the PCG alg. to create the new weights. 
SIMULATION RESULTS 
We now present several simulation results of two-class classification problems with Gaus- 
sian sources. That is, we have two propositions represented by class 0 or class 1. Suppose 
there are L random sources (i=I,..,L) over the input space, some of them are attributed to 
class 0, and the others to class 1. Each time, a source is chosen according to an a priori 
probability P (i). When chosen, the i-th source then emits a pattern _x according to a proba- 
bility density Q i x(x). Measuring a pauem x it is desired to decide upon the most probable 
origin class - in the binary decision problem (MAP classifier), or obtain some estimate to 
Q, the probability that class 1 emitted this pattern - in the soft classification problem. In 
the learning phase, a training set of size N was presented to the network, and the weights 
were iteratively modified by the learning scheme (Figure 4) until convergence. The final 
weights were used to consu'uct a BPN classifier, which was then tested for new input pat- 
terns. The output classification probability of the BPN, P  x(, was compared with the 
true (Bayesian) conditional probability, Q. x(3) which was computed analytically. Results 
are shown in Figures 5-7. In Figure 5, two symmeU'ic equi-probable Gaussian sources with 
substantial overlap were used, one for each class. The network was trained on N = 8 pat- 
terns with gain  = 1. Figure 5b shows how the BPN performs for the problem given in Fig- 
ure 5a. For  = 1, i.e., when the same gain is used in both the training and classification 
phases, there is an almost perfect match between the BPN output, P,. (x), denoted in the 
figures by '=1', and the true curve, Q,.(x). For I=10, the high gain winner-take-all 
The Boltzmann Perceptton Network ! 23 
competition is taking place and the classifier becomes, practically, a binary (yes/no) deci- 
sion network. In Fig. 6 disconnected decision regions were formed by using four sources, 
two of which were attributed to each class. Again, a nearly perfect match can be seen 
between actual (l= 1) and desired (Q,) outputs. Also, the simplicity of making 'hard' 
classification decisions by increasing the gain is again illustrated (l= 10). In Fig. 7 the 
classifier was required to find the boundary (expressed by Q,=0.5) between two 2D 
classes. 
'4i''' I .... I .... I''' 1 
(,.a) 
0.3 
0ol -- 
-4 - 0  4 
put pattern -  
0.5 
0.4 
0.3 
0.2 
0.1 
0.0 
.{'' 
- (6.a) 
-10 
o 1o 
input pattern - x 
(?.a) / 
/ 
/ 
I class 
0 2 4 8 8 10 
1.0 
0.5 
(S.b) , 
Q,/ J=l 
=1 
-2 0 2 
input pattern - x 
(7.b) 
-5 0 6 10 
input pattern - x 
Figure 6. 
-4 4 0 E 4 8 5 
Figure 5. Figure 7. 
Figure $: Classification for Gaussian sources. (5.a) The two sources. (5.b) 'Soft' 05= 1) and 'hard' 
05= 10) classifications versus Q.. J indicates the number of hickie units used. 
Figure 6: Classification for disconnected decision regions. (6.a) The sources used: dashed lines indi- 
cate class 0 and solid lines - class 1. (6.b) Soft (l= 1) and hard (l= 10) classifications versus Qi 
Figure 7: Classification in a 2D space. (7.a) The two classes and the U'ue boundary indicated by 
Qi,. =0.5. (7.b) The boundary found by the BPN. marked by Pi,.=0.5, versus the true one. 
References 
Fletcher, R., Reeves, C.M. (1964). Function minimization by conjugate gradients. Computer J., 7, 149-154. 
Hinton, G.E., Sejnowski T.R., & Ackley D.H. (1984). Boltnnann machines: constraint satisfaction networks that 
learn. Carnesic-Melon Technical Report, CMU-CS-84-119. 
Hopfield, J.J. (1987). Learning algorithms and probability distributions in feed-forward and feed-back networks. 
Proc. Natl. Acad. $i. USA, g4, 8429-8433. 
Kirkpatrick, S., Gelart, C.D., & Vecchi M.P. (1983). Optimization by simulated annealing. Science, 220, 671-680. 
Luenberger, D.G. (1984). Linear and nonlinear prosrammin 8, Addison-Wesley, Reading, Mass. 
Rumlhart, D.E., Hinton, G.E., & Williams RJ. (1986). Learning internal representations by error propagation. In 
D.E. Rumelhart & l.g McClelhnd (Eds.), Parallel Distributed Processing., M1T Pmss/Bra.dford Books. 
Yair. E., & Gersho, A. (1989). The Boltzmann pemcptron network: a soft classifier. Submitted to the Journal of 
Nepal Networks, December, 1988. 
