A 
mean field algorithm for Bayes learning 
in large feed-forward neural networks 
Manfred Opper 
Institut fiir Theoretische Physik 
Julius-Maximilians-Universitit, Am Hubland 
D-97074 Wiirzburg, Germany 
opprphys ik. IJn i-urzburg. d 
Ole Winther 
CONNECT 
The Niels Bohr Institute 
Blegdamsvej 17 
2100 Copenhagen, Denmark 
vintherconnec. nbi. dk 
Abstract 
We present an algorithm which is expected to realise Bayes optimal 
predictions in large feed-forward networks. It is based on mean field 
methods developed within statistical mechanics of disordered sys- 
tems. We give a derivation for the single layer perceptton and show 
that the algorithm also provides a leave-one-out cross-validation 
test of the predictions. Simulations show excellent agreement with 
theoretical results of statistical mechanics. 
I INTRODUCTION 
Bayes methods have become popular as a consistent framework for regularization 
and model selection in the field of neural networks (see e.g. [MacKay, 1992]). In 
the Bayes approach to statistical inference [Berger,1985] one assumes that the prior 
uncertainty about parameters of an unknown data generating mechanism can be 
encoded in a probability distribution, the so called prior. Using the prior and 
the likelihood of the data given the parameters, the posterior distribution of the 
parameters can be derived from Bayes rule. From this posterior, various estimates 
for functions of the parameter, like predictions about unseen data, can be calculated. 
However, in general, those predictions cannot be realised by specific parameter 
values, but only by an ensemble average over parameters according to the posterior 
probability. 
Hence, exact implementations of Bayes method for neural networks require averages 
over network parameters which in general can be performed by time consuming 
226 M. Opper and O. Winther 
Monte Carlo procedures. There are however useful approximate approaches for 
calculating posterior averages which are based on the assumption of a Gaussian 
form of the posterior distribution [MacKay, 1992]. Under regularity conditions on 
the likelihood, this approximation becomes asymptotically exact when the number 
of data is large compared to the number of parameters. This Gaussian ansatz 
for the posterior may not be justified when the number of examples is small or 
comparable to the number of network weights. A second cause for its failure would 
be a situation where discrete classification labels are produced from a probability 
distribution which is a nonsmooth function of the parameters. This would include 
the case of a network with threshold units learning a noise free binary classification 
problem. 
In this contribution we present an alternative approximate realization of Bayes 
method for neural networks, which is not based on asymptotic posterior normal- 
ity. The posterior averages are performed using mean field techniques known from 
the statistical mechanics of disordered systems. Those are expected to become exact 
in the limit of a large number of network parameters under additional assumptions on 
the statistics of the input data. Our analysis follows the approach of [Thouless, An- 
derson&; Palmer,1977] (T^P) as adapted to the simple perceptron by [Mzard,1989]. 
The basic set up of the Bayes method is as follows: We have a training set consisting 
of m input-output pairs D,n -- {(s,tr),m -- 1,...,p), where the outputs are 
generated independently from a conditional probability distribution P(tr[w, s). 
This probability is assumed to describe the output tr to an input s of a neural 
network with weights w subject to a suitable noise process. If we assume that the 
unknown parameters w are randomly distributed with a prior distribution p(w), 
then according to Bayes theorem our knowledge about w after seeing m examples 
is expressed through the posterior distribution 
m 
p(w[Om)- Z-lp(w) H 
P(tr [w, s) (1) 
m 
where Z = f dwp(w) rI:l P(tr[w, s) is called the partition function in statistical 
mechanics and the evidence in Bayesian terminology. Taking the average with re- 
spect to the posterior eq. (1), which in the following will be denoted by angle brack- 
ets, gives Bayes estimates for various quantities. For example the optimal predictive 
probability for an output tr to a new input s is given by PnYes([s) = {P(tr[w, s)). 
In section 2 exact equations for the posterior averaged weights {w) are derived for 
arbitrary networks. In 3 we specialize these equations to a perceptron and develop 
a mean field ansatz in section 4. The resulting system of mean field equations equa- 
tions is presented in section 5. In section 6 we consider Bayes optimal predictions 
and a leave-one-out estimator for the generalization error. We conclude in section 7 
with a discussion of our results. 
2 
A RESULT FOR POSTERIOR AVERAGES FROM 
GAUSSIAN PRIORS 
In this section we will derive an interesting equation for the posterior mean of the 
weights for arbitrary networks when the prior is Gaussian. This average of the 
Mean Field Algorithm for Bayes Learning 227 
weights can be calculated for the distribution (1) by using the following simple and 
well known result for averages over Gaussian distributions. 
Let v be a Gaussian random variable with zero means. Then for any function f(v), 
we have 
(vf(v)> : (v . (dr(v) 
av ' ' (94 
Here {...)6 denotes the average over the Gaussian distribution of v. The relation is 
easily proved from an integration by parts. 
In the following we will specialize to an isotropic Gaussian prior p(w) = 
 -w.w. In [Opper & Winter,1996] anisotropic priors are treated as well. 
x/T e 
m 
Applying (2) to each component of w and the function l-I= P(r'lw, s'), we get 
the following equations 
m 
(w) = Z -*/dw wp(w) YI P(I w,s) 
Here {...>  = f dp()... YL.P(" I,s ") 
f dp(w)I-I..P("l, s") is a reduced average over a posterior where 
the p-th example is kept out of the training set and X7w denotes the gradient with 
respect to w. 
3 THE PERCEPTRON 
In the following, we will utilize the fact that for neural networks, the probability (1) 
depends only on the so called internal fields A = 
A simple but nontrivial example is the perceptton with N dimensional input vector s 
and output r(w, s) = sign(A). We will generalize the noise free model by considering 
label noise in which the output is flipped, i.e. rA < 0 with a probability (1 +e/) -x. 
(For simplicity, we will assume that fi is known such that no prior on fi is needed.) 
The conditional probability may thus be written as 
e-O(-a' a ' ) 
P(trA) = P(lw' s) = 1 + e-" ' (4) 
where O(x) = 1 for x > 0 and 0 otherwise. Obviously, this a nonsmooth function of 
the weights w, for which the posterior will not become Gaussian asymptotically. 
For this case (3) reads 
  (P'(,zx)) , = () 
()- v = (P(a)) 
 . Ya(a)P'(a) 
 = Ya(a)P(a) 
228 M. Opper and O. Winther 
fm (A) is the density of &w  s m, when the weights w are randomly drawn from a 
posterior, where example (sin, rm) was kept out of the training set. This result states 
that the weights are linear combinations of the input vectors. It gives an example 
of the ability of Bayes method to regularize a network model: the effective number 
of parameters will never exceed the number of data points. 
4 MEAN FIELD APPROXIMATION 
Sofar, no approximations have been made to obtain eqs. (3,5). In general f(A) 
depends on the entire set of data Dm and can not be calculated easily. Hence, we 
look for a useful approximation to these densities. 
We split the internal field into its average and fluctuating parts, i.e. we set Am - 
1 
(Am)m + vm, with vm = v(W- (w)m)sm. Our mean field approximation is based 
on the assumption of a central limit theorem for the fluctuating part of the internal 
field, vm which enters in the reduced average of eq. (5). This means, we assume 
that the non-Gaussian fluctuations of wi around (wi)m, when mulitplied by s will 
sum up to make v m a Gaussian random variable. The important point is here that 
for the reduced average, the wi are not correlated to the s!  
We expect that this Gaussian approximation is reasonable, when N, the number 
of network weights is sufficiently large. Following ideas of [M4zard, Parisi & Vira- 
soro,1987] and [M4zard,1989], who obtained mean field equations for a variety of 
disordered systems in statistical mechanics, one can argue that in many cases this 
assumption may be exactly fulfilled in the 'thermodynamic limit' m, N -+ cx with 
m fixed. According to this ansatz, we get 
in terms of the second moment of v m Am -- v ']ij sm sm 
To evaluate (5) we need to calculate the mean (Am)m and the variance Am. The first 
problem is treated easily within the Gaussian approximation. 
= (sin)_ (vm) 
= (am) _ 
(6) 
In the third line (2) has been used again for the Gaussian random variable 
Sofar, the calculation of the variance A m for general inputs is an open problem. 
However, we can make a further reasonable ansatz, when the distribution of the 
inputs is known. The following approximation for Am is expected to become exact 
in the thermodynamic limit if the inputs of the training set are drawn independently 
 Note that the fluctuations of the internal field with respect to the full posterior mean 
(which depends on the input s ') is non Gaussian, because the different terms in the sum 
become slightly correlated. 
Mean Field Algorithm for Bayes Learning 229 
from a distribution, where all components si are uncorrelated and normalized i.e. 
s- = 0 and sisj = 8ij. The bars denote expectation over the distribution of inputs. 
For the generalisation to a correlated input distribution see lopper& Winther, 1996]. 
Our basic mean field assumption is that the fluctuations of the A with the data 
set can be neglected so that we can replace them by their averages A. Since the 
reduced posterior averages are not correlated with the data s/, we obtain A _ 
! i((w/}_ (wi)). Finally, we replace the reduced average by the expectation 
N 
over the full posterior, neglecting terms of order 1IN. Using 'i(wi ) = N, which 
follows from our choice of the Gaussian prior, we get A _ A = 1-  'i(wi) 2. This 
depends only on known quantities. 
5 MEAN FIELD EQUATIONS FOR THE PERCEPTRON 
(p'(t` at'))t' 
(5) and (6) give a selfconsistent set of equations for the variable x  _= (p(at'at'))t' 
We finally get 
m 
1 
(w> - (7) 
with 
= (1 - e-a)e-Zt`'/2 (8) 
+ (1- 
1 
(9) 
A=i- 
These mean field equations can be solved by iteration. It is useful to start with a 
small number of data and then to increase the number of data in steps of 1 - 10. 
Numerical work show that the algorithm works well even for small systems sizes, 
N_15. 
6 BAYES PREDICTIONS AND LEAVE-ONE-OUT 
After solving the mean field equations we can make optimal Bayesian classifications 
for new data s by chosing the output label with the largest predictive probability. 
In case of output noise this reduces to trnaYeS(S) ----sign(tr(w, s)) Since the posterior 
distribution is independent of the new input vector we can apply the Gaussian as- 
sumption again to the internal field, A. and obtain trnaYeS(S) ---- a((W), S), i.e for the 
simple perceptton the averaged weights implement the Bayesian prediction. This 
will not be the case for multi-layer neural networks. 
We can also get an estimate for the generalization error which occurs on the pre- 
diction of new data. The generalization error for the Bayes prediction is defined 
by nayea = (O (--ff(s)(ff(w, s)>)>s, where tr(s) is the true output and (...)s denotes 
average over the input distribution. To obtain the leave-one-out estimator of e one 
230 M. Opper and O. Winther 
0.50 
0.40 
0.30 
0.20 
0.10 
0.00 
0 
Figure 1: Error vs. a = m/N for the simple perceptton with output noise fl = 0.5 
and N = 50 averaged over 200 runs. The full lines are the simulation results (upper 
curve shows prediction error and the lower curve shows training error). The dashed 
line is the theoretical result for N -+ c obtained from statistical mechanics [Opper & 
Haussler,1991]. The dotted line with larger error bars is the moving control estimate. 
removes the F-th example from the training set and trains the network using only 
the remaining m - 1 examples. The F'th example is used for testing. Repeating 
this procedure for all tt an unbiased estimate for the Bayes generalization error with 
m- 1 training data is obtained as the mean value Bayes 
= 
which is exactly the type of reduced averages which are calculated within our ap- 
proach. Figure 1 shows a result of simulations of our algorithm when the inputs are 
uncorrelated and the outputs are generated from a teacher perceptton with fixed 
noise rate 
7 CONCLUSION 
In this paper we have presented a mean field algorithm which is expected to imple- 
ment a Bayesian optimal classification well in the limit of large networks. We have 
explained the method for the single layer perceptton. An extension to a simple mul- 
tilayer network, the so called committee machine with a tree architecture is discussed 
in lopper& Winther,1996]. The algorithm is based on a Gaussian assumption for 
the distribution of the internal fields, which seems reasonable for large networks. 
The main problem solar is the restriction to ideal situations such as a known distri- 
Mean FieM Algorithm for Bayes Learning 231 
bution of inputs which is not a realistic assumption for real world data. However, 
this assumption only entered in the calculation of the variance of the Gaussian field. 
More theoretical work is necessary to find an approximation to the variance which 
is valid in more general cases. A promising approach is a derivation of the mean 
field equations directly from an approximation to the free energy - ln(Z). Besides a 
deeper understanding this would also give us the possibility to use the method with 
the so called evidence framework, where the partition function (evidence) can be 
used to estimate unknown (hyper-) parameters of the model class [Berger, 1985]. It 
will further be important to extend the algorithm to fully connected architectures. 
In that case it might be necessary to make further approximations in the mean field 
method. 
ACKNOWLEDGMENTS 
This research is supported by a Heisenberg fellowship of the Deutsche Forschungs- 
gemeinschaft and by the Danish Research Councils for the Natural and Technical 
Sciences through the Danish Computational Neural Network Center (CONNECT). 
REFERENCES 
Berger, J. O. (1985) Statistical Decision theory and Bayesian Analysis, Springer- 
Verlag, New York. 
MacKay, D. J. (1992) A practical Bayesian framework for backpropagation networks, 
Neural Comp. 4 448. 
Mzard, M., Parisi G. & Virasoro M. A. (1987) Spin Glass Theory and Beyond, 
Lecture Notes in Physics, 9, World Scientific, . 
Mdzard, M. (1989) The space of interactions in neural networks: Gardner's calcu- 
lation with the cavity method J. Phys. A 22, 2181 . 
Opper, M. & Haussler, D. (1991) in IVth Annual Workshop on Computational 
Learning Theory (COLT91), Morgan Kaufmann. 
Opper M. & Winther O (1996) A mean field approach to Bayes learning in feed- 
forward neural networks, Phys. Rev. Lett. 76 1964. 
Thouless, D.J., Anderson, P. W. & Palmer, R.G. (1977), Solution of 'Solvable model 
of a spin glass' Phil. Mag. 35, 593. 
