Bayesian Methods for Mixtures of Experts 
Steve Waterhouse 
Cambridge University 
Engineering Department 
Cambridge CB2 1PZ 
England 
Tel: [+44] 1223 332754 
srw1001@eng.cam.ac.uk 
David MacKay 
Cavendish Laboratory 
Madingley Rd. 
Cambridge CB3 0HE 
England 
Tel: [+44] 1223 337238 
m ackay@mr ao.cam. ac.uk 
Tony Robinson 
Cambridge University 
Engineering Department 
Cambridge CB2 1PZ 
England. 
Tel: [+44] 1223 332815 
ajr@eng.cam.ac.uk 
ABSTRACT 
We present a Bayesian framework for inferring the parameters of 
a mixture of experts model based on ensemble learning by varia- 
tional free energy minimisation. The Bayesian approach avoids the 
over-fitting and noise level under-estimation problems of traditional 
maximum likelihood inference. We demonstrate these methods on 
artificial problems and sunspot time series prediction. 
INTRODUCTION 
The task of estimating the parameters of adaptive models such as artificial neural 
networks using Maximum Likelihood (ML) is well documented eg. Geman, Bienen- 
stock & Doursat (1992). ML estimates typically lead to models with high variance, 
a process known as "over-fitting". ML also yields over-confident predictions; in 
regression problems for example, ML underestimates the noise level. This problem 
is particularly dominant in models where the ratio of the number of data points in 
the training set to the number of parameters in the model is low. In this paper we 
consider inference of the parameters of the hierarchical mixture of experts (HME) 
architecture (Jordan & Jacobs 1994). This model consists of a series of "experts," 
each modelling different processes assumed to be underlying causes of the data. 
Since each expert may focus on a different subset of the data which may be ar- 
bitrarily small, the possibility of over-fitting of each process is increased. We use 
Bayesian methods (MacKay 1992a) to avoid over-fitting by specifying prior belief 
in various aspects of the model and marginalising over parameter uncertainty. 
The use of regularisation or "weight decay" corresponds to the prior assumption 
that the model should have smooth outputs. This is equivalent to a prior P(Ola ) on 
the parameters 0 of the model, where a are the hyperparameters of the prior. Given 
a set of priors we may specify a posterior distribution of the parameters given data 
D, 
P(OID, a, 2/) oc 
(1) 
where the variable 2/encompasses the assumptions of model architecture, type of 
regularisation used and assumed noise model. Maximising the posterior gives us 
the most probable parameters Oa4e. We may then set the hyperparameters either 
by cross-validation, or by finding the maximum of the posterior distribution of the 
352 S. WATERHOUSE, D. MACKAY, T. ROBINSON 
hyperparameters P(aID), also known as the "evidence" (Gull 1989). In this paper 
we describe a method, motivated by the Expectation Maximisation (EM) algorithm 
of Dempster, Laird & Rubin (1977) and the principle of ensemble learning by vari- 
ational free energy minimisation (Hinton & van Camp 1993, Neal & Hinton 1993) 
which achieves simultaneous optimisation of the parameters and hyperparameters 
of the HME. We then demonstrate this algorithm on two simulated examples and a 
time series prediction task. In each task the use of the Bayesian methods prevents 
over-fitting of the data and gives better prediction performance. Before we describe 
this algorithm, we will specify the model and its associated priors. 
MIXTURES OF EXPERTS 
The mixture of experts architecture (Jordan & Jacobs 1994) consists of a set of 
"experts" which perform local function approximation. The expert outputs are 
combined by a "gate" to form the overall output. In the hierarchical case, the 
experts are themselves mixtures of further experts, thus extending the network in 
a tree structured fashion. The model is a generative one in which we assume that 
data are generated in the domain by a series of J independent processes which 
are selected in a stochastic manner. We specify a set of indicator variables Z = 
{() N}, where z ) is 1 if the output y() was generated 
zj 'j=l ... J,n = 1 ... 
by expert j and zero otherwise. Consider the case of regression over a data set 
D = {a  e k, y()  p, n = 1 ... N} with p = 1. We specify that the conditional 
probability of the scalar output y() given the input vector a  at exemplar (n) is 
J 
P(.y(n)lze(n), O) = E P(ZJ n)lze(n)' wj, pj), (2) 
j=l 
where {j  k} is the set of gate parameters, and {(%  ),/3j} the set of expert 
parameters. In this case, P(y()la ), %,/3j) is a Gaussian: 
1 
q0))=P(y()lm(),wj, fl/)= -i exp- , (3) 
(n) 
where 1//3 is the variance of expert j, and y1 = fi(m(), wi) is the output of expert 
j, giving a probabilistic mixture model. In this paper we restrict the expert output 
to be a linear function of the input, fi(m , %) = wm . We model the action of 
selecting process j with the gate, the outputs of which are given by the softmax 
function of the inner products of the input vector 2 and the gate parameter vectors. 
The conditional probability of selecting expert j given input m  is thus: 
gJ") -' (")- 1 la?, j) = exp(a("))//i= 
-- l"tzj - exp(/rm ) 
(4) 
A straightforward extension of this model also gives us the conditional probability 
hJ ) of expert j having been selected given input a  and output y(), 
-- _(n) a.(n) 
h! ' -- P(zJ '0: l ly , a '0, O) S qoi 
'v 
(5) 
Although fii is a parameter of expert j, in common with MacKay (1992a) we consider 
it as a hyperparameter on the Gaussian noise prior. 
2In all notation, we assume that the input vector is augmented by a constant term, 
which avoids the need to specify a "bias" term in the parameter vectors. 
Bayesian Methods for Mixtures of Experts 353 
PRIORS 
We assume a separable prior on the parameters 0 of the model: 
P(ola) = H P(JI)P(wjI9) 
j=l 
where {ay} and {It} are the hyperparameters for the parameter vectors of the experts 
and the gate respectively. We assume Gaussian priors on the parameters of the 
experts {%} and the gate {j}, for example: 
k 
 cj T 
P(wjlaj):( ) exp (--wj wj) 
(7) 
For simplicity of notation, we shall refer to the set of all smoothness hyperparame- 
ters as a= {It, a} and the set of all noise level hyperparameters as iS = 
Finally, we assume Gamma priors on the hyperparameters {It, a, iS} of the priors, 
for example: 
P(logiSjlp0, v0) - F(pO) \ vo / exp(-iSi / vo), (8) 
where vg,po are the hyper-hyperparameters which specify the range in which we 
expect the noise levels iSj to lie. 
INFERRING PARAMETERS USING ENSEMBLE LEARNING 
The EM algorithm was used by Jordan & Jacobs (1994) to train the HME in a 
maximum likelihood framework. In the EM algorithm we specify a complete data set 
{D, Z} which includes the observed dataD and the set of indicator variables Z. Given 
0 (m-O, the E step of the EM algorithm computes a distribution P(ZID, 0 (m-l)) over 
Z. The M step then maximises the expected value of the complete data likelihood 
P(D, ZIO ) over this distribution. In the case of the HME, the indicator variables 
$$_(n)lJ 1N specify which expert was responsible for generating the data at 
Z = tt4j .fj:l.fn:l 
each time. 
We now outline an algorithm for the simultaneous optimisation of the parameters 
0 and hyperparameters a and iS, using the framework of ensemble learning by 
variational free energy minimisation (Hinton & van Camp 1993). Rather than 
optimising a point estimate of O, a and iS, we optimise a distribution over these 
parameters. This builds on Neal & Hinton's (1993) description of the EM algorithm 
in terms of variational free energy minimisation. 
We first specify an approximating ensemble Q(w, , a, iS, Z) which we optimise so that 
it approximates the posterior distribution P(w, , a, iS, ZID, ) well. The objective 
function chosen to measure the quality of the approximation is the variational free 
energy, 
f Q(w,/,a, iS, Z) 
F(Q) = dw d/ dadiSdZQ(w,/,a, iS, Z)log P(w,/,a, iS, Z, DlY-D' (9) 
where the joint probability of parameters {w, }, hyperparameters, {a, iS}, missing 
data Z and observed data D is given by, 
354 S. WATERHOUSE, D. MACKAY, T. ROBINSON 
P(w, , a, ll, Z, DlY-[) = 
J N (n) 
POt) I-[ P(jIIt)P(9)P(wjI9)P([3IPj, ) ]-I (P(z) n) = iI re(n), j)P(Y(n)l(n), wj, )) z; (JO) 
j=l n=l 
The free energy can be viewed as the sum of the negative log evidence - log P(D]Y() 
and the Kullback-Leibler divergence between Q and P(w, lg, a, f I, ZID, Y(). F is 
bounded below by - logP(DlYO, with equality when Q = P(w, 
We constrain the approximating ensemble Q to be separable in the form 
Q(w, , a, 15, Z) = Q(w)Q(lg)Q(a)Q(II)Q(Z). We find the optimal separable distribu- 
tion Q by considering separately the optimisation of F over each separate ensemble 
component Q(.) with all other components fixed. 
Optimising Qw(w) and Q(). 
As a functional of Qw(w), F is 
J r  ..(n)J r.,(n) 
yj I log Qw(w) 
F: dwQw(w) -w/w/+ nd__14j 2ky -- + 
+ const (11) 
where for any variable a, a denotes f da Q(a) a. Noting that the w dependent terms 
are the log of a posterior distribution and that a divergence f Q log(Q/P) is minimised 
by setting Q = P, we can write down the distribution Qw(w) that minimises this 
expression. For given data and Q, Q, Qz, Q, the optimising distribution Q?t(w) is 
Q?'(w): II op, ( 
Qwj (wj) = II exp - 
J J 
const (12) 
This is a set of J Gaussian distributions with means {}, which can be found 
exactly by quadratic optimisation. We denote the variance covariance matrices of 
Qopt / x 
wj twj) by {Zwj}. The analogous expression for the gates Qpt() is obtained in a 
similar fashion and is given by 
: Q (Y): H exp -''.y .y + n2  y logg n) 
J Y 
(13) 
opt 
We approximate each Q{j (y) by a Gaussian distribution fitted at its maximum 
 = j with variance covariance matrix Z. 
Optimising Q(Z) 
By a similar procedure, the optimal distribution QPt(z) is given by 
Q'Pt(Z)= nI][. {exp(sJn))/i:exp(s?))} (14) 
where s)) = fm() -  [(y() - 9))(wftP))2 + m()zl m()] (15) 
2 
Bayesian Methods for Mixtures of Experts 355 
and j is the value of j computed above. The standard E-step gives us a distribution 
of Z given a fixed value of parameters and the data, as shown in equation (5). In 
this case, by finding the optimal Qz(Z) we obtain the alternative expression of (15), 
with dependencies on the uncertainty of the experts' predictions. Ideally (if we 
did not made the assumption of a separable distribution Q) Qz might be expected 
to contain an additional effect of the uncertainty in the gate parameters. We can 
introduce this by the method of MacKay (1992b) for marginalising classifiers, in the 
case of binary gates. 
Optimising Q,(a) and Qo() 
Finally, for the hyperparameter distributions, the optimal values of ensemble func- 
tions give values for a/and j as 
An analogous procedure is used to set the hyperparameters {t} of the gate. 
MAKING PREDICTIONS 
In order to make predictions using the model, we must marginalise over the param- 
eters and hyperparameters to get the predictive distribution. We use the optimal 
distributions QOVt(.) to approximate the posterior distribution. 
For the experts, the marginalised outputs are given by (N+) = f(a,v+o, w?), with 
variance ,j,sj = aCtv+OrEwj a?+O + c 2, where (2 = 1//j. We may also marginalise 
over the gate parameters (MacKay 1992b) to give marginalised outputs for the gates. 
The predictive distribution is then a mixture of Gaussians, with mean and variance 
given by its first and second moments, 
(N+I)  o.(N+I)E.(N+I). 
i=1 
.(16) 
SIMULATIONS 
Artificial Data 
In order to test the performance of the Bayesian method, we constructed two arti- 
ficial data sets. Both data sets consist of a known function corrupted by additive 
zero mean Gaussian noise. The first data set, shown in Figure (la) consists of 
100 points from a piecewise linear function in which the leftmost portion is cor- 
rupted with noise of variance 3 times greater than the rightmost portion. The 
second data set, shown in Figure (lb) consists of 100 points from the function 
g(t) = 4.26(e -t - 4e -2t + 3e-3t), corrupted by Gaussian noise of constant variance 
0.44. We trained a number of models on these data sets, and they provide a typical 
set of results for the maximum likelihood and Bayesian methods, together with the 
error bars on the Bayesian solutions. The model architecture used was a 6 deep 
binary hierarchy of linear experts. In both cases, the ML solutions tend to overfit 
the noise in the data set. The Bayesian solutions, on the other hand, are both 
smooth functions which are better approximations to the underlying functions. 
Time Series Prediction 
The Bayesian method was also evaluated on a time series prediction problem. This 
consists of yearly readings of sunspot activity from 1700 to 1979, and was first 
356 S. WATERHOUSE, D. MACKAY, T. ROBINSON 
 ' ,, ./ ' 
,-7- .;.'-" .' - 
 V..f  o- 'grl 'lnl + Noise 
..! .. - -ML solution 
:. ' -"' ' ---- Bayesian solution 
. : .... Error bars 
(b) 
Figure 1' The effect of regularisation on fitting known functions corrupted with noise. 
considered in the connectionist community by Weigend, Huberman & Rumelhart 
(1990), who used an MLP with 8 hidden tanh units, to predict the coming year's 
activity based on the activities of the previous 12 years. This data set was chosen 
since it consists of a relatively small number of examples and thus the probability 
of over-fitting sizeable models is large. In previous work, we considered the use of a 
mixture of 7 experts on this problem. Due to the problems of over-fitting inherent 
in ML however, we were constrained to using cross validation to stop the training 
early. This also constrained the selection of the model order, since the branches of 
deep networks tend to become "pinched off" during ML training, resulting in local 
minima during training. The Bayesian method avoids this over-fitting of the gates 
and allows us to use very large models. 
Table 1: Single step prediction on the Sunspots data set using a lag vector of 12 years. 
NMSE is the mean squared prediction error normalised by the variance of the entire 
record from 1700 to 1979. The models used were; WHR: Weigend et al's MLP result; 
1HME_7_CV: mixture of 7 experts trained via maximum likelihood and using a 10 % 
cross validation scheme; 8HME2_ML & 8HME2_Bayes: 8 deep binary HME,trained via 
maximum likehhood (ML) and Bayesian method (Bayes). 
MODEL Train NMSE Test NMSE 
1700-1920 1921-1955 1956-1979 
WHR 0.082 0.086 0.35 
1HME7_CV 0.061 0.089 0.27 
8HME2_ML 0.052 0.162 0.41 
8HME2_Bayes 0.079 0.089 0.26 
Table 1 shows the results obtained using a variety of methods on the sunspots 
task. The Bayesian method performs significantly better on the test sets than the 
maximum likelihood method (8HME2_ML), and is competitive with the MLP of 
Weigend et al (WHR). It should be noted that even though the number of param- 
eters in the 8 deep binary HME (4992) used is much larger than the number of 
training examples (209), the Bayesian method still avoids over-fitting of the data. 
This allows us to specify large models and avoids the need for prior architecture 
selection, although in some cases such selection may be advantageous, for example 
if the number of processes inherent in the data is known a-priori. 
Bayesian Methods for Mixtures of Experts 357 
In our experience with linear experts, the smoothness prior on the output function 
of the expert does not have an important effect; the prior on the gates and the 
Bayesian inference of the noise level are the important factors. We expect that the 
smoothness prior would become more important if the experts used more complex 
basis functions. 
DISCUSSION 
The EM algorithm is a special case of the ensemble learning algorithm presented 
here: the EM algorithm is obtained if we constrain Qo() and Qp(]]) to be delta 
functions and fix a = 0. The Bayesian ensemble works better because it includes 
regularization and because the uncertainty of the parameters is taken into account 
when predictions are made. It could be of interest in future work to investigate how 
other models trained by EM could benefit from the ensemble learning approach 
such as hidden Markov models. 
The Bayesian method of avoiding over-fitting has been shown to lend itself naturally 
to the mixture of experts architecture. The Bayesian approach can be implemented 
practically with only a small computational overhead and gives significantly better 
performance than the ML model. 
References 
Dempster, A. P., Laird, N.M. & Rubin, D. B. (1977), 'Maximum likelihood from 
incomplete data via the EM algorithm', Journal of the Royal Statistical Society, 
Series B 39, 1-38. 
Geman, S., Bienenstock, E. & Doursat, R. (1992), 'Neural networks and the bias 
/ variance dilemma', Neural Computation 5, 1-58. 
Gull, S. F. (1989), Developments in maximum entropy data analysis, in J. Skilling, 
ed., 'Maximum Entropy and Bayesian Methods, Cambridge 1988', Kluwer, 
Dordrecht, pp. 53-71. 
Hinton, G. E. & van Camp, D. (1993), Keeping neural networks simple by min- 
imizing the description length of the weights, To appear in: Proceedings of 
COLT-93. 
Jordan, M. I. & Jacobs, R. A. (1994), 'Hierarchical Mixtures of Experts and the 
EM algorithm', Neural Computation 6, 181-214. 
MacKay, D. J. C. (1992a), 'Bayesian interpolation', Neural Computation 4(3), 415- 
447. 
MacKay, D. J. C. (1992b), 'The evidence framework applied to classification net- 
works', Neural Computation 4(5), 698-714. 
Neal, R. M. & Hinton, G. E. (1993), 'A new view of the EM algorithm that jus- 
tifies incremental and other variants'. Submitted to Biometrika. Available at 
URL:ftp://ftp.cs.toronto.edu/pub/radford/www. 
Weigend, A. S., Huberman, B. A. & Rumelhart, D. E. (1990), 'Predicting the future: 
a connectionist approach', International Journal of Neural Systems 1,193-209. 
