Improved Gaussian Mixture Density 
Estimates Using Bayesian Penalty Terms 
and Network Averaging 
Dirk Ormoneit 
Institut fiir Informatik (H2) 
Technische Universitit Miinchen 
80290 Miinchen, Germany 
ormoneit @inf ormatik. tu-muenchen. de 
Volker Tresp 
Siemens AG 
Central Research 
81730 Miinchen, Germany 
Volker. Tresp@zfe.siemens. de 
Abstract 
We compare two regularization methods which can be used to im- 
prove the generalization capabilities of Gaussian mixture density 
estimates. The first method uses a Bayesian prior on the parame- 
ter space. We derive EM (Expectation Maximization) update rules 
which maximize the a posterior parameter probability. In the sec- 
ond approach we apply ensemble averaging to density estimation. 
This includes Breiman's "bagging", which recently has been found 
to produce impressive results for classification networks. 
I Introduction 
Gaussian mixture models have recently attracted wide attention in the neural net- 
work community. Important examples of their application include the training of 
radial basis function classifiers, learning from patterns with missing features, and 
active learning. The appeal of Gaussian mixtures is based to a high degree on the 
applicability of the EM (Expectation Maximization) learning algorithm, which may 
be implemented as a fast neural network learning rule ([Now91], [Orm93]). Severe 
problems arise, however, due to singularities and local maxima in the log-likelihood 
function. Particularly in high-dimensional spaces these problems frequently cause 
the computed density estimates to possess only relatively limited generalization ca- 
pabilities in terms of predicting the densities of new data points. As shown in this 
paper, considerably better generalization can be achieved using regularization. 
Improved Gaussian Mixture Density Estimates Using Bayesian Penalty Terms 543 
We will compare two regularization methods. The first one uses a Bayesian prior 
on the parameters. By using conjugate priors we can derive EM learning rules 
for finding the MAP (maximum a posteriori probability) parameter estimate. The 
second approach consists of averaging the outputs of ensembles of Gaussian mixture 
density estimators trained on identical or resampled data sets. The latter is a form 
of "bagging" which was introduced by Breiman ([Bre94]) and which has recently 
been found to produce impressive results for classification networks. By using the 
regularized density estimators in a Bayes classifier ([THA93], [HT94], [KL95]), we 
demonstrate that both methods lead to density estimates which are superior to the 
unregularized Gaussian mixture estimate. 
2 Gaussian Mixtures and the EM Algorithm 
Consider theproblem of estimating the probability density of a continuous random 
vector x 6 7 a based on aset x* = {xkll _< k _< rn} ofiid. realizations ofx. As a den- 
sity model we choose the class of Gaussian mixtures p(xlO ) = 
where the restrictions tq > 0 and Ei=l tgi '-- 1 apply. O denotes the parameter 
vector (tq, Iti, Ei):I- The p(xli , Iti, Zi) are multivariate normal densities: 
p(xli, It,, i) - (2,r)- Ii 1-1/2 exp [- 1/2(x - Iti )- 1( __ Iti )]. 
The Gaussian mixture model is well suited to approximate a wide class of continuous 
probability densities. Based on the model and given the data x*, we may formulate 
the log-likelihood as 
l(O) = log P(xklO) - E 1gE mp(xli, iti Ed. 
k=l k=l i=1 ' 
Maximum likelihood parameter estimates 6 may efficiently be computed with the 
EM (Expectation Maximization) algorithm ([DLR77]). It consists of the iterative 
application of the following two steps: 
1. In the E-step, based on the current parameter estimates, the posterior 
probability that unit i is responsible for the generation of pattern x a is 
estimated as 
tqP(x li' iti' Ei) (1) 
h/: Eint=l tq,p(xli', iti,, Ei,) ' 
2. In the M-step, we obtain new parameter estimates (denoted by the prime): 
m Ek=l hi k:rk 
t I E h/ (2) It: , (3) 
tgi -- m k:l En__l h i 
t-- Ekm__l hik(gCk __ Itti)(gCk __ It)t 
-- (4) 
Note that n'i is a scalar, whereas It'i denotes a d-dimensional vector and 
is a d x d matrix. 
It is well known that training neural networks as predictors using the maximum 
likelihood parameter estimate leads to overfitting. The problem of overfitting is 
even more severe in density estimation due to singularities in the log-likelihood 
function. Obviously, the model likelihood becomes infinite in a trivial way if we 
concentrate all the probability mass on one or several samples of the training set. 
544 D. ORMONEIT, V. TRESP 
In a Gaussian mixture this is just the case if the center of a unit coincides with 
one of the data points and E approaches the zero matrix. Figure I compares the 
true and the estimated probability density in a toy problem. As may be seen, 
the contraction of the Gaussians results in (possibly infinitely) high peaks in the 
Gaussian mixture density estimate. A simple way to achieve numerical stability 
is to artificially enforce a lower bound on the diagonal elements of E. This is a 
very rude way of regularization, however, and usually results in low generalization 
capabilities. The problem becomes even more severe in high-dimensional spaces. 
To yield reasonable approximations, we will apply two methods of regularization, 
which will be discussed in the following two sections. 
o? 
Figure 1' True density (tefO and unregu/arized density estimation (right). 
3 Bayesian Regularization 
In this section we propose a Bayesian prior distribution on the Gaussian mixture 
parameters, which leads to a numerically stable version of the EM algorithm. We 
first select a family of prior distributions on the parameters which is conjugate*. 
Selecting a conjugate prior has a number of advantages. In particular, we obtain 
analytic solutions for the posterior density and the predictive density. In our case, 
the posterior density is a complex mixture of densities t. It is possible, however, to 
derive EM-update rules to obtain the MAP parameter estimates. 
A conjugate prior of a single multivariate normal density is a product of a normal 
density N(/ql/2,l-lEi) and a Wishart density Wi(E-llc,fi)([Bun94]). A proper 
conjugate prior for the the mixture weightings/ -- (/l, ...,/) is a Dirichlet density 
D(/17) t. Consequently, the prior of the overall Gaussian mixture is the product 
D(17) 1-Iin___l N(Izilft,l-li)Wi(-ll,fl ). Our goal is to find the MAP parameter 
estimate, that is parameters which assume the maximum of the log-posterior 
lp(O) : Z:=llOgZ=l/qP(xkli,/zi,Ei)+logD(/17 ) 
+ :=111og N(/zil/, ]-lEi)-[-log Wi(E-llc, fi)]. 
As in the unregularized case, we may use the EM-algorithm to find a local maximum 
*A family F of probability distributions on O is said to be conjugateif, for every r E F, 
the posterior r(O[x) also belongs to F ([Rob94]). 
tThe posterior distribution can be written as a sum of n ' simple terms. 
tThose densities are defined as follows (b and c are normalizing constants): 
D(l'r) 
Improved Gaussian Mixture Density Estimates Using Bayesian Penalty Terms 545 
of lp (6}). The E-step is identical to (1). The M-step becomes 
m Ekm=l hik3jk ._ 111 
, E: h + -  () :  
(6) 
r/ = E, h( x - ;)(x - ;)' + .(; - )(; - )' + 2/ (7) 
E?=i h', + 2.- d 
As typical for conjugate priors, prior knowledge corresponds to a set of artificial 
training data which is also reflected in the EM-update equations. In our experi- 
ments, we focus on a prior on the variances which is implemented by/? : 0, where 
0 denotes the d x d zero matrix. All other parameters we set to "neutral" values: 
7i=1 i:l<i<n, a=(d+l)/2, 1=0, fi=/I d 
I a is the d x d unity matrix. The choice of a introduces a bias which favors large 
variances. The effect of various values of the scalar/ on the density estimate is 
illustrated in figure 2. Note that if/ is chosen too small, overfitting still occurs. If 
it is chosen to large, on the other hand, the model is too constraint to recognize the 
underlying structure. 
O7 
o 
Figure 2: Regularized density estimates (left:/ = 0.05, right:'/ = 0.1). 
Typically, the optimal _value for/ is not known a priori. The simplest procedure 
consists of using that fi which leads to the best performance on a validation set, 
analogous to the determination of the optimal weight decay parameter in neural 
network training. Alternatively, / might be determined according to appropriate 
Bayesian methods ([Mac91]). Either way, only few additional computations are 
required for this method if compared with standard EM. 
4 Averaging Gaussian Mixtures 
In this section we discuss the averaging of several Gaussian mixtures to yield im- 
proved probability density estimation. The averaging over neural network ensembles 
has been applied previously to regression and classification tasks ([PC93]). 
There are several different variants on the simple averaging idea. First, one may 
train all networks on the complete set of training data. The only source of dis- 
agreement between the individual predictions consists in different local solutions 
found by the likelihood maximization procedure due to different starting points. 
Disagreement is essential to yield an improvement by averaging, however, so that 
this proceeding only seems advantageous in cases where the relation between train- 
ing data and weights is extremely non-deterministic in the sense that in training, 
If A is distributed according to Wi(Ala, j?), then E[A -] = (a-(d + 1)/2)-/. In our 
case A is E -, so that E[Ei] -- e./ for a - (d + 1)/2. 
546 D. ORMONEIT, V. TRESP 
different solutions are found from different random starting points. A straightfor- 
ward way to increase the disagreement is to train each network on a resampled 
version of the original data set. If we resample the data without replacement, the 
size of each training set is reduced, in our experiments to 70% of the original. The 
averaging of neural network predictions based on resampling with replacement has 
recently been proposed under the notation "bagging" by Breiman ([Bre94]), who 
has achieved dramatically improved results in several classification tasks. He also 
notes, however, that an actual improvement of the prediction can only result if the 
estimation procedure is relatively unstable. As discussed, this is particularly the 
case for Gaussian mixture training. We therefore expect bagging to be well suited 
for our task. 
5 Experiments and Results 
To assess the practical advantage resulting from regularization, we used the density 
estimates to construct classifiers and compared the resulting prediction accuracies 
using a toy problem and a real-world problem. The reason is that the generaliza- 
tion error of density estimates in terms of the likelihood based on the test data 
is rather unintuitive whereas performance on a classification problem provides a 
good impression of the degree of improvement. Assume we have a set of N labeled 
data z* = {(xk,lk)lk = 1,...,N}, where lk E T: {1,...,C} denotes the class label 
of each input x/. A classifier of new inputs x is yielded by choosing the class l 
with the maximum posterior class-probability p(llx ). The posterior probabilities 
may be derived from the class-conditional data likelihood p(xll ) via Bayes theorem: 
p(llx ) - p(xll)p(l)/p(x ) o p(xll)p(l)o The resulting partitions of the input space are 
optimal for the true p(llx ). A viable way to approximate the posterior p(llx ) is to 
estimate p(xll ) and p(l) from the sample data. 
5.1 Toy Problem 
In the toy classification problem the task is to discriminate the two classes of circu- 
latory arranged data shown in figure 3. We generated 200 data points for each class 
and subdivided them into two sets of 100 data points. The first was used for train- 
ing, the second to test the generalization performance. As a network architecture 
we chose a Gaussian mixture with 20 units. Table I summarizes the results, begin- 
ning with the unregularized Gaussian mixture which is followed by the averaging 
and the Bayesian penalty approaches. The three rows for averaging correspond to 
the results yielded without applying resampling (local max.), with resampling with- 
Figure 3: Toy Classification Task. 
Improved Gaussian Mixture Density Estimates Using Bayesian Penalty Terms 54 7 
out replacement (70% subsets), and with resampling with replacement (bagging). 
The performances on training and test set are measured in terms of the model log- 
likelihood. Larger values indicate a better performance. We report separate results 
for class A and B, since the densities of both were estimated separately. The final 
column shows the prediction accuracy in terms of the percentage of correctly clas- 
sifted data in the test set. We report the average results from 20 experiments. The 
numbers in brackets denote the standard deviations a of the results. Multiplying a 
with '19;95%[x/ = 0.4680 yields 95% confidence intervals. The best result in each 
category is underlined. 
Algorithm 
unreg. 
Averaging: 
1ocaJ max. 
70% subset 
bagging 
Penalty: 
1 = 0.01 
1 = 0.02 
1 ---- 0.05 
1 = 0.1 
Log-Likelihood 
Training Test Accuracy 
A B A B 
-120.8 (13.3) -120.4 (10.8) -224.9 (32.6) -241.9 (34.1) 
-115.6 (6.0) 
-106.8 (5.8) 
-83.8 (4.9) 
-112.6 (6.6) 
-105.1 (6.7) 
-83.1 (7.1) 
-200.9 (13.9) 
-188.8 (9.5) 
-194.2 (7.3) 
-209.1 (16.3) 
-196.4 (11.3) 
-200.1 (11.3) 
-149.3 (18.5) 
-156.0 (16.5) 
-173.9 (24.3) 
-183.0 (21.9) 
-146.5 (5.9) 
-153.0 (4.8) 
-167.0 (15.8) 
-181.9 (21.1) 
-186.2 (13.9) 
-177.1 (ll.8) 
-182.0 (20.1) 
-184.6 (21.0) 
-182.9(11.6) 
-174.9 (7.0) 
-173.9 (14.3) 
-182.5 (21.1) 
80.6% (2.8) 
81.8% (3.1) 
83.2% (2.9) 
82.6% (3.4) 
83.1% (2.9) 
84.4% (6.3) 
81.5% (5.9) 
78.5% (5.1) 
Table 1: Performances in the toy classification problem . 
As expected, all regularization methods outperform the maximum likelihood ap- 
proach in terms of correct classification. The performance of the Bayesian regu- 
larization is hereby very sensitive to the appropriate choice of the regularization 
parameter 8. Optimality of  with respect to the density prediction and optimality 
with respect to prediction accuracy on the test set roughly coincide (for/ = 0.02). 
Averaging is inferior to the Bayesian approach if an optimal/ is chosen. 
5.2 BUPA Liver Disorder Classification 
As a second task we applied our methods to a real-world decision problem from 
the medical environment. The problem is to detect liver disorders which might 
arise from excessive alcohol consumption. Available information consists of five 
blood tests as well as a measure of the patients' daily alcohol consumption. We 
subdivided the 345 available samples into a training set of 200 and a test set of 145 
samples. Due to the relatively few data we did not try to determine the optimal 
regularization parameter using a validation process and will report results on the 
test set for different parameter values. 
Algorithm 
Accuracy 
unregularized 
64.8 
Bayesian penalty (fl_ = 0.05) 
Bayesian penalty ([ = 0.10) 
Bayesian penalty (/ = 0.20) 
averaging (local maxima) 
averaging (70 % subset) 
averaging (bagging) 
65.5 
66.9 
61.4 
65.5 
72.4 
71.0 % 
Table 2: Performances in the liver disorder classification problem . 
548 D. ORMONEIT, V. TRESP 
The results of our experiments are shown in table 2. Again, both regularization 
methods led to an improvement in prediction accuracy. In contrast to the toy prob- 
lem, the averaged predictor was superior to the Bayesian approach here. Note that 
the resampling led to an improvement of more than five percent points compared 
to unresampled averaging. 
6 Conclusion 
We proposed a Bayesian and an averaging approach to regularize Gaussian mixture 
density estimates. In comparison with the maximum likelihood solution both ap- 
proaches led to considerably improved results as demonstrated using a toy problem 
and a real-world classification task. Interestingly, none of the methods outperformed 
the other in both tasks. This might be explained with the fact that Gaussian mix- 
ture density estimates are particularly unstable in high-dimensional spaces with 
relatively few data. The benefit of averaging might thus be greater in this case. 
Averaging proved to be particularly effective if applied in connection with resam- 
pling of the training data, which agrees with results in regression and classification 
tasks. If compared to Bayesian regularization, averaging is computationally expen- 
sive. On the other hand, Baysian approaches typically require the determination of 
hyper parameters (in our case ), which is not the case for averaging approaches. 
References 
[Bre94] 
[B.n94] 
[DLR77] 
[HT941 
[:L95] 
[Mac91] 
[Now91] 
[Orm93] 
[PC93] 
[Rob94] 
[THA93] 
L. Breiman. Bagging predictors. TechnicaJ report, UC Berkeley, 1994. 
W. Buntine. Operations for learning with graphicaJ models. Journal of Artificial 
Intelligence Research, 2:159-225, 1994. 
A. P. Dempster, N.M. Laird, and D. B. Rubin. Maximum hkehhood from 
incomplete data via the EM aJgorithm. J. Royal Statistical Society B, 1977. 
T. Hastie and R. Tibshirani. Discriminant anaJysis by gaussian mixtures. Tech- 
nical report, AT&T Bell Labs and University of Toronto, 1994. 
N. Kambhatla and T. K. Leen. Classifying with gaussian mixtures and clusters. 
In Advances in Neural Information Processing Systems 7. Morgan Kaufman, 
1995. 
D. MacKay. Bayesian Modelling and Neural Networks. PhD thesis, CaJ. ifornia 
Institute of Technology, Pasadena, 1991. 
S. J. Nowlan. Soft Competitive Adaption: Neural Network Learning Algorithms 
based on Fitting Statistical Mictures. PhD thesis, School of Computer Science, 
Carnegie Mellon University, Pittsburgh, 1991. 
D. Ormoneit. Estimation of probability densities using neuraJ networks. Master's 
thesis, Technische Universitit Mfinchen, 1993. 
M. P. Pertone and L. N. Cooper. When networks disagree: Ensemble methods for 
hybrid NeuraJ networks. In Neural Networks for Speech and Image Processing. 
Chapman Hall, 1993. 
C. P. Robert. The Bayesian Choice. Springer-Verlag, 1994. 
V. Tresp, J. Hollatz, and S. Ahmad. Network structuring and training using 
rule-based knowledge. In Advances in Neural Information Processing Systems 5. 
Morgan Kaufman, 1993. 
