Bayesian averaging is well-temperated 
Lars Kai Hansen 
Department of Mathematical Modelling 
Technical University of Denmark B321 
DK-2800 Lyngby, Denmark 
lkhansen @imm. dtu. dk 
Abstract 
Bayesian predictions are stochastic just like predictions of any other 
inference scheme that generalize from a finite sample. While a sim- 
ple variational argument shows that Bayes averaging is generaliza- 
tion optimal given that the prior matches the teacher parameter 
distribution the situation is less clear if the teacher distribution is 
unknown. I define a class of averaging procedures, the temperated 
likelihoods, including both Bayes averaging with a uniform prior 
and maximum likelihood estimation as special cases. I show that 
Bayes is generalization optimal in this family for any teacher dis- 
tribution for two learning problems that are analytically tractable: 
learning the mean of a Gaussian and asymptotics of smooth learn- 
ers. 
1 Introduction 
Learning is the stochastic process of generalizing from a random finite sample of 
data. Often a learning problem has natural quantitative measure of generalization. 
If a loss function is defined the natural measure is the generalization error, i.e., the 
expected loss on a random sample independent of the training set. Generalizability 
is a key topic of learning theory and much progress has been reported. Analytic 
results for a broad class of machines can be found in the litterature [8, 12, 9, 10] 
describing the asymptotic generalization ability of supervised algorithms that are 
continuously parameterized. Asymptotic bounds on generalization for general ma- 
chines have been advocated by Vapnik [11]. Generalization results valid for finite 
training sets can only be obtained for specific learning machines, see e.g. [5]. A 
very rich framework for analysis of generalization for Bayesian averaging and other 
schemes is defined in [6]. 
Averaging has become popular as a tool for improving generalizability of learning 
machines. In the context of (time series) forecasting averaging has been investigated 
intensely for decades [3]. Neural network ensembles were shown to improve general- 
ization by simple voting in [4] and later work has generalized these results to other 
types of averaging. Boosting, Bagging, Stacking, and Arcing are recent examples 
of averaging procedures based on data resampling that have shown useful see [2] 
for a recent review with references. However, Bayesian averaging in particular is 
attaining a kind of cult status. Bayesian averaging is indeed provably optimal in a 
266 L. K. Hansen 
number various ways (admissibility, the likelihood principle etc) [1]. While it fol- 
lows by construction that Bayes is generalization optimal if given the correct prior 
information, i.e., the teacher parameter distribution, the situation is less clear if 
the teacher distribution is unknown. Hence, the pragmatic Bayesians downplay the 
role of the prior. Instead the averaging aspect is emphasized and "vague" priors are 
invoked. It is important to note that whatever prior is used Bayesian predictions 
are stochastic just like predictions of any other inference scheme that generalize 
from a finite sample. 
In this contribution I analyse two scenarios where averaging can improve gener- 
alizability and I show that the vague Bayes average is in fact optimal among the 
averaging schemes investigated. Averaging is shown to reduce variance at the cost 
of introducing bias, and Bayes happens to implement the optimal bias-variance 
trade-off. 
2 Bayes and generalization 
Consider a model that is smoothly parametrized and whose predictions can be 
described in terms of a density function x. Predictions in the model are based on a 
given training set: a finite sample D N 
-- {xa}a= of the stochastic vector x whose 
density - the teacher - is denoted p(xlOo ). In other words the true density is assumed 
to be defined by a fixed, but unknown, teacher parameter vector 00. The model, 
denoted H, involves the parameter vector 0 and the predictive density is given by 
p(xlD, H) - / p(xl O, H)p(OID, Z)dO 
(1) 
p(OID , H) is the parameter distribution produced in training process. In a maxi- 
mum likelihood scenario this distribution is a delta function centered on the most 
likely parameters under the model for the given data set. In ensemble averaging 
approaches, like boosting bagging or stacking, the distribution is obtained by train- 
ing on resampled traning sets. In a Bayesian scenario, the parameter distribution 
is the posterior distribution, 
p(OID, H ) --- P(DIO, H)P(OIH) 
f p( D[O', H)p( O'lH)dO' 
(2) 
where p(OIH ) is the prior distribution (probability density of parameters if D is 
empty). In the sequel we will only consider one model hence we suppress the model 
conditioning label H. 
The generalization error is the average negative log density (also known as simply 
the "log loss" - in some applied statistics works known as the "deviance") 
r(DlO0) = / -logp(xlD)p(xlOo)dx, (3) 
The expected value of the generalization error for training sets produced by the 
given teacher is given by 
F(00) = / / -logp(x[D)p(x[Oo)dxp(D[Oo)dD. 
(4) 
This does not limit us to conventional density estimation; pattern recognition and 
many functional approximations problems can be formulated as density estimation prob- 
lems as well. 
Bayesian Averaging is Well- Temperated 267 
Playing the game of "guessing a probability distribution" [6] we not only face a 
random training set, we also face a teacher drawn from the teacher distribution 
p(00). The teacher averaged generalization must then be defined as 
r = f r(Oo)p(eo)deo. (5) 
This is the typical generalization error for a random training set from the randomly 
chosen teacher - produced by the model H. The generalization error is minimized 
by Bayes averaging if the teacher distribution is used as prior. To see this, form the 
Lagrangian functional 
[q(xlD)]- f f f-logq(xlD)p(x[Oo)dxp(DlOo)dDp(Oo)dOo+A f q(xlD)dx (6) 
defined on positive functions q(xID ). The second term is used to ensure that q(xlD ) 
is a normalized density in x. Now compute the variational derivative to obtain 
( 1 / 
5q(xlD ) = q(x[D) p(xlO)p(DlO)P(O)dO + '' (7) 
Equating this derivative to zero we recover the predictive distribution of Bayesian 
averaging, 
p(DlO)p(O) 
q(xID ) - p(xlO ) f P(DlO,)p(O,)do, dO, (8) 
where we used that A - f p(DlO)p(O)dO is the appropriate normalization constant. 
It is easily verified that this is indeed the global minimum of the averaged gener- 
alization error. We also note that if the Bayes average is performed with another 
prior than the teacher distribution p(0o), we can expect a higher generalization er- 
ror. The important question from a Bayesian point of view is then: Are there cases 
where averaging with generic priors (e.g. vague or uniform priors) can be shown to 
be optimal? 
3 Temperated likelihoods 
To come closer to a quantative statement about when and why vague Bayes is the 
better procedure we will analyse two problems for which some analytical progress is 
possible. We will consider a one-parameter family of learning procedures including 
both a Bayes and the maximum likelihood procedure, 
p(OI,D,H) = f p(DlO')dO" (9) 
where / is a positive parameter (plying the role of an inverse temperature). The 
family of procedures are all averaging procedures, and/ controls the width of the 
average. Vague Bayes (here used synonymously with Bayes with a uniform prior) 
is recoved for/ - 1, while the maximum posterior procedure is obtained by cooling 
to zero width/ - o. 
In this context the generalization design question can be frased as follows: is there 
an optimal temperature in the family of the temperated likelihoods? 
3.1 Example: 1D normal variates 
Let the teacher distribution be given by 
1 ( 1 (x_00)2) 
p(xlOo ) -  exp 2/r2 
(10) 
268 L. K. Hansen 
The model density is of the same form with 0 unknown and er 2 assumed to be 
known. For N examples the posterior (with a uniform prior) is, 
p(OlD) = 2--- exp - , 
(11) 
with 7 = 1/N Y'o zoo. The temperated likelihood is obtained by raising to the fi'th 
power and normalizing, 
p(O]D, fi) = 2---- 2 exp  . 
(12) 
The predictive distribution is found by integrating w.r.t. 0, 
p(xlD,) = p(x10)p(01D,)d0 = (7- , (la) 
with  =  (1 + 1/N). We note that this distribution is wider for all the averaging 
procedures than it is for maximum likelihood (  ), i.e., less variant. Por very 
small  the predictive distribution is almost independent of the data set, hence 
highly bied. 
It is straightforward to compute the generalization error of the predictive distribu- 
tion for general . Pirst we compute the generalization error for the specific training 
set D, 
r(D, fi, Oo) = -logp(x[D, fi)p(x]Oo)dx = log + 2 ((- O)2 +2), 
(14) 
The average generalization error is then found by averaging w.r.t the sampling 
distribution using  (Oo,a2/N)., 
f  a2(1 ) 
F(fi) = F(D, fi)dDp(DlOo) = log + 2  + i , (15) 
We first note that the generalization error is independent of the teacher 0 param- 
eter, this happened because  is a "location" parameter. The fi-dependency of the 
averaged generalization error is depicted in Figure 1. Solving OF(fi)/Ofi = 0 we find 
that the optimal fi solves 
+1 =a2 +1  fi=l (16) 
Note that this result holds for any N and is independent of the teacher parameter. 
The Bayes averaging at unit temperature is optimal for any given value of 0, hence, 
for any teacher distribution. We may say that the vague Bayes scheme is robust 
to the teacher distribution in this ce. Clearly this is a much stronger optimality 
than the more general result proven above. 
3.2 Bias-variance tradeoff 
It is interesting to decompose the generalization error in Eq. 15 in bias and variance 
components. We follow Heskes [7] and define the bias error as the generalization 
error of the geometric average distribution, 
B(fi) --/-log(x)p(xlOo)dx , 
(17) 
Bayesian Averaging is Well-Temperated 269 
0.7  
0.6 GENERALIZATION ' 
0.5 BAY 
0.1 
0  I 315 ; 4'5 
0 0.5 1 15 2 25 3 
TEMPERATURE 
A04 
V03 
Figure l: Bias-variance trade-off as function of the width of the temperated likeli- 
hood ensemble (temperature - 1/fi) for N - 1. The bias is computed as the gen- 
eralization error of the predictive distribution obtained from the geometric average 
distribution w.r.t. training set fluctuations as proposed by Heskes. The predictive 
distribution produced by Bayesian averaging corresponds to unit temperature (ver- 
tical line) and it achieves the minimal generalization error. Maximum-likelihood 
estimation for reference is recovered as the zero width/temperature limit. 
with 
(z) = Z-l exp (f log[p(z,D)]p(DlOo)dD)  
Inserting from Eq. (13), we find 
Integrating over the teacher distribution we find, 
1 
B(fi) = log + -- 
1 ) 
24(- Oo) . 
(18) 
(19) 
O-2 
24 (20) 
The variance error is given by V(fi) = I'(fi) - B(fi), 
o- 2 
V(fi)- 2No. 
(21) 
We can now quantify the statements above. By averaging a bias is introduced -the 
predictive distribution becomes wider- which decrease the variance contribution 
initially so that the generalization error being the sum of the two decreases. At still 
higher temperatures the bias becomes too strong and the generalization error start 
to increase. The Bayes average at unit temperature is the optimal trade-off within 
the given family of procedures. 
270 L. K. Hansen 
3.3 Asymptotics for smoothly parameterized models 
We now go on to show that a similar result also holds for general learning prob- 
lems in limit of large data sets. We consider a system parameterized by a finite 
dimensional parameter vector 0. For a given large training set and for a smooth 
likelihood function, the temperated likelihood is approximately Gaussian centered 
at the maximum posterior parameters[13], hence the normalized temperated poste- 
rior reads 
P(OlfiD, H) = [ fiNA(D'Oz') exp (----50'A(D,Oz,)50) (22) 
2r 
where 40 = 0- OM, with OM = OM (D) denoting the maximum likelihood solution 
for the given training sample. The second derivative or Hessian matrix is given by 
i  A(x O) (23) 
A(O,O) =  , 
A(x,O) = OOOO' logp(x[O) (24) 
The predictive distribution is given by 
V(l"  = f p(lO)p(Ol,D)dO (5) 
we write p(x[O) = exp(-e(x[O)) and expand e(x[O) around OML to second order, we 
find 
p(x[O)  p(xlOML)exp (--a(xlO)'50- 50'A(xlO)50 ) . (26) 
We are then in position to perform the integration over the posterior to find the 
normalized predictive distribution, 
p(xl,W ) p(xlOM). INA(D)I exp ( x (xlOM),A(xlOM)a(xlOM)) 
INA(D)+A(x)I 
(7) 
Proceeding  above, we compute the generalization error 
r(, 00) = f f -ogp(l, D)p(lOo)dp(DlOo)dD (8) 
For suciently smooth likelihoods, fluctuations in the maximum likelihood param- 
eters will be asymptotic normal, see e.g. [8], and furthermore fluctuations in A(D) 
can be neglected, this means that we can approximate, 
1 
A(x) + A(D)  ( + 1)A0, A0 = A(xlOo)p(xlOo)dx (29) 
where A0 is the averaged Fisher information matrix. With these approximations 
(valid as N  ) the generalization error can be found, 
a i a 1 + v (30) 
r(3, 00)  r() +  log 1 +  1 + 3' 
with d = dim(0) denoting the dimension of the parameter vector. Like in the 1D 
example (Eq. (15)) we find the generalization error is ymptotically independent 
of the teacher parameters. It is minimized for  = 1 and we conclude that Bayes 
is well-temperated in the asymptotics and that this holds for any teacher distri- 
bution. In the Bayes literature this is refered to  the prior is overwhelmed by 
data [1]. Decomposing the errors in bi and variance contributions we find similar 
results  for in 1D example, Bayes introduces the optimal bias by averaging at unit 
temperature. 
Bayesian Averaging is Well-Temperated 271 
4 Discussion 
We have seen two examples of Bayes averaging being optimal, in particular improv- 
ing on maximum likelihood estimation. We found that averaging introduces a bias 
and reduces variance so that the generalization error (being the sum of bias and 
variance) initially decrease. Bayesian averaging at unit temperature is the optimal 
width of the averaging distribution. For larger temperatures (widths) the bias is 
too strong and the generalization error increases. Both examples were special in the 
sense that they lead to generalization errors that are independent of the random 
teacher parameter. This is not generic, of course, rather the generic case is that a 
mis-specified prior can lead to arbitrary large learning catastrophes. 
Acknowledgments 
I thank the organizers of the 1999 Max Planck Institute Workshop on Statistical 
Physics of Neural Networks Michael Biehl, Wolfgang Kinzel and Ido Kanter, where 
this work was initiated. I thank Carl Edward Rasmussen, Jan Larsen, and Manfred 
Opper for stimulating discussions on_Bayesian averaging. This work was funded by 
the Danish Research Councils through the Computational Neural Network Center 
CONNECT and the THOR Center for Neuroinformatics. 
References 
[1] C.P. Robert: The Bayesian Choice - A Decision-Theoretic Motivation. Springer Texts 
in Statistics, Springer Verlag, New York (1994). A. Ohagan: Bayesian Inference. 
Kendall's Advanced Theory of Statistics. Vol 2B. The University Press, Cambridge 
(1994). 
[2] L. Breiman: Using adaptive bagging to debias regressions. Technical Report 547, 
Statistics Dept. U.C. Berkeley, (1999). 
[3] R.T. Clemen Combining forecast: A review and annotated bibliography. Journal of 
Forecasting 5, 559 (1989). 
[4] L.K. Hansen and P. Salamon: Neural Network Ensembles. IEEE Transactions on 
Pattern Analysis and Machine Intelligence, 12, 993-1001 (1990). 
[5] L.K. Hansen: Stochastic Linear Learning: Exact Test and Training Error Averages. 
Neural Networks 6, 393-396, (1993) 
[6] D. Haussler and M. Opper: Mutual Information, Metric Entropy, and Cumulative 
Relative Entropy Risk Annals of Statistics 25 2451-2492 (1997) 
[7] T. Heskes: Bias/Variance Decomposition/or Likelihood-Based Estimators. Neural 
Computation 10, pp 1425-1433, (1998). 
[8] L. Ljung: System Identification: Theory/or the User. Englewood Cliffs, New Jersey: 
Prentice-Hall, (1987). 
[9] J. Moody: "Note on Generalization, Regularization, and Architecture Selection in 
Nonlinear Learning Systems," in B.H. Juang, S.Y. Kung & C.A. Kamm (eds.) Pro- 
ceedings of the first IEEE Workshop on Neural Networks/or Signal Processing, Pis- 
cataway, New Jersey: IEEE, 1-10, (1991). 
[10] N. Murata, S. Yoshizawa & S. Amax/: Network Information Criterion -- Deter- 
mining the Number of Hidden Units for an Artificial Neural Network Model. IEEE 
Transactions on Neural Networks, vol. 5, no. 6, pp. 865-872, 1994. 
[11] V. Vapnik: Estimation of Dependences Based on Empirical Data. Springer-Verlag 
New York (1982). 
[12] H. White, "Consequences and Detection of Misspecified Nonlinear Regression Mod- 
els," Journal of the American Statistical Association, 76(374), 419-433, (1981). 
[13] D.J.C MacKay: Bayesian Interpolation, Neural Computation 4, 415-447, (1992). 
