Regression with Input-Dependent Noise: 
A Bayesian Treatment 
Christopher M. Bishop 
C. H. Bshopaston. ac. uk 
Cazhaow S. Qazaz 
qazazcsaston. ac. uk 
Neural Computing Research Group 
Aston University, Birmingham, B4 7ET, U.K. 
http://. ncrg. aston. ac. uk/ 
Abstract 
In most treatments of the regression problem it is assumed that 
the distribution of target data can be described by a deterministic 
function of the inputs, together with additive Gaussian noise hav- 
ing constant variance. The use of maximum likelihood to train such 
models then corresponds to the minimization of a sum-of-squares 
error function. In many applications a more realistic model would 
allow the noise variance itself to depend on the input variables. 
However, the use of maximum likelihood to train such models would 
give highly biased results. In this paper we show how a Bayesian 
treatment can allow for an input-dependent variance while over- 
coming the bias of maximum likelihood. 
I Introduction 
In regression problems it is important not only to predict the output variables but 
also to have some estimate of the error bars associated with those predictions. An 
important contribution to the error bars arises from the intrinsic noise on the data. 
In most conventional treatments of regression, it is assumed that the noise can be 
modelled by a Gaussian distribution with a constant variance. However, in many 
applications it will be more realistic to allow the noise variance itself to depend on 
the input variables. A general framework for modelling the conditional probability 
density function of the target data, given the input vector, has been introduced in 
the form of mixture density networks by Bishop (1994, 1995). This uses a feed- 
forward network to set the parameters of a mixture kernel distribution, following 
Jacobs et al. (1991). The special case of a single isotropic Gaussian kernel function 
348 C. M. Bishop and C. S. Qazaz 
was discussed by Nix and Weigend (1995), and its generalization to allow for an 
arbitrary covariance matrix was given by Williams (1996). 
These approaches, however, are all based on the use of maximum likelihood, which 
can lead to the noise variance being systematically under-estimated. Here we adopt 
an approximate hierarchical Bayesian treatment (MacKay, 1991) to find the most 
probable interpolant and most probable input-dependent noise variance. We com- 
pare our results with maximum likelihood and show how this Bayesian approach 
leads to a significantly reduced bias. 
In order to gain some insight into the limitations of the maximum likelihood ap- 
proach, and to see how these limitations can be overcome in a Bayesian treatment, it 
is useful to consider first a much simpler problem involving a single random variable 
(Bishop, 1995). Suppose that a variable z is known to have a Gaussian distribution, 
but with unknown mean/z and unknown variance cr 2. Given a sample D -- 
drawn from that distribution, where n = 1,..., N, our goal is to infer values for the 
mean and variance. The likelihood function is given by 
p(Dll,a 2) = (2a2)N/2 exp --2or--  E(zn-/z) 2 . (1) 
n----1 
A non-Bayesian approach to finding the mean and variance is to maximize the 
likelihood jointly over/z and cr 2, corresponding to the intuitive idea of finding the 
parameter values which are most likely to have given rise to the observed data set. 
This yields the standard result 
It is well known that the estimate 2 for the variance given in (2) is biased since 
the expectation of this estimate is not equal to the true value 
N-1 
where cr is the true variance of the distribution which generated the data, and 
[.] denotes an average over data sets of size N. For large N this effect is small. 
However, in the case of regression problems there are generally much larger number 
of degrees of freedom in relation to the number of available data points, in which 
case the effect of this bias can be very substantial. 
The problem of bias can be regarded as a symptom of the maximum likelihood 
approach. Because the mean  has been estimated from the data, it has fitted some 
of the noise on the data and this leads to an under-estimate of the variance. If the 
true mean is used in the expression for 2 in (2) instead of the maximum likelihood 
expression, then the estimate is unbiased. 
By adopting a Bayesian viewpoint this bias can be removed. The marginal likelihood 
of cr 2 should be computed by integrating over the mean/z. Assuming a 'fiat' prior 
p(/z) we obtain 
p(Dla ) = /p(Dla a,p)p(p) dp 
(4) 
Regression with Input-Dependent Noise: A Bayesian Treatment 349 
oc cry_ x exp 2.2 (z, - )2 . 
(5) 
Maximizing (5) with respect to cr 2 then gives 
N 
52 = 1 
N - (zn - (6) 
which is unbiased. 
This result is illustrated in Figure 1 which shows contours of p(Dlit, a 2) together 
with the marginal likelihood p(DI a2) and the conditional likelihood p(D I, a2) eval- 
A 
uated at it = it. 
3 , , 3 
2.5 
2 
._ 1.5 
0.5 
2.5 
2 
1.5 
-2 0 2 0 
mean 
2 4 
likelihood 
Figure 1: The left hand plot shows contours of the likelihood function p(Dl/z , a ') given 
by (1) for 4 data points drawn from a Gaussian distribution having zero mean and unit 
variance. The right hand plot shows the marginal likelihood function p(DI a') (dashed 
curve) and the conditional likelihood function p(D[, cr ') (solid curve). It can be seen that 
the skewed contours result in a value of ', which maximizes p(Dl,a'), which is smaller 
than ' which maximizes p(D[cr'). 
2 Bayesian Regression 
Consider a regression problem involving the prediction of a noisy variable t given 
the value of a vector x of input variables . Our goal is to predict both a regression 
function and an input-dependent noise variance. We shall therefore consider two 
networks. The first network takes the input vector x and generates an output 
For simplicity we consider a single output variable. The extension of this work to 
multiple outputs is straightforward. 
350 C. M. Bishop and C. S. Qazaz 
y(x; w) which represents the regression function, and is governed by a vector of 
weight parameters w. The second network also takes the input vector x, and 
generates an output function/3(x; u) representing the inverse variance of the noise 
distribution, and is governed by a vector of weight parameters u. The conditional 
distribution of target data, given the input vector, is then modelled by a normal 
distribution p(t[x, w, u) - JV'(t[y,/3-). From this we obtain the likelihood function 
I exp - E/3nEn (7) 
p(DIw, u) - ZD 
where/3, =/3(x,.; u), 
N (_)/2 1 
zr, = H = w) - t.) 2 (8) 
, 
and D _= {Xn, tn} is the data set. 
Some simplification of the subsequent analysis is obtained by taking the regression 
function, and In/3, to be given by linear combinations of fixed basis functions, as in 
MacKay (1995), so that 
y(x; w) = w 'r4,(X), 
/3(x; u) = exp (uT,(x)) 
(9) 
where choose one basis function in each network to be a constant qb0 = P0 = 1 so 
that the corresponding weights w0 and u0 represent bias parameters. 
The maximum likelihood procedure chooses values r and fi by finding a joint max- 
imum over w and u. As we have already indicated, this will give a biased result 
since the regression function inevitably fits part of the noise on the data, leading 
to an over-estimate of/3(x). In extreme cases, where the regression curve passes 
exactly through a data point, the corresponding estimate of/3 can go to infinity, 
corresponding to an estimated noise variance of zero. 
The solution to this problem has already been indicated in Section 1 and was first 
suggested in this context by MacKay (1991, Chapter 6). In order to obtain an 
unbiased estimate of/3(x) we must find the marginal distribution of/3, or equiva- 
lently of u, in which we have integrated out the dependence on w. This leads to a 
hierarchical Bayesian analysis. 
We begin by defining priors over the parameters w and u. Here we consider isotropic 
Gaussian priors of the form 
where Cw and cu are hyper-parameters. At the first stage of the hierarchy, we 
assume that u is fixed to its most probable value UMr, which will be determined 
shortly. The most probable value of w, denoted by WMr, is then found by maxi- 
Regression with Input-Dependent Noise: A Bayesian Treatment 351 
mizing the posterior distribution 2 
p(wlD, UMr, w) = P(DIw' UMr)p(Wlw) 
P(DluMr, Otw) 
where the denominator in (12) is given by 
p( DIuMr, aw ) = f p(DIw, uMr )P(Wlw ) dw. 
(12) 
(13) 
Taking the negative log of (12), and dropping constant terms, we see that WMr is 
obtained by minimizing 
N 
s(w) = + -Ilwll 2 
(14) 
.=1 
where we have used (7) and (10). For the particular choice of model (9) this min- 
imization represents a linear problem which is easily solved (for a given u) by 
standard matrix techniques. 
At the next level of the hierarchy, we find uMp by maximizing the marginal posterior 
distribution 
p(u[D, cu, cw) = p(D[u, Ow)p(u[o,) (15) 
The term p(D[u, a,) is just the denominator from (12) and is found by integrating 
over w as in (13). For the model (9) and prior (10) this integral is Gaussian and 
can be performed analytically without approximation. Again taking logarithms and 
discarding constants, we have to minimize 
N 1 
au 1 N 
M(u)- E,E, + -Ilul12-  Elna, + lnlA I (16) 
n=l n=l 
where [A I denotes the determinant of the Hessian matrix A given by 
N 
A = E "b(x")b(x")? + aI (17) 
.=1 
and I is the unit matrix. The function M(u) in (16) can be minimized using 
standard non-linear optimization algorithms. We use scaled conjugate gradients, in 
which the necessary derivatives of In IAI are easily found in terms of the eigenvalues 
of A. 
In summary, the algorithm requires an outer loop in which the most probable value 
UMp is found by non-linear minimization of (16), using the scaled conjugate gra- 
dient algorithm. Each time the optimization code requires a value for M(u) or 
its gradient, for a new value of u, the optimum value for WM? must be found by 
minimizing (14). In effect, w is evolving on a fast time-scale, and u on a slow time- 
scale. The corresponding maximum (penalized) likelihood approach consists of a 
joint non-linear optimization over u and w of the posterior distribution p(w, uID ) 
obtained from (7), (10) and (11). Finally, the hyperparameters are given fixed val- 
ues aw = au = 0.1 as this allows the maximum likelihood and Bayesian approaches 
to be treated on an equal footing. 
'Note that the result will be dependent on the choice of parametrization since the 
maximum of a distribution is not invariant under a change of variable. 
352 
3 
Results and Discussion 
C. M. Bishop and C. S. Qazaz 
As an illustration of this algorithm, we consider a toy problem involving one input 
and one output, with a noise variance which has an x 2 dependence on the input 
variable. Since the estimated quantities are noisy, due to the finite data set, we 
consider an averaging procedure as follows. We generate 100 independent data sets 
each consisting of 10 data points. The model is trained on each of the data sets in 
turn and then tested on the remaining 99 data sets. Both the y(x; w) and fi(x; u) 
networks have 4 Gaussian basis functions (plus a bias) with width parameters chosen 
to equal the spacing of the centres. 
Results are shown in Figure 2. It is clear that the maximum likelihood results are 
biased and that the noise variance is systematically underestimated. By contrast, 
I 0.8 
0.5 
-0.5 
-1 
-4 4 
0.5 
-0.5 
-1 
Maximum likelihood 
-2 0 2 
x 
Bayesian 
/ 
-2 0 2 
x 
0.6 
0.4 
0.2 
0.8 
0.6 
0.4 
0.2 
Maximum likelihood 
\ 
\ / 
___.__..,, .. / 
-2 0 
x 
Bayesian 
/ 
/ 
/ 
/ 
/ 
/ 
\ / 
\ / 
\ / 
-2 0 2 
x 
4 
Figure 2: The left hand plots show the sinusoidal function (dashed curve) from which 
the data were generated, together with the regression function averaged over 100 training 
sets. The right hand plots show the true noise variance (dashed curve) together with the 
estimated. noise variance, again averaged over 100 data sets. 
the Bayesian results show an improved estimate of the noise variance. This is borne 
out by evaluating the log likelihood for the test data under the corresponding pre- 
dictive distributions. The Bayesian approach gives a log likelihood per data point, 
averaged over the 100 runs, of -1.38. Due to the over-fitting problem, maximum 
likelihood occasionally gives extremely large negative values for the log likelihood 
(when fi has been estimated to be very large, corresponding to a regression curve 
which passes close to an individual data point). Even omitting these extreme val- 
ues, the maximum likelihood still gives an average log likelihood per data point of 
Regression with Input-Dependent Noise: A Bayesian Treatment 353 
-17.1 which is substantially smaller than the Bayesian result. 
We are currently exploring the use of Markov chain Monte Carlo methods (Neal, 
1993) to perform the integrations required by the Bayesian analysis numerically, 
without the need to introduce the Gaussian approximation or the evidence frame- 
work. Recently, MacKay (1995) has proposed an alternative technique based on 
Gibbs sampling. It will be interesting to compare these various approaches. 
Acknowledgements: This work was supported by EPSRC grant GR/K51792, 
Validation and Verification of Neural Network Systems. 
References 
Bishop, C. M. (1994). Mixture density networks. Technical Report 
NCRG/94/001, Neural Computing Research Group, Aston University, Birm- 
ingham, UK. 
Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford Univer- 
sity Press. 
Jacobs, R. A., M. I. Jordan, S. J. Nowlan, and G. E. Hinton (1991). Adaptive 
mixtures of local experts. Neural Computation 3 (1), 79-87. 
MacKay, D. J. C. (1991). Bayesian Methods for Adaptive Models. Ph.DC. hesis, 
California Institute of Technology. 
MacKay, D. J. C. (1995). Probabilistic networks: new models and new methods. 
In F. Fogelman-Souli6 and P. Gallinari (Eds.), Proceedings ICANN'95 Inter- 
national Conference on Artificial Neural Networks, pp. 331-337. Paris: EC2 
& Cie. 
Neal, R. M. (1993). Probabilistic inference using Markov chain Monte Carlo meth- 
ods. Technical Report CRG-TR-93-1, Department of Computer Science, Uni- 
versity of Toronto, Cananda. 
Nix, A.D. and A. S. Weigend (1995). Learning local error bars for nonlinear 
regression. In G. Tesauro, D. S. Touretzky, and T. K. Leen (Eds.), Advances in 
Neural Information Processing Systems, Volume 7, pp. 489-496. Cambridge, 
MA: MIT Press. 
Williams, P.M. (1996). Using neural networks to model conditional multivariate 
densities. Neural Computation 8 (4), 843-854. 
