Bias, Variance and the Combination of 
Least Squares Estimators 
Ronny Meir 
Faculty of Electrical Engineering 
Technion, Haifa 32000 
Israel 
rmer)ee. t echnon. ac. 1 
Abstract 
We consider the effect of combining several least squares estimators 
on the expected performance of a regression problem. Computing 
the exact bias and variance curves as a function of the sample size 
we are able to quantitatively compare the effect of the combination 
on the bias and variance separately, and thus on the expected error 
which is the sum of the two. Our exact calculations, demonstrate 
that the combination of estimators is particularly useful in the case 
where the data set is small and noisy and the function to be learned 
is unrealizable. For large data sets the single estimator produces 
superior results. Finally, we show that by splitting the data set 
into several independent parts and training each estimator on a 
different subset, the performance can in some cases be significantly 
improved. 
Key words: Bias, Variance, Least Squares, Combination. 
1 INTRODUCTION 
Many of the problems related to supervised learning can be boiled down to the 
question of balancing bias and variance. While reducing bias can usually be ac- 
complished quite easily by simply increasing the complexity of the class of models 
studied, this usually comes at the expense of increasing the variance in such a way 
that the overall expected error (which is the sum of the two) is often increased. 
296 Ronny Meir 
Thus, many efforts have been devoted to the issue of decreasing variance, while at- 
tempting to keep the concomitant increase in bias as small as possible. One of the 
methods which has become popular recently in the neural network community is 
variance reduction by combining estimators, although the idea has been around in 
the statistics and econometrics literature at least since the late sixties (see Granger 
1989 for a review). Nevertheless, it seems that not much analytic work has been 
devoted to a detailed study of the effect of noise and an effectively finite sample size 
on the bias/variance balance. It is the explicit goal of this paper to study in detail 
a simple problem of linear regression, where the full bias/variance curve can be 
computed exactly for any effectively finite sample size and noise level. We believe 
that this simple and exactly solvable model can afford us insight into more complex 
non-linear problems, which are at the heart of much of the recent work in neural 
networks. 
A further aspect of our work is related to the question of the partitioning of the 
data set between the various estimators. Thus, while most studies assume the each 
estimator is trained on the complete data set, it is possible to envisage a situation 
where the data set is broken up into several subsets, using each subset of data to form 
a different estimator. While such a scheme seems wasteful from the bias point of 
view, we will see that in fact it produces superior forecasts in some situations. This, 
perhaps suprising, result is due to a large decrease in variance resulting from the 
independence of the estimators, in the case where the data subsets are independent. 
2 ON THE COMBINATION OF ESTIMATORS 
The basic objective of regression is the following: given a finite training set, D, com- 
posed of n input/output pairs, D = {(x,y)}:, drawn according to an unkown 
distribution P(x, y), find a function ('estimator'), f(x; D), which 'best' approxi- 
mates y. Using the popular mean-squared error criterion and taking expectations 
with respect to the data distribution one finds the well-known separation of the 
error into a bias and variance terms respectively (Geman et al. 1992) 
(x) = (ED f(x; D) - E[ylx])  + ED If(x; D) - E f(x; D)]  (1) 
We consider a data source of the form y = g(x) + r/, where the 'target' function g(x) 
is an unknown (and potentially non-linear) function and r/ is a Gaussian random 
variable with zero mean and variance a2. Clearly then E[ylx] = g(x). 
In the usual scenario for parameter estimation one uses the complete data set, D, 
to form an estimator f(x; D). In this paper we consider the case where the data set 
nc 
D is broken up into K subsets (not necessarily disjoint), such that D =, ,k=._. , 
and a separate estimator is found for each subset. The full estimator is then given 
by the linear combination (Granger 1989) 
K 
k=l 
The optimal values of the parameters b can be easily obtained if the data distribu- 
tion, P(x,y), is known, by simply minimizing the mean-squared error (Granger 
Bias, Variance and the Combination of Least Squares Estimators 297 
1989). In the more typical case where this distribution is unkown, one may 
resort to other schemes such as least-squares fitting for the parameter vector 
b = {b,..., bK}. The bias and variance of the combined estimator can be simply 
expressed in this case, and are given by 
k=l k,k  
(3) 
where the overbars denote an average with respect to the data. It is immediately 
apparent that the variance term is composed of two contributions. The first term, 
corresponding to k = k , simply computes a weighted average of the single estimator 
variances, while the second term measures the average covariance between the dif- 
ferent estimators. While the first term in the variance can be seen to decay as 
in the case where all the weights bk are of the same order of magnitude, the second 
term is finite unless the covariances between estimators are very small. It would 
thus seem beneficial to attempt to make the estimators as weakly correlated as pos- 
sible in order to decrease the variance. Observe that in the extreme case where the 
data sets are independent of each other, the second term in the variance vanishes 
identically. Note that the bias term depends only on single estimator properties 
and can thus be computed from the theory of the single estimator. As mentioned 
above, however, the second term in the variance expression explicitly depends on 
correlations between the different estimators, and thus requires the computation of 
quantities beyond those of single estimators. 
3 THE SINGLE LINEAR ESTIMATOR 
Before considering the case of a combination of estimators, we first review the case 
of a single linear estimator, given by f(x; D) = T . x, where  is estimated from 
the data set D. Following BSs et al. (1993) we further assume that the data arises 
through an equation of the form y = g(x) + /with g = g(w0 T  x). Looking back 
at equations (3) it is clear that the bias and variance are explicit functions of x 
and the weight vector w0. In order to remove the explicit dependence we compute 
in what follows expectations with respect to the probability distribution of x and 
w0, denoted respectively by ELI. ] and E0[.]. Thus, we define the averaged bias and 
variance by B = EoEr[B(x;wo)] and V = EoEr[V(x;wo)] and the expected error 
is then  = B + V. 
In this work we consider least-squares estimation which corresponds to minimizing 
the empirical error, emp(W, D) = IIXw - Y]l 2, where X is the n x d data matrix, 
Y is the n x I output vector and w is a d x I weight vector. The components of 
the 'target' vector Y are given by y, = g(w0 T  x) + /, where /, are i.i.d normal 
random variables with zero mean and variance a 2. Note that while we take the 
estimator itself to be linear we allow the target function g(.) to be non-linear. This 
is meant to model the common situation where the model we are trying to fit is 
inadequate, since the correct model (even it exists) is usually unkown. 
Thus, the least squares estimator is given by   argminw emp(W, D). Since in 
this case the error-function is quadratic it possesses either a unique global minimum 
298 Ronny Meir 
or a degenerate manifold of minima, in the case where the Hessian matrix, X'X, 
is singular. 
The solution to the least squares problem is well known (see for example Scharf 
1991), and will be briefly summarized. When the number of examples, n, is smaller 
than the input dimension, d, the problem is underdetermined and there are many 
solutions with zero empirical error. The solutions can be written out explicitly in 
the form 
r = XT(XXT)-Y + (I- XT(XXT)-X) V (n < d), (4) 
where V is an arbitrary d-dimensional vector. It should be noted that any vector w 
satisfying this equation (and thus any least-squares estimator) becomes singular as 
n approaches d from below, since the matrix XX T becomes singular. The minimal 
norm solution, often referred to as the Moore-Penrose solution, is given in this case 
by the choice V = 0. It is common in the literature to neglect the study of the 
underdetermined regime since the solution is not unique in this case. We however 
will pay specific attention to this case, corresponding to the often prevalent situa- 
tion where the amount of data is small, attempting to show that the combination 
of estimators approach can significantly improve the quality of predictors in this 
regime. Moreover, many important inverse problems in signal processing fall into 
this category (Scharf 1991). 
In the overdetermined case, n > d (assuming the matrix X to be of full rank), 
a zero error solution is possible only if the function g(.) is linear and there is no 
noise, namely Eta/a] = 0. In any other case, the problem becomes unrealizable and 
the minimum error is non-zero. In any event, in this regime the unique solution 
minimizing the empirical error is given by 
= (xTx)-xY (n > d). 
It is eay to see that this estimator is unbiased for linear g(.). 
In order to compute the bias and variance for this model we use Eqs. (3) with 
K = 1 and b} = 1. In order to actually compute the expectations with respect to 
x and the weight vector w0 we assume in what follows that the random vector x 
is distributed according to a multi-dimensional normal distributions of zero mean 
and covariance matrix (1/d)I. The vector w0 is similarly distributed with unit 
covariance matrix. The reason for the particular scaling chosen for the covariance 
matrices will become clear below. In the remainder of the paper we will be concerned 
with exact calculations in the so called thermodynamic limit: n, d -- oo and a = 
n/d finite. This limit is particularly useful in that the central limit theorem allows 
one to make precise statements about the behavior of the system, for an effectively 
finite sample size, ct. We note in passing that in the thermodynamic limit, d -- ec, 
we have y.,x/  1 with probability 1 and similarly for (l/d) Y4 Woi ' Using these 
simple distributions we can, after some algerbra, directly compute the bias and 
variance. Denoting R = E0[W a" w0], r = E011wll Q - E01111 one can show 
that the bias and variance are given by 
B = r- 2ggR+ g ; V = Q- r. (6) 
In the above equations we have used g- = f Dug(u) and gg = f Du ug(u) where 
the Gaussian measure Du is defined by Du = (e-u//v/)du. We note in passing 
Bias, Variance and the Combination of Least Squares Estimators 299 
that the same result is obtained for any i.i.d variables, xi, with zero mean and 
variance 1/d. We thus note that a complete calculation of the expected bias and 
variance requires the explicit computation of the variables R, r and Q defined 
above. In principle, with the explicit expressions (4) and (5) at hand one may 
proceed to compute all the quantities relevant to the evaluation of the bias and 
variance. Unfortunately, it turns out that a direct computation of r, R and Q using 
these expressions is a rather difficult task in the theory of random matrices, keeping 
in mind the potential non-linearity of the function g(.). A way to solve the problem 
can be undertaken via a slightly indirect route, using tools from statistical physics. 
The variables R and Q above have been recently computed by BSs et al. (1993) 
using replicas and by Opper and Kinzel (1994) by a direct calculation. The variable 
r can be computed along similar lines resulting in the following expressions for the 
bias and variance (given for brevity for the Moore-Penrose solution): 
a>l: 
We see from this solution that for a > 1 the bias is constant, while the variance 
is monotonically decreasing with the sample size a. For a < 1, there are of course 
multiple solutions corresponding to different normalizations Q. It is easy to see, 
however, that the Moore-Penrose solution, gives rise to the smallest variance of all 
least-squares estimators (the bias is unaffected by the normalization of the solution). 
The expected (or generalization) error is given simply by  = B + V, and is thus 
smallest for the Moore-Penrose solution. Note that this latter result is independent 
of whether the function g(.) is linear or not. We note in passing that in the simple 
case where the target function g(.) is linear and the data is noise-free (aa = 0) one 
obtains the particularly simple result  = 1 - a for a < I and  = 0 above a = 1. 
Note however that in any other case the expected error is a non-linear function of 
the normalized sample size a. 
4 COMBINING LINEAR ESTIMATORS 
Having summarized the situation in the case of a single estimator, we proceed to 
the case of K linear estimators. In this case we assume that the complete data 
set, D, is broken up into K subsets D  of size nk = ad each. In particular we 
consider two extreme situations: (i) The data sets are independent of each other, 
namely D () 91D  = 0 for all k, and (ii) D  = D for all k. We refer to the first 
situation as the non-overlapping case, while to the second case where all estimators 
.... (k) 
are trmned on the same data set as the fully-overlapping case. Denoting by w 
the k'th least-squares estimator, based on training set D(k), we define the following 
quantities: 
R(:)=Eo[(:)T.wo] 
, r('')=Eo[r(k)T.r( ')] 
, p('')=Eo[r()T.(')]. 
(8) 
Making use of Eqs. (3) and the probability distribution of w0 one then straighfor- 
300 Ronny Meir 
wardly finds for the mixture estimator 
(9) 
Now, the computation of the parameters p(k,k) and R  is identical to the case of 
a single estimator and can be directly read from the results of BSs et al. (1993) 
with the single modification of rescaling the sample size ( to c. The only in- 
teraction between estimators arises from the terms p(},k') and r('') which couple 
two different estimators. Note however that r (,') is independent of the degree of 
overlap between the data sets D (k) since each estimator is averaged independently. 
In fact, a straighforward, if lengthy, calculation leads to the conclusion that in gen- 
eral r ('') = R()R('). For the case where the data sets are independent of each 
other, namely D(  91 D (') =  it is obvious that p(,') = r(,'). One then finds 
p(}'}') = r ('') = R(k)R (') for independent data sets. In the case where the data 
sets are identical, D() = D for all k, the variable p(,') has been computed by 
BSs et al. (1993). At this point we proceed by discussing individually the two data 
generation scenarios described in section 1. 
4.1 NON-OVERLAPPING DATA SETS 
As shown above, since the data sets are independent of each other one has in this 
case p(k'') = r (''), implying that the second term in the expression for the variance 
in Eq. (9) vanishes. Specializing now to the case where the sizes of the data sets 
are equal, namely ( = nk/d = a/K for all k, we expect the variables R  and 
Q() to be independent of the specific index k. In the particular case of a uniform 
mixture (b = 1/K for all k) we obtain 
B(KU) = R -- 2ggRK + g-5 ; VK(U) = (QK - R})/K, 
(10) 
where the subscript K indicates that the values of the corresponding variables must 
be taken with respect to CK = (/K. The specific values of the parameters RK 
and QK can be found directly from the solution of BSs et al. (1993) by replacing 
a by a/K and dividing the variance term by a further K factor. Since the bias 
is a non-increasing function a it is always increased (or at best unchanged) by 
combining estimators from non-overlapping data subsets. It can also be seen from 
the above equation that as K increases the variance can be made arbitrarily small, 
albeit at the expense of increasing the bias. It can thus be expected that for any 
sample size c there is an optimal number of estimators minimizing the expected 
error,  = B + V. In fact, for small ( one may expand the equations for the bias 
and variance obtaining the expected error from which it is easy to see that the 
optimal value of K is given by K* - (- + a)/gg . As could be expected we 
see that the optimal value of K scales with the noise level a, since the effect of 
combining multiple independent estimators is to reduce the effective noise variance 
by a factor of K. As the sample size a increases and outweighs the effect of the 
noise, we find that the bias increase due to the data partitioning dominates the 
Bias, Variance and the Combination of Least Squares Estimators 301 
decrease in variance and the combined estimator performs more and more poorly 
relative to the single estimator. Finally, for a > K we find the choice K = I always 
yields a lower expected error. Thus, while we have shown that for small sample 
sizes the effect of splitting the data set into independent subsets is helpful, this 
is no longer the case if the sample size is sufficiently large, in which case a single 
estimator based on the complete data set is superior. For ( -- c, however, one 
finds that all uniform mixtures converge (to leading order) at the same rate, namely 
K(O) . o +(o + r2)/c , where o = _gg2. For finite values of (, however, 
the value of K has a strong effect on the quality of the stimator. 
4.2 FULLY OVERLAPPINC DATA SETS 
We focus now on the case where all estimators are formed from the same data set 
D, namely D (k) = D for all k. Since there is a unique solution to the least-squares 
estimation problem for a > 1, all least-squares estimators coincide in this regime. 
Thus, we focus here on the case a < 1, where multiple least-squares estimators 
coexist. We further assume that only mixtures of estimators of the same norm Q 
are allowed. We obtain for the uniform mixture 
-- 1 
B(K u)=R 2--2ggR+9 2 ; VK =(Q--R 2) -- (Q -- q)(1-- ?) ((< 1) (11) 
Clearly the expression for the bias in this case is identical to that obtained for 
the single estimator, since all estimators are based on the same data set and the 
bias term depends only on single estimator properties. The variance term, however, 
is modified due to the correlation between the estimators expressed through the 
variable p(k,k'). Since the variance for the case of a single estimator is Q - R  and 
since q < Q it is clear in this case that the variance is reduced while the bias remains 
unchanged. Thus we conclude that the mixture of estimators in this case indeed 
produces superior performance to that of the single estimator. However, it can be 
seen that in the case of the Moore-Penrose solution, corresponding to choosing the 
smallest possible norm Q, the expected error is minimal. We thus conclude that for 
a < I the Moore-Penrose pseudo-inverse solution yields the lowest expected error, 
and this cannot be improved on by combining least-squares estimators obtained 
from the full data set D. 
Recall that we have shown in the previous section that (for small and noisy data 
sets) combining estimators formed using non-overlapping data subsets produced 
results superior to those of any single estimator trained on the complete data set. 
An interesting conclusion of these results is that splitting the data set into non- 
overlapping subsets is a better strategy than training each estimator with the full 
data. As mentioned previously, the basic reason for this is the independence of 
the estimators formed in this fashion, which helps to reduce the variance term 
more drastically than in the case where the estimators are dependent (having been 
exposed to overlapping data sets). 
5 CONCLUSIONS 
In this paper we have studied the effect of combining different estimators on the 
performance of linear regression. In particular we have focused on the case of linear 
302 Ronny Meir 
least-squares estimation, computing exactly the full bias and variance curves for the 
case where the input dimension is very large (the so called thermodynamic limit). 
While we have focused specifically on the case of linear estimators, it should not be 
hard to extend these results to simple non-linear functions of the form f(w T.x) (see 
section 2). The case of a combination of more complex estimators (such as multi- 
layered neural networks) is much more demanding, as even the case of a single such 
network is rather difficult to analyzes. 
Several positive conclusions we can draw from our study are the following. First, the 
general claim that combining experts is always helpful is clearly fallacious. While 
we have shown that combining estimators is beneficial in some cases (such as small 
noisy data sets), this is not the case in general. Second, we have shown that in some 
situations (specifically unrealizable rules and small sample size) it is advantageous 
to split the data into several non-overlapping subsets. It turns out that in this case 
the decrease in variance resulting from the independence of the different estimators, 
is larger than the concomitant increase in bias. It would be interesting to try to 
generalize our results to the case where the data is split in a more efficient manner. 
Third, our results agree with the general notion that when attempting to learn 
an unrealizable function (whether due to noise or to a mismatch with the target 
function) the best option is to learn with errors. 
Ultimately one would like to have a general theory for combining empirical estima- 
tors. Our work has shown that the effect of noise and finite sample size is expected 
to produce non-trivial effects which are impossible to observe when considering only 
the asymptotic limit. 
Acknowledgement s 
The author thanks Manfred Opper for a very helpful conversation and the Ollen- 
dorff center of the Electrical Engineering department at the Technion for financial 
support. 
References 
S. BSs., W. Kinzel and M. Opper 1993, The generalization ability of perceptrons 
with continuous outputs, Phys. Rev. A 47:1384-1391. 
S. Geman, E. Bienenstock and R. Dorsat 1992, Neural networks and the 
bias/variance dilemma, Neural Computation 4:1-58. 
C.W.J. Granger 1989, Combining forecasts - twenty years later, J. of Forecast. 
8:167-173. 
M. Opper and W. Kinzel 1994, Statistical mechanics of generalization, in Physics 
of Neural networks, van Hemmen, J.S., E. Domany and K. Schulten eds., Springer- 
Verlag, Berlin. 
L.L. Scharf, 1991 Statistical Signal Processing: Detection, Estimation and Time 
Series Analysis, Addison-Wesley, Massachusetts. 
