Learning Model Bias 
Jonathan Baxter 
Department of Computer Science 
Royal Holloway College, University of London 
j onedes. rhbnc. ac. uk 
Abstract 
In this paper the problem of learning appropriate domain-specific 
bias is addressed. It is shown that this can be achieved by learning 
many related tasks from the same domain, and a theorem is given 
bounding the number tasks that must be learnt. A corollary of the 
theorem is that if the tasks are known to possess a common inter- 
nal representation or preprocessing then the number of examples 
required per task for good generahsation when learning n tasks si- 
multaneously scales like O(a + ), where O(a) is a bound on the 
minimum number of examples requred to learn a single task, and 
O(a + b) is a bound on the number of examples required to learn 
each task independently. An experiment providing strong qualita- 
tive support for the theoretical results is reported. 
I Introduction 
It has been argued (see [6]) that the main problem in machine learning is the biasing 
of a learner's hypothesis space sufficiently well to ensure good generahsation from 
a small number of examples. Once suitable biases have been found the actual 
learning task is relatively trivial. Exisiting methods of bias generally require the 
input of a human expert in the form of heuristics, hints [1], domain knowledge, 
etc. Such methods are clearly limited by the accuracy and reliabihty of the expert's 
knowledge and also by the extent to which that knowledge can be transferred to the 
learner. Here I attempt to solve some of these problems by introducing a method 
for automatically learning the bias. 
The central idea is that in many learning problems the learner is typically em- 
bedded within an environment or domain of related learning tasks and that the 
bias appropriate for a single task is likely to be appropriate for other tasks within 
the same environment. A simple example is the problem of handwritten character 
recognition. A preprocessing stage that identifies and removes any (small) rota- 
tions, dilations and translations of an image of a character will be advantageous for 
170 j. BAXTER 
recognising all characters. If the set of all individual character recognition problems 
is viewed as an environment of learning tasks, this preprocessor represents a bias 
that is appropriate to all tasks in the environment. It is likely that there are many 
other currently unknown biases that are also appropriate for this environment. We 
would like to be able to learn these automatically. 
Bias that is appropriate for all tasks must be learnt by sampling from many tasks. 
If only a single task is learnt then the bias extracted is likely to be specific to that 
task. For example, if a network is constructed as in figure I and the output nodes 
are simultaneously trained on many similar problems, then the hidden layers are 
more likely to be useful in learning a novel problem of the same type than if only a 
single problem is learnt. In the rest of this paper I develop a general theory of bias 
learning based upon the idea of learning multiple related tasks. The theory shows 
that a learner's generalisation performance can be greatly improved by learning 
related tasks and that if sufficiently many tasks are learnt the learner's bias can be 
extracted and used to learn novel tasks. 
Other authors that have empirically investigated the idea of learning multiple re- 
lated tasks include [5] and [8]. 
2 Learning Bias 
For the sake of argument I consider learning problems that amount to minimizing 
the mean squared error of a function h over some training set D. A more general 
formulation based on statistical decision theory is given in [3]. Thus, it is assumed 
that the learner receives a training set of (possibly noisy) input-output pairs D -- 
((a:l, yl),..., (aZm, Ym)}, drawn according to a probability distribution P on X x Y 
(X being the input space and Y being the output space) and searches through its 
hypothesis space 7/ for a function h: X - Y minimizing the empirical error, 
m 
(h, D)-- 1 E(h(xi ) _ yi)2. (1) 
rrt 
i----1 
The true error or generalisation error of h is the expected error under P: 
E(h, P) = fx (h(x) - y)' dP(x, y). (2) 
xY 
The hope of course is that an h with a small empirical error on a large enough 
training set will also have a small true error, i.e. it will genera!ise well. 
I model the environment of the learner as a pair (P, Q) where P = {P} is a set of 
learning tasks and Q is a probability measure on P. The learner is now supplied 
not with a single hypothesis space 7/but with a hypothesis space family  -- {7/}. 
Each 7/ E  represents a different bias the learner has about the environment. For 
example, one 7/ may contain functions that are very smooth, whereas another 7/ 
might contain more wiggly functions. Which hypothesis space is best will depend 
on the kinds of functions in the environment. To determine the best 7/ E  for 
(P, Q), we provide the learner not with a single training set D but with n such 
training sets D1,..., Dn. Each Di is generated by first sampling from P according 
to Q to give Pi and then sampling rrt times from X x Y according to Pi to give 
Di -- ((Xil,Yil),...,(Xim, Yim). The learner searches for the hypothesis space 
7/  with minimal empirical error on D1, ..., Dn, where this is defined by 
1  inf (h, Di). (3) 
*(7/,D1,...,Dn)- ni=l he 
Learning Model Bias 171 
Figure 1: Net for learning multiple tasks. Input xij from training set Di is prop- 
agated forwards through the internal representation f and then only through the 
output network gi. The error [gi(f(xij))- yij] 2 is similarly backpropagated only 
through the output network 9i and then f. Weight updates are performed after all 
training sets Di, ..., Dn have been presented. 
The hypothesis space 7/with smallest empirical error is the one that is best able to 
learn the n data sets on average. 
There are two ways of measuring the true error of a bias learner. The first is how 
well it generalises on the n tasks P1,..., P,, used to generate the training sets. 
Assuming that in the process of minimising (3) the learner generates n functions 
hi,..., h,, C 7/ with minimal empirical error on their respective training sets 1, the 
learner's true error is measured by: 
i----1 
(4) 
Note that in this case the learner's empirical er- 
1 
ror is given by D1,..., : Zi=l (hi, The second way 
of measuring the generalisation error of a bias learner is to determine how good 7/ 
is for learning novel tasks drawn from the environment (P, Q)' 
z*(7/, Q): f, inf 
h7-t 
(5) 
A learner that has found an 7/ with a small value of (5) can be said to have learnt 
to learn the tasks in 7 > in general. To state the bounds ensuring these two types of 
generalisation a few more definitions must be introduced. 
Definition 1 Let  - {7/} be a hypothesis space family. Let  - {h  7/:7/  
}. For any h: X -+ Y, define a map h: X x Y -+ [0, 1] by h(x, y) = (h(x)- y)2. 
Note the abuse of notation: h stands for two different functions depending on its 
argument. Given a sequence of n functions fz : (hl, . . . , hn) let fz: (X x y)n _ [0, 1] 
1 n 
be the function (Xl, yl,...,x,,yn) -+  i=l hi(xi,yl). Let 7/n be the set of all such 
functions where the hi are all chosen from 7/. Let ]I = {7/n:7/  H}. For each 
7/ E  define 7/*: P --> [0, 1] by 7/*(P)--infhet E(h,P)and let *: {7/*: 7/E }. 
This assumes the infimum in (3) is attained. 
172 J. BAXTER 
Definition 2 Given a set of functions 7/from any space Z to [0, 1], and any prob- 
ability measure on Z, define the pseudo-metric dp on 7/ by 
ap(h, h'): fz lb(z) - h'(z)l 
Denote the smallest e-cover of (7/,dp) by A/'(e, 7/, dp). Define the c-capacity of 7/ 
by 
c(e, 7/) 
P 
where the supremum is over all discrete probability measures P on g. 
Definition 2 will be used to define the c-capacity of spaces such as Ilql* and [1I], 
where from definition I the latter is [E ] = { E 7/': 7/E H}. 
The following theorem bounds the number of tasks and examples per task required 
to ensure that the hypothesis space learnt by a bias learner will, with high proba- 
bility, contain good solutions to novel tasks in the same environment 2. 
Theorem 1 Let the n training sets D1,..., Dn be generated by sampling n times 
from the environment P according to Q to give P1,..., Pn, and then sampling m 
times from each Pi to generate Di. Let : {7/} be a hypothesis space family and 
^ 
suppose a learner chooses 7/   minimizing (3) on D1, ..., Dn. For all e > 0 and 
0<5<1, if 
n --- 
1 C (e, IE*)) 
0 -ffln 3 ' 
then 
{ ^ ^ } 
Pr D1,...,D,:IE*(7/,D1,...,D,,.)-E*(7-,Q)I>e <3. 
The bound on m in theorem I is the also the number of examples required per 
task to ensure generalisation of the first kind mentioned above. That is, it is 
the number of examples required in each data set Di to ensure good generalisa- 
tion on average across all n tasks when using the hypothesis space family SI. If 
we let rn(, n, e, 5) be the number of examples required per task to ensure that 
Pr{D1,...,Dn:ln(hl,...,hn, D1,...,Dn) - En(hl,...,hn, P1,...,Pn)l > e} < 
5, where all hi 6 7/for some fixed 7/ 6  then 
G(IFhn, e, 3) = re(H, 1, e, 3) 
m ( lI-]I, n, e, 3 ) 
represents the advantage in learning n tasks as opposed to one task (the ordinary 
learning scenario). Can a( ,,e, 3) the ,-task gain of Using the fact [] that 
c (e, ) _< c (e, [L) _< c (e, )  , 
and the formula for m from theorem 1, we have, 
1 _< G(It n, e, 3) _< n. 
2The bounds in theorem 1 can be improved to O (3) if all 7  H are convex and the 
error is the squared loss [7]. 
Learning Model Bias 173 
Thus, at least in the worst case analysis here, learning n tasks in the same environ- 
ment can result in anything from no gain at all to an n-fold reduction in the number 
of examples required per task. In the next section a very intuitive analysis of the 
conditions leading to the extreme values of G(H, n, e, 3) is given for the situation 
where an internal representation is being learnt for the environment. I will also say 
more about the bound on the number of tasks (n) in theorem 1. 
3 Learning Internal Representations with Neural Networks 
In figure I n tasks are being learnt using a common representation f. In this 
case [E7] is the set of all possible networks formed by choosing the weights in the 
representation and output networks. E is the same space with a single output node. 
If the n tasks were learnt independently (i.e. without a common representation) then 
each task would use its own copy of H,,, i.e. we wouldn't be forcing the tasks to all 
use the same representation. 
Let Wt be the total number of weights in the representation network and Wo 
be the number of weights in an individual output network. Suppose also that all 
the nodes in each network are Lipschitz bounded e. Then it can be shown [3] that 
1 
In C (e, [I  ]) - O ((Wo + -n ) In ) and In C (e, ]E* ) - O (WR In ). Substituting 
these bounds into theorem I shows that to generalise well on average on n tasks 
I 1in] ) _ 
using a common representation requires ra -- 0( [(Wo + wn-)ln  +  
b 
O (a + ) examples of each task. In addition, if n- O (WRln ) then with high 
probability the resulting representation will be good for learning novel tasks from 
the same environment. Note that this bound is very large. However it results from a 
worst-case analysis and so is highly likely to be beaten in practice. This is certainly 
borne out by the experiment in the next section. 
The learning gain G(H, n, e) satisfies G(H, n, e)  Wo+Wa Thus, if Ws >> Wo, 
Wo + wl ' 
G  n, while if Wo >> Wn then G  1. This is perfectly intuitive: when Wo >> 
Wa the representation network is hardly doing any work, most of the power of 
the network is in the ouput networks and hence the tasks are effectively being 
learnt independently. However, if Wa >> Wo then the representation network 
dominates; there is very httle extra learning to be done for the individual tasks 
once the representation is known, and so each example from every task is providing 
full information to the representation network. Hence the gain of n. 
Note that once a representation has been learnt the sampling burden for learning a 
novel task will be reduced to m = O ( [Wo In  + In ]) because only the output 
network has to be learnt. If this theory apphe; to human learning then the fact 
that we are able to learn words, faces, characters, etcwith relatively few examples 
(a single example in the case of faces) indicates that our "output networks" are very 
small, and, given our large ignorance concerning an appropriate representation, the 
representation network for learning in these domains would have to be large, so we 
would expect to see an n-task gain of nearly n for learning within these domains. 
SA node a: P -  is Lipschitz bounded if there exists a constant c such that la(x) - 
a(x')l < cll- x'll for an  Note that this rules out threshold nodes, but sigmoid 
squashing functions are okay as long as the weights are bounded. 
174 j. BAXTER 
4 Experiment: Learning Symmetric Boolean Functions 
In this section the results of an experiment are reported in which a neural network 
was trained to learn syraraetric 4 Boolean functions. The network was the same as 
the one in figure 1 except that the output networks gi had no hidden layers. The 
input space X: {0, 1} 1 was restricted to include only those inputs with between 
one and four ones. The functions in the environment of the network consisted of all 
possible symmetric Boolean functions over the input space, except the trivial "con- 
stant 0" and "constant 1" functions. Training sets D1,..., Dn were generated by 
first choosing n functions (with replacement) uniformly from the fourteen possible, 
and then choosing m input vectors by choosing a random number between 1 and 4 
and placing that many l's at random in the input vector. The training sets were 
learnt by miniraising the empirical error (3) using the backpropagation algorithm 
as outlined in figure 1. Separate simulations were performed with n ranging from 
1 to 21 in steps of four and m ranging from 1 to 171 in steps of 10. Further details 
of the experimental procedure may be found in [3], chapter 4. 
Once the network had sucessfully learnt the n training sets its generalization ability 
was tested on all n functions used to generate the training set. In this case the 
generalisation error (equation (4)) could be computed exactly by calculating the 
network's output (for all n functions) for each of the 385 input vectors. The gener- 
alisation error as a function of n and m is plotted in figure 2 for two independent 
sets of simulations. Both simulations support the theoretical result that the number 
of examples m required for good generalisation decreases with increasing n (cf the- 
orem 1). For training sets D1,. ., D that led to a generalisation error of less than 
Figure 2: Learning surfaces for two independent simulations. 
0.01, the representation network f was extracted and tested for its true error, where 
this is defined as in equation (5) (the hypothesis space 7/ is the set of all networks 
formed by attaching any output network to the fixed representation network f). 
Although there is insufficient space to show the representation error here (see [3] 
for the details), it was found that the representation error monotonically decreased 
with the number of tasks learnt, verifying the theoretical conclusions. 
The representation's output for all inputs is shown in figure 3 for sample sizes 
(n, ra) = (1,131),(5,31)and (13,31). All outputs corresponding to inputs from 
the same category (i.e. the same number of ones) are labelled with the same symbol. 
The network in the n = I case generalised perfectly but the resulting representation 
does not capture the symmetry in the environment and also does not distinguish 
the inputs with 2, 3 and 4 "1's" (because the function learnt didn't), showing that 
4A symmetric Boolean function is one that is invariant under interchange of its inputs, 
or equivalently, one that only depends on the number of "1's" in its input (e.g. parity). 
Learning Model Bias 175 
learning a single function is not sufficient to learn an appropriate representation. 
By n = 5 the representation's behaviour has improved (the inputs with differing 
numbers of l's are now well separated, but they are still spread around a lot) and 
by n = 13 it is perfect. As well as reducing the samphng burden for the n tasks in 
(1,11) (5,1) (1:, 1) 
node  
0  /nodo 2 2 0 
node 1  0 
Figure 3: Plots of the output of a representation generated from the indicated (n, ra) 
sample. 
the training set, a representation learnt on sufficiently many tasks should be good 
for learning novel tasks and should greatly reduce the number of examples required 
for new tasks. This too was experimentally verified although there is insufficient 
space to present the results here (see [3]). 
5 Conclusion 
I have introduced a formal model of bias learning and shown that (under mild 
restrictions) a learner can sample sufficiently many times from sufficiently many 
tasks to learn bias that is appropriate for the entire environment. In addition, the 
number of examples required per task to learn n tasks independently was shown 
to be upper bounded by O(a + b/n) for appropriate environments. See [2] for an 
analysis of bias learning within an Information theoretic framework which leads to 
an exact a + b/n-type bound. 
References 
[1] Y. S. Abu-Mostafa. Learning from Hints in Neural Networks. Journal of Com- 
plecity, 6:192-198, 1989. 
[2] J. Baxter. A Bayesian Model of Bias Learning. Submitted to COLT 1996, 1995. 
[3] J. Baxter. Learning Internal Representations. PhD thesis, De- 
partment of Mathematics and Statistics, The Fhnders University of 
South Australia, 1995. Draft copy in Neuroprose Archive under 
"/pub/neuroprose/Thesis/baxter.thesis.ps.Z". 
[4] J. Baxter. Learning Internal Representations. In Proceedings of the Eighth Inter- 
national Conference on Computational Learning Theory, Santa Cruz, California, 
1995. ACM Press. 
[5] R. Caruana. Learning Many Related Tasks at the Same Time with Backpropa- 
gation. In Advances in Neural Information Processing 5, 1993. 
[6] S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the 
bias/variance dilemma. Neural Cornput., 4:1-58, 1992. 
[7] W. S. Lee, P. L. Bartlett, and R. C. Williamson. Sample Complexity of Agnostic 
Learning with Squared Loss. In preparation, 1995. 
[8] T. M. Mitchell and S. Thrum Learning One More Thing. Technical Report 
CMU-CS-94-184, CMU, 1994. 
