Active Learning for Function 
Approximation 
Kah Kay Sung 
(sung@ai.mit.edu) 
Massachusetts Institute of Technology 
Artificial Intelligence Laboratory 
545 Technology Square 
Cambridge, MA 02139 
Partha Niyogi 
(pn@ai.mit.edu) 
Massachusetts Institute of Technology 
Artificial Intelligence Laboratory 
545 Technology Square 
Cambridge, MA 02139 
Abstract 
We develop a principled strategy to sample a function optimally for 
function approximation tasks within a Bayesian framework. Using 
ideas from optimal experiment design, we introduce an objective 
function (incorporating both bias and variance) to measure the de- 
gree of approximation, and the potential utility of the data points 
towards optimizing this objective. We show how the general strat- 
egy can be used to derive precise algorithms to select data for two 
cases: learning unit step functions and polynomial functions. In 
particular, we investigate whether such active algorithms can learn 
the target with fewer examples. We obtain theoretical and empir- 
ical results to suggest that this is the case. 
1 INTRODUCTION AND MOTIVATION 
Learning from examples is a common supervised learning paradigm that hypothe- 
sizes a target concept given a stream of training examples that describes the concept. 
In function approximation, example-based learning can be formulated as synthesiz- 
ing an approximation function for data sampled from an unknown target function 
(Poggio and Girosi, 1990). 
Active learning describes a class of example-based learning paradigms that seeks out 
new training examples from specific regions of the input space, instead of passively 
accepting examples from some data generating source. By judiciously selecting ex- 
594 Kah Kay Sung, Partha Niyogi 
ampies instead of allowing for possible random sampling, active learning techniques 
can conceivably have faster learning rates and better approximation results than 
passive learning methods. 
This paper presents a Bayesian formulation for active learning within the function 
approximation framework. Specifically, here is the problem we want to address: Let 
Dn = {(xi,yi)li -' 1,...,n} be a set of n data points sampled from an unknown 
target function g, possibly in the presence of noise. Given an approximation func- 
tion concept class, f', where each f E f' has prior probability P:=[f], one can use 
regularization techniques to approximate g from Dn (in the Bayes optimal sense) 
by means of a function  E f'. We want a strategy to determine at what input 
location one should sample the next data point, (XN+l, YN+I), in order to obtain 
the "best" possible Bayes optimal approximation of the unknown target function g 
with our concept class :. 
The data sampling problem consists of two parts: 
1) Defining what we mean by the "best" possible Bayes optimal ap- 
proximation of an unknown target function. In this paper, we propose an 
optimality criterion for evaluating the "goodness" of a solution with respect to an 
unknown target function. 
2) Formalizing precisely the task of determining where in input space 
to sample the next data point. We express the above mentioned optimality 
criterion as a cost function to be minimized, and the task of choosing the next 
sample as one of minimizing the cost function with respect to the input space 
location of the next sample point. 
Earlier work (Cohn, 1991; MacKay, 1992) have tried to use similar optimal experi- 
ment design (Fedorov, 1972) techniques to collect data that would provide maximum 
information about the target function. Our work differs from theirs in several re- 
spects. First, we use a different, and perhaps more general, optimality criterion for 
evaluating solutions to an unknown target function, based on a measure of function 
uncertainty that incorporates both bias and variance components of the total output 
generalization error. In contrast, MacKay and Cohn use only variance components 
in model parameter space. Second, we address the important sample complexity 
question, i.e., does the active strategy require fewer examples to learn the target 
to the same degree of uncertainty? Our results are stated in PAC-style (Valiant, 
1984). After completion of this work, we learnt that Sollich (1994) had also recently 
developed a similar formulation to ours. His analysis is conducted in a statistical 
physics framework. 
The rest of the paper is organized as follows: Section 2, develops our active sampling 
paradigm. In Sections 3 and 4, we consider two classes of functions for which active 
strategies are obtained, and investigate their performance both theoretically and 
empirically. 
2 THE MATHEMATICAL FRAMEWORK 
In order to optimally select examples for a learning task, one should first have a 
clear notion of what an "ideal" learning goal is for the task. We can then measure 
an example's utility in terms of how well the example helps the learner achieve the 
Active Learning for Function Approximation 595 
goal, and devise an active sampling strategy that selects examples with maximum 
potential utility. In this section, we propose one such learning goal -- to find an 
approximation function 0  :T' that "best" estimates the unknown target function 
g. We then derive an example utility cost function for the goal and finally present 
a general procedure for selecting examples. 
2.1 EVALUATING A SOLUTION TO AN UNKNOWN TARGET 
THE EXPECTED INTEGRATED SQUARED DIFFERENCE 
Let g be the target function that we want to estimate by means of an approximation 
function .0 6 Y. If the target function g were known, then one natural measure of 
how well (or badly)  approximates g would be the Integrated Squared Difference 
(ISD) of the two functions: 
5(O,g) = (g(x) - O(x))2dx. (1) 
In most function approximation tks, the target g is unknown, so we clearly can- 
not express the quality of a learning result, , in terms of g. We can, however, 
obtain an expected integrated squared difference (EISD) between the unknown tar- 
get, g, and its estimate, , by treating the unknown target g  a random vari- 
able from the approximation function concept cls . Taking into account the 
n data points, Dn, seen so far, we have the following a-posteriori likelihood for 
g: P(g[Dn)  [g]P(Dn[g). The expected integrated squared difference (EISD) 
between an unknown target, g, and its estimate, , given Dn, is thus: 
E[5(O,g)Dn] =  P(gDn)5(O,g)dg =  [g]P(Dng)5(O,g)dg. (2) 
2.2 SELECTING THE NEXT SAMPLE LOCATION 
We can now express our learning goal  minimizing the expected integrated squared 
difference (EISD) between the unknown target g and its estimate . A rexonahie 
sampling strategy would be to choose the next example from the input location that 
minimizes the EISD between g and the new estimate .+1. How does one predict 
the new EISD that results from sampling the next data point at location X.+l v 
Suppose we also know the target output value (possibly noisy), y.+, at x.+. 
The EISD between g and its new estimate .+ would then be E[5(.+1, g)D. U 
(x.+,y.+l)], where + can be recovered from D. U (x.+,y.+) via regular- 
ization. In reality, we do not know y.+, but we can derive for it the following 
conditional probability distribution: 
(3) 
This leads to the following expected value for the new EISD, if we sample our next 
data point at x.+: 
= 
Clearly, the optimal input location to sample next is the location that minimizes e 
cost function in Equation 4 (henceforth referred to  the total output unceainty), 
i.. 
+ = argmin(g.+lD.,x.+). (5) 
596 Kah Kay Sung, Partha Niyogi 
2.3 SUMMARY OF ACTIVE LEARNING PROCEDURE 
We summarize the key steps involved in finding the optimal next sample location: 
1) Compute P(glD,). This is the a-posteriori likelihood of the different functions 
g given Dn, the n data points seen so far. 
2) Fix a new point x,+ to sample. 
3) Assume a value y,+ for this x,+. One can compute P(g[D, U (x,+l,y,+l)) 
and hence the expected integrated squared difference between the target and its new 
estimate. This is given by E:[6(.n+,g)IDn U (xn+,yn+)]. See also Equation 2. 
4) At the given x,+, y,+ has a probability distribution given by Equation 3. 
Averaging over all Yn+l'S, we obtain the total output uncertainty for x,+, given by 
bl(.O,+lD,,x,+) in Equation 4. 
5) Sample at the input location that minimizes the total output uncertainty cost 
function. 
3 EXAMPLE 1: UNIT STEP FUNCTIONS 
To demonstrate the usefulness of the above procedure, let us first consider the 
following simple class of indicator functions parameterized by a single parameter a 
which takes values in [0, 1]. Thus 
:r - {l[,llO < a _< 1} 
We obtain a prior P(g - l[a,])) by assuming that a has an a-priori uniform distri- 
bution on [0, 1]. Assume that data, = i - 1, ..n} consistent with some 
unknown target function l[a,q (which the learner is to approximate) has been ob- 
tained. We are interested in choosing a point z E [0, 1] to sample which will provide 
us with maximal information. Following the general procedure outlined above we 
go through the following steps. 
For ease of notation, let zR be the right most point belonging to Dn whose y value 
is 0, i.e., zR = maxi=,..,{zilyi = 0}. Similarly let zL = mini=,..,{zilyi = 1} and 
let x: - xR ---- w. 
1) We first need to get P(gID,). It is easy to show that 
1 
P(g -- l[a,]lDn) - -- if a  [xt, x]; 0 otherwise 
w 
2) Suppose we sample next at a particular x  [0, 1], we would obtain y with the 
distribution 
P(y - OlD,,x) - (x - x) _ (x - x) if x  [xt, x;]; 1 if x _ xt;0 otherwise 
r L -- r R w 
For a particular y, the new data set would be Dn+i = Dn U (x, y) and the corre- 
sponding EISD can be easily obtained using the distribution P(glD,+i). Averaging 
this over P(ylDn, x) as in step 4 of the general procedure, we obtain 
U(.O,+ID,,x) = ( w2/12 
(1/12w)((xL - x) 3 + (x- xn) ) 
if x _< x or x >_ x/; 
otherwise 
Active Learning for Function Approximation 597 
Clearly the point which minimizes the expected total output uncertainty is the mid- 
point of x: and 
Thus applying the general procedure to this special case reduces to a binary search 
learning algorithm which queries the midpoint of xn and x;. An interesting question 
at this stage is whether such a strategy provably reduces the sample complexity; 
and if so, by how much. It is possible to. prove the following theorem which shows 
that for a certain pre-decided total output uncertainly value, the active learning 
algorithm takes fewer examples to learn the target to the same degree of total output 
uncertainty than a random drawing of examples according to a uniform distribution. 
Theorem 1 Suppose we want to collect examples so that we are guaranteed with 
high probability (i.e. probability > 1 - 5) that the total output uncertainty is less 
than e. Then a passive learner would require at least 1 ln(1/5) examples while 
the active strategy described earlier would require at most (1/2)ln(1/12e) examples. 
4 EXAMPLE 2: THE CASE OF POLYNOMIALS 
In this section we turn our attention to a class of univariate polynomials (from 
[-5, 5] to ) of maximum degree K, i.e., K 
.T'---- (g(a0,...,aK) =  aix i) 
i--O 
As before, the prior on  is obtained here by assuming a prior on the parameters; 
in particular we assume that a -- (ao, al,..., ar) has a multivariate normal distri- 
bution N(0, $). For simplicity, it is assumed that the parameters are independent, 
i.e., S is a diagonal matrix with $i,i = r. In this example we also incorporate noise 
(distributed normally according to N(0, r2)). As before, there is a target g,  G 
which the learner is to approximate on the basis of data. Suppose the learner is in 
possession of a data set D, = {(xi,Yi = g(xi) + r/);i = 1 ...n} and is to receive 
another data point. The two options are 1) to sample the function at a point x 
according to a uniform distribution on the domain [-5, 5] (passive learning) and 2) 
follow our principled active learning strategy to select the next point to be sampled. 
4.1 ACTIVE STRATEGY 
Here we derive an exact expression for 5:n+ (the next query point) by applying the 
general procedure described earlier. Going through the steps as before, 
1)It is possible to show that P(g(a)[Dn) = P(a[Dn) is again a multivariate normal 
2 . xiK) T and 
distribution N(it, En) where it = z/N= YiXi, x i = (1,xi,xi,.. , 
i  
p, g l = $-1 +   (xixi or) 
i=1 
2)Computation of he total output uncertainty bl(On+l ID., z) requires several steps. 
Taking advantage of the Gaussian distribution on both the parameters a and the 
noise, we obtain (see Niyogi and Sung, 1995 for details): 
598 Kah Kay Sung, Partha Niyogi 
15 
10 
Noise=O,1 
15 
10 
Noise= 1,0 
5 5 
0 5O 0 5O 
Figure 1: Comparing active and passive learning average error rates at different sample 
noise levels. The two graphs above plot log error rates against number of samples. See text 
for detailed explanation. The sohd and dashed curves are the active and passive learning 
error rates respectively. 
where A is a matrix of numbers whose i, j element is f_5 5 t(i+J-2)dt. E,+ has the 
same form as En and depends on the previous data, the priors, noise and xn+. 
When minimized over xn+, we get 5:n+ as the maximum utility location where 
the active learner should next sample the unknown target function. 
4.2 SIMULATIONS 
We have performed several simulations to compare the performance of the active 
strategy developed in the previous section to that of a passive learner (who receives 
examples according to a uniform random distribution on the domain [-5, 5]). The 
following issues have been investigated. 
1) Average error rate as a function of the number of examples: Is it 
indeed the case that the active strategy has superior error performance for the same 
number of examples? To investigate this we generated 1000 test target polynomial 
functions (of maximum degree 9) according to the following Gaussian prior on 
the parameters: for each ai,P(ai) = N(0,0.9i). For each target polynomial, we 
collected data according to the active strategy as well as the passive (random) 
strategy for varying number of data points. Figure 1 shows the average error rate 
(i.e., the integrated squared difference between the actual target function and its 
estimate, averaged over the 1000 different target polynomials) as a function of the 
number of data points. Notice that the active strategy has a lower error rate than 
the passive for the same number of examples and is particularly true for small 
number of data. The active strategy uses the same priors that generate the test 
target functions. We show results of the same simulation performed at two noise 
levels (noise standard deviation 0.1 and 1.0). In both cases the active strategy 
outperforms the passive learner indicating robustness in the face of noise. 
2) Incorrect priors: How sensitive is the active learner to possible differences 
between its prior assumptions on the class ' and the true priors? We repeated the 
function learning task of the earlier case with the test targets generated in the same 
way as before. The active learner assumes a slightly different Gaussian prior and 
polynomial degree from the target (Std(ai) = 0.7 i and K = 7 for the active learner 
versus Std(ai) = 0.8 i and K = 8 for the target). Despite its inaccurate priors, the 
Active Learning for Function Approximation 599 
8.5 
8 
-- 6.5 
Different Gauss Prior & Deg. 
\ 
\ 
x, 
I I I I ,, 
0 10 20 30 40 S0 
Figure 2: Active learning results with different Gaussian priors for coefficients, and a 
lower a priori polynomial degree K. See text for detailed explanation. The sohd and 
dashed curves are the active and passive learning error rates respectively. 
active learner outperforms the passive case. 
3) The distribution of points: How does the active learner choose to sample 
the domain for maximally reducing uncertainty? There are a few sampling trends 
which are noteworthy here. First, the learner does not simply sample the domain 
on a uniform grid. Instead it chooses to cluster its samples typically around K q- 1 
locations for concept classes with maximum degree K as borne out by simulations 
where K varies from 5 to 9. One possible explanation for this is it takes only 
K q- i points to determine the target in the absence of noise. Second, as the noise 
increases, although the number of clusters remains fixed, they tend to be distributed 
away from the origin. It seems that for higher noise levels, there is less pressure to 
fit the data closely; consequently the prior assumption of lower order polynomials 
dominates. For such lower order polynomials, it is profitable to sample away from 
the origin as it reduces the variance of the resulting fit. (Note the case of linear 
regression). 
Remarks 
1) Notice that because the class of polynomials is linear in its model parameters, a, 
the new sample location (n+l) does not depend on the y values actually observed 
but only on the x values sampled. Thus if the learner is to collect n data points, 
it can pre-compute the n points at which to sample from the start. In this sense 
the active algorithm is not really adaptive. This behavior has also been observed 
by MacKay (1992) and Sollich (1994). 
2) Needless to say, the general framework from optimal design can be used for 
any function class within a Bayesian framework. We are currently investigating 
the possibility of developing active strategies for Radial Basis Function networks. 
While it is possible to compute exact expressions for 5:n+l for such RBF networks 
with fixed centers, for the case of moving centers, one has to resort to numerical 
minimization. For lack of space we do not include those results in this paper. 
600 Kah Kay Sung, Partha Niyogi 
o o 
1o] }   
oo o  
10 
.IDO00 CI  I!1 
e0a o o 
: 1 OD 0 (3D (ID( 
 0 O 0 
-5 0 5 -5 5 
Figure 3: Distribution of active learning sample points as a function of (i) noise strength 
and (ii) a-priori polynomial degree. The horizontal axis of both graphs represents the input 
space [-5,5]. Each circle indicates a sample location. The Left graph shows the distribution 
of sample locations (on as axis) for different noise level (indicated on y-axis). The Right 
graph shows the distribution of sample locations (on as-axis) for different assumptions on 
the maximum polynomial degree K (indicated on y-axis). 
5 CONCLUSIONS 
We have developed a Bayesian framework for active learning using ideas from opti- 
mal experiment design. Our focus has been to investigate the possibility of improved 
sample complexity using such active learning schemes. For a simple case of unit 
step functions, we are able to derive a binary search algorithm from a completely 
different standpoint. Such an algorithm then provably requires fewer examples for 
the same error rate. We then show how to derive specific algorithms for the case 
of polynomials and carry out extensive simulations to compare their performance 
against the benchmark of a passive learner with encouraging results. This is an 
application of the optimal design paradigm to function learning and seems to bear 
promise for the design of more efficient learning algorithms. 
References 
D. Cobh. (1991) A Local Approach to Optimal Queries. In D. Touretzky (ed.), Proc. of 
1990 Connectionist Summer School, San Mateo, CA, 1991. Morgan Kaufmann Pubhshers. 
V. Fedorov. (1972) Theory of Optimal Easperiments. Academic Press, New York, 1972. 
D. MacKay. (1992) Bayesian Methods for Adaptive Models. PhD thesis, CMTech, 1992. 
P. Niyogi and K. Sung. (1995) Active Learning for Function Approximation: Paradigms 
from Optimal Experiment Design. Tech Report AIM-1483, AI Lab., MIT, In Preparation. 
M. Plutowski and H. White. (1991) Active Selection of Training Examples for Network 
Learning in Noiseless Environments. Tech Report CS91-180, Dept. of Computer Science 
and Engineering, University of California, San Diego, 1991. 
T. Poggio and F. Girosi. (1990) Regularization Algorithms for Learning that are Equiva- 
lent to Multilayer Networks. Science, 247:978-982, 1990. 
P. Sollich. (1994) Query Construction, Entropy, Generalization in Neural Network Models. 
Physical Review E, 49:4637-4651, 1994. 
L. Valiant. (1984) A Theory of Learnable. Proc. of the 1984 STOC, p436-445, 1984. 
