Learning from queries for maximum 
information gain in imperfectly learnable 
problems 
Peter SolItch David Saad 
Department of Physics, University of Edinburgh 
Edinburgh EH9 3JZ, U.K. 
P.Solliched.ac.uk, D.Saaded.ac.uk 
Abstract 
In supervised learning, learning from queries rather than from 
random examples can improve generalization performance signif- 
icantly. We study the performance of query learning for problems 
where the student cannot learn the teacher perfectly, which occur 
frequently in practice. As a prototypical scenario of this kind, we 
consider a linear perceptron student learning a binary perceptron 
teacher. Two kinds of queries for maximum information gain, i.e., 
minimum entropy, are investigated: Minimum student space en- 
tropy (MSSE) queries, which are appropriate if the teacher space 
is unknown, and minimum teacher space entropy (MTSE) queries, 
which can be used if the teacher space is assumed to be known, but 
a student of a simpler form has deliberately been chosen. We find 
that for MSSE queries, the structure of the student space deter- 
mines the efficacy of query learning, whereas MTSE queries lead 
to a higher generalization error than random examples, due to a 
lack of feedback about the progress of the student in the way queries 
are selected. 
I INTRODUCTION 
In systems that learn from examples, the traditional approach has been to study 
generalization from random examples, where each example is an input-output pair 
288 Peter Sollich, David Saad 
with the input chosen randomly from some fixed distribution and the corresponding 
output provided by a teacher that one is trying to approximate. However, random 
examples contain less and less new information as learning proceeds. Therefore, 
generalization performance can be improved by learning from queries, i.e., by choos- 
ing the input of each new training example such that it will be, together with its 
expected output, in some sense 'maximally useful'. The most widely used mea- 
sure of 'usefulness' is the information gain, i.e., the decrease in entropy of the 
post-training probability distributions in the parameter space of the student or the 
teacher. We shall call the resulting queries 'minimum (student or teacher space) 
entropy (MSSE/MTSE) queries'; their effect on generalization performance has re- 
cently been investigated for perfectly learnable problems, where student and teacher 
space are identical (Seung el al., 1992, Freund et al., 1993, Sollich, 1994), and was 
found to depend qualitatively on the structure of the teacher. For a linear percep- 
tron, for example, one obtains a relative reduction in generalization error compared 
to learning from random examples which becomes insignificant as the number of 
training examples, p, tends to infinity. For a perceptron with binary output, on the 
other hand, minimum entropy queries result in a generalization error which decays 
exponentially as p increases, a marked improvement over the much slower algebraic 
decay with p in the case of random examples. 
In practical situations, one almost always encounters imperfectly learnable problems, 
where the student can only approximate the teacher, but not learn it perfectly. 
Imperfectly learnable problems can arise for two reasons: Firstly, the teacher space 
(i.e., the space of models generating the data) might be unknown. Because the 
teacher space entropy is then also unknown, MSSE (and not MTSE) queries have 
to be used for query learning. Secondly, the teacher space may be known, but a 
student of a simpler structure might have deliberately been chosen to facilitate or 
speed up training, for example. In this case, MTSE queries could be employed as 
an alternative to MSSE queries. The motivation for doing this would be strongest 
if, as in the learning scenario that we consider below, it is known from analyses 
of perfectly learnable problems that the structure of the teacher space allows more 
significant improvements in generalization performance from query learning than 
the structure of the student space. 
With the above motivation in mind, we investigate in this paper the performance 
of both MSSE and MTSE queries for a prototypical imperfectly learnable prob- 
lem, in which a linear perceptron student is trained on data generated by a binary 
perceptron teacher. Both student and teacher are specified by an N-dimensional 
weight vector with real components, and we will consider the thermodynamic limit 
N  oo, p  cx>, o = pin = const. In Section 2 below we calculate the general- 
ization error for learning from random examples. In Sections 3 and 4 we compare 
the result to MSSE and MTSE queries. Throughout, we only outline the neces- 
sary calculations; for details, we refer the reader to a forthcoming publication. We 
conclude in Section 5 with a summary and brief discussion of our results. 
2 LEARNING FROM RANDOM EXAMPLES 
We denote students and teachers by iV' (for 'Neural network') and l/ (for 'element 
of the Version space', see Section 4), respectively, and their corresponding weight 
Learning from Queries for Maximum Information Gain 289 
vectors by w and Wv. For an input vector x, the outputs of a given student and 
teacher are 
1 xTw, Yv = sgn( N xTwv)- 
Y=  
Assuming that inputs are drawn from a uniform distribution over the hypersphere 
x 2 = N, and taking as our error measure the standard squared output difference 
 (y.- yv) , the generalization error, i.e., the average error between student iV' and 
teacher 12 when tested on random test inputs, is given by 
g(JV',V) = Q+1-2 , (1) 
1 T 1 2 1 2 
where we have set R = WWv, Q = w, Qv = Wv. 
As our training algorithm we take s[och[ic gradient descent on the training 
error Et, which for a training set 0  = {(x,u  = Uv(X)), = 1...p} is 
1 . 1 
Et =  ( -(x)) 2 A weight decay term Aw is added for regulariza- 
lion, i.e., o prevent overfitting. S[ochtic gradient descent on the resulting en- 
x Aw} yields a Gibbs post4raining distribution of students, 
ergy function E = Et +  
 exp(-E/T), where the training temperature T menures the amount 
of s[ochticity in the training algorithm. For the linear perceptton students con- 
sidered here, this distribution is Gaussian, with covariance matrix TM , where 
(1N denotes the N x N identity matrix) 
M = A1N +  Z[:l XY(XY) T' 
Since the length of the teacher weight vector Wv does not affect the teacher outputs, 
we sume aspherical prior on teacher space, P(wv) m 5(w}-N), for which Qv = 1. 
Restricting attention to the limit of zero training temperature, it is straightforward 
to calculate from eq. (1) the average generalization error obtained by training on 
random examples 
(g-- (g,min : -- optO + (opt - ) , (2) 
with the hnction G = (tr M})m({,}) given by (Krogh and Hertz, 1992) 
i [1-a-A+(1-a-A) 2+4A] (3) 
In eq. (2) we have explicitly subtracted the minimum achievable generalization error, 
1 
(g,min  (1--2/), which is nonzero since a linear perceptron cannot approximate a 
binary perceptron perfectly. At finite a, the generalization error is minimized when 
the weight decay is set to its optimal value A = Aopt  /2 - 1. Note that since 
both G and OG/OA tend to zero  a  , the generalization error for random 
examples approaches the minimum achievable generalization error in this limit. 
3 MINIMUM STUDENT SPACE ENTROPY QUERIES 
We now calculate the generalization performance resulting from MSSE queries. For 
the training algorithm introduced in the last section, the student space entropy 
(normalized by N) is given by 
290 Peter Sollich, David Saad 
3.0 
2.0- 
1.5- 
1.0 I I I I 
0 1 2 3 4 5 
Figure 1: Relative improvement n in generalization error due to MSSE queries, for 
weight decay A = 0.01, 0.1, 1. 
1 
Sv = --- In der Mv, 
2N 
where we have omitted an unimportant constant which depends on the training 
temperature only. This entropy is minimized by choosing each new query along 
the direction corresponding to the minimal eigenvalue of the existing Mv (Sollich, 
1994). The expression for the resulting average generalization error is given by 
eq. (2) with G replaced by its analogue for MSSE queries (Sollich, 1994) 
Ac 1 - Ao 
GQ= A+[c]+l + A+[c]' 
where [] is the greatest integer less than or equal to  and Z: -[]. We define 
the improvement factor  as the ratio of the generalization error (with the minimum 
achievable generalization error subtracted as in eq. (2)) for random examples to that 
for MSSE queries. Figure 1 shows (c) for several values of the weight decay A. 
Comparing with existing results (Sollich, 1994), we find that n is exactly the same 
as if our linear student were trying to approximate a linear teacher with additive 
noise of variance Aopt on the outputs. For large a, one can show (Sollich, 1994) that 
 = 1 + 1/a + O(1/a 2) and hence the relative reduction in generalization error due 
to querying tends to zero as a  . We investigate in the next section whether it 
is possible to improve generalization performance more significantly by using MTSE 
queries. 
4 MINIMUM TEACHER SPACE ENTROPY QUERIES 
We now consider the generalization performance achieved by MTSE queries. We 
remind readers that such queries could be used if the teacher space is known, but 
a student of a simpler functional form has deliberately chosen. The aim in using 
MTSE rather than MSSE queries would be to exploit the structure of the teacher 
space if this is known (for perfectly learnable problems) to make query learning very 
efficient compared to random examples. 
For the case of noise free training data under consideration, the posterior probability 
distribution in teacher space given a certain training set is proportional to the prior 
Learning from Queries for Maximum Information Gain 291 
distribution on the version space (the set of all teachers that could have produced 
the training set without error) and zero everywhere else. From this the (normalized) 
teacher space entropy can be derived to be, up to an additive constant, 
1 
$v =  In V(p), 
where the version space volume V(p)is given by ((z) = 1 for z > 0 and 0 otherwise) 
V(p) = f dwv P(wv)l-[[=l O(NNYWvTx)  
It can easily be verified that this entropy is minimized 1 by choosing queries x which 
'bisect' the existing version space, i.e., for which the hyperplane perpendicular to 
x splits the version space into two equal halves (Seung eta!., 1992, Freund eta!., 
1993). Such queries lead to an exponentially shrinking version space, V(p) = 2 -', 
and hence a linear decrease of the entropy, Sv = -or In 2. We consider instead 
queries which achieve qualitatively the same effect, but permit a much simpler 
analysis of the resulting student performance. They are similar to those studied in 
the context of a learnable problem by Watkin and Rau (1992), and are defined as 
follows. The (p + 1)th query is obtained by first picking a random teacher vector 
w r from the version space defined by the existing p training examples, and then 
picking the new training input xp+ from the distribution of random inputs but 
under the constraint that ' - 0. 
xp + lWp 
For the calculation of the student performance, i.e., the average generalization error, 
achieved by the approximate MTSE queries described above, we use an approxi- 
mation based on the following observation. As the number of training examples, 
p, increases, the teacher vectors wp from the version space will align themselves 
0 their components along the direction of Wv will increase, 
with the true teacher Wv; 
whereas their components perpendicular to Wv will decrease, varying widely across 
the N- 1 dimensional hyperplane perpendicular to Wv . Following Watkin and Rau 
(1992), we therefore assume that the only significant effect of choosing queries xr+l 
with T = 0 is on the distribution of the component of xr+ along wv. Writing 
Xp + lWp 
this component as x  ' Wv/Iwvl, its probability distribution can readily be 
p+l -- Xp+l 
shown to be 
0 
P(xp+l) cr exp(- 0 
(xp+/sp)2) , (4) 
0 For finite N, the value of sp 
where sp is the sine of the angle between wr and w v. 
is dependent on the p previous training examples that define the existing version 
space and on the teacher vector wp sampled randomly from this version space. In 
the thermodynamic limit, however, the variations of sr become vanishingly small 
and we can thus replace sp by its average value, which is a function of p alone. 
In the thermodynamic limit, this average value becomes a continuous function of 
a = p/N, the number of training examples per weight, which we denote simply by 
s(a). The calculation can then be split into two parts: First, the function s(a) is 
obtained from a calculation of the teacher space entropy using the replica method, 
generalizing the results of GySrgi and Tishby (1990). The average generalization 
More precisely, what is minimized is the value of the entropy after a new training 
example (x, y) is added, averaged over the distribution of the unknown new training output 
y given the new training input x and the existing training set; see Sollich (1994). 
292 Peter Sollich, David Saad 
-3- $v (queries)  '""'--...... 
-4- Sv (random examples)  
........ In s(a) (queries) 
-5 I i I I I I I I I 
0 1 2 3 4 5 6 7 8 9 10 
Figure 2: MTSE queries: Teacher space entropy, $v (with value for random exam- 
ples plotted for comparison), and In s, the log of the sine of the angle between the 
true teacher and a random teacher from the version space. 
error can then be calculated by using an extension of the response function method 
described in (Sollich, 1994b) or by another replica calculation (now in student space) 
as in (Duntour and Wallace, 1993). 
Figure 2 shows the effects of (approximate) MTSE queries in teacher space. For large 
a values, the teacher space entropy decreases linearly with a, with gradient c  0.44, 
whereas the entropy for random examples, also shown for comparison, decreases 
much more slowly (asymptotically like - In a, see (Gyfrgi and Tishby, 1990)). The 
linear a-dependence of the entropy for queries corresponds to an average reduction 
of the version space volume with each new training example by a factor ofexp(-c)  
 for proper bisection of the version 
0.64, which is reasonably close to the factor 5 
space. This justifies our choice of analysing approximate MTSE queries rather than 
true MTSE queries, since the former achieve qualitatively the same results as the 
latter. 
Before discussing the student performance achieved by (approximate) MTSE 
queries, we note from figure 2 that lns(c 0 decreases linearly with a for large a, 
with the same gradient as the teacher space entropy. Hence s(a) cr exp(-ca) for 
large a, and MTSE queries force the average teacher from the version space to 
approach the true teacher exponentially quickly. It can easily be shown that if we 
were learning with a binary perceptron student, i.e., if the problem were perfectly 
learnable, then this would result in an exponentially decaying generalization error, 
eg cr exp(-cc 0. MTSE queries would thus lead to a marked improvement in gener- 
alization performance over random examples (for which eg cr i/a, see (Gy6rgi and 
Tishby, 1990)). It is this significant benefit (in teacher space) of query learning that 
provides the motivation for using MTSE queries in imperfectly learnable problems 
such as the one considered here. 
The results plotted in Figure 3 for the average generalization error achieved by the 
linear perceptron student show, however, that MTSE queries do not have the de- 
sired effect. Far from translating the benefits in teacher space into improvements 
in generalization performance for the linear student, they actually lead to a deteri- 
oration of generalization performance, i.e., a larger generalization error than that 
Learning from Queries for Maximum Information Gain 293 
0.6-]/ A= 0.01 
/ \ 
g 
0.4- ',, X \ 
I I I I I I I I I 
0 1 2 3 4 5 6 7 8 9 10 
Figure 3: Generalization error for MTSE queries (higher curves of each pair) and 
random examples (lower curves), for weight decay A = 0.01, 0.1, 1. The curves for 
random examples (which are virtually indistinguishable from one another already 
at a = 10) converge to the minimum achievable generalization error (g,min (dotted 
line) as 
obtained for random examples. Worse still, they 'mislead' the student to such an 
extent that [he minimum achievable generalization error is not reached even for an 
infinite number of training examples, a  c. How does this happen? I[ can be 
verified that the angle between the student and teacher weight vectors tends to zero 
for a  c as expected, while Q, the normalized squared length of [he student 
weight vector, approaches 
oo) = - 
where * = fdas(a), s w = fdase(a). Unless the weight decay parameter A 
happens to be equal [o  - s 2, [his is different from the optimal ymp[o[ic value, 
which is 2/. This is the re,on why in general the linear student does not reach 
the minimum possible generalization error even  a  . The approach of Q 
[o its non-optimal ymp[o[ic value can cause an increde in the generalization 
error for large a and a corresponding minimum of the generalization error at some 
fini[ea, can beseen in the plots for  = 0.01 and 0.1 in Figure 3. For  = 0, 
eq. (5) h the following intuitive interpretation' As a incre,es, the version space 
shrinks around the [rue teacher w, and hence MTSE queries become 'more and 
more orthogonal' [o w. As a consequence, the distribution of [raining inputs along 
the direction of w is narrowed down progressively (compare eq. (4)). ying [o 
find a best fit [o the teacher's binary output function over [his narrower range of 
inputs, the linear student learns a hnc[ion which is steeper than the best fit over 
the range of random inputs (which would give minimum generalization error). This 
corresponds [o a subop[imally large length of the student weigh[ vector in agreement 
with eq. (5)' Q(a  ) > 2/ for A = 0 because s 2 <*. 
Summarizing the results of [his section, we have found that although MTSE queries 
are very beneficial in teacher space, [hey are entirely misleading for the linear stu- 
dent, [o the extent that the student does not learn [o approximate the teacher 
optimally even for an infinite number of training examples. 
294 Peter Sollich, David Saad 
5 SUMMARY AND DISCUSSION 
We have found in our study of an imperfectly learnable problem with a linear student 
and a binary teacher that queries for minimum student and teacher space entropy, 
respectively, have very different effects on generalization performance. Minimum 
student space entropy (MSSE) queries essentially have the same effect as for a linear 
student learning a noisy linear teacher, apart from a nonzero minimum value of the 
generalization error due to the unlearnability of the problem. Hence the structure 
of the student space is the dominating influence on the efficacy of query learning. 
Minimum teacher space entropy queries (MTSE), on the other hand, perform worse 
than random examples, leading to a higher generalization error even for an infinite 
number of training examples. With the benefit of hindsight, we note that this makes 
intuitive sense since the teacher space entropy, according to which MTSE queries 
are selected, contains no feedback about the progress of the student in learning the 
required generalization task, and thus MTSE queries cannot be guaranteed to have 
a positive effect. 
Our results, then, are a mixture of good and bad news for query learning for max- 
imum information gain in imperfectly learnable problems: The bad news is that 
MTSE queries, due to a lack of feedback information about student progress, are 
not enough to translate significant benefits in teacher space into similar improve- 
ments of student performance and may in fact yield worse performance than random 
examples. The good news is that for MSSE queries, we have found evidence that 
the structure of the student space is the key factor in determining the efficacy of 
query learning. If this result holds more generally, then statements about the ben- 
efits of query learning can be made on the basis of how one is trying to learn only, 
independently of what one is trying to learn--a result of great practical significance. 
References 
A P Dunmur and D J Wallace (1993). Learning and generalization in a linear 
perceptron stochastically trained with noisy data. J. Phys. A, 26:5767-5779. 
Y Freund, H S Seung, E Shamir, and N Tishby (1993). Information, prediction, 
and query by committee. In S J Hanson, J D Cowan, and C Lee Giles, editors, 
NIPS 5, pages 483-490, San Mateo, CA, Morgan Kaufmann. 
G GySrgi and N Tishby (1990). Statistical theory of learning a rule. In 
W Theumann and R. K/Jberle, editors, Neural Networks and Spin Glasses, pages 
3-36. Singapore, World Scientific. 
A Krogh and J A Hertz (1992). Generalization in a linear perceptton in the presence 
of noise. J. Phys. A, 25:1135-1147. 
P Sollich (1994). Query construction, entropy, and generalization in neural network 
models. Phys. Rev. E, 49:4637-4651. 
P Sollich (1994b). Finite-size effects in learning and generalization in linear percep- 
ttons. J. Phys. A, 27:7771-7784. 
H S Seung, MOpper, and H Sompolinsky (1992). Query by committee. In Pro- 
ceedings of COLT '9, pages 287-294, New York, ACM. 
T L H Watkin and A Rau (1992). Selecting examples for percepttons. J. Phys. A, 
25:113-121. 
