Incorporating Test Inputs into Learning 
Zehra Cataltepe 
Learning Systems Group 
Department of Computer Science 
California Institute of Technology 
Pasadena, CA 91125 
zehra@cs. caltech. edu 
Malik Magdon-Ismail 
Learning Systems Group 
Department of Electrical Engineering 
California Institute of Technology 
Pasadena, CA 91125 
magdon@cco. cal tech. edu 
Abstract 
In many applications, such as credit default prediction and medical im- 
age recognition, test inputs are available in addition to the labeled train- 
ing examples. We propose a method to incorporate the test inputs into 
learning. Our method results in solutions having smaller test errors than 
that of simple training solution, especially for noisy problems or small 
training sets. 
1 Introduction 
We introduce an estimator of test error that takes into consideration the test inputs. The 
new estimator, augmented error, is composed of the training error and an additional term 
computed using the test inputs. In some applications, such as credit default prediction and 
medical image recognition, we do have access to the test inputs. In our experiments, we 
found that the augmented error (which is computed without looking at the test outputs but 
only test inputs and training examples) can result in a smaller test error. In particular, it 
tends to increase when the test error increases (overtraining) even if the simple training 
error does not. (see figure (1)). 
In this paper, we provide an analytic solution for incorporating test inputs into learning in 
the case of linear, noisy targets and linear hypothesis functions. We also show experimental 
results for the nonlinear case. 
Previous results on the use of unlabeled inputs include Castelli and Cover [2] who show that 
the labeled examples are exponentially more valuable than unlabeled examples in reducing 
the classification error. For mixture models, Shahshahani and Landgrebe [7] and Miller 
and Uyar [6] investigate incorporating unlabeled examples into learning for classification 
problems and using EM algorithm, and show that unlabeled examples are useful especially 
when input dimensionality is high and the number of examples is small. In our work we 
only concentrate on estimating the test error better using the test inputs and our method 
438 Z Cataltepe and M. Magdon-Ismail 
2. = 
1. = 
1( 
Training error 
- - Test error 
Augmented error 
8 
log(pass) 
Figure 1' The augmented error, computed not looking at the test outputs at all, follows the 
test error as overtraining occurs. 
extends to the case of unlabeled inputs or input distribution information. Our method is 
also applicable for regression or classification problems. 
In figure 1, we show the training, test and augmented errors, while learning a nonlinear 
noisy target function with a nonlinear hypothesis. As overtraining occurs, the augmented 
error follows the test error. In section 2, we explain our method of incorporating test inputs 
into learning and give the analytical solutions for linear target and hypothesis functions. 
Section 3 includes theory about the existence and general form of the new solution. Section 
4 discusses experimental results. Section 5 extends our solution to the case of knowing the 
input distribution, or knowing extra inputs that are not necessarily test inputs. 
2 Incorporating Test Inputs into Learning 
In learning-from-examples, we assume we have a training set: {(Xl, f),..., (XN, fN)} 
with inputs Xn and possibly noisy targets fn. Our goal is to choose a hypothesis 
#, among a class of hypotheses G, minimizing the test error on an unknown test set 
{(yl,hl),...,(yM,hM)}. 
Using the sample mean square error as our error criterion, the training error of hypothesis 
g is: 
N 
= 
r----1 
Similarly the test error of g is: 
M 
E(g) = M E (g(Ym)-hm): 
m=l 
Expanding the test error: 
1 
(gO = 
M 
M M M 
2 1  
E g (Ym) M Eg(ym) h'n+ Eh'n 
m=X m=l m=X 
Incorporating Test Inputs into Learning 439 
The main observation is that, when we know the test inputs, we know the first term exactly. 
Therefore we need only approximate the remaining terms using the training set: 
I M 2 /v 
E(g,)  M Eg2(ym)--Eg'(Xn)fn+ Efn2 (1) 
m=l n----1 n----1 
M N 
1 z 1 2 
m----1 n----1 
We scale the addition to the training error by an augmentation parameter c to obtain a 
more general error function that we call the augmented error: 
Ea(gv) = E0(gv)+a Egv(Ym) N Egv(Xn) 
m----1 n----1 
where ct = 0 corresponds to the training error Eo and a = 1 corresponds to equation (1). 
The best value of the augmentation parameter depends on a number of factors including the 
target function, the noise distribution and the hypothesis class. In the following sections 
we investigate properties of the best augmentation parameter and give a method of finding 
the best augmentation parameter when the hypothesis is linear. 
3 Augmented Solution for the Linear Hypothesis 
In this section we assume hypothesis functions of the form #,(x) = vTx. From here 
onwards we will denote the functions by the vector that multiplies the inputs. When the 
hypothesis is linear we can find the minimum of the augmented error analytically. 
Let Xa x v be the matrix of training inputs, Ya x M be the matrix of test inputs and fv x  
contain the training targets. The solution Wo minimizing the training error Eo is the least 
squares solution [5]: Wo = (---)- f. 
The augmented error E. (v) = Eo (v) + av z (vvT 
M 
mented error w.: 
where R = [- (X -----r) 
w. = (I - aR)-lw0 
--1 yyT 
M 
least mean squares solution w0. 
v is minimized at the aug- 
(2) 
When a = 0, the augmented solution w. is equal to the 
4 Properties of the Augmentation Parameter 
Assume a linear target and possibly noisy training outputs: f = w*ZX +e where (ee z) = 
rre2 IN x N. 
Since the specific realization of noise e is unknown, instead of minimizing the test error 
directly, we focus on minimizing (E (w.)),,, the expected value of the test error of the 
augmented solution with respect to the noise distribution: 
_- 0 ,)w. 
M ((I-aR) -- 
+tr (I - art) -rrz(I- aR) - XXW 
(3) 
440 Z Cataltepe and M. Magdon-Ismail 
where we have used (eTAe)e = a2tr (A) and tr(A) denotes the trace of matrix A. When 
a = 0, we have: 
(E(w0))e = ae tr (4) 
N 
Now, we prove the existence of a nonzero augmentation parameter a when the outputs are 
noisy. 
2 
Theorem 1: If cre > 0 and tr (R (I - R))  0, then there is an a  0 that minimizes the 
expected test error (E (wa)). 
Proof: Since B-(a) - -B - -aa ...- (a) for any matrix B whose elements are 
scalar functions of a [3], the derivative of {E (w)), with respect to a at a = 0 is: 
d a=o 
2tr R( 
- yyr ) a2 
M = 2tr(R(I- R)) 
If the derivative is < 0 (> 0 respectively), then (E (wa)) is minimized at some a > 0 
(a < 0 respectively). [] 
The following proposition gives an approximate formula for the best a. 
Theorem 2: If N and M are large, and the training and test inputs are drawn i.i.d from 
an input distribution with covariance matrix (xxT) = a2I, then the a* minimizing 
(E (w,)),,x,y, the expected test error of the augmented solution with respect to noise and 
inputs, is approximately: 
2 
d er e 
a*N 2 ,' (5) 
0'xW' W' 
Proof: is given in the appendix. [] 
This formula determines the behavior of the best a. The best a: 
decreases as the signal-to-noise ratio increases. 
increases as  increases, i.e. as we have less examples per input dimension. 
4.1 w. as an Estimator of w* 
The mean squared error (m.s.e.) of any estimator  of w*, can be written as [1]: 
m.s.e() = bias() + variance(r) 
When a is independent of the specific realization e of the noise: 
m.s.e.(wa) = w *T (,-(I--aRT) -') (,- (I-R) -1) W* 
0'2 ((xT) -1 ) 
+tr (I - art) - (I - aR) - 
Incorporating Test Inputs into Learning 441 
Hence the m.s.e. of the least square estimator wo is: 
m.s.e.(w0) v e tr ( - 
N 
w0 is the minimum variance unbiased linear estimator of w*. Although w, is a biased 
estimator if aR k 0, the following proposition shows that, when there is noise, there is an 
a  0 minimizing the m.s.e. of w,: 
2 (R + R T)  0, then there is an a  0 that 
Theorem 3: If rre > 0 and tr 
minimizes the m.s.e. of w,. 
Proof: is similar to the proof of proposition 1 and will be skipped 
As N and M get large, R = I - (x-----T) -1YYr --} 0 and wa = (I - aR)-w0 ---} w0. 
M 
Hence, for large N and M, the bias and variance of w, approach 0, making w, an unbiased 
and consistent estimator of w*. 
5 A Method to Find the Best Augmentation Parameter 
m 
1.5 
1.4 
1.3 
1.1 
0.9 
0.8 
o 
Liver data, d=6, M=60 
20 40 60 80 100 120 140 160 
N'.num/r of training examples 
u,I 
5.5  ", 
5 
4.5 
4  
Bond dala, d=11, M=50 
augmented ero, with estimated alpha - 
80 100 120 140 
N.'number of training examp 
Figure 2: Using the augmented error results in smaller test error especially when the num- 
ber of training examples is small. 
Given only the training and test inputs X and Y, and the training outputs f, in this section 
we propose a method to find the best a minimizing the test error of wa. 
Equation (3) gives a formula for the expected test error which we want to minimize. How- 
2 In equation (3), we replace 
ever, we do not know the target w* and the noise variance a e . 
(X 3" w, -f) 3" (X 3"w, -f) 
2 by , where wa is given by equation (2). Then we 
w'* by wa and rr e N-d-1 
find the a minimizing the resulting approximation to the expected test error. 
We experimented with this method of finding the best c on artificial and real data. The 
results of experiments for liver data  and bond data 2 are shown in figure 2. In the liver 
fip://fip.ics.uci.edu/pub/machine-leaming-databases/liver-disorders/bupa.data 
2We thank Dr. John Moody for providing the bond data. 
442 Z Cataltepe and M. Magdon-Ismail 
database the inputs are different blood test results and the output is the number of drinks 
per day. The bond data consists of financial ratios as inputs and rating of the bond from 
AAA to B- or lower as the output. 
We also compared our method to the least squares (w0) and early stopping using different 
validation set sizes for linear and noisy problems. The table below shows the results. 
SNR mean E(.0) ]l mean E(,o) ' v" - T mean c(,0) , N,, = - 
" 0.01 0.650 4- 0.16" 0.126  0.003 " 0.12 : 0.004 
1 0.830 4- 0.007 . 13 4- 0.02 .075 4- 0.020 
100 1.001 4- 0.002 2.373 4- 0.040 2.073 4- 0.042 
Table 1' Augmented solution is consistently better than the least squares whereas early 
stopping gives worse results as the signal-to-noise ratio (SNR) increases. Even averaging 
early stopping solutions did not help when SNR = 100 (E(wa ,top)_ 1.245 q- 0.018 
E(wo) - 
when Nv = - and 1.307 q- 0.021 for Nv = -). For the results shown, d = 11, N = 30 
training examples were used, No is the number of validation examples. 
6 Extensions 
When the input probability distribution or the covariance matrix of inputs, instead of test 
inputs are known, ____r can be replaced by {xx T) = E and our methods are still applica- 
ble. 
If the inputs available are not test inputs but just some extra inputs, they can still be incor- 
porated into learning. Let us denote the extra K inputs {z,..., zK } by the matrix ZdxK. 
Then the augmented error becomes: 
xx) 
E(v) = Eo (v) + ot K + N N v 
The augmented new solution and its expected test error are same as in equations (2) and 
(3), except we have Rz = I - zzr instead of R. 
K 
Note that for the linear hypothesis case, the augmented error is not necessarily a regularized 
version of the training error, because the matrix rrr xxr is not necessarily a positive 
M N 
definite matrix. 
7 Conclusions and Future Work 
We have demonstrated a method of incorporating inputs into learning when the target and 
hypothesis functions are linear, and the target is noisy. We are currently working on ex- 
tending our method to nonlinear target and hypothesis functions. 
Appendix 
Proof of Theorem 2: When the spectral radius of otR is less than 1 (or is small and/or 
N and M are large), we can approximate (I -dR) -1  I + aR [4], and similarly, 
(I - aR T) -  I + aR T. Discarding any terms with powers of a greater than 1, and 
Incorporating Test Inputs into Learning 443 
solving for a in a(r(w))',",r - (a(r(w)),) 
da -- da x,y 
- O: 
N (W r yyr R2w\ N tr2w T (R2)x,y w 
3// / x,y 
The last step follows since we can write r 2 I + 
--' O'X V ' ---- O'x -- 
(_._)--1 1 (I +  + ) + O (N---) for magices V and V u such that 
and =   
(V)x = (Vu)y = 0 and (Vff) and (V) u me constant with respect to N and M. For 
lmg e M we can approximate ( R 2 )x,y = a (R2)x,y. 
Ignoring terms 
M /x,y' 
input distribution. Similarly [  \ = 3I' Therefore: 
\M/y 
d tr d 
X gw-w 3 + 3 
of O (N'---), (/12 __ /l>x,y (2 + v__\ and (R2)x,y = 
It can be shown that [   - I for a constant A depending on the 
kNI- N 
2 
0' e 
Acknowledgments 
We would like to thank the Caltech Learning Systems Group: Prof. Yaser Abu-Mostafa, Dr. 
Amir Atiya, Alexander Nicholson, Joseph Sill and Xubo Song for many useful discussions. 
References 
[1] Bishop, C. (1995)Neural Networks for Pattern Recognition, Clarendon Press, Oxford, 
1995. 
[2] Castelli, V. & Cover T. (1995) On the Exponential Value of Labeled Samples. Pattern 
Recognition Letters, Vol. 16, Jan. 1995, pp. 105-111. 
[3] Devijver, P. A. & Kittler, J. (1982) Pattern Recognition: A Statistical Approach, pp. 
434. Prentice-Hall International, London. 
[4] Golub, G. H. & Van Loan C. E (1993) Matrix Computations, The Johns-Hopkins Uni- 
versity Press, Baltimore, MD. 
[5] Hocking, R. R. (1996) Methods and Applications of Linear Models. John Wiley & 
Sons, NY. 
[6] Miller, D. J. & Uyar, S. (1996), A Mixture of Experts Classifier with Learning Based 
on Both Labeled and Unlabeled Data. In G. Tesauro, D. S. Touretzky and T.K. Leen (eds.), 
Advances in Neural Information Processing Systems 9. Cambridge, MA: MIT Press. 
[7] Shahshahani, B. M. & Landgrebe, D. A. (1994) The Effect of Unlabeled Samples in 
Reducing Small Sample Size Problem and Mitigating the Hughes Phonemenon. IEEE 
Transactions on Geoscience and Remote Sensing, Vol. 32 No. 5, Sept 1994, pp. 1087- 
1095. 
