Transductive Inference for Estimating 
Values of Functions 
Olivier Chapelle*, Vladimir Vapnik *'t, Jason Westont''* 
* AT&T Research Laboratories, Red Bank, USA. 
Royal Holloway, University of London, Egham, Surrey, UK. 
t Barnhill BioInformatics.com, Savannah, Georgia, USA. 
{ chapelle, vlad, weston) research. art. corn 
Abstract 
We introduce an algorithm for estimating the values of a function 
at a set of test points Xt+l,   , Xt+ra given a set of training points 
(Xl, Yl),-.., (xt, Yt) without estimating (as an intermediate step) 
the regression function. We demonstrate that this direct (transduc- 
rive) way for estimating values of the regression (or classification 
in pattern recognition) can be more accurate than the tradition- 
al one based on two steps, first estimating the function and then 
calculating the values of this function at the points of interest. 
1 Introduction 
Following [6] we consider a general scheme of transductive inference. Suppose there 
exists a function y* = fo(x) from which we observe the measurements corrupted 
with noise 
((/1,Yl),...(xt,Yt)), Yi -- Y n u i. (1) 
Find an algorithm A that using both the given set of training data (1) and the given 
set of test data 
(Xt+i,... ,Xg+ra) (2) 
selects from a set of functions {x -> f(x)} a function 
y = f(x)= fA(XlXl,Yl,...,xt,yt,xt+l,...,Xt+m) (3) 
and minimizes at the points of interest the functional 
R(A)-E(t (y--fA(Xi[Xl,Yl,...,xt,yt,Xt+l,...,Xt+m)) 2) (4) 
\i=+1 
where expectation is taken over x and 6. For the training data we are given the 
vector x and the value y, for the test data we are only given x. 
Usually, the problem of estimating values of a function at points of interest is 
solved in two steps: first in a given set of functions f(x, (), ( E A one estimates 
the regression, i.e the function which minimizes the functional 
R(a) =/((y- f(x,a))2dF(x,y), (5) 
422 O. Chapelle, V.N. Vapnik and J. Weston 
(the inductive step) and then using the estimated function y = f(x, at) we calculate 
the values at points of interest 
= i(x;,-t) (6) 
(the deductive step). 
Note, however, that the estimation of a function is equivalent to estimating its val- 
ues in the continuum points of the domain of the function. Therefore, by solving 
the regression problem using a restricted amount of information, we are looking 
for a more general solution than is required. In [6] it is shown that using a di- 
rect estimation method one can obtain better bounds than through the two step 
procedure. 
In this article we develop the idea introduced in [5] for estimating the values of a 
function only at the given points. 
The material is organized as follows. In Section 1 we consider the classical (induc- 
tive) Ridge Regression procedure, and the leave-one-out technique which is used to 
measure the quality of its solutions. Section 2 introduces the transductive method 
of inference for estimation of the values of a function based on this leave-one-out 
technique. In Section 3 experiments which demonstrate the improvement given 
by transductive inference compared to inductive inference (in both regression and 
pattern recognition) are presented. Finally, Section 4 summarizes the results. 
2 Ridge Regression and the Leave-One-Out procedure 
In order to describe our transductive method, let us first discuss the classical two- 
step (inductive plus deductive) procedure of Ridge Regression. Consider the set of 
functions linear in their parameters 
n 
= (7) 
i----1 
To minimize the expected loss (5), where F(x, y) is unknown, we minimize the 
following empirical functional (the so-called Ridge Regression functional [1]) 
I t 
Jte,p() = - f(x, + 11ll (S) 
i----1 
where 7 is a fixed positive constant, called the regularization parameter. The min- 
imum is given by the vector of coefficients 
OZt ---- OZ(Xl, Yl,..-, xt, Yt) ---- (KTK 4' ")'I) -1K:ry (9) 
where 
Y = (Yl,..., yt) T, 
and K is a matrix with elements: 
(10) 
Kij =ckj(xi), i= l,...,e, j=l,...,n. (11) 
The problem is to choose the value - which provides small expected loss for training 
on a sample St - {(Xl,yl),...,(xt, yt)}. 
For this purpose, we would like to choose '7 such that fv minimizing (8) also mini- 
mizes 
 = f(y* - fv(x*[$t))dF(x*,y*)dF($t). (12) 
Transductive Inference for Estimating Values of Funca'ons 423 
Since F(x, y) is unknown one cannot estimate this minimum directly. To solve this 
problem we instead use the leave-one-out procedure, which is an almost unbiased 
estimator of (12). The leave-one-out error of an algorithm on the training sample 
St is 
1 
Ttoo(,) =   (y - f(xl& \ (x,yd)  . (13) 
i=1 
The leave-one-out procedure consists of removing from the training data one el- 
ement (say (xi,yi)), constructing the regression function only on the basis of the 
remaining training data and then testing the removed element. In this fashion one 
tests all/ elements of the training data using/ different decision rules. The mini- 
mum over 3' of (13) we consider as the minimum over 3' of (12) since the expectation 
of (13) coincides with (12) [2]. 
For Ridge Regression, one can derive a closed form expression for the leave-one-out 
error. Denoting 
A;1: (KT K ..]_ 3'_/-)-1 (14) 
the error incurred by the leave-one-out procedure is [6] 
where 
rl(3') --  i--1  "-- k/T J 
kt -- (ql(Xt)...,q}n(Xt)) T. 
(15) 
(16) 
Let 3' - ,),0 be the minimum of (15). Then the vector 
yO = K,(KTK + 3'oi)-1KT Y 
where 
K* - ( q(Xt+l) ... qn(Xt+l) 
ql(Xt+m) ... qn(Xt+m) 
is the Ridge Regression estimate of the unknown values (Yt*+l,'  , 
(17) 
(18) 
3 Leave-One-Out Error for Transductive Inference 
In transductive inference, our goal is to find an algorithm A which minimizes the 
functional (4) using both the training data (1) and the test data (2). We suggest the 
following method: predict ( t+l .. Yt+m) by finding those values which minimize 
the leave-one-out error of Ridge Regression training on the joint set 
(Xl, Yl ),''', (It, Yt), (It+l, Y+i ),''', (It+m, Y+m)' (19) 
This is achieved in the following way. Suppose we treat the unknown values 
(Yt*+l,.",Yt*+m) as variables and for some fixed value of these variables we min- 
imize the following empirical functional 
, (y- f(x,)? + y}. (y - (x,)? +ll[I . 
*P(lY"" Y): e + m 
i=+1 
(20) 
This functional differs only in the second term from the functional (8) and corre- 
sponds to performing Ridge Regression with the extra pairs 
(+ 1, y,% 1),.--, (,+, y% ). (21) 
424 O. Chapelle, K N. Vapnik and J. Weston 
Suppose that vector Y* - (y{,... ,y) is taken from some set Y* E Y such that 
the pairs (21) can be considered as a sample drawn from the same distribution as 
the pairs (Xl,yi),... (xl,y* 
, t)- In this case the leave-one-out error of minimizing 
(20) over the set (19) approximates the functional (4). We can measure this leave- 
one-out error using the same technique as in Ridge Regression. Using the closed 
form (15) one obtains 
, 1 Z (22) 
Tl(fflY1""'Y)--+m i= 1  ;Z^kiii J 
where we denote 2 = (Xl,..., xt+),  (Yl,-.., Yt, Yt+l,' ", Yt+m) , and 
A;  = (RrR + ,0 -1 (2) 
Rij = (fij(2i), i = l,...,g + m, j = l,...,n. 
k t ---- ((l(t)...,(n(:t)) r. 
(24) 
(25) 
Now let us rewrite the expression (22) in an equivalent form to separate the terms 
with  from the terms with x. Introducing 
C = I - RA; 1R T, (26) 
and the matrix M with elements 
MO =  CikCk 
k----1 Ckk2 
(27) 
we obtain the equivalent expression of (22) 
1 
Ttoo(ly , - 
'"'Y) +m 
(28) 
In order for the * which minimize the leave-one-out procedure to be valid it 
is required that the pairs (21) are drawn from the same distribution as the pairs 
(21, y),..., (xi, y). To satisfy this constraint we choose vectors Y* from the set 
Y: {Y*: IIY* - Y11 _< R) (29) 
where the vector yo is the solution obtained from classical Ridge Regression. 
To minimize (28) under constraint (29) we use the functional 
TZ(7) = ?TM- + PllY* - Yll (30) 
where 7' is a constant depending on R. 
Now, to find the values at the given points of interest (2) all that remains is to find 
the minimum of (30) in Y*. Note that the matrix M is obtained using only the 
vectors 5. Therefore, to find the minimum of this functional we rewrite Equation 
(30) as 
where 
T;(ff)=yTMoY+2y*TMiY+y'TM2 * +PllY* - Y11  (31) 
Mo M1 
(32) 
Transductive Inference for Estimating Values of Functions 425 
and M0 is a  x  matrix, M1 is a  x m matrix and M2 is a m x m matrix. Taking 
the derivative of (31) in Y* we obtain the condition for the solution 
2M1Y + 2M2Y* - 23'*Y  + 23'*Y* = 0 (33) 
which gives the predictions 
Y* = (3"1 + M2)-(-MY + 3',yO). (34) 
In this algorithm (which we will call Transductive Regression) we have two param- 
eters to control: 3' and if*. The choice of ff can be found using the leave-one-out 
estimator (15) for Ridge Regression. This leaves * as the only free parameter. 
4 Experiments 
To compare our one-step transductive approach with the classical two-step ap- 
proach we performed a series of experiments on regression problems. We also de- 
scribe experiments applying our technique to the problem of pattern recognition. 
4.1 Regression 
We conducted computer simulations for the regression problem using two datasets 
from the DELVE repository: boston and kin-32fh. 
The boston dataset is a well-known problem where one is required to estimate 
house prices according to various statistics based on 13 locational, economic and 
structural features from data collected by the U.S Census Service in the Boston 
Massachusetts area. 
The kin-32fh dataset is a realistic simulation of the forward dynamics of an 8 link 
all-revolute robot arm. The task is to predict the distance of the end-effector from 
a target, given 32 inputs which contain information on the joint positions, twist 
angles and so forth. 
Both problems are nonlinear and contain noisy data. Our objective is to com- 
pare our transductive inference method directly with the inductive method of 
Ridge Regression. To do this we chose the set of basis functions c)i(x) : 
exp (-II x -xill2/2rr2), i = 1,... ,, and found the values of 3' and rr for Ridge 
Regression which minimized the leave-one-out bound (15). We then used the same 
values of these parameters in our transductive approach, and using the basis func- 
tions c)i(x) - exp (-IIx - 5i112/2rr 2) , i = 1,... , + m, we then chose a fixed value 
of 3'*. 
For the boston dataset we followed the same experimental setup as in [4], that 
is, we partitioned the training set of 506 observations randomly 100 times into a 
training set of 481 observations and a testing set of 25 observations. We chose the 
values of 3' and rr by taking the minimum average leave-one-out error over five more 
random splits of the data stepping over the parameter space. The minimum was 
found at '7 - 0.005 and log a - 0.7. For our transductive method, we also chose 
the parameter 3'*: 10. In Figure la we plot mean squared error (MSE) on the test 
set averaged over the 100 runs against log rr for Ridge Regression and Transductive 
Regression. Transductive Regression outperforms Ridge Regression, especially at 
the minimum. 
To observe the influence of the number of test points m on the generalization ability 
of our transductive method, we ran further experiments, setting if* = f/2m for 
426 O. Chapelle, E N. Vapnik and J. Weston 
9.2 
9 
8.8 
8.6 
8 
7.8 
7.6 
0.14 
0.13 
m0.12 
I-- 
' I -- Transductive Regression 
,, 1--- Ridge Regre,n I 
o'.4 o'.5 o; ; 1.2 
Log sigma 
0.15 ', 
, -- Transductive Regression 
', - - - Ridge Regression 
0.1 ", 
0'091 1.5 2 
Log sigma 
8 
7.95 
7.9 
7.85 
uJ 
 7.8 
7'75 t 
-- Transductive Regression 
- -- R dge Regress on 
0.14 
0.135 
0,13 
0.125 
0.12 
uJ_ 0.115 
0.11 
0.105 
0.1 
0.095 
0.09 
5 10 15 20 25 
Test Set Size 
I -- Transductive Regression 
[ - -- Ridge Regression 
2'.5 5'0 100 10 200 250 
Test Set Size 
(a) 
Figure 1: A comparison of Transductive Regression to Ridge Regression on the 
boston dataset: (a) error rates for varying rr, (b) varying the test set size, m, and 
on the kin-32fh dataset: (c) error rates for varying rr, (d) varying the test set size. 
different values of m. In Figure lb we plot m against MSE on the testing set, at 
log rr - 0.7. The results indicate that increasing the test set size gives improved 
performance in Transductive Regression. For Ridge Regression, of course, the size 
of the testing set has no influence on the generalization ability. 
We then performed similar experiments on the kin-32fh dataset. This time, as we 
were interested in large testing sets giving improved performance for Transductive 
Regression we chose 100 splits where we took a subset of only 64 observations for 
training and 256 for testing. Again the leave-one-out estimator was used to find the 
values ' - 0.1 and log a = 2 for Ridge Regression, and for Transductive Regression 
we also chose the parameter '' - 0.1. We plotted MSE on the testing set against 
log rr (Figure lc) and the size of the test set m for log rr = 2.75 (also, '' = 50/m) 
(Figure ld) for the two algorithms. For large test set sizes our method outperforms 
Ridge Regression. 
4.2 Pattern Recognition 
This technique can also be applied for pattern recognition problems by solving them 
based on minimizing functional (8) with y - +1. Such a technique is known as a 
Linear Discriminant (LD) technique. 
Transductive Inference for Estimating Values of Functions 42 7 
AB ABt SVM TLD 
Postal - - 5.5 4.7 
Banana 12.3 10.9 11.5 11.4 
Diabetes 26.5 23.8 23.5 23.3 
Titanic 22.6 22.6 22.4 22.4 
Breast Cancer 30.4 26.5 26.0 25.7 
Heart 20.3 16.6 16.0 15.7 
Thyroid 4.4 4.6 4.8 4.0 
Table 1: Comparison of percentage test error of AdaBoost (AB), Regularized Ad- 
aBoost (ABt), Support Vector Machines (SVM) and Transductive Linear Discrim- 
ination (TLD) on seven datasets. 
Table 1 describes results of experiments on classification in the following problems: 
2 class digit recognition (0 - 4 versus 5 - 9) splitting the training set into 23 runs 
of 317 observations and considering a testing set of 2000 observations, and six 
problems from the UCI database. We followed the same experimental setup as 
in [3]: the performance of a classifier is measured by its average error over one 
hundred partitions of the datasets into training and testing sets. Free parameter(s) 
are chosen via validation on the first five training datasets. The performance of the 
transductive LD technique was compared to Support Vector Machines, AdaBoost 
and Regularized AdaBoost [3]. 
It is interesting to note that in spite of the fact that LD technique is one of the sim- 
plest pattern recognition techniques, transductive inference based upon this method 
performs well compared to state of the art methods of pattern recognition. 
5 Summary 
In this article we performed transductive inference in the problem of estimating 
values of functions at the points of interest. We demonstrate that estimating the 
unknown values via a one-step (transductive) procedure can be more accurate than 
the traditional two-step (inductive plus deductive) one. 
References 
[1] A. Hoerl and R. W. Kennard. Ridge regression: Biased estimation for nonorthog- 
onal problems. Technometrics, 12(1):55-67, 1970. 
[2] A. Luntz and V. Brailovsky. On the estimation of characters obtained in statis- 
tical procedure of recognition,. Technicheskaya Kibernetica, 1969. [In Russian]. 
[3] G. R5tsch, T. Onoda, and K.-R. Miiller. Soft margins for adaboost. Technical 
report, Royal Holloway, University of London, 1998. TR-98-21. 
[4] C. Saunders, A. Gammermann, and V. Vovk. Ridge regression learning algo- 
rithm in dual variables. In Proccedings of the 15th International Conference on 
Machine Learning, pages 515-521. Morgan Kaufmann, 1998. 
[5] V. Vapnik. Estimating of values of regression at the point of interest. In Method 
of Pattern Recognition. Sovetskoe Radio, 1977. [In Russian]. 
[6] V. Vapnik. Estimation of Dependences Based on Empirical Data. Springer- 
Verlag, New York, 1982. 
