Training Data Selection 
for Optimal Generalization 
in Trigonometric Polynomial Networks 
Masashi Sugiyama*and Hidemitsu Ogawa 
Department of Computer Science, Tokyo Institute of Technology, 
2-12-1, O-okayama, Meguro-ku, Tokyo, 152-8552, Japan. 
sugi@cs. titech. ac.jp 
Abstract 
In this paper, we consider the problem of active learning in trigonomet- 
ric polynomial networks and give a necessary and sufficient condition of 
sample points to provide the optimal generalization capability. By ana- 
lyzing the condition from the functional analytic point of view, we clarify 
the mechanism of achieving the optimal generalization capability. We 
also show that a set of training examples satisfying the condition does 
not only provide the optimal generalization but also reduces the compu- 
tational complexity and memory required for the calculation of learning 
results. Finally, examples of sample points satisfying the condition are 
given and computer simulations are performed to demonstrate the effec- 
tiveness of the proposed active learning method. 
1 Introduction 
Supervised learning is obtaining an underlying rule from training examples, and can 
be formulated as a function approximation problem. If sample points are actively 
designed, then learning can be performed more efficiently. In this paper, we discuss 
the problem of designing sample points, referred to as active learning, for optimal 
generalization. 
Active learning is classified into two categories depending on the optimality. One 
is global optimal, where a set of all training examples is optimal (e.g. Fedorov 
[3]). The other is greedy optimal, where the next training example to sample is 
optimal in each step (e.g. MacKay [5], Cohn [2], Fukumizu [4], and Sugiyama and 
Ogawa [10]). In this paper, we focus on the global optimal case and give a new ac- 
tive learning method in trigonometric polynomial networks. The proposed method 
does not employ any approximations in its derivation, so that it provides exactly 
the optimal generalization capability. Moreover, the proposed method reduces the 
computational complexity and memory required for the calculation of learning re- 
sults. Finally, the effectiveness of the proposed method is demonstrated through 
computer simulations. 
* http://ogawa-www.cs.titech.ac.jp/~sugi. 
Training Data Selection for Optimal Generalization 625 
2 Formulation of supervised learning 
In this section, the supervised learning problem is formulated from the functional 
analytic point of view (see Ogawa [7]). Then, our learning criterion and model are 
described. 
2.1 Supervised learning as an inverse problem 
Let us consider the problem of obtaining the optimal approximation to a target 
function f(x) of L variables from a set of M training examples. The training 
examples are made up of sample points Xm E T), where T) is a subset of the L- 
dimensional Euclidean space R , and corresponding sample values Ym  C: 
n M 
{(Xm, Ym) l Ym---- f(Xm) + m}m=l, (1) 
where Ym is degraded by zero-mean additive noise nm. Let n and y be M- 
dimensional vectors whose m-th elements are nm and Ym, respectively. y is called 
a sample value vector. In this paper, the target function f(x) is assumed to be- 
long to a reproducing kernel Hilbert space H (Aronszajn [1]). If H is unknown, 
then it can be estimated by model selection methods (e.g. Sugiyama and Ogawa 
[9]). Let K(-, .) be the reproducing kernel of H. If a function Pm(X) is defined 
as Pm(X) = K(x, Xm), then the value of f at a sample point Xm is expressed as 
f(Xm) = (f, )m), where (., .) stands for the inner product. For this reason, )m is 
called a sampling function. Let A be an operator defined as 
M 
A =  (era  m), (2) 
m=l 
where em is the m-th vector of the so-called standard basis in C M and (.  r) 
stands for the Neumann-Schatten product 1. A is called a sampling operator. Then, 
the relationship between f and y can be expressed as 
y=Af +n. 
Let us denote a mapping from y to a learning result f0 by X: 
fo = Xy, 
(4) 
where X is called a learning operator. Then, the supervised learning problem is 
reformulated as an inverse problem of obtaining X providing the best approximation 
f0 to f under a certain learning criterion. 
2.2 Learning criterion and model 
As mentioned above, function approximation is performed on the basis of a learning 
criterion. Our purpose of learning is to minimize the generalization error of the 
learning result f0 measured by 
JG = Enllf0 - fl[ 2, (5) 
where En denotes the ensemble average over noise. In this paper, we adopt projec- 
tion learning as our learning criterion. Let A*, T(A*), and P(A*) be the adjoint 
operator of A, the range of A*, and the orthogonal projection operator onto T(A*), 
respectively. Then, projection learning is defined as follows. 
XFor any fixed g in a Hilbert space Hx and any fixed f in a Hilbert space H2, the 
Neumann-Schatten product (f  ) is an operator from Hx to H2 defined by using any 
h C nx as (f  )h = (h,g)f. 
626 M. Sugiyama and H. Ogawa 
Definition 1 (Projection learning) (Ogawa [6]) An operator X is called the 
projection learning operator if X minimizes the functional Jp IX] = En II Xn 112 under 
the constraint XA = P(A*). 
It is well-known that Eq.(5) can be decomposed into the bias and variance: 
JG = [IP(A.)f- f[[2 + EnllXnll 2. (6) 
Eq.(6) implies that the projection learning criterion reduces the bias to a certain 
level and minimizes the variance. 
Let us consider the following function space. 
Definition 2 (Trigonometric polynomial space) Let x = ((1),(2),...,(L))T. 
For i _< 1 _< L, let Nl be a positive integer and Z)l = [-r, r]. Then, a function 
space H is called a trigonometric polynomial space of order (N1, N2,-.., N) if H 
is spanned by 
exp(int (0) (7) 
/=1 nl=-N1 ,n2------N2,"',nL------NL 
defined on D1 x D2 x ... x D, and the inner product in H is defined as 
(f,g)- (2r)  .-. f(x)g(x)d(1)d(2)...d (). (8) 
The dimension/ of a trigonometric polynomial space of order (N1, N2,..., N) is 
/ ---- rIL=i(2Ni q- 1), and the reproducing kernel of this space is expressed as 
where 
K((),() ') = { 
L 
K(x, x') = H (9) 
/=1 
-(t)') / . (t)-(t)' 
sin (2N+1)((0 _ '/ran 
2 2 
2Nt+l 
if (t)  0)', (10) 
if () -- (t)'. 
3 Active learning in trigonometric polynomial space 
The problem of active learning is to find a set M 
{Xm}m= 1 of sample points providing 
the optimal generalization capability. In this section, we give the optimal solution 
to the active learning problem in the trigonometric polynomial space. 
Let A t be the Moore-Penrose generalized inverse  of A. Then, the following propo- 
sition holds. 
Proposition 1 If the noise covariance matrix Q is given as Q = a2 with a 2 > O, 
then the projection learning operator X is expressed as X = A t. 
Note that the sampling operator A is uniquely determined by M 
{Xm}m= 1 (see Eq.(2)). 
From Eq. (6), the bias of a learning result f0 becomes zero for all f in H if and only 
if iV(A) = {0}, where iV(.) stands for the null space of an operator. For this reason, 
2An operator X is called the Moore-Penrose generalized inverse of an operator A if X 
satisfies AXA = A, XAX = X, (AX)* = AX, and (XA)* = XA. 
Training Data Selection for Optimal Generalization 62 7 
C M 
Figure 1' Mechanism of noise suppression by Theorem 1. If a set M 
(Xm)m=l of sample 
points satisfies A*A - MI, then XAf -- f, [[Xnl[[ - 1 
11111, and = O. 
we consider the case where a set M 
{Xm}m=l of sample points satisfies Af(A) = {0). 
In this case, Eq.(6) is reduced to 
aG = Enlln*nll 2 , 
which is equivalent to the noise variance in H. Consequently, the problem of active 
x M 
learning becomes the problem of finding a set { m}m=l of sample points minimizing 
Eq.(11) under the constraint Af(A) = {0}. 
First, we derive a condition for optimal generalization in terms of the sampling 
operator A. 
Theorem 1 Assume that the noise covariance matrix Q is given as Q = a2I with 
cr 2 > O. Then, JG in Eq. (11) is minimized under the constraint iV(A) = {0} if and 
only if 
A*A=M, (12) 
where I denotes the identity operator on H. In this case, the minimum value of J 
is a2tz/M, where tz is the dimension of H. 
Eq.(12) implies that i M 
{)m}m=l forms a pseudo orthonormal basis (Ogawa [8]) 
in H, which is an extension of orthonormal bases. The following lemma gives 
interpretation of Theorem 1. 
Lemma 1 
x M 
When a set { m}m=l of sample points satisfies Eq.(12), it holds that 
XAf = f for all f E H, (13) 
IlAfl[ = v/-llfll for all f e H, (14) 
1 '(A), 
IlXull - 11ull for u  (15) 
0 for u  T4(A)  . 
1 
Eqs.(14) and (15) imply that A becomes an isometry and vX becomes a 
partial isometry with the initial space T(A), respectively. Let us decompose the 
noise n as n = nl q- n2, where nl G T(A) and n2  T(A) . Then, the sample value 
vector y is rewritten as y = Af + nl q- n2. It follows from Eq.(13) that the signal 
component Af is transformed into the original function f by X. From Eq.(15), X 
suppresses the magnitude of noise nl in T(A) by 1 and completely removes the 
628 M. Sugiyama and H. Ogawa 
Xl X2 23 X4 X5 X6 
--71' C 2.__E.  71' 
M 
(a) Theorem 2 
I X4 X5 
, x x 2 
(b) Theorem3 
Figure 2: Two examples of sample points such that Condition (12) holds (p = 3 
and M = 6). 
noise n2 in 7(A) -L. This analysis is summarized in Fig.1. Note that Theorem 1 and 
its interpretation are valid for all Hilbert spaces such that K(x, x) is a constant for 
any x. 
In Theorem 1, we have given a necessary and sufficient condition to minimize JG 
in terms of the sampling operator A. Now we give two examples of sample points 
M 
{Xm}m= 1 such that Condition (12) holds. From here on, we focus on the case when 
the dimension L of the input x is 1 for simplicity. However, the following results 
can be easily scaled to the case when L > 1. 
Theorem 2 Let M _> I, 
constant such that -r < 
determined as 
then Eq. (12) holds. 
where I is the dimension of H. Let c be an arbitrary 
2,r M 
c _< -r + . If a set {Xm}m= 1 of sample points is 
1), 
(16) 
Theorem 3 Let M = kl where k is a positive integer. 
constant such that -r < c < -r + 2 
determined as 
2r 
Xm -- C + r, where 
then Eq. (12) holds. 
Let c be an arbitrary 
x M 
If a set { m}m=l of sample points is 
r = m- 1 (mod/), 
(17) 
Theorem 2 means that M sample points are fixed to 2r/M intervals in the domain 
[-r, r] and sample values are gathered once at each point (see Fig.2 (a)). In con- 
trast, Theorem 3 means that / sample points are fixed to 2r// intervals in the 
domain and sample values are gathered k times at each point (see Fig.2 (b)). 
Now, we discuss calculation methods of the projection learning result fo(x). Let hm 
be the m-th column vector of the M-dimensional matrix (AA*) t. Then, for general 
sample points, the projection learning result fo(x) can be calculated as 
M 
fo(x) --  (y, hm}m(X). (18) 
m=l 
When we use the optimal sample points satisfying Condition (12), the following 
theorems hold. 
Theorem 4 
lated as 
When Eq. (12) holds, the projection learning result fo(x) can be calcu- 
M 
fo(X) ---- M  YmCm(X). (19) 
m----1 
Training Data Selection for Optimal Generalization 629 
Theorem 5 When sample points are determined following Theorem 3, the projec- 
tion learning result fo(x) can be calculated as 
i ' i k 
So(x)- where y-;= (20) 
p=l q=l 
In Eq.(18), the coefficient of )m(X) is obtained by the inner product (y, hm). In 
contrast, it is replaced with Ym/M in Eq. (19), which implies that the Moore-Penrose 
generalized inverse of AA* is not required for calculating fo(x). This property 
is quite useful when the number M of training examples is very large since the 
calculation of the Moore-Penrose generalized inverse of high dimensional matrices 
is sometimes unstable. In Eq.(20), the number of basis functions is reduced to/ 
and the coefficient of bp(X) is obtained by pp/U, where pp is the mean sample values 
at Xp. 
For general sample points, the computational complexity and memory required for 
calculating fo(x) by Eq.(18) are both O(M2). In contrast, Theorem 4 states that 
if a set of sample points satisfies Eq.(12), then both the computational complexity 
and memory are reduced to O(M). Hence, Theorem 1 and Theorem 4 do not only 
provide the optimal generalization but also reduce the computational complexity 
and memory. Moreover, if we determine sample points following Theorem 3 and 
calculate the learning result f0 (x) by Theorem 5, then the computational complexity 
and memory are reduced to O(/). This is extremely efficient since/ does not depend 
on the number M of training examples. The above results are shown in Tab.1. 
4 Simulations 
In this section, the effectiveness of the proposed active learning method is demon- 
strated through computer simulations. 
Let H be a trigonometric polynomial space of order 100, and the noise covariance 
matrix Q be Q - I. Let us consider the following three sampling schemes. 
(A) Optimal sampling: Training examples are gathered following Theorem 3. 
(B) Experimental design: Eq.(2) in Cohn [2] is adopted as the active learning 
criterion. The value of this criterion is evaluated by 30 reference points. The 
sampling location is determined by multi-point-search with 3 candidates. 
(C) Passive learning: Training examples are given unilaterally. 
Fig.3 shows the relation between the number of training examples and the gener- 
alization error. The horizontal and vertical axes display the number of training 
examples and the generalization error JG measured by Eq.(5), respectively. The 
solid line shows the sampling scheme (A). The dashed and dotted lines denote the 
averages of 10 trials of the sampling schemes (B) and (C), respectively. When the 
number of training examples is 201, the generalization error of the sampling scheme 
(A) is 1 while the generalization errors of the sampling schemes (B) and (C) are 
3.18 x 104 and 8.75 x 104, respectively. This graph illustrates that the proposed sam- 
pling scheme gives much better generalization capability than the sampling schemes 
(B) and (C) especially when the number of training examples is not so large. 
5 Conclusion 
We proposed a new active learning method in the trigonometric polynomial space. 
The proposed method provides exactly the optimal generalization capability and 
630 M. Sugiyama and H. Ogawa 
Table 1: Computational complexity 
and memory required for projection . . 
Optimal sampling 
 l "' "'" .... 
i- 2 .,. : ...... 
learning. 
Computational 
Calculation Complexity 
methods and 
Memory 
Eq.(18) O(M ) 
Theorem 4 O(M) 
Theorem 5  O(u) 
300 400 500 600 
M = k/ where/ is the dimension The number of training examples 
of H and k is a positive integer. Figure 3: Relation between the number of 
training examples and the generalization er- 
ror. 
at the same time, it reduces the computational complexity and memory required 
for the calculation of learning results. The mechanism of achieving the optimal 
generalization was clarified from the functional analytic point of view. 
References 
[1] N. Aronszajn. Theory of reproducing kernels. Transactions on American Mathemat- 
ical Society, 68:337-404, 1950. 
[2] D. Cohn. Neural network exploration using optimal experiment design. In J. Cowan 
et al. (Eds.), Advances in Neural Information Processing Systems 6, pp. 679-686. 
Morgan-Kaufmann Publishers Inc., San Mateo, CA, 1994. 
[3] V. V. Fedorov. Theory of Optimal Experiments. Academic Press, New York, 1972. 
[4] K. Fukumizu. Active learning in multilayer perceptrons. In D. Touretzky et al. (Eds.), 
Advances in Neural Information Processing Systems 8, pp. 295-301. The MIT Press, 
Cambridge, 1996. 
[5] D. MacKay. Information-based objective functions for active data selection. Neural 
Computation, 4(4):590-604, 1992. 
[6] H. Ogawa. Projection filter regularization of ill-conditioned problem. In Proceedings 
of SPIE, 808, Inverse Problems in Optics, pp. 189-196, 1987. 
[7] H. Ogawa. Neural network learning, generalization and over-learning. In Proceedings 
of the ICIIPS'9, International Conference on Intelligent Information Processing  
System, vol. 2, pp. 1-6, Beijing, China, 1992. 
[8] H. Ogawa. Theory of pseudo biorthogonal bases and its application. In Research 
Institute for Mathematical Science, RIMS Kokyuroku, 1067, Reproducing Kernels 
and their Applications, pp. 24-38, 1998. 
[9] M. Sugiyama and H. Ogawa. Functional analytic approach to model selection-- 
Subspace information criterion. In Proceedings of 1999 Workshop on Information- 
Based Induction Sciences (IBIS'99), pp. 93-98, Syuzenji, Shizuoka, Japan, 1999 
(Its complete version is available at ftp://ftp.cs.titech.ac.jp/pub/TR/99/TR99- 
0009.ps.gz). 
[10] M. Sugiyama and H. Ogawa. Incremental active learning in consideration of bias, 
Technical Report of IEICE, NC99-56, pp. 15-22, 1999 (Its complete version is avail- 
able at ftp://ftp.cs.titech.ac.jp/pub/TR/99/TR99-OO10.ps.gz). 
