Discovering hidden features with 
Gaussian processes regression 
Francesco Vivarelli 
Centro Ricerche Ambientali 
Montecatini, 
via Ciro Menotti, 48 
48023 Marina di Ravenna 
Italy 
fvivarellicramont. it 
Christopher K. I. Williams 
Division of Informatics 
The University of Edinburgh, 
5 Forrest Hill, 
Edinburgh, EH1 2QL 
United Kingdom 
ckiwdai. ed. ac. uk 
Abstract 
In Gaussian process regression the covariance between the outputs 
at input locations x and x' is usually assumed to depend on the 
distance (x - x') T W (x - x'), where W is a positive definite mat- 
rix. W is often taken to be diagonal, but if we allow W to be a 
general positive definite matrix which can be tuned on the basis of 
training data, then an eigen-analysis of W shows that we are ef- 
fectively creating hidden features, where the dimensionality of the 
hidden-feature space is determined by the data. We demonstrate 
the superiority of predictions using the general matrix over those 
based on a diagonal matrix on two test problems. 
I Introduction 
Over the last few years Bayesian approaches to prediction with neural networks have 
come to the fore. Following an argument in Neal (1996) concerning the equivalence 
between infinite neural networks and certain Gaussian processes, Gaussian process 
(GP) prediction has also become popular, and Rasmussen (1996) has demonstrated 
good performance of GP predictors on a number of tasks. 
In Gaussian process prediction as applied by Rasmussen (1996), Williams and 
Rasmussen (1996) and others, the covariance between the outputs at locations x 
and x' is usually assumed to depend on the distance (x - x')TW(x -- x'), where 
W is a positive definite, diagonal matrix. This means that different dimensions in 
the input space can have different relevances to the prediction problem (c.f. MacKay 
and Neal's idea of Automatic Relevance Determination (Neal, 1996)). However, 
some of the reasoning about the success of neural networks and methods such as 
projection pursuit regression suggests that discovering relevant directions in feature 
space is important; clearly the ARD model is a special case, where these directions 
614 F.. 4varelli and C. K. I. Williams 
are parallel to the axes in the input feature space. In this paper we allow W to be a 
general positive semidefinite matrix (defining a Mahalanobis distance in the input 
space), thereby allowing general directions in the input space to be selected. We 
then compare the performance of GP predictors using the diagonal and full distance 
matrices on some regression problems. 
The structure of the paper is as follows. GPs for regression are introduced in Section 
2, where we also explain the rSle played by the distance matrix W and the criterion 
used to compare the generalisation performances of the diagonal and the general 
distance matrices. The two methods have been compared on two regression tasks 
and the results of our experiments are shown in Section 3. A summary of the work 
done and some open questions are presented in Section 4. 
2 Gaussian processes and prediction 
In this paper we use Gaussian process models as predictors. Consider a stochastic 
process Y (x), with the input observable x belonging to some input space X C_ I e. 
Gaussian processes are a subset of stochastic processes that can be defined by 
specifying the mean and covariance functions, /(x) = [Y (x)] and Cp (x,x ) - 
[ (x) (x)] respectively. For the work below we shall set / (x) -- 0. Al- 
though the GP formulation provides a prior over functions, for our purposes it 
suffices to note that the y-values  (xl),  (x2),...,  (x n) corresponding to x- 
values x 1, x2,..., x n have a multivariate Gaussian distribution A; (0, Kp), where 
(Kp)ij = Cp (xi, xJ). The specific form of the covariance function that we shall use 
is 
2exp[ 1 (x -- x')T ] 
(x,x') = % -5 w (x - x') . 
(1) 
When W is a diagonal matrix the entry wii is the inverse of the squared correla- 
tion length-scale of the process along the direction i. In particular, we note that 
this model is closely related to the Automatic Relevance Determination method of 
MacKay and Neal (Neal, 1996), as a small lengthscale along a certain direction of 
the space highlights the relevance of the corresponding input feature (assuming that 
the inputs are normalised). 
For the prediction problem, let us suppose to have n data points D, - 
{ (xl,tl), (x,t),..., (x",t ") }, where t i is the output-value corresponding to the 
input x i. The t's are assumed to be generated from the true y-values by adding 
 Given the assumption of a Gaussian process prior 
Gaussian noise of variance 
over functions, it is a standard result (e.g. Whittle, 1963) that the predictive dis- 
tribution p(t]x,Z)n)corresponding to a new input is Af ()(x),c 2 (x)), with mean 
and variance 
(x) = k T (x) K-t 
2 _ k T 
cy 2 (x) - Cp (x,x) 4- cy v (x) K-lk (x), (3) 
where K = Kp + a2I, k T (x) = (Cp (x,x ) ,Cp (x, x2),... ,Cp (x,x)) and t  = 
(,l,t2,...t n) . 
This method of prediction assumes that the process y (x) we are modelling is really 
a function of the observable x. However it is often the case that for real world 
problems the y is actually a function of a set of hidden features z  Z C q which 
arise from a combination of the manifest variables x. In particular we wish to study 
the problem in which the hidden features are a linear combination of the observable 
Discovering Hidden Features with Gaussian Processes Regression 615 
coordinates through a q x d matrix M, where q < d (i.e.z = Mx). In this case, 
the covariance of the function y is specified by Equation I but turns out to depend 
upon the estimation of the distance between hidden features (z- z')T (z- z'). 
Sincez=Mx,(z-z ')=M(x-x ) andW=MTM. 
A GP model depends on the parameters which describe the covariance function 
2 and the elements of W). The training of a GP can be carried out by 
(i.e.a, crv 
either estimating the parameters of the covariance function (for example, using the 
maximum likelihood method) or using a Bayesian approach and sampling from the 
posterior distribution over the parameters (Williams and Rasmussen, 1996). We 
follow the first approach, maximising the logarithm of the likelihood 
/2 = logp (D, I0 ) = -- 
1 log det K - -tTK-it -- n 
2  log 2r (4) 
where K -1 depends upon 0, the vector of parameters of the covariance function. 
The number of free parameters depends on the number of non-zero elements of the 
matrix W. Usually, W is chosen to be diagonal and the number of free parameters 
2 and cry,) We notice that this parametrisation 
is d + 2 (the d diagonal elements, ap . 
of W allows the discovery of relevant directions in the observed space; it does not 
lead to an estimation of a general mapping of X onto the feature space Z as the 
relevant directions are parallel to the axes in the input manifest space. 
If q is not known in advance, it is preferable to use a general symmetric posit- 
ive semidefinite matrix W. A parametrisation of such a matrix follows from the 
Choleski decomposition as W = uTu, where U is an upper triangular matrix with 
positive entries on the diagonal (Williams, 1996). Hence the factorisation of U turns 
out to be 
exp [Ul,1] Ul,2 ... t/1,d 
U = 0 exp [u2,2] ... u2, (5) 
0 0 ... ua,a ' 
......... exp 
The elements on the diagonal are positive because of the exponential. Being sym- 
metric, W has at most d (d + 1)/2 independent entries and thus the total number 
of free parameters of the GP model is 2 + d (d + 1)/2. 
We note that such a full distance matrix IV allows an estimation of the matrix M 
from an eigenvalue decomposition of W = VAV T, where A is a diagonal matrix of 
the eigenvalues of W and V is the matrix of the eigenvectors. The dimension of the 
hidden feature space Z can be inferred by the number of relevant eigenvalues of the 
matrix A (which are the inverse of the squared correlation lengths of the process 
along the directions of the hidden space). The directions of the hidden feature 
space are defined by the eigenvectors corresponding to the relevant eigenvalues; 
in particular the matrix composed by these eigenvectors gives an estimate of the 
mapping from X to Z. In the following the diagonal and the general full correlation 
matrices are designated by Wd and 
It is important to observe that the predictor obtained using Wf is not equivalent to 
an additive model (Hastie and Tibshirani, 1990), as the predictor is a multivariate 
function of z rather than being an additive function of the components of z. How- 
ever, it would be possible to produce an additive function in the GP context, using a 
covariance function which is the sum of one-dimensional covariance functions based 
on projections of x. 
616 E I4varelli and C. K. I. Williams 
2.1 Generalisation error 
Consider predicting the value of a function y(x) with a predictor (x). A commonly- 
used measure of the generalisation error given a dataset Dn is the average squared 
error 
E g (T)n) -- / (y(x) - v (x))2P (x) dx. 
(6) 
The average generalisation error E g (n) for a dataset of size n is obtained by aver- 
aging over the choice of training dataset, i.e. E g (n) = v [Eg (/)n)]. Eg (/)n) can 
sometimes be evaluated analytically or by numerical integration, but it is usually 
necessary to use samples to perform the average over training datasets 
In order to investigate the generalisation capabilities of GPs using a diagonal and 
full distance matrices Wd and Wf, we trained the GP predictors on some regression 
tasks. The generalisation errors are compared by looking at the relative error 
(7) 
where E (/)) and E (/)) are the generalisation errors reported using a diagonal 
and a full distance matrix respectively. This ratio allow us to perform a fair compar- 
ison between the pairwise differences of the generalisation errors for each dataset 
and the actual value E (/)). The expected value p (n) is the average over the 
sampling of the training data T)n: p(n): v [p (T))]. 
3 Results 
We have conducted experiments to compare the generalisation capabilities of a GP 
predictor with full and diagonal distance matrices. In this section we illustrate the 
results we obtained by training a GP on two regression tasks, the regression of a 
trigonometric function (Section 3.1), and the regression of a high-interaction surface 
(Section 3.2). 
3.1 Regression of a trigonometric function 
In the first experiments, a GP has been trained on observations drawn from the 
function y (z) = sin (2rz) corrupted by Gaussian noise of mean zero and variance 
2 10 -4 10 -3, 10 -2 10 - 1. The hidden feature z E I has been generated from 
the observable variables x E lt 2 through the transformation z = mTx, where m T = 
(1/v, 1/v) and x -- iV (0, lI). We wish to infer the process y (z) (which is actually 
a function of the one-dimensional feature z) by using a GP on the manifest space 
I 2 . We evaluated the expected generalisation errors of Equation 6 by Gaussian 
quadrature (Press et al., 1992) and estimated the expected relative error p (n) by 
averaging over 10 different samples of the training set. 
The parameters of the covariance function are optimised on each of the 10 training 
datasets by maximising the likelihood (see Equation 4) with the conjugate gradient 
algorithm (Press et al., 1992) with 50 (for Wd) and 70 (for Wf) iterations for the 
largest training sets with 256 data. 
Figure I reports the value of p (n) on the vertical axis as a function of the amount 
of training data (x axis). The variance of the noise has been set to 0.01 in Figure 
l(a) and 0.1 in Figure l(b). 
Discovering Hidden Features with Gaussian Processes Regression 617 
1 
0.75 
0.5 
0.25 
0 
-0.1 
0.7 
0.5 
0.25 
-0.2 
8 16 32 64 128 256 8 16 32 64 128 256 
n n 
(a) (b) 
Figure 1: The Figures report on the y axis the graphs of p (n) (see Equation 7) as a 
function of the amount of training data (x axis); the noise level is set to 0.01 (Figure l(a)) 
and to 0.1 (Figure l(b)). The error bars are generated by the minimum and the maximum 
value of p (T)) which occurred over the 10 training datasets. 
The plots show that the use of Wf significantly improves the generalisation perform- 
ance with respect to a diagonal matrix as the relative error p (n) lies well above zero, 
within its confidence interval. This is particularly highlighted in Figure l(a) where 
for datasets larger than 32 data, p (n) is larger than 75%. We notice that for small 
datasets, p (n) is close to zero, as the distribution of its values are spread out around 
zero with wide confidence intervals. This is due to the fact that with small amounts 
of data it is not possible to train the GP properly; in particular, as the number of 
free parameters of Wf is larger than that of Wd, the former needs larger datasets for 
the training than the latter in order to avoid overfitting. A fully Bayesian treatment 
of the training of a GP (see Section 2) would not be so seriously affected by this 
problem since the prediction of the GP would be marginalised over the posterior 
distribution of the parameters. For large datasets, the relative error declines after 
having reached its maximum value; this agrees with the intuition that with large 
amounts of data, both methods will be become good predictors. Similar remarks 
2 0.1) although we notice that the relative 
apply also to Figure l(b) (where rrv = 
error p (n) assumes lower values due to the higher noise variance. 
The better perromance of Wi with respect to Wd can be explained by an eigen- 
analysis of the two distance matrices. Since one eigenvalue of Wi is much larger than 
the other (O (10) vs. O (10 -4)), the full rank distance matrix is able to discover 
the relevant true dimension of the process. The eigenvector corresponding to the 
larger eigenvalue represents the operator which maps the space of the observables 
onto the hidden feature space. Wa fails to find out the effective dimension of the 
problem as it is characterised by two eigenvalues of similar magnitude (O (10)). 
3.2 A high-interaction surface 
We also tested our method on an example taken from Breiman (1993) which is 
concerned with a regression problem of a surface in a high dimensional space. The 
target function is y (x) = rr (z) + rr (z2) + cr (z3), where rr (z) is the sigmoid function 
a(z) = exp[z]/(1 + exp[z]). The hidden features zl, z2 and z3 are derived from 
the transformation zi = 2 (li -- 2), i = 1 ... 3, where the li are the normalised in- 
ner products m/Tx. The observed variables x E I  are uniformly distributed over 
618 E I4varelli and C. K. I. Williams 
0.5 
0.25 
0 
,-0.25 
-0.5 
-0.75 
-1 
0.8 
0.6 
,0.4 
0.2 
o w 
64 128 256 512 0 ' ' ; :  !   
n 1 2 5 6 7 8 9 10 
(a) (b) 
Figure 2: Figure 2(a) reports on the y axis the graph ofp (n) (see Equation 7) as a function 
of the amount of training data (x axis); the error bars are generated by the minimum and 
the maximum value of p(7) that occurred over the 10 training datasets. Figure 2(a) 
shows the graph of the ten eigenvalues of the Wd (*) and the Wf (o) distance matrices 
obtained using one training set of 512 data. The lower values reached by training sets with 
64 and 128 data are -1.06 and -1.97 respectively. 
[0,111; the three vectors mi are m T - (10,9,3,7,-6,-5,-9,-3,-2,-1), m T - 
(-1,-2,-3,-4,-5,-6,7,8,9,10) and m = (-1,-2,-3,4,5,4,-3,-2,-1,0). 
The values of the true function are also corrupted by Gaussian noise of mean zero; 
the variance of the noise was such that the ratio between the standard deviations 
of the signal y (x) and the the noise was 4.0, as in Breiman (1993). 
We have run experiments, training GPs with diagonal and full distance matrices 
on 10 data sets of size 64, 128, 256 and 512; in his work, Breiman used training 
sets with 400 datapoints. The GP's parameters are optimised on each of the 10 
training datasets by maximising the likelihood (see Equation 4) with the conjugate 
gradient algorithm (Press et al., 1992). The generalisation errors of W and Wf have 
been estimated using 1024 test data points; the relative generalisation error p (n) 
(c.f. Equation 7) is shown in Figure 2(a). We observe that for datasets of size 512 the 
use of Wf significantly reduces the relative error with respect the diagonal matrix. 
Models trained with smaller training sets do not have such good generalisation 
performance because the larger number of parameters in W I (57) overfits the data. 
An eigenvalue decomposition of the distance matrices shows that W I is able to dis- 
cover the underlying structure of the process. Figure 2(b) displays the eigenvalues 
of W I and W optimised for one of the training sets of 512 data. W I is characterised 
by three large eigenvalues, whose eigenvectors indicate the three main directions in 
the feature space; thus the full matrix is able to find out three out of ten directions 
which are responsible of the variation of the function. Conversely, Wd fails to dis- 
cover the hidden features in the data; since all the eigenvalues have almost the same 
magnitude, all the input dimensions of the observed variable are equally relevant in 
training the GP. 
The eigenvectors e/, i = 1, 2, 3 of Wf define a basis in the space generating a 
subspace of features. In order to verify the subspace spanned by the e/ actually 
overlaps the hidden feature space, we tried to express the former set of vectors as a 
linear combination the latter. Thus we computed the singular values (Press et al., 
1992) of the matrix composed by the normalised vectors mi and the basis ' 
e i s. As 
Discovering Hidden Features with Gaussian Processes Regression 619 
three out of six singular values are negligible with respect to the others (O (10 -2) 
vs. O (1)), the original hidden transformation can be well approximated as a linear 
combination of the new basis of eigenvectors showing that the eigenspace of Wf is 
a good approximation of the hidden feature space. 
4 Discussion 
In this paper we have shown how to discover hidden features with GP regression. We 
also note that this technique could be applied to problems where Gaussian process 
predictors are used in classification problems. An attractive feature of the method 
is that it allows the appropriate dimensionality of the z space to be discovered. If 
we wish to restrict the maximum dimensionality of Z to be q then one could use a 
 T  
distance matrix of rank-q, i.e.(M) (M). 
The idea of allowing a general transformation of the input space has been mentioned 
before in the literature, for example in (Girosi et al., 1995). However, Girosi et al 
suggest setting the parameters in Wf by cross-validation; we believe that this is not 
very practical in high-dimensional spaces. The results obtained show that the use 
of a full distance matrix can reduce significantly the relative error with respect to 
the use of a diagonal distance matrix. As the training of the GP has been carried 
out maximising the logarithm of the likelihood, this effect was particularly evident 
when larger amounts of data were used; this problem can be reduced when a full 
Bayesian approach to the GP regression is used. 
Currently we are investigating how the input-dimensionality of the affects GP re- 
gression with a general distance matrix W (for a fixed dimensionality of Z). 
Acknowledgements 
This research forms part of the "Validation and Verification of Neural Network Systems" 
project funded jointly by EPSRC (GR/K 51792) and British Aerospace. We thank Dr. 
Andy Wright of BAe for helpful discussions. 
References 
Breiman, L. (1993). Hinging hyperplanes for regression, classification and function ap- 
proximation. IEEE Trans. on Information Theory, 39(3):999-1013. 
Girosi, F., Jones, M., and Poggio, T. (1995). Regularization Theory and Neural Networks 
Architectures. Neural Computation, 7(2):219-269. 
Hastie, T. J. and Tibshirani, R. J. (1990). Generalized Additive Models. Chapman and 
Hall, London. 
Neal, R. M. (1996). Bayesian Learning for Neural Networks. Springer. Lecture Notes in 
Statistics 118. 
Press, W., Teukolsky, S., Vetterling, W., and Flannery, B. (1992). Numerical Recipes in 
C. The Art of Scientific Computing. Cambridge University Press. second edition. 
Rasmussen, C. E. (1996). Evaluation of Gaussian Processes and Other Methods for Non- 
linear Regression. PhD thesis, Dept. of Computer Science, University of Toronto. 
Whittle, P. (1963). Prediction and regulation by linear least square methods. English 
Universities Press. 
Williams, C. K. I. and Rasmussen, C. E. (1996). Gaussian processes for regression. In 
Touretzky, M. C. and Mozer, M. C. and Hasselmo, M. E., editors, Advances in Neural 
Information Processing Systems 8, pages 514-520. MIT Press. 
Williams, P.M. (1996). Conditional multivariate densities. Neural Computation, 8(4). 
