The Generalisation Cost of RAMnets 
Richard Rohwer and Michat Morciniec 
rohwerrj s. aston. a. uk morinims. aston. a. uk 
Neural Computing Research Group 
Aston University 
Aston Triangle, Birmingham B4 7ET, UK. 
Abstract 
Given unlimited computational resources, it is best to use a crite- 
rion of minimal expected generalisation error to select a model and 
determine its parameters. However, it may be worthwhile to sac- 
rifice some generalisation performance for higher learning speed. 
A method for quantifying sub-optimality is set out here, so that 
this choice can be made intelligently. Furthermore, the method 
is applicable to a broad class of models, including the ultra-fast 
memory-based methods such as RAMnets. This brings the added 
benefit of providing, for the first time, the means to analyse the 
generalisation properties of such models in a Bayesian framework. 
I Introduction 
In order to quantitatively predict the performance of methods such as the ultra-fast 
RAMnet, which are not trained by minimising a cost function, we develop a Bayesian 
formalism for estimating the generalisation cost of a wide class of algorithms. 
We consider the noisy interpolation problem, in which each output data point yi 
results from adding noise to the result y = f(x) of applying unknown function f 
to input data point x, which is generated from a distribution P (x). We follow a 
similar approach to (Zhu k Rohwer, to appear 1996) in using a Gaussian process to 
define a prior over the space of functions, so that the expected generalisation cost 
under the posterior can be determined. The optimal model is defined in terms of 
the restriction of this posterior to the subspace defined by the model. The optimum 
is easily determined for linear models over a set of basis functions. We go on to 
compute the generalisation cost (with an error bar) for all models of this class, 
which we demonstrate to include the RAMnets. 
254 R. Rohwer and M. Morciniec 
Section 2 gives a brief overview of RAMnets. Sections 3 and 4 supply the formalism 
for computing expected generalisation costs under Gaussian process priors. Numer- 
ical experiments with this formalism are presented in Section 5. Finally, we discuss 
the current limitations of this technique and future research directions in Section 6. 
2 RAMnets 
The RAMnet, or n-tuple network is a very fast 1-pass learning system that of- 
ten gives excellent results competitive with slower methods such as Radial Basis 
Function networks or Multi-layer Perceptrons (Rohwer & Morciniec, 1996). Al- 
though a semi-quantitative theory explains how these systems generalise, no formal 
framework has previously been given to precisely predict the accuracy of n-tuple 
networks. 
Essentially, a RAMnet defines a set of "features" which can be regarded as Boolean 
functions of the input variables. Let the a th feature of x be given by a {0, 1}-valued 
function qba(x). We will focus on the n-tuple regression network (Allinson & Kotcz, 
1995), which outputs 
E+a(x)Eyi+(xi) Eyir(x, 
a i _ i 
y= f(x)= Eq(x)Eq(xi ) - Eu(x, xi) (1) 
a i i 
'Y )}i=1' 
in response to input x, if trained on the set of N samples {X(N)Y(N)} -- {(x i i N 
Here U(x, x') -' y'}. qba(x)qba(x') can be seen to play the role of a smoothing kernel, 
provided that it turns out to have a suitable shape. It is well-know that it does, for 
appropriate choices of feature sets. The strength of this method is that the sums 
over training data can be done in one pass, producing a table containing two totals 
for each feature. Only this table is required for recognition. 
It is interesting to note that there is a familiar way to expand a kernel into the form 
U(x,x') =  q,(z)q,(z'), at least when U(z,z') = U(x - z'), if the range of  is 
not restricted to {0, 1}: an eigenfunction expansion 1. Indeed, principal component 
analysis a applied to a Gaussian with variance V shows that the smallest feature 
set for a given generalisation cost consists of the (real-valued) projections onto 
the leading eigenfunctions of V. Be that as it may, the treatment here applies to 
arbitrary feature sets. 
3 Bayesian inference with Gaussian priors 
Gaussian processes provide a diverse set of priors over function spaces. To 
avoid mathematical details of peripheral interest, let us approximate the infinite- 
dimensional space of functions by a finite-dimensional space of discretised functions, 
so that function f is replaced by high-dimensional vector f, and f(x) is replaced 
by fz, with f(x)  fz within a volume Ax around x. We develop the case of scalar 
functions f, but the generalisation to vector-valued functions is straightforward. 
In physics, this is essentially the mode function expansion of U -, the differential 
operator with Green's function U. 
2V- needs to be a compact operator for this to work in the infinite-dimensional limit. 
The Generalisation Cost of RAMnets 255 
We assume a Gaussian prior on f, with zero mean and covariance V/a: 
p(f)=(1/Za)e- fTv-f 
(2) 
where Za = det(-V). The overall scale of variation of f is controlled by a. 
Illustrative samples of the functions generated from various choices of covariance 
are given in (Zhu & Rohwer, to appear 1996). With qz// denoting the (possibly 
position-dependent) variance of the Gaussian output noise, the likelihood of outputs 
Y(N) given function f and inputs X(N) is 
/ E(fx ' _ yi)qf.l(fx, _ yi) 
P (y(N)IX(N), f) - (1/Z/)exp- 
i 
where Z = H -qz, = det [-Q] with Qff = qz.6ij. 
i 
Because f and X(N) are independent the joint distribution is 
P (Y(N), fiX(N)) = P (y(N)lf, X(N)) P(f) = (eb**b+*)/(ZZ)e 
(4) 
where 6x,x, is understood to be i whenever x i is in the same cell of the discreti- 
sation as x, and Azz, = aVzz, +/Y'.q,Sz,z,6,,z,, bz = /3Y.yiq,Sz,z., and 
i i 
 _OZyi - i 
= q, y . One can readily verify that 
$ 
(5) 
where K is the N x N matrix defined by 
(6) 
The posterior is readily determined to be 
P (fIX(N),Y(N)) = 
P (Y(N), fiX(N)) 
P (Y(N) IX(N)) 
= det [2rA] - e - (f-[)xA-(f-;) (7) 
where f = Ab is the posterior mean estimate of the true function f. 
4 Calculation of the expected cost and its variance 
m 
Let us define the cost of associating an output f of the model with an input x that 
actually produced an output y as 
m m 
C(f:, y) = (f: - y)2r 
where r is a position dependent cost weight. 
256 R. Rohwer and M. Morciniec 
The average of this cost defines a cost functional, given input data X(N): 
C(f,flx(v)) = C(f, y)P (zlx(v)) P(ylx,f)dxdy. 
(8) 
This form is obtained by noting that the function f carries no information about 
the input point x, and the input data X(N) supplies no information about y beyond 
that supplied by f. The distributions in (8) are unchanged by further conditioning 
m m 
on yuv), so we could write C(f,flxuv)) = C(f,flxuv),yuv)). This cost functional 
therefore has the posterior expectation value 
(Clx(v), y(v)) = / C(, ylx(N) )P (x, Y, flX(N), Y(N)) dxdydf 
and variance 
(9) 
- / C(', 1x(v))2P (flxuv), y(v))dr- 
(10) 
Plugging in the distributions (2) (applied to a single sample), (3) and (7) leads to: 
* m  m 
(Clx(v),y(v)) =tr [AR] + (f- f)TR(f-- f)+ 
tr [QR] 
(11) 
where the diagonal matrices R and q have the elements R, = P(xlX)rAxS., 
and Q, = q5,. 
Similar calculations lead to the expression for the variance 
var [C(,flx(N),Y(N))] = tr [ARAR] + tr [ARRFF] . 
(12) 
$71. * 
where the elements of F are F, = (f, - f)5,,. 
m 
Note that the RAMnet (1) has the form f = y]jyi linear in the output data 
i 
yuv), with J, = U(x,xi)/y]j U(x, xj). Let us take V to have the form V(x, x') = 
p(x)G(x - x')p(x'), combining translation-invariant and non-invariant factors in a 
plausible way. Then with the sums over x replaced by integrals, (11) becomes 
explicitly 
2 (Clx(), y(v)) 
+z 
yU / dxP (XlX(N)) rzJz,,zJzzvy v . 
(13) 
The Generalisation Cost of RAMnets 257 
1.5 
0.5 
0 
-1 
-1 
a) True, optimal and suboptimal functions 
-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.6 
-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 Or. 8 
,T 
350 
b) Distribution of the cost C 
0.01 0.012 0.014 0.016 0.018 
Figure 1: a) The lower figure shows the input distribution. The upper figure shows 
the true function f generated from a Gaussian prior with covariance matrix V (dot- 
ted line), the optimal function f - Ab (solid line) and the suboptimal solution 
7 
f (dashed line). b)The distribution of the cost function obtained by generating 
functions from the posterior Gaussian with covariance matrix A and calculating 
the cost according to equation 14. The mean and one standard deviation calcu- 
lated analytically and numerically are shown by the lower and upper error bars 
respectively. 
Taking P (xlx(N)) to be Gaussian (the maximum likelihood estimate would be 
reasonable) and p, r, and q uniform, the first four integrals are straightforward. 
The latter two involve the model J, and were evaluated numerically in the work 
reported below. 
5 Numerical results 
We present one numerical example to illustrate the formalism, and another to illus- 
trate its application. 
For the first illustration, let the input and output variables be one dimensional real 
numbers. Let the input distribution P(x) be a Gaussian with mean/tx = 0 and 
standard deviation a = 0.2. Nearly all inputs then fall within the range [-1, 1], 
which we uniformly quantise into 41 bins. The true function f is generated from a 
Gaussian distribution with tf = 0 and 41 x 41 covariance matrix V with elements 
V, - e-I-'l. 50 training inputs x were generated from the input distribution 
and assigned corresponding outputs y = x + e, where e is Gaussian noise with zero 
mean and standard deviation x/r// _= 0.01. The cost weight rx = 1. 
The inputs were thermometer coded 3 over 256 bits, from which 100 subsets of 30 
bits were randomly selected. Each of the 100 x 230 patterns formed over these 
bits defines a RAMnet feature which evaluates to I when that pattern is present 
in the input x. (Only those features which actually appear in the data need to be 
tabulated.) The 50 training data points were used in this way to train an n-tuple 
aThe first 256(z + 1)/2 bits are set to 1, and the remaining bits to 0. 
258 R. Rohwer and M. Morciniec 
2-$ 
a) Neal's regression problem 
-2 -1 0 1 2 
3: 
b)Mean cost (CIx (N),y(N)) as a function of f and  
0.01 
0,02 0.04 0.06 0,( 0 1 0.12 0,14 0,16 0.18 0.2 
Figure 2: a) Neal's Regression problem. The true function f is indicated by a dotted 
* m 
line, the optimal function f is denoted by a solid line and the suboptimal solution f 
is indicated by a dashed line. Circles indicate the training data. b) Dependence of 
the cost prediction on the values of parameters c and f. The cost evaluated from 
the test set is plotted as a dashed line, predicted cost is shown as a solid line with 
one standard deviation indicated by a dotted line. 
regression network. 
figure la. 
The input distribution and functions f, 
f, f are plotted in 
A Gaussian distribution with mean  and posterior covariance matrix A was then 
used to generate 104 functions. For each such function fp, the generalisation cost 
C =  E P (xlx(N)) (fPr -- )2. (14) 
was computed. A histogram of these costs appears in figure lb, together with the 
theoretical and numerically computed average generalisation cost and its variance. 
Good agreement is evident. 
Another one-dimensional problem illustrates the use of this formalism for predicting 
the generalisation performance of a RAMnet when the prior over functions can only 
be guessed. The true function, taken from (Neal, 1995) is given by 
0.3 + 0.4x + 0.5 sin(2.7x)+ 1.1/(1 
(15) 
where the Gaussian noise variable e has mean / = 0 and standard deviation 
V///? -- 0.1. The cost weight rx -- 1. The training and test set each comprised 
100 data-points. The inputs were generated by the standard normal distribution 
(px = 0, ax = 1) and converted into the binary strings using a thermometer code. 
The input range [-3, 3] was quantised into 61 uniform bins. 
The training set and the functions f, f, f are shown on figure 2a for a - 0.1. The 
function space covariance matrix was defined to have the Gaussian form Vrr, -- 
- X (z-z' ? 
2 
e ' where o'f = 1.0. 
The Generalisation Cost of RAMnets 259 
err is the correlation length of the functions, which is of order 1, judging from 
figure 2a. The overall scale of variation is 1/x/, which appears to be about 3, 
so a should be about 1/9. Figure 2b shows the expected cost as a function of a 
for various choices of err, with error bars on the err - 1.0 curve. The actual cost 
m 
computed from the test set according to C --  -](yi _fx)2 is plotted with a dashed 
i 
line. There is good agreement around the sensible values of a and err. 
6 Conclusions 
This paper demonstrates that unusual models, such as the ultra-fast RAMnets 
which are not trained by directly optimising a cost function, can be analysed in a 
Bayesian framework to determine their generalisation cost. Because the formalism 
is constructed in terms of distributions over function space rather than distributions 
over model parameters, it can be used for model comparison, and in particular to 
select RAMnet parameters. 
The main drawback with this technique, as it stands, is the need to numerically 
integrate two expressions which involve the model. This difficulty intensifies rapidly 
as the input dimension increases. Therefore, it is now a research priority to search 
for RAMnet feature sets which allow these integrals to be performed analytically. 
It would also be interesting to average the expected costs over the training data, 
producing an expected generalisation cost for an algorithm. The Y(N) integral is 
straightforward, but the X(N) integral is difficult. However, similar integrals have 
been carried out in the thermodynamic limit (high input dimension) (Sollich, 1994), 
so the investigation of these techniques in the current setting is another promising 
research direction. 
7 Acknowledgements 
We would like to thank the Aston Neural Computing Group, and especially Huaiyu 
Zhu, Chris Williams, and David Saad for helpful discussions. 
References 
Allinson, N.M., k Kotcz, A. 1995. N-tuple Regression Network. to be published in 
Neural Networks. 
Neal, R. 1995. Introductory documentation for software implementing Bayesian 
learning for neural networks using Markov chain Monte Carlo techniques. Tech. 
rept. Dept of Computer Science, University of Toronto. 
Rohwer, R., k Morciniec, M. 1996. A theoretical and experimental account of the 
n-tuple classifier performance. Neural Computation, 8(3), 657-670. 
Sollich, Peter. 1994. Finite-size effects in learning and generalization in linear per- 
ceptrons. J. Phys. A, 27, 7771-7784. 
Zhu, H., k Rohwer, R. to appear 1996. Bayesian regression fil- 
ters and the issue of priors. Neural Computing and Applications. 
ftp://cs. aston. ac. uk/neural/zhuh/reg_fil_prior. ps. Z. 
