NeuroScale: Novel Topographic Feature 
Extraction using RBF Networks 
David Lowe 
D. Loweraston. ac.uk 
Michael E. Tipping 
E. Tppngast on. ac. uk 
Neural Computing Research Group 
Aston University, Aston Triangle, Birmingam B4 7ET,UK 
http://www. ncrg. aston. ac. uk/ 
Abstract 
Dimension-reducing feature extraction neural network techniques 
which also preserve neighbourhood relationships in data have tra- 
ditionally been the exclusive domain of Kohonen self organising 
maps. Recently, we introduced a novel dimension-reducing feature 
extraction process, which is also topographic, based upon a Radial 
Basis Function architecture. It has been observed that the gener- 
alisation performance of the system is broadly insensitive to model 
order complexity and other smoothing factors such as the kernel 
widths, contrary to intuition derived from supervised neural net- 
work models. In this paper we provide an effective demonstration 
of this property and give a theoretical justification for the apparent 
'self-regularising' behaviour of the 'NEUROSCALE' architecture. 
1 
'NeuroScale': A Feed-forward Neural Network 
Topographic Transformation 
Recently an important class of topographic neural network based feature extraction 
approaches, which can be related to the traditional statistical methods of Sammon 
Mappings (Sammon, 1969) and Multidimensional Scaling (Kruskal, 1964), have 
been introduced (Mao and Jain, 1995; Lowe, 1993; Webb, 1995; Lowe and Tipping, 
1996). These novel alternatives to Kohonen-like approaches for topographic feature 
extraction possess several interesting properties. For instance, the NEUROSCALE 
architecture has the empirically observed property that the generalisation perfor- 
544 D. Lowe and M. E. Tipping 
mance does not seem to depend critically on model order complexity, contrary to 
intuition based upon knowledge of its supervised counterparts. This paper presents 
evidence for their 'self-regularising' behaviour and provides an explanation in terms 
of the curvature of the trained models. 
We now provide a brief introduction to the NEUROSCALE philosophy of nonlinear 
topographic feature extraction. Further details may be found in (Lowe, 1993; Lowe 
and Tipping, 1996). We seek a dimension-reducing, topographic transformation of 
data for the purposes of visualisation and analysis. By 'topographic', we imply that 
the geometric structure of the data be optimally preserved in the transformation, 
and the embodiment of this constraint is that the inter-point distances in the feature 
space should correspond as closely as possible to those distances in the data space. 
The implementation of this principle by a neural network is very simple. A Radial 
Basis Function (RBF) neural network is utilised to predict the coordinates of the 
data point in the transformed feature space. The locations of the feature points are 
indirectly determined by adjusting the weights of the network. The transformation 
is determined by optimising the network parameters in order to minimise a suitable 
error measure that embodies the topographic principle. 
The specific details of this alternative approach are as follows. Given an m- 
dimensional input space of N data points Xq, an n-dimensional feature space of 
points yq is generated such that the relative positions of the feature space points 
minimise the error, or 'STRESS', term: 
N 
Z -'  y (dp - dqp)2, (1) 
P q>P 
where the dqp are the inter-point Euclidean distances in the data space: dqp - 
v/(Xq -- Xp)W(Xq -- Xp), and the dqp are the corresponding distances in the feature 
space: dqp = V/(yq - yp)T(yq _ yp). 
The points y are generated by the RBF, given the data points as input. That is, 
yq -- f(Xq; W), where f is the nonlinear transformation effected by the RBF with 
parameters (weights and any kernel smoothing factors) W. The distances in the 
feature space may thus be given by dqp =ll f(Xq) - f(xp) II and so more explicitly 
by 
dqp = 
2 
II) - I[)] , (2) 
where b 0 are the basis functions,/z are the centres of those functions, which are 
fixed, and wt are the weights from the basis functions to the output. 
The topographic nature of the transformation is imposed by the STRESS term which 
attempts to match the inter-point Euclidean distances in the feature space with 
those in the input space. This mapping is relatively supervised because there is no 
specific target for each yq; only a relative measure of target separation between each 
yq, yp pair is provided. In this form it does not take account of any additional in- 
formation (for example, class labels) that might be associated with the data points, 
but is determined strictly by their spatial distribution. However, the approach may 
be extended to incorporate the use of extra 'subjective' information which may be 
NeuroScale: Novel Topographic Feature Extraction using RBF Networks 545 
used to influence the transformation and permits the extraction of 'enhanced', more 
informative, feature spaces (Lowe and Tipping, 1996). 
Combining equations (1) and (2) and differentiating with respect to the weights in 
the network allows the partial derivatives of the STRESS OE/Owtk to be derived 
for each pattern pair. These may be accumulated over the entire pattern set and 
the weights adjusted by an iterative procedure to minimise the STRESS term E. 
Note that the objective function for the RBF is no longer quadratic, and so a 
standard analytic matrix-inversion method for fixing the final layer weights cannot 
be employed. 
We refer to this overall procedure as 'NEuROSCALE'. Although any universal ap- 
proximator may be exploited within NEUROSCALE, using a Radial Basis Function 
network allows more theoretical analysis of the resulting behaviour, despite the fact 
that we have lost the usual linearity advantages of the RBF because of the STRESS 
measure. A schematic of the NEUROSCALE model is given in figure 1, and illustrates 
the rSle of the RBF in transforming the data space to the feature space. 
RBF 
Figure 1: The NEUROSCALE architecture. 
2 Generalisation 
In a supervised learning context, generalisation performance deteriorates for over- 
complex networks as 'overfitting' occurs. By contrast, it is an interesting empirical 
observation that the generalisation performance of NEUROSCALE, and related mod- 
els, is largely insensitive to excessive model complexity. This applies both to the 
number of centres used in the RBF and in the kernel smoothing factors which them- 
selves may be viewed as regularising hyperparameters in a feed-forward supervised 
situation. 
This insensitivity may be illustrated by Figure 2, which shows the training and 
test set performances on the IRIS data (for 5-45 basis functions trained and tested 
on 45 separate samples). To within acceptable deviations, the training and test set 
546 D. Lowe and M. E. Tipping 
STRESS values are approximately constant. This behaviour is counter-intuitive when 
compared with research on feed forward networks trained according to supervised 
approaches. We have observed this general trend on a variety of diverse real world 
problems, and it is not peculiar to the IRIS data. 
X 10 -3 Training and Test Errors 
4.5 
3.5 
3 
5 10 15 20 25 30 35 40 45 
Number o! Basis Functions 
Figure 2: Training and test errors for NEuROSCALE Radial Basis Functions with 
various numbers of basis functions. Training errors are on the left, test errors are 
on the right. 
There are two fundamental causes of this observed behaviour. Firstly, we may 
derive significant insight into the necessary form of the functional transformation 
independent of the data. Secondly, given this prior functional knowledge, there 
is an appropriate regularising component implicitly incorporated in the training 
algorithm outlined in the previous section. 
2.1 Smoothness and Topographic Transformations 
For a supervised problem, in the absence of any explicit prior information, the 
smoothness of the network function must be determined by the data, typically neces- 
sitating the setting of regularising hyperparameters to counter overfitting behaviour. 
In the case of the distance-preserving transformation effected by NEUROSCALE, an 
understanding of the necessary smoothness may be deduced a priori. 
Consider a point Xq in input space and a nearby test point xp -- Xq q- Epq, where 
Epq is an arbitrary displacement vector. Optimum generalisation demands that the 
distance between the corresponding image points yq and yp should thus be 
Considering the Taylor expansions around the point yq we find 
Ily Yq II 
-- -' (epqgql) q- O(e4), 
--Ep'rq (gqlgq51 Epqq-O(E4), 
/=1 
-- pWqGqEpq 'q- 0(4), 
(3) 
NeuroScale: Novel Topographic Feature Extraction using RBF Networks 547 
where the matrix C_.q -- -.=lgqtgqTt and gqt is the gradient vector 
(Oyt(q)/Ox,... ,Oyt(q)/Ox,) T evaluated at x - Xq. For structure preservation 
the corresponding distances in input and output spaces need to be retained for all 
values of apq: II Yp -- Yq II 2-- aTa, and so (q -- I with the requirement that second- 
and higher-order terms must vanish. In particular note that measures of curvature 
proportional to (02 yt (q)/Ox) 2 should vanish. In general, for dimension reduction, 
we cannot ensure that exact structure preservation is obtained since the rank of Gq 
is necessarily less than n and hence can never equate to the identity matrix. How- 
ever, when minimising STRESS we are locally attempting to minimise the residual 
III -- Cq [I, which is achieved when all the vectors apq of interest lie within the range 
of (q. 
2.2 The Training Mechanism 
An important feature of this class of topographic transformations is that the STRESS 
measure is invariant under arbitrary rotations and transformations of the output 
configuration. The algorithm outlined previously tends towards those configurations 
that generally reduce the sum-of-squared weight values (Tipping, 1996). This is 
achieved without any explicit addition of regularisation, but rather it is a feature 
of the relative supervision algorithm. 
The effect of this reduction in weight magnitudes on the smoothness of the network 
transformation may be observed by monitoring an explicit quantitative measure of 
total curvature: 
C=E E,  ' (4) 
q l i 
where q ranges over the patterns, i over the input dimensions and I over the output 
dimensions. 
Figure 3 depicts the total curvature of NEUROSCALE as a function of the training 
iterations on the IRIS subset data for a variety of model complexities. As predicted, 
curvature generally decreases during the training process, with the final value inde- 
pendent of the model complexity. Theoretical insight into this phenomenon is given 
in (Tipping, 1996). 
This behaviour is highly relevant, given the analysis of the previous subsection. That 
the training algorithm implicitly reduces the sum-of-squares weight values implies 
that there is a weight decay process occurring with an associated smoothing effect. 
While there is no control over the magnitude of this element, it was shown that 
for good generalisation, the optimal transformation should be maximally smooth. 
This self-regularisation operates differently to regularisers normally introduced to 
stabilise the ill-posed problems of supervised neural network models. In the latter 
case the regulariser acts to oppose the effect of reducing the error on the training 
set. In NEUROSCALE the implicit weight decay operates with the minimisation of 
STRESS since the aim is to 'fit' the relative input positions exactly. 
That there are many RBF networks which satisfy a given STRESS level may be 
seen by training a network a posteriori on a predetermined Sammon mapping of a 
data set by a supervised approach (since then the targets are known explicitly). In 
general, such a posteriori trained networks do not have a low curvature and hence 
548 D. Lowe and M. E. Tipping 
Curvature against Time for NeuroScale 
-- 15 Basis Functions 
- - 30 Basis Functions 
 .* 45 Basis Functions 
-? 
100 
Epoch Number 
150 
Figure 3: Curvature against time during the training of a NEuROSCALE 
mapping on the Iris data, for networks with 15, 30 and 45 basis func- 
tions. 
do not show as good a generalisation behaviour as networks trained according to 
the relative supervision approach. The method by which NEUROSCALE reduces 
curvature, is to select, automatically, RBF networks with minimum norm weights. 
This is an inherent property of the training algorithm to reduce the STRESS criterion. 
2.3 An example 
An effective example of the ease of production of good generalising transformations 
is given by the following experiment. A synthetic data set comprised four Gaussian 
clusters, each with spherical variance of 0.5, located in four dimensions with centres 
at (xc, 0, 0, 0): xc E {1, 2, 3, 4}. A NEvROSC^LE transformation to two dimensions 
was trained using the relative supervision approach, using the three clusters at 
xc - 1, 3 and 4. The network was then tested on the entire dataset, with the fourth 
cluster included, and the projections are given in Figure 4 below. 
The apparently excellent generalisation to test data not sampled from the same 
distribution as the training data is a function of the inherent smoothing within 
the training process and also reflects the fact that the test data lay approximately 
within the range of the matrices Gq determined during training. 
3 Conclusion 
We have described NEUROSCALE, a parameterised RBF Sammon mapping approach 
for topographic feature extraction. The NEuROSCALE method may be viewed as a 
technique which is closely related to Sammon mappings and nonlinear metric MDS, 
with the added flexibility of producing a generalising transformation. 
A theoretical justification has been provided for the empirical observation that the 
generalisation performance is not affected by model order complexity issues. This 
counter-intuitive result is based on arguments of necessary transformation smooth- 
NeuroScale: Novel Topographic Feature Extraction using RBF Networks 549 
0 
-0.5 
-1 
NeuroScae trained on 3 linear dustms NegroScale tested on 4 linear clusters 
1.5 
  
Figure 4: Training and test projections of the four clusters. Training STRESS was 
0.00515 and test STRESS 0.00532. 
ness coupled with the apparent self-regularising aspects of NEuROSCALE. The rel- 
ative supervision training algorithm implicitly minimises a measure of curvature by 
incorporating an automatic 'weight decay' effect which favours solutions generated 
by networks with small overall weights. 
Acknowledgements 
This work was supported in part under the EPSRC contract GR/J75425, "Novel 
Developments in Learning Theory for Neural Networks". 
References 
Kruskal, J. B. (1964). Multidimensional scaling by optimising goodness of fit to a 
nonmetric hypothesis. Psychometrika, 29 (1): 1-27. 
Lowe, D. (1993). Novel 'topographic' nonlinear feature extraction using radial basis 
functions for concentration coding in the 'artificial nose'. In 3rd IEE Interna- 
tional Conference on Artificial Neural Networks. London: IEE. 
Lowe, D. and Tipping, M. E. (1996). Feed-forward neural networks and topographic 
mappings for exploratory data analysis. Neural Computing and Applications, 
4:83-95. 
Mao, J. and Jain, A. K. (1995). Artificial neural networks for feature extraction 
and multivariate data projection. IEEE Transactions on Neural Networks, 
6(2):296-317. 
Sammon, J. W. (1969). A nonlinear mapping for data structure analysis. IEEE 
Transactions on Computers, C-18(5):401-409. 
Tipping, M. E. (1996). Topographic Mappings and Feed-Forward Neural Networks. 
PhD thesis, Aston University, Aston Street, Birmingham B4 7ET, UK. Avail- 
able from http://www.ncrg. aston. ac .uk/. 
Webb, A. R. (1995). Multidimensional scaling by iterative majorisation using radial 
basis functions. Pattern Recognition, 28 (5):753-759. 
