A Neural Network for Feature Extraction 719 
A Neural Network for Feature Extraction 
Nathan Intrator 
Div. of Applied Mathematics, and 
Center for Neural Science 
Brown University 
Providence, RI 02912 
ABSTRACT 
The paper suggests a statistical framework for the parameter esti- 
mation problem associated with unsupervised learning in a neural 
network, leading to an exploratory projection pursuit network that 
performs feature extraction, or dimensionality reduction. 
1 INTRODUCTION 
The search for a possible presence of some unspecified structure in a high dimen- 
sional space can be difficult due to the curse of dimensionalitt problem, namely 
the inherent sparsity of high dimensional spaces. Due to this problem, uniformly 
accurate estimations for all smooth functions are not possible in high dimensions 
with practical sample sizes (Cox, 1984, Barron, 1988). 
Recently, exploratory projection pursuit (PP) has been considered (Jones, 1983) as a 
potential method for overcoming the curse of dimensionality problem (Huber, 1985), 
and new algorithms were suggested by Friedman (1987), and by Hall (1988, 1989). 
The idea is to find low dimensional projections that provide the most revealing 
views of the full-dimensional data emphasizing the discovery of nonlinear effects 
such as clustering. 
Many of the methods of classical multivariate analysis turn out to be special cases 
of PP methods. Examples are principal component analysis, factor analysis, and 
discriminant analysis. The various PP methods differ by the projection index opti- 
mized. 
720 Intrator 
Neural networks seem promising for feature extraction, or dimensionality reduction, 
mainly because of their powerful parallel computation. Feature detecting functions 
of neurons have been studied in the past two decades (von der Malsburg, 1973, Nass 
et al., 1973, Cooper et al., 1979, Takeuchi and Amari, 1979). It has also been shown 
that a simplified neuron model can serve as a principal component analyzer (Oja, 
1982). 
This paper suggests a statistical framework for the parameter estimation problem 
associated with unsupervised learning in a neural network, leading to an exploratory 
PP network that performs feature extraction, or dimensionality reduction, of the 
training data set. The formulation, which is similar in nature to PP, is based on 
a minimization of a cost function over a set of parameters, yielding an optimal 
decision rule under some norm. First, the formulation of a single and a multiple 
feature extraction are presented. Then a new projection index (cost function) that 
favors directions possessing multimodality, where the multimodality is measured 
in terms of the separability property of the data, is presented. This leads to the 
synaptic modification equations governing learning in Bienenstock, Cooper, and 
Munro (BCM) neurons (1982). A network is presented based on the multiple feature 
extraction formulation, and both, the linear and nonlinear neurons are analysed. 
2 SINGLE FEATURE EXTRACTION 
We associate a feature with each projection direction. With the addition of a 
threshold function we can say that an input posses a feature associated with that 
direction if its projection onto that direction is larger than the threshold. In these 
terms, a one dimensional projection would be a single feature extraction. 
The approach proceeds as follows: Given a compact set of parameters, define a 
family of loss functions, where the loss function corresponds to a decision made by 
the neuron whether to fire or not for a given input. Let the risk be the averaged 
loss over all inputs. Minimize the risk over all possible decision rules, and then 
minimize the risk over the parameter set. In case the risk does not yield a meaningful 
minimization problem, or when the parameter set over which the minimization takes 
place can be restricted by some a-priori knowledge, a penalty, i.e. a measure on the 
parameter set, may be added to the risk. 
Define the decision problem (f,-r, P, L,.A), where f: (aO),..., a(')), a:(i) C R v, 
is a fixed set of input vectors, (f, -c, P) the corresponding probability space, .A = 
{0, 1} the decision space, and {Lo}oEBM, Lo : f x .A  R is the family of loss 
functions. B M is a compact set in R M. Let Z) be the space of all decision rules. 
The risk Ro: Z) - R, is given by: 
i=1 
For a fixed 8, the optimal decision 50 is chosen so that: 
A Neural Network for Feature Extraction 721 
Since the minimization takes place over a finite set, the minimizer exists. In par- 
ticulax, for a given a(i) the decision $0(a:(i)) is chosen so that Lo(e(O,$o(e(O)) < 
1 - 
Now we find an optimal  that minimizes the risk, namely,  will be such that: 
R(6) = min {Ro(6o)}. (2.3) 
8BM 
The minimum with respect to 8 exits since B M is compact. 
Ro(o) becomes a function that depends only on 8, and when 8 represents a vector 
in R r, Ro can be viewed as a projection index. 
3 MULTI-DIMENSIONAL FEATURE EXTRACTION 
In this case we have a single layer network of interconnected units, each performing 
a single feature extraction. All units receive the same input and the interaction be- 
tween the units is via lateral inhibition. The formulation is similar to single feature 
extraction, with the addition of interaction between the single feature extractors. 
Let Q be the number of features to be extracted from the data. The multiple de- 
cision rule 0 = (?),..., 0 (?)) takes values in .A: {0, 1} <?. The risk of node k 
is given by: R?)(): ii P(e(i))L?)(e(O,()(e(O)), and the total risk of the 
network is Ro(5) - =l R?)(). Proceeding as before, we can minimize over the 
decision rules  to get 0, and then minimize over 8 to get , as in equation (2.3). 
The coupling of the equations via the inhibition, and the relation between the 
different features extracted is exhibited in the loss function for each node and will 
become clear through the next example. 
4 
FINDING THE OPTIMAL 0 FOR A SPECIFIC LOSS 
FUNCTION 
4.1 A SINGLE BCM NEURON - ONE FEATURE EXTRACTION 
In this section, we present an exploratory PP method with a specific loss function. 
The differential equations performing the optimization turn out to be a good ap- 
proximation of the low governing synaptic weight modification in the BCM theory 
for learning and memory in neurons. The formal presentation of the theory, and 
some theoretical analysis is given in (Bienenstock, 1980, Bienenstock et al., 1982), 
mean field theory for a network based on these neurons is presented in (Scofield 
and Cooper, 1985, Cooper and Scofield, 1988), more recent analysis based on the 
statistical viewpoint is in (Intrator 1990), computer simulations and the biological 
relevance are discussed in (Saul et al., 1986, Bear et al., 1987, Cooper et al., 1988). 
We start with a short review of the notations and definitions of BCM theory. 
Consider a neuron with input vector a: : (a,...,a:r), synaptic weights vector 
m = (m,..., mr), both in R r, and activity (in the linear region) c = a  rn. 
722 Intrator 
Define Om: E[(a-rn) ] (c, Ore) -- c   
- 5cO,, b(c, O,) = c  4cO, The input 
a, which is a stochastic process, is assumed to be of Type II ;o mixing, bounded, and 
piecewise constant. The ;o mixing property specifies the dependency of the future 
of the process on its past. These assumptions are needed for the approximation of 
the resulting deterministic equation by a stochastic one and are discussed in detail 
in (Intrator, 1990). Note that c represents the linear projection of a onto m, and 
we seek an optimal projection in some sense. 
The BCM synaptic modification equations are given by: rh = it(t)ck(e.m, 0,), 
rn(0) = rn0, where it(t) is a global modulator which is assumed to take into account 
all the global factors affecting the cell, e.g., the beginning or end of the critical 
period, state of arousal, etc. 
Rewriting the modification equation as  = it(t)(a  m)(a .m - :40,)a, we see 
that unlike a classical ttebb-Stent rule, the threshold 0,, is dynamic. This gives 
the modification equation the desired stability, with no extra conditions such as 
saturation of the activity, or normalization of 11 m [[, and also yields a statistically 
meaningful optimization. 
Returning to the statistical formulation, we let 0 = m be the parameter to be 
estimated according to the above formulation and define an appropriate loss function 
depending on the cell's decision whether to fire or not. The loss function represents 
the intuitive idea that the neuron will fire when its activity is greater than some 
threshold, and will not otherwise. We denote the firing of the neuron by a = 1. 
f{o.,, c3(s, O,)ds. Consider the following loss function: 
Define K = 
Lo(,a) = Lm(,a): 
-. 
n -. 
- fo( 
(  ') 
(-m) 
(-m) 
>Ore, a--1 
<Ore, a=l 
<Orn, a=0 
>Ore, a:0 
It follows from the definition of Lo and from the definition of 50 in (2.2) that 
Lm(e, 5m) : --it (S, Ore)ds -- ---{(a. m) a -- E[(a. m)2](a  m) 2} (4.2) 
The above definition of the loss function suggests that the decision of a neuron 
whether to fire or not is based on a dynamic threshold (a:  m) > 0,,. It turns out 
that the synaptic modification equations remain the same if the decision is based 
on a fixed threshold. This is demonstrated by the following loss function, which 
leads to the same risk as in equation (4.3): K = -it fo. (s, O.)ds, 
Lo(z,a) = Lm(e,a): K - it f( c)(s, Om)ds, 
(x.,) ^ 
-it  (8, on 
(a-m)_>0, a= 1 
(.m)<0, a= 1 
(a.m) <_0, a--O 
(;r .m)> 0, a--0 
(4.1 ) 
A Neural Network for Feature Extraction 723 
The risk is given by: 
Ro(6o) = --{E[(a- m) a] - E2[(a  m)2]}. (4.3) 
The following graph represents the b function and the associated loss function 
L.(x, 6.) of the activity c. 
THE q FUNCTION THE LOSS FUNCTION 
Fig. 1: The Function ;b and the Loss Functions for a Fixed m and ,. 
From the graph of the loss function it follows that for any fixed m and ,,, the loss 
is small for a given input z, when either z  rn is close to zero or negative, or when 
z  m is larger than O. This suggests, that the preferred directions for a fixed 
will be such that the projected single dimensional distribution differs from normal 
in the center of the distribution, in the sense that it has a multi-modal distribution 
with a distance between the two peaks larger than ,. Rewriting (4.3) we get 
Ro(5o) tt E[(a-m) a] 
E2[(a  m?]:  { E- [( [-/)] - 1}. (4.4) 
The term E[(:e.m)al/E[(:e.m) 1 can be viewed as some measure of the skewhess of 
the distribution, which is a measure of deviation from normality and therefore an 
interesting direction (Diaconis and Friedman, 1984), in accordance with Friedman 
(1987) and Hall's (1988, 1989) argument that it is best to seek projections that 
differ from the normal in the center of the distribution rather than in the tails. 
Since the risk is continuously differentiable, its minimization can be done via the 
gradient descent method with respect to m, namely: 
0rn 0 
= ---Ro(5o) = ttE[qb(a-m, Om)ai]. (4.5) 
0t 0m 
Notice that the resulting equation represents an averaged deterministic equation 
of the stochastic BCM modification equations. It turns out that under suitable 
conditions on the mixing of the input a: and the global function/, equation (4.5) is 
a good approximation of its stochastic version. 
When the nonlinearity of the neuron is emphasized, the neuron's activity is then 
defined as c = a(a:  m), where a usually represents a smooth sigmoidal function. 
O,n is then defined as re)I, and the loss function is similar to the one 
given by equation (4.1)except that (z .m)is replaced by a(z-m). The gradient of 
724 Intrator 
the risk is given by: -V,R(5,,) =/E[b (r(z- m), e,,)rte], where r' represents 
the derivative of r at the point (a:  m). Note that r may represent any nonlinear 
function, e.g. radial symmetric kernels. 
4.2 THE NETWORK - MULTIPLE FEATURE EXTRACTION 
In this case we have Q identical nodes, which receive the same input and inhibit 
each other. Let the neuronal activity be denoted by c = a  m. We define the 
inhibited activity  = c r/yj cj, and the threshold - E[]. In a more 
general case, the inhibition may be defined to take into account the spatial location 
of adjacent neurons, namely, E = j ,jcj, where ,j represents different types 
of inhibitions, e.g. Mexican hat. Since the following calculations are valid for both 
kinds of inhibition we shall introduce only the simpler one. 
The loss function is similar to the one defined in a single feature extraction with the 
exception that the activity c = a. m is replaced by & Therefore the risk for node k is 
given by: R = -a{E[E]- (E[2])2}, and the total risk is given by R: =l R. 
3 
The gradient of R is given by: 
OR 
Om--- = -/[1 - r/(Q - 1)]E[qb(5,, m)a]. (4.6) 
Equation (4.6) demonstrates the ability of the network to perform exploratory pro- 
jection pursuit in parallel, since the minimization of the risk involves minimization 
of nodes 1,..., Q, which are loosely coupled. 
The parameter r/ represents the amount of lateral inhibition in the network, and 
is related to the amount of correlation between the different features sought by 
the network. Experience shows that when r/ "- 0, the different units may all be- 
come selective to the simplest feature that can be extracted from the data. When 
r/(Q - 1) ' 1, the network becomes selective to those inputs that are very far apart 
(under the 12 norm), yielding a classification of a small portion of the data, and 
mostly unresponsiveness to the rest of the data. When 0 < r/(Q - 1) < 1, the net- 
work becomes responsive to substructures that may be cmnmon to several different 
inputs, namely extract invariant features in the data. The optimal value of r/has 
been estimated by data driven techniques. 
When the non linearity of the neuron is emphasized the activity is defined (as in 
- and R are defined as before In 
the single neuron case) as c - o'(:r. . m ). 5, Ore, . 
oek _ (x mj)a, oek '(a rn)a, and equation (4.6) becomes: 
this case omi -rl' ' = ' 
m-m -/E )  m,) r/ 
(4.7) 
4.3 OPTIMAL NETWORK SIZE 
A major problem in network solutions to real world problems is optimal network 
size. In our case, it is desirable to try and extract as many features as possible on 
A Neural Network for Feature Extraction 725 
one hand, but it is clear that too many neurons in the network will simply inhibit 
each other, yielding sub-optimal results. The following solution was adopted: We 
replace each neuron in the network with a group of neurons which all receive the 
same input, and the same inhibition from adjacent groups. These neurons differ 
from one another only in their initial synaptic weights. The output of each neuron 
is replaced by the average group activity. Experiments show that the resulting 
network is more robust to noise and outliers in the data. Furthermore, it is observed 
that groups that become selective to a true feature in the data, posses a much 
smaller inter-group variance of their synaptic weight vector than those which do 
not become responsive to a coherent feature. We found that eliminating neurons 
with large inter-group variance and retraining the network, may yield improved 
feature extraction properties. 
The network has been applied to speech segments, in an attempt to extract some 
features from CV pairs of isolated phonemes (Seebach and Intrator, 1988). 
5 DISCUSSION 
The PP method based on the BCM modification function, has been found capable of 
effectively discovering non linear data structures in high dimensional spaces. Using 
a parallel processor and the presented network topology, the pursuit can be done 
faster than in the traditional serial methods. 
The projection index is based on polynomial moments, and is therefore computa- 
tionally attractive. When only the nonlinear structure in the data is of interest, a 
sphering transformation (Huber, 1981, Friedman, 1987), can be applied first to the 
data for removal of all the location, scale, and correlational structure from the data. 
When compared with other PP methods, the highlights of the presented method are 
i) the projection index concentrates on directions where the separability property as 
well as the non-normality of the data is large, thus giving rise to better classification 
properties; ii) the degree of correlation between the directions, or features extracted 
by the network can be regulated via the global inhibition, allowing some tuning of 
the network to different types of data for optimal results; iii) the pursuit is done on 
all the directions at once thus leading to the capability of finding more interesting 
structures than methods that find only one projection direction at a time. iv) the 
network's structure suggests a simple method for size-optimization. 
Acknowledgement s 
I would like to thank Professor Basilis Gidas for many fruitful discussions. 
Supported by the National Science Foundation, the Office of Naval Research, and 
the Army Research Office. 
References 
Barron A. R. (1988) Approximation of densities by sequences of exponential families. 
Submitted to Ann. Statist. 
726 Intrator 
Bienenstock E. L. (1980) A theory of the development of neuronal selectivity. Doctoral 
dissertation, Brown University, Providence, RI 
Bienenstock E. L., L. N Cooper, and P. W. Munro (1982) Theory for the development 
of neuron selectivity: orientation specificity and binocular interaction in visual cortex. 
J. Neurosci. 2:32-48 
Bear M. F., L. N Cooper, and F. F. Ebner (1987) A Physiological Basis for a Theory of 
Synapse Modification. Science 237:42-48 
Cooper L. N, and F. Liberman, and E. Oja (1979) A theory for the acquisition and loss 
of neurons specificity in visual cortex. Biol. Cyb. 33:9-28 
Cooper L. N, and C. L. Scofield (1988) Mean-field theory of a neural network. Proc. Natl. 
Acad. Sci. USA 85:1973-1977 
Cox D. D. (1984) Multivariate smoothing spline functions. SIAM J. Numer. Anal. 21 
789-813 
Diaconis P., and D. Freedman (1984) Asymptotics of Graphical Projection Pursuit. The 
Annals of Statistics, 12 793-815. 
Friedman J. H. (1987) Exploratory Projection Pursuit. Journal o.f the American Statistical 
Association 82-397:249-266 
Hall P. (1988) Estimating the Direction in which Data set is Most Interesting. Probab. 
Theory Rel. Fields 80, 51-78 
Hall P. (1989) On Polynomial-Based Projection Indices for Exploratory Projection Pursuit. 
The Annals o.f Statistics, 17, 589-605. 
Huber P. J. (1981) Projection Pursuit. Research Report PJH-6, Harvard University, Dept. 
of Statistics. 
Huber P. J. (1985) Projection Pursuit. The Annal. o.f Star. 13:435-475 
Intrator N. (1990) An Averaging Result for Random Differential Equations. In Press. 
Jones M. C. (1983) The Projection Pursuit Algorithm for Exploratory Data Analysis. 
Unpublished Ph.D. dissertation, University of Bath, School of Mathematics. 
yon der Malsburg, C. (1973) Self-organization of orientation sensitivity cells in the striate 
cortex. Kybernetik 14:85-100 
Nass M. M., and L. N Cooper (1975) A theory for the development of feature detecting 
cells in visual cortex. BioI. Cybernetics 19:1-18 
Oja E. (1982) A Simplified Neuron Model as a Principal Component Analyzer. J. Math. 
Biology, 15:267-273 
Saul A., and E. E. Clothiaux, 1986) Modeling and Simulation III: Simulation of a Model for 
Development of Visual Cortical specificity. J. of Electrophysiological Techniques, 13:279- 
306 
Scofield C. L., and L. N Cooper (1985) Development and properties of neural networks. 
Contemp. Phys. 26:125-145 
Seebach B. S., and N. Intrator (1988) A learning Mechanism for the Identification of 
Acoustic Features. (Society for Neuroscience). 
Takeuchi A., and S. Amari (1979) Formation of topographic maps and columnar mi- 
crostructures in nerve fields. Biol. Cyb. 35:63-72 
