One-unit Learning Rules for 
Independent Component Analysis 
Aapo Hyv'rinen and Erkki Oja 
Helsinki University of Technology 
Laboratory of Computer and Information Science 
Rakentajanaukio 2 C, FIN-02150 Espoo, Finland 
email: {Aapo. Hyvarinen, Erkki. Oj a}hut. f i 
Abstract 
Neural one-unit learning rules for the problem of Independent Com- 
ponent Analysis (ICA) and blind source separation are introduced. 
In these new algorithms, every ICA neuron develops into a sepa- 
rator that finds one of the independent components. The learning 
rules use very simple constrained Hebbian/anti-Hebbian learning 
in which decorrelating feedback may be added. To speed up the 
convergence of these stochastic gradient descent rules, a novel com- 
putationally efficient fixed-point algorithm is introduced. 
1 Introduction 
Independent Component Analysis (ICA) (Comon, 1994; Jutten and Herault, 1991) 
is a signal processing technique whose goal is to express a set of random vari- 
ables as linear combinations of statistically independent component variables. The 
main applications of ICA are in blind source separation, feature extraction, and 
blind deconvolution. In the simplest form of ICA (Comon, 1994), we observe rn 
scalar random variables x, ..., Xm which are assumed to be linear combinations of 
n unknown components s, ...Sn that are zero-mean and mutually statistically inde- 
pendent. In addition, we must assume n _ m. If we arrange the observed variables 
xi into a vector x -- (x, x2, ..., Xm) T and the component variables sj into a vector 
s, the linear relationship can be expressed as 
x---- As (1) 
Here, A is an unknown rn x n matrix of full rank, called the mixing matrix. Noise 
may also be added to the model, but it is omitted here for simplicity. The basic 
One-unit Learning Rules for Independent Component Analysis 481 
problem of ICA is then to estimate (separate) the realizations of the original inde- 
pendent components sj, or a subset of them, using only the mixtures xi. This is 
roughly equivalent to estimating the rows, or a subset of the rows, of the pseudoin- 
verse of the mixing matrix A. The fundamental restriction of the model is that 
we can only estimate non-Gaussian independent components, or ICs (except if just 
one of the ICs is Gaussian). Moreover, the ICs and the columns of A can only be 
estimated up to a multiplicative constant, because any constant multiplying an IC 
in eq. (1) could be cancelled by dividing the corresponding column of the mixing 
matrix A by the same constant. For mathematical convenience, we define here that 
the ICs sj have unit variance. This makes the (non-Gaussian) ICs unique, up to 
their signs. Note the assumption of zero mean of the ICs is in fact no restriction, as 
this can always be accomplished by subtracting the mean from the random vector 
x. Note also that no order is defined between the ICs. 
In blind source separation (Jutten and Herault, 1991), the observed values of x 
correspond to a realization of an m-dimensional discrete-time signal x(t), t - 1, 2, .... 
Then the components sj (t) are called source signals. The source signals are usually 
original, uncorrupted signals or noise sources. Another application of ICA is feature 
extraction (Bell and Sejnowski, 1996; Hurri et al., 1996), where the columns of the 
mixing matrix A define features, and the sj signal the presence and the amplitude 
of a feature. A closely related problem is blind deconvolution, in which a convolved 
version x(t) of a scalar i.i.d. signal s(t) is observed. The goal is then to recover the 
original signal s(t) without knowing the convolution kernel (Donoho, 1981). This 
problem can be represented in a way similar to eq. (1), replacing the matrix A by 
a filter. 
The current neural algorithms for Independent Component Analysis, e.g. (Bell and 
Sejnowski, 1995; Cardoso and Laheld, 1996; Jutten and Herault, 1991; Karhunen 
et al., 1997; Oja, 1995) try to estimate simultaneously all the components. This is 
often not necessary, nor feasible, and it is often desired to estimate only a subset of 
the ICs. This is the starting point of our paper. We introduce learning rules for a 
single neuron, by which the neuron learns to estimate one of the ICs. A network of 
several such neurons can then estimate several (1 to n) ICs. Both learning rules for 
the 'raw' data (Section 3) and for whitened data (Section 4) are introduced. If the 
data is whitened, the convergence is speeded up, and some interesting simplifications 
and approximations are made possible. Feedback mechanisms (Section 5) are also 
mentioned. Finally, we introduce a novel approach for performing the computations 
needed in the ICA learning rules, which uses a very simple, yet highly efficient, fixed- 
point iteration scheme (Section 6). An important generalization of our learning rules 
is discussed in Section 7, and an illustrative experiment is shown in Section 8. 
2 Using Kurtosis for ICA Estimation 
We begin by introducing the basic mathematical framework of ICA. Most sug- 
gested solutions for ICA use the fourth-order cumulant or kurtosis of the signals, 
defined for a zero-mean random variable v as kurt(v) - E{v 4} - 3(E{v2}) 2. For a 
Gaussian random variable, kurtosis is zero. Therefore, random variables of positive 
kurtosis are sometimes called super-Gaussian, and variables of negative kurtosis 
sub-Gaussian. Note that for two independent random variables v and v2 and for a 
scalar c, it holds kurt(v +v2) = kurt(v) + kurt(v2) and kurt(cv) = c 4 kurt(v). 
482 A. Hyviirinen and E. Oja 
Let us search for a linear combination of the observations xi, say, wTx, such that 
it has maximal or minimal kurtosis. Obviously, this is meaningful only if w is 
somehow bounded; let us assume that the variance of the linear combination is 
constant: E{(wTx) 2} -- 1. Using the mixing matrix A in eq. (1), let us define 
,, = ATv. Then also I1-,112 = vTA ATv = vTE(xxT}v = E((vTx) = 
Using eq. (1) and the properties of the kurtosis, we have 
kurt(wTx) = kurt(wTAs): kurt(zTs)= 
Ezkurt(sj) (2) 
Under the constraint E{(wTx) a} = llzll 2: 1, the function in (2) has a number of 
local minima and maxima. To make the argument clearer, let us assume for the 
moment that the mixture contains at least one IC whose kurtosis is negative, and 
at least one whose kurtosis is positive. Then, as may be obvious, and was rigorously 
proven by Delfosse and Loubaton (1995), the extremal points of (2) are obtained 
when all the components zj' of z are zero except one component which equals +1. 
In particular, the function in (2) is maximized (resp. minimized) exactly when the 
linear combination wTx = zTs equals, up to the sign, one of the ICs sj of positive 
(resp. negative) kurtosis. Thus, finding the extrema of kurtosis of wTx enables 
estimation of the independent components. Equation (2) also shows that Gaussian 
components, or other components whose kurtosis is zero, cannot be estimated by 
this method. 
To actually minimize or maximize kurt(wTx), a neural algorithm based on gradient 
descent or ascent can be used. Then w is interpreted as the weight vector of a neuron 
with input vector x and linear output wTx. The objective function can be simplified 
because of the constraint E{(wTx) a} = 1: it holds kurt(wTx) = E{(wTx) 4} - 3. 
The constraint E{(wTx) 2} = 1 itself can be taken into account by a penalty term. 
The final objective function is then of the form 
J(w) = E{(wTx) + OF(E{(wTx)a}) 
(3) 
where a,/ > 0 are arbitrary scaling constants, and F is a suitable penalty function. 
Our basic ICA learning rules are stochastic gradient descents or ascents for an 
objective function of this form. In the next two sections, we present learning rules 
resulting from adequate choices of the penalty function F. Preprocessing of the 
data (whitening) is also used to simplify J in Section 4. An alternative method for 
finding the extrema of kurtosis is the fixed-point algorithm; see Section 6. 
3 Basic One-Unit ICA Learning Rules 
In this section, we introduce learning rules for a single neural unit. These basic 
learning rules require no preprocessing of the data, except that the data must be 
made zero-mean. Our learning rules are divided into two categories. As explained in 
Section 2, the learning rules either minimize the kurtosis of the output to separate 
ICs of negative kurtosis, or maximize it for components of positive kurtosis. 
Let us assume that we observe a sample sequence x(t) of a vector x that is a 
linear combination of independent components s,..., s according to eq. (1). For 
separating one of the ICs of negative kurtosis, we use the following learning rule for 
One-unit Learning Rules for Independent Component Analysis 483 
the weight vector w of a neuron: 
Aw(t) o(x(t)g-(w(t)Tx(t)) 
(4) 
Here, the non-linear learning function g- is a simple polynomial: g-(u) = au - bu a 
with arbitrary scaling constants a, b > 0. This learning rule is clearly a stochastic 
gradient descent for a function of the form (3), with F(u) = -u. To separate an IC 
of positive kurtosis, we use the following learning rule: 
5w(t) o< x(t)gw+c)(w(t)Tx(t)) 
(5) 
where the learning function + is defined as follows: gw+(U) - 
gw(t) -- 
--au(w(t)TCw(t)) 2 + bu 3 where C is the covariance matrix of x(t), i.e. C -- 
E{x(t)x(t)T}, and a, b > 0 are arbitrary constants. This learning rule is a stochas- 
tic gradient ascent for a function of the form (3), with F(u) = -u 2. Note that 
w(t)TCw(t) in g+ might also be replaced by (E{(w(t)Tx(t))2}) 2 or by IIw(t)ll 4 to 
enable a simpler implementation. 
It can be proven (Hyvlirinen and Oja, 1996b) that using the learning rules (4) and 
(5), the linear output converges to csj(t) where sj(t) is one of the ICs, and c is a 
scalar constant. This multiplication of the source signal by the constant c is in fact 
not a restriction, as the variance and the sign of the sources cannot be estimated. 
The only condition for convergence is that one of the ICs must be of negative (resp. 
positive) kurtosis, when learning rule (4) (resp. learning rule (5)) is used. Thus 
we can say that the neuron learns to separate (estimate) one of the independent 
components. It is also possible to combine these two learning rules into a single rule 
that separates an IC of any kurtosis; see (Hyvlirinen and Oja, 1996b). 
4 One-Unit ICA Learning Rules for Whitened Data 
Whitening, also called sphering, is a very useful preprocessing technique. It speeds 
up the convergence considerably, makes the learning more stable numerically, and 
allows some interesting modifications of the learning rules. Whitening means that 
the observed vector x is linearly transformed to a vector v = Ux such that its 
elements vi are mutually uncorrelated and all have unit variance (Comon, 1994). 
Thus the correlation matrix of v equals unity: E{vv T} = I. This transformation is 
always possible and can be accomplished by classical Principal Component Analysis. 
At the same time, the dimensionality of the data should be reduced so that the 
dimension of the transformed data vector v equals n, the number of independent 
components. This also has the effect of reducing noise. 
Let us thus suppose that the observed signal v(t) is whitened (sphered). Then, in 
order to separate one of the components of negative kurtosis, we can modify the 
learning rule (4) so as to get the following learning rule for the weight vector w: 
5w(t) o< v(t)g-(w(t)Tv(t)) - w(t) 
(6) 
Here, the function g- is the same polynomial as above: g-(u) = au - bu a with 
a > I and b > 0. This modification is valid because we now have Ev(wTv) = w 
and thus we can add +w(t) in the linear part of g- and subtract w(t) explicitly 
afterwards. The modification is useful because it allows us to approximate g- with 
484 A. Hyviirinen and E. Oja 
the 'tanh' function, as w(t)Tv(t) then stays in the range where this approximation 
is valid. Thus we get what is perhaps the simplest possible stable Hebbian learning 
rule for a nonlinear Perceptron. 
To separate one of the components of positive kurtosis, rule (5) simplifies to: 
Aw(t) 3 - allw(t)ll4w(t). 
(7) 
5 Multi-Unit ICA Learning Rules 
If estimation of several independent components is desired, it is possible to construct 
a neural network by combining N (1 _< N _< n) neurons that learn according to 
the learning rules given above, and adding a feedback term to each of those learning 
rules. A discussion of such networks can be found in (Hyvlirinen and Oja, 1996b). 
6 Fixed-Point Algorithm for ICA 
The advantage of neural on-line learning rules like those introduced above is that 
the inputs v(t) can be used in the algorithm at once, thus enabling faster adaptation 
in a non-stationary environment. A resulting trade-off, however, is that the conver- 
gence is slow, and depends on a good choice of the learning rate sequence, i.e. the 
step size at each iteration. A bad choice of the learning rate can, in practice, destroy 
convergence. Therefore, some ways to make the learning radically faster and more 
reliable may be needed. The fixed-point iteration algorithms are such an alterna- 
tive. Based on the learning rules introduced above, we introduce here a fixed-point 
algorithm, whose convergence is proven and analyzed in detail in (Hyvlirinen and 
Oja, 1997). For simplicity, we only consider the case of whitened data here. 
Consider the general neural learning rule trying to find the extrema of kurtosis. 
In a fixed point of such a learning rule, the sum of the gradient of kurtosis and 
the penalty term must equal zero: E{v(wTv) 3} -- 311wl12w q- f(llw]12)w -- 0. The 
solutions of this equation must satisfy 
w: scalar x (E{(WT) 3 } -- 3wllwll 2) 
(s) 
Actually, because the norm of w is irrelevant, it is the direction of the right hand 
side that is important. Therefore the scalar in eq. (8) is not significant and its effect 
can be replaced by explicit normalization. 
Assume now that we have collected a sample of the random vector v, which is a 
whitened (or sphered) version of the vector x in eq. (1). Using (8), we obtain the 
following fixed-point algorithm for ICA: 
1. Take a random initial vector w(0) of norm 1. Let k = 1. 
2. Let w(k) -- E{v(w(k - 1)Tv) 3} -- 3w(k - 1). The expectation can be 
estimated using a large sample of v vectors (say, 1,000 points). 
3. Divide w(k) by its norm. 
4. If Iw(k)Tw(k - 1)1 is not close enough to 1, let k = k + 1 and go back 
to step 2. Otherwise, output the vector w(k). 
One-unit Learning Rules for Independent Component Analysis 485 
The final vector w* = limk w(k) given by the algorithm separates one of the non- 
Gaussian ICs in the sense that w*Tv equals one of the ICs sj. No distinction 
between components of positive or negative kurtosis is needed here. A remarkable 
property of our algorithm is that a very small number of iterations, usually 5-10, 
seems to be enough to obtain the maximal accuracy allowed by the sample data. 
This is due to the fact that the convergence of the fixed point algorithm is in fact 
cubic, as shown in (Hyvlirinen and Oja, 1997). 
To estimate N ICs, we run this algorithm N times. To ensure that we estimate each 
time a different IC, we only need to add a simple projection inside the loop, which 
forces the solution vector w(k) to be orthogonal to the previously found solutions. 
This is possible because the desired weight vectors are orthonormal for whitened 
data (Hyvlirinen and Oja, 1996b; Karhunen et al., 1997). Symmetric methods of 
orthogonalization may also be used (Hyvlirinen, 1997). 
This fixed-point algorithm has several advantages when compared to other suggested 
ICA methods. First, the convergence of our algorithm is cubic. This means very fast 
convergence and is rather unique among the ICA algorithms. Second, contrary to 
gradient-based algorithms, there is no learning rate or other adjustable parameters 
in the algorithm, which makes it easy to use and more reliable. Third, components 
of both positive and negative kurtosis can be directly estimated by the same fixed- 
point algorithm. 
7 Generalizations of Kurtosis 
In the learning rules introduced above, we used kurtosis as an optimization criterion 
for ICA estimation. This approach can be generalized to a large class of such 
optimizaton criteria, called contrast functions. For the case of on-line learning 
rules, this approach is developed in (Hyvarinen and Oja, 1996a), in which it is 
shown that the function g in the learning rules in section 4 can be, in fact, replaced 
by practically any non-linear function (provided that w is normalized properly). 
Whether one must use Hebbian or anti-Hebbian learning is then determined by the 
sign of certain 'non-polynomial cumulants'. The utility of such a generalization is 
that one can then choose the non-linearity according to some statistical optimality 
criteria, such as robustness against outliers. 
The fixed-point algorithm may also be generalized for an arbitrary non-linearity, say 
g. Step 2 in the fixed-point algorithm then becomes (for whitened data) (Hyvarinen, 
1997): w(k) = E{vg(w(k- 1)rv)} -- E{g'(w(k - 1)Tv)}w(k -- 1). 
8 Experiments 
A visually appealing way of demonstrating how ICA algorithms work is to use 
them to separate images from their linear mixtures. On the left in Fig. 1, four 
superimposed mixtures of 4 unknown images are depicted. Defining the j-th IC 
sj to be the gray-level value of the j-th image in a given position, and scanning 
the 4 images simultaneously pixel by pixel, we can use the ICA model and recover 
the original images. For example, we ran the fixed-point algorithm four times, 
estimating the four images shown on the right in Fig. 1. The algorithm needed on 
the average 7 iterations for each IC. 
486 A. Hyvgirinen and E. Oja 
Figure 1: Three photographs of natural scenes and a noise image were linearly 
mixed to illustrate our algorithms. The mixtures are depicted on the left. On the 
right, the images recovered by the fixed-point algorithm are shown. 
References 
Bell, A. and Sejnowski, T. (1995). An information-maximization approach to blind 
separation and blind deconvolution. Neural Computation, 7:1129-1159. 
Bell, A. and Sejnowski, T. J. (1996). Edges are the independent components of natural 
scenes. In NIPS*96, Denver, Colorado. 
Cardoso, J.-F. and Laheld, B. H. (1996). Equivariant adaptive source separation. IEEE 
Trans. on Signal Processing,, 44(12). 
Comon, P. (1994). Independent component analysis - a new concept? Signal Processing, 
36:287-314. 
Delfosse, N. and Loubaton, P. (1995). Adaptive blind separation of independent sources: 
a deflation approach. Signal Processing, 45:59-83. 
Donoho, D. (1981). On minimum entropy deconvolution. In Applied Time Series Anal- 
ysis II. Academic Press. 
Hurri, J., Hyvarinen, A., Karhunen, J., and Oja, E. (1996). Image feature extraction 
using independent component analysis. In Proc. NORSIG'96, Espoo, Finland. 
Hyvarinen, A. (1997). A family of fixed-point algorithms for independent component 
analysis. In Proc. ICASSP'97, Munich, Germany. 
Hyvarinen, A. and Oja, E. (1996a). Independent component analysis by general non- 
linear hebbian-like learning rules. Technical Report A41, Helsinki University of Tech- 
nology, Laboratory of Computer and Information Science. 
Hyvarinen, A. and Oja, E. (1996b). Simple neuron models for independent component 
analysis. Technical Report A37, Helsinki University of Technology, Laboratory of 
Computer and Information Science. 
Hyvarinen, A. and Oja, E. (1997). A fast fixed-point algorithm for independent compo- 
nent analysis. Neural Computation. To appear. 
Jutten, C. and Herault, J. (1991). Blind separation of sources, part I: An adaptive 
algorithm based on neuromimetic architecture. Signal Processing, 24:1-10. 
Karhunen, J., Oja, E., Wang, L., Vigario, R., and Joutsensalo, J. (1997). A class of neu- 
ral networks for independent component analysis. IEEE Trans. on Neural Networks. 
To appear. 
Oja, E. (1995). The nonlinear PCA learning rule and signal separation - mathematical 
analysis. Technical Report A 26, Helsinki University of Technology, Laboratory of 
Computer and Information Science. Submitted to a journal. 
