Higher Order Statistical Decorrelation without 
Information Loss 
Gustavo Deco 
Siemens AG 
Central Research 
Otto-Hahn-Ring 6 
81739 Munich 
Germany 
Wiifried Brauer 
Technische UniversitJt M'Onchen 
Institut filr Informatik 
Arcisstr. 21 
80290 Munich 
Germany 
Abstract 
A neural network learning paradigm based on information theory is pro- 
posed as a way to perform in an unsupervisd fashion, rodundancy 
_reduction among the elements of the output layer without loss of infor- 
mation from the sensory input. The model developed performs nonlin- 
ear decorrelation up to higher orders of the cumulant tensors and results 
in probabilistically independent components of the output layer. This 
means that we don't need to assme Gaussian distribution neither at the 
input nor at the output. The theory presented is related to the unsuper- 
vised-leaming theory of Barlow, which proposes redundancy reduction 
as the goal of cognition. When nonlinear units are used nonlinear princi- 
pal component analysis is obtained. In this case nonlinear manifolds can 
be reduced to minimum dimension manifolds. If such units are used the 
network performs a generalized principal component analysis in the 
sense that non-Gaussian distributions can be linearly decorrelated and 
higher orders of the correlation tensors are also taken into account. The 
basic structure of the architecture involves a general transformation that 
is volume conserving and therefore the entropy, yielding a map without 
loss of information. Minimization of the mutual information among the 
output neurons eliminates the redundancy between the outputs and 
results in statistical decorrelation of the extracted features. This is 
known as factorial learning. 
248 Gustavo Deco, Wilfried Brauer 
1 INTRODUCTION 
One of the most important theories of feature extracrion is the one proposed by Barlow 
(1989). Barlow describes the process of cognition as a preprocessing of the sensoffal 
information performed by the nervous system in order to extract the statistically relevant 
and independent features of the inputs without loosing information. This means that the 
brain should statistically decorrelate the extracted information. As a learning strategy Bar- 
low (1989) formulated the principle of redundancy reduction. This kind of learning is 
called factoffal learning. Recently Atick and Redlich (1992) and Redlich (1993) concen- 
trate on the original idea of Barlow yielding a very interesting formularion of early visual 
processing and factoffal learning. Redlich (1993) reduces redundancy at the input by using 
a network structure which is a reversible cellular automaton and therefore guarantees the 
conservation of information in the transformation between input and output. Some nonlin- 
ear extensions of PCA for decorrelation of sensoffal input signals were recently intro- 
duced. These follow very closely Bafiow's original ideas of unsupervised learning. 
Redlich (1993) use similar information theoretic concepts and reversible cellular automata 
architectures in order to define how nonlinear decorrelation can be performed. The aim of 
our work is to formulate a neural network architecture and a novel learning paradigm that 
performs Bafiow's unsupervised learning in the most general fashion. The basic idea is to 
define an architecture that assures perfect transmission without loss of informarion. Con- 
sequently the nonlinear transformation defined by the neural architecture is always bijec- 
rive. The architecture performs a volume-conserving transformation (determinant of the 
Jacobian matrix is equal one). As a particular case we can derive the reversible cellular 
automata architecture proposed by Redlich (1993). The learning paradigm is defined so 
that the components of the output signal are statistically decorrelated. Due to the fact that 
the output distriburion is not necessarily Gaussian, even if the input is Gaussian, we per- 
form a cumulant expansion of the output distriburion and find the rules that should be sat- 
isfied by the higher order correlation tensors in order to be decorrelated. 
2 THEORETICAL FORMALISM 
Let us consider an input vector : of dimensionality d with components distributed 
according to the probability distribution P (), which is not factoffal, i.e. the components 
of : are correlated. The goal of Barlow's unsupervised learning rule is to find a transfor- 
marion 
-- F (x) (2.1) 
such that the components of the output vector d-dimensional  are staristically decorre- 
lated. 
This means that the probability distributions of the components Yi are independent and 
therefore, 
d 
P ()) --' I-I P (Yi)' (2.2) 
i 
The obje_,ctive of factoffal learning is to find a neural network, which performs the transfor- 
mation F ( ) such that the joint probability distribution P () of the output signals is fac- 
toffzed as in eq. (2.2). In order to implement factoffal learning, the information contained 
in the input should be transferred to the output neurons without loss but, the probability 
distriburion of the output neurons should be statistically decorrelated. Let us now define 
Higher Order Statistical Decorrelation without hzformation Loss 249 
these facts from the information theory perspective. The first aspect is to assure the 
entropy is conserved, i.e. 
H () -- H () (2.3) 
where the symbol () denotes the entropy of  and H(a/b) the conditional 
entropy of  given b. One way to achieve this goal is to construct an architecture that 
independently of its synaptic parameters satisfies always eq. (2.3). Thus the architecture 
will conserve information or entropy. The transmitted entropy satisfies 
aP 
H(9) (2.4) 
where equality holds only if F is bijective, i.e. reversible. Conservation of information 
and bijectivity is assured if the neural transformation conserves the volme, which mathe- 
matically can be expressed by the fact that the Jacobian of the transformation should have 
determinant unity. In section 3 we formulate an architecture that always conserves the 
entropy. Let us now concentrate on the main aspect of factoffal learning, namely the 
decorrelation of the output components. Here the problem is to find a volume-conserving 
transformation that satisfies eq. (2.2). The major problem is that the distribution of the out- 
put signal will not necessarily be Gaussian. Therefore it is impossible to use the technique 
of minimizing the mutual information between the components of the output as done by 
Redlich (1993). The only way to decorrelate non-Gaussian distributions is to expand the 
distribution in higher orders of the correlation matrix and impose the independence condi- 
tion of eq. (2.2). In order to achieve this we propose to use a cumulant expansion of the 
output distribution. Let us define the Fourier transform of the output distribution, 
{()----' I d ei(') P(;{(Ki)--'ldYi ei(Ki'Yi) P(Yi) (2.5) 
The cumulant expansion of a distribution is (Papoulis, 1991) 
i n i n 
, (7) --- e O) (ri) -- e "' (2.6) 
In the Fourier space the independence condition is given by (Papoulis, 1991) 
b () -- 1-I qb (K) (2.7) 
t 
which is equivalent to 
-- = (2.8) 
t t 
Putting eq. (2.8) and the cumulant expansions of eq. (2.6) together, we obtain that in the 
case of independence the following equality is satisfied 
oo i n n cl oo i n 
E t-.. E il, i2 ..... in giigi2'"gi.--' E E .. Jn)g (2.9) 
n= l t,i 2 ..... i,i= ln= l 
In both expansions we will only consider the first four cumulants. After an extra transfor- 
mation 
Y' -- - (9) (2.10) 
250 Gustavo Deco, Wilfried Brauer 
to remove the bias (), we can rewrite eq. (2.9) using the cumulants expression derived 
in the Papoulius (1991): 
_ i - CJ 3) ijk } 
- E gigj { Cij - C/(2) ij}  E KigJKt { Cijt 
"J "J'* (2.11) 
q' ---i,j,, KiKjKtK l { (Cijtl- 3CijCtl ) - (CJ 4) -3 (C 2)) 2)ijtl} _- 0 
l 
Equation (2.11) should be satisfied for all values of . The multidimensional correlation 
tensors Ci. ..j and the one-dimensional higher order moments C/(n) are given by 
I 
;i..4 -- dy' P (y') y'i...yj , C n) dY'i P (Y'i) (Y'i) (2.12) 
The6 i . denotes Kroenecker's delta. Due to the fact that eq. (2.11) should be satisfied for 
all K, h'lt coefficients in each summation of eq. (2.11) must be zero. This means that 
Cij -- O, if(i -j) (2.13) 
COt - O, if(i -j v i  k) (2.14) 
Cijtl -- O, if( {ijvikvil} ^-d_) (2.15) 
Ciijj-CiiCii-- O, if(i,j). (2.16) 
In eq. (2.15) L is the logical expression 
L -- { (i--j^k--l^jk) v (i--k^j--l^i-j) v(i-l^jk^i-j)}, (2.17) 
which excludes the cases considered in eq. (2.16). The conditions of independence given 
by eqs. (2.13-2.16) can be achieved by minimization of the cost function 
E--a5".C.+[3 5". C't+, j Citlq'bE(eiijj-eiicjj )2 (2.18) 
i<j i<j:k '< :l i<j 
where or, [3, , b are the inverse of the number of elements in each summation respec- 
tively. 
In conclusion, minimizing the cost given by eq. (2.18) with a volume-conserving network, 
we achieve nonlinear decorrelation of non-Gaussian distributions. It is very easy to test 
wether a factorized probability distribution (eq. 2.2) satisfies the eqs. (2.13-2.16). As a 
particular case if only second order terms are used in the cumulant expansion, the learning 
rule reduces to eq. (2.13), which expresses nothing more than the diagonalization of the 
second order covariance matrix. In this case, by anti-transforming the cumulant expansion 
of the Fourier transform of the distribution,we obtain a Gaussian distribution. Diagonali- 
zation of the covariance matrix decorrelates statistically the components of the output only 
if we assume a Gaussian distribution of the outputs. In general the distribution of the out- 
put is not Gaussian and therefore higher orders of the cumulant expansion should be taken 
into account, yielding the learning rule conditions eqs. (2.13-2.16) (up to fourth order, 
generalization to higher orders is straightforward). In the case of Gaussian distribution, 
minimization of the sum of the variances at each output leads to statistically decorrelation. 
This fact has a nice information theoretic background namely the minimization of the 
mutual information between the output components. Statistical independence as expressed 
in eq. (2.2) is equivalent to (Atick and Redlich, 1992) 
MH -- y'.H (yj) - H (y) -- 0 (2.19) 
J 
Higher Order Statistical Decorrelation without bzformation Loss 251 
This means that in order to minimize the redundancy at the output we minimize the mutual 
information between the different components of the output vector. Due to the fact that the 
volume-conserving structure of the neural network conserves the entropy, the minimiza- 
tion of MH reduces to the minimization of H (yj) 
J 
3 VOLUME-CONSERVING ARCHITECTURE AND LEARNING 
RULE 
In this section we define a neural network architecture that is volume-conserving and 
therefore can be used for the implementation of the learning rules described in the last sec- 
tion. Figure 1.a shows the basic architecture of one layer. The dimensionality of input and 
output layer is the same and equal to d. A similar architecture was proposed by Redlich 
(1993b) using the theory of reversible cellular automata. 
x x 
y 
(a) (b) (c) 
Figure 1: Volume-conserving Neural Architectures. 
The analytical definition of the transformation defined by this architecture can be written 
as, 
Yi -- xi +.' (Xo ..... xj, i)' with j < i (3.1) 
where i represents a set of parameters of the function f/. Note that independently of the 
functions f/the network is always volume-conserving. In particular f/can be calculated 
by another neural network, by a sigmoid neuron, by polynomials (higher order neurons), 
etc. Due to the asymmetric dependence on the input variables and the direct connections 
with weights equal to 1 between corresponding components of input and output neurons, 
the Jacobian matrix of the transformation defined in eq. (3.1) is an upper triangular matrix 
with diagonal elements all equal to one, yielding a determinant equal to one. A network 
with inverted asymmetry also can be defined as 
Yi -- xi + gi (Xj ..... Xd' i)' with i <j (3.2) 
corresponding to a lower triangular Jacobian matfix ith diagonal elements all equal to 
one, being therefore volume-conserving. The vectors 0 i represent the parameters of func- 
tions gi. In order to yield a general nonlinear transformation from inputs to outputs (with- 
out asymmetric dependences) it is possible to build a multilayer architecture like the one 
shown in Fig. l.b, which involves mixed versions of networks described by eq. (3.1) and 
eq. (3.2), respectively. Due to the fact that successive application of volume-conserving 
transformation is also volume-conserving, the multilayer architecture is also volume-con- 
252 Gustavo Deco, Wilfried Brauer 
serving. In the two-layer case (Fig. 1.c) the second layer can be interpreted as asymmetric 
lateral connections between the neurons of the first layer. However, in our case the feed- 
forward connections between input layer and first layer are also asymmetric. As demon- 
strated in the last section, we minimize a cost function E to decorrelate noniinearly corre- 
lated non-Gaussian inputs. Let us analyze for simplicity a two-layer architecture (Fig. 1.c) 
with the first layer given by ec[. (3.1) and the second layer by eq. (3.2). Let us denote the 
output of the hidden layer by h and use it as input of eq. (3.2) with output . The exten- 
sion to multilayer architectures is straightforward. The learning role can be easily 
expressed by gradient descent method: 
In order to calculate the derivative of the cost functions we need 
(3.3) 
Ci"'J E {-(Yi --'i) "' (Yj--j) q' (Yi--i) '"-(Yj--; 3i 
 --- 30 - --E {0- (yi) }(3.4) 
P P 
where  represenm e pamemm 0 d . e sins N bo equafio exmnd over e 
N Wffing patte. e adienm of e erent ouCum e 
 Yi -  gi ; rye = ( ge) ( ) 8i> + ( ) (3.5) 
aO i aO i i ' ' ' 
where 8 i > t is equal to 1 if i > k and 0 oeise.  s paper we choose a polynoal 
fo for e ctions f and g. s mel volves Ner order neons. In s ca 
each cfion fi or gi is a pruct of poloal ctions of e Npum. e update equa- 
tions are given by 
i-1 R 
_-,.n0/r xq 
It i O}ij r j 
d R 
; Yi----' H IEOijrhj) (3.6) 
j=i+l X.r-O / 
where R is the order of the polynomial used. In this case the two-layer architecture is a 
higher order network with a general volume-conserving structure. The derivatives 
involved in the learning role are given by 
3 'Yk 15kiS>t J R 
30ijr E Oijr5 
rO 
3o. Yk ---- 
t jr 
(( Yk 
R 
E Okirh 
r=O 
R 
rO 
(3.7) 
Higher Order Statistical Decorrelation without Information Loss 253 
4 RESULTS AND SIMULATIONS 
We will present herein two different experiments using the architecture defined in this 
paper. The input space in all experiments is two-dimensional in order to show graphically 
the results and effects of the presented model. The experiments aim at learning noisy non- 
linear polynomial and rational curves. Figure 2.a and 2.b plot the input and output space of 
the second experiment after training is finished, respectively. In this case the noisy logistic 
map was used to generate the input: 
x 2 ---- 4x 1 (I-Xl) +v (4.1) 
where ) introduces 1% Gaussian noise. In this case a one-layer polynomial network with 
R = 2 was used. The learning constant was r I = 0.01 and 20000 iterations of training 
were performed. The result of Fig. 2.b is remarkable. The volume-conserving network 
decorrelated the output space extracting the strong nonlinear correlation that generated the 
curve in the input space. This means that after training only one coordinate is important to 
describe the curve. 
o 
o8 
04 
112 
o . .2 
(a) (b) 
Figure 2: Input and Output space distribution after training with a one-layer polynomial 
volume-conserving network of order for the logistic map. (a) input space; Co) output space. 
The whole information was compressed into the first coordinate of the output. This is the 
generalization of data compression normally performed by using linear PCA (also called 
Karhunen-Loewe transformation). The next experiment is similar, but in this case a two- 
layer network of order R = 4 was used. The input space is given by the rational function 
3 
x! (4.2) 
x 2 - 0.2x 1 + + 
(l 
where x 1 and ) are as in the last case. The results are shown in Fig. 4.a (input space) and 
Fig. 4.b (output space). Fig. 4.c shows the evolution of the four summands of eq (2.18) 
during lem'ning. It is important to remark that at the beginning the tensors of second and 
third order are equally important. During learning all summands are simultaneously mini- 
mized, resulting in a statistically decorrelated output. The training was performed during 
20000 iterations and the learning constant was rl = 0.005 
254 Gustavo Deco, Wilfried Brauer 
(a) 
(b) 
(c) 
cost4a 
st 
5t4b 
Figure 4: Inpu. t and .Optput space distribution after training with aJ,wo-layer polynomial vo. lume-conrving 
network of oraex xor me noisy_curve ox ,eq. (4.2). (a) input space; t) output space tc) Development ox 
the four mands of the cost ttmction (,.eft.. 2.18) durifig le'.aoaing: (cost2) fdst summand (second order corre- 
lation tensor); (cost 3) second summand (tlfird correlation oraer tensor); (cost 4a) third summarid (fourth order 
correlation tensor); (cost4b) fourth summand (fourth order correlation tensor). 
5 CONCLUSIONS 
We proposed a unsupervised neural paradigm, which is based on Information Theory. The 
algoffthm performs redundancy reduction among the elements of the output layer without 
loosing information, as the data is sent through the network. The model developed per- 
forms a generalization of Bafiow's unsupervised learning, which consists in nonlinear 
decorrelation up to higher orders of the cumulant tensors. After mining the components of 
the output layer are statistically independent. Due to the use of higher order cumulant 
expansion arbitrary non-Gaussian distributions can be rigorously handled. When nonlin- 
ear units are used nonlinear principal component analysis is obtained. In this case nonlin- 
ear manifolds can be reduced to a minimum dimension manifolds. When linear units are 
used, the network performs a generalized principal component analysis in the sense that 
non-Gaussian distribution can be linearly decorrelated.This paper generalizes previous 
works on factorial learning in two ways: the architecture performs a general nonlinear 
transformation without loss of information and the decorrelation is performed without 
assuming Gaussian distributions. 
References: 
H. Barlow. (1989) Unsupervised Learning. Neural Computation, 1,295-311. 
A. Papoulis. (1991) Probability, Random Variables, and Stochastic Processes. 3. Edition, 
McGraw-Hill, New York. 
A. N. Redlich. (1993) Supervised Factoffal Learning. Neural Computation, 5, 750-766. 
