Self-Organizing Rules for Robust 
Principal Component Analysis 
Lei Xu,2*and Alan Yuille  
1. Division of Applied Sciences, Harvard University, Cambridge, MA 02138 
2. Dept. of Mathematics, Peking University, Beijing, P.R.China 
Abstract 
In the presence of outliers, the existing self-organizing rules for 
Principal Component Analysis (PCA) perform poorly. Using sta- 
tistical physics techniques including the Gibbs distribution, binary 
decision fields and effective energies, we propose self-organizing 
PCA rules which are capable of resisting outliers while fulfilling 
various PCA-related tasks such as obtaining the first principal com- 
ponent vector, the first k principal component vectors, and directly 
finding the subspace spanned by the first k vector principal com- 
ponent vectors without solving for each vector individually. Com- 
parative experiments have shown that the proposed robust rules 
improve the performances of the existing PCA algorithms signifi- 
cantly when outliers are present. 
1 INTRODUCTION 
Principal Component Analysis (PCA) is an essential technique for data compression 
and feature extraction, and has been widely used in statistical data analysis, com- 
munication theory, pattern recognition and image processing. In the neural network 
literature, a lot of studies have been made on learning rules for implementing PCA 
or on networks closely related to PCA (see Xu & Yuille, 1993 for a detailed reference 
list which contains more than 30 papers related to these issues). The existing rules 
can fulfil various PCA-type tasks for a number of application purposes. 
*Present address: Dept. of Brain and Cognitive Sciences, E10-243, Massachusetts 
Institute of Technology, Cambridge, MA 02139. 
467 
468 Xu and Yuille 
However, almost all the previously mentioned PCA algorithms are based on the 
assumption that the data has not been spoiled by outliers (except Xu, Oja&Suen 
1992, where outliers can be resisted to some extent.). In practice, real data often 
contains some outliers and usually they are not easy to separate from the data set. 
As shown by the experiments described in this paper, these outliers will significantly 
worsen the performances of the existing PCA learning algorithms. Currently, little 
attention has been paid to this problem in the neural network literature, although 
the problem is very important for real applications. 
Recently, there have been some success in applying the statistical physics approach 
to a variety of computer vision problems (Yuille, 1990; Yuille, Yang&Geiger 1990; 
Yuille, Geiger&Bulthoff, 1991). In particular, it has also been shown that some 
techniques developed in robust statistics (e.g., redescending M-estimators, least- 
trimmed squares estimators) appear naturally within the Bayesian formulation by 
the use of the statistical physics approach. In this paper we adapt this approach 
to tackle the problem of robust PCA. Robust rules are proposed for various PCA- 
related tasks such as obtaining the first principal component vector, the first k 
principal component vectors, and principal subspaces. Comparative experiments 
have been made and the results show that our robust rules improve the performances 
of the existing PCA algorithms significantly when outliers are present. 
2 PCA LEARNING AND ENERGY MINIMIZATION 
There exist a number of self-organizing rules for finding the first principal compo- 
nent. Three of them are listed as follows (Oja 1982, 85; Xu, 1991, 93): 
(t + 1) = ffi(t) + aa(t)(Zy - r(t)y2), (1) 
_2, 
(t + 1) = r(t) + aa(t)(Yy - r(t)(t) y J' (2) 
r(t + 1) = (t) + rta(t)[y(- if) + (y - yt)x-q. (3) 
where y = (t) T, ff = yr(t), yt = ffi(t)T,7 and aa(t) _> 0 is the learning rate which 
decreases to zero as t  c while satisfying certain conditions, e.g., 
c, -, aa(t) q < c for some q > 1. 
Each of the three rules will converge to the principal component vector  almost 
surely under some mild conditions which are studied in detail in by Oja (1982&85) 
and Xu (1991&93). Regarding ffi as the weight vector of a linear neuron with output 
y -- r*, all the three rules can be considered as modifications of the well known 
Hebbian rule r(t + 1) = ff(t) + aa(t)y through introducing additional terms for 
preventing Ilr(t)ll from going to 
The performances of these rules deteriorate considerably when data contains out- 
liers. Although some outlier-resisting versions of eq.(1) and eq.(2) have also been 
recently proposed (Xu, Oja &Suen, 1992), they work well only for data which is not 
severely spoiled by outliers. In this paper, we adopt a totally different approach--we 
generalize eq.(1),eq.(2) and eq.(3) into more robust versions by using the statistical 
physics approach. 
To do so, first we need to connect these rules to energy functions. It follows from Xu 
(1991&93) and Xu & Yuille(1993) that the rules eq.(2) and eq.(3) are respectively 
Self-Organizing Rules for Robust Principal Component Analysis 469 
on-line gradient descent rules for minimizing Jl(r), J2(r) respectivelyl: 
N 
i:1 
It has also been provel that the rule given by eq.(1) satisfies (Xu, 1991, 93): 
(a) fzfza _> O,E(ft)T(fz) _> O, with fz = y- ry , fza = ay-y2; (b) 
E(fzl)TE(ft3) > 0, with ft3 = y(-g)+(y-y'); (c) Both J1 and Ja have only one 
local (also global) minimum tr(Z)- rZ(, and all the other critical points (i.e., 
the points satisfy 0s,(,a)-O,i= 1,2)are saddle points. Here Z = E{t}, and  
is the eigenvector of E corresponding to the largest eigenvalue. 
That is, the rule eq.(1) is a downhill algorithm for minimizing Ja in both the on 
line sense and the average sense, and for minimizing Ja in the average sense. 
3 GENERALIZED ENERGY AND ROBUST PCA 
We further regard Jx(), J2(r) as special cases of the following general energy: 
N 
1 
() =  z(i,), z(i,) > 0. (6) 
._. 
where z(i, ) is the portion of energy contributed by the sample i, and 
T, 
= ) 
,11 & (7) 
Following (Yuille, 1990 a& b), we now generalize energy eq.(6) into 
= + (8) 
where 1 = {,i = 1,...,N} is a binary field {} with each  being a random 
variable taking value either 0 or 1.  acts as a decision indicator for deciding 
whether i is an outlier or a sample. When  = 1, the portion of energy contributed 
by the sample i is taken into consideration; otherwise, it is equivalent to discarding 
  an outlier. Epo() is the a priori portion of energy contributed by the a 
priori distribution of {  }. A natural choice is 
N 
Epio() = .(1 - ) (9) 
=1 
This choice of priori has a natural interpretation: for fixed  it is energetically 
favourable to set  = 1 (i.e., not regarding  as an outlier) if z(, ) <  (i.e., 
We have J() > 0, since Ta?--  = i111 = sin  0,n >_ 0. 
470 Xu and Yuille 
the portion of energy contributed by Zi is smaller than a prespecified threshold) 
and to set it to 0 otherwise. 
Based on E(17, ), we define a Gibbs distribution (Parisi 1988): 
(10) 
where Z is the partition function which ensures y. 'ra P[I, if,] = 1. 
1 
compute 
Then we 
1 1 
= H Y e-{V'z(")+nO-v')}= e -z*--(r) (11) 
i v,={0,} Z 
Z = Ze " E,jj() = 1 1og{1 + e-{z(2")-"}. (12) 
i 
E,j is called the effective energy. Each term in the sum for E, is approximately 
z(i, ) for small values of z but becomes constant as z(i, )  . In this way 
outliens, which are more likely to yield large values of z(i, ), are treated differently 
from samples, and thus the estimation  obtained by minimizing Eejj () will be 
robust and able to resist outliens. 
EeL () is usually not a convex function and may have many local minima. The 
statistical physics framework suggests using deterministic annealing to minimize 
Eelf (nil). That is, by the following gradient descent rule eq.(13), to minimize 
Eefj(r) for small fi and then track the minimum as fi increases to infinity (the 
zero temperature limit): 
1 
rfi(t + 1) = r(t) - ab(t) y. 1 + e(z(,,'(t)) -" 
i 
04 
(13) 
More specifically, with z's chosen to correspond to the energies Jx and J respec- 
tively, we have the following batch-way learning rules for robust PCA: 
1 (t) 
r(t + l) = r(t) + ab(t) Y l + e(Z(.,.(O)_.) (iyi - r(t).r(t) y) , (14) 
i 
1  
(t + 1) = r(t) + a(t) E 1 + e(z(e,,m(t)) -0) [yi(i - gi) + (Yi - Yi)i]. (15) 
i 
For data that comes incrementally or in the ondine way, we correspondingly have 
the following adaptive or stochastic approximation versions 
1 r(t) 
r(t + 1) = ff(t) + a,(t) 1 + e((,,ra(t))-") (iYi - r(t)tr(t ) 
ff(t + 1) = r(t) + ra(t) 
yT), (16) 
1 
1 (17) 
Self-Organizing Rules for Robust Principal Component Analysis 471 
It can be observed that the difference between eq.(2) and eq.(16) or eq.(3) and 
eq.(17) is that the learning rate eta(t) has been modified by a multiplicative factor 
1 
am(t) = 1 + e(Z(,,'(0)-n) ' (18) 
which adaptively modifies the learning rate to suit the current input i. This 
modifying factor has a similar function as that used in Xu, Oja&Suen(1992) for 
robust line fitting. But the modifying factor eq.(18) is more sophisticated and 
performs better. 
Based on the connecticn between the rule eq.(1) and J or J2, given in see.2, we 
can also formally use t,ae modifying factor et,(t) to turn the rule eq.(1) into the 
following robust version: 
1 
+ = + 1 + - 
4 ROBUST RULES FOR k PRINCIPAL COMPONENTS 
In a similar way to SGA (Oja, 1992) and GHA (Sanger, 1989) we can generalize the 
robust rules eq.(19), eq.(16) and eq.(17) into the following general form of robust 
rules for finding the first k principal components: 
1 
rj(t -]- 1) = rj(t) -l- et.(t) l + e(4,(j),.%(O)_n Arj(xi(j), rj(t)), (20) 
j--1 
i(O) = i, gi(j + 1) = gi(j) - EY,(r)rfi(t), Yi(j) = (t)i(j), (21) 
where Arfij(i(3) fij(t)), z(xi(3), fftj(t)) have four possibilities (Xu & Yuille, 1993). 
As an example, one of them is given here 
Arfij('i(j), rfij(t)) = ('i(j)yi(j) -- rfij(t)yi(j)2), 
3, ' T Yi(J)  
z(i(J), rfij(t)) = '(3) 'i(j) - fij(t)Tfij(t ) . 
In this case, eq.(20) can be regarded as the generalization of GHA (Sanger, 1989). 
We can also develop an alternative set of rules for a type of nets with asymmetric 
lateral weights as used in (Rubner&Schulten, 1990). The rules can also get the first 
k principal components robustly in the presence of outliers (Xu & Yuille, 1993). 
5 ROBUST RULES FOR PRINCIPAL SUBSPACE 
Let M --[rfi,-..,rfit], --[,' ", t], if--[Y,'",Yt] T and if---- MT,it follows 
from Oja(1989) and Xu(1991) the rules eq.(1), eq.(3) can be generalized into eq.(22) 
and eq.(23) respectively: 
Ar(t -F 1)- Ar(t)-F etA(t)(7 a"- rM(t)) (22) 
472 Xu and Yuille 
A(t-4-1)--A(t)-4-CA(t)(y(--)T--(7--)T), ff----M7, '--M"ff (23) 
In the case without outliers, by both the rules, the weight matrix M(t) will converge 
to a matrix M * whose column vectors rn,j = 1,...,k span the k-dimensional 
principal subspace (Oja, 1989; Xu, 1991&93), although the vectors are, in general, 
not equal to the k principal component vectors j, j = 1,.--, k. 
Similar to the previously used procedure, we have the following results: 
(1). We can show that eq.(23) is an on-line or stochastic approximation rule which 
minimizes the energy J3 in the gradient descent way (Xu, 1991& 93)' 
N 
Ja( )=Ylli-ffill 2, ff=M7, '=MW*2. (24) 
i=1 
and that in the average sense the subspace rule eq.(22) is also an on-line "down-hill" 
rule for minimizing the energy function Ja. 
(2). We can also generalize the non-robust rules eq.(22) and eq.(23) into robust 
versions by using the statistical physics approach again: 
1 -, T 
_ _ _ yl)xi ] 
J(t -J- 1) = Jf(t) -J- CtA(t ) 1 + e(ll,-a,ll-") [(i   
1 
Jl(t + 1) = Jib(t) + aA(t) 1 + eP(ll,-a'11 -') [x-/-- y M(t)] (26) 
6 EXAMPLES OF EXPERIMENTAL RESULTS 
Let g from a population of 400 samples with zero mean. These samples are located 
on an elliptic ring centered at the origin of R a, with its largest elliptic axis being 
along the direction (-1, 1, 0), the plane of its other two axes intersecting the x - y 
plane with an acute angle (30). Among the 400 samples, 10 points (only 2.5%) are 
randomly chosen and replaced by outliers. The obtained data set is shown in Fig.1. 
Before the outliers were introduced, either the conventional simple-variance-matrix 
 'N=i ix) or the unrobust rules 
based approach (i.e., solving S  A, S = W 
eqs.(1)(2)(3) can find the correct 1st principal component vector of this data set. 
On the data set contaminated by outliers , shown in Fig.l, the result of the simple- 
variance-matrix based approach has an angular error of p by 71.04--a result 
definitely unacceptable. The results of using the proposed robust rules eq.(19), 
eq.(16) and eq.(17) are shown in Fig.2(a) in comparison with those of their unrobust 
counterparts-- the rules eq.(1), eq.(2) and eq.(3). We observe that all the unrobust 
rules get the solutions with errors of more than 21  from the correct direction of 
p. By contrast, the robust rules can still maintain a very good accuracy--the 
error is about 0.36 . Fig.2(b) gives the results of solving for the first two principal 
component vectors. Again, the unrobust rule produce large errors of around 23 , 
while the robust rules have an error of about 1.7 . Fig.3 shows the results of 
sol,ing for the 2-dimensional principal subspace, it is easy to see the significant 
improvements obtained by using the robust rules. 
Self-Organizing Rules for Robust Principal Component Analysis 473 
Figure 1: The projections of the data on the x - y, y - z and z - x planes, with 10 
outliers. 
Acknowledgement s 
We would like to thank DARPA and the Air Force for support with contracts 
AFOSR-89-0506 and F4969092-J-0466. 
We like to menta ion that some further issues about the proposed robust rules are studied 
in Xu & Yuille (1993), including the selection of parameters c, /3 and 7, the extension of 
the rules for robust Minor Component Analysis (MCA), the relations between the rules 
to the two man types of existing robust PCA aJgorithms in the literature of statistics, as 
well as to MaximM Likelihood (ML) estimation of finite mixture distributions. 
References 
E. Oja, J. Math. Bio. 16, 1982, 267-273. 
E. Oja & J. Karhunen, J. Math. Anal. Appl. 106, 1985, 69-84. 
E. Oja, Int. J. Neural Systems 1, 1989, 61-68. 
E. Oja, Neural Networks 5, 1992, 927-935. 
G. Parisi, Statistical Field Theory, Addison-Wesley, Reading, Mass., 1988. 
J. Rubner & K. Schulten, Biological Cybernetics, 62, 1990, 193-199. 
T.D. Sanger, Neural Networks, 2, 1989, 459-473. 
L. Xu, Proc. of IJCNN'91-Singapore, Nov., 1991, 2368-2373. 
L. Xu, Least mean square error reconstruction for self-organizing neural-nets, Neural 
Networks 6, 1993, in press. 
L. Xu, E. Oja & C.Y. Suen, Neural Networks 5, 1992, 441-457. 
L. Xu & A.L. Yuille, Robust principal component analysis by self-organizing rules 
based on statistical physics approach, IEEE Trans. Neural Networks, 1993, in press. 
A.L. Yuille, Neural computation 2, 1990, 1-24. 
A.L. Yuille, D. Geiger and H.H. Bulthoff, Networks 2, 1991. 423-442. 
474 Xu and Yuille 
le 
1o 
(a) (b) 
Figure 2: The learning curves obtained in the comparative experiments for principal 
component vectors. (a) for the first principal component vector, RA1, RA2, RA3 de- 
note the robust rules eq.(19), eq.(16) and eq.(17) respectively, and UA1, UA2, UA3 
denote the rules eq.(1), eq.(2) and eq.(3) respectively. The horizontal axis denotes 
the learning steps, and the vertical axis is 0r(0p  with 0,g denoting the acute 
angle between  and . (b) for the first two principal component vectors, by the 
robust rule eq.(20) and its unrobust counterpart GHA. UAkl, UAk2 denote the 
learning curves of angles 0(t)p and 0m(t)$ respectively, obtained by GHA . 
RAkl, RAk2 denote the learning curves of the angles obtained by using the robust 
rule eq.(20). In both (a) &: (b), pJ,J = 1,2 is the correct 1st and 2nd principal 
component vector respectively. 
0 lffi0  1  0 0 7ffi0 8 
0 
Figure 3: The learning curves obtained in the comparative experiments for for solv- 
ing the 2-dimensional principal subspace. Each learning curve expresses the change 
of the residual er(t) y-].= iirj(t)_ 2 
= r=l (rj(t) Tpr)prl[ 2 with learning steps. 
The smaller the residual, the closer the estimated principal subspace to the correct 
one. SUB1, SUB2 denote the untobust rules eq.(22) and eq.(23) respectively, and 
RSUB1, RSUB2 denote the robust rules eq.(26) and eq.(25) respectively. 
