The Efficiency and The Robustness of 
Natural Gradient Descent Learning Rule 
Howard Hua Yang 
Department of Computer Science 
Oregon Graduate Institute 
PO Box 91000, Portland, OR 97291, USA 
hyang@cse.ogi.edu 
Shun-ichi Amari 
Lab. for Information Synthesis 
RIKEN Brain Science Institute 
Wako-shi, Saitama 351-01, JAPAN 
amari@zoo.brain.riken.go.jp 
Abstract 
The inverse of the Fisher information matrix is used in the natu- 
ral gradient descent algorithm to train single-layer and multi-layer 
perceptrons. We have discovered a new scheme to represent the 
Fisher information matrix of a stochastic multi-layer perceptron. 
Based on this scheme, we have designed an algorithm to compute 
the natural gradient. When the input dimension n is much larger 
than the number of hidden neurons, the complexity of this algo- 
rithm is of order O(n). It is confirmed by simulations that the 
natural gradient descent learning rule is not only efficient but also 
robust. 
i INTRODUCTION 
The inverse of the Fisher information matrix is required to find the Cramer-Rao 
lower bound to analyze the performance of an unbiased estimator. It is also needed 
in the natural gradient learning framework (Amari, 1997) to design statistically 
efficient algorithms for estimating parameters in general and for training neural 
networks in particular. In this paper, we assume a stochastic model for multi- 
layer perceptrons. Considering a Riemannian parameter space in which the Fisher 
information matrix is a metric tensor, we apply the natural gradient learning rule to 
train single-layer and multi-layer perceptrons. The main difficulty encountered is to 
compute the inverse of the Fisher information matrix of large dimensions when the 
input dimension is high. By exploring the structure of the Fisher information matrix 
and its inverse, we design a fast algorithm with lower complexity to implement the 
natural gradient learning algorithm. 
386 H. H. Yang and S. Arnari 
2 A STOCHASTIC MULTI-LAYER PERCEPTRON 
Assume the following model of a stochastic multi-layer perceptron: 
m 
z -- E ai9(wiTac + bi) +  (1) 
i=l 
where (.)T denotes the transpose,  ~ N(0, a 2) is a Gaussian random variable, and 
9(x) is a differentiable output function for hidden neurons. Assume the multi-layer 
network has a n-dimensional input, m hidden neurons, a one dimensional output, 
and rn < n. Denote a = (a,..., a,) T the weight vector of the output neuron, wi = 
(wi,'" ,w,i) T the weight vector of the i-th hidden neuron, and b = (b,..., b,) T 
the vector of thresholds for the hidden neurons. Let W = [wx,...,w,] be a 
matrix formed by column weight vectors wi, then (1) can be rewritten as z = 
aT9(WT:r + b) + 6. Here, the scalar function 9 operates on each component of the 
vector wTx + b. 
The joint probability density function (pdf) of the input and the output is 
p(x, z; W,a,b) - p(zl:r; W,a,b)p(x). 
Define a loss function: 
L(x, z; 0) = - logp(x, z; 0) =/(zlx; 0) - logp(x) 
where 0 = (wT,  T bT)T 
 ., Win, a T, includes all the parameters to be estimated and 
1 
l(zl; o) -- - logp(zl:r; O ) -- a, e (z - aTp(WT:r + b)) e. 
Since OL _ Ol 
o- - o-' the Fisher information matrix is defined by 
OL OL T Ol Ol T 
(0) = E[() ]= E[() ] 
(2) 
The inverse of (7(0) is often used in the Cramer-Rao inequality: 
O'lle I o '] _> 
where 0 is an unbiased estimator of a true parameter 0'. 
For the on-line estimator Ot based on the independent examples {(xs,zs),s - 
1,..., t} drawn from the probability law p(x, z; 0'), the Cramer-Rao inequality 
for the on-line estimator is 
O'lle I o '] _> (:3) 
3 NATURAL GRADIENT LEARNING 
Consider a parameter space 0 = {0} in which the divergence between two points 
O1 and 02 is given by the Kullback-Leibler divergence 
D(O;, 02) = KL[p(x, z; 0;)liP(x, z; 02)]. 
When the two points are infinitesimally close, we have the quadratic form 
D(O,O + dO) = dOT(7(O)dO. (4) 
The Efficiency and the Robustness of Natural Gradient Descent Learning Rule 387 
This is regarded as the square of the length of dO. Since (7(0) depends on 0, the 
parameter space is regarded as a Riemannian space in which the local distance is 
defined by (4). Here, the Fisher information matrix G(0) plays the role of the 
Riemannian metric tensor. 
It is shown by Amari(1997) that the steepest descent direction of a loss function 
C(O) in the Riemannian space O is 
= (o)vc(o). 
The natural gradient descent method is to decrease the loss function by updating 
the parameter vector along this direction. By multiplying (7-x(0), the covariant 
gradient VC(0) is converted into its contravariant form (7-(O)VC(O) which is 
consistent with the contravariant differential form dC(O). 
Instead of using/(z[x; 0) we use the following loss function: 
1 
lx(zlx;O) = (z - arqo(Wrx + b)) 2. 
We have proved in [5] that G(O) = A(O) where A(O) does not depend on the 
unknown a. So G-x(0) o  = A -x (0) /. The on-line learning algorithms based on 
the gradient  and the natural gradient A-x(O)o are, respectively, 
Ot+ = Ot tt Olx 
t O0 (ztlt; Or), 
(5) 
(6) 
' l 
tt A_lfO  0 l, , 
Ot+l -- Or-- - [ 
where tt and/ are learning rates. 
When the negative log-likelihood function is chosen as the loss function, the natural 
gradient descent algorithm (6) gives a Fisher efficient on-line estimator (Amari, 
1997), i.e., the asymptotic variance of Ot driven by (6) satisfies 
E[(O - - 0') r I 0'1 (7) 
t 
which gives the mean square error 
(8) 
E[llo,- o*ll I o '1 
The main difficulty in implementing the natural gradient descent algorithm (6) is 
to compute the natural gradient on-line. To overcome this difficulty, we studied the 
structure of the matrix A(O) in [5] and proposed an efficient scheme to represent 
this matrix. Here, we briefly describe this scheme. 
Let A(O) = [Aij](m+2)x(m+2) be a partition of A(O) corresponding to the par- 
tition of 0 = (w[,. T bT)T 
 ',Wrn,a T, . Denote Ui -- Wi/[[Wi[[,i -- 1,...,m, 
U = [ux,'" ,um] and Ivy,.. ',vm] = Ux(UU) -. It has been proved in [5] that 
those blocks in A(O) are divided into three classes: x = {Aij,i,j = 1,...,m}, 
C2 {Ai,m+l, T T 
-- Am+l,i,Ai,m+2, Am+2,i,i - 1,... ,m} and Ca = {Am+i,m+j,i,j -- 
1, 2}. Each block in Cx is a linear combination of matrices uvt r, k, l = 1,..., m, 
and flo = I - Y']__ uv. Each block in C2 is a matrix whose column is a lin- 
ear combination of {v, k = 1,... ,m.}. The coefficients in these combinations are 
integrals with respect to the multivariate Gaussian distribution N(0, R) where 
388 H. H. Yang and $. Arnari 
R 1 -- UTU 1 is rn x rn. Each block in Cs is an rn x rn matrix whose entries are also 
integrals with respect to N(0, R). Detail expressions for these integrals are given 
in [5]. When qo(x) = erf(2) , using the techniques in (Saad and Solla, 1995), we 
can find the analytic expressions for most of these integrals. 
The dimension of A(8) is (nm q- 2rn) x (nm q- 2rn). When the input dimension n 
is much larger than the number of hidden neurons, by using the above scheme, the 
space for storing this large matrix is reduced from O(n:) to O(n). We also gave 
a fast algorithm in [5] to compute A -1(0) and the natural gradient with the time 
complexity O(n ) and O(n) respectively. The trick is to make use of the structure 
of the matrix A-X(8). 
4 SIMULATION 
In this section, we give some simulation results to demonstrate that the natural 
gradient descent algorithm is efficient and robust. 
4.1 Single-layer perceptron 
Assume 7-dimensional inputs xt ~ N(0, I) and qo(u) = -e-" For the single-layer 
l_}_e-   
perceptton, z = o(w'x), the on-line gradient descent (GD) and the natural GD 
algorithms are respectively 
w,+ = w, + Izo(t)(z, - p(wxt))p'(wx,)x, and (9) 
where 
1 1 1 ww T 
d(w) ) w=llwll, (11) 
d 1 (W) -"  oo 
1/: 
a(w) = 
2 
> 0, 
m2 
> o, 
and po(t) and /l(t) are two learning rate schedules defined by 
p(li,ci,ri;t),i = O, 1. Here, 
p(r/, c, r;t)= ,(1 + -c t)/(1 + c_t + t_2). 
fir fir r 
(12) 
(13) 
i(t) = 
(14) 
is the search-then-converge schedule proposed by (Darken and Moody, 1992). Note 
that t < r is a "search phase" and t > r is a "converge phase". When ri - 1, the 
learning rate function pi (t) has no search phase but a weaker converge phase when 
qi is small. When t is large, i(t) decreases as  
t' 
Randomly choose a 7-dimensional vector as w* for the teacher network: 
w* - [-1.1043, 0.4302, 1.1978, 1.5317, -2.2946, -0.7866, 0.4428] T. 
Choose q0 - 1.25, x = 0.05, co - 8.75, c - 1, and ro- rl = 1. These parameters 
are selected by trial and error to optimize the performance of the GD and the 
natural GD methods at the noise level er - 0.2. The training examples 
are generated by zt = (w*Txt) q- t where t ~ N(0, a 2) and a 2 is unknown to the 
algorithms. 
389 
The Efficiency and the RobUstness of Natural Gradient Descent Learning Rule 
Let wt and t be the weight vectors driven by the equations (9) and (10) respec- 
tivelY. I wt - w*// and /t - w*//axe eo functions for the GD and the natural 
, we obtain the Cramer-R Lower 
GD. +,e equation (11). :,+ vector: 
Denote RLB for the aev,a,--- dnx 
Bound 1 1 
 + * 
6 
o   l l  
00  1 150 itemOn 
kue [: domCe o[ the  d the utu 
1o' 
/ natural GD 
GD 
,. Ci:ll.,B 
' so mo ,   t, different 
 Fgue 
It is soWn 
 se levels yhie _ 
0.2. The 
390 H. H. Yang and S. Amari 
Figure 2: Performance of the GD and the natural GD when r/o - 
1.25, 1.75, 2.25, 2.75, rh = 0.05, 0.2, 0.4425, 0.443, and co - 8.75 and c] - I are 
fixed. 
the training examples is clearly shown by Figure 1. When the teacher signal is 
non-stationary, our simulations show that the natural GD algorithm also reaches 
the CRLB. 
Figure 2 shows that the natural GD algorithm is more robust than the GD algo- 
rithm against the change of the learning rate schedule. The performance of the GD 
algorithm deteriorates when the constant r/o in the learning rate schedule tt0(t) is 
different from that optimal one. On the contrary, the natural GD algorithm per- 
forms almost the same for all rh within a interval [0.05, 0.4425]. Figure 2 also shows 
that the natural GD algorithm breaks down when rh is larger than the critical num- 
ber 0.443. This means that the weak converge phase in the learning rate schedule 
is necessary. 
4.2 Multi-layer perceptron 
Let us consider the simple multi-layer perceptron with 2-dimensional input and 2- 
hidden neurons. The problem is to train the committee machine y = q(wrx) + 
9(w2rx) based on the examply, s: {(xt,zt),t  1,... ,T} generated by the stochastic 
committee machine zt = 9(wTxt) + 9(w2Txt) + it. Assume I]wl] = 1. We can 
reparameterize the weight vector to decrease the dimension of the parameter space 
from 4 to 2: 
wi-- sin(a/) ' wi = sin(a?) , i--1,2. 
10 0 
104 
10 4 
10' 
.... GD 
-- natural GD 
CRIB 
v/k ,9 t,',',.,)'  , .?',.,. , , , . 
100 200 300 400 500 600 
iteration 
Figure 3: The GD vs. the natural GD 
The parameter space is {8 = (a], a2)}. Assume that the true parameters are a = 0 
and c = . Due to the symmetry, both 0 - (0, ) and 0 -- (,0) are true 
parameters. Let 0t and 0t be computed by the GD algorithm and the natural GD 
The Efficiency and the Robustness of Natural Gradient Descent Learning Rule 391 
algorithm respectively. The errors are measured by 
et = min{[[Ot - 0[[, [[Ot - 0[[}, and 
In this simulation, using 0o = (0.1,0.2) as an initial estimate, we first start the 
GD algorithm and run it for 80 iterations. Then, we use the estimate obtained 
from the GD algorithm at the 80-th iteration as an initial estimate for the natural 
GD algorithm and run the latter algorithm for 420 iterations. The noise level is 
cr -- 0.05. N independent runs are conducted to obtain the errors et(j) and el(j), 
j = 1,..-, N. Define root mean square errors 
I and 
j=l 
 N 
i ' 
 
j----1 
Based on N = 10 independent runs, the errors t and e't are computed and com- 
pared with the CRLB in Figure 3. The search-then-converge learning schedule (14) 
is used in the GD algorithm while the learning rate for the natural GD algorithm 
is simply the annealing rate 
5 CONCLUSIONS 
The natural gradient descent learning rule is statistically efficient. It can be used 
to train any adaptive system. But the complexity of this learning rule depends on 
the architecture of the learning machine. The main difficulty in implementing this 
learning rule is to compute the inverse of the Fisher information matrix of large 
dimensions. For a multi-layer perceptron, we have shown an efficient scheme to 
represent the Fisher information matrix based on which the space for storing this 
large matrix is reduced from O(n 2) to O(n). We have also shown an algorithm 
to compute the natural gradient. Taking advantage of the structure of the inverse 
of the Fisher information matrix, we found that the complexity of computing the 
natural gradient is O(n) when the input dimension n is much larger than the number 
of hidden neurons. 
The simulation results have confirmed the fast convergence and statistical efficiency 
of the natural gradient descent learning rule. They have also verified that this 
learning rule is robust against the changes of the noise levels in the training examples 
and the parameters in the learning rate schedules. 
References 
[1] S. Amari. Natural gradient works efficiently in learning. Accepted by Neural 
Computation, 1997. 
[2] S. Amari. Neural learning in structured parameter spaces - natural Riemannian 
gradient. In Advances in Neural Information Processing Systems, 9, ed. M. C. 
Mozer, M. I. Jordan and T. Petsche, The MIT Press: Cambridge, MA., pages 
127-133, 1997. 
[3] C. Darken and J. Moody. Towards faster stochastic gradient search. In Ad- 
vances in Neural Information Processing Systems, 3, eds. Moody, Hanson, and 
Lippmann, Morgan Kaufmann, San Mateo, pages 1009-1016, 1992. 
[4] D. Saad and S. A. Solla. On-line learning in soft committee machines. Physical 
Review E, 52:4225-4243, 1995. 
[5] H. H. Yang and S. Amari. Natural gradient descent for training multi-layer 
perceptrons. Submitted to IEEE Tr. on Neural Networks, 1997. 
