Neural Learning in Structured 
Parameter Spaces 
Natural Riernannian Gradient 
Shun-ichi Amari 
RIKEN Frontier Research Program, RIKEN, 
Hirosawa 2-1, Wako-shi 351-01, Japan 
amari@zoo.riken.go.jp 
Abstract 
The parameter space of neural networks has a Riemannian met- 
ric structure. The natural Riemannian gradient should be used 
instead of the conventional gradient, since the former denotes the 
true steepest descent direction of a loss function in the Riemannian 
space. The behavior of the stochastic gradient learning algorithm 
is much more effective if the natural gradient is used. The present 
paper studies the information-geometrical structure of perceptrons 
and other networks, and prove that the on-line learning method 
based on the natural gradient is asymptotically as efficient as the 
optimal batch algorithm. Adaptive modification of the learning 
constant is proposed and analyzed in terms of the Riemannian mea- 
sure and is shown to be efficient. The natural gradient is finally 
applied to blind separation of mixtured independent signal sources. 
I Introduction 
Neural learning takes place in the parameter space of modifiable synaptic weights 
of a neural network. The role of each parameter is different in the neural network 
so that the parameter space is structured in this sense. The Riemannian structure 
which represents a local distance measure is introduced in the parameter space by 
information geometry (Amari, 1985). 
On-line learning is mostly based on the stochastic gradient descent method, where 
the current weight vector is modified in the gradient direction of a loss function. 
However, the ordinary gradient does not represent the steepest direction of a loss 
function in the Riemannian space. A geometrical modification is necessary, and it is 
called the natural Riemannian gradient. The present paper studies the remarkable 
effects of using the natural Riemannian gradient in neural learning. 
128 $. Amari 
We first studies the asymptotic behavior of on-line learning (Opper, NIPS'95 Work- 
shop). Batch learning uses all the examples at any time to obtain the optimal weight 
vector, whereas on-line learning uses an example once when it is observed. Hence, 
in general, the target weight vector is estimated more accurately in the case of batch 
learning. However, we prove that, when the Riemannian gradient is used, on-line 
learning is asymptotically as efficient as optimal batch learning. 
On-line learning is useful when the target vector fluctuates slowly (Am. ari, 1967). 
In this case, we need to modify a learning constant r h depending on how far the 
current weight vector is located from the target function. We show an algorithm 
adaptive changes in the learning constant based on the Riemannian criterion and 
prove that it gives asymptotically optimal behavior. This is a generalization of the 
idea of Sompolinsky et al. [1995]. 
We then answer the question what is the Riemannian structure to be introduced in 
the parameter space of synaptic weights. We answer this problem from the point of 
view of information geometry (Amari [1985, 1995], Amari et al [1992]). The explicit 
form of the Riemannian metric and its inverse matrix are given in the case of simple 
perceptrons. 
We finally show how the Riemannian gradient is applied to blind separation of mix- 
tured independent signal sources. Here, the mixing matrix is unknown so that the 
parameter space is the space of matrices. The Riemannian structure is introduced 
in it. The natural Riemannian gradient is computationary much simpler and more 
effective than the conventional gradient. 
2 Stochastic Gradient Descent and On-Line Learning 
Let us consider a neural network which is specified by a vector parameter w - 
(w,--. w,)  R '. The parameter w is composed of modifiable connection weights 
and thresholds. Let us denote by l(;, w) a loss when input signal ; is processed 
by a network having parameter w. In the case of multilayer percepttons, a desired 
output or teacher signal y is associated with ;, and a typical loss is given 
(1) 
where z = jf(a, w) is the output from the network. 
When input ;, or input-output training pair (;, y), is generated from a fixed prob- 
ability distribution, the expected loss L(w) of the network specified by w is 
(2) 
where E denotes the expectation. A neural network is trained by using training 
examples (a,y), (;r2,y2),"- to obtain the optimal network parameter w* that 
minimizes L(w). If L(w) is known, the gradient method'is described by 
tOt+l: tot -- rltVL(wt), t = 1,2,... 
where r/t is a learning constant depending on t and X7L = OL/e9w. Usually L(w) is 
unknown. The stochastic gradient learning method 
wt+ = wt - rltVl(at+, lt+l ; Wt) 
(3) 
was proposed by an old paper (Amari [1967]). This method has become popular 
since Rumelhart et al. [1986] rediscovered it. It is expected that, when r h converges 
to 0 in a certain manner, the above wt converges to w*. The dynamical behavior of 
Neural Learning and Natural Gradient Descent 129 
(3) was studied by Amari [1967], Heskes and Kappen [1992] and many others when 
r/t is a constant. 
It was also shown in Amari [1967] that 
Wt+l -- wt -- rhG-1Vl(;t+l,Yt+l,Wt) 
(4) 
works well for any positive-definite matrix, in particular for the metric G. Ge- 
ometrically speaking O1/Ow is a covariant vector while Awt = wt+l - wt is a 
contravariant vector. Therefore, it is natural to use a (contravariant) metric tensor 
G -1 to convert the covariant gradient into the contravariant form 
71= G - O1 ( .. O ) 
O- = Eg*3-wj(W) , (5) 
where G- 1 _m (gij) is the inverse matrix of G -- (gij). The present paper studies how 
the matrix tensor matrix G should be defined in neural learning and how effective 
is the new gradient learning rule 
wt+ = wt - qt'l(ah, Yt, wt). 
(6) 
3 Gradient in Riemannian spaces 
Let S = {w) be the parameter space and let l(vo) be a function defined on S. 
When S is a Euclidean space and w is an orthonormal coordinate system, the 
squared length of a small incremental vector dw connecting w and w + dw is given 
by 
r 
Id,12- - y](dw) 2. (7) 
i=1 
However, when the coordinate system is non-orthonormal or the space $ is Rieman- 
nian, the squared length is given by a quadratic form 
Idwl 2- gij(m)dwidwj = mtGm. (8) 
i,j 
Here, the matrix G = (gij) depends in general on w and is called the metric tensor. 
It reduces to the identity matrix I in the Euclidean orthonormal ce. It will be 
shown soon that the parameter space S of neural networks has Riemannian structure 
(see Amari et al. [1992], Amari [1995], etc.). 
The steepest descent direction of a function l(w) at w is defined by a vector dw 
that minimize l(w+dw) under the constraint Idw] 2 = ,2 (see eq.8) for a sumciently 
small constant s. 
Lemma 1. The steepest descent direction of l(w) in a Riemannian space is given 
by 
= -a 
We call 
7l(w) = G-l(w)Vl(w) 
the natural gradient of l(w) in the Riemannian space. It shows the steepest descent 
direction of l, and is nothing but the contravariant form of V'l in the tensor notation. 
When the space is Euclidean and the coordinate system is orthonormal, G is the 
unit matrix I so that rl = V1. 
130 $. Arnari 
4 Natural gradient gives efficient on-line learning 
Let us begin with the simplest case of noisy multilayer analog perceptrons. Given 
input , the network emits output z = jf(, w) + n, where jf is a differentiable 
deterministic function of the multilayer perceptron with parameter w and n is a 
noise subject to the normal distribution N(0, I). The probability density of an 
input-output pair (, z) is given by 
p(ae, z; w) = q(ae)p(z]ae; w), 
where q(a) is the probability distribution of input , and 
P(zla; TM)-- x/-- exp -- II II 2. 
The squared error loss function (1) can be written as 
l(, z, w) = - log p(, z; w) + log q() - log x/-. 
Hence, minimizing the loss is equivalent to maximizing the likelihood function 
Let DT = {(,zl),-'-,(T, ZT)} be T independent input-output examples gener- 
ated by the network having the parameter w*. Then, maximum likelihood estimator 
/VT minimizes the log loss l(, z; w) = -logp(, z; w) over the training data DT, 
that is, it minimizes the training error 
1 T 
Etrain(W ) =   l(at,zt;w). (9) 
t----1 
The maximum likelihood estimator is efficient (or Fisher-efficient), implying that it 
is the best consistent estimator satisfying the Cram4r-Rao bound asymptotically, 
lira TE[(/VT - w*)(,T -- w*)']- G -1 , (10) 
where G- 1 is the inverse of the Fisher information matrix G = (gij) defined by 
gij =El OlOgp('z;w) OlOgp('z;w)] 
OWl  (11) 
in the component form. Information geometry (Amari, 1985) proves that the isher 
information G is the only invariant metric to be introduced in the space 
of the parameters of probability distributions. 
Examples (, z), (, z )... are given one at a time in the case of on-line learning. 
Let t be the estimated value at time t. At the next time t + 1, the estimator t is 
modified to give a new estimator t+ based on the observation (t+, zt+). The 
old observations (, z),.-., (t, zt) cannot be reused to obtain t+, so that the 
learning rule is written  t+ = m(t+,zt+,t). The process {t} is hence 
MarkovJan. Whatever a learning rule m we choose, the behavior of the estimator 
t is never better than that of the optimal batch estimator t because of this 
restriction. The conventional on-line learning rule is given by the following gradient 
form t+l : t -- tVl(t+l, Zt+l;t)' When t satisfies a certain condition, say 
t = c/t, the stochtic approximation guarantees that t is a consistent estimator 
converging to w*. However, it is not efficient in general. 
There arises a question if there exists an on-line learning rule that gives an efficient 
estimator. If it exists, the ymptotic behavior of on-line learning is equivalent to 
Neural Learning and Natural Gradient Descent 131 
'that of the batch estimation method. The present paper answers the question by 
giving an efficient on-line learning rule 
1~ 
t+l -- t -- -Vl(rt+, zt+; vt). (12) 
t 
Theorem 1. The natural gradient on-line learning rule gives an Fisher-efficient 
estimator, that is, 
 = E[(vt - w')(vt- w')']  G-(w'). (13) 
Adaptive modification of learning constant 
We have proved that r h - 1It with the coefficient matrix G - is the asymptotically 
best choice for on-line learning. However, when the target parameter w* is not fixed 
but fluctuating or changes suddenly, this choice is not good, because the learning 
system cannot follow the change if r h is too small. It was proposed in Amari [1967] 
to choose r h adaptively such that r h becomes large.r when the current target w* 
is far from wt and becomes small when it is close to wt adaptively. However, no 
definite scheme was analyzed there. Sompolinsky et a']. [1995] proposed an excellent 
scheme of an adaptive choice of r h for a deterministic dichotomy neural networks. 
We extend their idea to be applicable to stochastic cases, where the Riemannian 
structure plays a role. 
We assume that l(a,z;w) is differentiable with respect to w. (The non- 
differentiable case is usually more difficult to analyze. Sompolinsky et al [1995] 
treated this case.) We moreover treat the realizable teacher so that L(w*) = O. 
We propose the following learning scheme: 
vt+ = bt - r/tZl(:et+,zt+;vt) (14) 
r/t+ = rh + otr/t31(a't+,zt+;vt)- r/t], (15) 
where a and/? are constants. We try to analyze the dynamical behavior of learning 
by using the continuous version of the algorithm, 
d 
vt = -r/t 7l(at, zt ; vt ), 
d 
(16) 
(17) 
In order to show the dynamical behavior of (bt, rh), we use the averaged version 
of the above equation with respect to the current input-output pair (aet,zt). We 
introduce the squared error variable 
1 
et = (wt - (wt - w*). (18) 
By using the average and continuous time version 
where 'denotes d/dt and ( ) the average over the current (, z), we have 
t = -2r/tet, (19) 
= - (20) 
132 $. Arnari 
The behavior of the above equation is interesting  The origin (0, 0) is its attractor. 
However, the basin of attraction has a fractal boundary. Anyway, starting from an 
adequate initial value, it has the solution of the form 
1 
) 1 (l) 
(22) 
This proves the 1It convergence rate of the generalization error, that is optimal in 
order for any estimator/or converging to w*. 
6 Riemannian structures of simple perceptrons 
We first study the parameter space $ of simple perceptrons to obtain an explicit 
form of the metric G and its inverse G-1 This suggests how to calculate the metric 
in the parameter space of multilayer percepttons. 
Let us consider a simple perceptton with input  and output z. Let w be its 
connection weight vector. For the analog stochastic perceptton, its input-output 
behavior is described by z = f(w) + n, where n denotes a random noise subject 
to the normal distribution N(0, er 2) and f is the hyperbolic tangent, 
l_e- 
f(u) = 1 + e -' 
In order to calculate the metric G explicitly, let ew be the unit column vector in 
the direction of w in the Euclidean space R ', 
w 
where [[ w ]1 is the Euclidean norm. We then have the following theorem. 
Theorem 2. The Fisher metric G and its inverse G-1 are given by 
G(,) = cl(w) + {c.(w) - c(w))c,cb, (23) 
G_i (w) -- 1 ( 1 1 ) (24) 
Cl(W  + c.(w) c(w--- '"'' 
where w -I,l (Euclidean norm) and c (w) and c.(w) are given by 
1 / {12} 
cl(w) = 4V-rer 2 {f2(we)- 1} 2 exp -e de, (25) 
I I 2 
C2(W) = 4x/.ff-4r2/{f2(we)--l}2e2exp{--e}de. (26) 
Theorem 3. The Jeffrey prior is given by 
v/io(.)l = 1 
v. v/c2( w){ c (w) )'-  ' 
(27) 
7 
The natural gradient for blind separation of mixtured 
signals 
Let s = (sl,.-., s) be n source signals which are n independent random variables. 
We assume that their n mixtures ae: (x,..., Xn), 
 = As (28) 
Neural Learning and Natural Gradient Descent 133 
are observed. Here, A is a matrix. When s is time serieses, we observe 
r(1),..-,r(t). The problem of blind separation is to estimate W - A - adap- 
tively from (t), t - 1,2,3, ... without knowing s(t) nor A. We can then recover 
original s by 
u = (29) 
when V - A -1. Let W E (l(n), that is a nonsingular n x n-matrix, and (W) 
be a scalar function. This is given by a measure of independence such as (W) -- 
KL[iS(y); p(y)], which is represented by the expectation of a loss function. We define 
the natural gradient of (W). 
Now we return to our manifold (l(n) of matrices. It has the Lie group structure: 
Any A  (l(n) maps (l(n) to (l(n) by W -- WA. We impose that the Riemannian 
structure should be invariant by this operation A. 
We can then prove that the natural gradient in this case is 
= vw'w. (30) 
The natural gradient works surprisingly well for adaptive blind signal separation 
Amari et al. [1995], Cardoso and Laheld [1996]. 
References 
[1] S. Amari. Theory of adaptive pattern classifiers, IEEE Trans., EC-16, No.3, 
299-307, 1967. 
[2] S. Amari. Differential-Geometrical Methods in Statistics, Lecture Notes in 
Statistics, vol.28, Springer, 1985. 
[3] S. Amari. Information geometry of the EM and em algorithms for neural net- 
works, Neural Networks, 8, No.9, 1379-1408, 1995. 
[4] S. Amari, A. Cichocki and H.H. Yang. A new learning algorithm for blind signal 
separation, in NIPS'95, vol. S, 1996, MIT Press, Cambridge, Mass. 
[5] S. Amari, K. Kurata, H. Nagaoka. Information geometry of Boltzmann ma- 
chines, IEEE Trans. on Neural Networks, 3, 260-271, 1992. 
[6] J. F. Cardoso and Beate Labeld. Equivariant adaptive source separation, to 
appear IEEE Trans. on Signal Processing, 1996. 
[7] T. M. Heskes and B. Kappen. Learning processes in neural networks, Physical 
Review A, 440, 2718-2726, 1991. 
[8] D. Rumelhart, G.E. Hinton and R. J. Williams. Learning internal representa- 
tion, in Parallel Distributed Processing: Ecplorations in the Microstructure of 
Cognition, 1, Foundations, MIT Press, Cambridge, MA, 1986. 
[9] H. Sompolinsky,N. Barkai and H. S. Seung. On-line learning of dichotomies: 
algorithms and learning curves, Neural Networks: The statistical Mechanics 
Perspective, Proceedings of the CTP-PBSRI Joint Workshop on Theoretical 
Physics, J.-H. Oh et al eds, 105-130, 1995. 
