Green's Function Method for Fast On-line Learning 
Algorithm of Recurrent Neural Networks 
Guo-Zheng Sun, Hsing-Hen Chen and Yee-Chun Lee 
Institute for Advanced Computer Studies 
Laboratory for Plasma Research, 
University of Maryland 
College Park, MD 20742 
Abstract 
The two well known learning algorithms of recurrent neural networks are 
the back-propagation (Rumelhart & eL al., Werbos) and the forward propa- 
gation (Williams and Zipset). The main drawback of back-propagation is its 
off-line backward path in time for error cumulation. This violates the on-line 
requirement in many practical applications. Although the forward propaga- 
tion algorithm can be used in an on-line manner, the annoying drawback is 
the heavy computation load required to update the high dimensional sensitiv- 
ity matrix (O(N") operations for each time step). Therefore, to develop a fast 
forward algorithm is a challenging task. In this paper wl proposed a forward 
learning algorithm which is one order faster (only O(N ) operations for each 
time step) than the sensitivity matrix algorithm. The basic idea is that instead 
of integrating the high dimensional sensitivity dynamic equation we solve 
forward in time for its Green's function to avoid the redundant computations, 
and then update the weights whenever the error is to be corrected. 
A Numerical example for classifying state trajectories using a recurrent 
network is presented. It substantiated the faster speed of the proposed algo- 
rithm than the Williams and Zipser's algorithm. 
I. Introduction. 
In order to deal with sequential signals, recurrent neural networks are often put forward as a 
useful model. A particularly pressing issue concerning recurrent networks is the search for an 
efficient on-line training algorithm. Error back-propagation  (Rumeihart, Hinton, and 
Williams[l]) was originally proposed to handle feedforward networks. This method can be ap- 
plied to train recurrent networks if one unfolds the time sequence of mappings into a multilayer 
feed-forward net, each layer with identical weights. Due to the nature of backward path, it is 
basically an off-line method. Pineda [2] generalized it to recurrent networks with hidden neu- 
rons. However, he is mostly interested in time-independent fixed point type of behaviors. Pearl- 
mutter [3] proposed a scheme to learn temporal txajectories which involves equations to be 
solved backward in time. It is essentially a generalized version of error back-propagation to the 
problem of learning a target state txajectory. The viable on-line method to date is the RTRL 
(Real Time Recurrent Learning) algorithm (Williams and Zipsex [4]), which propagates a sen- 
333 
334 Sun, Chen, and Lee 
sitivity matrix forward in time. The main drawback of this algorithm is its high cost of compu- 
tation. It needs O(N 4) number of operations per time step. Therefore, a faster (less than O(N 4) 
operations) on-line algorithm appears to be desirable. 
Toomarian and Barhen [5] proposed an O(N 2) on-line algorithm. They derived the same 
equations as Pearlmutter's back-propagation using adjoint-operator approach. They then tried 
to convert the backward path into a forward path by adding a Delta function to its source term. 
But this is not correct. The problem is not merely because it "precludes straightforward numer- 
ical implementation" as they acknowledged later [6]. Even in theory, the result is not correct. 
The mistake is in their using a not well defined equity of the Delta function integration. Briefly 
I tl  (t- tf)f(t) dt = f(tf) is not right if the function fit) is discontin- 
speaking, the equity to 
uous at t = t? The value of the left-side integral depends on the distribution of function fit) and 
therefore is not uniquely defined. If we deal with the discontinuity carefully by splitting time 
interval from t o to {finto two segments: t o to e and tfe to 9and let  --) 0, we will find out 
that adding a Delta function to the source term does not affect the basic property of the adjoint 
equation. Namely, it still has to be solved backward in time. 
Recently, Toomarian and Barhen [6] modified their adjoint-operator approach and proposed 
an alternative O(N ') on-line training algorithm. Although, in nature, their result is very similar 
to what we presented in this paper, it will be seen that our approach is more straightforward and 
can be easily implemented numerically. 
Schmidhuber[7] proposed an O(N ') algorithm which is a combination of back propagation 
(within each data block of size N) and forward propagation (between blocks). It is therefore not 
truly an on-line algorithm. 
Sun, Chen and Lee [8] studied this problem, using a more general approach - variational ap- 
proach, in which a constrained optimization problem with Lagrangian multipliers was consid- 
ered. The dynamic equation of the Lagrangian multiplier was derived, which is exactly the 
same as adjoint equation[5]. By taking advantage of linearity of this equation an O(N ') on-line 
algorithm was derived. But, the numerical implementation of the algorithm, especially the nu- 
merical instabilities are not addressed in the paper. 
In this paper we will present a new approach to this problem - the Green's function method. 
The advantages of the this method are the simple mathematical formulation and easy numerical 
implementation. One numerical example of trajectory classification is presented to substantiate 
the faster speed of the proposed algorithm. The numerical results are benchmarked with Wil- 
liams and Zipser's algorithm. 
II. Green's Function Approach. 
(a) Definition of the Problem 
Consider a fully recurrent network with neural activity represented by an N-dimensional vec- 
tor x(t). The dynamic equations can be written in general as a set of first order differential equa- 
tions: 
.(t) = F(x(t),w,I(t)) (1) 
where w is a matrix representing the set of weights and all other adjustable parameters, I(t) is a 
vector representing the neuron units clamped by external input signals at time t. For a simple 
network connected by first order weights the nonlinear function F may look like 
F(x(t),w,I(t)) = -x(t) +g(w.x) +I(t) (2) 
where the scaler function g(u) could be, for instance, the Sigmoid function g(u) = 1/(l+e' u). 
Suppose that part of the state neurons {x i I i  M} are measurable and part of neurons {x i I i  
Green's Function Method for Fast On-line Learning Algorithm of Recurrent Neural Networks 335 
H} are hidden. For the measurable units we may have desired output  (t). In order to train 
the network, an objective functional (or an error measure functional) is often given to be 
tz 
E(x,) = Ie(x(t),(t) )dt (3) 
to 
where functional E depends on weights w implicitly through the measurable neurons {x i I i  
M}. A typical error function is 
2 (4) 
e(x(t),2(t)) = (x(t)-2(t)) 
The gradient descent learning is to modify the weights according to 
E_ ;?x i}x 
A wo w ' a-' dt. (5) 
t o 
In order o evaluate the integral in Eq. (5) one needs to know both e/rw and x/rw. The 
function 
first term can be easily obtained by taking derivative of the given error 
e (x (t), J (t)). For the second term one needs to solve the differential equation 
(6) 
which is easily derived by taking derivative of Eq.(1) with respect to w. The well known for- 
ward algorithm of recurrent networks [4] is to solve Equation (6) forward in time and make the 
weight correction at the end (t = ) of the input sequence. (This algorithm was developed inde- 
pendently by several researchers, but due to the page limitation we could not refer all related 
papers and now simply call it Williams and Zipset' s algorithm) The on-line learning is to make 
weight correction whenever an error Is to be corrected during the input sequence 
Aw(t) = -rl (ff---ex  -  (7) 
The proof of convergence of on-line learning algorithm will be addressed elsewhere. 
The main drawback of this forward algorithm is that it requires 0(3/4 ) operations per time 
step to update the matrix )x/Ow. Our goal of the Green's function approach is to find an on- 
line algorithm which requires less computation load. 
(b). Green's Function Solution 
First let us analyze the computational complexity when integrating Eq. (6) directly. Rewrite 
Eq. (6) as 
i)x F 
. -- - (8) 
i)w 
where the linear operator L is defined as L - d F 
dt x 
Two types of redundancy will be seen from Eq. (8). First, the operator L does not depend on w 
explicitly, which means that what we did in solving for )x/Ow is to repeatedly solve the iden- 
tical differential equation for each components ofw. This is redundant. It is especially wasteful 
when higher order connection weights are used. The second redundancy is in the special form 
of F/tOw for neural computations where the same activity function (say, Sigmoid function) is 
336 Sun, Chen, and Lee 
used for every neuron, so that 
F 
- g' (wkt.xt)8i x i (9) 
where /is the Kronecker delta function. It is seen from Eq. (9) that among N 3 components of 
this third order tensor most of them, N2(N- 1), are zero (when k  i) and need not to be computed 
repeatedly. In the original forward learning scheme, we did not pay attention to this redundan- 
cy. 
Our Green' s function approach is able to avoid the redundancy by solving for the low dimen- 
sional Green's function. And then we construct the solution of Eq. (8) by the dot product of F/ 
)w with the Green's function, which can in turn be reduced to a scaler product due to Eq. (9). 
The Green' s function of the operator L is defined as a dual time tensor function G(t-x) which 
satisfies the following equation 
d F 
G(t-x)-i} x .G(t-x) = (t-x) (10) 
It is well known that, if the solution of Eq. (10) is known, the solution of the original equation 
Eq. (6) (or (8)) can be constructed using the source term }F/tOw through the integral 
(t) = I (GO-x) .-(x))dx (11) 
t o 
To find the Green's function solution we first introduce a tensor function V(t) that satisfies 
the homogeneous form of Eq. (10) 
ltt i}F V ( t) =0 
V ( t) - i}x ' 
y (t o) = 1 
The solution of Eq. (10) or the Green's function can then be constructed as 
G(t-x) = V(t) . V 'l (x)H(t-x) 
where H(t-x) is the Heaviside function defined as 
1 t>_x 
H(t-x) = { 
0 t<x 
Using the well known equalities 
ClH(t-x) = (t-x) 
dt 
(12) 
(13) 
and f(t,x)(t-x) = f(t,t)(t-x) , 
one can easily verify that the constructed Green's function shown in Eq. (13) is correct, that is, 
it satisfies Eq. (10). Substituting G(t-x) from Eq. (13) into Eq. (11) we obtain the solution of 
Eq. (6) as, 
t 
i)w(t) = V(t) . l ( (V(x)) '! i)F(x))dx (14) 
t o 
Green's Function Method for Fast On-line Learning Algorithm of Recurrent Neural Networks 337 
We note that this formal solution not only satisfies Eq. (6) but also satisfies the required initial 
condition 
aw(tO) = O. (15) 
The "on-line" weight correction at time t is obtained easily from Eq. (5) 
15w=-rl .---rl -.V(t) ((V(x)) .i}w(X))dx (16) 
t o 
(c) Implementation 
To implement Eq. (16) numerically we will introduce two auxiliary memories. First, we de- 
fine U(t) to be the inverse of matrix V(t), i.e. U(t) = V -l(t). It is easy to see that the dynamic 
equation of U(t) is 
fdU(t) + U (t) F 
'i}x -0 (17) 
r (t0) = 1 
Secondly, we define a third order tensor I-[ij k that satisfies 
= U (t)  w 
, 1-l(t0) =0 
then the weight correction in Eq. (16) becomes 
15 w = -q (v (t)  1-I (t)) 
where the vector v(t) is the solution of the linear equation 
e 
v(t).U(t) = }x 
In discrete time, Eqs. (17) - (20) become: 
Uq(t) = Uij(t- 1) +At Uik(t- 1)xj 
Uo ( O) =15j 
(18) 
(19) 
(20) 
(21) 
Ylij  (t) = YIon (t - 1) + (At) Uij (t - 1) iwj 
nij n (0) = 0 
t. vi(t)' Uij(t)= 
(22) 
(23) 
338 Sun, Chen, and Lee 
Awij = -11 ( X vtc ( t) lltc 0 ( t) ) (24) 
k 
To summarize the procedure of the Green' s function method, we need to simultaneously in- 
tegrate Eq. (21) and Eq. (22) for U(t) and 1-I forward in time starting from Uij(O) = ij and 
I-Iijtc(0) = 0. Whenever error message are generated, we shall solve Eq. (23) for v(t) and update 
weights according to Eq. (24). 
The memory size required by this algorithm is simply N+N 2 for storing U(t) and I-[(t). 
The speed of the algorithm is analyzed as follows. From Eq. (21) and Eq. (22) we see that the 
update of U(t) and I'I both need N s operations per time step. To solve for v(t) and update w, 
we need also N s operations per time step. So, the on-line updating of weights needs totally 4N s 
operations per time step. This is one order of magnitude faster than the current forward learning 
scheme. 
IH Numerical Simulation 
We present in this section numerical examples to demonstrate the proposed learning algorithm 
and benchmark it against Williams&Zipser's algorithm. 
Class 1 
-' -0'.5 0.6. 
-0.6. 
Class 2 Class 3 
-0.75 -0.5 i 0. j25 0.5 0.75  
-0.25 
' - 0 '.. 5 -0225 
-0.7 
0.75 
0.25 o 3 
-o275 ' '  
-0.25 
-o. -o.5 
-o .75 
_!5,  -: -0',5 { 
Fig. 1 PHASE SPACE TRAJECTORIES 
Three different shapes of 2-D trajectory, each is shown in one column with three examples. 
Recurrent neural networks are trained to recognize the different shapes of trajectory. 
We consider the trajectory classification problem. The input data are the time series of two 
Green's Function Method for Fast On-line Learning Algorithm of Recurrent Neural Networks 339 
dimensional coordinate pairs {x(t), y(t)} sampled along three different types of trajectories in 
the phase space. The sampling is taken uniformly with At=-2rd60. The trajectory equations are 
x(t) = sin(t+l)lsin(t)l x(t) = sin(0.st+)sin(l.st) x(t) = sin(t+l)sin(2t) 
y(t)=cos(t+)lsin(t) I (y(t) cos(0.st+l)sin(l.st) (y(t)=cos(t+)sin(:t) 
where 13 is a uniformly distributed random parameter. When 13 is changed, these trajectories 
are distorted accordingly. Nine examples (three for each class) are shown in Fig. 1.The neural 
net used here is a fully recurrent first-order network with dynamics 
Si(t+l ) = Si(t ) +(Tanh Wij(Sl)j (25) 
- _-! 
where S and I are vectors of state and input neurons, the symbol  represents concatenation, 
and N is the number of state. Six input neurons are used to represent the normalized vector { 1, 
x(t), y(t), x(t) 2, y(t) 2, x(t)y(t)}. The neural network structure is shown in Fig. 2. 
't Check state neurons at 
ate ( t q- 1  the end p_f input sequence. 
et- b' 
 S1S 2 S N x(t ,)y(t), xZ(t), )(t), )c(t)y(t) 
 State (t) Input (t) 
Fig.2 Recurrent Neural Network for Trajectory Classification 
For recognition, each trajectory data sequence needs to be fed to the input neurons and the 
state neurons evolve according to the dynamics in Eq. (25). At the end of input series we check 
the last three state neurons and classify the input trajectory according to the "winner-take-all" 
rule. For training, we assign the desired final output for the three trajectory classes to (1,0,0), 
(0,1,0) and (0,0,1) respectively. Meanwhile, we need to simultaneou sly integrate Eq. (21) for 
U(t) and Eq. (22) for 1-I. At the end, we calculated the error from Eq. (4) and solve Eq. (23) 
for v(t) using LU decomposition algorithm. Finally, we update weights according to Eq. (24). 
Since the classification error is generated at the end of input sequence, this learning does not 
have to be on-line. We present this example only to compare the speeds of the proposed fast 
algorithm against the Williams and Zipser's. We run the two algorithms for the same number 
of iterations and compare the CPU time used. The results are shown in Table. 1, where in each 
one iteration we present 150 training patterns, 50 for each class. These patterns are chosen by 
randomly selecting 13 values. It is seen that the CPU time ratio is O(1/N), indicating the Green' s. 
function algorithm is one order faster in N. 
Another issue to be considered is the error convergent rate (or learning rate, as usually 
called). Although the two algorithms calculate the same weight correction as in Eq. (7), due to 
different numerical schemes the outcomes may be different. As the result, the error convergent 
rates are slightly different even if the same learning rate l is used. In all numerical simulations 
we have conducted the learning results are very good (in testing, the recognition is perfect, no 
single misclassification was found). But, during training the error convergence rates are differ- 
ent. The numerical experiments show that the proposed fast algorithm converges slower than 
340 Sun, Chen, and Lee 
the Williams and Zipser's for the small size neural nets but faster for the large size neural net. 
Simulatio Fast Algorithm Williams&Zipser's .ratio 
N = 4 1607.4 5020.8 1: 3 
(Number of Iterations = 200) 
N = 8 1981.7 10807.0 1: 5 
(Number of Iterations = 50) 
N=12 
(Number of Iterations = 50) 5947.6 45503.0 1: 8 
Table 1. The CPU time (in seconds) comparison, implemented in DEC3100 Workstation, 
for learning the trajectory classification example. 
IV. Conclusion 
The Green's function has been used to develop a faster on-line learning algorithm for recur- 
rent neural networks. This algorithm requires 0(/73 ) operations for each time step, which is one 
order faster than the Williams and Zipser's algorithm. The memory required is O(N$). 
One feature of this algorithm is its straightforward formula, which can be easily implemented 
numerically. A numerical example of trajectory classification has been used to demonstrate the 
speed of this fast algorithm compared to Williams and Zipser's algorithm. 
References 
[1] D.Rumelhart, G. Hinton, and R. Williams. Learning internal representations by error 
propagation. In Parallel distributed processing: Vol. I MIT press 1986. P. Werbos, Beyond Re- 
gression: New tools for prediction and analysis in the behavior sciences. Ph.D. thesis, Harvard 
university, 1974. 
[2] F. Pineda, Generalization of back-propagation to recurrent neural networks. Phys. Rev. 
Letters, 19(59):2229, 1987. 
[3] B. Pearlmutter, Learning state space trajectories in recurrent neural networks. Neural 
Computation, 1(2):263, 1989. 
[4] R. Williams and D. Zipser, A learning algorithm for continually running fully recurrent 
neural networks. Tech. Report ICS Report 8805, UCSD, La Jolla, CA 92093, November 1988. 
[5] N. Toomarian, J. Barhen and S. Gulati, "Application of Adjoint Operators to Neural 
Learning", Appl. Math. Lett., 3(3), 13-18, 1990. 
[6] N. Toomarian and J. Barhen, "Adjoint-Functions and Temporal Learning Algorithms in 
Neural Networks", Advances in Neural Information Processing Systems 3, p. 113-120, Ed. by 
R. Lippmann, J. Moody and D. Touretzky, Morgan Kaufmann, 1991. 
[7] J. H. Schmidhuber, "An O(N $) Learning Algorithm for Fully Recurrent Networks", Tech 
Report FKI- 151-91, Institut fiir Informatik, Technische Universitfit Miinchen, May 1991. 
[8] Guo-Zheng Sun, Hsing-Hen Chen and Yee-Chun Lee, "A Fast On-line Learning Algo- 
rithm for Recurrent Neural Networks", Proceedings of lnternational Joint Conference on Neu- 
ral Networks, Seattle, Washington, page 11-13, June 1991. 
