Adjoint-Functions and Temporal Learning 
Algorithms in Neural Networks 
N. Toomarian and J. Barhen 
Jet Propulsion Laboratory 
California Institute of Technology 
Pasadena, CA 91109 
Abstract 
The development of learning algorithms is generally based upon the min- 
imization of an energy function. It is a fundamental requirement to com- 
pute the gradient of this energy function with respect to the various pa- 
rameters of the neural architecture, e.g., synaptic weights, neural gain,etc. 
In principle, this requires solving a system of nonlinear equations for each 
parameter of the model, which is computationally very expensive. A new 
methodology for neural learning of time-dependent nonlinear mappings is 
presented. It exploits the concept of adjoint operators to enable a fast 
global computation of the network's response to perturbations in all the 
systems parameters. The importance of the time boundary conditions of 
the adjoint functions is discussed. An algorithm is presented in which 
the adjoint sensitivity equations are solved smultaneously (i.e., forward 
in time) along with the nonlinear dynamics of the neural networks. This 
methodology makes real-time applications and hardware implementation 
of temporal learning feasible. 
1 INTRODUCTION 
Early efforts in the area of training artificial neural networks have largely focused 
on the study of schemes for encoding nonlinear mapping characterized by time- 
independent inputs and outputs. The most widely used approach in this context 
has been the error backpropagation algorithm (Werbos, 1974), which involves either 
static i.e., "feedforward" (Rumelhart, 1986), or dynamic i.e., "recurrent" ( Pineda, 
1988) networks. In this context ( Barhen et al, 1989, 1990a, 1990b), have exploited 
113 
114 Toomarian and Barhen 
the concepts of adjoint operators and terminal attractors. These concepts provide 
a firm mathematical foundation for learning such mappings with dynamical neural 
networks, while achieving a considerable reduction in the overall computational 
costs (Barhen et al, 1991). 
Recently, there has been a wide interest in developing learning algorithms capable 
of modeling time-dependent phenomena ( Hirsh, 1989). In a more restricted appli- 
cation oriented, domain attention has focused on learning temporal sequences. The 
problem can be formulated as minimization, over an arbitrary but finite time inter- 
val, of an appropriate error functional. Thus, the gradients of the functional with 
respect to the various parameters of the neural architecture, e.g., synaptic weights, 
neural gains, etc. must be computed. 
A number of methods have been proposed for carrying out this task, a recent sur- 
vey of which can be found in (Pearlmutter, 1990). Here, we will briefly mention 
only those which are relevant to our work. Williams and Zipset(1989) discuss a 
scheme similar to the well known "Forward Sensitivity Equations" of sensitivity 
theory (Cacuci, 1981 and Toomarian et al, 1987), in which the same set of sensi- 
tivity equations has to be solved again and again for each network parameter of 
interest. Clearly, this is computationally very expensive, and scales poorly to large 
systems. Pearlmutter (1989), on the other hand, describes a variational approach 
which yields a set of equations which are similar to the "Adjoint Sensitivity Equa- 
tions" (Cacuci, 1981 and Toomarian et al, 1987). These equations must be solved 
backwards in time and involve storage of the state variables from the activation net- 
work dynamics, which is impractical. These authors ( Toomarian and Barhen, 1990 
) have suggested a new method which, in contradistinction to previous approaches, 
solves the adjoint system of equations forward in time, concomitantly with the neu- 
ral activation dynamics. A potential drawback of this method lies in the fact that 
these adjoint equations have to be treated in terms of distributions which precludes 
straight-forward numerical implementation. Finally, Pineda (1990), suggests com- 
bining the existence of disparate time scales with a heuristic gradient computation. 
However, the underlying adiabatic assumptions and highly "approximate" gradient 
evaluation technique place severe limits on the applicability of his approach. 
In this paper we introduce a rigorous derivation of two novel systems of adjoint equa- 
tions, which can be solved simultaneously (i.e., forward in time) with the network 
dynamics, and thereby enable the implementation of temporal learning algorithms 
in a computationally efficient manner. Numerical simulations and comparison with 
previously available results will be presented elsewhere( Toomarian and Barhen, 
1991). 
2 TEMPORAL LEARNING 
We formalize a neural network as an adaptive dynamical system whose temporal 
evolution is governed by the following set of coupled nonlinear differential equations: 
(1) 
Adjoint-Functions and Temporal Learning Algorithms in Neural Networks 115 
where u, represents the output of the nth neuron [u,(0) being the initial state], 
and T,m denotes the synaptic coupling from the m-th to the n-th neuron. The 
constant c characterizes the decay of neuron activity. The sigmoidal function g(.) 
modulates the neural response, with gain given by 7; typically, g(7x) = tanh(7x ). 
The time-dependent "source" term, I(t), encodes component-contribution of the 
target temporal pattern a(t) via the expression 
{a.(t) if n  Sx (2) 
I.(t) = 0 if n ( Sn U Sv 
The topographic input, output, and hidden network partitions $x, $Y and 
respectively, are architectural requirements related to the encoding of mapping-type 
problems. Details are given in Barhen et al (1989). 
To proceed formally with the development of a temporal learning algorithm, we 
consider an approach based upon the minimization of a "neuromorphic" energy 
functional E, given by the following expression 
(3) 
where 
r.(t) = { a.(t)- ..(t) if n  Sy (4) 
0 ifn SxUSi 
In our model the internal dynamical parameters of interest are the synaptic 
strengths T.. of the interconnection topology, the characteristic decay constants 
., and the gain parameters 7.. Therefore, the vector of system parameters ( 
Barhen et al, 1990b) should be 
In this paper, however, for illustration purposes and simplicity, we will limit our- 
selves in terms of parameters to the synaptic interconnections only. Hence, the 
vector of system parameters will have M - N 2 elements 
= { 
We will assume that elements of/5 are, in principle, independent. Furthermore, we 
will also assume that, for a specific choice of parameters and set of initial conditions, 
a unique solution of Eq. (1) exists. Hence, fi is an implicit function of/5. 
Lyapunov stability requires the energy functional to be monotonically decreasing 
during learning time, r. This translates into 
dE M dE dPt <0 (6) 
dr = E dp, dr 
/=1 
Thus, one can always choose, with r/> 0 
dp 
dr 
dE 
= -l dp. 
(7) 
116 Toomarian and Barhen 
Integrating the above dynamical system over the interval [v, v + At], one obtains, 
'+' dE dr (8) 
p.(r + = p.(r) - ap. 
T 
Equation (8) implies that, in order to update a system parameter p., one must 
evaluate the gradient of E with respect to p. in the intervM [r, r+r]. Furthermore, 
using Eq. (3) and observing that the time integral and derivative with respect to 
p., permute one can write; 
dp. dp. Opt' dt + Oa 0'. dt (9) 
Since F is known analytically [viz. Eq. (3)] computation of OF/Ou, and OF/Opt, 
is straightforward. 
OF 
=- F. (10a) 
oqun 
OF 
= 0 (lOb) 
Thus, the quantity that needs to be determined is the vector Oa/Op,. Differentiating 
the activation dynamics, Eq. (1), with respect to p,, we observe that the time 
derivative and partial derivative with respect to p, commute. Using the shorthand 
notation 0(...)/Opt, = (.-.),, we obtain a set of equations to be referred to as 
"Forward Sensitivity Equations-FSE": 
in which 
6,,,. + m A.,., u.,, = S,,,. t > 0 
u.,. = 0 t=0 
(12) 
A,.,m = n,., 5,.,,n - 7,., .O,., T,.,m (13) 
S.,, = 7- .O- E T.,n um 5r,,,T,,. (14) 
171 
where .On represents the derivative of gn with respect to un, and 5 denotes the Kro- 
necker symbol. Since the initial conditions of the activation dynamics, Eq.(1), are 
excluded from the system parameter vector i5, the initial conditions of the forward 
sensitivity equations will be taken as zero. Computation of the gradients, via Eq. 
(9), using the forward sensitivity scheme as proposed by William and Zipser (1989), 
would require solving Eq. (12), N 2 times, since the source term explicitly depends 
on p. The system of equations (12) has N equations, each of which requires 
summation over all N neurons. Hence, the amount of computation ( measured in 
multiply-accumulates, scales like N 4 per time step. We assume that the interval 
between to to t! is divided to L time steps. Therefore, the total number of multiply- 
accumulates scales like N4L. Clearly, the scaling properties of this approach are 
very poor and it can not be practically applied to very large networks. On the other 
hand, this method has also inherent advantages. The FSE are solved forward in 
time along with the nonlinear dynamics of the neural networks. Therefore, there is 
no need for or a large amount of memory. Since Un,, has N s components, that is 
all needed to be stored. 
Adjoint-Functions and Temporal Learning Algorithms in Neural Networks 117 
In order to reduce the computational costs, an alternative approach can be consid- 
ered. It is based upon the concept of adjoint operators, and eliminates the need for 
explicit appearance of , in Eq. (9). A vector of adjoint functions,  is obtained, 
which contain all the information required for computing all the "sensitivities", 
dE/dp,. The necessary and sufficient conditions for constructing adjoint equations 
are discussed elsewhere ( Toomarian et al, 1987 and references therein). 
It can be shown that an Adjoint System of Equations-ASE, pertaining to the forward 
system of equations (12), can be formally written as 
-(. + A.m v,. = S, t > 0 (15) 
In order to specify Eq. (15) in closed mathematical form , we must define the 
source term $ and time- boundary conditions for the system. Both should be 
independent of p, and its derivatives. 
By identifying S with OF/O u, and selecting the final time condition (t = 
t!) = 0, a system of equations is obtained, which is similar to those proposed by 
Pearlmutter. The method requires that the neural activation dynamics, i.e., Eq. 
(1), be solved first forward in time, as followed by the ASE, Eq. (15), integrated 
backwards in time. The computation requirement of this approach scales as N2L. 
However, a major drawback to date has resided with the necessity to store quantities 
such as ., * and ,, at each time step. Thus, the memory requirements for this 
method scale as N2L. 
By selecting 3' - or. -5(t-t!) and initial conditions (t = 0) = 0 these authors ( 
-- 0 ' 
Toomarian and Barhen 1990 ) have suggested a method which, in contradistinction 
to previous approaches, enables the ASE to be integrated forward in time, i.e., 
concomitantly with the neural activation dynamics. This approach saves a large 
amount of storage, which scales only as N 2. The computation complexity of this 
method, is similar to that of backward integration and scales as NeL. A potential 
drawback lies in the fact that Eq. (15) must then be treated in terms of distributions, 
which precludes straightforward numerical implementation. 
At this stage, we introduce a new paradigm which will enable us to evolve the 
adjoint dynamics, Eq. (15) forward in time, but without the difficulties associated 
with solutions in the sense of distributions. We multiply the FSE, Eq. (12), by  
and the ASE, Eq. (15), by fi,,, subtract the two resulting equations and integrate 
over the time interval (lo,t!). This procedure yields the bilinear form: 
(, 'a,/., ),, - ('5 'a,/., ),o = [(, 05,/., )- ( ',/., 05*)]dr (16) 
To proceed, we select 
05._ or. 
= 0. 
Thus, Eq. (16) can be rewritten as: 
i OF _ 
(17) 
(18) 
118 Toomarian and Barhen 
The first term in the RHS of Eq. (18) can be computed by using the values of  
obtained by solving the ASE, (Eqs. (15) and (17)), forward in time. The main 
difficulty resides in the evaluation of the second term in the RHS of Eq. (18), i.e., 
[9 fi,,]q. To compute it, we now introduce an auxiliary adjoint system: 
in which we select 
A,, z,: . t> 0 (19) 
m 
(tf) :o. 
(20) 
Note that, eventhough we selected (t,) - 0, we are also interested in solving this 
auxiliary adjoint system forward in time. Thus, the critical issue is how to select 
the initial condition (i.e. (to)), that would result in (t,) - 0. The bilinear form 
associated with the dynamical systems fi,, and  can be derived in a similar fashion 
to Eq. (16). Its expression is: 
jt ! - 
( fi,. ),. - ( fi,. ),o = [( ,, ) - ( fi,. S)]dt (21) 
Incorporating S, 2(t!) and the initial condition of Eq. (12) into Eq. (21), we obtain; 
(fi,, S)dt = [ a,,]t, : (2 q,,)dt (22) 
In order to provide a simple illustration on how the problem of selecting the initial 
conditions for the -dynamics can be addressed, we assume, for a moment, that the 
matrix A in Eq. (19) is time independent. Hence, the formal solution of Eq. (19) 
can be written as: 
= (23a) 
= -'o) - o(tf) (23b) 
Therefore, in principle, Eq. (22) can be expressed in terms of i(to), using Eq. (23a). 
At time tl, where (t]) is known from the solution of Eq. (15), one can calculate 
the vector 2(to), from Eq. (23b), with (t]) = 0. 
In the problem under consideration, however, the matrix A in Eq. (19) is time 
dependent (viz rq. (13)). Thus the auxiliary adjoint equations will be solved by 
means of finite differences. Usually, the same numerical scheme that is used for Eqs. 
(1) and (15) will be adopted. For illustrative purposes, we limit the discussion in 
the sequel to the first order approximation i.e.; 
(/+1 _ l) 
+ A21 = 0 0 < I < L (24) 
At 
Adjoint-Functions and Temporal Learning Algorithms in Neural Networks 119 
From this equation one can easily show that 
/+1 = B z B 1-1 ... B  B2(to) = Bt;2(to) (25) 
in which 
B l = I + AtA l (26) 
where I is the identity matrix. Thus, the RHS of Eq. (22) can be rewritten as: 
= at (27) 
i 
The initial conditions ;7(to) can easily be found at time tl, i.e., at iteration stop L, 
by solving the algebraic equation: 
= (tf) (28) 
In summary, the computation of the gradients i.e. Eq. (8) involves two stages, 
corresponding to the two terms in the RHS of Eq. (18). The first term is calcu- 
lated using the adjoint functions  obtained from Eq. (15). The computational 
complexity is N2L. The second term is calculated via Eq. (27), and involves two 
steps: a) kernel propagation, which requires multiplication of two matrices B t and 
B (-) at each time step; the computational complexity scales as N3L; b) numerical 
integration via Eq. (24) which requires a matrix vector multiplication at each time 
step; hence, it scales as N2L. Thus, the overall computational complexity of this 
approach is of the order N3L. Notice, however, that here the storage needed is 
minimal and equal to N 2. 
3 CONCLUSIONS 
A new methodology for neural learning of time-dependent nonlinear mappings is 
presented. It exploits the concept of adjoint operators. The resulting algorithm 
enables computation of the gradient of an energy function with respect to various 
parameters of the network architecture in a highly efficient manner. Specifically, 
it combines the advantage of dramatic reductions in computational complexity in- 
herent in adjoint methods with the ability to solve the equations forward in time. 
Not only is a large amount of computation and storage saved, but the handling 
of real-time applications becomes also possible. This methodology also makes the 
hardware implementation of temporal learning attractive. 
Acknowledgments 
This research was carried out at the Center for Space Microelectronics Technology, 
Jet Propulsion Laboratory, California Institute of Technology. Support for the 
work came from Agencies of the U.S. Department of Defense including the Naval 
Weapons Center (China Lake, CA), and from the Office of Basic Energy Sciences 
of the Department of Energy, through an agreement with the National Aeronautics 
and Space Administration. The authors acknowledge helpful discussions with J. 
Martin and D. Andes from Navel Weapons Center. 
120 Toomarian and Barhen 
References 
Barhen, J., Gulati, S., and Zak, M., 1989, "Neural Learning of Constrained Non- 
linear Transformations",IEEE Computer, 22(6), 67-76. 
Barhen, J., Toomarian, N., and Gulati, S., 1990a, "Adjoint Operator Algorithms 
for Faster Learning in Dynamical Neural Networks", Adv. Neut. Inf. Proc. Sys., 
2,498-508. 
Barhen, J., Toomarian, N., and Gulati, S., 1990b, "Application of Adjoint Operators 
to Neural Learning", Appl. Math. Lett., 3 (3), 13-18. 
Barhen, J., Toomarian, N., and Gulati, S., 1991, "Fast Neural Learning Algorithms 
Using Adjoint Operators", Submitted to IEEE Trans. of Neural Networks 
Cacuci, D. G., 1981, "Sensitivity Theory for Nonlinear Systems", J. Math. Phys., 
22 (12), 2794-2802. 
Hirsch, M. W., 1989, "Convergent Activation Dynamics in Continuous Time Net- 
works",Neural Networks, 2 (5), 331-349. 
Pearlmutter, B. A., 1989,"Learning State Space Trajectories in Recurrent Neural 
Networks", Neural Computation, 1 (2), 263-269. 
Pearlmutter, B. A., 1990, "Dynamic Recurrent Neural Networks", Technical Re- 
port CMU-CS-90-196, School of Computer Science, Carnegie Mellon University, 
Pittsburgh, Pa. 
Pineda, F., 1988, "Dynamics and Architecture in Neural Computation", J. of Com- 
plezity, , 216-245. 
Pineda, F., 1990, "Time Dependent Adaptive Neural Networks", Adv. Neur. Inf. 
Proc. $ys., 2, 710-718. 
Rumelhart, D. E., and McC.and, J. L., 1986, Parallel and Distributed Processing, 
MIT Press. 
Toomarian, N., Wacholder, E., and Kaizerman, S., 1987, "Sensitivity Analysis of 
Two-Phase Flow Problems", Nucl. $ci. Eng., 99 (1), 53-81. 
Toomarian, N. and Barhen, J., 1990, "Adjoint Operators and Non- Adiabatic Al- 
gorithms in Neural Networks", Appl. Math. ett., (in press). 
Toomarian, N. and Barhen, J., 1991," Learning a Trajectory Using Adjoint Func- 
tions", submitted to Neural Networks 
Werbos, P., 1974, "Beyond Regression: New Tools for Prediction and Analysis in 
The Behavioral Sciences", Ph.D. Thesis, Harvard Univ. 
Williams, R. J., and Zipser, D., 1989, "A Learning Algorithm for Continually Run- 
ning Fully Recurrent Neural Networks", Neural Computation, I (2), 270-280. 
Par III 
Oscillations 
