22 
LEARNING ON A GENERAL NETWORK 
Amir F. Atiya 
Department of Electrical Engineering 
California Institute of Technology 
Ca 91125 
Abstract 
This paper generalizes the backpropagation method to a general network containing feed- 
back connections. The network model considered consists of interconnected groups of neurons, 
where each group could be fully interconnected (it could have feedback connections, with pos- 
sibly asymmetric weights), but no loops between the groups are allowed. A stochastic descent 
algorithm is applied, under a certain inequality constraint on each intra-group weight matrix 
which ensures for the network to possess a unique equilibrium state for every input. 
Introduction 
It has been shown in the last few years that large networks of interconnected "neuron"-like 
elements re quite suitable for performing a variety of computational and pattern recognition 
tasks. One of the well-known neural network models is the backpropagation model [1]-[4]. It 
is an elegant way for teaching a layered feedforward network by a set of given input/output 
examples. Neural network models having feedback connections, on the other hand, have lso 
been devised (for example the Hopfield network [5]), and are shown to be quite successful in 
performing some computational tasks. It is important, though, to have a method for learning 
by examples for a feedback network, since this is a general way of design, and thus one can 
avoid using an ad hoc design method for each different computational task. The existence 
of feedback is expected to improve the computational abilities of a given network. This is 
because in feedback networks the state iterates until a stable state is reached. Thus processing 
is performed on several steps or recursions. This, in general allows more processing abilities 
than the "single step" feedforward case (note also the fact that a feedforward network is 
a special case of a feedback network). Therefore, in this work we consider the problem of 
developing a general learning algorithm for feedback networks. 
In developing a learning algorithm for feedback networks, one has to pay attention to the 
following (see Fig. I for an example of a configuration of a feedback network). The state of 
the network evolves in time until it goes to equilibrium, or possibly other types of behavior 
such as periodic or chaotic motion could occur. However, we are interested in having a steady 
and and fixed output for every input applied to the network. Therefore, we have the following 
two important requirements for the network. Beginning in any initial condition, the state 
should ultimately go to equilibrium. The other requirement is that we have to have a unique 
American Institute of Physics 1988 
23 
equilibrium state. It is in fact that equilibrium state that determines the final output. The 
objective of the learning algorithm is to adjust the parameters (weights) of the network in small 
steps, so as to move the unique equilibrium state in a way that will result finally in an output 
as close as possible to the required one (for each given input). The existence of more than one 
equilibrium state for a given input causes the following problems. In some iterations one might 
be updating the weights so as to move one of the equilibrium states in a sought direction, while 
in other iterations (especially with different input examples) a different equilibrium state is 
moved. Another important point is that when implementing the network (after the completion 
of learning), for a fixed input there can be more than one possible output. Independently, other 
work appeared recently on training a feedback network [6],[7],[8]. Learning algorithms were 
developed, but solving the problem of ensuring a unique equilibrium was not considered, This 
problem is addressed in this paper and an appropriate network and a learning algorithm are 
proposed. 
outputs 
Fig. 1 
A recurrent network 
The Feedback Network 
Consider a group of n neurons which could be fully inter-connected (see Fig. ! for an 
example). The weight matrix W can be asymmetric (as opposed to the Hopfield network). 
The inputs are also weighted before entering into the network (let V be the weight matrix). 
Let x and y be the input and output vectors respectively. In our model y is governed by the 
following set of differential equations, proposed by Hopfield [5]: 
du 
 d--[ = Wf(u) - u + Vx, y = f(u) (1) 
24 
where f(u) = (f(ux),...,f(ur)) T, W denotes the transpose operator, f is a bounded and 
differentiable function, and r is a positive constant. 
For a given input, we would like the network after a short transient period to give a steady 
and fixed output, no matter what the initial network state was. This means that beginning 
any initial condition, the state is to be attracted towards a unique equilibrium. This leads to 
looking for a condition on the matrix W. 
Theorem: A network (not necessarily symmetric) satisfying 
2 1/max(?)2, 
i j 
exhibits no other behavior except going to a unique equilibrium for a given input. 
Proof: Let u(t) and u2(t) be two solutions of (1). Let 
J(t) = Ilu(t) - u2(t)ll 2 
where [I II is the [wo-norm. Differentiating J with respect to time, one obtains 
dJ(t) _ 2(u(t)-u2(t))r(du(t) du2(t)  
dt dt  1' 
Using (1) , the expression becomes 
dJ(t) 2 2 
dt - r []ul(t) -u2(t))[12 + ;(u(t) -u2(t))"W[f(u(t)) - f(u2(t))]. 
Using Schwarz's Inequality, we obtain 
2 
dJ(t) < _211u(t ) _ u(t)ll  + _11u (t)_ u(t)11. iiW[f(u (t)) _ f(u2(t))] ii. 
dt - r r 
Again, by Schwarz's Inequality, 
w/ [f(u (t)) - f(u(t))] <_ Ilwgll - Ilf(u(t)) - f(u2(t))11 , i = 1,...,n 
where wi denotes the i th row of W. Using the mean value theorem, we get 
[If(u (t)) - f(u2(t)) II < (maxlf'l)llu (t) - u2(t)l I. (3) 
Using (2),(3), and the expression for J(t), we get 
dJ(t) <-aJ(t) (4) 
dt - 
where 
2 2 (maxl f,i)/- jwj.. 
T T i 
(2) 
25 
By hypothesis of the Theorem, a is strictly positive. Multiplying both sides of (4) by exp(at), 
the inequality 
d 0 
- 
results, from which we obtain 
J(t) _< J(0)e 
From that and from the fact that J is non-negative, it follows that J(t) goes to zero as t 
Therefore, any two solutions corresponding to any two initial conditions ultimately approach 
each other. To show that this asymptotic solution is in fact an equilibrium, one simply takes 
u2(t): u(t + T), where T is a constant, and applies the above argument (that J(t) - 0 as 
t --* c), and hence u (t + T) -- u (t) as t --* cw for any T, and this completes the proof. 
For example, if the function f is of the following widely used sigmoid-shaped form, 
1 
f (u) - 1 + e- 
then the sum of the square of the weights should be less than 16. Note that for any function 
f, scaling does not have an effect on the overall results. We have to work in our updating 
scheme subject to the constraint given in the Theorem. In many cases where a large network 
is necessary, this constraint might be too restrictive. Therefore we propose a general network, 
which is explained in the next Section. 
The General Network 
We propose the following network (for an example refer to Fig. 2). The neurons are 
partitioned into several groups. Within each group there are no restrictions on the connections 
and therefore the group could be fully interconnected (i.e. it could have feedback connections). 
The groups are connected to each other, but in a way that there are no loops. The inputs to 
the whole network can be connected to the inputs of any of the groups (each input can have 
several connections to several groups). The outputs of the whole network are taken to be the 
outputs (or part of the outputs) of a certain group, say group f. The constraint given in the 
Theorem is applied on each intra-group weight matrix separately. Let (qa, s,), a = 1, ..., N be 
the input/output vector pairs of the function to be implemented. We would like to minimize 
the sum of the square error, given by 
N 
a----1 
where 
M 
---- --$i) , 
i----1 
and f is the output vector of group f upon giving input q, and M is the dimension of vector 
s a. The learning process is performed by feeding the input examples q sequentially to the 
network, each time updatg the weights in an attempt to minimize the eor. 
26 
inputa outputs 
Fig. 2 
An example of a general network 
(each group represents a recurrent network) 
Now, consider a single group l. Let W t be the intra-group weight matrix of group l, V 'q 
be the matrix of weights between the outputs of group r and the inputs of group l, and yt be 
I rl 
the output vector of group l. Let the respective elements be wo. , vO. , and Yi- Furthermore, 
let nt be the number of neurons of group l. Assume that the time constant  is sufficiently 
small so s to allow the network to settle quickly to the equilibrium state, which is given by 
the solution of the equation 
yt= f(Wtyt + E VtYr)' (5) 
rAl 
where At is the set of the indices of the groups whose outputs are connected to the inputs of 
group l. We would like each iteration to update the weight matrices W t and V rt so as to move 
the equilibrium in a direction to decrease the error. We need therefore to know the change in 
the error produced by a small change in the weight matrices. Let Oe 
3-Wr, and ov-r, denote the 
Oe Oe Oea, 
matrices whose (i,j) th element are -,, and respectively. Let 3-fy be the column vector 
whose i th element is . We obtain the following relations: 
aea 
av,,t 
-1 3a t T 
 = [,_ (w') T] y,(y ) , 
-1 aCa r T 
- ['- (w')q y,(y ) , 
wheee A t is the diagonal matrix whose ita diagonal element is 1/f'( k t t ,q ,- 
for a derivation refer to Appends). The vector  sociaed with group I can be obtaed 
in erms of the vectors  jeBt where Bt is the se of he indices of the groups whose puts 
e connected to the outputs of group l..We get (refer to Appends} 
 = (v')[a -(wq] - 
., aye' () 
The matrix A t - (Wt? ' for any group l can never be singular, so we will not face any 
problem in the updating process. To prove that, let z be a vector satisfying 
[' - (w')],. = 0. 
27 
We can write 
,/maxl.f'l _< -?:,, i= ],..., , 
where z  the i  element of z. Usg Schwarz's Inequity, we obtain 
_ ..., 
Squing both sides and dg the inequalities for i = 1, ..., n, we get 
,,)- (7) 
k i k 
Since the condition 
w' 2 1/max(),)2) 
, 
i k 
is enforced, it follows that (7) cannot be satisfied unless z is the zero vector. Thus, the matrix 
Ai - (Wl) T cannot be singular. 
For each iteration we begin by updating the weights of group f (the group containing 
0e equals simply 2(y{ - s,...,y - sM 0, 0)T). Then 
the final outputs). For that group yy , ..., 
we move backwards to the groups connected to that group and obtain their corresponding 
0e, vectors using (6) update the weights, and proceed in the same manner until we complete 
0y ' 
updating all the groups. Updating the weights is performed using the following stochastic 
descent algorithm for each group, 
O ea 
c9 ea 
AV ----- --aa + adeaR , 
where 1 is a noise matrix whose elements axe characterized by independent zero-mean unity- 
variance Gaussian densities, and the a's are parameters. The purpose of adding noise is to 
allow escaping local minima if one gets stuck in any of them. Note that the control parameter 
is taken to be ea. Hence the variance of the added noise tends to decrease the more we 
approach the ideal zero-error solution. This makes sense because for a large error, i.e. for an 
unsatisfactory solution, it pays more to add noise to the weight matrices in order to escape 
local minima. On the other hand, if the error is small, then we are possibly near the global 
minimum or to an acceptable solution, and hence we do not want too much noise in order 
not to be thrown out of that basin. Note that once we reach the ideal zero-error solution the 
added noise as well as the gradient of e, become zero for all a and hence the increments of the 
weight matrices become zero. If a/ter a certain iteration W happens to violate the constraint 
i.w. _ constant < 1/rnax(f') , then its elements are scaled so as to project it back onto 
the surface of the hypershere. 
Implementation Example 
A pattern recognition example is considered. Fig. $ shows a set of two-dimensional 
training patterns from three classes. It is required to design a neural network recognizer with 
28 
three output neurons. Each of the neurons should be on if a sample of the corresponding class is 
presented, and off otherwise, i.e. we would like to design a "winner-take-all" network. A single- 
layer three neuron feedback network is implemented. We obtained 3.3% error. Performing the 
same experiment on a feedforward single-layer network with three neurons, we obtained 20% 
error. For satisfactory results, a feedforward network should be two-layer. With one neuron 
in the first layer and three in the second layer, we got 36.7% error. Finally, with two neurons 
in the first layer and three in the second layer, we got a match with the feedback case, with 
3.3% error. 
2 
3 33 
3 3 
2 
2 
2 
2 
2 2 1 
2 2 1 
2 
2 2 1 
2 
3 
3 
I 1 
1 
33333 3 
3 
3 
3 
1 
I I 
1 
1 
I I 
Fig. 3 
A pattern recognition example 
Conclusion 
A way to extend the backpropagation method to feedback networks has been proposed. 
A condition on the weight matrix is obtained, to insure having only one fixed point, so as 
to prevent having more than one possible output for a fixed input. A general structure for 
networks is presented, in which the network consists of a number of feedback groups connected 
to each other in a feedforward manner. A stochastic descent rule is used to update the weights. 
The method is applied to a pattern recognition example. With a single-layer feedback network 
i obtained good results. On the other hand, the feedforward backpropagation method achieved 
good resuls only for the case of more than one layer, hence also with a larger number of neurons 
than the feedback case. 
29 
Acknowledgement 
The author would like to gratefully acknowledge Dr. Y. Abu-Mostafa for the useful 
discussions. This work is supported by Air Force Office of Scientific Research under Grant 
AFOSR-86-0296. 
Appendix 
Differentiating (5), one obtains 
m tOjm OtOp q- ypSjk), 
Owl, - r(4.)( ' Oy, , 
k, p = 1, ..., nt 
where 
1 ifj=k 
8Y= 0 otherwise, 
and 
l 
Zj = 
I I rl r 
m reA rr 
We can write 
c9yt -- (A t- Wt)-b " (A- 1) 
where b kv is the nt-dimensional vector whose i tn component is given by 
t ifi=k 
b . p= Yp 
 0 otherwise. 
By the chain rule, 
which, upon substituting from (A 1), can be put in the form  
- Yp o--fy, where g is the ktn 
column of (A t - Wt) -. Finally, we obtain the required expression, which is 
Oea Oea I T 
aw, - ['-(w')]-' W' (Y) ' 
Regarding  it is obtained by differentiating (5) with respect to v t We get similarly 
OVr ' kp' 
Oy  _ (A t - W)-c, 
witere c kp is the nt-dimensional vector whose i tn component is given by 
cp { y[, ff i = k 
i = 0 otherwise. 
3O 
A derivation very similar to the case of  results in the following required expression: 
aea - 1 aea r T 
aW, - ) ' 
Now, finally consider oe o j ay 
-. Let -', jeBt be the matrix whose (k,p) ta element is b'y' The 
elements of  can be obtained by differentiating the equation for the fixed point for group 
j, as follows, 
ag .  
,,, )' 
Hence, 
ay' - (A'- W')-IV'" (A- 2) 
Using the chain rule, one can write 
ay, -  (---) ayy' 
We substitute from (A - 2) into the previous equation to complete the derivation by obtaining 
aea 
ae,, _ 
ay  
'B 
References 
[1] P. Werbos, "Beyond regression: New tools for prediction and analysis in behavioral sci- 
ences", Harvard University dissertation, 1974. 
[2] D. Parker, "Learning logic", MIT Tech Report TR-47, Center for Computational Research 
in Economics and Management Science, 1985. 
[3] Y. Le Cun, "A learning scheme for asymmetric threshold network" Proceedings of Cog- 
nitiva, Paris, June 1985. 
[4] D. Rumelhart, G.Hinton, and R. Wilhams, "earning internal representations by error 
propagation" in D. Rumelhart, J. McLelland and the PDP research group (Eds.), Parallel 
distributed processing: Explorations in the microstructure of co9nition, Vol. 1, MIT Press, 
Cambridge, MA, 1986. 
[5] J. Hopfield, "Neurons with graded response have collective computational properties hke 
those of two-state neurons", Proc. Natl. Acad. Sci. USA, May 1984. 
[6] L. Almeida, "A learning rule for asynchronous perceptrons with feedback in a combinato- 
rim environment" Proc. of the First Int. Annual Conf. on Neural Networks, San Diego, 
June 1987. 
[7] R. Rohwer, and B. Forrest, "Training time-dependence in neural networks", Proc. of the 
First Int. Annual Conf. on Neural Networks, San Diego, June 1987. 
[8] F. Pineda, "Generahzation of back-propagation to recurrent neural networks", Phys. Rev. 
Lett., vol. 59, no. 19, 9 Nov. 1987. 
