Recurrent Neural Networks Can Learn to 
Implement Symbol-Sensitive Counting 
IMul Rodriguez 
Department of Cognitive Science 
University of California, San Diego 
La Jolla, CA. 92093 
prodrigu @ cogsci.ucsd.edu 
Janet Wiles 
School of Information Technology and 
Department of Psychology 
University of Queensland 
Brisbane, Queensland 4072 Australia 
janetw@it.uq.edu.au 
Abstract 
Recently researchers have derived formal complexity analysis of analog 
computation in the setting of discrete-time dynamical systems. As an 
empirical consWast, training recurrent neural networks (RNNs) produces 
seff-organizeA systems that are realizations of analog mechanisms. Pre- 
vious work showed that a RNN can learn to process a simple context-free 
language (CFL) by counting. Herein, we extend that work to show that a 
RNN can learn a harder CFL, a simple palindrome, by organizing its re- 
sources into a symbol-sensitive counting solution, and we provide a dy- 
namical systems analysis which demonstrates how the network can not 
only count, but also copy and store counting information. 
1 INTRODUCTION 
Several researchers have recently derived results in analog computation theory in the set- 
ting of discrete-time dynamical systems(Siegelmann, 1994; Maass & Opren, 1997; Moore, 
1996; Casey, 1996). For example, a dynamical recognizer (DR) is a discrete-time continu- 
ous dynamical system with a given initial starting point and a finite set of Boolean output 
decision functions(Pollack, 1991; Moore, 1996; see also Siegelmann, 1993). The dynami- 
cal system is composed ofa space,R n , an alphabet A, a set of functions (1 per element of A) 
that each maps n _+ n and an accepting region H, in n. With enough precision and 
appropriate diffeaemtial equations, DRs can use real-valued variables to encode contents of 
a stack or counter (for details see Siegelmann, 1994; Moore, 1996). 
As an empirical contrast, training recurrent neural networks (RNNs) produces self- 
organized implementations of analog mechanisms. In previous work we showed that an 
RNN can learn to process a simple context-free language, a nb n, by organizing its resources 
into a counter which is similar to hand-coded dynamical recognizers but also exhibits some 
88 P. Rodriguez and J. Wiles 
novelties (Wiles & Elman, 1995). In particuiar, similar to hand-coded counters, the network 
developed proportional contracting and expanding rates and precision matters - but unex- 
pectedly the network distributed the contraction/expansion axis among hidden units, devel- 
oped a saddle point to transition between the first haft and second half of a string, and used 
oscillating dynamics as a way to visit regions of the phase space around the fixed points. 
In this work we show that an RNN can implement a solution for a harder CFL, a simple 
palindrome language(bed below), which requires a symbol-sensitive counting solu- 
tion. We provide a dynamical systems analysis which demonstrates how the network can 
not only count, but also copy and store counting information implicitly in space around a 
2 TRAINING an RNN TO PROCESS CFLs 
We use a discrete-time RNN that has 1 hidden layer with recmxet connections, and 1 output 
layer without recurrent connections so that the accepg regions are determined by the out- 
put units. The RNN processes output in Tune(n), where n is the length of the input, and it 
can recogni?e languages that are a proper subset of context-sensitive languages and a proper 
superset of regular languages(Moore, 1996). Consequently, the RNN we investigate can in 
principle embody the computational power needed to process self-recursion. 
Furthermore, many connectionlst models of language processing have used a prediction 
task(e.g. Elman, 1990). Hence, we trained an RNN to be a real-time transducer version 
of a dynamical recognizer that predicts the next input in a sequence. Although the network 
does not explicitly accept or reject strings, if our network makes all the right predictions 
possible then performing the prediction task subsumes the accept task, and in principle one 
could simply rejea;t unmatched predictions. We used a threxhhold criterion of .5 such that if 
an ouput node has a value greatex than .5 then the network is considered to be making that 
prediction. If the network makes all the right predictions possible for some input string, 
then it is correctly proceasing that string. Although a finite dimensional RNN cannot pro- 
cess CFLs robustly with a margin for error (e.g.Casey, 1996;Maass and Orponen, 1997), we 
will show that it can acquire the right kind of trajectory to proce the language in a way 
that genemliz_ to longer strings. 
2.1 A SIMPLE PALINDROME LANGUAGE 
A palindrome language (mirror language) consists of a set of strings, S, such that each 
string, s 6S, s = ww r, is a concatenation of a substring, w, and its reverse, w r. The rele- 
vant aspect of this language is that a mechanism cannot use a simple counter to process the 
string but must use the functional equivalent of a stack that enables it to match the symbols 
in second haft of the string with the first haft. 
We investigated a palindrome language that uses only two symbols for w, two other sym- 
bols for w r, such that the second haft of the string is fully predictable once the change in 
symbols occurs. The language we used is a simple version restricted such that one sym- 
bol is always present and precedes the other, for example: w = a" m, w r = B m A" e.g. 
aaaabbbBBBAAAA, (where n > 0, m >= 0). Note that the embedded subsequence 
bmB m is just the simple-CFL used in W'des & Elman (1995) as mentioned above, hence, 
one can reasonably expect that a solution to this task has an embedded counter for the sub- 
sequence b...B. 
2.2 LINEAR SYSTEM COUNTERS 
A basic counter in analog computation theory uses real-valued precision (e.g. Siegelman 
1994; Moore 1996). For example, a 1-dimensional up/down counter for two symbols { a, b } 
RNNs Can Learn Symbol-Sensitive Counting 89 
is the system f(z) = .Sz + .Sa, f(z) = 2z - .Sb where z is the state variable, a is the input 
variable to count up(push), and b is the variable to count down(pop). A sequence of input 
aaabbb has state values(sg at 0): .5,.75,875, .75,.5,0. 
Similarly, for our transducer version one can develop piecewise linear system equations in 
which counting takes place along different dimensions so that different predictions can be 
made at appropriate time steps I . The linear system serves as a hypothesis before running any 
simulations to understand the implementation issues for an RNN. For example, using the 
function f(z) = z for z  [0, 1], 0 for z < 0, 1 for z > 1, then for the simple palindrome 
task one can explicitly encode a mechanism to copy and store the count for a across the 
b...B subsequences. If we assign dimension-1 to a, dimension-2 to b, dimension-3 to A, 
dimension-4 to B, and dimension-5 to store the a value, we can build a system so that for 
a sequence aaabbBBAAA we get state variables values: initial, (0,0,0,0,0), (.5,0,0,0,0), 
(.75,0,0,0,0), (.875,0,0,0,0), (0,.5,0,0,.875), (0,.75,0,0,.875), (0,0,0,.5,.875), (0,0,0,0,.875), 
(0,0,.75,0,0), (0,0,.5,0,0), (0,0,0,0,0). The matrix equations for such a system could be: 
0 .5 0 0 0 0 .5 0 -5 
Xt = f( 0 2 0 2 * Xt-l + 0 0 --1 --5 *It) 
2 0 2 0 0 -5 0 -1 
1 0 0 0 1 -5 0 -5 0 
where t is time, Xt is the 5-dimensional state vector,/t is the 4-dimensional input vector 
using 1-hot encoding ofa = [1, O,O,O];& = [O,I,O,O];A = [O, O, I, O], B = [0,0,0,1]. 
The simple trick is to use the input weights to turn on or off the counting. For eample, 
the dimension-5 state variable is mined off when input is a or A, but then turned on when 
b is input, at which time it copies the last a value and holds on to it. It is then easy to add 
Boolean output decision functions that keep predictions linearly separable. 
However, other solutions are possible. Rather than store the a count one could keep count- 
ing up in dimension-1 for b input and then cancel it by counting down for B input. The 
questions that arise are: Can an RNN implement a solution that generalizes? What kind of 
store and copy mechanism does an RNN discover? 
2.3 TRAINING DATA & RESULTS 
The training set consists of 68 possible strings of total length <_ 25, which means a maxi- 
m}m of n + rn = 12, or 12 symbols in the first haft, 12 symbols in the second half, and 
1 end symbol 2. Th complete training set has more short strings so that the network does 
not disregard the transitions at the end of the string or at the end of the b...B subsequence. 
The network consists of 5 input, 5 hiddeAl, 5 output onits, with a bias node. The hidden and 
recurrent units are updated in the same time step as the input is presented. The recurrent 
layer activations are input on the next time step. The weight updates are performed using 
back-propagation thru time training with error injected at each time step backward for 24 
time steps for each input. 
We found that about haft our simulations learn to make lxedions for transitions, and most 
will have few genemlizons on longer strings not seen in the training set. However, no 
network learned the complete training set perfectly. The best network was trained for 250K 
sweeps (1 per character) with a learning parameter of .001, and 136K more sweeps with 
.0001, for a total of about 51K strings. The network made 28 total prediction errors on 28 
IThese can be expanded relatively easily to include more symbols, different symbol representa- 
tions, harder palindwme sequences, or different kind of decision planes. 
2We removed training strings to -- anb,for n > 1; it tums out that the network interpolates on 
the B-to-A transition for these. Also, we added an end symbol to help reset the system to a consistent 
starting value. 
90 P. Rodriguez and J. Wiles 
different strings in the test set of 68 possible strings seen in training. All of these errors were 
isolated to 3 situations: when the number of a input = 2or4 the error occurred at the B-to- 
A transition, when the number of a input = 1, for m > 2, the error occurred as an early 
A-to-end transition. 
Importantly, the network rnac!e correct predictions on many strings longer than seen in train- 
ing, e.g. strings that have total length > 25 (or n + m > 12). It counted longer strings 
of a..As with or without embedded b..Bs; such as: w = aa; w = aab2; w = anb ?, n = 
6, 7or8 (recall that w is the first half of the string). It also generalized to count longer subse- 
quences ofb..Bs with or without more a..As; such as w = ash ", where n = 8, 9, 10, 11, 12. 
The longest string it processed correctly was w = a9b 9, which is 12 more characters than 
seen during training. The network learned to store the count for a 9 for up to 9bs, even though 
the longest example it had seen in training had only 3bs - clearly it's doing something right. 
2.4 NETWORK EVALUATION 
Our evaluation will focus on how the best network counts, copies, and stores information. 
We use a mix of graphical analysis and linear system analysis, to piece together a global 
picture of how phase space trajectories hold informational states. The linear system analysis 
consists of investigating the local behaviour of the Jacobian at fixed points under each input 
condition separately. We refer to Fa as the autonomous system under a input condition and 
similarly for Fb, FA, and FB. 
The most salient aspect to the solution is that the network divides up the processing along 
different dimensions in space. By inspection we note that hidden unitl (HU1) takes on low 
values for the first half of the string and high values for the second half, which helps keep 
the processing linearly separable. Therefore in the graphical analysis of the RNN we can 
set HU1 to a constant. 
First, we can evaluate how the network counts the b..B subsequences. Again, by inspection 
the network uses dimensions HU3,HU4. The graphical analysis in Figure la and Figure lb 
plots the activity of HU3xHU4. It shows how the network counts the right number of Bs and 
then makes a transition to predict the first A. The dominant eigenvalues at the Ft, attracting 
point and FB saddle point are inversely proportional, which indicates that the contraction 
rate to and expansion rate away from the fixed points are inversely matched. The FB sys- 
tem expands out to a periodic-2 fixed point in HU3xHU4 subspace, and the unstable eigen- 
vector corresponding to the one unstable eigenvalue has components only in HU3,HU4. In 
Filgure 2 we plot the vector field that descri the flow in phase space for the composite 
F, which shows the direction where the system contracts along the stable manifold, and 
expands on the unstable manifold. One can see that the nature of the transition after the last 
b to the first B is to place the state vector close to saddle point for FB so that the nnmber of 
expansion steps matches the number of the Ft, contraction steps. In this way the b count is 
copied over to a different region of phase space. 
Now we evaluate how the network counts a...A, first without any b...B embedding. Since 
the output unit for the end symbol has very high weight values for HU2, and the Fa system 
has little activity in HU4, we note that a is processed in HU2xHU3xHU5. The trajectories 
in Figure 3 show a plot of aXaAXa that properly predicts all As as well as the transition at 
the end. Furthermore, the dominant eigenvalues for the F attracting point and the Ft sad- 
dle point are nearly inversely proportional and the FA system expands to a periodic-2 fixed 
point in 4-dimensions (HU1 is constant, whereas the other HU values are periodic). The 
F eigenvectors have strong-moderate components in dimensions HU2, HU3, HU5; and 
likewise in HU2, HU3, HU4, HU5 for FA. 
The much h_m'der question is: How does the network maintain the information about the 
count of as that were input while it is processing the b..B subsequence? Inspection shows 
RNNs Can Learn Symbol-Sensitive Counting 91 
that after processing a n the activation values are not directly copied over any HU values, 
nor do they latch any HU values that indicate how many as were processed. Instead, the 
last st_ate value after the last a affects the dynamics for b...B in such a way that clusters the 
last st_ate value after the last B, but only in HU3xHU4 space (since the other HU dimensions 
were unchanging throughout b...B processing). 
We show in Figure 4 the clusters for state variables in HU3xHU4 space after processing 
a" b m B m, where n = 2, 3, 4, 5or6; m = 1.. 10. The graph shows that the information about 
how many a's occtmed is "stored" in the HU3xHU4 region where points are clustered. Fig- 
ure 4 includes the dividing line from Figure lb for the predict A region. The network does 
not predict the B-to-A transition after a 4 or a 2 I:KN:alISe it ends up on the wrong side of the 
dividing line of Figure lb, but the network in these cases still predicts the A-to-end transi- 
tion. We see that if the network did not oscillate around the FB saddle point while exanding 
then the trajectory would end up correctly on one side of the decision plane. 
It is important to see that the dusters themeIves in Figure 4 are on a contracting trajec- 
tory toward a fixed point, which stores information about increasing number of as when 
matched by an expansion of the FA system. For example, the state values after aSAA and 
aSbmBmAA, m = 2..10 have a total hamming distance for all 5 dimensions that ranged 
from .070 to .079. Also, the fixed point for the Fa system, the estimated fixed point for the 
composite F o F n o Fg, and the saddle point of the FA system are colinear 3. in all the 
relevant counting dimensions: 2,3,4, and 5. In other words, the F system contracts the 
different coordinate points, one for a" and one for a" b'"B'", towards the saddle point to 
nearly the same location in phase space, treating those points as having the same informa- 
tion. Unfortunately, this is a contraction occuring through a 4 dimensional subspace which 
we cannot easily show graphically. 
3 CONCLUSION 
In conclusion, we have shown that an RNN can develop a symbol-sensitive counting so- 
lution for a simple palindrome. In fact, this solution is not a st__k but consists of non- 
independent counters that use dynamics to visit different regions at appropriate times. Fur- 
the'more, an RNN can implement counting solutions for a prediction task that are function- 
ally similar to that prescribed by analog computation theory, but the store and copy functions 
rely on distance in phase space to implicitly affect other trajectories. 
Acknowledgements 
This research was funded by the UCSD, Center for Research in Language Training Grant 
to Paul Rodriguez, and a grant from the Australian Research Council to Janet Wiles. 
References 
Casey, M. (1996) The Dynamics of Discrete-Tune Computation, With Application to Re- 
current Neural Networks and F'mite State Machine Extraction. Neural Computation, 8. 
Elman, LL. (1990) Finding Structure in Tune. Cognitive Science, 14, 179-211. 
Maass, W., Orponen, P. (1997) On the Effect of Analog Noise in Discrete-Tune Analog 
Computations. Proceedings Neural Information Processing Systems, 1996. 
Moore, C. (1996) Dynamical Recognizers: Real-Tune Language Recognition by Analog 
Computation. Santa Fe InstituteWorking Paper 96-05-023. 
Relative to the saddle point, the vector for one fixed point, multiplied by a constant had the same 
value(to within .05) in each of 4 dimensions as the vector for the other fixed point 
92 P. Rodriguez and J. Wiles 
HU4 
1 
0.8 
0.6 
0.4 
0.2 
predi=t b rion w J4 
0.8 
0.6 
0.4 
0.2 
0 
pmdia A region 
Figure 1: la)Trajectory of b x (after a s) in HU3xHU4. Each arrow repre,qents a trajectory 
step:the base is a state vector at time t, the head is a state at time t + 1. The first b trajectory 
step has a base near (.9,.05), which is the previous state from the last a. The output node 
b is > .5 above the dividing line. lb) Trajectory of B 1 (after a% 1) in HU3xHU4. The 
output node B is > .5 above the dashed dividing line, and the output node A is > .5 below 
the solid dividing line. The system crosses the line on the last B step, hence it predicts the 
B-to-A transition. 
Pollack, J.B. (1991) The Induction of Dynamical Recognizexs. Machine Learning, 7, 227- 
252. 
Siegelmann, H.(1993) Foundations of Recurrent Neural Networks. Ph.D. dissertation, un- 
published. New Brunswick Rutgers, The State University of New Jersey. 
Wiles, J., l!man, J. (1995) Counting Without a Counter:. A Case Study in Activation Dy- 
nantics. Procee4ings of the Seventeenth Annual Conference of the Cognitive Science So- 
ciety. Hillsdale, NJ.: Lawrence Erlba,m Associates. 
HU4 
0.8 
0.6 
0.4 
0.2 
o /' 
 '' .... 0.5 "' of, - -.u3 
0.2 0.4 
Figure 2: Vector field that describes the flow of F projected onto HU3xHU4. The graph 
shows a saddle point near (.5,.5)and a periodic-2 attracting point. 
RNNs Can Learn Symbol-Sensitive Counting 93 
1 
0.4 
0.2 
Oo 
0ol 
0.4 
0.2 
0 0 
0 0.2 0.4 06 08 ' ' ro2 0 0.2 0.4 0.6 0.8 mJ2 
Figure 3: 3a) Trajectory of a ls projected onto HU2xHU3. The output node a is > .5 below 
and right of dividing line. The projection for HU2xHU5 is very shnilar. 3b) Trajectory of 
AS(after a ls) projected onto HU2xHU3. The output node for the end symbol is > .5 on 
the 13th trajectory step left of the solid dividing line, and it is > .5 on the 11th step left 
of the dashed dividing line (the hypaplane projection must use values at the appropriate 
time steps), hence the system predi the A-to-end transition. The graph for HU2xHU5 
and HU2xHU4 is very similar. 
1.0 
0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
0.2 
0.1 
II JI $1 61 41 
 d, 
0.2 0.3 0.4 O.S 0.6 0.7 0.8 0.9 1.0 
Figure 4: Clusters of last state values a'bmB m , m > 1, projected onto HU3xHU4. Notice 
that for increasing n the system oscillates toward an attracting point of the system F o 
