Time Warping Invariant Neural Networks 
Guo-Zheng Sun, Hsing-Hen Chen and Yee-Chun Lee 
Institute for Advanced Computer Studies 
and 
Laboratory for Plasma Research, 
University of Maryland 
College Park, MD 20742 
Abstract 
We proposed a model of Time Warping Invariant Neural Networks (TWINN) 
to handle the time warped continuous signals. Although TWINN is a simple modifica- 
tion of well known recurrent neural network, analysis has shown that TWINN com- 
pletely removes time warping and is able to handle difficult classification problem. It 
is also shown that TWINN has certain advantages over the current available sequential 
processing schemes: Dynamic Programming(DP)[1], Hidden Markov Model(- 
HMM)[2], Time Delayed Neural Networks(TDNN) [3] and Neural Network Finite 
Automata(NNFA)[4]. 
We also analyzed the time continuity employed in TWINN and pointed out that 
this kind of structure can memorize longer input history compared with Neural Net- 
work Finite Automata (NNFA). This may help to understand the well accepted fact 
that for learning grammatical reference with NNFA one had to start with very short 
strings in training set. 
The numerical example we used is a trajectory classification problem. This 
problem, making a feature of variable sampling rates, having internal states, continu- 
ous dynamics, heavily time-warped data and deformed phase space trajectories, is 
shown to be difficult to other schemes. With TWINN this problem has been learned in 
100 iterations. For benchmark we also trained the exact same problem with TDNN and 
completely failed as expected. 
I. INTRODUCTION 
In dealing with the temporal pattern classification or recognition, time warping of input sig- 
nals is one of the difficult problems we often encounter. Although there are a number of 
schemes available to handle time warping, e.g. Dynamic Programming (DP) and Hidden Mark- 
ov Model(HMM), these schemes also have their own shortcomings in certain aspects. More de- 
pressing is that, as far as we know, there are no efficient neural network schemes to handle time 
warping. In this paper we proposed a model of Time Warping Invariant Neural Networks 
(TWINN) as a solution. Although TWINN is only a simple modification to the well known neu- 
ral net structure, analysis shows that TWINN has the built-in ability to remove time warping 
completely. 
The basic idea of TWINN is straightforward. If one plots the state trajectories of a continuous 
180 
Time Warping Invariant Neural Networks 181 
dynamical system in its phase space, these trajectory curves are independent of time warping 
because time warping can only change the time duration when traveling along these trajectories 
and does not affect their shapes and structures. Therefore, if we normalize the time dependence 
of the state variables with respect to any phase space variable, say the length of trajectory, the 
neural network dynamics becomes time warping invariant. 
To illustrate the power of the TWlNN we tested it with a numerical example of trajectory 
classification. This problem, chosen as a typical problem that the TWlNN could handle, has the 
following properties: (1). The input signals obey a continuous time dynamics and are sampled 
with various sampling rates. (2). The dynamics of the de-warped signals has internal states. (3). 
The temporal patterns consist of severely time warped signals. 
To our knowledge there have not been any neural network schemes which can deal with this 
case effectively. We tested it with TDNN and failed to learn. 
In the next section we will introduce the TWINN and prove its time warping invariance. In 
Section III we analyze its features and identify the advantages over other schemes. The numer- 
ical example of the trajectory classification with TWlNN is presented in Section IV. 
II. TIME WARPING INVARIANT NEURAL NETWORKS (TWINN) 
To process temporal signals, we consider a fully recurrent network, which consists of two 
groups of neurons: the state neurons (or recurrent units) represented by vector S(t) and the input 
neurons that are clamped to the external input signals {l(t), t = 0, 1, 2, ...... , T-I). The Time 
Warping Invariant Neural Networks (TWINN) is simply defined as.' 
S(t+ 1) = S(t) +l(t)F(S(t), W,l(t)) (1) 
where W is the weight matrix, l(t) is the distance between two consecutive input vectors defined 
by the norm 
l(t) = Ill(t+ 1) -l(t)II (2) 
and the mapping function F is a nonlinear function usually referred as neural activity function. 
For example of first order networks, it could take the form: 
Fi(S (t ), W,l (t) ) = Tanh (Wij(S (t )  l (t) )j) (3) 
J 
where Tanh(x) is Hyperbolic Tangent function and symbol  stands for the vector concatena- 
tion. 
For the purpose of classification (or recognition), we assign the target final state S k, 
(k=l,2,3,...K), for each category of pattems. After we feed into the TWlNN the whole sequence 
{I(0), I(1), I(2), ...... ,I(T-1)}, the state vector S(t) will reach the final state S(T). We then need to 
compare S(T) with the target final state Sk for each category k, (k=l,2,3,...K), and calculate the 
error: 
ek = ilSCT)_Skil 2 (4) 
The one with minimal error will be classified as such. The ideal error is zero. 
For the purpose of training, we are given a set of training examples for each category. We 
then minimize the error functions given by Eq. (4) using either back-propagation[7] or forward 
propagation algorithm[8]. The training process can be terminated when the total error reach its 
minimum. 
The formula of TWINN as shown in Eq. (1) does not look like new. The subtle difference 
from wildly used models is the introduction of normalization factor l(t) as in Eq. (1). The main 
advantage by doing this lies in its built-in time warping ability. This can be directly seen from 
its continuous version. 
As Eq. (1) is the discrete implementation of continuous dynamics, we can easily convert it 
into a continuous version by replacing "t +1" by "t+At" and let At --> 0. By doing so, we get 
182 Sun, Chen, and Lee 
S (t + At) -S (t) dS 
lim = (5) 
at--}olll(t+at ) -l(t)1[ dL 
where L is the input trajectory length, which can be expressed as an integral 
t 
d: 
or summation (as in discrete version) 
t 
L(t) - o111(+ 1) -'I(:) II (7) 
For deterministic dynamics, the distance L(t) is a single-valued function. Therefore, we can 
make a unique mapping from t to L, I'I: t -> L, and any function of t can be transformed into a 
function of L in terms of this mapping. For instance, the input trajectory l(t) and the state tra- 
jectory S(t) can be transformed into I(L) and S(L). By doing so, discrete dynamics of Eq. (1) 
becomes, in the continuous limit, 
as  (s () w,  ()) (8) 
dL 
It is obvious that there is no explicit time dependence in Eq. (8) and therefore the dynamics rep- 
resented by Eq. (8) is time warping independent. 
To be more specific, if we draw the trajectory curves off(t) and S(t) in their phase spaces re- 
spectively, these two curves would not be deformed if we only change the time duration when 
traveling along the curves. Therefore, if we generate several input sequences {l(t)} using dif- 
ferent time warping functions and feed them into TWINN, represented by Eq. (8) or Eq. (1), the 
induced state dynamics of S(L) would be the same. Meanwhile, the final state is the solo crite- 
rion for classification. Therefore, any time warped signals would be classified by the TWINN 
as the same. This is the so called "time warping invariant". 
III. ANALYSIS OF TWINN VS. OTHER SCHEMES 
We emphasize two points in this section. First, we would analyze the advantages of the 
TWINN over the other neural network structures, like TDNN, and other mature and well known 
algorithms for time warping, such as HMM and Dynamics Programming. Second, we would an- 
alyze the memory capacity of input history for both the continuous dynamical networks as il- 
lustrated in Eq. (1) and its discrete companion, Neural Network Finite Automata used in 
grammatical inference by Liu [3], Sun [4] and Giles [5]. And, we will show by mathematical 
estimation that the continuity employed in TWINN increases the power of memorizing history 
compared with NNFA 
The Time Delayed Neural Networks (TDNN)[3] has been a useful neural network structure 
in processing temporal signals and achieves successes in several applications, e.g. speech rec- 
ognition. The traditional neural network structures are either feedforward or recurrent. The 
TDNN is something in between. The power of TDNN is in its dynamic combination of the spa- 
tial processing (as in a feedforward net) and sequential processing (as in a recurrent net with 
short time memory). Therefore, the TDNN could detect the local features within each windowed 
frame and store their voting scores into the short time memory neurons, and then make a final 
decision at the end of input sequence. This technique is suitable for processing the temporal pat- 
terns where the classification is decided by the integration of local features. But, it could not 
handle the long time correlation across time frames like a state machine. It also does not tolerate 
time warping effectively. Each of time warped patterns will be treated as a new feature. There- 
fore, TDNN would not be able to handle the numerical example given in this paper which has 
both the severe time warping and the internal states (long time correlation). The benchmark test 
has been performed and it proved our prediction. Actually, it can be seen later that in our exam- 
Time Warping Invariant Neural Networks 183 
ples, no matter which category they belong to, all windowed frames would contain similar local 
features, the simple integration of local features do not contribute directly to the final classifi- 
cation, rather the whole sinal history will decide the classification. 
As for the Dynamic Programming, it is to date the most efficient way to cope with time warp- 
ing problem. The most impressing feature of dynamic programming is that it accomplishes a 
global search among all N N possible paths using only -O(N 2) operations, where N is the length 
of the input time series and, of course, one operation here represents all calculations involved 
in evaluating the 'score" of one path. But, on the other hand this is not ideal. If we can do the 
time warping using recurrent network, the number of operations will be reduced to -O(N). This 
is a dramatic saving. Another undesirable feature of current dynamic warping scheme is that the 
recognition or classification result heavily depends on the pre-selected template and therefore 
one may need a large number of templates for a better classification rate. By adding one or two 
template we actually double or triple the number of operations. Therefore, search for a neural 
network time warping scheme is a pressing task. 
Another available technique for time warping is Hidden Markov Model (HMM), which has 
been successfully applied in speech recognition. The way for HMM to deal with time warping 
is in terms of statistical behavior of its hidden state transition. Starting from one state qi, HMM 
allows a certain probability aij to forward to another state qj. Therefore, for any given HMM one 
could generate various state sequences, say, qlq2q2q3q4q4q5, qlq2q2q2q3q3q4q4q5, etc., each 
with a certain occurrence probability. But, these state sequences are "hidden", the observed part 
is a set of speech data or symbol represented by {Sk} for example. HMM also includes a set of 
observation probability B={bjk}, so that when it is in a certain state, say qj, HMM allows each 
symbol from the set { Sk} tO occur with the probability bjk. Therefore, for any state sequence one 
can generate various series of symbols. As an example, let us consider one simple way to gen- 
erate symbols: in state qj we generate symbol sj (with probability bjj). By doing so, the two state 
sequences mentioned above would correspond to two possible symbol sequences: 
S 1 s2S2S3S4S4S 5 and s 1 s2S2S2S3S3S4S4S5 . Examining the two strings closely, we find that the second 
one may be considered as the time warped version of the first one, or vice versa. If we present 
these two strings to the HMM for testing, it will accept them with similar probabilities. This is 
the way that HMM tolerates time warping. And, these state transition probabilities of HMM are 
learned from the statistics of training set by using re-estimation formula. In this sense, HMM 
does not deal with time warping directly, instead, it learns statistical distribution of training set 
which contains time warped patterns. Consequently, if one presents a test pattern with time 
warped signals which is far away from the statistical distribution of training set, it is very un- 
likely for a HMM to recognize this pattern. 
On the contrary, the model of TWINN we proposed here has intrinsic built-in time warping 
nature. Although the TWINN itself has intemal states, these intemal states are not used for tol- 
erating time warping. Instead, they are used to learn more complex behavior of the "de-warped" 
trajectories. In this sense, TWINN could be more powerful than HMM. 
Another feature of TWINN needs be mention is its explicit expression of continuou s mapping 
from S(t) to S(t+l) as shown in Eq. (1). In our early work of [4,5,6], to train a NNFA (Neural 
Network Finite Automaton), we used a discrete mapping 
S(t+ 1) - F(S(t), W,l(t)) (9) 
where F is a nonlinear function, say Sigmoid function g(x) -- 1/(l+e' x). This model has been 
successfully applied into the grammatical inference. The reason we call Eq. (1) a continuous 
mapping but Eq. (9) a discrete one, even though both of them are implemented in discrete time 
steps, is because there is an explicit infinitesimal factor l(t) used in Eq. (1). Due to this factor 
the continuous state dynamics is guaranteed, by which we mean that the state variation S(t+ 1 ) 
- S(t+l) approaches zero if the input variation l(t+l) - l(t+l) does so. But, In general, the state 
184 Sun, Chen, and Lee 
variation S(t+l) - S(t+l) generated by Eq. (9) is of order of one, regardless of what input varia- 
tions are. If one starts from random initial weights, Eq. (9) provides a discrete jump between 
different, randomly distributed states, which is far away from any continuous dynamics. 
We did numerical test using NNFA of Eq. (9) to learn the classification problem of continu- 
ous trajectories as shown in Section V. For simplicity we did not include time warping, but the 
NNFA still failed to learn. The reason is that when we tried to train a NNFA to learning the con- 
tinuous dynamics, we were actually forcing the weights to generate an almost identical mapping 
F from S(t) to S(t+l). This is a very strong constrain on the weight parameters, such that it 
drives the diagonal terms to positive infinity and off-diagonal terms to negative infinity (Sig- 
moid function is used). When this happens, the learning is stuck due to the saturation effect. 
The failure of NNFA may also comes from the short history memory capacity compared to 
the continuous mapping of Eq. (1). It has been shown by many numerical experiments on gram- 
matical inference [3, 4, 5] that to train an NNFA as in Eq. (9) effectively, one has to start with 
short training patterns (usually, the sentence length _< 4). Otherwise, learning will fail or be very 
slow. This is exactly what happened to learning the trajectory classification using NNFA, where 
the lengths of our training pattems are in general considerably long (normally,- 60). But, 
TWlNN learned it easily. To understand the NNFA's failure and TWINN's success, in the fol- 
lowing, we will analyze how the history information enters the learning process. 
Consider the example of learning grammatical inference. Before training since we have no a 
priori knowledge about the target values of weights, we normally start with random initial val- 
ues. On the other hand, during training the credit assignment (or the weight correction AW) can 
only be done at the end of each input sequence. Consequently, each AW should explicitly con- 
tain the information about all symbols contained in that string, otherwise the learning is inean- 
ingless. But, in numerical implementation, every variable, including both AW and W, has a 
finite precision and any information beyond the precision range will be lost. Therefore, to com- 
pare which model has the longer history memory we need to examine how the history informa- 
tion relates to the finite precisions of AW and W. 
Let us illustrate this point with a simple second-order connected fully recurrent network and 
write both Eq. (1) and Eq. (9) in a unified form 
S(t+l) = G t+l (10) 
such that Eq. (1) is represented by 
G t+l 
= S(t) +l(t)g(K(t)) (11) 
and Eq. (9) is just 
G t+l = g(K(t)) (12) 
where K(t) is the weighted sum of concatenation of vectors S(t) and l(t) 
gi(t ) = . Wij(S(t ) (])l(t))j (13) 
J 
For a grammatical inference problem the error is calculated from the final state S(T) as 
E = (S(T)-Starget )2 (14) 
Learning is to minimize this error function. According to the standard error back-propagation 
scheme, the recurrent net can be viewed as a multi-layered net with identical weights between 
neurons at adjacent time step: w(t) = W, where w(t) is the "tth layer" weights connecting inptt 
S(t-1) to output S(t). The total weight correction is the summation of all weight corrections at 
each layer. By using the gradient descent scheme one immediately has 
A W = Sw(t) = -r I v(t) -rl --(t) 3w(t) 
t=l t=l t=l 
If we define new symbols: vector u(t), second-order tensor A(t) and third-order tensor B(t) as 
Time Warping Invariant Neural Networks 185 
t+l 
E Gti  G i 
ui(t) --i)Si(t ) Bijt(t) Wjl  Aij (t) i)Sj(t) 
the weight correction can be simply written as 
T 
AW = -1 __u(t) .B(t) 
t 
and the "error rate" u(t) can be back-propagated using the Derivative Chain Rule 
u(t) = u(t+ l) .A(t) 
so that it is easy to have 
u(t) = u(T).A(T-1).A(T-2)....-A(t) 
t = 1,2,...,T-I; 
-- u ( T)'__I_A1 ('c) 
(16) 
(17) 
(18) 
t= 1,2 ..... T-l; (19) 
First, let us examine the model of NNFA in Eq. (9). Using Eqs. (12), (13) and (16), Aij(t) and 
Bijk(t) can be written as 
Aij(t ) = g'(Ki(t))Wij Bijt(t ) = 8ij(S(t-1) l(t-1))t (20) 
where g'(x) -- dg/dx = g(1-g) is the derivative of Sigmoid function and Sij is Kronecker delta 
function. If we substitute Bijl(t) into Eq. (17), AW becomes a weighted sum of all input symbols 
{I(0), I(1), I(2), ...... ,/(T-l) }, each with different weighting factor u(t). Therefore, to guarantee 
that AW contain the information of all input symbols {I(0), I(1), I(2), ...... ,I(T-1) }, the ratio of 
lu(t)lmax/lu(t)lmi n should be within the range of precision of AW. This is the main point. 
The exact mathematical analysis has not been done, but from a rough estimate we can gain 
some good understanding. From Eq. (19), u(t) is a matrices product of A ij(t), and u(1) the coef- 
ficient of I(0) contains the highest order product of Aij(t). The key point is that the coefficient 
ratio between the adjacent symbols: lu(t)l/lu(t+l) is of the order of IAij(t)l, which is a small val- 
ue, therefore the earlier symbol information could be lost from AW due to its finite precision. It 
can be shown that xg' (x) =x g(x)(1-g(x))< 0.25 for any real value of x. Then, we roughly have 
IAij(t)l = Ig' Wijl = Ig(1-g)Wijl < 0.25, if we assume the values of weights Wij to be order 1. Thus, 
the ratio R=!u(t)!max/lu(t)!mn is estimated as 
1 
- =l-J_ 2-2. (r- 1) 
g-lu(1)l/lu(r)l t' lA(t')l< (21) 
From Eq. (21) we see that if the input pattern length is T= 10 we need at least 2(T-1) _-- 18 bits 
computer memory to store weight variables (including u, W and AW). If T= 60, as in the trajec- 
tory classification problem, it requires at least 128 bit weight variables. This is why the NNFA 
Eq. (9) could not work. 
Similarly, for the dynamics of Eq. (1), we use Eqs. (11), (13) and (16), and obtain 
Aij(t ) = l+/(t) (g'(gi(t))Wij); Bijk(t) = !(t) (Jij(S(t-l) l(t-l)).) (22) 
From Eq. (22) we see that no matter how small the factor l(t) will be, IAij(t)l remains a value 
of order of one, therefore the ratio R=lu(t)lmax/lu(t)lmi n which is estimated as a product of lAij(t)l 
would be of order of one compared with result of discrete case as in Eq. (21).Therefore, the con- 
tributions from all {I(0), I(1), I(2), ...... J(T-1)} to the weight correction AW are of the same or- 
der. This prevents the information loss during learning. 
IV NUMERICAL SIMULATION 
We demonstrate the power of TWINN with a trajectory classification problem. The three 2- 
186 Sun, Chen, and Lee 
D trajectory equations are artificially given by 
(t) = sin (t + [5)Isin (t)[ (x(t) = sin (0.5t + [5) sin (1.5t) (x(t) = sin (t + [5) sin (2t) 
(t) cos (t+ [5)[sin (t)[ ,y (t) cos (0.5t+ [5)sin (1.5t) ky(t) cos (t+ [5) sin (2t) 
(23) 
where [5 is a uniformly distributed random parameter. When [5 is changed, these trajectories are 
distorted accordingly. Some examples (three for each class) are shown in Fig. 1. 
Class 1 
0.5 
0.25 
Class 2 
0.5 
-0.75 
Class 3 
Fig. 1 PHASE SPACE TRAJECTORIES 
Three different shapes of 2-D trajectory, each is shown in one column with three examples. 
Recurrent neural networks are trained to recognize the different shapes of trajectory. 
The trajectory data are the time series of two dimensional coordinate pairs {x(t), y(t)} sampled 
along three different types of curves in the phase space. The neural net dynamics of TWINN is 
Si(t+l)=Si(t)+l(t) (Tanh Wij(S(t) l(t)); artall(t) li(t)_li(t_l ) )2 (24) 
- i=1 
where we used 6 input neurons I = { 1, x(t), y(t), x2(t), y2(t), x(t)y(t) } (normalized to norm = 1.0) 
and 4 (N=4) state neurons S ={ S i, S 2, S 3, S4}. The neural network structure is shown in Fig. 2. 
/''a  .  x Check statn!ltrons 
te[t-t-lO'-attlteettaln-P t' 36X 3 Xi 
Fig.2 Time Warping Invariant Neural Network 
for Trajectory Classification Fig.3 Time Delayed Neural Network 
for Trajectory Classification 
For training, we assign the desired final output for the three trajectory classes to be (1,0,0), 
Time Warping Invariant Neural Networks 187 
(0,1,0) and (0,0,1) respectively. For recognition, each trajectory data sequence needs to be fed 
to the input neurons and the state neurons evolve according to the dynamics in Eq. (24). At the 
end of input series we check the last three state neurons and classify the input trajectory accord- 
ing to the "winner-take-all" rule. 
In each iteration of training we randomly picked up 150 deformed trajectories, 50 for each of 
the three categories, by choosing different values of  within 0<[3 <2m To simulate time warp- 
ing we randomly sampled the data by choosing the random time step At = 2r/T along each tra- 
jectory, where r is a random number between 0 and 2 and the sampling rate T=60 for training 
patterns, and T=20 to 200 for testing pattems. Therefore, each training pattem is a time warped 
trajectory data with averaged length = 60. Using RTRL algorithm[8] to minimize the error func- 
tion, after 100 iterations of training it converged to Mean Square Error of -_- 0.03. 
We tested the trained network with hundreds of randomly picked input sequences with differ- 
ent sampling rate (from 20/2 to 200/2) and different wrapping functions (non-uniform step 
length). All input trajectories are classified correctly. If the sampling rates are too large (>200) 
or too small(<20), some classification errors will occur. 
We test the same example with TDNN. See Fig.3 for its parameters. The top layer contains 
three output neurons for the three classes of trajectories. The classification rules, error function 
and training patterns are the same as those of TWINN. After three days of training with DEC- 
3100 Workstation the training error (MSE) approaches 0.5 and in testing the error rate is 70%. 
v. CONCLUSION 
We have proposed a model of Time Warping Invariant Neural Network to handle temporal 
pattem classification where the severely time warped and deformed data may occur. This model 
is shown to have built-in time warping ability. We have analyzed the properties of TWINN and 
shown that for trajectory classification it has several advantages over other schemes: HMM, DP, 
TDNN and NNFA. 
We also numerically implemented the TWlNN and trained a trajectory classification easily. 
This problem is shown by analysis to be difficult to other schemes. It has been trained with 
TDNN but failed. 
References 
[1] H.Sakoe and S. Chiba, "Dynamic Programming Algorithm Optimization for Spoken 
Word Recognition", IEEE Transactions on Acoustics Speech and Signal Processing, Vol. 
ASSP-26, pp.43-49, Feb. 1978. 
[2] L.R.Rabiner and B.H.Juang, "An Introduction to Hidden Markov Models", IEEE, ASSP 
Mag., Vol.3, No. 1, pp. 4-16, 1986. 
[3]A. Weibel, T. Hanazawa, G. Hinton, K.shikano and K. Lang, "Phoneme Recognition Us- 
ing Time-Delay Neural Networks", IEEE Transactions on Acoustics Speech and Signal Pro- 
cessing, March, 1989. 
[4]. Y.D. Liu, G.Z. Sun, H.H. Chen, C.L. Giles and Y.C. Lee, "Grammatic Inference and 
Neural Network State Machine", Proceedings of the International Joint Conference on Neural 
Networks, pp. 1-285, Washington D.C. (1990). 
[5]. G.Z. Sun, H.H. Chen, C.L. Giles, Y.C. Lee and D. Chen, "Connectionist Pushdown Au- 
tomata that Learn Context-Free Grammars", Proceedings of the International Joint Conference 
on Neural networks, pp. 1-577, Washington D.C. (1990). 
[6]Giles, C.L., Sun, G.Z., Chen, H.H., Lee,Y.C., and Chen, D. (1990). "Higher Order Recur- 
rent Networks & Grammatical Inference". Advances in Neural Information Processing Systems 
2, D.S. Touretzky (editor), 380-386, Morgan Kaufmann, San Mateo, C.A. (7) 
[7] D.Rumelhart, G. Hinton, and R. Williams. "Learning intemal representations by error 
propagation", In PDP: Vol.I MIT press 1986. P. Werbos, "Beyond Regression: New tools for 
prediction and analysis in the behavior sciences", Ph.D. thesis, Harvard university, 1974. 
[8] R. Williams and D. Zipser, "A learning algorithm for continually running fully recurrent 
neural networks", Neural Computation 1(1989), pp.270-280. 
