Time-Warping Network: 
A Hybrid Framework for Speech Recognition 
Esther Levin 
Roberto Pieraccini 
AT&T Bell Laboratories 
Speech Research Department 
Murray Hill, NJ 07974 USA 
Enrico Bocchieri 
ABSTRACT 
Recently, much interest has been generated regarding speech 
recognition systems based on Hidden Markov Models (HMMs) and 
neural network (NN) hybrids. Such systems attempt to combine the 
best features of both models: the temporal structure of HMMs and 
the discriminative power of neural networks. In this work we define 
a time-warping (TW) neuron that extends the operation of the formal 
neuron of a back-propagation network by warping the input pattern to 
match it optimally to its weights. We show that a single-layer 
network of TW neurons is equivalent to a Gaussian nsity HMM- 
based recognition system, and we propose to unprove the 
discriminative power of this system by using back-propagation 
discriminative training, and/or by generalizing the structure of the 
recognizer to a multi-layered net. The performance of the proposed 
network was evaluated on a highly confusable, isolated word, multi 
speaker recognition task. The results indicate that not only does the 
recognition performance improve, but the separation between classes 
is enhanced also, allowing us to set up a rejection criterion to 
improve the confidence of the system. 
I. INTRODUCTION 
Since their first application in h recognition systems in the late seventies, hidden 
Markov models have been established as a most useful tool, mainly due to their ability 
to handle the sequential dynamical nature of the speech signal. With the revival of 
connectionism in the mid. eighties, considerable interest arose in applying artificial 
neural networks for speech recognition. This interest was based on the discriminafive 
power of NNs and their ability to deal with non-explicit knowledge. These two 
paradigms, namely HMM and NN, inspired by different philosophies, were seen at first 
as different and competing tools. Recently, links have been established between these 
two paradigms, aiming at a hybrid framework in which the advantages of the two 
models can be combined. For example, Boutlard and Wellekens [1] showed that neural 
151 
152 Levin, Pieraccini, and Bocchieri 
networks with proper architecture can be regarded as non-parametric models for 
computing "discriminant probabilities" related to HMM. Bridle [2] introduced 
"Alpha-nets", a recurrent neural architecture that implements the alpha computation of 
HMM, and found connections between back-propagation [3] training and discriminative 
HMM parameter estimation. Predictive neural nets were shown to have a statistical 
interpretation [4], generalizing the conventional hidden Markov model by assuming 
that the speech signal is generated by nonlinear dynamics contaminated by noise. 
In this work we establish one more link between the two paradigms by introducing the 
time-warping network (TWN) that is a generalization of both an HMM-based 
recogmzer and a back-propagation net. The basic element of such a network, a t/me- 
warping neuron, generalizes the function of a formal neuron by warping the input 
signal in order maximize its activation. In the special case of network parameter 
values, a single-layered network of time-warping (TW) neurons is equivalent to a 
recognizer based on Gaussian HMMs. This equivalence of the HM -based recognizer 
and single-layer TWN suggests ways of using discriminafive neural tools to enhance 
the performance of the recognizer. For instance, a training algorithm, like back- 
propagation, that minimizes a quantity related to the recognition performance, can be 
used to train the recognizer instead of the standard non-discfiminative maximum 
likelihood training. Then, the architecture of the recognizer can be expanded to 
contain more than one layer of units, enabling the network to form discriminant feature 
detectors in the hidden layers. 
This paper is organized as follows: in the first pan of Section 2 we describe a simple 
HMM-based recognizer. Then we define the time-warping neuron and show that a 
single-layer network built with such neurons is equivalent to the HM recognizer. In 
Section 3 two methods are proposed to improve the discriminafive power of the 
recognizer, namely, adopting neural training algorithms and extending the structure of 
the recognizer to a multi-layer net. For special cases of such multi-layer architecture 
such net can implement a conventional or weighted [5] HMM recognizer. Results of 
experiments using a TW network for recognition of the English E-set are presented in 
Section 4. The results indicate that not only does the recognition performance 
improve, but the separation between classes is enhanced also, allowing us to set up a 
rejection criterion to improve the confidence of the system. A summary and discussion 
of this work are included in Section 5. 
II. THE MODEL 
In this section first we describe the basic HMM-based speech recognition system that 
is used in many applications, including isolated and connected word recognition [6] 
and large vocabulary subword-based recognition [7]. Though in this paper we treat the 
case of isolated word recognition, generalization to connected speech can be made like 
in [6,7]. In the second pan of this section we define a single-layered time-warping 
network and show that it is equivalent to the HMM based recognizer when certain 
conditions constraining the network parameter values apply. 
II.1 THE HIDDEN MARKOV MODEL-BASED RECOGNITION SYSTEM 
A HMM-based recognition system consists of K N-state HMMs, where K is the 
vocabulary size (number of words or subword units in the defined task). The k-th 
HMM, ', is associated with the k-th word in the vocabulary and is characterized by a 
matrix A'= {a/ } of transition probabilities between states, 
ai=Pr(st=j I s,_=i) , O<i <N , 1 <j <N, (1) 
where s, denotes the active state at time t (So =0 is a dummy initial state) and by a set 
of emission probabilities (one per state): 
Time-Warping Network: A Hybrid Framework for Speech Recognition 153 
1 k* 
1 exp [-(Xt-I&) (Z) - (Xt-I&)] i=1, .-. ,N (2) 
Pr(X, I s,=i II I I 2 ' ' 
where Xt is the d-dimensional observation vector describing some parametric 
representation of the t-th frame of the spoken token, and ()* denotes the transpose 
operation. 
For the case discussed here, we concentrate on strictly left-to-right HMMs, where 
a/: : 0 only if j=i or j=i+l, and a simplified case of (2) where all Zi=la, the 
d-Z-dimensional unit matrix. 
The system recognizes a speech token of duration T, X={X,Y2,''',Xr}, by 
classifying the token into the class k 0 with the highest likelihood L  (X), 
ko = arg maxL t (X) . (3) 
The likelihood Lk(X) is computed for the k-th HMM as 
Lt(X)= max log[Pr(X Iilt, si=i , ... ,st=i,)] (4) 
[i,.  "-/4 
= max II II 2+1oga ,-log2r. 
[i, ' ",it] = 
The state sequence that maximizes (4) is found by using the Viterbi [8] algorithm. 
H.2 THE EQUIVALENT SINGLE-LAYER TIME-WARPING NETWORK 
A single-layer TW network is composed of K TW neurons, one for each word in the 
vocabulary. The TW neuron is an extension of a formal neuron that can handle 
dynamic and temporally distorted patterns. The k-th TW neuron, associated with^te 
k-l v,,ot:abulary,, kword, is chartcterized by a bias w and a set of weights, W = 
{W,Wi,    ,W} , where W[ is a column vector of dimensionality d+2. Given an 
input speech token of duration T, X={X,X2,    ,Xr}, the output activation yt of the 
k-th unit is computed as 
T ,,, N 
y'=g( ZXt'< +w{ )=g( Z ( Z ;)'l/+w{ ), (5) 
t=l j=l t:i,=j 
where g (-) is a sigrnoidal, smooth, strictly increasing nonlinearity, and Xt = [X, 1,1 ] 
is an d+2 - dimensional augmented input vector. The corresponding indices it, 
t=l,    ,T are determined by the following condition: 
T ,,, 
{ i,''',ir} =argmax Xt' +w. (6) 
t=l 
In other words, a TW neuron warps the input pattern to match it optimally to its 
weights (6) and computes its output using this warped version of the input (5). The 
time-warping process of (6) is a distinguishing feature of this neural model, enabling it 
to deal with the dynamic nature of a speech signal and to handle temporal distortions. 
All TW neurons in this single-layer net recognizer receive the same input speech token 
X. Recognition is performed by selecting the word class corresponding to the neuron 
with the maximal output activation. 
It is easy to show that when 
II II 2, logan./1, (7a) 
and 
154 Levin, Pieraccini, and Bocchieri 
N 
w =  log ad_ 1 -log ajl,i (7b) 
this network is equivalent to an HMM-based recognition system, with K N-state 
HMMs, as described above) 
This equivalent neural representation of an HM -based system suggests ways of 
improving the discfiminafive power of the recognizer, while preserving the temporal 
structure of the HMM, thus allowing generalization to more complicated tasks (e.g., 
continuous speech, subword units, etc.). 
HI. IMPROVING DISCRIMINATION 
There are two important differences between the HMM-based system and a neural net 
approach to speech recognition that contribute to the improved discrimination power of 
the latter, namely, training and structure. 
HI.1 DISCRIMINATIVE TRAINING 
The HMM parameters are usually estimated by applying the maximum likelihood 
approach, using only the examples of the word represented by the model and 
disregarding the rival classes completely. This is a non-discriminative approach: the 
learning criterion is not directly connected to the improvement of recognition accuracy. 
Here we propose to enhance the discfiminative power of the system by adopting a 
neural training approach. 
NN training algorithms are based on minimizing an error function E, which is related 
to the performance of the network on the training set of labeled examples, {Xt,gt}, 
/=1,    ,L, where gt=[z,    ,zc]* denotes the vector of target neural outputs for the 
l-th input token. Z t has +1 only in the entry corresponding to the right word class, and 
-1 elsewhere. Then, 
L 
= (8) 
i=1 
where t t t   
Y = l, ''',Yr] is a vector of nemal output activations for the l-th input 
token, a0. d Et(Z t, yt) measures the distortion between the two vectors. One choice for 
Et(Z t, Y') is a quadratic error measure, i.e., Et(Z t, yt)= [[ Zt_yt i/2. Other choices 
include the cross-entropy error [9] and the recently proposed discriminative error 
functions, which measure the misclassification rate more directly [10]. 
The gradient based training algorithms (such as back-propagation) modify the 
parameters of the network after presentation of each training token to minimize the 
error (8). The change in the j-,th weight subvector of the k-th model after presentation 
of, the l-th training token, AtW is inversely proportional to the derivative of the error 
E' with respect to this weight subvector, 
, I<j<__N, l<_k<_K, 
(9) 
updated weight vector 
To compute the terms 
1. With minor changes we can show equivalence to a general Gaussian HMM, where the covariance 
matrices are not restricted to be the unit matrix. 
Time-Warping Network: A Hybrid Framework for Speech Recognition 155 
we have to consider (5) and (6) that define the operation of the neuron. Equation (6) 
expresses the dependence of the warping indices i,    ,it on Wq. In the proposed 
leaming rule we compute the gradient for the quadrauc error criterion using only (5). 
a'W=a(zk-yk)g (') Z t k 
 X,-W. , (10) 
t : ij 
where the values of it fulfill condition (6). Although the weights do not change 
according to the exact gradient descent rule (since (6) is not taken into account for 
back-propagation) we found experimentally that the error made by the network always 
decreases after the weight update. This fact also can be proved when certain 
conditions restricting the step-size a hold, and we conjecture that it is always true for 
x>0. 
111.2 THE STRUCTURE OF THE RECOGNIZER 
'When the equivalent neural representation of the HMlVl-based recognizer is used, there 
exists a natural way of adaptively increasing the complexity of the decision boundaries 
and developing discriminative feature detectors. This can be done by extending the 
structure of the recognizer to a multi-layered net. There are many possible 
architectures that result from such an extension by changing the number of hidden 
layers, as well as the number and the type (i.e., standard or TW ) of neurons in the 
hidden layers. Moreover, the role of the TW neurons in the first hidden layer is 
different now: they are no longer class representatives, as in a single-layered net, but 
just abstract computing elements with built-in time scale normalization. In this work 
we investigate only a simple special case of such multi-layered architecture. The 
multi-layered network we use has a single hidden layer, with NxK TW neurons. Each 
hidden neuron corresponds to o__0 state of one 
characterized by a weight vector W i and a bias w i. 
neuron is given as 
h=g(u), 
of the original HMMs, and is 
The output activation h of the 
whe 
and 
uj-  
t :i,=.i 
N 
[ i, ''',it} =argmax u. 
j=l 
(11) 
The output layer is composed of K standard neurons. The activation of output neurons 
f', k=l,..., K, is determined by the hidden layer neurons activations as 
yk=g(H*Vk+v), (12) 
where V ' is a NxK dimensional weight vector, H is the vector of hidden neurons 
activation, and v, is a bias term. 
In a special case of parameter values, when 1/satisfy the conditions (7a,b) and 
k=log k Iogaj (13) 
wj aid_! - , 
the activation h; corresponds to an accumulated j-th state likelihood of the k-th HM:M 
and the networ implements a weighted [5] HMM recognizer where the connection 
weight vectors V ' determine the relative weights assigned to each state likelihood in 
the final classification. Such network can learn to adopt these weights to enhance 
discrimination by giving large positive weights to states that contain information 
important for discrimination and ignoring (by forming zero or close to zero weights) 
those states that do not contribute to discrimination. A back-propagation algorithm 
156 Levin, Pieraccini, and Bocchieri 
can be used for training this net. 
IV. EXPERIMENTAL RESULTS 
To evaluate the effectiveness of the proposed TWN, we conducted several experiments 
that involved recognition of the highly confusable English E-set (i.e., fo, c, d, e, g, p, t, 
v, z/). The utterances were collected from 100 speakers, 50 males and 50 females, 
each speaking every word in the E-set twice, once for training and once for testing. 
The signal was sampled at 6.67 kHz. We used 12 cepstral and 12 delta-cepstral LPC- 
derived [11] coefficients to represent each 45 msec frame of the sampled signal. 
We used a baseline conventional HMM-based recognizer to initialize the TW network, 
and to get a benchmark performance. Each strictly left-to-right HMM in this system 
has five states, and the observation densities are modeled by four Gaussian mixture 
components. The recognition rates of this system are 61.7% on the test data, and 
80.2% on the training data. 
Experiment with single-layer TWN: In this experiment the single-layer TW network 
was initialized according to (7), using the parameters of the baseline HMMs. The four 
mixture components of each state were treated as a fully connected set of four states, 
with transition probabilities that reflect the original transition probabilities and the 
relative weights of the mixtures. This corresponds to the case in which the local 
likelihood is computed using the dominant mixture component only. The network was 
trained using the suggested training algorithm (10), with quadratic error function. The 
recognition rate of the trained network increased to 69.4% on the test set and 93.6% on 
the training set. 
Experiment with multi-layer TWN: In this experiment we used the multi-layer 
network architecture described in the previous section. The recognition performance of 
this network after training was 74.4% on the test set and 91% on the training set. 
Figures 1, 2, and 3 show the recognition performance of a single-layer TWN, 
initialized by a baseline HMM, the trained single-layer TWN, and the trained multi- 
layer TWIN, respectively. In these figures the activation of the unit representing the 
correct class is plotted against the activation of the best wrong unit (i.e., the incorrect 
class with the highest score) for each input utterance. Therefore, the utterances that 
correspond to the marks above the diagonal line are correctly recognized, and those 
under it are misclassified. The most interesting observation that can be made from 
these plots is the striking difference between the multi-layer and the single-layer 
TWNs. The single-layer TWNs in Figures 1 and 2 (the basdine and the trained) 
exhibit the same typical behavior when the utterances are concentrated around the 
diagonal line. For the multi-layer net, the utterances that were recognized correctly tend 
to concentrate in the upper part of the graph, having the correct unit activation close to 
1.0. This property of a multi-layer net can be used for introducing error rejection 
criterions: utterances for which the difference between the highest activation and 
second high activation is less than a prescribed threshold are rejected. In Figure 4 we 
compare the test performance of the multi-layer net and the baseline system, both with 
such rejection mechanism, for different values of rejection threshold. As expected, the 
multi-layer net outperforms the baseline recognizer, by showing much smaller 
misclassificafion rate for the same number of rejections. 
V. SUMMARY AND DISCUSSION 
In this paper we established a hybrid framework for speech recognition, combining the 
characteristics of hidden Markov models and neural networks. We showed that a 
HMM-based recognizer has an equivalent representation as a single-layer network 
composed of time-warping neurons, and proposed to improve the discriminafive power 
of the recognizer by using back-propagation waining and by generalizing the structure 
of the recognizer to a multi-layer net. Several experiments were conducted for testing 
Time-Warping Network: A Hybrid Framework for Speech Recognition 157 
the performance of the proposed network on a highly confusable vocabulary (the 
English E-set). The recognition performance on the test set of a single-layer TW net 
improved from 61% (when initialized with a baseline HM/) to 69% after training. 
Expertring the structure of the recognizer by one more layer of neurons, we obtained 
further improvement of recognition accuracy up to 74.4%. Scatter plots of the results 
indicate that in the multi-layer case, there is a qualitative change in the performance of 
the recognizer, allowing us to set up a rejection criterion to improve the confidence of 
the system. 
References 
1. H. Bourlard, C.J. Wellekens, "Links between Markov models and multilayer 
percepttons," Advances in Neural Information Processing Systems, pp.502-510, 
Morgan Kauffman, 1989. 
2. J.S. Bridle, "Alphanets: a recurrent 'neural' network architecture with a hidden 
Markov model interpretation," Speech Communication, April 1990. 
3. D.E. Rumelhart, G.E. Hinton and R.J. Williams, "Learning internal representation 
by error propagation," Parallel Distributed Processing: Exploration in the 
Microstructure of Cognition, MIT Press, 1986. 
4. E. Levin, "Word recognition using hidden control neural architecture," Proc. of 
ICASSP, Albuquerque, April 1990. 
5. K.-Y. Su, C.-H. Lee, "Speech Recognition Using Weighted HMM and Subspace 
Projection Approaches," Proc oflCASSP, Toronto, 1991. 
6. L. R. Rabiner, "A tutorial on hidden Markov models and selected applications in 
speech recognition," Proc. oflEEE, vol. 77, No. 2, pp. 257-286, February 1989. 
7. C.-H. Lee, L. R. Rabiner, R. Pieraccini, J. G. Wilpon, "coustic Modeling for Large 
Vocabulary Speech Recognition," Computer Speech and Language, 1990, No. 4, pp. 
127-165. 
8. G.D. Fomey, "The Viterbi algorithm," Proc. IEEE, vol. 61, pp. 268-278, Ivlar. 
1973. 
9. S.A. Solla, E. Levin, M. Fleisher, "Improved targets for multilayer perceptton 
learning," Neural Networks Journal, 1988. 
10. B.-H. Juang, S. Katagiri, "Discriminative Learning for Minimum Error 
Classification," IEEE Trans. on SP, to be published. 
11. B.S. Atal, "Effectiveness of linear prediction characteristics of the speech wave for 
automatic speaker identification and verification," J. Acoust. Soc. Am., vol. 55, No. 6, 
pp. 1304-1312, June 1974. 
Figure 1: Scatter plot for baseline recognizer 
158 Levin, Pieraccini, and Bocchieri 
Figure 2: Scatter plot for trained s'gle-laycr TWN 
Figure 3: Scatter plot for multi-layer TWN 
%"% X r,-' ,.,. ,'r,- I , . r- J / 
O 20 40 ,O dO ' ' 
Figure 4: Rejection performance of baseline recognizer and the multi-layer TWN 
