Decoding Cursive Scripts 
Yoram Singer and Naftali Tishby 
Institute of Computer Science and 
Center for Neural Computation 
Hebrew University, Jerusalem 91904, Israel 
Abstract 
Online cursive handwriting recognition is currently one of the most 
intriguing challenges in pattern recognition. This study presents a 
novel approach to this problem which is composed of two comple- 
mentary phases. The first is dynamic encoding of the writing tra- 
jectory into a compact sequence of discrete motor control symbols. 
In this compact representation we largely remove the redundancy of 
the script, while preserving most of its intelligible components. In 
the second phase these control sequences are used to train adaptive 
probabilistic acyclic automata (PAA) for the important ingredients 
of the writing trajectories, e.g. letters. We present a new and effi- 
cient learning algorithm for such stochastic automata, and demon- 
strate its utility for spotting and segmentation of cursive scripts. 
Our experiments show that over 90% of the letters are correctly 
spotted and identified, prior to any higher level language model. 
Moreover, both the training and recognition algorithms are very 
efficient compared to other modeling methods, and the models are 
'on-line' adaptable to other writers and styles. 
1 Introduction 
While the emerging technology of pen-computing is already available on the world's 
markets, there is an on growing gap between the state of the hardware and the 
quality of the available online handwriting recognition algorithms. Clearly, the 
critical requirement for the success of this technology is the availability of reliable 
and robust cursive handwriting recognition methods. 
833 
834 Singer and Tishby 
We have previously proposed a dynamic encoding scheme for cursive handwriting 
based on an oscillatory model of handwriting [8, 9] and demonstrated its power 
mainly through analysis by synthesis. Here we continue with this paradigm and use 
the dynamic encoding scheme as the front-end for a complete stochastic model of 
cursive script. 
The accumulated experience in temporal pattern recognition in the past 30 years 
has yielded some important lessons relevant to handwriting. The first is that one 
can not predefine the basic 'units' of such temporal patterns due to the strong inter- 
action, or 'coarticulation', between such units. Any reasonable model must allow for 
the large variability of the basic handwriting components in different contexts and 
by different writers. Thus true adaptability is a key ingredient of a good stochas- 
tic model of handwriting. Most, if not all, currently used models of handwriting 
and speech are hard to adapt and require vast amounts of training data for some 
robustness in performance. In this paper we propose a simpler stochastic modeling 
scheme, which we call Probabilistic Acyclic Automata (PAA), with the important 
feature of being adaptive. The training algorithm modifies the architecture and 
dimensionality of the model while optimizing its predictive power. This is achieved 
through the minimization of the "description length" of the model and training 
sequences, following the minimum description length (MDL) principle. Another 
interesting feature of our algorithm is that precisely the same procedure is used in 
both training and recognition phases, which enables continuous adaptation. 
The structure of the paper is as follows. In section 2 we review our dynamic en- 
coding method, used as the front-end to the stochastic modeling phase. We briefly 
describe the estimation and quantization process, and show how the discrete motor 
control sequences are estimated and used, in section 3. Section 4 deals with our 
stochastic modeling approach and the PAA learning algorithm. The algorithm is 
demonstrated by the modeling of handwritten letters. Sections 5 and 6 deal with 
preliminary applications of our approach to segmentation and recognition of cursive 
handwriting. 
2 Dynamic encoding of cursive handwriting 
Motivated by the oscillatory motion model of handwriting, as described e.g. by 
Holierbach in 1981 [2], we developed a parameter estimation and regularization 
method which serves for the analysis, synthesis and coding of cursive handwriting. 
This regularization technique results in a compact and efficient discrete representa- 
tion of handwriting. 
Handwriting is generated by the human muscular motor system, which can be sim- 
plified as spring muscles near a mechanical equilibrium state. When the movements 
are small it is justified to assume that the spring muscles operate in the linear 
regime, so the basic movements are simple harmonic oscillations, superimposed by 
a simple linear drift. Movements are excited by selecting a pair of agonist-antagonist 
muscles that are modeled by the spring pair. In a restricted form this simple motion 
is described by the following two equations, 
V(t) : d:(t): a cos(o:.t + c)) + c Vy(t) : (t) -- bcos(o:yt) , (1) 
where Vx(t) and Vy(t) are the horizontal and vertical pen velocities respectively, wx 
and wy are the angular velocities, a, b are the velocity amplitudes,  is the relative 
Decoding Cursive Scripts 835 
phase lag, and c is the horizontal drift velocity. Assuming that these describe 
the true trajectory, the horizontal drift, c, is estimated as the average horizontal 
1 N 
velocity, b = y -]i=l V:(i). For fixed values of the parameters a, b, a; and d these 
equations describe a cycloidal trajectory. 
Our main assumption is that the cycloidal trajectory is the natural (free) pen mo- 
tion, which is modified only at the velocity zero crossings. Thus changes in the 
dynamical parameters occur only at the zero crossings and preserve the continuity 
of the velocity field. This assumption implies that the angular velocities w,wy 
and amplitudes a, b can be considered constant between consecutive zero crossings. 
Denoting by t' and t, the i'th zero crossing locations of the horizontal and vertical 
velocities, and by L and L, the horizontal and vertical progression during the i'th 
interval, then the estimated amplitudes are, a - 7    Those 
2(t+_t,) , b = 2(t,y+l_t,y). 
amplitudes define the vertical and horizontal scales of the written letters. 
Examination of the vertical velocity dynamics reveals the following: (a) There is 
a virtual center of the vertical movement and velocity trajectory is approximately 
symmetric around this center. (b) The vertical velocity zero crossings occur while 
the pen is at almost fixed vertical levels which correspond to high, normal and low 
modulation values, yielding altogether 5 quantized levels. The actual pen levels 
achieved at the vertical velocity zero crossings vary around the quantized values, 
with approximately normal distribution. Let the indicator, It (It  {1,..., 5}), 
be the most probable quantized level when the pen is at the position obtained at 
the t'th zero crossing. We need to estimate concurrently the 5 quantized levels 
H,..., Hs, their variance cr (assumed the same for all levels), and the indicators 
It. In this model the observed data is the sequence of actual pen levels L(t), while 
the complete data is the sequence of levels and indicators {It,L(t)}. The task of 
estimating the parameters {Hi, or} is performed via maximum likelihood estimation 
from incomplete data, commonly done by the M algorithm[i] and described in [9]. 
The horizontal amplitude is similarly quantized to 3 levels. 
After performing slant equalization of the handwriting, namely, orthogonalizing the 
x and y motions, the velocities l;(t), (t) become approximately uncorrelated. 
When w  wy, the two velocities are uncorrelated if there is a -t-900 phase-lag 
between V: and l/s. There are also locations of total halt in both velocities (no pen 
movement) which we take as a zero phase lag. Considering the vertical oscillations 
as a 'master clock', the horizontal oscillations can be viewed as a 'slave clock' whose 
phase and amplitude vary around the 'master clock'. For English cursive writing, 
1 1 2} thus V 
the frequency ratio between the two clocks is limited to the set {5, , , 
induces a grid for the possible V zero crossings. The phase-lag of the horizontal 
oscillation is therefore restricted to the values 0 , 4-900 at the zero crossings of 
V. The most likely phase-lag trajectory is determined by dynamic programming 
over the entire grid. At the end of this process the horizontal oscillations are fully 
determined by the vertical oscillations and the pen trajectory's description greatly 
simplified. 
The variations in the vertical angular velocity for a given writer are small, except 
in short intervals where the writer hesitates or stops. The only information that 
should be preserved is the typical vertical angular velocity, denoted by w. The 
836 Singer and Tishby 
normalized discretized equations of motion now become, 
{  : aisin(wt q-j)q- 1 ai G {A,A,A} j  {-900,00,90 } 
 = bsin(wt) b  {Ht2 - Ht 11 < 1,12 _< 5} (2) 
We used analysis by synthesis technique in order to verify our assumptions and 
estimation scheme. The final result of the whole process is depicted in Fig. 1, 
where the original handwriting is plotted together with its reconstruction from the 
discrete representation. 
Figure 1: The original and the fully quantized cursive scripts. 
3 Discrete control sequences 
The process described in the previous section results in a many to one mapping 
from the continuous velocity field, V:(t), Vy(t), to a discrete set of symbols. This 
set is composed of the cartesian product of the quantized vertical and horizontal 
amplitudes and the phase-lags between these velocities. We treat this discrete con- 
trol sequence as a cartesian product time series. Using the value '0' to indicate 
that the corresponding oscillation continues with the same dynamics, a change in 
the phase lag can be encoded by setting the code to zero for one dimension, while 
switching to a new value in the other dimension. A zero in both dimensions in- 
dicates no activity. In this way we can model 'pen ups'intervals and incorporate 
auxiliary symbols like 'dashes','dots', and 'crosses', that play an important role in 
resolving disambiguations between letters. These auxiliary are modeled as a sep- 
arate channel and are ordered according to their X coordinate. We encode the 
control levels by numbers from I to 5, for the 5 levels of vertical positions. The 
quantized horizontal amplitudes are coded by 5 values as well: 2 for positive am- 
plitudes (small and large), 2 for negative amplitudes, and one for zero amplitude. 
Below is an example of our discrete representation for the handwriting depicted in 
Fig. 1. The upper and lower lines encode the vertical and horizontal oscillations 
respectively, and the auxiliary channel is omitted. In this example there is only one 
location where both symbols are '0', indicating a pen-up at the end of the word. 
24020420400 :t 005002040202204020402424204020500204020402400440240220 
:t 040340304 :t 042032040 :t 0$00:t 05024253050 :t 050204:t 032403050033 :t 0500 :tOO0 
4 Stochastic modeling of the motor control sequences 
Existing stochastic modeling methods, such as Hidden Markov Models (HMM) [3], 
suffer from several serious drawbacks. They suffer from the need to 'fix' a-priory the 
Decoding Cursive Scripts 837 
architecture of the model; they require large amounts of segmented training data; 
and they are very hard to adapt to new data. The stochastic model presented here 
is an on-line learning algorithm whose important property is its simple adaptability 
to new examples. We begin with a brief introduction to probabilistic automata, 
leaving the theoretical issues and some of the more technical details to another 
place. 
A probabilistic automaton is a 6-tuple (Q, E, r, 7, %, q), where Q is a finite set 
of n states, E is an alphabet of size k, r : Q x E - Q is the state transition 
function, 7: E x Q - [0, 1] is the transition (output) probability where for every 
q  Q' oep. 7(crlq) = 1. q  Q is a start state, and q  Q is an end state. A 
probabilistic automaton is called acyclic if it contains no cycles. We denote such 
automata by PAA. This type of automaton is also known as a Markov process with 
a single source and a single absorbing state. The rest of the states are all transient 
states. Such automata induce non-zero probabilities on a finite set of strings. Given 
an input string & - (rl,..., r) if at the of end its 'run' the automaton entered the 
final state %, the probability of a string a is defined to be, P(a) - HV__l (dri]qi_l) 
where qo = %, qi - -(qi-1, cri)- On the other hand, if q;v  q then P(a) - 0. 
The inference of the PAA structure from data can be viewed as a communication 
problem. Suppose that one wants to transmit an ensemble of strings, all created 
by the same PAA. If both sides know the structure and probabilities of the PAA 
then the transmitter can optimally encode the strings by using the PAA transition 
probabilities. If only the transmitter knows the structure and the receiver has 
to discover it while receiving new strings, each time a new transition occurs, the 
transmitter has to send the next state index as well. Since the automaton is acyclic, 
the possible next states are limited to those which do not form a cycle when the 
.t be the number of legal next states 
new edge is added to the automaton. Let kq 
from a state q known to the receiver ar time t. Then the encoding of the next 
state index requires at least log2(ktq + 1) bits. The receiver also needs to estimate 
the state transition probability from the previously received strings. Let n(crlq ) be 
the number of times the symbol r has been observed by the receiver while being in 
state q. Then the transition probability is estimated by Laplace's rule of succession, 
t the number of 
n(elq)+l In sum if q is the current state and kq 
(crJq)- -]o.,en(o.jq)+j j , 
possible next states known to the receiver, the number of bits required to encode the 
next symbol r (assuming optimal coding scheme) is given by: (a) if the transition 
r(q, r)has already been observed: -log2(P(r[q)); (b) if the transition r(q, r) has 
never occurred before: -logu(iO(crlq))-4-logu(ktq + 1). 
In training such a model from empirical observations it is necessary to infer the 
structure of the PAA as well its parameters. We can thus use the above coding 
scheme to find a minimal description length (MDL) of the data, provided that our 
model assumption is correct. Since the true PAA is not known to us, we need to 
imitate the role of the receiver in order to find the optimal coding of a message. This 
can be done efficiently via dynamic programming for each individual string. After 
the optimal coding for a single string has been found, the new states are added, the 
transition probabilities/5(crJq) are updated and the number of legal next states kq 
is recalculated. An example of the learning algorithm is given in Fig. 2, with the 
estimated probabilities P, written on the graph edges. 
838 Singer and Tishby 
(a) '1. (b) 
 l. 
(c) 
0.5 
.6 
0.6 
',0,.33 . 
Figure 2: Demonstration of the PAA learning algorithm. Figure (a) shows the 
original automaton from which the examples were created. Figures (b)-(d) are the 
intermediate automata built by the algorithm. Edges drawn with bold, dashed, and 
grey lines correspond to transitions with the symbols '0', '1', and the terminating 
symbol, respectively. 
15 Automatic segmentation of cursive scripts 
Since the learning algorithm of a PAA is an on-line scheme, only a small number 
of segmented examples is needed in order to built an initial model. For cursive 
handwriting we manually collected and segmented about 10 examples, for each 
lower case cursive letter, and built 26 initial models. At this stage the models are 
small and do not capture the full variability of the control sequences. Yet this set 
of initial automata was sufficient to gradually segment cursive scripts into letters 
and update the models from these segments. Segmented words with high likelihood 
are fed back into the learning algorithm and the models are further refined. The 
process is iterated until all the training data is segmented with high likelihood. 
The likelihood of new data might not be defined due the incompleteness of the 
automata, hence the learning algorithm is again applied in order to induce prob- 
abilities. Let PiS, j be the probability that a model S (which represents a cursive 
letter) generates the control symbols si,..., sj-1 (j > i). The log-likelihood of a 
proposed segmentation (i:, i2,...,iN+:) of a word S1, S2,..., SN is, 
N N 
L((i1,.. ,iN+i)I(S'i .. ,S'N),(Sl .. ,SL)):log(l- IPS3. 
 , . , . 
j=l j=l 
The segmentation is calculated efficiently by maintaining a layers graph and using 
dynamic programming to compute recursively the most likely segmentation. For- 
mally, let ML(n, k) be the highest likelihood segmentation of the word up to the 
Decoding Cursive Scripts 839 
n'th control symbol and the k'th letter in the word. Then, 
log } 
ik_i<i<n 
The best segmentation is obtained by tracking the most likely path from M(N, L) 
back to M(1, 1). The result of such a segmentation is depicted in Fig. 3. 
Figure 3: Temporal segmentation of the word impossible. The segmentation is 
performed by applying the automata of the letters contained in the word, and 
finding the Maximum-Likelihood sequence of models via dynamic programming. 
6 Inducing probabilities for unlabeled words 
Using this scheme we automatically segmented a database which contained about 
1200 frequent english words, by three different writers. After adding the segmented 
letters to the training set the resulting automata were general enough, yet very 
compact. Thus inducing probabilities and recognition of unlabeled data could be 
performed efficiently. The probability of locating letters in certain locations in new 
unlabeled words (i.e. words whose transcription is not given) can be evaluated by 
the automata. These probabilities are calculated by applying the various models 
on each sub-string of the control sequence, in parallel. Since the automata can 
accommodate different lengths of observations, the log-likelihood should be divided 
by the length of the sequence. This normalized log-likelihood is an approximation 
of the entropy induced by the models, and measures the uncertainty in determining 
the transcription of a word. The score which measures the uncertainty of the occur- 
1 
rence of a letter S in place n in the a word is, Scove(nlS) = maxl T lg(P+-l)' 
The result of applying several automata to a new word is shown in Fig. 4. High 
probability of a given automaton indicates a beginning of a letter with the cor- 
responding model. The probabilities for the letters k, a, e, b are plotted top to 
bottom. The correspondence between high likelihood points and the relevant lo- 
cations in the words are shown with dashed lines. These locations occur near the 
'true' occurrence of the letter and indicate that these probabilities can be used for 
recognition and spotting of cursire handwriting. There are other locations where 
the automata obtain high scores. These correspond to words with high similarity to 
the model letter and can be resolved by higher level models, similar to techniques 
used in speech. 
7 Conclusions and future research 
In this paper we present a novel stochastic modeling approach for the analysis, 
spotting, and recognition of online cursive handwriting. Our scheme is based on a 
840 Singer and Tishby 
Figure 4: The normalized log-likelihood scores induced by the automata for the 
letters k, a, e, and b (top to bottom). Locations with high score are marked with 
dashed lines and indicate the relative positions of the letters in the word. 
discrete dynamic representation of the handwriting trajectory, followed by training 
adaptive probabilistic automata for frequent writing sequences. These automata 
are easy to train and provide simple adaptation mechanism with sufficient power 
to capture the high variability of cursively written words. Preliminary experiments 
show that over 90% of the single letters are correctly identified and located, without 
any additional higher level language model. Methods for higher level statistical 
language models are also being investigated [6], and will be incorporated into a 
complete recognition system. 
Acknowledgments 
We would like to thank Dana Ron for useful discussions and Lee Giles for providing us 
with the software for plotting finite state machines. Y.S. would like to thank the Clore 
foundation for its support. 
References 
[1] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood estimation from 
incomplete data via the EM algorithm. J. Roy. Statist. Soc., 39(B):1-38, 1977. 
[2] J.M. Holierbach. An oscillation theory of handwriting. Bio. Cyb., 39, 1981. 
[3] L.R. Rabiner. A tutorial on hidden markov models and selected applications in 
speech recognition. Proc. IEEE, pages 257-286, Feb. 1989. 
[4] J. Rissanen. Modeling by shortest data description. Automatica, 14, 1978. 
[5] J. Rissanen. Stochastic complexity and modeling. Annals of Star., 14(3), 1986. 
[6] D. Ron, Y. Singer, and N. Tishby. The power of amnesia. In this volume. 
[7] D.E. Rumelhart. Theory to practice: a  case study - recognizing cursire hand- 
writing. In Proc. of 1992 NEC Conf. on Computation and Cognition. 
[8] Y. Singer and N. Tishby. Dynamical encoding of cursire handwriting. In IEEE 
Conference on Computer Vision and Pattern Recognition, 1993. 
[9] Y. Singer and N. Tishby. Dynamical encoding of cursire handwriting. Technical 
Report CS93-4, The Hebrew University of Jerusalem, 1993. 
PART VII 
IMPLEMENTATIONS 
