Planar Hidden Markov Modeling: 
from Speech to Optical Character Recognition 
Esther Levin and Roberto Pieraccini 
ATr Bell Laboratories 
600 Mountain Ave. 
Murray Hill, NJ 07974 
Abstract 
We propose in this paper a statistical model (planar hidden Markov model - 
PHMM) describing statistical properties of images. The model generalizes 
the single-dimensional HMM, used for speech processing, to the planar case. 
For this model to be useful an efficient segmentation algorithm, similar to the 
Viterbi algorithm for HMM, must exist. We present conditions in terms of 
the PHMM parameters that are sufficient to guarantee that the planar 
segmentation problem can be solved in polynomial time, and describe an 
algorithm for that. This algorithm aligns optimally the image with the model, 
and therefore is insensitive to elastic distortions of images. Using this 
algorithm a joint optimal segmentation and recognition of the image can be 
performed, thus overcoming the wealmess of traditional OCR systems where 
segmentation is performed independently before the recognition leading to 
unrecoverable recognition errors. 
The PHMM approach was evaluated using a set of isolated hand-written 
digits. An overall digit recognition accuracy of 95% was achieved. An 
analysis of the results showed that even in the simple case of recognition of 
isolated characters, the elimination of elastic distortions enhances the 
performance significantly. We expect that the advantage of this approach will 
be even more significant for tasks such as connected writing 
recognition/spotting, for which there is no known high accuracy method of 
recognition. 
1 Introduction 
The performance of traditional OCR systems deteriorate very quickly when documents 
are degraded by noise, blur, and other forms of distortion. The main reason for such 
deterioration is that in addition to the intra-class character variability caused by distortion, 
the segmentation of the text into words and characters becomes a nontrivial task. In most 
of the traditional systems, such segmentation is done before recognition, leading to many 
recognition errors, since recognition algorithms cannot usually recover from errors 
introduced in the segmentation phase. Moreover, in many cases the segmentation is ill- 
defined, since many plausible segmentations might exist, and only grammatical and 
linguistic analysis can find the "right" one. To address these problems, an algorithm is 
needed that can: 
 be tolerant to distortions leading to intra-class variability 
731 
732 Levin and Pieraccini 
 perform segmentation together with recognition, thus join fly optimizing both 
processes, while incorporating grammatical/linguistic constraints. 
In this paper we describe a planar segmentation algorithm that has the above properties. 
It results from a direct extension of the Viterbi (Forney, 1973) algorithm, widely used in 
automatic speech recognition, to two-dimensional signals. 
In the next section we describe the basic hidden Markov model and define the 
segmentation problem. In section 3 we introduce the planar HMM that extends the HMM 
concept to model images. The planar segmentation problem for PffMM is defined in 
section 4. It was recently shown (Keams and Levin, 1992) that the planar segmentation 
problem is NP-hard, and therefore, in order to obtain an effective planar segmentation 
algorithm, we propose to constrain the parameters of the PHMM. We show sufficient 
conditions in terms of PffMM parameters for such algorithm to exist and describe the 
algorithm. This approach differs from the one taken in references (Chellappa and 
Chatterjee, 1985) and (Derin and Elliot, 1987), where instead of restricting the problem, 
a suboptimal solution to the general problem was found. Since in (Kearns and Levin, 
1992) it was also shown that planar segmentation problem is hard to approximate, such 
suboptimal solution doesn't have any guaranteed bounds. The segmentation algorithm 
can now be used effectively not only for aligning isolated images, but also for joint 
recognition/segmentation, eliminating the need of independent segmentation that usually 
leads to unrecoverable errors in recognition. The same algorithm is used for estimation of 
the parameters of the model given a set of example images. In section 5, results of 
isolated hand-written digit recognition experiments are presented. The results indicate 
that even in the simple case of isolated characters, the elimination of planar distortions 
enhances the performance signilicamfly. Section 6 contains the summary of this work. 
2 Hidden Markov Model 
The HMM is a st__a_fi.qtical model that is used to describe temporal signals 
G= {g(t): l<t<T,g  G oR'} in speech processing applications (Rabiner, 1989; Lee 
et al., 1990; Wilpon et al., 1990; Pieraccini and Levin, 1991). The HMM is a composite 
statistical source comprising a set s= { 1,  ",TR} of TR sources called states. The i-th 
state, i  s, is characterized by its probability distribution Pi(g) over G. At each time t 
only one of the states is active, emitting the observable g(t). We denote by s(t), s(t)  s 
the random variable corresponding to the active state at time t. The joint probability 
distribution (for real-valued g) or discrete probability mass (for g being a discrete 
variable) P(s(t),g(t)) for t > 1 is characterized by the following property: 
P(s(t),g(t)ls(l:t-1),g(l:t-1))=P(s(t)] s(t-1))P(g(t)]s(t))--- (1) 
=P(s(t) ] s(t-1)) P,(t)(g(t) ), 
where s(l:t-1) stands for the sequence {s(1), ..- s(t-1) }, and 
g(l:t-1)={g(1) ..... g(t-1)}. We denote by a 0 the transition probability 
P(s(t)=jls(t-1)=i), and by r, the probability of state i being active at t=l, 
i=P(s(1)=i). The probability of the entire sequence of states S--s(I:T) and 
observations G =g( I :T) can be expressed as 
T 
P (G,S) = (1)P(1)( g (1)) rI a(t-os(t) P(t)(g(t) ). (2) 
t--2 
The interpretation of equations (1) and (2) is that the observable sequence G is generated 
in two stages: first, a sequence S of T states is chosen according to the Markovian 
dis,'ibution parametrized by {ao} and {hi}; then each one of the states s (t), l<r, in S 
generates an observable g(t) according to its own memoryless distribution P,(t), forming 
the observable sequence G. This model is called a hidden Markov model, because the 
state sequence S is not given, and only the observation sequence G is known. A 
particular case of this model, called a left-to-right HMM, where aij =0 for j <i, and 
Planar Hidden Markov Modeling: from Speech to Optical Character Recognition 733 
l = 1, is especially useful for speech recognition. In this case each state of the model 
represents an unspecified acoustic unit, and due to the "eft-to-right" structure, the whole 
word is modeled as a concatenation of such acoustic mits. The time spent in each of the 
states is not fixed, and therefore the model can take into account the duration variability 
between different utterances of the same word. 
The segm^entafion problem of HMM is that of estimating the most probable state 
sequence S, given the observation G, 
}=argmaxP(S l G)=argmaxP(G,S). (3) 
$ $ 
The problem of finding } through exhaustive search is of exponential complexity, since 
there exist T r possible state sequences, but it can be solved in polynomial time using a 
dynamic programming approach (i.e. Viterbi algorithm). The segmentation plays a 
central role in all HMM-based speech recognizers, since for connected speech it gives the 
segmentation into words or sub-word units, and performs a recognition simultaneously, in 
an optimal way. This is in contrast to sequential systems, in which the connected speech 
is first segmented into words/subwords according to some rules, and than the individual 
segments are recognized by computing the appropriate likelihoods, and where many 
recognition errors are mused by unrecoverable segmentation errors. Higher-level 
syntactical knowledge can be integrated into decoding process through transition 
probabilities between the models. The segmentation is also used for estimating the 
HMMs parameters using a corpus of a training dam. 
3 The Two-Dimensional Case: Planar HMM 
In this section we describe a statistical model for planar image 
G = {g(x,y):(x,y) 6 Lx, r , g  G}. We call this model "Planar HMM" (PHMM) and 
design it to extend the advantages of conventional HMM to the two-dimensional case. 
The PHMM is a composite source, comprising a set s= {(,), I<.<_i<_X R, I<<YR} of 
N=XeYt states. Each state in s is a stochastic source characterized by its probability 
density P;,9(g) over the space of observations g  G. It is convenient to think of the 
states of the model as being located on a rectangular lattice where each state corresponds 
to a pixel of the corresponding reference image. Similarly to the conventional HMM, 
only one state is active in the generation of the (x,y)-th image pixel g (x,y). We denote 
by s(x,y)e s the active state of the model that generates g (x,y). The joint distribution 
governing the choice of active states and image values has the following MarkovJan 
property: 
P(g(x,y), s(x,y) I g(l:X, l:y-1), g(l:x-l,y), s(l:X, l:y-1),s(l:x-l,y))= (4) 
=P(g(x,y) ] s(x,y)) P(s(x,y) l s(x-l,y),s(x,y-1))= 
=P3(x,y)(g(x,Y)) P(s(x,y) [ s(x-l,y),s(x,y-1))= 
where g(l:X,y-1) = {g(x,y):(x,y)  Rx,y- }, g(l:x-l,y) -= {g (1,y),    ,g(x-l,y)}, and 
s(l:X,l:y-1), s(l:x-l,y) are the active states involved in generating g(l'JC, y-1), 
g(l:x-l,y), respectively, and Rx,y_ is an axis parallel rectangle between the origin and 
the point (X,y-1). Similarly to the one-dimensional case, it is useful to define a left-to- 
right bottom-up PHMM where P(s(x,y)=(m,n) I s(x-l,y)=(i,j),s(x,y-1)=(k,l)):t:O only 
when i_Jn and l_n, that does not allow for "fold overs" in the state image. The 
Markovjan property (4) allows the left-to-right bottom-up PHMM to model elastic 
distortions among different realizations of the same image, similarly to the way the 
MarkovJan property in left-to-right HMM handles temporal alignment. We have chosen 
this definition (4) of MarkovJan property rather than others (see for example Derm and 
Kelly, 1989) since it leads to the formulation of a segmentation problem which is similar 
to the planar alignment defined in (Levin and Pieraccini, 1992). 
734 Levin and Pieraccini 
Using property (4), the joint likelihood of the image G=g(I:X,I:Y) and the state image 
S=s(I:X, I:Y) can be written as 
x Y 
P(G,S) = FI rl P3(x.y)(g(x,y) ) (5) 
x=l y=l 
X H Y V Y X 
H t) H H I'I A*(x-Ly),3(x,y-),s(x,y) , 
x=2 y=2 y=2 x=2 
where: 
A(i,j).(k.l),(m.n) --P( s(x,y)=(m,n) I s(x-l,y)=(i,j), s(x,y-1)=(k,l) ), 
H 
a(i.i),(m.,,) -=P(s(x, 1)=(re, n) [ s(x-l, 1)=(i,j) ), 
v 
a(k.l).(m.,,) ---P( s(1,y)=(m,n) [ s(1,y)=(k,l) ), 
and 
j--P(s(1,1)=(i,j)) 
denote the generalized transition probabilities of PHMM. Similarly to HMM, (5) 
suggests that an image G is generated by the PHMM in two successive stages: in the first 
stage the state matrix S irgenerd according to the Markovian probability distribution 
parametrized by {A}, {a 1, {a 1, and{l. In the second stage, the image value in the 
(x,y)-th pixel is produced independently from other pixels according to the distribution of 
the s(x,y)-th state P,(,y)(g ). As in HMM, the state matrix S in most of the applications 
is not known, only G is observed. 
4 Planar Segmentation Problem 
The segmentation problem of PHMM is that of finding the state matrix } that best 
explains the observable image G and defines an optimal alignment of the image to the 
model. Solving this problem eliminates the sensitivity to intra-class elastic distortions and 
allows for simult3neous segmentation/recognit$,on of images similarly to the one- 
dimensional case. S can be estimated as in (3) by S = argnvax P (G,S). If we approach this 
s 
maximization by exhaustive search, the computational complexity is exponential, since 
there are (XR YR )xy different state matrices. Since the segmentation problem is NP-hard 
(Kearns and Levin, 1992), we suggest to simplify the problem by constraining the 
parameters of the PHMM, so that efficient segmentation algorithm can be found. In this 
section we present conditions in terms of the generalized transitiop probabilities of 
PHMM that are sufficient to guarantee that the most likely state image S can be computed 
in polynomial time, and describe an algorithm for doing thaL 
For the problem of finding  to be solved in polynomial time, there should exist a 
groupinoof the set s of states of the model into No mutually exclusive  subsets of states 
,, s = U ,. The generalized transition probabilities should satisfy the two following 
p=l 
constraints with respect to such grouping: 
H 
A(i,j).(k,l),(m.n):/:O ; a(i.j).(,,,) 0 (6) 
only if there existsp, 1 5p <No, such that (i,j), (re, n)  ,. 
v v 
A (i,j).(i,l),(m,n) =A(i,j),(k,lO,(m,n) ; a(&l).(m,n) = a(t,l),(m,n) (7) 
 It is possible to drop the mutually exclusiveness constraints by duphcating states, but then we have to 
ensure that the number of subsets, N G, should be polynomial in the dimensions of the model X R, YR. 
Planar Hidden Markov Modeling: from Speech to Optical Character Recognition 735 
if there exists p, 1 <p <No. such that (k,l), (k l,l t )  t,  
Condition (6) means that the the left neighbor (i,j) of the state (re, n) in the state matrix S 
must be a member of the same subset , as (re, n). Condition (7) means that the value of 
transition probability A (i./).(t./).(m.n) does not depend explicitly on the identity (k,l) of the 
bottom neighboring state, but only on the subset , to which (k,/) belongs. 
Under (6) and (7) the most likely state matrix } can be found using an algorithm 
described in (Levin and Pieraccini, 1992). This algorithm makes use of the Viterbi 
procedure at two different levels. In the first stage optimal segmentation is computed for 
each subset , with each image raw using Viterbi. Then global segmentation is found, 
through Viterbi, by combining optimally the segmentations obtained in the previous 
stage. 
Although conditions (6),(7) are hard to check in practice since any possible grouping of 
the states h to be considered, they can be effectively used in constructive mode, i.e., 
chosing one particular grouping, and then imposing the constraints (6) and (7) on the 
generalized transition probabilities with respect to this grouping. For example, if we 
choose , = {(,) I I <-<-<-<-XR,  =p }, 1 < p < YR, then the constraints (6),(7) become: 
H 
A(i.i).(t.t).(m,n) sO, ao.i).(m.,) sO only for j =n, (8) 
and, 
v v 
A(id).(t.t),(r,.n) =A(id).(k.t).(m.n), a(t.t).(m.n) =a(k,.t).(,,.,) for 1 <k, k <XR. (9) 
Note that constraints (6), (7) break the symmetry between the roles of the two 
coordinates. Other sets of conditions can be obtained from (6) and (7) by coordinate 
transformation. For example, the roles of the vertical and the horizontal axes can be 
exchanged. A grouping and constraints set chosen for a particular application should 
reflect the geometric properties of the images. 
5 Experiment_a_! Results 
The PHMM approach was tested on a writer-independent isolated handwritten digit 
recognition application. The data we used in our experiments was collected from 12 
subjects (6 for training and 6 for test). Each subject was asked to write 10 samples of 
each digit. Samples were written in fixed-size boxes, therefore naturally size-normalized 
and centered. Each sample in the database was represented by a 16x16 binary image. 
Each character class (digi0 was represented by a single PHMM, satisfying (6) and (7). 
Each PHMM had a strictly left-to-right bottom-up structure, where the state matrix S was 
restricted to contain every state of the model, i.e., states could not be skipped. All models 
had the same number of states. Each state was represented by its own binary probability 
distribution, i.e., the probability of a pixel being 1 (black) or 0 (white). We estimated 
these probabilities from the training data with the following generalization of the Viterbi 
training algorithm (Jelinek, 1976). For the initialization we uniformly divided each 
training image into regions corresponding to the states of its model. The initial value of 
Pi(g=l) for the i-th state was obtained as a frequency count of the black pixels in the 
corresponding region over all the samples of the same digit. Each iteration of the 
algorithm consisted of two stages: first,^the samples were aligned with the corresponding 
model, by finding the best state matrix S. Then, a new frequency count for each state was 
used to update Pi(1), according to the obtained alignment. We noticed that the training 
procedure converged usually after 2-4 iterations, and in all the experiments the algorithm 
was stopped at the loth iteration. The recognition was performed by assigning the test 
sample to the class k for which the alignment likelihood was maximal. 
736 Levin and Pieraccini 
Table 1 shows the number of errors in the recognition of the training set and the test set 
for different sizes of the models. 
Numberofsmtes RecognitionErrors 
XR=YR Training Test 
6 78 82 
8 36 50 
9 35 48 
10 26 32 
11 21 38 
12 18 42 
16 36 64 
Table 1: Number of errors in the recognition of the training set and the test set for 
different size of the models (out of 600 trials in both cases) 
It is worth noting the following two points. First, the test error shows a minimum for 
X/ = Y/e = 10 of 5%. By increasing or decreasing the number of states this error increases. 
This phenomenon is due to the following: 
1. The typical under/over parametrization behavior. 
2. Increasing the number of states closer to the size of the modeled images reduces the 
flexibility of the alignment procedure, making this a trivial uniform alignment 
when X/e = Y/e = 16. 
Also, the training error decreases monotonically with increasing number of states up to 
X/ = Y/e = 16. This is again typical behavior for such systems, since by increasing the 
number of states, the number of model parameters grows, improving the fit to the training 
data. But when the number of states equals the dimensions of the sample images, 
X/e = Y/e = 16, there is a sudden significant increase in the training error. This behavior is 
consistent with point (2) above. 
Fig. 1 shows three sets of models with different numbers of states. The states of the 
models in this figure are represented by squares, where the grey level of the square 
encodes the probability P(g=l). The (6x6) state models have a very coarse 
representation of the digits, because the number of states is so small. The (10x10) state 
models appear much sharper than the (16x16) state models, due to their ability to align 
the training samples. 
This preliminary experiment shows that eliminating elastic distortions by the alignment 
procedure discussed above plays an important role in the task of isolated character 
recognition, improving the recognition accuracy significantly. Note that the simplicity of 
this task does not stress the full power of the PHMM representation, since the data was 
isolated, size-normalized, and centered. On this task, the achieved performance is 
comparable to that of many other OCR systems. We expect that in harder tasks, involving 
connected text, the advantage of the proposed method will enhance the performance. 
Recently, this approach is being successfully applied to the task of recognition of noisy 
degraded printed messages (Agazzi et al., 1993). 
6 Summary and Discussion 
In this paper we describe a planar hidden Markov model and develop a planar 
segmentation algorithm that generalizes the Viterbi procedure widely used in speech 
recognition. This algorithm can be used to perform joint optimal 
recognition/segmentation of images incorporating some grammatical constraints and 
tolerating intra-class elastic distortions. The PHMM approach was tested on an isolated, 
hand-written digit recognition application. An analysis of the results indicate that even in 
a simple case of isolated characters, the elimination of elastic distortions enhances 
Planar Hidden Markov Modeling: from Speech to Optical Character Recognition 737 
recognition performance significantly. We expect that the advantage of this approach will 
be even more valuable m harder tasks, such as cursive writing recognition/spotting, for 
which an effective solution using the current available techniques has not yet been found. 
1 
'::::5... ':$V :' 
Figure 1: Three sets of models with 6x6, 10x10, and 16x16 states. 
References 
O. E. Agazzi, S. S. Kuo, E. Levin, R. Pieraccini, "Connected and Degraded Text 
Recognition Using Planar Hidden Markov Models," 
Proc. Of lnt. COnference on Acoustics Speech and Signal Processing, April 1993. 
R. Chellappa, S. Chatterjee, "Classification of textures Using Gaussian Markov Random 
Fields," IEEE Transactions on ASSP, Vol. 33, No. 4, pp. 959-963, August 1985. 
738 Levin and Pieraccini 
H. Derin, H. Elliot, "Modeling and Segmentation of Noisy and Textured Images Using 
Gibbs Random Fields," IEEE Transactions on PAMI, Vol. 9, No. 1 laP. 39-55, January 
1987. 
H. Derin, P. A. Kelly, 'Discrete-Index Markov-Type Random Processes,' in IE 
Proceedings, vol 77, #10, pp.1485-1510, 1989 
G.D. Forhey, "The Viterbi algorithm," Proc. IEEE, Mar. 1973. 
F. Jelinek, "Continuous Speech Recognition by Statistical Methods," Proceedings of 
IEEE, vol. 64, pp. 532-556, April 1976. 
M. Keams, E. Levin, Unpublished, 1992. 
C.-H. Lee, L. R. Rabiner, R. Pieraccini, J. G. Wilpon, "Acoustic Modeling for Large 
Vocabulary Speech Recognition," Computer Speech and Language, 1990, No. 4, pp. 
127-165. 
E. Levin, R. Pieraccini, "Dynamic Planar Warping and Planar Hidden Markov Modeling: 
from Speech to Optical Character Recognition," submitted to IEEE Trans. on PAMI, 
1992. 
R. Pieraccini, E. Levin, "Stochastic Representation of Semantic Structure for Speech 
Understanding," Proceedings of EUROSPEECH 91, Vol.2, pp. 383-386, Genova, 
September 1991. 
L.R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in 
Speech Recognition," Proc. IEEE, Feb. 1989. 
J. G. Wilpon, L. R. Rabiner, C.-H. Lee, E. R. Goldman, "Automatic Recognition of 
Keywords in Unconstrained Speech Using Hidden Markov Models," IEEE Tram. on 
ASSP, Vol. 38, No. 11, pp 1870-1878, November 1990. 
