388 Smith and Miller 
Bayesian Inference of Regular Grammar 
and Markov Source Models 
Kurt R. Smith and Michael I. Miller 
Biomedical Computer Laboratory 
and 
Electronic Signals and Systems Research Laboratory 
Washington University, St. Louis, MO 63130 
ABSTRACT 
In this paper we develop a Bayes criterion which includes the Rissanen 
complexity, for inferring regular grammar models. We develop two 
methods for regular grammar Bayesian infexemce. The fu'st method is 
based on treating the regular grammar as a 1-dimensional Markov 
source, and the second is based on the combinatoric characteristics of 
the regular grammar itseft. We apply the resulting Bayes criteria to a 
particular example in order to show the efficiency of each method. 
1 MOTIVATION 
We are interested in segmenting electron-microscope autoradiography (EMA) images by 
learning representational models for the textures found in the EMA image. In studying 
this problem, we have recognized that both structural and statistical features may be 
useful for characterizing textms. This has motivated us to study the source modeling 
problem for both structural sources and statistical stmrces. The statistical sources that 
we have examined are the class of one and two-dimenfional Markov sources (see [Smith 
1990] for a Bayesian treatment of Markov random field texture model inference), while 
the structural sources that we are primarily inte in here are the class of regular 
grammars, which are important due to the role that grammatical constraints may play in 
the development of structural features for texture representation. 
Bayesian Inference of Regular Grammar and Markov Source Models 389 
2 MARKOV SOURCE INFERENCE 
Our primary interest here is the development of a complete Bayesian framework for the 
process of inferring a regular grammar from a training sequence. However, we have 
shown previously that there exists a 1-D Markov source which generates the regular 
language defined via some regular grammar [Miller, 1988]. We can therefore develop a 
generalized Bayesian inference procedure over the class of 1-D Markov sources which 
enables us to learn the Markov source corresponding to the optimal regular grammar. 
We begin our analysis by developing the general structure for Bayesian source modeling. 
2.1 BAYESIAN APPROACH TO SOURCE MODELING 
We state the Bayesian approach to model learning: Given a set of source models 
{ , t,. ., tt- } and the observation x, choose the source model t which most accurately 
represents the unknown source that generated x. This decision is made by calculating 
Bayes risk over the possible models which produces a general decision criterion for the 
model learning problem: 
log P(xt) + log Pi . 
(2.1) 
Under the additional assumption that the apriori probabilities over the candidam models 
are equivalent, the decision criterion becomes 
max log (2.2) 
which is the quantity that we will use in measuring the accuracy of a model's 
representation. 
2.2 STOCHASTIC COMPLEXITY AND MODEL LEARNING 
It is well known that when given finite data, Bayesian procedures of this kind which do 
not have any prior on the models suffer from the fundamental limitation that they will 
predict models of greater and greater complexity. This has led others to introduce 
priors into the Bayes hypothesis testing procedure based on the complexity of the model 
being tested [Rissanen, 1986]. In particular, for the Markov case the complexity is 
directly proportional to the number of transition probabilities of the particular model 
being tested with the prior exponentially decreasing with the associated complexity. 
We now describe the inclusion of the complexity measure in greater detail. 
Following Rissanen, the basic idea is to uncover the model which assigns maximum 
probability to the observed data. while also being as simple as possible so as to require a 
small Kolmogorov description length. The complexity associated with a model having 
k real parameters and a likelihood with n independent samples, is the now well-known 
k2-/og n which allows us to express the generalization of the original Bayes lxocedure 
(2.2) as the quantity 
390 Smith and Miller 
max log P(x-I) - ? log n 
(2.3) 
Note well that  is the ksrdimensional parameter parameterizing model , which must 
be estimated from the observed data x-. An alternative view of (2.3) is discovered by 
viewing the second term as the prior in the Bayes model (2.1) where the prior is defined 
koi log n 
Psi= e'- (2.4) 
2.3 1-D MARKOV SOURCE MODELING 
Consider that x- is a 1-D n-length string of symbols which is generated by an unknown 
finite-state Markov source. In examining (2.3), we recognize that for 1-D Markov 
^ n-! 
sources logP(xtOi) may be written as log H P(S(x.)tS(xj.)) where S(x.) is a state 
j=l 
function which evaluates to a state in the Markov source state set S. Using this 
notation, the Bayes hypothesis test for 1-D Markov sources may be expressed as: 
n-1 
max log 
.... ' 
(2.5) 
For the general Markov source inference problem, we know only that the string x,, was 
generated by a 1-D Markov source, with the state set Ss and the transition probabilities 
P(Sg.t), k,le S. unknown. They must therefore be included in the inference procedure. 
To include the complexity term for this case, we note that the number of parameters to 
be estimated for model/ is simply the number of entries in the state-transition matrix 
Pa, i.e. ko = ISl 2. Therefore for 1-D Markov sources, the generalized Bayes hypothesis 
test including complexity may be stated as 
max 1  log P(S(x))lS(x.,)) - IS2log n, (2.6) 
{axs,..,su} n 2n 
where we have divided the entire quantity by n in order to express the criterion in terms 
of bits per symbol. Note that a candidate Markov source model Oi is initially specified 
by its order and corresponding state set Si. 
The procedure for inferring 1-D Markov source models can thus be stated as follows. 
Given a sequence x- from some unknown source, consider candidate Markov source 
models by computing the state function S(x.) (determined by the candidate model 
order) over the entire string x,, Enumerating the state transitions which occur in x. 
provides an estimate of the state-transition matrix Pa which is then used to compute 
(2.6). Now, the inferred Markov source becomes the one maximizing (2.6). 
Bayesian Inference of Regular Grammar and Markov Source Models 391 
3 REGULAR GRAMMAR INFERENCE 
Although the Bayes criterion developed for 1-D Markov sources (2.6) is a sufficient 
model learning criterion for the class of regular grammars, we will now show that by 
taking advantage of the apriori knowledge that the source is a regular grammar, the 
inference procedure can be made much more efficient. This apriori knowledge brings a 
special structure to the regular grammar inference problem in that not all allowable 
sets of Markov probabilities correspond to regular grammars. In fact, as shown in 
[Miller, 1988], corresponding to each regular grammar is a unique set of candidate 
probabilities, implying that the Bayesian solution which takes this into account will be 
far more efficient. We demonstrate that now. 
3.1 BAYESIAN CRITERION USING GRAMMAR COMBINATORICS 
Our approach is to use the combinatoric properties of the regular grammar in order to 
develop the optimal Bayes hypothesis test. We begin by defining the regular grammar. 
Definition: A regular grammar G is a quadruple {Vjv, Vr, Ss, R} where Vtv, Vr are finite 
sets of non-terminal symbols (or states) and terminal symbols respectively, Ss is the 
sentence start state, and R is a finite set of production rules consisting of the 
transformation of a non-terminal symbol to either a terminal followed by a non- 
terminal, or a terminal alone, i.e., 
Si--->WjSk or Si--->Wj, where Wf Vr, Sij, Vv . 
In the class of regular grammars that we consider, we define the depth of the language 
as the maximum number of terminal symbols which make up a nonterminal symbol. 
Corresponding to each regular grammar is an associated incidence matrix B with the i,k ts 
entry Bi equal to the number of times there is a production for some terminal j and 
non-terminals i,!: of the form Si }WR. Also associated with each grammar Gi is 
the set of all n-length strings produced by the grammar, deno.xl as the regular language 
l;,,(Gi). 
Now we make the quite reasonable assumption that no string in the language l,,(Gi) is 
more or less probable apriori than any other string in that language. This indicates that 
all n-length strings that can be generaw.,d by Gi are equiprobable with a probability 
dictated by the combinatorics of the language as 
P(x,,IGi) - 1 (3.1) 
where l ag.(G01 denotes the number of n-length sequences in the langmage which can be 
computed by considering the combinatorics of the language as follows: 
IL.(GOI = xa?, 
392 Smith and Miller 
with ,al corresponding to the largest eigenvalue of the state-transition matrix Bay 
This results from the combinatoric growth rate being determined by the sum of the 
entries in the n a power state-transition matrix B' 
o,, which grows as the largest 
eigenvalue ,a ofBo [Blahut, 1987]. We can now write (3.1) in these terms as 
P(x. IGi)= i", 
(3.2) 
which expresses the probability of the sequence x. in terms of the combinatorics of Gi. 
We now use this combinatoric interpretation of the probability to develop Bayes 
decision criterion over two candidate grammars. Assume that there exists a finite space 
of sequences X, all of which may be generated by one of the two possible grammars 
{Go, G }. Now by dividing this observation space X into two decision regions, X0 (for 
Go) and X (for G ), we can write Bayes risk R in terms of the observation probabilities 
P(xlGo), P(xlG ): 
R=  P(x, IGo)+  P(x,,ICO. (3.3) 
Xn X1 Xn XO 
This implementation of Bayes risk assumes that sequences from each grammar occur 
equiprobably apriori and that the cost of choosing the incorrect grammar is equal to 1. 
Now incorporating the combinatoric counting probabilities (3.2), we can rewrite (3.3) 
which can be rewritten 
x. Xl X. Xo 
=1+ '. { lo"- ;too"} (3.4) 
R 2 ' 
x Xo 
The risk is therefore minimized by choosing GO if ox" < ,tc, o" and G 1 if 
This establishes the likelihood ratio for the grammar inference problem: 
Gl 
Go 
which can alternatively be expressed in terms of the log as 
- n log . 
Recognizing this as the maximum likelihood decision, this decision criterion is easily 
generalized to M hypothesis. Now by ignoring any complexity component, the 
generalized Bayes test for a regular granmmr can be stated as 
Bayesian Inference of Regular Grammar and Markov Source Models 393 
a - a g , (3.5) 
where 1 is the largest eigenvalue of the estimated incidence matrix Bal corresponding 
to grammar Gi where Bat is estimated fxom  
The complexity factor to be included in this Bayesian criterion differs from the 
complexity term in (2.3) due to the fact that the parameters to be estimated are now the 
entries in the Bai matrix which are strictly binary. From a description length 
interpretation then, these parameters can be fully described using 1 bit per entry in Bai. 
The complexity term is thus simply IScil 2 which now allows us to write the Bayes 
inference criterion for regular grammars as 
max -tog Isc'12 
{Co.al,..,cu.} n ' (3.6) 
in terms of bits per symbol. We can now state the algorithm for inferring grmnmars. 
Regular Grammar Inference Algorithm 
1. Initialize the grammar depth to d=l. 
2. ComputelSol =lVa". 
3. Using the state function Sa(x.,) corresponding to the current depth, compute 
the state transitions at all sites x in the observed sequence x,, in order to 
estimate the incidence matrix Bo for the grammar currently being 
considered. 
4. Compute i from Bo. (recall that this is the largest eigenvalue of BoO. 
5. Using 1, andlSo] compute 0.6) - denote this as lo =-/og ;ta Is']2 
6. Increase the grammar depth d=d+l and goto 2 (i.e. test another candidate 
grammar) until lodiscontinues to ira:tease. 
The regular grammar of minimum depth which maximizes lo (i.e. maximizes (3.6)) is 
then the optimal regular grammar source model for the given sequence x, 
3.2 REGULAR GRAMMAR INFERENCE RESULTS 
To compare the efficiency of the two Bayes criteria (2.6) and (3.6), we will consider a 
regular grammar inference experiment. The regular grammar that we will attempt to 
learn, which we refer to as the 4-0,1s relll_ar grammar, is a run-length constrained binary 
394 Smith and Miller 
grammar which disallows 4 consecutive occurrences of a 0 or a 1. Referring to the 
regular grammar definition, we note that this regular grammar can be described by its 
incidence matrix 
B4.0,1 = 
000100- 
100100 
010100 
001010 
001001 
-001000_ 
where the states corresponding to row and column indices are 
S1 = 000, S2 = 00,S3 = O,Sn = 1,S5 = 11,S6 = 111. 
Note that this regular grammar has a depth equal to 3 and thus the corresponding 
Markov source has an order equal to 3. 
The inference experiment may be described as follows. Given a training set of length 16 
strings from the 4-0,1s language, we apply the Bayes criteria (2.6) and (3.6) in an attempt 
to infer the regular grammar in each case. We compute the criteria for five candidate 
models of order/depth 1 through 5 (recall that this deEmes the size of the state set for 
the Markov source and the regular grammar, respectively). 
Treating the unknown regular grammar as a Markov source, we estimate the 
corresponding state-transition matrix P and then compute the Bayes criterion according 
to (2.6) for each of the five candidate models. We compute the criterion as a function of 
the number of training samples for each candidate model and plot the result in Figure la. 
Similarly, we estimate the incidence matrix B and compute the Bayes criterion according 
to (3.6) for each of the five regular grammar candidate models, and plot the results as a 
function of the number of training samples in Figure lb. 
We compare the two Bayesian criteria by examining Figures la and lb. Note that 
criterion (3.6) discovers the correct regular grammar (depth = 3) after only 50 training 
samples (Figure lb), while the equivalent Markov source (order = 3) is found only after 
almost 500 training samples have been used in computing (2.6) (Figure la). This points 
out that a much more efficient inference procedure exists for regular grammars by 
taking advantage of the apriori gramhint information (i.e. only the depth and the binary 
incidence matrix B must be estimated), whereas for 1-D Markov sources, both the order 
and the real-valued state-transition matrix P must be estimated. 
4. CONCLUSION 
In conclusion, we stress the importance of casting the source modeling problem within a 
Bayesian framework which incorporates priors based on the model complexity and 
known model attributes. Using this approach, we have developed m efficient Bayesian 
Bayesian Inference of Regular Grammar and Markov Source Models 395 
-1 
--0.8 
--0.9 
-1 
-- o , X x' ' 
* 
* 
x 
o , 
x 
-- ,,,,o'. i)     mmmm 
 
oO  
  
m* 
tlllml  x x x Limit 
i I I I I i I I 
5 50 500 500( 50000 5 50 500 5000 50000 
a) b) 
Grammar depth d/Markov order:. = 1,  = 2,. = 3, * = 4, x -' 5 . 
Figure 1: Results of computing Bayes criterion measures (2.6) and (3.6) 
vs. the number of training samples - a) Markov source criterion 
(2.6); b) Regular grammar combinatoric criterion (3.6). 
framework for inferring regular grammars. This type of Bayesian model is potentially 
quite useful for the texture analysis and image segmentation problem where a consistent 
framework is desired for considering both structural and statistical features in the 
texlaredimage representation. 
Acknowledgements 
This research was supported by the NSF via a Presidential Young Investigator Award 
ECE-8552518 and by the NIH via a DRR Grant RR-1380. 
References 
Blahut, R. E. (1987), Principles and Practice of lnformation Theory, Addison-Wesley 
Publishing Co., Reading, MA. 
Miller, M. I., Roysam, B, Smith, K. R., and Udding, J. T (1988), qVapping Rule-Based 
Regular Grammars to Gibbs Distributions', AMS-IMS-SIAM Joint Conference on 
SPATIAL STATISTICS AND IMAGING, American Mathematical Society. 
Rissanen, J. (1986), "Stochastic Complexity and Modeling', Annals of Statistics, 14, 
no.3, pp. 1080-11. 
Smith, K. R., Miller, M. I. (1990), "A Bayesian Approach Incoting Rissanen 
Complexity for Learning Markov Random Field Texture Models", Proceedings of 
Int. Conference on Acoustics, Speech, and Signal Processing, Albuquerque, NM. 
