Grammar Learning by a Self-Organizing 
Network 
Michiro Negishi 
Dept. of Cognitive and Neural Systems, Boston University 
111 Cummington Street 
Boston, MA 02215 email: negishi@cns.bu.edu 
Abstract 
This paper presents the design and simulation results of a self- 
organizing neural network which induces a grammar from exam- 
ple sentences. Input sentences are generated from a simple phrase 
structure grammar including number agreement, verb transitiv- 
ity, and recursive noun phrase construction rules. The network 
induces a grammar explicitly in the form of symbol categorization 
rules and phrase structure rules. 
1 Purpose and related works 
The purpose of this research is to show that a self-organizing network with a certain 
structure can acquire syntactic knowledge from only positive (i.e. grammatical) 
data, without rocluiring any initial knowledge or external teachers that correct 
errors. 
There has been research on supervised neural network models of language acquisi- 
tion tasks [Elman, 1991, Miikkulainen and Dyer, 1988, John and McClelland, 1988]. 
Unlike these supervised models, the current model self-organizes word and phrasal 
categories and phrase construction rules through mere exposure to input sentences, 
without any artificially defined task goals. There also have been self-organizing 
models of language acquisition tasks [Ritter and Kohonen, 1990, Scholtes, 1991]. 
Compared to these models, the current model acquires phrase structure rules in 
more explicit forms, and it learns wider and more structured contexts, as will be 
explained below. 
2 Network Structure and Algorithm 
The design of the current network is motivated by the observation that humans 
have the ability to handle a frequently occurring sequence of symbols (chunk) 
as an unit of information [Grossberg, 1978, Mannes, 1993]. The network consists 
of two parts: classification networks and production networks (Figure 1). The 
classification networks categorize words and phrases, and the production networks 
28 Michiro Negishi 
evaluate how it is likely for a pair of categories to form a phrase. A pair of combined 
categories is given its own symbol, and fed back to the classifiers. 
After weights are formed, the network parses a sentence as follows. Input words 
are incrementally added to the neural sequence memory called the Gradient Field 
[Grossberg, 1978] (GF hereafter). The top (i.e. most recent) two symbols and the 
lookahead token are classified by three classification networks. Here a symbol 
is either a word or a phrase, and the lookahead token is the word which will be 
read in next. Then the lookahead token and the top symbol in the GF are sent to 
the right production network, and the top and the second ones are sent to the left 
production network. If the latter pair is judged to be more likely to form a phrase, 
the symbol pair reduces to a phrase, and the phrase is fed back to the GF after 
removing the top two symbols. Otherwise, the lookahead token is added to the 
sequence memor causing a shift in the sequence memory. If the input sentence is 
grammatical, the repetition of this process reduces the whole sentence to a single 
"S" (sentence) symbol. The sequence of shifts and reductions (annoted with the 
resultant symbols) amounts to a parse of the sentence. 
During learning, the operations stated above are carried out as weights are grad- 
ually formed. In classification networks, the weights record a distribution pattern 
with respect to each symbol. That is, the weights record the co-occurrence of 
up to three adjacent symbols in the corpus. An symbol is classified in terms of 
this distribution in the classification networks. The production networks keep 
track of the categories of adjacent symbols. If the occurrence of one category reli- 
ably predicts the next or the previous one, the pair of categories forms a phrase, 
and is given the status of an symbol which is treated just like a word in the 
sentence. Because the symbols include phrases, the learned context is wider 
and more structured than the mere bigram, as well as the contexts utilized in 
[Ritter and Kohonen, 1990, Scholtes, 1991]. 
3 Simulation 
3.1 The Simulation Task 
The grammar used to generate input sentences (Table 3) is identical to that used 
in [Elman, 1991], except that it does not include optionally transitive verbs and 
proper nouns. Lengths of the input sentences are limited to 16 words. To deter- 
mine the completion of learning, after accepting 200 consecutive sentences with 
learning, learning is suppressed and other 200 sentences are processed to see if all 
are accepted. In addition, the network was tested for 44 ungrammatical sentences 
to see that they are correctly rejected. Ungrammatical sentences are derived by 
hand from randomly generated grammatical sentences. Parameters used in the 
simulation are: number of symbol nodes = 30 (words) + 250 (phrases), number 
of category nodes = 150, e = 10 -9, 7 '- 0.25, p -- 0.65, Ctl = 0.00005, fll -- 0.005, 
 = 0.2, Ct 3 -- 0.0001, f13 -- 0.001, and T = 4.0. 
Grammar Learning by a Self-Organizing Nem,ork 2 9 
3.2 Acquired Syntax Rules 
Learning was completed after learning 19800 grammatical sentences. Tables 1 and 
2 show the acquired syntax rules extracted from the connection weights. Note that 
category names such as Ns, VPp, are not given a priori, but assigned by the author 
for the exposition. Only rules that eventually may reach the "S"sentence) node 
are shown. There were a small number of uninterpretable rules, which are marked 
"". These rules might disturb normal parsing for some sentences, but they were 
not activated while testing for 200 sentences after learning. 
3.3 Discussion 
Recursive noun phrase structures should be learned by finding equivalences of 
distribution between noun phrases and nouns. However, nouns and noun phrases 
have the same contextual features only when they are in certain contexts. An 
examination of the acquired grammar reveals that the network finds equivalence 
of features not of "N" and "N RC"(where RC is a relative clause) but of "N V" and 
"N RC V" (when "N RC" is subjective), or "V N" and "V N RC"(when "N RC" is objective). As an example, let us examine the parsing of the sentence [19912] 
below. The rule used to reduce FEEDS CATS WHO LIVE ("V N RC") is P0, which 
is classified as category C4, which includes P121 ("V N") where V are the singular 
forms of transitive verbs, and also includes the "V"where V are singular forms of 
intransitive verbs. Thus, GIRL WHO FEEDS CATS WHO LIVE is reduced to GIRL 
WHO "VPsingle". 
+---141---+ 
I +---88 ...... + 
I I +---206 ...... + 
I I I + .... 0 .... + 
I I I I +-21-+ 
I t +-41-+ I +-36-+ I 
BOYS CHASE GIRL WHO FEEDS CATS WHO LIVE 
<<Accepted>> Top symbol was 77 
4 Conclusion and Future Direction 
In this paper, a self-organizing neural network model of grammar learning was 
presented. A basic principle of the network is that all words and phrases are 
categorized by the contexts in which they appear, and that familiar sequence of 
categories are chunked. 
As it stands, the scope of the grammar used in the simulation is extremely limited. 
Also, considering the poverty of the actual learning environment, the learning 
of syntax should also be guided by the cognitive competence to comprehend the 
utterance situations and conversational contexts. However, being a self-organizing 
network, the current model offers a plausible model of natural language acquisition 
through mere exposures to only grammatical sentences, not requiring any external 
teacher or an explicit goal. 
30 Michiro Negishi 
Table 1. Acquired categorization rules 
$ := C29/* Nrps VPs '/ 
C30/* ? '/ I 
CT//' NPp VPp '/ 
C4 := LIVES I WALKS I 
P0/" VTs Np RC '/ 
P74/' VTs Ns RC*/ 
P121/ VTs Ns '/ 
P157/' v'rs Np '/ 
C3 := GIRLI DOG I 
CAT I BOY 
C16 := CHASE t FEED 
C18 := WHO 
C20 := CHASES t FEEDS 
C26 := BOYSICATSI 
DOGS I GIRLS 
C29 := P93/'NsRCVPs'/ 
P138/' Ns VPs '/ 
C30 := P2/'VTpNPpVPp 
P94/'VTp N VT'/ I 
P137/' ? '/ 
C32 := WALKILIVE I 
PI/' VTp Np RC '/ 
P61/ VTp Np '/ 
P88/' VTp Ns RC 
P122/' VTp Ns'/ 
 -/' VPs '/ 
=/' Ns '/ 
=/' VTp '/ 
=/' R '/ 
=/' VTs '/ 
=/' Np '/ 
=/' NPs VPs '/ 
=/. ? "/ 
=/' VPp '/ 
C52 := P41/*NsR*/ 
C56 := P36/' Np R'/ 
C58 := P28/' Ns VTs'/ 
P34/'NpVTp'/  
P68/' Ns RC V'rs 
P147/* Np RC V'rp 
C69 := P206/' Ns R VPs 
P238/' Ns R N VT'/ 
C74 := P219/* Np R VPp 
P249/'Np R N VT'/ 
C77 := P141/'NpVPp 
P217/' Np RC VPp 
Cl19 := P148 
C122 := P243 
C139 := P10/' VTs hips 
P32/' VTs NPp 
where 
RCs = RVPs I RNVT 
RCp = R VPp I R N VT 
NPp = NpI Np RCp 
Nrps = Ns I Ns RCs 
=/*NVT*/ 
=/* Ns RCs */ 
=/* Np RCp */ 
=/* NPp VPp '/ 
=/* VTs NVT*/ 
=/* Ns R VTs N VT '/ 
=/* VPs' VPp/s ?*/ 
Table 2. Acquired production rules 
P0 := C20/* VTs */ C74/* Np RCp "/ 
P1 := C16/* VTp */ C74/* Np RCp */ 
P2 := C16/* VTp */ C77/* NPp VPp */ 
P10 := C20/* VTs */ C29/* NPs VPs */ 
P28 :=C13 /*Ns*/ C20/*VTs*/ 
P32 := C20/* VTs */ C77/* NPp VPp */ 
P34 := C26/* Np */ C16/* VTp */ 
P36 := C26/* Np */ C18/* R */ 
P41 := C13/* Ns */ C18/* R */ 
P61 := C16/* VTp */ C26/* Np */ 
P68 := C69/* Ns RCs */ C20/*VTs*/ 
P74 := C20/* VTs */ C69/* Ns RCs */ 
1:'88 := C16/*VTp*/ C69/*NsRCs*/ 
P93 := C69/* Ns RCs */ C4/* VPs */ 
P94 := C16/* VTp '/ C58/* N VT '/ 
P121 := C20/* VTs */ C13/* Ns */ 
P122 := C16/*.V.,p */ C13/* Ns */ 
P137 :=C122/ NsRVTsNVT*/ C32/*VPp*/ 
P138 := C13/* Ns */ C4/* VPs*/ 
P141 := C26/' Np '/ C32/' VPp '/ 
P147 := C74/' vsR.? '/ C161'VTp'/ 
P148 := C20/* C58/* N VT*/ 
P157 :-- C20/* VTs */ C26/* Np */ 
P206 := C52/* Ns R */ C4/* VPs*/ 
P217 := C/* Np RCs */ C32/* VPp*/ 
P219 := /* Np R */ C32/* VPp */ 
P238 := C52/* Ns R */ C58/* N VT */ 
P243 :=C52/*NsR*/ Cl19/*VTsNVT*/ 
P249 := C56/* Np R */ C58/* N VT */ 
=/* VTs Np RCp */ 
=/* VTp Np RCp */ 
/* VTp NPp VPp */ 
=/* VTs NPs VPs */ 
=/" Ns VTs '/ 
=/* VTs NPp VPp '/ 
=/* Np VTp */ 
=/* Np R */ 
=/* Ns R */ 
=/- VTp Np '/ 
=/* Ns RCs VTs */ 
--/* VTs Ns RCs */ 
=/* VTp Ns RCs */ 
=/* Ns RCs VPs */ 
= /*VTpNVT*/ 
=/* VTs Ns */ 
=/* VTp Ns */ 
=/* ? '/ 
=/* Ns VPs / 
=/* Np VPp */ 
=/* Np RCs xrrp */ 
=/* VTs N VT*/ 
=/* VTs Np */ 
/* Ns R VPs */ 
=/* Np RCs VPp */ 
=/* Np R VPp */ 
=/*NsR N VT*/ 
=/* (Ns R VTs N) VT */ 
=/*NpRNVT*/ 
Grammar Leaning by a Self-Organ&#tg Network 31 
Acknowledgements 
The author wishes to thank Prof. Dan Bullock, Prof. Cathy Harris, Prof. Mike 
Cohen, and Chris Myers of Boston University for valuable discussions. 
This work was supported in part by the Air Force Office of Scientific Research 
(AFOSR F49620-92-J-0225). 
References 
[Elman, 1991] Elman, J. (1991). Distributed representations, simple recurrent net- 
works, and grammatical structure. Machine Learning, 7. 
[Grossberg, 1978] Grossberg, S. (1978). A theory of human memory: Self- 
organization and performance of sensory-motor codes, maps, and plans. Progress 
in Theoretical Biology, 5. 
[John and McClelland, 1988] John, M. E S. and McClelland, J. L. (1988). Applying 
contextual constraints in sentence comprehension. In Touretzky, D. S., Hinton, 
G. E., and Sejnowsky, T. J., editors, Proceedings of the Second Connectionist Models 
Summer School 1988, Los Altos, CA. Morgan Kaufmann Publisher, Inc. 
[Mannes, 1993] Mannes, C. (1993). Self-organizing grammar induction using a 
neural network model. In Mitra, J., Cabestany, J., and Prieto, A., editors, New 
Trends in Neural Computation: Lecture Notes in Computer Science 686. Springer 
Verlag, New York. 
[Miikkulainen and Dyer, 1988] Miikkulainen, R. and Dyer, M. G. (1988). Encoding 
input/output representations in connectionist cognitive systems. In Touretzky, 
D. D., Hinton, G. E., and Senowsky, T. J., editors, Proceedings of the Second Connec- 
tionist Models Summer School 1988, Los Altos, CA. Morgan Kauffman Publisher, 
Inc. 
[Ritter and Kohonen, 1990] Ritter, H. and Kohonen, T. (1990). Learning semanto- 
topic maps from context. Proceedings of. IJCNN 90, Washington D.C., I. 
[Scholtes, 1991] Scholtes, J. C. (1991). Unsupervised context learning in natural 
language processing. Proceedings of IJCNN Seattle 1991. 
Appendix A. Activation and learning equations 
A.1 Classification Network Activities 
 Gradient Field XOi(t) = 0.5/0i(t - 1) + Ii(t) (1) 
where t is a discrete time, i is the symbol id. and Ii (t) is an input symbol. 
elnput Layer 
= Xl.(t) = Xlc(t) = 
Where the suffix A ,B, and C the most recent, the next to most recent, and the 
lookahead symbols, respectively.. Weights in networks A, B, and C are identical. 
1 ifx > 1 -2 -M 
0(x) = 0 otherwise 
32 Michiro Negishi 
Here M is the maximum number of symbols on the gradient field. 
 Feature Layer 
X2,i t = = X2,j ) 
J  J 
- 2/0  - 
where s is a suffix which is either A, B, or C and T is the steepness of the sigrnoid 
function and a is a small positive constant. Table 4 shows the meaning of above 
suffix i. 
eCategory Layer 
1 if i= min{j[X2W2j 5pp) ), or 
X3pi = if 4=min{j[yX2W2 & unrefi =y*" {unrefj} 
0 othei 
Whe p is the least match o uid and uref is an unferenced count. 
(2) 
A.2 Classification Learning 
Feature Weights AWI,ij = -a Wlq + 13 Xli (X2,j - Wl,i) 
where ct is the forgetting rate, and 3 is the learning rate. 
 Categorization Weights 
{ W2,i = fl2X3,i(X2,i - W2,ij) 
W2ij = X2i 
where  is the learning rate. 
if the node is selected by the first line of (2) 
if the node is selected by the second line of (2) 
A.3 Production Network Activities 
.Mutual predictiveness 
X4ij = X3AiW3ij, X5ji = X3BjW4ji, X6ij = X4ijX5ji 
X7ij = X3BiW3ij, X8ji = X3cjW4ji, X9ij = X7ijX8ji 
The phrase identification number for a category pair (i, j) is given algorithmically 
in the current version by a cash function cash(i, j). 
(i) Case in which 7 ij X6ij ) ij X9ij  
1 if i = cash(I, J) 
XlOi = 0 otherwise 
Reduce 
(x%) 
where X6xa 
XOi(t + 1) = 0.5  pop(pop(XOi(t))) + X10, pop(z) = 2(z - 0(z)) 
(ii) Case in which 7 ij X6i < id X9q : Shift 
The next input symbol is added on the gradient field, as was expressed in (1). 
Grammar Learning by a Self-Organizing Network 33 
A.4 Production Learning 
A W3q = -ot3W3ii +3X3Ai (X3B-W3q ), AW4 = --a3W41i+J3X3Bj (X3Ai-W4) 
where X34i and X3Bi are nodes that receive the next to the most recent symbol i 
and the most recent symbol j, respectively. 
Shift / Reduce | 
Controller 
Production J 
Networks A 
(predictiveness/ \ / \ 
evaluators) 
Classification ,,. ) 
Networks ",,-'-' 
' o0 
1o - OOOl=-- 
Neural Sequence Mer 
Lookahead 
Token 
Input words 
Figure 1. Block diagram of the network 
eedback 
'hrases 
S -. NP VP. 
NP -- NI NRC 
vP -- v [NP] 
RC -- whoNPVI whoVP 
N -- boy l girl I carl dog I 
boys I girls I catsl dogs 
V -- chase I feed I work I live I 
chases I feeds I works I lives 
Number agreement 
- Agreements between N and V within clause 
- Agreements between head N and 
subordinate V (where appropriate) 
Verb arguments 
- chase, feed -> require a direct object 
- walk, live -> preclude a direct object 
(Observed also for head/verb relations 
in relative clauses) 
(1) Left context, words. 
s=,1 <=i<=N,,, 
(2) Left c6ntext, phra ses. 
s= L,N < i <= N,+ N 
(3) Left context, categories. 
s = L,N,,, +N, < i <= N,,, +N, + Nc 
(4) Right context, words. 
s= R, 1 <=i <= N, 
(5) Right context, phrases. 
s= R, 3t, < i <= N, + 3t s 
(6) Right context, categories. 
s = R, N, + N, < i <= N, + N, +.No . 
(7) Right context, lookahead. 
s= R,N, + N + N, < i <=2N,+ N,+ N, 
N,, N,, and N, denotes number of words, phrases, and 
categories, respectively. 
Table 3. Grammar for generated sentences Table 4. Subfields in a feature layer 
34 Michiro Negishi 
CategonJ 
(The Most Recent) 
Category 
(The Next to 
Most Recent) 
To the 
Production 
Network 
x 
W2 
X3A 
1A 
X5 
i I / 
I 
I 
I 
W2 
Wl 
AB 
 Gradient Field 
X 3B 
X 2 L Category 
(Lookahead) 
X 2B 2B 
XIB 
I  Xlc 
C 
X0 
(X1) I [ From the Production Network 
Terminals(Inputs) 
...ABC... 
Classification Path 
Copy for Learning 
Nonterminals 
(Reduction 
Results) 
Lookahead 
Terminals 
C'---'---I terminals 
 .':'.'.'.....:::_._..'. n onter m in al s 
I category 
L_ j 
previous next next to next 
Figure 2. Classification Network 
X 3B 
X 10 
Non torminal 
_I_ 
X 3c 
o o / 
le X 3A 
The xt to 
Most Recent 
Category 
Shift/Reduce z4 7 
Controller 
/ O,d'10 14,01/7..///7 
x 
3B 
The Most Recent 
Category 
Figure 3. Production Network 
X8 
X9 
X7 
Lookahead 
Category 
