“Computation in Linguistics: A Case Book” | Open Indiana

Ranan B. Banerji

0. INTRODUCTION

0.0 In recent years considerable interest has been manifest, both among linguists¹ and among computer scientists,² in developing methods for mechanical parsing of sentences in a language. In the early phases of this work, the parsing algorithms developed were strongly grammar dependent, so that the computer program for parsing one given language was completely different from that for parsing a different language. In other words, the program embodied the grammar envisaged for the language. Indeed — this is especially true in the case of computer sciences — in some cases the only concept of the grammar which was discernible was its embodiment in the program.

0.1 In recent years a lot of interest has been directed toward methods known as syntax-directed parsing.³ In these methods the computer program may be considered to be an interpreter which parses sentences according to the format supplied by the grammar, which is an entity outside the interpreter, being treated as a piece of data as it were. One may say that while the programs indicated in paragraph 0.0 answer the question, ‘what is the structure of this given sentence?’ the ones indicated in this paragraph answer the question ‘what is the structure of this given sentence according to this given grammar? ’

0.2 Syntax-directed parsing has the obvious advantage of enabling easy changes in grammar at the experimenter’s discretion without involving any change in the program. On the other hand, since it is designed to process any grammar within a given class, its structure cannot be tailored to a specific language so as to make it eminently efficient for quick and efficient parsing of sentences in that language. How efficiently a specific syntax-directed parsing algorithm will process sentences in a specific language is dependent on the grammar envisaged for that language. The range of efficiency over different grammars, however, may not be the same for different algorithms. A number of syntax-directed parsing algorithms are known at present and already some efforts have been made to study their efficiencies with respect to given grammars.⁴ However, the problem of efficiency range will take deeper study.

0.3 In the present paper we shall study certain aspects of the efficiency of a specific parsing algorithm. Like most of the parsing algorithms at the time of writing of this paper, the algorithm envisages the grammars to be context-free phrase structure grammars. It may be worthwhile to point out here that syntax-directed parsing algorithms, although they do not envisage a specific grammar, have to envisage a specific class of grammars to assure uniformity of operation. This class need not be context free (a few efforts are already under way for syntax-directed parsing in transformational grammars⁵), but they have to be a well-defined class.

0.4 The algorithm we shall discuss here was suggested in its original form by Alec Glennie.⁶ The form in which we shall present it here will differ from his — we believe — only in nonessential features. The method is also essentially similar to the one suggested by Conway.⁷

0.5 The essential point that we want to make at this point about our results here is that although the efficiency of an algorithm is grammar dependent, this may not mean that in general the efficiency is as strongly language dependent. In case an algorithm is not efficient for a certain grammar, it may often be possible to suitably modify the grammar to an equivalent form (without changing the language) for which the same algorithm is more efficient. As a matter of fact, the Rhodes-Kuno-Oettinger predictive analysis method is essentially the result of modifying an arbitrary grammar to the so-called standard form.⁸ Such a change of grammar may not always be acceptable to the linguist or the machine translator, since the parsing (i.e. the structural description or P-marker⁹) is changed entirely, unless the originally envisaged parsing can be recovered unequivocally from the obtained one. When this recovery is possible one says that the new grammar is ‘strongly equivalent’ to the old one. Unfortunately, in what follows we shall not be able to demonstrate that the changes suggested by us for the grammars result in strongly equivalent grammars. We shall only assure ourselves of equivalence, that is, that the language will remain unchanged on introducing the suggested changes in the grammar. At the time of writing, very little is known about tests for strong equivlalence.¹⁰

0.6 What has been said above essentially sets out the background of the present paper. In the rest of this section we shall summarize the content of the remaining sections for the benefit of readers who may want to read only parts of the paper.

0.7 In section 1 of this article we shall introduce the concept of a context-free grammar in a precise way. This is somewhat redundant to the mathematical linguist — however, for the other readers of this paper it may be worthwhile to restate it and to bring out the fact that we are dealing with an axiomatically defined mathematical structure, and very little else. To some, this structure is evidently adequate for describing certain aspects of natural languages. To others it is clearly inadequate for dealing with any aspect of language since language is an entity from which no aspect can be abstracted. We shall flatly refuse to enter the controversy here, since we consider it to be outside our pale of competence. We shall then define what we mean by structural descriptions, derivations, sentences, languages, and parsing; and we shall exhibit the basic philosophy of the Glennie parsing algorithm. A recursive flowchart for carrying out the Glennie algorithm will then be exhibited, and certain computer tricks for realization of recursion pointed out. The operation of the flowchart will be exemplified.

0.8 In sections 2 and 3 we shall point out two difficulties associated with the Glennie algorithm. One, which we shall call the ‘Loop’, occurs in certain grammars and renders the Glennie algorithm completely powerless to work with them. However, it will also be pointed out that such a grammar can always be modified to an equivalent grammar on which the Glennie algorithm can operate. The method of modification will be exemplified and a flowchart for the method exhibited.

0.9 The other difficulty, called by us ‘back-up’,¹¹ renders the Glennie algorithm inefficient on certain grammars. We shall exemplify a technique whereby one can change such grammars to equivalent grammars where the inefficiency is prevented. A flowchart for the technique will be exhibited and the technique exemplified.

0.10 Unfortunately the technique, unlike the technique discussed in paragraph 0.8 above, is not universally applicable and in the case of grammars where the technique is not applicable, the flowchart leads to a nonterminating program. What is worse, there is no method available whereby one can decide for a grammar whether the back-up-preventing technique will terminate or not. This fact will be discussed in detail.

0.11 After a few concluding remarks in section 4, we shall include in the appendices some detailed examples of the mode of operation of the loop-preventing and back-up-preventing flowcharts, as well as a failure of the latter. We shall also include clear proofs of some of the statements made in the body of the paper — endeavoring to exhibit the proofs in such a manner as to be intelligible to the nonmathematician.

1. CONTEXT-FREE GRAMMARS AND THE GLENNIE ALGORITHM

1.1 A context-free grammar is defined by exhibiting three finite sets of objects. The objects in one set are called the terminal symbols or words. These are the symbols that occur in the sentences of the language being described by the grammar. One may think of these terminal symbols as the words¹² of the language. The objects in the second set are called nonterminals or phrase names. These never occur in the sentences of the language. This set has a prespecified member, called the sentence symbol. Terminal symbols will generally be represented by strings of Latin letters or by single lower-case Latin letters, depending on the language being exemplified. The nonterminals will be represented either by single capital Latin letters or by strings of Latin letters enclosed in angular brackets. The sentence symbol will generally be represented as ‘S’.

1.2 The objects in the third set are called productions, each of which takes the form of a string of symbols, consisting of a phrase name (nonterminal), followed by an arrow pointing to the right (→), followed by a string of terminal symbols and nonterminals.

1.3 Each grammar has its own sets of terminal symbols, nonterminals, and productions. These describe a language by the following uniform definition, which is the same for all grammars. We introduce this definition through the following steps.

1.4 A derivation in a context-free grammar is a progression of strings, obtained as follows. The first element of the progression is S. At every step of the progression, one locates the left-most nonterminal in the string comprising the last element in the progression, chooses any production which has this nonterminal at the left of the arrow and replaces the nonterminal in the string by the string on the right-hand side of the arrow in this production. If at some step of the derivation the resulting string consists of all terminal symbols, then such a string of terminal symbols ending a derivation is called a ‘sentence of the language described by the grammar’. Any language which can be described by a context-free grammar is a context-free language.

1.5 The following examples may clarify the definition and some of the definitions to follow. We shall take a grammar which will generate a set of strings, some of which will look like English sentences. It is by no means construed that the grammar is even a part of the grammar of English. We shall later on have to exemplify languages which have no connection with natural languages at all. However, the example does bring out the intuitive contents of our definitions and points out what sort of formal operations could parse English sentences if English were a context-free language.

1.6 In our example, the set of terminal symbols will contain the following strings or words:

old

gray

killed

worried

ate

dog

cat

mouse

malt

the

The set of nonterminals will contain the following phrase names:

<NP>

<VP>

N”

N’

S’

The productions will be the ones shown in Table 4.1. The productions actually comprise the three columns named jointly as productions. The columns ‘Phrase name’, ‘Position 1’, and ‘Position 2’ are of no importance at present. Nor is the column headed ‘Definition number’. The column headed ‘Production number’ is for identification purposes only.

Let us now look at a derivation.

At each step of the derivation, the left-most nonterminal is replaced. For instance, in step 6 the left-most nonterminal, <NP>, is replaced. It is replaced by means of production number 2, which has <NP> on the left-hand side of the arrow: the resulting next step 7 of the derivation has <NP> replaced by the string the N” which occurs to the right of the arrow in production 2.

Step 24 of the derivation contains no phrase name and hence no further progression is possible. This string, then, is a sentence of the language described by the grammar.

Table 4.1

1.7 Let us now define parsing and structural description. One says that a certain string y is derived from another given string x if y is a step in a derivation starting at x. It is not hard to see that if x be broken up into two arbitrary parts x₁ and x₂, that is if x₁ and X₂ are two strings and x is obtained by writing X₁ and x₂ side by side (a fact we shall often express in concise notation by writing x = x₁x₂). then y can be broken into two parts y₁ and y₂ such that y₁ can be derived from X₁ and y₂ can be derived from X₂;. (A rigorous proof of this fact can be given.¹³ However, since the fact is so easily acceptable to common sense and is also true, we need not go into this.)

1.8 Let us now look at the derivation exemplified above. The entire sentence, of course, is derived from the second step, that is from <NP><VP>, hence the sentence can be broken up into two parts, one derived from <NP> and the other from <VP>. A perusal of step 12 will indicate that the dog the old cat killed is derived from <NP> and hence worried the grey malt the mouse ate is derived from <VP>. However, the dog the old cat killed is derived also from the N'S' in step 4; dog is derived from N' and the old cat killed is derived from S'.

1.9 Proceeding in this manner we can identify certain terminal symbols and strings of terminal symbols as being derived from a nonterminal. This identification can be effected by enclosing each string derived from a nonterminal in a pair of brackets and attaching the phrase name (nonterminal) to the right-hand bracket. Since one can readily identify the left bracket corresponding to a right bracket, one can, from such a notation, readily identify the phrase name corresponding to the strings which are derived from phrase names. The sentence in the above example will have its substrings identified as follows:

1.10 Such an identification of what may be called the structure of a sentence is called parsing. To those used to thinking of a parsing as the exhibition of a tree-like structure we could say that the two methods are essentially identical. If one identifies each bracket pair as coming from a node and envisions that the inner pairs correspond to lower nodes than the outer pairs, one would obtain, for this above sentence, the tree found in Figure 4.1. We shall not indicate here a formal procedure for conversion of the bracketed string to the tree and vice versa; it is probably intuitively clear and the purpose of the present paper is to exhibit some other mechanical procedures.

1.11 The mechanical procedure or program we want to discuss first in detail is one which, given a grammar and a string, will determine if the string belongs to the language described by the grammar and, if it does, will indicate its structure by parsing it. The overall mode of operation of the program can best be exemplified in terms of the above example again. The program essentially scans the string from left to right. Starting at the extreme left it first assumes that the string to its right is derived from S. If this assumption is to be true, then, according to production number 1, it must be divisible into two parts, the first part derived from <NP> and the latter part derived from <VP>. However, if the first part is derived from <NP> then it must consist of the followed by a string derived from N”. Since the left-hand word of the string is indeed the, the program shifts attention to the right of this word and assumes that the string to its right is derived from N''. To test for this, one has to test for three possibilities, that this string be divisible into strings derived from N' and S', from A and N'', or that the string be simply derived from N'. These are checked as above, yielding the fact that dog the old cat killed is indeed derived from N'', and hence the dog the old cat killed is derived from <NP>. Thus the program shifts attention to the part worried the grey malt the mouse ate and, according to the direction of production number 1, assumes it is derived from <VP>. It is to be borne in mind that all this time, while the program was checking for <NP>, V, N'', etc., through the application of many of the productions, after the success of these, the rest of production number 1 has still to be checked and this has to be kept ‘in the back of the program’s mind’ so to speak.

1.12 Instead of giving another rough sketch of how <VP> is checked for in the second part of the sentence (we shall consider the program in great detail in what follows), it may be worthwhile to point out another basic fact. It will be noticed that although the operation of the program is quite complicated, it consists basically of the repeated application of a single program — one which, on being given a phrase name and a string, determines whether the string can be derived from the phrase name. The successive questions are in this case: First, ‘Is the dog the old cat . . . ate derived from S? ’; ‘Is the dog the old cat killed derived from <NP>? ’; ‘Is the first word the? ’. Since it is, the next questions are: ‘Is dog the old cat killed derived from N”? ’;‘Is dog derived from N';? ’. Since it is, the next questions are: ‘Is the old cat killed derived from S'? ’; ‘Is the old cat derived from <NP>? ’, etc. At this last point the program is checking for <NP> and the affirmative answer to this check would provide only a partial answer to another check for <NP> on another string — a check started earlier but so far incomplete; only if the old cat be derived from N” and killed be derived from V will the dog the old cat killed be checked as derived from <NP> and, after this, not until worried the grey malt the mouse ate is checked as derived from <CVP> is the check for S complete. The questions, then, follow not in sequence, in the sense that the second question is asked after the answer to the first is received, but as a part of the first question. The program, as it were, used itself as its own subprogram.

1.13 Since the number of occurrences of the same program in the overall operation is rather large (as the example above no doubt bears out), it is not advantageous to have the same program repeat itself in the larger program. Also, the total number of occurrences of the program cannot be ascertained initially. Hence the only way one can realize an operation such as the above in an actual computer is through the use of subroutines. When a subroutine has to be used, the main program which uses the subroutine communicates to the subroutine (by storing it in some area available to the subroutinewhere the operation of the main program is to be resumed after the subroutine has completed its operation. This imposes no difficulty as long as a subroutine is not used until all of its previous uses have been completed. In the present case, however, it is precisely this condition which is not fulfilled, and on completion of the subroutine, control may have to be transferred to various places depending on which use of the subroutine has been completed.

1.14 Another problem involved in this situation deals with data. While the subroutine is working on one set of data, it may use itself as a subroutine with a different set of data. This makes it necessary to make data available to the subroutine according to which use of the subroutine is under way.

1.15 These two problems are solved in a computer by storing both the data and the transfer-of-control address, not in specific memory cells, but in what are known as push-down stacks.¹⁴ The exact procedure for the realization of a push-down stack in a computer is strongly dependent on the available technology and need not concern us here. For our purposes, it will suffice to think of these as being similar to a stack of cards on which data may be recorded. As a card bearing a new piece of data is placed on the top of the stack, the cards below are hidden by the data at the top. Any time the card at the top is destroyed, the data on the card immediately below is made available.

1.16 The mode of use of the push-down stack for programming purposes will become clear as we peruse the flowcharts that follow. In the first flowchart (Figure 4.2), embodying the Glennie parsing algorithm, the stack will be used only for the storage of data and not for specifying the point of return from the subroutine to the parent routine. In the exemplification of the procedure, we shall exhibit stacks by writing the successive entries on the stack in a vertical column. The bottom entry of a column will be considered to be the datum on the top card of the stack.

1.17 Before coming to the detailed discussion of the flowchart, it may be worthwhile to explain in detail the conventions used there and to discuss the five push-down stacks used in the program. It should also be pointed out that because of the difference in philosophy between the conventional computer memory structures and a structure involving stacks, our notational conventions may be slightly different from those in the rest of this volume.

1.18 The Glennie algorithm, when considered as an interpreter (in the original paper, as well as in the Conway algorithm, the analyzer embodies the grammar, i.e. acts as a compiled program), uses five push-down stacks which we shall call here A, B, C, D, and E. A contains the location of a marker which moves along the words (terminal symbols) of the sentence to be analyzed. As this marker moves along the sentence, its words, as well as the brackets needed for parsing the sentence (see below), are put down in an output string. The position of the output string at which the next symbol is to be written at any stage of the program is marked by another marker whose location is stored in push-down stack B. Stack C contains the phrase name (nonterminal) which is being tested for. Generally, when the subroutine starts with a phrase name in C and a specific position of the input marker in A, the test being conducted is for checking the assumption that some string starting at the position marker is derived from the specific phrase name. The program conducts this test by assuming that the derivation of the string from the phrase name starts with a specific production which has the phrase name on its left. In Table 4.1 of paragraph 1.6, the different productions having the same phrase name on the left are numbered consecutively. These numbers are called the definition numbers. When the program checks for a certain string being derived from a phrase name, it may find that the starting production needed is not the one having the specific definition number assumed initially, but one having a different definition number. The definition number being tried is stored in a push-down stack D. Again, as a specific definition is tried, it is assumed that the string being tested can be broken up into parts, each separately derived from the symbols in the string in the right-hand side of the production under consideration. The testing for these symbols has to be done consecutively and hence a record has to be kept at running time of the program about which symbol in the right-hand side of the definition is being tested for. This information, designated here ‘the position number in the phrase’, is recorded in stack E.

1.19 A few other explanations are needed before the working of the flowchart shown in Figure 4.2 can be clearly understood. One is that the parsed sentence, as envisaged in the output of this program, does not include the left parenthesis. Also, instead of the right parenthesis, it has the phrase name itself, which was placed as a subscript of the right parenthesis in our previous example in paragraph 1.9. This phrase name in the output string is followed by an integral number which indicates the position of the corresponding left parenthesis that was omitted. The method for restoring this parenthesis can be best brought out by first indicating what the parsing output would look like for the sentence shown in paragraph 1.6 and comparing it with the parsing shown in paragraph 1.9. The output appears as follows;

To restore the parenthesis, one goes from left to right until the first phrase name is encountered, which is replaced by the right parenthesis with the phrase name as a subscript. Then one counts as many words to the left as is indicated by the following integer and inserts a left parenthesis at the end of the count. The integer is removed. This process is repeated until all the phrase names and integers are removed. When during the counting to the left a right parenthesis is encountered, the entire string between the right parenthesis and the corresponding left parenthesis is counted as a single word. It can be seen that the parsing in paragraph 1.9 can be obtained from the string exemplified above by this procedure.

1.20 A few words used in the flowchart need explanation also. When a stack is said to be pushed¹⁵ by the program, it means that a new card is placed on the top of the stack and the content of the previous top card is copied into it. Hence, if the stack before push looked as follows:

<S>

<VP>

<NP>

then after push it looks as follows:

<S>

<VP>

<NP>

Again, when the stack is said to be popped it means that the top card of the stack is removed without any change in the content of the other cards. The card immediately below the top card becomes the top card. When a symbol is put on a stack, it is written on the top card of the stack, destroying the previous content of the top card. When a symbol is to be placed in a stack without destroying the previous content, the stack must be pushed before putting the symbol. Each card on a stack will be called a level of the stack. Thus pushing increases the levels of a stack, popping decreases it. When the subroutine is running with a certain number ‘x’ of cards in the stack, we will say the program is ‘at level x’. All stacks start with one card in it.

1.21 One more convention needs to be established. When in the flowchart we use an expression like ‘phrase name A’ or ‘position C, we shall mean the phrase name or position written on the top card of stack A or C. When we want to refer to a symbol by referring to itself rather than by the stack it is contained in, we shall enclose the symbol in quotes. Thus, ‘A’ will refer to the letter Aye, and A will refer to the content of the top card of stack A.

1.22 The reader will do well to verify that the operations indicated in the flowchart do parse sentences. To aid in this we are appending below a table (Table 4.2) indicating the successive situations of the stacks. Instead of indicating each situation by a separate table, we have included certain integers on the extreme left of the table which indicate the same thing. To find a specific situation, one fixes on an integer and considers only those rows which are numbered by integers smaller than or equal to this integer All the rows above the row numbered by this integer are to be considered as existing on the stack. The rows below are to be considered as having been popped on return to the row numbered by the integer.

1.23 Before ending this section it ought to be pointed out that the Glennie algorithm was designed specifically for unambiguous languages, that is, languages wherein each sentence had only one possible parsing. It is not too difficult to modify the algorithm to exhibit all the parsings of a sentence in ambiguous languages. Indeed the Rhodes-Kuno-Oettinger algorithm, which is closely analogous to the one under discussion, is capable of exhibiting all the parsings. The present algorithm, when faced with a number of definitions, tries them in sequence until one of them yields an effective analysis. Any later definitions are ignored. One can make a modification wherein all the definitions would be tried, irrespective of their successes; however, only the successful definitions would give rise to output tapes. In this case, the output tapes would have to be put on stacks also. We shall make no attempt to do this here. As the flowchart stands at present, it will get only one parsing for any sentence, ignoring any other parsings that may exist.

2.0 PREVENTION OF LOOP

2.1 The grammar we discussed in Table 4.1 was free from a characteristic which mathematical linguists call ‘left-recursiveness’, that is, occurrences of productions where the first position (or ‘handle’¹⁶) of a production coincides with the phrase name. One could quote for example the production which might occur in French, say, to replace definition 2 for N'' in Table 4.1. Since the adjective follows the noun we would need a production like

N'' → N'' A

Table 4.2

2.2 A left-recursive grammar would be completely unsuitable for the application of the Glennie algorithm discussed in section 1 above. To see why this is so, let us see what would happen if we entered box 2 of the flowchart with the top level push-down stacks looking as follows:

78N''21

So, position 1 of definition 2 of N'' being N'', the next level of the stacks would have the appearance:

78N''21

which is identical to the previous level, and box 2 would be entered over and over again without any movement of the input and output tapes.

2.3 We propose to get around this difficulty, not by modifying the Glennie algorithm but by modifying the grammar itself to one describing the same language but free from left-recursiveness.

2.4 The method for doing this will be explained by introducing several examples. Intuitive arguments will be given to show that the method achieves what it is expected to achieve without attempting in any way to prove the fact, except in the appendix, since the proofs are somewhat involved mathematically. Moreover, the intuitive arguments will seem sufficient to all but the mathematician. What will be unsatisfying to the nonmathematician in the following sections will be the fact that our examples will be artificially constructed and the sentences described by the grammar will have nothing to do with natural languages. But this dissatisfaction has no real cause, since even the language discussed in the examples so far has had only deceptively superficial similarity to natural languages.

2.5 As an initial example, let us take a language having the following grammar:

S → Sb

S → a

which is clearly left-recursive. It describes a language whose sentences consist of a single a followed an indefinite number of b’s. Now the same language could be described by the grammar:

S → aX

S → a

X → bX

X → b

This grammar is not hard to construct from the previous one, as a little thought will show. We shall set out below a general method for effecting this change when more than two productions are involved.

2.6 However, one other thing needs to be pointed out before going into details. It is the fact that, although it may not be evident on looking at a grammar, when position 1 of a production coincides with its phrase name, the phenomenon of looping can again take place. Consider the following grammar:

S → Ab

A → SA

A → b

With this grammar, if one enters box 2 of the flowchart with the top of the stacks looking as follows:

11S11

the next following position would be:

11A11

and then once more:

11S11

the cycle repeating itself indefinitely.

2.7 Before any modification like the ones in paragraph 2.5 above can be applied, a preliminary modification must be made on the grammar so as to make all the hidden loops explicit. This can be done easily also, without modifying the language. The method can probably be explained best in terms of the example above. Let us consider the first production S → Ab. Here, though position 1 of the definition does not coincide with the phrase name in the definition, this position is a nonterminal and there may be a definition of it which has S in position 1. In the case of this example this is indeed true, since we have the definition:

A → SA

To make this phenomenon ‘come out into the open’, we write the grammar by replacing the occurrence of A in position ‘1’ of the first production by the right-hand sides of the productions having A as the phrase name, obtaining the grammar:

S → Sab

S → bb

A → SA

A → b

which has no more hidden loops and describes the same language. That it describes the same language can be seen with a little introspection, or by trying to generate identical sentences from the two grammars. A proof of the equivalence of grammars obtained from one another by the above procedure (replacing a nonterminal on the right of a production by the right side of all definitions of the nonterminal) will be given in the appendix.

2.8 The flowchart shown in Figure 4.3 carries out the operation outlined in the example in paragraph 2.7 above. The only thing to be pointed out here is that this program also uses itself as its own subprogram, the reason being that one does not know how deep one may have to ‘chain’ on the nonterminal in position 1. It may be that A has B at position 1 of a definition, B has C at position 1 of a definition, C in its turn has D in position 1 and finally D has A in position 1 of a definition. In the grammar in our example, we were just lucky to find the loop merely by descending two levels.

2.9 We have introduced a new innovation in this flowchart over the previous one by having introduced two push-down stacks involving the exit address. This program has two possible exits, one when its operation is found unnecessary, and another when it exits after carrying out the operation. When a program (either itself or another program — see below) uses this program as a subroutine, the subsequent operation of the user program depends on the exit taken by the used program. In the present situation the only program which uses this program is itself and so we could have found out from the number of levels left in the push-down stack whether an exit is a termination exit or an exit to the user program. However, since our job does not end with making the loops explicit but also must lead to a program which modifies it, we felt it safer to put the exit addresses in push-down stacks. Moreover, since we shall soon be using this program as a subprogram of another program for preventing back-up, we will not be able to take recourse to the level-counting procedure of section 1 at all.

2.10 The subroutine of Figure 4.3 can only be entered after another program has placed in stack A a phrase name and in B a symbol (phrase name or otherwise — see below) whose occurrence (in position 1 of any definition of the phrase name in A) is to be made explicit. Initially, A and B will contain the same phrase name, but at higher levels this may not remain true. The user program also has to enter in stacks D and E the points of exit of this program for the ‘unnecessary’ and ‘completed’ cases. The program itself uses stack C for holding the current definition number at any level of recursion and stack F to keep track of which of the two exits to use at this level. The actual use of this program will occur in later flowcharts.

2.11 We shall make use of the flowchart of Figure 4.3 in the flowchart of Figure 4.4, which actually modifies a grammar to prevent loops. In the appendix we have included the first few steps of a table similar to Table 4.2 in section 1 to indicate the operation of the flowcharts. However, we want to include in the body of this section the example for which this table is given and indicate the results of the operations on this grammar. The grammar we start with is as follows:

S → Be

S → SbB

B → Cd

B → Dx

C → m

C → n

D → Sy

D → 1

The program first makes all loops explicit. It does so by starting to make occurrence of S in position 1 of the definitions of S explicit. In the definition S → SbB it is already explicit, but the production S → Bc may or may not lead to a loop. To check this, the program makes the occurrence of S in position 1 of definitions of B explicit. To do this, it has to make occurrence of S in position 1 of the definition of C explicit. There being no such occurrence (all definitions of C leading to terminal symbols), it tries to make the occurrence of S in position 1 of the definitions of D explicit. When this is done, one uses these explicit definitions to modify the definition of S. The grammar looks as follows at this stage:

S → Cdc

S → SbB

S → lxc

S → Syxc

B → Cd

B → Dx

C → m

C → n

D → Sy

D → 1

No further modification is necessary to make the loops explicit. At this point control is returned from the program in Figure 4.3 to the program in Figure 4.4, which proceeds to modify the grammar to prevent loops. This process, though essentially similar in principle to the one exemplified in section 2.5 above, may not always be as simple. In our present example, for instance, there are two definitions of S which lead to loops. Also, there are two definitions of S additional to this. We can first transform the two definitions of S by introducing two more auxiliary variables, N and M, and rewriting the grammar thus:

S → SN

S → M

M → Cdc

M → lxc

N → yxc

N → bB

B → Cd

B → Syx

B → lx

C → m

C → n

D → Sy

D → 1

and then modify it to:

S → MP

S → M

P → NP

P → N

M → Cdc

M → lxc

N → yxc

N → bB

B → Cd

B → Syx

B → lx

C → m

C → n

D → Sy

D → 1

It may be remarked in passing that in the two last stages of modification the phrase name D does not appear on any string derived from S. The D, so to speak, has been ‘rendered useless’. This does not occur in every case of such modification however. See, for instance, the example in section 2.7 above. Hence, after these algorithms are used to modify a grammar, it may be worthwhile to remove the useless phrase names. An algorithm for doing this is already known.¹⁷

3. PREVENTION OF BACK-UP

3.1 In this section we shall take up the study of another phenomenon which renders the Glennie algorithm, if not powerless (as does a loop), at least somewhat inefficient. The phenomenon can best be understood by referring to Table 4.2 in paragraph 1.22 above, with special reference to serial numbers 13 to 22. We note in number 13 that the input tape is in position 5 (cat) and the program is trying to recognize an N'' phrase starting here. The first definition of N'' being N' S', the program goes down to the next level (serial number 14), testing for an N' phrase. The second definition of N' being cat, the tape moves one step to the right, and since no more words need be looked up to test for N'', it goes up to the previous level (serial number 15) with input tape in position 6. It then checks for the next phrase name, S', in the definition on N'', going down one level (serial number 16) and, by the definition of S', goes down another level (serial number 17) to check for <NP>, with the input tape still in position 6. The word in position 6 (killed) does not agree with the; no other definition of <NP> being available, the program exits from this level (serial number 18) and, no more definitions of S' being available either, exits to the next previous level (serial number 19). Here, a new definition being available, the program ‘starts afresh’, backing up the tape to position 5 (see box 7 in Figure 4.2). Then, after descending one level and conducting a futile search for an A according to definition 2 of N', it returns (serial number 20,21) and starts on the third definition (serial number 22), looking at the same word (cat) and recognizing it to be an N' as the sole element of N'', this time correctly. This backing up of the tape should not be necessary, however, since even after back-up the program looks at the same word (cat), changing only its interpretation. The question one is tempted to ask is, ‘could this reinterpretation be done in retrospect, rather than by actual reexamination of the words.’ The question is especially important in view of the possibility that the length of the back-up need not always be restricted to one word but can be quite long. To convince oneself that this may be the case, one may consider the grammar:

S → Mα

S → Nβ

M → TM

M → x

N → LN

N → y

L → TM

T → m

and try to parse the sentence:

mmmxmmxyβ

3.2 Before considering this last grammar, let us once more go back to the grammar in Table 4.1 and ask ourselves what the effect would be if production numbers 3 and 5 were modified to appear thus:

N'' → N'X

X → S'

X →

A new phrase name, X, has been introduced, and takes on two possible definitions, one for each of the two definitions on N'' with an identical handle. The definitions have on their right hand the strings on the original definitions of N'', with the common handle deleted. In this special case, one such string is nonexistent (nothing remains in the right-hand side of definition 3 of N'' after removing the common handle N'). This leads to a rather unusual kind of production which we have not encountered before. However, it does not do the parsing algorithm any harm. In fact, box 2 in Figure 4.2 is for taking care of precisely this exigency.

3.3 The rule for modification for preventing back-up, then, is simply: if two or more definitions of a phrase name have the same handle, replace them with one definition with the handle followed by a new phrase name whose right-hand sides are obtained from those of the old definitions by removing the common handle. This works all right if the common handles can be seen explicitly. But here also the same problem can arise that arose in connection with removing infinite loops. The grammar indicated in paragraph 3.1 is a case in point, where the common handles have to be made explicit. A program for doing this is shown in Figures 4.5 and 4.6, which uses Figure 4.3 as a subroutine. The working of the entire program can be gleaned by looking at its working on the grammar above. A table has been given in the appendix to indicate the appearance of the stacks as the program runs. In addition we give below a rough description of its working.

3.4 The program checks for common handles for all nonterminals, one at a time. Hence it tries to find any common handle of S; it finds M and N as handles of the two definitions of S. These are not identical; however, they are both nonterminals (M and N), so the program successively checks to see if N occurs as the handle of M or vice versa. For this purpose it uses the program in Figure 4.3. In this case M has T as its handle and T has a terminal symbol as its handle — so N never appears as a hidden handle of M. Similarly, N has L as a handle and L has T as a handle — so M never occurs as a handle of N either. At this point the program looks for a common handle for M and N, starting over at the beginning of the program (taking care of the push-down stacks, of course). Then it tests for the occurrence of T as a handle for L, using the program in Figure 4.3. Since T is a handle for L, and since M and N have a common handle (at the next higher level), the definitions of N involving L in the handle are changed yielding a grammar:

S → Mα

S → Nβ

M → TM

M → x

N → TMN

N → y

L → TM

T → m

This exit from the level for testing the common handle of M and N starts a redefinition of S, yielding a grammar:

S → TMα

S → xα

S → TMNβ

S → yβ

M → TM

M → x

N → TMN

N → y

L → TM

T → m

which can now be changed in several steps of application, to the grammar:

S → TX

S → xα

S → yβ

X → MY

Y → α

Y → Nβ

M → TM

M → x

N → TMN

N → y

L → TM

T → m

which is free of all back-up.

3.5 The reader must be warned at this point that, unlike the loop-prevention program which is an effective procedure in all cases, the back-up-prevention program is not guaranteed for success. To see this, one may consider as an example the following grammar:

S → aSc

S → B

B → aBb

B → ab

If we apply the procedure of the above paragraphs to this grammar we shall initially get the modified grammar:

S → aX

B → aY

X → Sc

X → Bb

X → b

Y → Bb

Y → b

which, on reapplication of the method, yields:

S → aX

B → aY

X → aZ

X → b

Y → Bb

Y → b

Z → Sc

Z → Bb

It will be noticed that the two definitions of Z behave exactly like two definitions of X in the grammar in which X appeared for the first time. The back-up-prevention routine, then, would be effective again and another phrase name would be produced with identical definitions. This process would continue indefinitely.

3.6 All grammars, then, cannot be rendered free of backup by this procedure. In the example cited in the previous paragraph, we have demonstrated one way in which this procedure may fail to terminate on some grammars. In this case and similar cases we can study the grammar beforehand to decide that it would be a waste of time to free it from back-up. Unfortunately, this may not always be possible. We have not been able to derive any essentially finite method which, applied to a grammar, can tell us whether it can be made free of back-up. There is a possibility that such a method may be impossible to construct, just as a method for testing the ambiguity of a grammar is impossible to construct.¹⁸ Unfortunately we have not been able to formally establish the impossibility either. While no grammar with an infinite number of ambiguous sentences can be made back-up-free, it is not true (as our example above shows) that a grammar which cannot be made back-up-free is necessarily ambiguous. Thus the impossibility for a test for ambiguity (as well as for certain similar phenomena such as infinite intersection)¹⁹ does not necessarily imply the impossibility of testing for back-up. Meanwhile the procedure roughly described above and made precise in the flowcharts of Figures 4.5 and 4.6 does provide what is called a semidecision procedure for back-up prevention. If the described method, applied to a grammar, terminates, then the modified grammar resulting from the application would be free of back-up. However, if the method does not terminate after some prespecified time, one has at present no way of knowing whether the grammar can be freed from back-up or not — a few more cycles might have completed the process, or they might not. It is unfortunate that the state of the art has to be left at this rather unsatisfactory stage.

4. CONCLUDING REMARKS

4.1 We have seen in recent years an increase of interest in syntax-directed parsing both among linguists and computer specialists. For reasons of efficiency as well as applicability of some generalized algorithm, it is often found necessary to use special ‘tricks’ in the algorithm, as has been suggested by Conway²⁰ in his discussion of the syntax-directed parsing of Cobol. Very often, such tricks need not be incorporated separately in the parsing algorithm, but can be made to take the form of a special change of the grammar to an equivalent form. Since linguistic structures are presently somewhat better understood than the theory of computing algorithms, it is much easier to study effects of changes in grammatical forms than changes in algorithms.

4.2 Any change in a grammar with the aim of increasing the efficiency of parsing algorithms naturally has to keep the language unchanged. This invariance over language can generally be achieved without too much difficulty, although a lot still remains to be understood about language-preserving transformations on grammars. The few modifications of grammars that have been considered for their usefulness have been simple enough so that their language-preserving nature can be understood easily. However, techniques for investigating equivalence (albeit no uniformly applicable technique can be developed) are highly necessary. This is one reason why we have included in the appendix some proofs of the equivalence-preserving nature of the class of transformations on grammars that we have discussed here. These transformations are of such a simple nature that such proofs are almost unnecessary. However, the technique of proof may be of interest for future applications.

4.3 However, it must be remembered that mere maintenance of equivalence is not sufficient when transforming grammars. One has to maintain what Greibach refers to as strong equivalence²¹ so that the phrase marker of a sentence according to one grammar is derivable from the phrase marker of the same sentence according to the other grammar. Very little is known about strong equivalences between grammars²² at present.

4.4 The Glennie algorithm (which has been discussed here in the interpretive mode) is a representative of the class of algorithms termed ‘top-to-bottom’ by Petrick.²³ The Rhodes-Kuno-Oettinger predictive analysis falls in this category. The bottom-to-top algorithms of Robinson²⁴ form another class of algorithms for syntax-directed parsing. A comparison of the different classes of algorithms and the effectiveness of the variants within a class has been investigated only to a very small extent. Further efforts in this direction ought to prove extremely fruitful.

APPENDICES

1. The appearance of the push-down stacks at the first ten stages of the operation of the program in Figures 4.4 and 4.3 on the grammar shown in section 2.11 is as follows: (The format of the representation is identical with that of Table 4.2, section 1.22.)

The next few steps of the program result in no further modification of the grammar and are left as an exercise for the reader. The first modified grammar, obtained on return from serial number 5 to serial number 6, is as follows:

S → Be

S → SbB

B → Cd

B → Syx

B → mix

C → m

C → n

D → Sy

D → 1

The second and final forms of the grammar have been displayed in section 2.11.

2. The appearance of the push-down stacks at the first 19 stages of operation of the programs shown in Figures 4.5 and 4.6 (including Figure 4.3 as a subroutine) on the grammar exemplified in section 3.1 is as follows:

The rest of the operation which culminates in the final production of the grammar shown at the end of section 3.4, produced by box 1 3 in Figure 4.6, is left to the reader. The modified grammars have already been shown in section 3.4.

3. Some proofs regarding transformations preserving weak-equivalences: an introduction to the Ginsburg²⁵ formalism. The reason this appendix is included is to point out that a certain approach to context-free grammars is useful in the study of equivalence. The reasons it is not included in the text are twofold: first, the kind of grammatical changes we have introduced in the paper are so simple that a rigorous proof of equivalence preservation in their case may seem out of place to many. Second, the concepts and techniques involved herein are so different from those used in the body of the text as to make it uninteresting to anyone but mathematicians and those linguists who really want to know about the formal aspects of the subject. However, we have endeavored for the benefit of the latter to make the exposition ab initio and explain the concepts carefully.

The major innovation in the point of view we shall adopt here lies in thinking of the phrase names as representing sets of strings of terminal symbols. That is, we can say about a language that there is a set of strings called sentences, another set called noun phrases, another set called nominals, and so on. Each terminal symbol also represents a set of strings, i. e. the one consisting of that terminal symbol alone.

On such sets we define two operations, union and concatenation. If A and B are two sets of strings, we shall mean by A + B (read: ‘A union B’) those strings which appear in at least one of the two sets, A or B. By AB we will mean the set of all those strings which can be obtained by taking any string from A and adjoining or concatenating to it some string taken from B.

The grammar of a language expresses certain phrase types as derived from other phrase types by means of these operations. To see this, let us take as an example the grammar:

S → Be

S → SbB

B → Cd

B → Dx

C → m

C → n

D → Sy

D → 1

discussed in section 3 and rewrite it in a slightly different form:

S = Be + SbB

B = Cd + Dx

C = m + n

D = Sy + 1

which to the naked eye looks almost like a set of simultaneous algebraic equations. But it must be remembered that the interpretation is quite different. What the set of equations above says, is: ‘There is a set C of strings whose only members are the string “m” and the string “n”. There is also a set D of strings which has the string “1” as one member and whose other members are obtained by taking any member of the set S (defined later) and adjoining the terminal symbol y to it. There is a set B whose members are obtained either by taking an element of C and adjoining the terminal symbol d to it or taking an element of D and adjoining the terminal symbol x to it. The set S of strings is obtained either by adjoining the terminal symbol C to some member of B or by taking a member of S, adjoining the terminal symbol b to it and adjoining to the resulting string any member of “B”.’

It is clear that the above set of sentences say exactly what is said (under interpretation) by the four equations above. There is, therefore, partial justification in the linguist’s contention that mathematics does not tell us anything more than what the linguist already knows. However, the use of mathematical notations is just a little bit more than a matter of ‘prestige’. For one thing, it is evident that once the interpretation of the symbolism is firmly ensconced in the reader’s mind, there is a certain advantage to being able to say in four lines what would otherwise take a full paragraph to say. The avantages do not stop here, however, as we shall show in what follows.

Given a set of equations of the form above one can, in a step-by-step manner, build up the sets of strings denoted by the phrase names. We shall exemplify this method in terms of the grammar above and then proceed to formulate the discussion in general terms.

We first define a sequence of sets of strings for each phrase name, calling them S°, S¹, S², S³. . ., B°, B¹, B², B³. . ., C°, C¹, C², C³..., D°, D¹, D², D³... . The sets are constructed as follows: A°, B°, C° and D° are empty sets. Then for any value of i,

Sⁱ⁺¹ = Bⁱc + Sⁱb Bⁱ

Bⁱ⁺¹ = Cⁱd + Dⁱx

Cⁱ⁺¹ = m + n

Cⁱ⁺¹ = Sⁱy + 1

To exemplify this process, let us put i=o. Then S¹ = B°c + S°b B°= the empty set again (B° is empty and anything concatenated with an empty set still yields an empty set. Notice the difference between an empty string and the empty set of strings). Similarly,

B¹ = the empty set

C¹ = (m, n)

D¹ = 1

Now putting i= 1 we obtain:

S² = empty

B² = (md, nd, lx)

C² = (m, n)

D² = 1

Again:

S³= (mdc, ndc, lxc)

B³ = (md, nd, l)

C³ = (m, n)

D³ = 1

and so on.

It can be seen that for all i, Sⁱ is contained in Sⁱ⁺¹, that is, every element of Sⁱ is an element of Sⁱ⁺¹. It is not hard to believe (though the proof may be hard to read) that, if S, B, C, D are the smallest set of strings that satisfy the original polynomial equations, then each element of S is an element of some Sⁱ for some i and every element of Sⁱ for every i is an element of S. We shall use this fact to prove our equivalence theorems by invoking the following tricks over and over. If S and S₁ are the set of sentences of the two grammars, we shall show that every element of Sⁱ is an element of for some j and vice versa. However, since we are interested in proving these theorems for context-free grammars in general, let us introduce a symbolism which avoids giving specific names to nonterminals and specific forms to the polynomial equations. To this end we shall denote the nonterminals by A₁, A₂, A₃, … A_n, and the corresponding sequence of sets by AJ in general, i. e. by

We shall denote arbitrary polynomial forms by f₁(A₁, A₂ … A_n), f₂(A₁, A₂ … A_n) etc., f’s with different subscripts denoting different polynomials. A general grammar then would look as follows:

and the equations for the take the form:

and

We are now ready to tackle the various equivalence proofs

a) The changes in the grammar we made to make either the loops or the common handles explicit depended on one (so far unproved) fact, namely that if an occurrence of a nonterminal (A₂, say) in the definition of another nonterminal (A₁, say) be replaced by the definitions of the former nonterminal, the language defined by the modified grammar is identical to that defined by the unmodified grammar. In symbols, if

be two grammars, then for each i, A_i and b_i are identical.

To prove this we shall first set up the sets and by setting up , the empty set, and for each j

and for all i greater than 1 and all j

We shall now. proceed to show that for all i and j, any string in is also in is a subset of , or in symbols:

This statment is trivially true for j = 0, since both and are empty sets. We shall now show that if for any j, , then it follows that . This is seen as follows. We have for all i greater than 1

however, since for each i

proving the assertion when i is greater than 1. For the one other remaining case,

but since

So₂ since for all i, we have for all i, and so for all i, and so on. Hence, a member of A_j, being a member of some , is a member of and hence of B. This shows that the language described by the first grammar is a subset of the language described by the second grammar. But this is not enough to show that the grammars are equivalent. Hence we shall now show that for all i and j,

Once more, this is clear for j = 0, since and are empty sets; it is also quite easily, seen to be true that for all i greater than 1, if , then . For i=l, we have

So that for all , whence , and so on. Thus any member of B_i is an element of some and hence a member of and hence a member of A_j. Hence, B_j is a subset of A_j; and since we have already seen that A_j is a subset of B_j, the two languages are identical.

b) The method by which an explicitly common handle was made to ‘merge’ into a single definition through the introduction of a new phrase name can be similarly shown to leave the language unchanged. The method effectively is based on the following proposition. If

are two grammars, then for all i, A_j = B_j (note that nothing is said about C).

We shall first show that for all i and j, . This is evidently true for j = 0. Also, for all i greater than 1, it can be easily seen that if , then . For i = 1 we note:

This would immediately lead, as in the case (a) above, to the conclusion that for all i. We now proceed to show that . Again this is trivially true for j = 0 and it also follows easily that for all i greater than 1, if , then . For i = 1 we have:

The rest of the argument is similar to case (a).

c) We are now left with showing that the method employed for removing loops leaves the language unchanged. One can express this statement in symbols as follows:

be two grammars, then the languages described by the two are identical.

To show this, we shall first show that for all i. Once again, this is true for j = 0. Also for all i less than n, if , then

showing that for all i. To show , we proceed to show that in a similar manner.

NOTES

1. Susumu Kuno and Anthony G. Oettinger, ‘Multiple Path Syntactic Analyzer’, Mathematical Linguistics and Automatic Translation Report NSF-8, Sec. I (Computation Laboratory of Harvard University, 1963). C. Douglas Johnson, ‘The Berkeley Parser’, 1964 Annual Meeting of the Association for Machine Translation and Computational Linguistic (Bloomington, Indiana). Jane J. Robinson, ‘An Automated Phrase Structure Grammar of English’, 1964 Annual Meeting of the Association for Machine Translation and Computation Linguistics (Bloomington, Indiana).

2. K. Samelson and F. L. Bauer, ‘Sequential Formula Translation’, Communications of the Association for Computing Machinery 3.76 (1960).

3. E. Irons, ‘A Syntax Directed Compiler for Algol 60’, Communications of the Association for Computing Machinery 4.51 (1961).

4. Stanley Petrick and Thomas Griffiths, ‘On Relative Efficiencies of Context Free Grammar Recognizers’ AFCRL-TM-64-2 (January, 1964).

5. Stanley Petrick, ‘A Recognition Procedure for Transformational Grammars’, Second Congress on the Information System Sciences (Hot Springs, Virginia, 1964).

6. Alec E. Glennie, ‘On the Syntax Machine and the Construe-tion of an Universal Compiler’, Technical Report No. 2, Computation Center, Carnegie Institute of Technology (Pittsburgh, Pa., I960).

7. Melvin Conway, ‘Design of a Separable Transition Diagram Compiler’, Communications of the Association for Computing Machinery 6.396 (1963).

8. Sheila Greibach, ‘Inverse of Phrase Structure Generators’, Mathematical Linguistics and Automatic Translation, Report NSF-11 (Computation Laboratory of Harvard University, 1963).

9. Noam Chomsky, ‘On the Notion, “Rule of Grammar”’, Struc -tureof Language and its Mathematical Aspects: Proceedings of the 12th Symposium in Applied Mathematics, American Mathematical Society (1961).

10. See, however, William Lynch, ‘Ambiguityin Backus Normal Form Languages’, Ph.D. Thesis (University of Wisconsin, 1963). Also, Sheila Greibach, ‘A New Normal Form Theorem for Context-Free Phrase Struc -ture Grammar s ‘, Journal of the Association for Computing Machinery 12.42 (1965).

11. We shall use the words ‘Loop’ and ‘Back-up’ as technical terms in the future with the precise meaning defined in appropriate places. The reader is warned against interpreting them with any colloquial English meaning.

12. This use of the word word is nontechnical, as opposed to its use as a technical synonym for ‘terminal symbol’.

13. Yehoshua Bar-Hillel, Y. Perles and E. Shamir, ‘On Formal Properties of Simple Phrase Structure Grammars’, Zeitschrift fur Phonetik, Sprachwissenschaft und Kommunikationsforschung 14.143 (1961).

14. The words ‘push-down stack’ form a technical term defined in what follows. It is not to be construed to be a colloquialism with a colloquial meaning.

15. The terms, ‘push’, ‘pop’, and ‘put’ are technical terms with meaning as defined in the text. They are not to be construed to be colloquialisms with meanings in colloquial English.

16. The word ‘handle’ will be used throughout this paper as a technical term following Greibach (see note 8). It is not to be construed to have a colloquial meaning.

17. See op. cit., note 13.

18. Noam Chomsky, ‘Formal Properties of Grammars’, Handbook of Mathematical Psychology 2.323 (New York, 1963).

19. See note 13.

20. See op. cit. in note 7.

21. See op. cit. in note 8.

22. See, however, Lynch, op. cit., in note 10.

23. See op. cit. in note 4.

24. See Robinson, note 1.

25. Seymour Ginsburg, ‘Two Families of Language Related to Algol’, Journal of the Association for Computing Machinery 9.350 (1962).