“Computation in Linguistics: A Case Book”
Descriptive Problems — Morphology
Towards Automatic Morphemicization of Verbal Forms in Telugu
1.0 THE MOTIVATION
For the past 25 years or so many American linguists have been faced with the problem of teaching exotic languages for which no suitable teaching materials have existed. The problem of ‘what to teach’ has been handled — with increasing success over the years — by the application of descriptive techniques which are well known. But the problem of ‘what to teach first’ has received much less attention. It is to this latter problem that this paper is addressed.
What to teach first, and what to teach after that, is an important consideration in teaching any language, but particularly an exotic one, because the beginning students are adults with little time and many responsibilities, often people who need the language as a tool for research and field work; an efficient set of materials — designed to impart the maximum amount of language skill in the time available — is needed for teaching the language.
It is at least plausible if not obvious that one order of presentation of the material might be less efficient than some other order. From this we may assume that there must be a maximal efficiency of presentation for any given language, achievable through the adoption of one or more particular ways of ordering the material presented.
There are two types of data which the linguist qua linguist can bring to bear on this problem of optimal ordering: structural and statistical. The statistical is the motivation for this paper; the structural can be referred to only briefly in this present context. 1
2.0 THE PHILOSOPHY
Many agree that for maximal efficiency in language teaching the student should be exposed early to those features (both lexical and structural) of the language which have the highest text frequency in the language.
There are credible reasons for this. In the first place, such features, because of their naturally higher frequency, should impress themselves more readily on the student’s mind — he will learn them faster because he meets and needs to know them more often (this is called reinforcement). Secondly, once the student has learned these features he will already know a substantial portion of any new text (written or spoken), and thus be in a position to make better guesses about the unfamiliar portions, and to concentrate on learning the unfamiliar material, because there will be proportionally less of it.
In the seminar which produced this book this line of reasoning was objected to as illusory, since the most common words in a language will probably also be the most polysemous; consider for example the three nouns in
(1) Plants need light for photosynthesis.
The observation on polysemy may be true, but the objection is false: most of the words in any natural language are polysemous, and the sooner the student learns how this fact of life is manifested in language A the sooner he will be adequately equipped to use language A — a goal which is the very essence of efficient language instruction.
In order to take maximum advantage of the beneficial effects of high-frequency features, it is desirable to raise their frequency in texts used for teaching still higher by substituting them for features of low frequency (this is part of what is called editing). The result is still more reinforcement of the high-frequency features, still greater ease in their mastery, and correspondingly less interference with reading efficiency (roughly, amount of comprehension per unit of study time) from the low-frequency features.
In (1) above this would amount to replacing photosynthesis with something like making sugar from air and water, which is the answer the student would get if he did not know what photosynthesis was and had to ask. Clearly it would not serve the interests of efficient teaching to remove the polysemous word light and replace it by the unambiguous phrase radiation from the ultraviolet portion of the electromagnetic spectrum.
3.0 THE GOAL
So the question of what to teach first has given rise to the question of what features of the language have the highest text frequency.
Considerations of cost aside, the best way to count the occurrences of the features is with the aid of a computer, which is able to handle vast quantities of data in a relatively short time with low incidence of error. The yield of frequency information is of high statistical reliability and may be broken down in any way desired.
By features I mean items like morphemes, allomorphs, lexemes, idioms, constructions, parts of speech, and grammatical categories (person, number, case, tense, mood, aspect, and the like). What features will be counted will depend on what features the language has and on their relative complexity.
In Telugu, for example, most of the complexity seems to be concentrated in the verbal system, so that one’s attention to the grammatical features of the nouns can be limited to counting the frequency of the irregular plural forms.
It seems, then, that a realistic goal for an early frequency count for Telugu would be a computer program to take input text preparsed into three parts of speech — verbs, nouns, and neither — and to yield frequency data on all noun and verb bases and all affixes in the verbal system. 2
4.0 THE MEANS
A fairly gross flowchart of this program is given in Figures 7.1 and 7.2, showing the entire operation from card input to printed output frequencies. Striped blocks in the flowcharts are discussed in the indicated sections of this paper; the double-striped blocks, being the central topic of this paper, are flowcharted in detail later.
4.10 Preparation of input. The card input to this program is prepared from published Telugu text. In this process (1) the Telugu graphemes are transliterated into an I/O alphabet of code groups of Hollerith characters, (2) word boundaries are indicated by blank, with external sandhi neutralized, and (3) each word is marked by a one-digit parsing code.
4.11 Transliteration. The Telugu script in its contemporary form has 51 graphemes. Since there are 63 characters available for machine processing, it is not difficult in theory to transliterate the Telugu graphemes one-to-one into BCD, but the resultant BCD orthography is very difficult for humans to work with, having very little in the way of mnemonic properties.
For the benefit of those who must read the card input and printed output, the source Telugu text is transliterated into a highly mnemonic ‘alphabet’ using code groups of one to three Roman letters for the Telugu graphemes. Long vowels are written double, retroflex consonants are prefixed by ‘X’, aspiration is indicated by ‘* ’, etc. See 4.20 for a description of how this orthography is altered by the program for internal use.
4.12 Word boundaries. In Telugu orthography every syllable boundary is potentially indicated by space; 3 furthermore, due to the action of external sandhi4 (optionally reflected in Telugu orthography), certain word boundaries do not coincide with syllable boundaries. The result is that not all word boundaries are marked by space in Telugu orthography, and not all spaces mark word boundaries. To program the computer to resolve these ambiguities would require going far beyond the scope of this paper toward the land of machine translation, where angels fear to tread.
Accordingly it is assumed that the preeditors who will prepare the input to this program will close up spurious spaces and eliminate external sandhi, so as to place word boundaries in one-to-one correspondence with spaces (blank).
4.13 Parsing code. The final step in preparing the input is marking each word with its part of speech. To simplify sightreading, digits are used, say 2 for nouns, 3 for verbs (both finite and nonfinite), and 4 for residue (everything else). The parsing code is written immediately after the word (not preceded by a blank), followed by a blank before the next word.
4.20 Decoding routine. The first action of the program is to retransliterate the input into BCD characters for internal use, one character per Telugu grapheme. The first three characters in the input are taken as argument for a look-up in a table of threecharacter transliteration codes, e.g. XT* (voiceless aspirated retroflex stop). If the argument is not in the table, the program then does a look-up on the first two characters, in a table of twocharacter codes, e. g. UU (long high back vowel); and finally, if necessary, on just the first character, in a table of single-character codes. Each code has as its associated function a single BCD character, such that the alphabetic ranking of each Telugu grapheme corresponds exactly to the position (collating number) of its corresponding single BCD character in the internal collating sequence (see sorting, 4.60).
The decoding routine also counts the number of these BCD characters in each decoded word, and the number of syllables in each decoded word (see 4.30, 5.12).
4.30 Tape 1. As each input word is decoded, its successive BCD functions are written in a record on Tape 1 in a field of which portions will subsequently be referred to as ‘Arg’ (for ‘argument’); see Figure 7.3. The record also has fields for: ‘SC (syllable count, provided by the decoding routine — 1 for monosyllables, 2 for disyllables, 3 for words of more than two syllables); ‘CC (character count, also provided by the decoding routine); and ‘Word’, the original orthography of the input, which is carried along with Arg for ease of identification in case the fail-safe routine is called (see 4.40).
4.40 Fail-safe routine. At various points throughout the entire program there are tests for abnormal conditions such as invalid parsing codes, utter failure to analyze forms successfully, etc. If such a condition is detected, the fail-safe routine is called, which causes the contents of the tape-record currently in process to be printed out for visual inspection, along with the contents of all filled ‘Hypo’ locations (see 4.50). The program then proceeds to the next word.
4.50 Tape 2. The analysis of each verbal form (5.0) proceeds as a series of look-up operations on portions of it. The look-ups are directed by a series of hypotheses regarding the morphological structure of the form. Each such hypothesis is provisionally written in a hypo area in storage (Figure 7.4), and is subsequently erased if it does not ultimately lead to a successful look-up.
When the nth look-up is successful, the resulting function, which is the verb base, is written as ‘Hypo (n + 1) ‘, all (n + 1) Hypos are written on Tape 2, and the program proceeds to the next word.
Each record on Tape 2 has two fields, ‘Clear’ (for sorting — see 4.60) and ‘Code’ (for printout — see 4.70). For Hypo 1 through n, both fields contain identical information (suffix codes); for Hypo (n+ 1) the ‘Clear’ field is BCD and the ‘Code’ field is its equivalent in the I/O alphabet, both fields being supplied as the function of the successful look-up.
4.60 Sort Tape 2. At EOF on Tape 1, Tape 2 is sorted on the ‘Clear’ field. This brings together all occurrences of each word, (noun or verb) base, and suffix code, ready for counting. Because of the method of assigning the BCD characters to the Telugu graphemes, a normal sort results in the arrangement of the records in accordance with the Telugu alphabet. By using numeric initials for the suffix codes (i. e. using codes like 25, 3&), the suffixes can be kept separate from the lexical items in the printout. If desired, the noun and verb bases can be made to list separately by retaining on Tape 2 the ‘PC (parsing code) from Tape 1 and using it as the major sort field, with ‘Clear’ as the minor.
4.70 Print Tape 4. This tape (Figure 7.5) contains all lexical items in alphabetical order and all suffixes in numerical order, each with its frequency in the corpus just analyzed. For the reports the ‘Code’ and ‘Counter’ fields are printed out, giving each lexical item in the I/O orthography. If desired, the records can be sorted on ‘Counter’ into ascending or descending order of frequency before printout.
5.0 Verbal analysis routine. Each verbal form is analyzed into a string of base + suffix(es) by a process of ‘cutting off from the right end of the form (right truncation) one or more portions suspected of being suffixes, and then doing a table look-up on what remains. Figure 7.6 gives an overall view of the method. 5
5.10 Each suffix in the Telugu verbal system has been assigned an arbitrary two-digit number. When a given substring is tentatively identified as a particular suffix, the number of that suffix is written in a hypo area and the substring is right-truncated from the rest of the form. Several hypos must be used in the analysis of a polymorphemic form; the serial number of the current hypo is given by an index P, which is increased one unit with each truncation, and decreased one unit as the contents of each hypo is written on tape following successful analysis of the form.
5.11 The left-hand character of the form being analyzed is in location’a’ in storage, with successive characters in locations (a+ 1), (a+ 2), etc. The right-hand character of course changes with each truncation; the address of the current right-hand character is stored in location ‘c’. Truncation is thus accomplished by simply decreasing c.
Area ‘Arg’ comprises locations ‘a’ through c; it contains the argument for the current look-up.
The initial address of the right-hand character of a given full form is (a + CC - 1), where CC is the number of characters in the full form (4.30).
5.12 If the form to be analyzed is a monosyllable, it is looked up in Table A, which contains the few monosyllabic verbal forms that can occur, such as6 rā‘come !’. If the form is disyllabic it is first looked up in Table B, which contains the highly frequent uḿdi ‘she is, it is, there is’ and a few other forms. If the monosyllable look-up fails, the fail-safe routine (4.40) is called; if the disyllable look-up fails, the polysyllable routine (Figure 7.7) is called.
6.0 Polysyllable routine. Figure 7.7 shows how the identity of the contents of c transfers command to the appropriate subroutine for truncation of the right-most suffix. The two right-hand columns in the decision table are for the benefit of the reader only.
6.10 Suppose the form under analysis is tin-ina ‘eaten’. In Figure 7.8 the three right-most characters of this form are truncated by first placing the number 3 in location ‘b’ and then changing c to (c - b ). b is then used to compute the number of characters in the residue of trie form, and that number is stored in location ‘r’; the residue tin is then sought in a table containing only arguments of r characters. This method of storing the verb bases in several different tables according to their length results in considerable savings in look-up time over the use of a single all-inclusive table.
6.20 Similarly, if c contained /u/ (see table in Figure 7.7), Hypo P would become IS, 2S, etc., according as location (c - 1) contained /n, w/ etc., and the number 2 would become b.
6.30 The analysis is more involved if c contains /ā/, because the possible suffixes are of (effectively) two different lengths. And in either case there is the further complication that there are either zero or more-than-zero suffixes between the right-most suffix and the verb base. Together these two facts give 2 x 2 = 4 as the maximum number of attempts that can be allowed for the analysis of an /ā/-final form.7
In Figure 7.9 this number 4 is stored in location ‘i’;i will be decreased by one each time the analysis is unsuccessful, and the fail-safe routine will be called if i reaches zero.
6.40 To enable the program to keep track of what suffix-position it is currently working on, location ‘s’ is provided as a logical switch. Zero becomes s at the top of Figure 7.8, and ‘i’ becomes s if command transfers to Figure 7.9. If the look-up operation in Figure 7.8 is unsuccessful, command transfers to Figure 7.10, where s is tested. If nonzero, s is the address of the location to be tested to direct further branching of the program; in this paper only one such location is necessary.
6.50 The following examples should help the reader to follow the workings of the program as thus far described. The analysis of the three forms tinn-ā-nu ‘I ate’, tiḿ-tā-nu ‘I will eat’, and tin-inā ‘even if one eats’ can be traced step by step through the flowcharts by referring to the rows of Table 7.1, whose columns show the contents of the various pertinent storage locations at the end of each step. Columns not germane to a particular step are left blank in the corresponding row. In the two right-hand columns of the table, the expression ‘MT’ is used to refer to the medial decision table of Figure 7.10.
Table 7.1 (see 6.50).
Table 7.2 (see 6.70)
Analysis of tin-asag-uta-nu ‘I will continue to eat’
Incidentally, the three examples just given point up the problem of internal sandhi in Telugu verbal forms. Large numbers of verb bases occur in two or more different shapes, conditioned by the first suffix which follows them; the verb ‘to eat’ has three, tin, tiḿ, and tinn. It seems more efficient to include in the base tables all variants of each base than to program special rules to undo the workings of internal sandhi.
6.60 Figure 7.12 shows the extra programming necessary to accommodate the additional medial suffix -sāg. Location ( c - 3) is tested to see whether the allomorph -asāg is present; if so, four characters instead of three are truncated. A similar test was incorporated in Figure 7.11, to check for the presence of the allomorphs -atā and -utā of the future tense morpheme.
6.70 Table 7.2 traces the analysis of a form with two medial suffixes: tin-asag-uta-nu ‘I will continue to eat’.
7.0 This paper has discussed the need for a recognition routine for Telugu verbal forms and has described the form such a routine might take. In order to design a full-capacity routine we must first have complete statements of verb base graphotactics and of the privileges of occurrence of the verbal suffixes relative to one another. Until this information becomes available, a first-approximation routine such as that described here can still relieve the linguist of a good deal of time-consuming work.
NOTES
1. For example, the seemingly very complex system of internal sandhi in the Telugu verbal will not overwhelm the student if it is understood sufficiently well by those who prepare the teaching materials. It consists of two sets of rules, one optional and one obligatory. The optional rules apply to all verbs; there is a different set of obligatory rules for each class of verbs, but almost all of these classes are defined by the phonological structure of their members. All of this suggests teaching the verbs one class at a time, starting with the class having the fewest obligatory sandhi rules — not unlike the practice of teaching Latin or French verbs one ‘conjugation’ at a time.
2. The noun analysis is not discussed in this paper, being considered of minor importance for two reasons: first, the verbal forms make up nearly 50 per cent of the average Telugu text; and secondly, the noun analysis is a much simpler job, which can be turned over to persons who are not linguists, while the verbal analysis is (in part — see 7.0) turned over to the computer.
3. The orthographic syllable can be rigorously defined for Telugu by a short list of ordered rules, but orthographic practice is inconsistent.
4. See Gerald Kelly, ‘Vowel phonemes and external vocalic sandhi in Telugu’, JAOS 8 3. 67-73 (1963) and references given there in.
5. It must be admitted that the program described in this paper is sharply limited in scope, partly due to the customary considerations of space, but partly because not enough is yet known about the Telugu verbal system to make possible a program capable of handling any arbitrarily chosen form.
6. All Telugu citations in this paper are transliterations of the orthography (sometimes with morpheme boundaries indicated by hyphens), including the symbols between slashes, /ḿ/ represents the Telugu grapheme for the nasal homorganic to the stop which follows it.
7. This statement is true only for the purposes of this paper, which deals with only a very few of the actually possible suffixes and strings thereof.
Figure 7.1 Processing of input prior to frequency count.
Figure 7.2 Counting and printout of frequencies.
Figure 7.3 One record on Tape 1, with names of correspondings areas in storage, and number if positions in each
Figure 7.4 One record on Tape 2 (Hypo 1, 2, ...) or on Tape 3 (Fn A, Fn B) (See Figs. 7.1, 7.2, respectively.)
Figure 7.5. One record on Tape 4
Figure 7.6 Verbal analysis routine
Figure 7.7 Beginning of polysyllable routine.
Figure 7.8 Truncation of suffixes with final /a/.
Figure 7.9 First step in identifying /a/-final suffix
Figure 7.10 Identification of medial suffix.
Figure 7.11 Third attempt to identify /a/-finol suffix
Figure 7.12 Identification of medial suffix - sag.
We use cookies to analyze our traffic. Please decide if you are willing to accept cookies from our website. You can change this setting anytime in Privacy Settings.