“Computation in Linguistics: A Case Book”
Part I: DATA—PROCESSING PROBLEMS
Computers and Anthropological Linguistics
1.0 INTRODUCTION
This paper discusses the role of computation in anthropological linguistics on the basis of experience gained with Navaho.
The fulcrum of the Navaho sentence is the verb. The thirteen prefix categories before the stem1 mark a large number of dependencies to other constituents of the sentence. A particularly interesting aspect of Navaho grammatical research is the determination of the structure of these dependencies.
One of the aims of such research is an automatic parsing program of Navaho sentences. This requires the solution of at least three interrelated problems: (1) the structural description of Navaho sentences; (2) the form of the dictionary entries which accompany the grammar, and (3) the grammar codes to be attached to the entries in the dictionary. The latter is the problem of the nature of the nonterminal vocabulary and of subcategorization of word classes, in other words, the problem of the effect of word categories on Navaho grammatical constructions.
A corollary of the parsing problem is a treatment of Navaho derivational prefixes. Yet their complexity, their limited distribution, and their present obscure lexical productivity make their treatment considerably more difficult than that of inflectional prefixes. The discussion of these prefixes is postponed.
The approach to the problem of this paper is transformational generative. There will be no attempt at defending this point of view. Both defense and criticism have been presented in numerous papers. This approach has been selected as a matter of personal preference. The emphasis on personal preference should put the choice of theoretical orientation beyond controversy — needless to say, this does not apply to the substance.
This theoretical bias puts one at a disadvantage in a discussion of automatic parsing of sentences in any language. Such a parsing program must be based on a minimally adequate grammar of the language to be parsed. In other words, a parsing program is subsequent to a grammar of the language, regardless of how preliminary or putative such a grammar may be. As shall be seen later, the definition of grammatical units in a transformational generative grammar is not independent of the aims of the grammar, which is the production of grammatical sentences. Hence, the entire problem of subcategorization cannot be treated independently of such a grammar.
The opinion has been voiced frequently (in private conversations more than in public discussions) that transformational grammar is not empirical. Whatever the reasons for holding this view may be, this author agrees with Longacre that ‘generative grammars are by no means uninterested in linguistic analysis’.2
It has been previously stated that the ultimate aim of this work is an automatic parsing program. It was also said that the prerequisite for such a program is a minimally adequate transformational generative grammar. Following this sequence of priorities, an attempt will be made to answer questions concerning the usefulness of a computer for the construction of a transformational generative grammar. Parts of the process will be illustrated by examples from Navaho. In particular, those linguistic computer programs which are easily available will be considered.
The author is in agreement with Longacre that techniques of discovery ‘are guess-and-check procedures’.3 Trial and error plays an important part among the techniques of any science. The linguist has in addition several aids at his disposal: his typological knowledge or his experience, be it formal or intuitive, with other languages of the world.4 The transformational grammarian is in addition aided by his basic assumption that the deep structure, the output of the constituent or phrase-structure component of the grammars of the languages of the world, is similar.5 That is, with the knowledge of the deep structure of English and other well-known languages, one can draw likely conjectures about the structure and processes in the phrase-structure part of the theory or grammar of any language of the world.
In the following sections the author will also discuss the field techniques of the anthropological linguist and the extent to which these are modified by the theoretical orientation toward transformational generative grammar. He hopes to support in detail the notion that there is close analogy and even an overlap between the techniques and procedures of analysis used in finding the units of a taxonomic grammar and the construction of transformational generative grammars. The reassuring closeness of the analogy should not be interpreted as a defense or justification of the taxonomic theoretical stance.
It was inevitable that the preoccupation of the taxonomic grammarians with more or less automatic discovery procedures should lead them to general and probably universal analytic techniques which are applicable, as laboratory techniques are, regardless of one’s theoretical bias. This is in no way demeaning the achievements of taxonomic grammarians. On the contrary, an attempt will be made to demonstrate that their successes in analyzing languages are very important to anthropological linguists dealing with little-known languages.
2.0 HEURISTICS
The following heuristic principle was proposed by Katz and Postal: ‘Given the sentence for which a syntactic derivation is needed, look for simple paraphrases of the sentence which are not paraphrases by virtue of synonymous expressions; on finding them, construct grammatical rules that relate the original sentence and its paraphrases in such a way that each of these sentences has the same sequence of underlying P- (phrase) markers. Of course, having constructed such rules, it is still necessary to find independent syntactic justification for them.’6 This principle, as it stands, complicates the picture by leaving unexplained a number of factors which would make derivation simpler. It assumes (see ‘the same sequence of underlying P-markers’) that most sentences encountered in the eliciting situation are complex, consisting of more than one underlying phrase marker. This is generally true of sentences in texts. However, the specific advantage of eliciting with a knowledgeable informant is the linguist’s awareness of the metatheoretical fact that complex sentences consist of simple sentences and that it is, therefore, possible and more efficient to start eliciting ‘simple sentences’ even though some may later turn out to be complex.7 It is best to make some ad hoc assumptions about what simple sentences are, and to try for nonsynonymous paraphrases of these. This is analogous to eliciting for morphemic analysis as it was stated by the Voegelins.8 The objective is to ‘postulate what common underlying sentences occur in what constructions’. C[F]ind what common morphemes occur in what sequences’ has here been replaced by ‘postulate what common underlying sentences. . . ‘) Although quite apart in theoretical orientation, close analogies are seen in the techniques of analysis (see the following discussion).
The eliciting procedure is based on the fact that formal grammatical relationships exist among sentences of quite different superficial structures.9 It must be added that there exist quite similar superficial structures which have no formal relationships. Hale’s statement can be restated: ‘formal relationships often do not exist among sentences of quite similar superficial structure’. The objective of field technique is to keep these two cases apart. The former is a case of nonsynonymous paraphrase, the latter a case of structural ambiguity.
There is a close analogy between eliciting for a transformational grammar and eliciting for a taxonomic grammar. The notion of ‘emic’ and ‘etic’ may be extended to eliciting techniques for transformational generative grammars: The objective is a general statement of the ‘emic’ abstract underlying sentences of the constituent structure component of the grammar. Such a statement takes into account transformational varieties. The transformational varieties of sentences are derived from ‘emic’ underlying strings. The deep structure of a sentence is its ‘emic’ representation, whereas the transformations are responsible for the generation of ‘etic’ ‘allo-forms’ of the surface structure. This represents a relationship of invariance, which is preserved across transformation: it is the invariance of the semantic interpretation of the sentences. This is the discovery of Katz and Postal, that transformational generative grammars can be constructed in such a way that transformations do not change the semantic interpretation of the sentences. This strategy is motivated by considerations of simplicity and intuitive preference.10 The ‘emic’ deep structure is postulated on the basis of ‘independent syntactic justification’.
The striking feature of this analogy is that many statements which were made concerning the discovery of taxonomic units in taxonomic grammars are, with slight modification, applicable as eliciting techniques for transformational grammars. Garvin’s statement, ‘the input to the morphological analysis is thus a set of behavioral units, namely, informant responses elicited by a form-meaning technique in ordered sets under controlled conditions’,11 may be applied to syntactic analysis.
‘Controlled conditions’ I interpret as the requirement that ungrammatical sentences are not permitted, unless so marked. In other words, in the eliciting situation, there is sufficient time to permit postediting by the informant, i. e. there is a closer check on grammaticality than in ‘natural’ situations.12
‘Ordered sets’ are unordered sets of paraphrases of sentences. The ‘form-meaning relationship’ is interpretable as those sets of sentences which are nonsynonymous paraphrases of each other and which must preserve their semantic interpretation across all possible transformations. The sentences of the surface structure (plus phonetic interpretation) are ‘behavior units’ considerably closer (because ‘precooked’) to behavior than the abstract characterization of the underlying strings. The behavioral units are modified and abstracted by postediting (see note 12) and transcription. Finally, the ‘input’ is not to morphemic but to transformational generative analysis.
Garvin goes on to say that two aspects of linguistic analysis are ‘often termed segmentation. . . and. . . distributional analysis’. Again with slight modification the gist of these terms is applicable here. The purpose of the analysis is the isolation of sentences or parts of sentences which are nonsynonymous paraphrases of each other. An important addition is, however, that the informant’s help is required in order to determine which sentences or parts of sentences conform to this requirement, or preserve the invariance of the semantic interpretation. The distributional analysis is analogous to the search for syntactically justified and economical ways of stating transformational relationships. Whereas in distributional analysis the ‘etic’ environment is the overriding factor, the central concerns here are the ‘emic’ conditions which trigger the desired transformational outputs. The treatment is distributional in the sense that certain transformations only apply in certain environments. For example, if a sentence of a given structure is embedded in a matrix sentence of a given structure most transformations may not apply.
One can further illustrate this analogy by applying Garvin’s five steps comprising an analytic cycle.13
1) ‘The formulation of the immediate analytic objective’. In the transformational generative view the analytic objective is given by metatheoretical considerations. We know that we are dealing with sentences, and we know that sentences are very often composed of other sentences. The analytic objective is to establish the nature of simple sentences, and the nature of transformations by which certain sets of simple sentences are related to each other. This involves relating the deep structure to the surface structure. It must be emphasized that appeal is made to the informant’s Sprachgefiihl.
2) ‘Preparation of data base by elicitation, study of text or reorganization of existing data’. This problem will be discussed in detail in the next section. However, neither study of texts, nor elicited data, nor reorganization of existing data is a priori excluded.
3) ‘Impressionistic examination of data to observe pertinent units and relations’. All procedures require careful examination of the data, not only by the analyst but also by the informant who has to be trained for this task. Garvin’s ‘observe pertinent units’ must be restated as ‘postulate pertinent units’. The postulation of units and their structural relationships is an important part of transformational and generative grammar construction. It is essential to observe the difference: units are postulated, rather than observed. That the units of taxonomic grammar (e. g. morphemes) are postulated rather than observed is noted by Koutsoudas.14 The distinction is important because it affects the definition of these units. Although one may define linguistic units by diagnostic environments, the selection of ‘diagnostic environment’ is not automatic and can be applied only ‘if the answer is in a sense already known’.15 Transformational grammar defines its units by rules.16 ‘Rewrite N as’ followed by a number of lexical items on the right-hand side defines the members of this list as units called ‘noun’. Such a definition is necessary but not sufficient. How one has discovered this rule is relatively unimportant, except perhaps for the fact that taxonomic techniques can and should be utilized. Whether the definition is adequate becomes apparent only if sentences generated by the grammar do not meet the requirement of grammaticality. Such a definition is sufficient only if the grammar will generate grammatical and only grammatical sentences.
4) ‘Operational tests when necessary to verify impressionistic observation leading to attestation of relations and definitions of units. ‘At this point in the analytic cycle of a transformational generative grammar, a first tentative formulation of the rules and categories of the grammar has been made. The operational test is the generation of sample sentences either randomly or by some other procedure. The attestation of relations and the definition of units follows: if the grammar is inadequate it will produce other than grammatical sequences, and the rules, or categories, or both, have to be adjusted accordingly.
5) ‘Collection and examination of additional data to crosscheck relations and definitions. ‘The grammarian’s work is never done. There exists no complete grammar of any language. The examination of additional data will reveal parts of the grammar which either inadvertently or intentionally have been by-passed and left out. Although relations and definitions are crosschecked at this point, more is involved in formulation. General formulations are preferable and should replace particularistic ones. General rules are preferable to specific rules, covering one or just a few cases. Extending a grammar to new sentence-types may lead to new relationships, and adjustments will be necessary. These should make the rules of the grammar more comprehensive.
It is useful to look upon the construction of grammar in terms of certain priorities. The formalism underlying the theory of grammar is well suited to partial solutions or subgrammars. These deal with subsets of rules and categories of some ultimate grammar. Such a subgrammar generates only a certain type of subsets of sentences.17 The choice of some sequence of priorities in the study of sentence-types may be based on linguistics universals. Other priorities may vary from language to language. Equational sentences, if present in a given language, are probably the simplest structures. In ascending order, intransitive sentences, transitive sentences, double transitive sentences and their derivatives, are increasingly more complex. Rather than attacking a language frontally one may with less effort procede from simpler structures to the more complex.
Subgrammars are limited not only by sentence-type. Sentences of a given type may contain (as they do in Navaho) future, imper-fective, perfective, iterative, optative, or other subsentence types.
In the next section a Navaho morphophonemic class of verbs is considered in detail. These verbs of the imperfective paradigm have been selected because the syncretism or morpheme overlap is simpler than in the perfective. The relationship of the imperfective to the future or any of the other paradigms was not checked. A further simplification of the Navaho example is achieved by considering only the simplest of the imperfective morphophonemic class of verbs. These are verbs without derivational affixes, or the so-called disjunct imperfectives.18
Among the list of priorities is also the problem of subcategori-zation, which plays an important part in the assignment of grammar codes (grammatical markers)19 or labels to the modes of the structural description. The subcategorization in a transformational generative grammar is finer than in traditional grammars, certainly more so than in most taxonomic grammars.
As Chomsky has shown, at least one aspect of grammatically can be expressed in terms of subcategorization.20 The lexical entries of the grammar are subcategorized componentially. The use of grammatical components is an efficient way of dealing with intersecting classes. Componential solution is an important aspect of recent formulations.21
An arbitrary level of grammaticality for early formulations may be decided in advance. This will simplify the first stages of analysis. Subcategorization may be represented as a taxonomic rooted tree. It is the ‘depth’ of this taxonomy that can be arbitrarily fixed and decided upon in advance.
A zero level of subcategorization, in this view, contains only words. A first level of subcategorization separates words into such classes as nouns, verbs, particles, etc. A second level of subcategorization may separate nouns into count nouns and mass nouns, and so on. The level of subcategorizations, if the branchings are always binary, and if set arbitrarily at five, gives 32 categories, at six, 64 categories, or at seven, 128 categories of classes of words. An example of some arbitrary limit of subcategorization is one which permits sentences of the type John spoke softly and The horse spoke softly, but excludes sentences of the form The orange spoke softly. It does not make the complete distinction between animate, human and nonhuman nouns. The lack of this distinction may be due to an arbitrary restriction of the taxonomic depth to perhaps four or five.
3.0 INFORMANT VERSUS TEXT
There is no doubt that all human beings have an innate genetic predisposition that enables them to learn at least one human language. The information that goes into learning a language is vast. To speak is perhaps the most complicated task that a human being learns during his lifetime. ‘Since each speaker is a finite organism, this knowledge (of his native language) must be finite in character, i. e. learnable’.22 In fact we are forced to assume that such learning involves the breakdown of the finite external linguistic information into a set of elementary rules (or some hypothetical complex neural connections) which are capable of reproducing and/or interpreting an infinite set of sentences which a speaker of a language may produce or encounter. It follows that this innate ability to analyze language, however implicitly and nonselfcon-sciously, is a corollary to the innate ability of being able to learn a language. This observation is important to the anthropological linguist because no matter how strange a human culture may be or how strange the language spoken in that culture is, native speakers can be taught to become efficient linguistic technicians or linguists. That is, they can be taught to make the notions of an implicit nonselfconscious linguistic analysis explicit and hence, verbally public.
Linguists who work on well-known languages usually analyze their own language. They combine in one person the native speaker and the analyst. The anthropological linguist is rarely that lucky. It is for this reason that he is eternally preoccupied with techniques of discovery. The best recourse of the anthropological linguist is close cooperation with a native speaker. That this cooperation assumes a special character should be evident from this discussion.23 It is inevitable and necessary for eliciting toward the ‘discovery’ of the grammatical structure of a language. The native speaker becomes considerably more central to the linguistic investigation than he has been in eliciting for taxonomic grammars.24
There are two reasons for the native speaker to be central to transformational generative analysis:
1)Because of his knowledge of the language he is capable of recognizing partial and total similarities between sentences and parts of sentences, i. e. he is capable of establishing the fact quickly and efficiently that a given sentence and another are paraphrases of each other, he can state the nature of the paraphrase relationship (synonymous vs. nonsynonymous) or can recognize that they are not related at all. The native speaker is also capable of taking his entire knowledge of the language into account while making these judgments. It is on the basis of this total knowledge that he is capable of making decisions about the plausibility of a proposed source sentence. It should be noted that it is a requirement of descriptive adequacy25 that these skills of the informant be explicitly accounted for.
2)(Not entirely independent of 1)) A great asset of the native speaker is his capacity to recall large bodies of linguistic material on demand by virtue of his memory and fluent knowledge of his language.
The emphasis on the native speaker or the participant-consultant in a project of writing a grammar of a given language raises the question of the usefulness of texts. The place of texts in the taxonomic grammatical analysis was discussed in detail by the Voegelins in Hopi Domains.26 The technique of ancillary eliciting is based on the availability of a text. ‘(It) starts with translation eliciting (i. e. texts read to the informant who then translates them into the language of the analyst), but once the English gloss of the word is obtained, “how do you say” questions are asked in an attempt to discover combinatorial possibilities, to find what common morphemes occur in what sequence’, and “... ancillary eliciting generates numerous texts. . . .’27 Ancillary eliciting according to this view was carried on until all the morphemes — particularly the minor morphemes or affixes of the language — had been identified. Ancillary eliciting in itself controverts the claim that a text can be analyzed without recourse to ancillary data. Anthropological linguists never subscribed to that extreme in practice, but in their theoretical orientation the corpus was considered closed when all morphemes were assumed to be listed, particularly the minor morphemes or affixes. There was no further reason for more data.
The technique of ancillary eliciting with minor modification and specific attention to sentence structure can be applied to the eliciting of transformational generative grammars. The basic assumption of the Voegelins, that the text is primary and subsequent eliciting is ancillary, however, must be reversed. Eliciting becomes primary and the use of text is ancillary. The Voegelins state that ancillary eliciting is time-consuming.28 So is general grammar construction. However, to the anthropological linguist the procedure suggests a general technique of analysis.
1)It is useful to start work in an unknown language with a text. The text may function as an ancillary data base. In addition it may perform a useful function in the training and education of informants (see below).
2)If the corpus is large enough it will be possible to find in it related sentences. These can be brought together and presented to the informant as evidence that related sentences do exist in his language.29 The informant can be taught what nonsynonymous paraphrases of sentences are and how they are to be used in the analysis. In order to expalin the type of relationship that the linguist is interested in, it is most useful to bring together examples, and demonstrate with these the objectives of the linguist. In an entirely unknown language, and working with an unsophisticated informant, such illustrative examples may be the quickest and perhaps the only effective method of demonstrating the relatedness of sentences. Once this is accomplished, it becomes easy to transcend the text and to elicit related sentences directly from the informant. Hale has remarked that it is useful to teach the informant to respond to metatheoretical linguistic labels such as, ‘would you please now give me an equational sentence’, etc. This greatly facilitates the informant’s ability to ‘perform operations on given linguistic material’.30
3)As the eliciting of new varieties of sentences proceeds, it is useful to add the new sentences to the original text as expansions of the corpus.
4)A further advantage of the presence of the large text is that there are sentence-types which are infrequent in ordinary discourse and which may elude the memory of the informant. A file of sentences from elicited texts or elicitations based on elicited texts may perform the useful function of a memory base. Such a file becomes more and more important ‘as the questions become more sophisticated (and) the informant’s responses become more and more difficult to control, and his memory becomes less and less reliable’.31 Transformational generative grammar construction certainly meets this requirement of sophisticated responses which are increasingly more difficult to control. It is in the treatment and manipulation of such large bodies of data, which can be accumulated in amazingly short periods of time, that a computer processing of linguistic information can be used as an aid in the construction of the rules, transformations, categorization and selection restrictions which hold in a transformational generative grammar of a given language.
4.0 THE USE OF COMPUTERS
According to the view taken in this paper ‘discovery procedures’ have no place in linguistic theory. There is no a priori way in which we can determine how discoveries are made. However, no one doubts the utility of various analytic techniques which can be used to manipulate the ‘raw’ data and by means of which one can gain understanding. Such understanding is the raw material of creative intuition. Some analytic techniques are more useful and more appropriate for the manipulation of data or for information processing. Among the various techniques and trial-and-error procedures which one can use to cull understanding from the data, some by their very nature promise greater reward than others. It is obvious that in a list of priorities of analytic usefulness, the ouija board, familiar spirits, or a solar eclipse will have a lower priority32 than the vast memory capacity and speed of modern electronic computers. This does not mean the linguistic analysis cannot be performed without a computer. It can be done and has been done by trained linguists who may or may not be native speakers of the language analyzed.
To the anthropological linguist who knows the language to be investigated poorly, if at all, the information-processing capacity of a modern computational device may be a tremendous asset. If there is a considerable body of data available, there is a second reason for computers. They may serve as data organizers and make previous works available for quick recall or look-up.
Computers and human beings have a common feature. They are both complicated information-processing devices. The speed and accuracy of computers makes it a device for the extension of human capabilities. Computers have been programmed as general problem solvers of highly structured simple games or of theorems of logic and geometry.33 However, at least at present a computer program which can automatically analyze a previously unknown language and furnish structural descriptions of its sentences and a lexicon with a grammar code is beyond reach.34
Computers may be used as check on exhaustiveness. ‘In eliciting complex sentences’, according to Hale, ‘some sentence types are inevitably missed’. He proposes that by sensitizing the informant to questions of the kind, ‘What is the simple sentence which underlies sentence X?’ and by explaining to him what ‘underlie’ or ‘source’ means, some of the hiatus can be filled in.35 In spite of the effectiveness of this technique there is no procedure which will guarantee the exhaustiveness of a grammatical investigation. This is an inevitable by-product of the infinitely many sentences which constitute the repertory of a speaker of a natural language.
The greater the variety of source texts and of techniques of discovery the higher the degree of exhaustiveness. Without artificial stimulation of some kind or proper sociophysical contexts, no informant will be able to provide all possible sentence-types occurring in his language. Computer programs processing texts can function as artificial stimulants in this sense.
A grammarian, even if he does not know the language too well, can use concordances and very simple search programs. More complex programs (see other articles in this volume) require a vast amount of sophisticated linguistic information before they become feasible or economical. Simple concordances and search programs with various extensions and adaptations may perform much more effectively because of the generality and independence from linguistic analysis.
In spite of sophisticated additions (see discussions below) concordances are inefficient information-retrieval devices. The contexts of most key words in long texts contain more sentences than are needed at any stage of the analysis. Fortunately, the number of context sentences is much smaller than the entire corpus. The separation of sentences with the relevant features can easily be done manually.
There are two types of concordance programs which are easily accessible to linguists: so-called unblocked concordances, and blocked concordances. The first gives an arbitrary amount of context for each key word.
Martin Kay’s (RAND) CRUDEN unblocked concordance provides one line of 130 characters (including spaces) for the key word and its context. The key word is centered on the printout page. I found this format useful with Trader Navaho texts. I had difficulty in establishing sentence boundaries, however; in most cases, the 130-character line included more than a single sentence of context. (See Fig. 1.1.)
Blocked concordances have the advantage that the unit of context can be chosen in terms of its relevance to the linguistic objective. A grammarian intending to write a grammar of the sentences of a language will prefer the sentence as the extent of ‘the block’. The boundary of the block is marked by boundary symbol, usually an identification number. There is no reason to limit context always to one sentence. It can easily be extended to include larger, or smaller units.
Unblocked concordances are generally easiest to use with one line of text, which for the anthropological linguist will rarely suffice. More often than not, he needs English translation labels along with the original sentence. Although the CRUDEN concordance program allows for multiple lines (see Fig. 1.1), the count of items between spaces (i. e. the words) of the text and the English sentences must coincide. Such matching requires very careful and exacting proofreading, hence, extra time. With more than two lines (say, interlinear and free translation) this task becomes most cumbersome.
Blocked concordances have the advantage that precise matching of items on each interlinear line is unnecessary. The blocks are keypunched in a special format which automatically assures the proper sequence and proper placement of the lines.
Some blocked concordances have provisions for an unlimited number of parallel lines. Lines may be subgrouped into several classes. For example, the BIDAP (Bibliographical Data Processor) program developed by Professor James Aagaard at Northwestern University has four classes of lines. Each line can have up to 99 continuation cards (see Fig. 1.2).
Northwestern University’s TRIAL, programmed by William Tetzlaff, can handle eight classes of cards, each with up to 99 continuation cards. Multiple classes of lines can be utilized in various ways: for text; interlinear translation; free translation; grammar code; etc.; or as in the case of my Navaho dictionary work, for identification number; stem; prefixes; nouns, postpositions, and examples; and English translations (see Fig. 1.2).
Figure 1.1 The Output of the CRUDEN Concordance Program
Note to Fig. 1.1 The output of the CRUDEN concordance program. Note the discrepancy in the spelling of ‘ADE7E72BIN and ADE7E7ZBIN (due to a change to CDC equipment at Northwestern’ is printed as) as well as ‘AGHA7A7 and ‘AGHAA. Concordances are ideal for locating inconsistencies in the transcription (first example) or free variations (second example) of speech. (The Trader Navaho transcription follows the Indian Service alphabet of Young and Morgan (op. cit.) with the following changes: ts = C; i = LH; ń = N6; ʼn = N9; if V = vowel, then V7=high vowel, W = long vowel, V7V7 = long high vowel, and 8V7V7 = nasalized long high vowel. This sample of Trader Navaho is spoken by Sam Drolet of Carson’s trading post. Trader Navaho is a marginal language spoken by white traders, usually of pioneer stock, who are engaged in trading with the Navaho Indians of Arizona, New Mexico, and Utah. Because they are isolated from each other every trader has his own version of Trader Navaho. For details see my dissertation, ‘A typological comparison of Four Trader Navaho Speakers’ (Indiana University, 1963).
The two Northwestern programs are equipped for the use of logical operators (AND, OR (exclusive), and NOT) for special search procedures. This feature permits searchers for cooccurrence or noncooccurrence of ‘words’ or parts of words.36 The investigator is not restricted to a simple concordance but can ask for a concordance of two or more items simultaneously occurring in the same sentence or block.
In the following section some Navaho examples and problems are discussed including the applicability of computer programs to their solution.
4.1 Examples of computer applications. Garvin calls the phonemic fusion of two morphs morpheme overlap. ‘The difficulty of separating morphs increases with the amount of morpheme overlap present in the language’.37 In Navaho the morphemes of the verb mode and the person markers tend to fuse, making analysis difficult.
The incorporated person markers represent an important feature of the verb: the person marked within the verb must be in agreement with the nominal subject and nominal object of the sentence. This dependency relationship between the nouns and verbs of the Navaho sentence is a clear-cut and natural place to begin analysis.
Two paradigms have been investigated in detail: the imperfective and the perfective. The complexities of the perfective are overwhelming and will not be dealt with further in this paper.
Figure 1.2 The BIDAP Input from the Young and Morgan Navaho Dictionary
Young and Morgan list five morphophonemic classes of imperfectives: one disjunct imperfective, with no derivational affixes, and four conjunct imperfectives, which cooccur with various derivational prefixes.38 It appeared on the basis of Young and Morgan’s work that it should be possible to write a relatively-simple search program for an electronic computer which would recognize the five classes of Navaho imperfectives. In the process the program should also recognize the subject and object markers incorporated in the verb and assign the ‘person’ of the subject and object to the verb construction.39
Such a program would have a dual function. (1) It could be utilized to search the keypunched version of the Young and Morgan Navaho Dictionary40 in order to check the exhaustiveness of the five postulated classes of imperfectives. Residual forms, i. e. imperfectives which the program could not identify, could then be analyzed and the program refined accordingly. Experience gained in this operation could then be applied to the more complex paradigms of Navaho. (2) As soon as a corpus of Navaho sentences becomes available for computer processing (again the greatest bottleneck is, of course, keypunching), the program may be applied to such data. The assignment of the ‘person’ of the incorporated subject marker would be a small but important step in the assignment of a grammar code to the constituents of Navaho sentences.
Unfortunately neither one of these approaches is rewarding. In the first instance the programming time is considerable and the gain relatively small. The same objective can be achieved more exhaustively by a concordance program. In order to retrieve this kind of information about the structure of the prefixes of the Navaho verb, the Northwestern BIDAP program was modified in the following manner (see Fig. 1.2).
The BIDAP card format contains five classes of cards. Class Ø (marked by Ø in column 75) contains the identification number of the dictionary entry. The first three digits are the page number in the dictionary, the next two digits are the number of the entry on a given page, YMA designates the first dictionary of Young and Morgan. The two digits in column 73 and 74 are used as continuation numbers for each class, i. e. from 1-99 continuation cards.
On the lines of class 1 (marked by 1 in column 75) and all other classes only column 1-60 are used for data. In class 1, the 60-character line is divided into fields of 20 characters. These fields are used for the stems of the various paradigms. Column 1-20 is reserved for the future stem, 21-40 for the imperfective, 41-60 for the perfective, etc., the stems are continued on the next continuation card of class 1 if necessary. In addition to its position, each stem is m.irked by a letter following two spaces after the last symbol of the stem. The abbreviations are F: Future, I: Imperfective, P: Perfective, R: Repetitive, O: Optative, PRG: Progressive, U: Usitative, CI: Continuative Imperfective, and SP: Si-Perfective.
The lines of class 2 contain all the prefix combinations of the dictionary entries. The paradigms are separated by asterisks and each prefix complex is marked by a symbol for paradigm (same as above) and by 1, 2, 3, 4, for first, second, third, and fourth person41 subject, 21 and 22 for first and second person dual, and PI and P2 for the two ‘passives’. In this class there are usually several continuation cards.
Class 3 cards contain three types of information: (1) postpositions if they precede the verb, (2) illustrative sentences, and (3) entries which are not verbs and consist of only one construction.
Class 4 cards contain all the English equivalents given in the Young and Morgan dictionary as well as translation of the sample sentences.
A ‘key word from context’ concordance can be compiled on any one of the five line classes (see Fig. 1.3). Instead of printing out the full contexts the entries may be simply indexed according to their location in the Young and Morgan dictionary. (Fig. 1.4)
The alphabetic sorting performed on all the prefixes contained in the dictionary provides all the information contained in the above described search program. It provides considerably more detail because the sorting is performed on all Navaho paradigms contained in the Young and Morgan dictionary. All identical prefix forms marked for person and paradigm are brought together. Since in the case of the search program and the BIDAP program the entire dictionary must be keypunched to assure exhaustive exploitation of the source dictionary, nothing is to be gained by such a specially programmed search routine.
The failure of the search program in its application to texts is even more serious. Figure 1.5 indicates what such a search routine may look like in the form of a schematic search tree. The search procedes from right to left through the Navaho word and follows the stem recognition and enclitic - stripping routine (see note 41). Although the diagram is restricted to the disjunct imperfectives, it is a representative example of paradigmatic information given in the source dictionary. The Young and Morgan Navaho dictionary is not a dictionary of lexical items in the usual sense of the word. It is apparent (see sample page of the dictionary, Fig. 1.2) that it is a paradigmatic dictionary. Instead of giving only the ‘naming units’42 each entry of a verb contains a partial paradigm. If the verb is intransitive the entry is usually restricted to the 1, 2, 3, 4, person singular and the 1, and 2, person dual. If the entries are transitive, the third or obviative form of the object is implicit, while the subject person markers are the same as in the intransitive. Whenever possible, passive forms are added.
Figure 1.3 BIDAP Output: Concordance on Line of Class 2 — Verb Prefixes
Figure 1.4. Sample of a Partial Index of Navaho Verbal Prefixes from the Young and Morgan dictionary by BIDAP
Figure 1.5 A Schematic Search Tree of the Navaho Disjunct Imperfective
This paradigmatic limitation of the entries makes the proposed search program unsuitable for texts. The object may assume any number of ‘persons’ which are not contained in the dictionary. The extension of the search to objects marked for other than third person is not selfevident nor automatic and must be ascertained by work with an informant. The matrix of Fig. 1.6, presumably exhaustive (except for the reciprocal and reflective incorporated pronominal forms), represents the entire set of marked subjects and objects in the disjunct imperfective. The situation is complicated by the fact that verb stems place selectional restrictions on the noun classes of subject or object. In most cases only part of the matrix (Fig. 1.6) is utilized.
One verb stem selection restriction not noted in the Navaho literature is the animate-inanimate distinction: e. g. yi-λóóh ‘third person animate is cold’43 versus yi-tin ‘third person inanimate is cold’. First, second, and fourth person forms given by Young and Morgan are ungrammatical.44 The pairing of animate vs. inanimate stems is a prevalent feature of the Navaho stem vocabulary.
The problem of selection restrictions between the nominals of the subject and object of the sentence and the verb stem raises the question of the dependencies and selection restriction holding between the verb stem, and the subject and object nominals of the sentence and the subject and object markers incorporated in the verb construction. The following eleven sentences illustrate the scope of the problems.
Eleven Related Navaho Sentences
Figure 1.6. The Full set of Subject and Object Person Markers of the Navaho Disjunct Imperfective
The verb stem t’ood ‘(to) wipe (with a rag)’ requires a human subject, but is noncommittal and may apply to any object nominal. The ‘any’ to be interpreted as ‘at present possible finer subcategorization is not known’.
These eleven sentences are not synonymous paraphrases of each other since they differ in their lexical makeup. However, they have the same underlying structure. The rules for the generation of these eleven sentences and a great many like them, are as follows:
This rule gives a three-way division of the Navaho sentence. It is needed to account for the later transformational permutation of the two-noun phrases. The two-noun phrases are numbered. The numbering may be interpreted as 1 = subject and 2 = object; however, the main purpose for the numbers is to simplify the statement of certain later environmental restrictions. A component of emphasis (+) or no emphasis (-) must be attached to every noun phrase. This component will later determine the order of the noun phrases in the sentence, as well as the selection of the incorporated person markers within the verb. Any noun phrase with the components of emphasis and humanity (see rule 7) refers to a person psychologically close to the speaker. Ken Begishe explains this feature as: ‘if the person is well known’.
The VERBAL, consists of the two incorporated person markers. They are numbered for a more simple later environmental statement. Again they mark the incorporated subject person marker (1) and the object person marker (2). The numbering corresponds to the numbering of the two-noun phrases.
The verb proper consists of a modal MOD which is marked in this class of disjunct imperfective verbs by a morpheme yi (the so-called yi-imperfective). (Other classes are considerably more complex.) The classifier of t’ood is zero(0). With other stems in this class it may be ‘l” or ‘ł’. The function of the classifiers is not too well understood. This class has a membership of about twenty stems.
One type of noun phrase is a noun. There are other possible noun phrases symbolized here by the three dots ‘….’. Whatever constituent structure the noun phrase may have, the component of emphasis is carried across the rewrite rule and is attached to the componential structure of the right side of the rule. The exact nature of this rule is not important because complex noun phrases consisting of more than a noun are not considered.
The next three rules expand the componential structure of the Navaho nouns:
Nouns are divided into definite and indefinite nouns. All indefinite nouns must be emphasized. The definite nouns are divided into animate and inanimate nouns, but there is no way to emphasize inanimate nouns. All animate nouns may be human or nonhuman nouns.
Two transformations are necessary to account for this set of sentences:
So far only nouns (or noun phrases) have components. This transformation attaches a duplicate of the components of the nouns to the corresponding incorporated person markers P. The environmental statement in this rule is simplified if the noun phrases are numbered. (See rule (1)).
If there is a noun phrase in noninitial position in the sentence which contains any set of components R and a component of emphasis, then it must obligatorily assume the initial position in the sentence.
The remaining rules are lexical:
The indefinite incorporated person marker P (2) requires the emphasis of the object of the sentence. The morpheme a can occur as P (1) only if it is preceded by an emphasized P (2). This restricts the occurrence of this a to the following sequence: ?a + ?a, bi + ?a, and ha + ?a (see rule (14) and (15)). Sentence (ix) is hence unambiguous and can signify only ‘the woman she wipes “indefinite” (with a rag) ‘. A sentence
has no interpretation, čidí ‘car’ cannot be the subject of the stem t’ood which requires a human actor; it cannot be its object because all inanimate nouns are nonemphasizable.
The selection restrictions governing the indefinite cannot be ascertained by sentences of the form:
Such sentences cannot be elicited. There are no nouns or pronouns which can occupy the position of object or subject in an indefinite sentence. This can be demonstrated by the following sentences.
The indefinite nouns are place-holders. Their existence can only be inferred indirectly. They are deleted after the feature of indefiniteness is transferred to the appropriate P(α) by transformation (8).
An emphasized psychologically close human person marker P (1) is to be replaced by . The nonemphasized person marker P (1) is replaced by a zero morpheme. All emphasized nonhuman person markers P (1) may be replaced by a zero morpheme, but can never occur in the environment following an emphasized P (2). Only the occurrence of two emphasized human person markers (1) and (2) permits this kind of ambiguity as reflected by sentences (v) and (vi) (see next rule (14)).
The selection of ha requires that the subject of the sentence be human, i. e. the subject marker may be ø (rule (12)) if it is not emphasized, or if it is emphasized. In the latter case the sentence is ambiguous because according to transformation (10) either one of the two NP may be interpreted as either subject or object of sentences (v) and (vi).
Any human or animate object P (2) may be denoted by bi. It cannot occur in the environment preceding the emphasized, close human subject marker .
In all cases where the object is not emphasized the morpheme is yi.
Once the noun phrases are marked by the incorporated person markers they may be optionally deleted. All noun phrases with the component of indefiniteness must be deleted obligatorily. Sample derivations of sentences (ii), (iii), (v), (vi), (ix), and(xi) are given in the diagrams on the following pages.
According to the proposed solution, sentence
is two-way ambiguous, because of the two sources of zero morphemes (rules (12) and (13)). This ambiguity is caused by the asymmetry of bi which has no P (1) analog. There are two possible solutions:
A.A transformational rule which eliminates the component combination of rule (13), i. e. which changes
B.There is some reluctance on the part of Ken Begishe to extend P (2) bi to animals (animate objects). If bi is restricted to human beings, a component of closeness has to be introduced to distinguish bi from ha. However, if an environmental restriction is added to rule (7) to the effect that the component of non-humanity can occur only with nonemphasis, the ambiguity of sentence (xvii) is eliminated.
which after the application of the phonological rule
* In this and the following phrase markers rules for the deletion of the zero morphemes are assumed but not explicitly stated.
Sentence (ii)
which after the application of the phonological rule
NP(2) was moved into the initial position by transformation (9)
Sentence (iii)
which gives after the application of transformation (9) and the phonological rule two sentences:
Sentences (v) and (vi)
which gives after the application of the phonological rule
with the first interpretation of this sentence
Sentence (xi)
which gives after the application of the phonological rule :
and by a rule permitting the optional deletion of the subject
with the second interpretation of this sentence (see p. 25).
Sentences (ix), (xi)
The ambiguity of this sentence has no overt effect and is not serious, except perhaps for esthetic considerations. Because of the potential extension of bi to animals the present solution is retained. It remains to be seen if and how the feature of emphasis affects transformations which will be added in a more complete solution of Navaho.
The introduction of the ‘place-holder’ indefinite nouns is independently motivated by the possessive construction which is, in the form here given, already transformationally derived.
Apparent neologisms of Navaho such as šiɂ ableɂ ‘my indefinite possessor’s milk’ for store bought milk (versus šibeɂ ‘my milk (from my own breast)’) and ‘my indefinite possessor’s heart = my (car) battery)’ (versus
own heart’), become easily explainable:
becomes transformationally (possibly in two steps)
and, after the necessary deletions,
This solution is supported by constructions with definite nouns:
This extended example is intended to illustrate two points: first, that selection restrictions in Navaho are complicated; second, to demonstrate that postulation of the abstract underlying grammatical structure of sentences is hardly aided directly by the use of computers.
5.0 CONCLUSIONS
The preceding discussion and the examples indicate that the computers currently in use are not ideally suited for linguistics. Since the texts must be viewed as ancillary to work with informants, the problem of constant updating of the text with newly elicited materials prevails.
The anthropological linguist, accustomed to small operations, is limited not only by the present cumbersome technique of key-punching elicited material on IBM cards, but by the expense of this operation. The fact that the text is mere nonsense syllables to the English-speaking keypuncher, impairing his speed and accuracy, is another drawback. The anthropological linguist is not alone in his difficulties of reading linguistic data into the computer. The technology has not yet caught up with the problem.
The ultimate computers for linguistic analysis have not been built. One could easily imagine a machine which accepts spoken or at least written sentences in its memory banks without the tedious recourse to punched cards. Equally imaginable is a system where a native speaker-linguist or linguist-native speaker team would sit at the console of the computer with a large corpus of texts in its memory, carrying on a conversation with the machine — a conversation in the sense that the computer would, in short order, not only provide requested information but immediately store and make available new materials from the lips of the informant, and also try out proposed generative formulas.
My goal in this paper was to explore the utility of easily available existing programs for linguistic analysis. I focused particulary on the problems faced by the anthropological linguist. The utility of the programs is information retrieval in one way or another rather than what has been called automatic linguistic analysis.45
NOTES
1. H. Hoijer, IJAL 11. 193-203 (1945)
2. R. E. Longacre, Grammar Discovery Procedures 10 (The Hague, 1964). Although the author agrees with the spirit of this quote, it is more to the point if paraphrased as ‘generative grammarians are by no means uninterested in linguistic analysis’. This paper hopes to demonstrate that this is so.
3. Longacre, 11.
4. As this applies to phonology, see for example, C. F. Voegelin and Florence M. Voegelin, ‘Guide to the Transcribing of Unwritten Languages in Field Work’, AL 1:6. 1-28 (1959).
5. This is the interpretation given to R. B. Lees’ ‘shocking’ statement that ‘all languages are dialects of English’.
6. J. J. Katz and P. M. Postal, An Integrated Theory of Linguistic Description 157 (Cambridge, Mass., 1964).
7. This observation is from an unpublished paper by K. L. Hale, ‘On the Use of Informants in Field Work’ 3 (1964).
8. C. F. Voegelin and Florence M. Voegelin, Hopi Domains 3 (1957).
9. K. L. Hale, 1.
10. Katz and Postal.
11. P. L. Garvin, On Linguistic Method 66 (The Hague, 1964). Garvin’s ‘ordered set’ is not to be construed in the set-theoretical sense of the word.
12. It is interesting to note that such postediting of informant responses is in general unconscious, on the part of the informant as well as the investigator. That this is so can easily be ascertained by comparing informant responses with recorded spontaneous conversations or occasionally with spontaneous texts which do not require formal delivery in the culture whose language is under study. That this observation was missed in the past is probably due to the fact that tape recorders are relatively recent innovations of linguistic field technique. The author has noted on several occasions, for example, the bewildered amusement of Navahos when selected texts from E. Sapir and H. Hoijer, Navaho texts (LSA, 1942) were read to them. They claim that no Navaho speaks in such short, choppy sentences. The brevity of the sentences was most certainly due to the limitations of the unaided memory span of the transcribers. Among American Indian languages there are few examples of recorded live conversations.
13. P. L. Garvin, 64.
14. A. Koutsoudas, IJAL 29. 160-170 (1963), see also R. B. Lees, Lg. 33.375 - 408 (1957) and J. J. Katz, Lg. 40.124-37 (1964).
15. R. B. Lees, Lg. 36.210 (1960).
16. The fact that some of these rules are nonarrow rules will for the moment be disregarded.
17. Such subsets may also be infinite.
18. R. W. Young and W. Morgan, The Navaho Language 77 (U.S. Indian Service, 1943).
19. E.g., J. J. Katz and J. Fodor, Lg. 39. esp. 207-10.
20. N. Chomsky and G. M. Miller, ‘Finitary Models of Language Users’, Handbook of Mathematical Psychology esp. 2.443-49 (New York, 1963).
21. N. Chomsky ‘Topics in the Theory of Generative Grammar’ (1964, to appear).
22. Paul M. Postal, ‘Underlying and Superficial Linguistic Structure’, Harvard Educational Review 34. 246-66 (1964).
23. For a discussion of this point in semantic eliciting see Oswald Werner, ‘Semantics of Navajo Medical Terms: I’, IJAL 31. 1-16 (1965).
24. It is for this reason that the designation of participant-consultant or participant-informant is more appropriate. Oswald Werner, 8-9.
25. N. Chomsky, ‘Current issues in Linguistics’, The Structure of Language esp. 66-76 (Englewood Cliffs, N. J., 1964).
26. C. F. and Florence M. Voegelin, Hopi Domains (1957).
27. Voegelin and Voegelin, Hopi Domains 3.
28. Voegelins, ibid.
29. When informants are not confronted by examples and if the linguist fails to explain the tasks with sufficient clarity, the informant will sometimes say ‘You can’t do that in my language’. Examples from a text may alleviate this danger. Needless to say, there are times when the informant is right and one ‘can’t say it that way’ in his language.
30. Ken Hale, 3 (1964).
31. Garvin, 8 (1964).
32. The author is indebted to Bruce Dikson for these examples.
33. See for example E. A. Feigenbaum and J. Friedman, Computers and Thought (New York, 1963).
34. I cannot agree with Garvin that, ‘data processing equipment allows the processing of very large bodies of texts using the same program, with the program assuming the role of the linguistic analyst’. Garvin, 80 (1964) emphasis added. If we accept the fact that it takes a competent native speaker to perform the operations which lead to the postulation of the structure of a language, then a program which can allegedly replace an analyst would need to be superior to the capabilities of human beings in the following sense: It is implicitly assumed that such a program could perform linguistic analysis without the aid of the speaker’s knowledge of his native language. That is, it is claimed that the computer program is capable of performing complicated judgments comparable to the native speaker’s without mastery of the language under investigation.
35. K. Hale, 9.
36. This feature is at present restricted to ‘prefixes’ or ‘words’ minus suffixes, i. e. a search for Navaho stems is at present not pos sible.
37. Paul Garvin, 25.
38. Young and Morgan, 77-81.
39. In the following discussion it is assumed that two problems of Navaho automatic grammatical analysis have been solved: (1) Every Navaho verb may take from one to three enclitics following the last morpheme of the verb, which is the stem. The morpheme overlap between stem and enclitics is relatively simple. A sufficiently large number of enclitics are known (possibly all) so that an ‘enclitic-stripping’ program seems to present little difficulty; (2) Similarly, a ‘stem-recognition’ routine is also required. Since all stems are of the form CVC or CV (V representing one or two vowel segments), and since the morpheme overlap between the stem and the prefix immediately preceding it is relatively simple, such a stem-recognition routine should be easy to construct.
40. The Young and Morgan dictionary is being keypunched at the present time by Kenneth Begishe under a grant awarded to me by the American Philosophical Society.
41. The Navaho fourth person is a special third person restricted to human beings (see p. 30).
42. Madeleine Mathiot, ‘A Procedure for Investigating Language and Culture Relations’ 4(ms. 1963).
43. Apparently deleted from the Young and Morgan dictionary by a printer’s error on page 55.
44. Young and Morgan, 209, and Kenneth Begishe, personal communication.
45. Validation of the results of analysis is another very important application. That is, computer generation of sentences (for example by a COMIT program) or of phonological units. COMIT is a compiler program designed for programming generative grammars. For a brief introduction to COMIT see V. H. Yngve, ‘The COMIT system for Mechanical Translation’, Information Processing 183-87 (Paris, 1960) and the two COMIT manuals (Cambridge, Mass., 1963).
We use cookies to analyze our traffic. Please decide if you are willing to accept cookies from our website. You can change this setting anytime in Privacy Settings.