“Computation in Linguistics: A Case Book” | Open Indiana

Mary Lu Joynes

0.0 INTRODUCTION

0.1 A generation ago many linguists hoped, and some perhaps believed, that developments in acoustical engineering would provide total, unequivocal, and unique solutions to the problems outstanding in descriptive phonology. While automatic phonological analysis and the robot secretary are still far from reality today, the use of such instruments as the sound spectrograph has had some highly significant results in both practical and theoretical areas. Although they allow practically as much argument as ever, the phonologies proposed by linguists from impressionistic work have been sharpened and occasionally modified by new techniques and more particularly by a recognition of the differences between the kinds of questions the linguist may ask the native informant and those he may ask the machine. In effect, the main value of analytic instruments for the linguist is as a test, for his impressionistic theory concerning the data elicited and for his empirical operating procedures.

More recently, with the development of computer technology, some very ambitious plans and claims have been made for the use of computers in grammatical analysis. As yet completely automatic grammatical analysis of the so-called neglected languages seems to be at least as distant a prospect as automatic phonological analysis. Nevertheless, even at the present stage of computer technology it is possible to automate some routine stages of linguistic analysis, and in the process perhaps to utilize the by-products to enrich linguistic theory, just as the information on some aspects of redundancy, transitional phenomena, discrete and continuous segments, etc., which emerged from their mechanized investigations, have been used by phonologists.

0.2 The problem presented here is designed to give an extremely simple example of a system for testing an impressionistic classification frame of two tentative subclasses of prenominal modifiers in English.¹ Since the problem in question and its proposed approach to solution are to be considered illustrative rather than definitive, the beginning stages of the descriptive process and the formulation of the framework to be tested will be discussed only briefly. Likewise, the revision of the tentative descriptive framework after the testing process is not given here in any detail. The computer system is presented only from the standpoint of the role of the linguist, for it is assumed that he will have the help of a competent programmer and will be interested only in the part of the process which involves his own interests and sphere of activity.

The type of grammar represented is basically one of item-and-arrangement. It is also what is usually called a ‘word grammar’, although in the early stages of planning it had a phonological component of stress as one of the criteria of classification.² The stress component was of necessity scrapped, along with one of the subclasses based on it, for two practical reasons. First, while it is possible in theory to construct a computer code which marks stresses, such a code would be extremely complex for a problem which is in other respects a very small-scale operation. Second, the system presented will begin from a corpus of written English in normal orthography as input.³ As yet there is no corpus generally available which marks stresses in a consistent transcription and is of sufficient length to be representative. Perhaps some significant and useful work can be done on phonologically oriented grammars when such a corpus is made available.

While the system described in some ways resembles the automatic parsing operations detailed elsewhere in this casebook, it is much less sophisticated in that it does not handle every item in the phrase, but only selected items. After sufficient testing, revision, and additions, it might be rewritten into a full parsing operation, but in the system’s use here, as an interim tool for the collection of the specified data about the specified items, those refinements are replaced by a human preeditor.⁴

A collecting and sorting operation of the type presented is a very useful one in testing linguistic classifications made on a distributional basis, but it is done much less frequently than it should be. If done by hand, such work is extremely tedious and subject to human errors arising from boredom and unconscious revision by the investigator. Since computers are not subject to boredom and work very rapidly with total consistency, they are much more reliable for tasks of this nature than human investigators.⁵

In addition to its value as a testing device for a descriptive frame, a system of the type proposed might, after any necessary revisions, have some utility in other areas of linguistic investigation. A collection of such statistical data on modification patterns of contemporary American journalistic English might be tested against a comparable collection based on a corpus of Shakespeare, respelled Chaucer, or some other variant of English, with inter esting results for the philologist. Similarly, the statistical data on closed-list modifications and their possible combinations would be extremely useful for anyone composing graded materials for teaching English as a foreign language. Comparable studies of French, German, or other less familiar languages might show that some of the pedagogical problems which have always been called idiomatic are in fact patterned. The patterning noted could then be incorporated into controllable drill materials exploiting the systems of both languages. More extensive and detailed studies of syntactic frequency, when they are made available, will be at least as useful for teaching grammar as the familiar word counts have been for the development of lexical fluency.

1.0 REVIEW OF PRELIMINARY PROCEDURES

As background for the framework to be tested, a short survey of the steps preliminary to the preparation of the classification system in use at the intermediate stage⁶ is necessary. The forms involved normally occur near the beginning of the noun phrase, are all of relatively high lexical frequency, are often a cause of difficulty to students of English as a foreign language, and seem to be members of a closed list; that is, these classes seem to be relatively free from borrowings in comparison with the class which includes cerise, antiphonal, ancient, etc. Many of the handbooks and grammars simply list them as aberrant forms or place them in similarly vague lists or categories in opposition to more regular adjectives. A core of these aberrant or leftover forms was abstracted from several handbooks and examined for any internal consistencies of any type that might be found. At this stage some forms were eliminated from immediate consideration or held for separate study at a later time. Some of these peripheral forms were the substantive posses-sives (mine, yours, etc.), own, enough, such, and a few others, such as the ordinal numbers. The condensed lists were then examined for any other similarities and differences, resulting in the tentative or intermediate classification framework presented in Figure 8.1. Up to this point the investigation had proceeded quite impressionistically, using the native-speaker investigator as informant. In terms of standard procedure, it was time to test the classification system against some external text as a check on native-speaker intuition. After the comparison of the text⁷ produced nothing that contradicted the previous work, a test on a larger scale involving other native speakers was begun. Using grid paper as a primitive generating device, sequences were produced of the items under consideration in all possible combinations such as what which, what whose, what one’s, what my, etc., throughout the chart. Any sequences that had not appeared in the previous examination of text were submitted to other native speakers in the form of the question ‘Can you think of a sentence in which you might use /whose our/his five/other this fewest/ etc./?’. Some of the sequences considered possible were immediately discarded as segmentable into a substantival followed by a phrase which was not directly related to it. The informants, as might be expected, frequently did not agree in their ability to give examples or in their acceptance of examples given by other informants. In addition to differences in permissiveness in terms of rhetorical correctness (since some allowed five less men, while others refused it), the differences increased with the length of the sequences, allowing a maximum of three items in sequence, with the exception of sequences with other, which yielded a very hesitant maximum of four. The frame of sequences presented in the flowcharts shows the forms considered to be maximally allowable by even the most permissive informants. It will not be surprising, therefore, that many of the sequences leading to OK printout slots will not be used during the examination of extensive bodies of written text. It is a linguistic cliche’ that an informant does not always do what he says he does.

It is important to call attention to the syntactic context and boundaries of the phrases which were considered. While the question of segmentation remains one of the most interesting and difficult in linguistic theory today, cuts must be made and made as consistently as possible, on whatever basis, if any analysis is to be done. The segment spans dealt with in this study are noun phrases, each containing one noun head and at least one of the items on the classification frame. In a study in greater depth it might be fruitful to consider nested phrases, sequences of prepositional phrases, etc., but for the purposes of this limited preliminary example nested phrases (including possessive nouns) are removed in preediting as if they were open-list modifiers, and the noun head is considered the end of the construction under consideration. Until preliminary segmentation can be accomplished reliably by mechanical techniques, the linguist will be forced to make such decisions on an empirical or intuitive basis, for he must have some beginning point, even an arbitrary one. Perhaps more detailed studies of order restrictions will form a basis, however circular, for more reliable automatic segmentation systems.

2.0 PREPARATION OF DATA FOR INPUT

The flowcharts illustrate a conventionalized format into which the linguist will organize the tentative patternings of the items he has observed, class by class, in preparation for the more technical work of the programmer, and in which he will specify the format of the output he expects. While the system proposed is said to be a very simple one to program, the detailed steps of an actual working program and coding system are not shown. It is assumed that the linguist interested in any similar project will not need or want to be his own programmer, but will rather work in cooperation with one who knows the resources of the facilities available to them. In addition to the preliminary data which he plans to test, the questions by which he will test his assumptions, and a suggested format for the output, the linguist is also responsible for the selection and preediting of the corpus to be used as input.

2.1 Figure 8.1 represents the tentative set of classes and subclasses under discussion. The labels are intended to be mnemonic or loosely descriptive and are only for the convenience of the linguist. The organization of the chart is binary in form, at least down to the ultimate subclasses which are usually unique forms. Some of the ultimate classes which could not be reduced readily into smaller subclasses include several members, all of which are assumed to have identical intraphrase distribution potential. For example, the personal pronouns of 1.1. 2.1. 2.2, the round numbers of 2.2. 2.2. 1.2.1, and the nonround numbers of 2.2. 2.2. 1.2. 2⁸ are assumed to have the same intraphrase distributions respectively, while the nonparadigmatic a, an of 1.2 is considered one form with two spellings.

The binary class divisions are in part a tool of the investigation and in part a result of it. In setting up the tentative classes, a class frequently appeared to be stable and homogeneous at four or five members, but on further investigation proved to be divisible on the basis of intraphrase distribution of the members with other items under investigation. Further testing was continued until all the classes had been broken down as far as possible. It is quite probable that after statistical tests many of these distinctions can be ignored as having only marginal utility, but that others will remain valid. In another sense the binary distinctions made the formulation of the questions for the processing operation much easier and possibly more efficient than it would have been with fewer but larger classes.

2.2 The input for a system which is not equipped for completely automatic parsing is a problem which can be solved by the use of a dual input of manually segmented and preedited sequences. While subsequent work may suggest more useful segmentation principles, it is reasonable to assume that by the intermediate stage a fairly valid set of principles for segmentation has emerged. The linguist then can isolate manually the sequences to be analyzed from the remainder of the text, which can be filed until it is needed in the revision stage. The sequences isolated, in the case of this problem only noun phrases, are then preedited manually. The preediting process removes everything from the sequence except the forms of the classes to be tested, substitutes for the noun the cover symbol N, and if the noun was marked for plurality adds the symbol S. The cover symbols are of course subject to the discretion of the programmer, who may wish to use some other coding convention. In the preediting process any spelled numbers will be coded with only a coding convention to distinguish round and nonround (see Note 8). Arabic and Roman numerals, although something of a problem, are best coded as nonround. It is always the first element of a long number which determines its roundness or nonroundness. The pre-edited sequence is the working input of the system, while the unedited sequence, the other member of the dual input pair, is stored for the final printout. Each pair of sequences is assigned a code number to correspond to its location in the unsegmented corpus in order to facilitate any later reference to the corpus that might be needed. An example of one set of a dual input sequence as prepared by the preeditor before coding would be:

205721 set all the seventy-three fat volumes on all the all the nonround NS

The working sequence in the pair is all the nonround NS, while the final printout sequence is set all the seventy-three fat volumes on.

2.3 The formulation and ordering of the questions for the processing operation is the linguist’s chief responsibility in designing the testing system, and it is here that the most tedious work is done, and with the greatest possibility of error and of discovery. The principle of the program is a very simple one, based on a series of questions which can be answered affirmatively or negatively. The problem lies in the fact that all the questions must be answerable from data the linguist has already supplied. The questions, in order to be reasonably successful, will start with the largest classes and work downward through the appropriate subclasses to the individual items. If there is a combination of items, the questions will start with the combinations which are shorter and more likely according to the native speaker judgments collected earlier, and will then work to those which are longer and were considered less likely.⁹ While theoretically the possibilities of combinations would seem to be extremely large and unwieldy, in actuality they are quite limited and highly restricted in order of combination. Even with an extremely permissive set of allowable sequences there will inevitably be a few combinations which were not provided for in the set of paths leading to OK outlet slots. In order to recover such examples, the RE slots were provided and numbered individually to facilitate revision of the system when it seems desirable to do so.

2.4 The output of the processing operation is fundamentally a sorting and listing, arranged by number, of examples from text of the sequences provided for by the tentative description (the OK slots) and those not provided for (the RE slots). The OK label, therefore, does not in any way mean that the sequence is grammatical or even meaningful. Similarly the RE label does not necessarily indicate that the sequence is ungrammatical or nonsensical, but only that it is for some reason not accounted for by the system. In practical terms the information from the RE slots will be more valuable for the linguist in his evaluation of the description being tested than that obtained from the OK slots. For this reason a great many more RE exits are provided than most systems designers would consider of maximum mechanical efficiency.

From the two basic types of output the programmer can arrange specialized listings for various purposes. If the information seems useful, the printout can tabulate the relative frequency of various types of constructions from the number of times the particular exits are used. On the other hand, a listing of unused OK slots would be very useful information, available with very little effort. Such refinements of the printout are limited only by the data available and the ingenuity of the investigators. While major syntax is of necessity not considered in a limited intermediate stage project of this nature, it is possible for the linguist to use the revised framework in the preliminaries to such an investigation. By just sorting the input manually during the preediting process into the major syntactical categories he wishes to investigate, he can get some rough information about the patterning of the items and sequences in larger spans than the intraphrase segments. For example, one run of input might be presorted into items functioning as the subject of a finite verb, object of an infinitive phrase, direct object of negative verb, etc., and the frequency of the items in the separate runs compared with their overall frequency.

3.0 ADVANTAGES OF MACHINE METHOD

In view of the amount of time and effort that the linguist must spend in setting up the system and in preparing the input, it might seem that the human operator could perform all the operations more rapidly and more economically than the machine. Up to a certain point this assumption is quite correct. In handling extremely small bodies of data in preliminary stages the tedious process of manual counting is much more efficient. It also provides the linguist with the opportunity to revise his system while he is working from it, as he cannot so readily do while the machine is operating. On the other hand, the blind obedience of the computer follows all instructions exactly as they are given with a consistency impossible for a normal human being. This very consistency reveals errors which the linguist or his assistant working manually might unconsciously overlook or repair by means of an unnoted change in the system, which might in turn be forgotten before it could be applied to the entire corpus.

In one sense the use of the machine may seem to eliminate some of the unexpected discoveries which make linguistic analysis interesting. In fact, however, the discoveries are simply shifted from the intermediate or raw counting stage of the process to the preliminary or planning stage or to the final revision. The process of preediting the corpus can also expose unexpected problems, which must be solved before the sequence can be coded as input. In the system described, the linguist must decide whether a sequence such as (Did they want) a little cheese(?) is to be edited into a N or a little N, or whether he wants to include both possibilities. The decisions must be made consistently, if arbitrarily, and with some control on the criteria by which they are made.

4.0 RESULTS AND USES OF SYSTEM

As was mentioned earlier, the by-products of an investigation can be as interesting and useful as the information which it was planned to uncover. In addition to collecting sequences of forms which were not accounted for in the preliminary framework, the output of the full, unedited sequences gives starting points for further investigation. For example, handbooks have often referred to a particular type of noun, or perhaps noun construction, as a mass-noun or uncountable noun, as illustrated by milk, information, or justice. Certain other nouns, such as fish and species, are what have been called nouns with unmarked plurals. By programming the output slots of the constructions which are expected to produce such nouns into a separate listing, or simply by noting the output numbers of such slots and checking the printouts manually, it is possible to collect extensive lists of nouns with the properties desired. The unedited printout may also be very useful in spotting the distribution of forms not immediately under consideration, but closely related to them. Some of the peripheral forms which were discarded at the beginning of the investigation, such as certain, (a) lot, lots, very, same, only, etc., reappear for reconsideration in the unedited printout. If a similar testing device were set up for verb phrases,¹⁰ information might be made available through the unedited printout about various types of intraphrase adverbial modifiers and their positions in relation to the specified phrase types.

In a very limited sense it is possible to use the processing operation section of the flowchart to generate sequences manually. It would, however, be best to do this after sufficient material had been tested to determine exactly which slots are actually used and which are not. The simulated generation is done by beginning at an OK slot (or a RE slot which has been found acceptable and suitably revised into an OK slot) and tracing backward through its path to the noun, which is then supplied by the investigator, subject to any plurality restrictions applicable at that point. Not surprisingly, however, not all such sequences are equally acceptable to the native-speaker judge. For example, the same slot can produce his every rabbit and her every thought. Thus any generation should be preceded by a collection of nouns or constructions which are semantically as well as formally allowable in that slot. Such a pool can easily be collected for each outlet actually in use, but the result would be possibilities for generation so limited that it would in effect give out as totally acceptable only what had been previously put in from the corpus. Even without a pool of congruent nouns, however, the generation and testing of semantic acceptability of modifier sequences with random nouns might produce interesting lexical information.

The most important product of the investigation remains the collection and ordering of statistics and examples, from a corpus, of patterns observed and projected by the linguist from his empirical work. If the preliminary investigation produced a framework which was only slightly weak, the intermediate-stage investigation will show where the defects are. If it was extremely weak, the computer’s work will be virtually useless. If it was adequate, the accumulated data will point to areas for further investigation along more sophisticated lines, using as reference points the items, classes, and arrangement patterns already isolated. In any case the linguist is responsible for the results and the final utility of the description and for its implications.

NOTES

1. The framework presented in Figure 8.1 was drawn up in rough outline in connection with an informal workshop on some problems in English syntax during the Linguistic Institute of 1960 in Austin, Texas. It was revised extensively in 1963-1964. The purpose of the workshop was to subdivide some of the problems, such as the analysis of verb phrases, noun phrases, pronouns and pro-nominals, into small units which might be studied formally, dis-tributionally, and semantically, first within the subsections themselves and then ultimately within larger syntactic frames of reference. Figure 8.1 represents a tentative subsectioning of some of the modifications within the noun phrase. The terms Primary and Secondary as used refer only to the first two tentative divisions of the subsectioning. The names of the divisions of Figure 8.1 are merely for convenience, and may be replaced if other terms seem more suitable.

2. All the forms classed as Primaries on Figure 8.1 normally have minor stress (/ˋ/`̆/), while those classed as Secondaries typically have/ ̂ /On the basis of the stress distinction, a second non-paradigmatic subclass, consisting of minor-stressed some and any, was drawn up, which contrasted with the Secondary pair of the same spelling. The division, incorporated into drill material, has been pedagogically useful for students of English as a foreign language, especially those of Romance backgrounds.

3. An ideal corpus for work of this nature will be that in preparation at Brown University by W. Nelson Francis. It is to consist of 1, 000, 000 words from randomly chosen sources.

4. It would be possible to use a skipping device in the scanning process which would simply ignore all items but those on the chart. A great deal of waste would result from the process, for the device would also act on listed items found in nested phrases and other modifiers which the preeditor would have rejected.

5. A computer system is by no means infallible. It is subject to both mechanical breakdown and human errors resulting from faulty programming or coding. The explicit nature of mechanical operations, however, usually makes the errors easier to locate and correct than human errors in a manual operation.

6. The term ‘intermediate stage’ as used throughout the paper refers to the second of three stages of a procedure followed by many linguists in their investigations. The first, or preliminary, stage consists of the gathering of a body of data and formation of tentative hypotheses, classifications, etc., from it. The second, or intermediate, stage consists of the testing of the hypotheses, classifications, etc., against a larger corpus. The third, or revision, stage evaluates the failures of the tentative description to cover the data from the intermediate stage corpus and suggests refinements and changes in the description and directions for further work. In actual practice, of course, the intermediate and revision stages are frequently repeated, using the most recent revisions as the basis for the next intermediate stage investigation.

7. The first corpus was C. E. Ayres, ‘The Industrial Way of Life,’ Texas Quarterly II, Z. 1-19 (1959). As a more informal balance, the left-most column of the front page of the Dallas Morning News was examined every day for two weeks.

8. The subsectioning of the cardinal numbers in the preliminary stage of the investigation is an example of distributional subsectioning of what at first appeared to be a homogeneous class. The numbers classed as nonround on Figure 8.1 may appear without a preceding Primary or Secondary in a noun phrase. Those classed as round, however, must be preceded by some such modification. The semantic relation to the decimal counting system is not complete, for ten (and its compounds, such as ten thousand, ten million, etc.) functions as a nonround number, while dozen (and marginally gross) functions as a round number.

9. The question of whether the scanning should be done from left to right or right to left brings up some practical and theoretical points which led to interesting discussions during the summer. For this system, the right-to-left scan was adopted because an earlier trial on the same material using a left-to-right scan had proved more cumbersome. The question that usually arises in connection with scanning is whether the computer does (or should) actually simulate the human process of perception, which is, at least in figurative terms, of the chronologically-based left-to-right type. Aside from the question of whether the computer simulates human speech perception to any degree or not, some linguists would not agree that at the level of the phrase language is necessarily understood in any chronological order. Thus, for such linguists, the whole question is purely rhetorical.

10. A beginning point for such a study is already available in W. F. Twaddell’s The English Verb Auxiliaries² (Providence, R.I., 1963).