“Computation in Linguistics: A Case Book”
An Automatic Retrieval Program for the Linguistic Atlas of the United States and Canada
1.0 THE NEED FOR A RETRIEVAL SYSTEM
1.1 Accessibility of Atlas data. It is indisputable that the collections of The Linguistic Atlas of the United States and Canada, valuable and well organized as they are, present the researcher with many difficulties. To begin with, the files of each regional Atlas are usually located in only one place. The investigator must either work at that place or have the field records copied and sent to him. Secondly, the Atlas questionnaire is filed by pages; that is, all the page sevens of all the interviews in one state are filed together. Likewise, all the page 22’s and so on. This is as efficient a system of its kind as one can develop (more effective, in fact, than the British system), but the fact remains that a great deal of page turning must accompany any extensive research in which the Atlas files are significant.
1.2 Problems of collecting Atlas data. Not only are the Atlas data difficult to get at, but they are also difficult to collect. Thus far the Director of the Atlas has insisted, and rightly so, that only highly skilled field workers may gather the data. The informant’s every response is recorded in a narrow phonetic transcription by a person thoroughly trained to hear perceptively and to record accurately. There is a general lack not only of qualified field workers but also of financial support for research and publication. Furthermore, the length and thoroughness of the questionnaire and the size of the country have necessarily limited the quantity of field records in the Atlas files.
1.3 The focus of Atlas research. Up to this time, moreover, the attention of linguistic geographers in the United States has focused on the relationship of dialect variables to education or culture level and to geography. Critics of American linguistic geography have strongly suggested that more emphasis should be placed on other bases of group identification such as occupation, age, sex, church affiliation, political party, and so on.1 Although dialectologists have not been blind to the possible effects of these forces on current dialects, they have not been successful in utilizing the vast body of information that has been gathered to answer such questions. In short, with a more efficient tool for getting at the data that dialectologists have collected and are still collecting, they could say a great deal more about the diversity of sociolinguistic features which characterize American speech.
1.4 The scope of this paper. With these problems in mind, I looked to electronic data processing for whatever solutions it might offer. Then I set to work on what is essentially a punched-card problem to be done on relatively simple tab equipment. The accessibility of the data has been foremost in my mind in this study, although it has also suggested some valuable insights on the collection of data.*
2.0 THE COMPARTMENTALIZED ATLAS RETRIEVAL SYSTEM
2.1 Complexity of data. It is obvious that a retrieval system able to handle a wide variety of possible responses to a 750-item questionnaire would be a complex and impressive tool. I have felt that the place to begin such a study was not the whole questionnaire. The complexity of symbolizing pronunciations for electronic data processing is, in itself, enormous. Add the problems posed by vocabulary and morphology-syntax variations and the formulation becomes even more imposing. If the entire Atlas evidence is not to be handled in a single retrieval program, compartmentalization into separate programs of pronunciation, vocabulary, and morphology-syntax seems to be the most obvious cut to make.
2.2 Previous Atlas data processing. Such compartmentalization seems particularly relevant in that electronic handling of Atlas vocabulary data has been begun by the late E. Bagby Atwood and by Gordon R. Wood. 2 Since William Card and Virginia McDavid3 have outlined a similar program involving verb forms, it seemed wise to limit the present research to the programming of the remaining aspects of grammar, exclusive of verbs. The problems involved in programming Atlas pronunciation data will not be dealt with in this paper but must, of course, be faced in the near future.
2.3 The usefulness of this research. The nonverb grammatical items are a natural beginning point for accurate, contemporary statements about usage. The immediate usefulness of such an automated program would be appreciated by several organizations such as The National Council of Teachers of English and the American Dialect Society, both of which are vitally interested in American English usage.
3.0 SELECTING DATA FOR THE PROGRAM
3.1 Lexical vs. grammatical vs. phonological. Some Atlas responses appear to be useful both as vocabulary and as grammatical items. For example, the response sick (to/at/on/in/of) your stomach is at the same time lexical and grammatical. Likewise, all information gathered in the Atlas files has phonological significance. Since this program treats grammatical items only, it became my task to deal only with the grammatical status of selected responses. What is done with the phonological evidence is the focus of another problem. At a later time, an analyst will have to decide whether some or all of the grammatical items treated in this program are to be included in a phonological analysis.
To avoid possible overlap between lexical and grammatical items, I carefully chose items which were clearly structural rather than lexical. The following types of items were included in this study: prepositions, matters of agreement, noun plurals, pronouns, adjective suffixes, adverb suffixes, conjunctions, and articles. These categories were selected because the field records yield consistent information in these areas. Though the field workers recorded much other interesting grammatical information not asked for in the questionnaire, it is too sporadic to be susceptible to programming.
3.2 Backgound data. It was relatively easy to determine which background data about informants to include in this study. For the sake of scientific accuracy, nothing that appears systematically and that can be readily coded should be excluded. It is of great importance to avoid throwing away data of any kind. The Atlas field records include background material about the informants’ State, community, education, age, sex, ethnic association, occupation, and family history, as well as the field worker’s name and the year in which the field work was done. All of this information was coded on the data card for each informant.
3.3 Grammatical data. Fortunately, there is a limited range of response possibilities for nonverb grammatical items, a fact which makes the task of coding in this program relatively simple. The number of recurring variants for most items was quite limited. Occasionally an unusual response appeared. It was necessary, in this investigation, to determine whether to include all answers that have ever been given or to lump this small number of unusual responses into a category called ‘other’. Involved in this question are the definition of idiosyncrasy, an evaluation of the accuracy of evidence which includes only the most common answers, a definition of economy, and an understanding of the limited storage capacity of the equipment. In the end I grouped forms into categories of variants rather than into separate items, partly on the basis of known occurrence and inclusion in the Atlas field records, partly on the word-of-mouth agreement among field workers, and partly on intuition. This problem did not turn out to be very extensive in the study of nonverb grammatical items. Originality or idiosyncrasy of structure is undoubtedly more restricted than, say, that of lexicon or pronunciation. Consequently, the categories of variants were determined here with relative ease.
3.4 Items in this study. In all, 78 nonverb grammatical items were selected from The Linguistic Atlas of New England as a basis for this study. Since the New England field workers used a longer questionnaire than the one used in most other parts of the country, there will not be complete congruence of items in various Atlas collections. My decision to include items that are in the New England Atlas, whether or not they appear in other Atlas questionnaires, was made for two reasons. First, I wanted to retain as much data as possible. Second, it is probable that these (and other) data will be included in future collections. The number of items (not variants) included in this study is broken down as follows:
prepositions | 18 |
agreement items | 9 |
noun inflections | 16 |
pronouns | 22 |
adjective suffixes | 2 |
adverb suffixes | 5 |
conjunctions | 5 |
article | 1 |
total | 78 |
4.0 CLASSIFYING THE DATA FOR THE PROGRAM
4.1 Standardization. One of the advantages of a program for the automatic retrieval of nonverb grammatical items is its usefulness both to other investigators of the same data and, as we shall see, to researchers in other compartments of Atlas data such as pronunciation, verb forms, and vocabulary. It is very important, if the programs of various compartments are to be interchangeable, that we agree upon a standard format of data input. For this reason, I wrote a code book which is, in effect, a detailed table of contents for the arrangement of all the various kinds of data included in the nonverb grammatical program. Further standardization is being established for background data. For example, there should be agreement among all Atlas programs on the numbering system for States, communities, field worker identification, occupations, and so forth.
4.2 Atlas classification systems. In setting up the code book for nonverb grammatical items, I tried, as much as possible, to follow the order of previously published retrieval systems in American dialectology. E. Bagby Atwood, in his The Regional “Vocabulary of Texas, set a precedent for future machine handling of Atlas materials. I have followed his ordering in the first few columns of the code book but have supplemented it with additional background material in keeping with my belief that no useful and codable information should be excluded.4 In the end it will be necessary to involve a larger body in the decisions, including particularly the directors of the various regional Atlases.
4.3 Classifying data in this program. Punched-card columns 1 through 35 in the code book deal with background information for each field as follows:
columns | 1-2 | State |
3-5 | County | |
6 | Type | |
7-8 | Informant by type | |
9-10 | Age | |
11 | Sex | |
12-1 3 | Ethnic | |
14-15 | Field worker | |
16-17 | Year of field work | |
18-19 | Occupation | |
20-21 | Father’s birthplace | |
22-23 | Mother’s birthplace | |
24-25 | Paternal grandfather’s birthplace | |
26-27 | Paternal grandmother’s birthplace | |
28-29 | Maternal grandfather’s birthplace | |
30-31 | Maternal grandmother’s birthplace | |
32-35 | Item identification |
Beyond the matter of background information, there is no precedent for ordering the coded responses. Arbitrarily I separated the items by natural grammatical types. That is, prepositions are all in one place, items of agreement are grouped together, and so forth. The researcher who wishes to add other items to his study can easily put them at the end, for there are available programs that will bring the new item back to its category-mates at the printout stage.
One last matter concerning the classification of data involves card identification. Columns 73-80 of each card are used for identification in the following manner:
columns | 73-74 | program identification |
75-76 | State identification | |
77-79 | informant identification | |
80 | card number |
Thus, if columns 73-80 read 30010012, we may understand that in the Atlas program for nonverb grammatical items (30), we have a card for Maine (State 01), informant number one (001), card number two (the last number). The compartmentalized approach to the analysis of Atlas data will require a standardized numbering system of States and informants as well as an agreed-upon symbol of program identification. The number 30 was arbitrarily chosen for the program here, pending such standardization.
We have identified the columns reserved for background information of the informants and the columns reserved for card identification. The following is a description of the remainder of the classification by columns:
card 1, columns | 36-66 | prepositions |
67-72 | blank | |
card 2, columns | 1-7 | agreement items |
8-22 | noun inflections | |
23-42 | Pronouns | |
43-44 | adjective suffixes | |
45-49 | adverb suffixes | |
50-57 | conjunctions | |
58 | Article | |
59-72 | Blank |
5.0 RESOLVING PROBLEMS OF CODING
5.1 Coding the data. The essential problem in utilizing electronic language data processing is the matter of coding the data. This, in turn, is related to the nature of the equipment to be used, the criterion of simplicity, the code’s usefulness to other people who work with the same data, and the convertibility of the raw data to punched-card input.
The linguist will know best what constitutes simplicity and how useful this program may be to other linguists. The computer specialists will know best which equipment and/or program suits the specific linguistic problem most adequately. One of my primary goals in this project was to obtain frequency distributions. After consultation with professional programmers at the Indiana University Research Computing Center, I decided to use the Indiana University Research Computing Center’s IBM 709-1401 Program for 10 1 Simulation. Its primary function is to tabulate the number of units falling into various categories according to classification variables.
The unit record for the simulator is one to eight IBM cards. The information is placed on magnetic tape via the 1401 computer and is then processed on the IBM 709.
5.2 Special problems. Getting the raw data coded involves special problems which will not be our major concern in this paper. Such a special problem is illustrated by the field work in Maine for The Linguistic Atlas of New England, the source of the data in this study, which was done by only one field worker, the late Guy Lowman. Other field workers in other States frequently recorded responses that were repeated at the field worker’s request, responses that were heard-of but not used by the informant, responses of auxiliary informants, suggested responses, responses considered archaic by the informant, and corrected responses. Very few facts of this type were recorded by Lowman in Maine, and in this study I arbitrarily excluded them. This does not mean that I consider such responses irrelevant or unimportant. I was simply not interested in complicating the coding system at this time by making provisions for these possibilities. It appears, in fact, that it might be easier to run two separate programs in such a study: one for the type of responses which concern us here and another for such modifications as repeated, suggested, or auxiliary responses. The two kinds of information could then be brought together for whatever significance they might show in juxtaposition.
5.3 Problems of identification. There are several problems of coding which must be mentioned. Identification of grammatical items and variants is one such problem. The 709-1401 Program for 10 1 Simulation permits identification of each item in eight alphanumeric columns and variants in six. It is impractical to try to identify the item by the field worker’s question. For example, the investigator might ask the informant to fill in the missing part of this sentence: When trouble hits, it comes ——. One of the sets of possible answers includes the response all at once and its variations such as all to once. But the informant might also offer an alternate type of answer to this clue. He might respond with in bunches or quickly. The clearest way to identify the item, therefore, is not by the field worker’s question but by the answer set, as long as the set is comprehensible under that identification. Therefore I have used ALLXONCE as the item identification for that set. Since the program will not permit dashes, X has been used to indicate the position in which the variant appears.
Identification of variants follows much the same procedure except that the program allows for only six alphanumeric characters for this purpose. There is no particular problem with readability until we are faced with the need for compressing three variants in one six-character slot. For example, the frame I wonder what he died — has 16 possible responses:
1. of
2. with
3. from
4. for
5. of, with
6. of, from
7. of, for
8. with, from
9. with, for
10. from, for
11. of, with, from
12. of, with, for
1 3. of, from, for
14. with, from, for
15. of, with from, for
16. other
The coding of the responses with one and sometimes two variants is relatively clear and readable. But when the entire set is coded, the printout looks like this:
1. OF
2. WITH
3. FROM
4. FOR
5. OFWITH
6. OFFROM
7. OFFOR
8. WTHFRM
9. WTHFOR
10. FRMFOR
11. OFWHFM
12. OFWHFR
1 3. OFFMFR
14. WHFMFR
15. ALL
16. OTHER
5.4 The problem of complexity. The problem of identification leads to the second coding problem to be mentioned here, that of complexity. It is clear that if there are five possible variant single responses, there will be a total of 32 responses including single responses, combinations, and a slot for other responses. The coding of as many as four possible responses in a six-column space is possible but the outcome will hardly be readable. For example, the item elicited from the frame I won’t go — he does might be coded as follows:
response | code |
1. unless | UNLESS |
2. without | WTHOUT |
3. ‘lessen | LESSEN |
4. ‘thout(n) | THOUTN |
5. ‘douten | DOUTEN |
6. unless, without | UNLWHT |
7. unless, lessen | UNLLSN |
8. unless, thout(n) | UNLTHN |
9. unless, douten | UNLDTN |
10. without, lessen | WHTLSN |
11. without, thout(n) | WHTTHN |
12. without, douten | WHTDTN |
13. lessen, thout(n) | LSNTHN |
*** | |
18. unless, lessen, thout(n) | USLNTN |
*** | |
25. unless, without, lessen, thout(n) | UWOLTN |
* * * | |
32. other | OTHER |
To avoid repetition, the sequences indicated by the asterisks have been skipped.
The problem of readability is obvious. An item with four or fewer variants is somewhat easier to handle. For this reason, I have tried to limit each range of variants to four related variants. This sometimes results in an artificiality, as the following example will reveal, but it makes for a more readable printout.
The frame Quarter — four contains six significant known responses: of, to, till, before, until, and unto. If these six variants were to be listed with all combinatorial possibilities, the result would be a very complex and unmanageable code. In order to preserve my format and limit the coding to 16 possibilities, I broke the item into two parts as follows:
quarter — four
1. of
2. to
3. till
4. before
5. of, to
6. of, till
7. of, before
8. to, till
9. to, before
10. till, before
11. of, to, till
12. of, to, before
13. of, till, before
14. to, till, before
15. all four responses
16. other responses
quarter un — four
1. until
2. unto
3. until, unto
4. other responses
Economically, it would be better to split the six-response item into two three-item units, but this conflicts with the logic of the data. Hence the split between responses containing the un-prefix and other responses was chosen. The real danger of this technique is not how the item is split but that it is split at all. Special instruction must be given the analyst to avoid coding quarter unto four both as other responses (16) to item quarter — four and as unto (2) to item quarter un — four.
6.0 DEFINING THE FIELDS
6.1 The Define Field Statement. All of the variant responses in a particular answer set constitute an individual field, a set of one to three contiguous columns of the punched card which, for the sake of clarity, may be given a name. This name will appear in the printout as the label for the classification variables in each distribution. After my data were coded and my cards keypunched, I proceeded to give names to my fields. This naming procedure, which also provides specifications for locating the cards of the data to be processed, is called a Define Field Statement. 5 A sample Define Field Statement follows:
That is, the field is the set of columns used for the response frame This weighs two —. This is column 91 (that is, it appears in column 1 1 of card number two). It has three possible responses (0-2) which are pound, pounds, and both pound and pounds.
6.2 The group mode. In cases where it seemed useful to combine responses within a given frame, I used the ‘group mode’. This makes it possible, in the Define Field Statement, to combine punches within a given field. As the following example demonstrates, instead of the exact age of my informants I can specify some grouping of age, in this case, decade grouping:
That is, the field of the informant’s age appears in columns 9 and 10. Although the exact ages of the informants were processed initially, I am only interested in retrieving decade groupings. Thus, I let the computer do the grouping for me. This way I still have the exact ages in the data in case I should desire to use them. To get at them, all I have to do is write a different Define Field Statement which specifies exact ages.
The Define Field Statement need not utilize all the input data. For example, in the program run on Maine nonverb grammatical items, there was no need to define the field of ‘field worker’ even though it appears in the data in columns 14 and 15. It was not necessary, in this particular program, because one man did all the field work in that State. If the program were to involve a comparison of several States (thus probably comparing the work of several field workers also), it would be necessary to define that field.
In all, 81 fields, including informant background data, were defined in this program. The responses of each informant were coded from the Atlas files to the list manuscript (which was simply a matrix of columns of responses and rows of informants).
7.0 DISTRIBUTING THE DATA
After the data were coded and the fields defined, it was necessary to decide which particular relationships I wanted to display. As mentioned earlier, one of the goals of this research was to get at significant sociolinguistic relationships which had been hitherto ignored or which had been discovered only through hours of laborious hand-sorting. The computer classifies and counts according to the combinations of categories of several fields simultaneously. In order to produce such frequency distributions, it is necessary to write a Distribute Statement as part of the program. This statement specifies a matrix in which the categories of one field are printed across the top of the page and those of another down the left side of the page. The counts of the responses in each category of the cross tabulation are listed. The Distribute Statement gives the number and dimension of the frequency distributions and ultimately produces a count of their variants. Some sample Distribute Statements follow:
DISTRIBUTE (HALFX2*SEX)
DISTRIBUTE (2POUNDX* AGE)
DISTRIBUTE (TROUGHX* OCCUPAT)
DISTRIBUTE (9FXHIGH* T YPE)
That is, I was able to determine quickly and accurately the relationship of an informant’s sex to his use of the variant responses to half — two (past, after, etc.). I was also able to see what effect age has on the use of the inflectional form of pound, what part occupation plays in the distribution of inflected and uninflected trough and what type of per son (cultivated, semicultivated, or uncultivated) is more apt to inflect foot in response to the frame, nine — tall.
8.0 RESULTS OF THE PROGRAM
8.1 Distributions. This program, of course, is only suggestive of what can be done with the Atlas materials once the data are submitted to automation. These materials will be more accessible and reproducible than ever before. More significant, the dialec-tologist will be able to broaden his investigation of the sociological implications of American speech through improved handling of data. As indicated previously, one of the benefits of our program is in the area of distributions by occupation, sex, age, and type. Sample matrixes are illustrated below (reject signifies that there was no response by an informant for that item):
It is not my purpose in this paper to comment on the sociological significance of the preceding distribution displays. Rather, they are presented as examples of the kinds of frequency distributions which may be obtained easily by using this program. Further distributions may be tabulated to discover whatever relationships the investigator is interested in. Matrixes 1 and 2 display the grammatical items in relationship to background data of the informants. Matrix 3 displays the relationship of the responses of two grammatical items to each other and clearly demonstrates how easily this interesting phase of dialect study may be handled by computer processing. Moreover, once Atlas programs become standardized, it should be relatively easy to crosstabulate grammatical and vocabulary items and find distributions across lexical and structural boundaries in which the number of speakers, for example, who use nine foot high can be correlated with those who say bucket instead of pail.
8.2 Cartography. Cartography, including the preparation of isoglosses, is another aspect of dialectology in which automation can be useful. The machine can be directed to print out the numbers of all communities in which a particular response appears. Charting responses onto maps can be greatly simplified by this procedure.
8.3 Methods of data collection. One last use of an automatic retrieval system for the Atlas materials involves the method of data collection. It is undoubtedly true that the well-trained field workers are the best persons to gather data about language. But it is also true that few well-trained field workers are available for the collection of data and only limited financial support is accessible to them. The Atlas interview has been criticized for being both too long and too incomplete. If there is some way to shorten the interview and yet maintain completeness while, at the same time, increasing the number of field workers, the progress of the Atlas will be insured.
Perhaps the compartmentalized analysis of Atlas materials suggested here is a step toward the solution of the problem. The fact that we have not dealt with phonetic transcription in this program shows that this compartment can be left out. Since phonetic transcription is not required, one major hurdle to competent data gathering may be overcome. It seems probable that a short questionnaire dealing with nonverb grammatical items might be administered by a college student in an advanced course in the English language. If he were given useful suggestions for eliciting responses naturally and effectively, his nonphonetic field work might be used to broaden the coverage of the Atlas nonverb grammatical items just as various multiple choice checklists of vocabulary items have broadened Atlas coverage of lexical materials. The data could be coded as the field work is being done — possibly on a machine-graded answer sheet similar to the multiple choice answer sheets used in some large college courses. If so, the step of transforming raw data to coded data will be eliminated at the field work stage.
8.4 Summary. The advantages of such an approach are numerous. The data already gathered by field workers for the Atlas can be utilized for, as I have already indicated, this material serves as a basis for future collecting. Secondly, the amount of data will multiply tremendously, for it will be limited only by our own energy. Our files can be updated yearly and, as long as students and informants are honest, we can get a more accurate picture of current American grammatical practice.
NOTES
1. See for example Glenna Ruth Pickford, ‘American Linguistic Geography: A Sociological Appraisal’, Word 12. 211-33 (1956).
2. See E. Bagby Atwood, The Regional Vocabulary of Texas (Austin, 1962), and Gordon R. Wood, ‘Dialect Contours in the Southern States’, American Speech (December, 1963), in which electronic data processing was utilized.
3. William Card and Virginia McDavid, ‘Paper for private circulation’ (Chicago Teachers College South, November, 1963).
4. Since this was written the matter of informant identification has been considered at a conference involving Raven I. McDavid, Jr., editor of The Linguistic Atlas of the Middle and South Atlantic States; Virginia McDavid, associate editor; William Card, member of the Local Policy Committee of the Atlas; and Frederic Cassidy, director of a five-year data-gathering project for a dictionary of American Regional English. It was agreed that the county was the appropriate geographical subunit. Mr. McDavid suggested a numbering system incorporating a feature devised by Hans Kurath, as follows:
Columns | Field |
1-2 | State |
3-5 | County |
6 | Type |
7-8 | Informant by type |
9-10 | Age |
11 | Sex |
12-13 | Ethnic |
It was agreed that this system was compatible both with computerizing DARE materials and with the projected publication in list manuscript form of the Atlas records of MSAS. McDavid and Cassidy agreed to a division of the labor necessary to produce standard county numbers compatible with the filing order of all existing Atlas projects.
5. For further details, see A 709-1401 Program for 101 Simulation 12-19 (Bloomington, 1963).
__________
* I am greatly indebted to William Card for his general help throughout the preparation of this paper. His suggestions concerning style, content, and application were particularly useful. It should be noted, however, that his generous concern makes him in no way responsible for all aspects of the paper.
We use cookies to analyze our traffic. Please decide if you are willing to accept cookies from our website. You can change this setting anytime in Privacy Settings.