“Computation in Linguistics: A Case Book”
A Syntactic Concordance for Middle High German
1. INTRODUCTION
This paper discusses a method of preediting Middle High German texts as a first step in preparing a concordance which can provide statistics about Middle High German syntactic patterns. The concordance will consist of a computer input which contains text plus a coded grammatical analysis, but which requires reasonably little time spent by scholars in preparation of the input. The paper begins with some discussion of the need for a syntactic concordance as an aid to research in historical linguistics. The chief variables which control the selection of a suitable preediting procedure are discussed in terms of a specific language, Middle High German. A tentative coding system for Middle High German is presented. Finally, two programs are flowcharted to demonstrate how the syntactic concordance might be used to answer specific questions. While much of the discussion deals with Middle High German, it is assumed that similar considerations apply to all languages which are extant only in written texts.
2.0 DESCRIPTION OF DEAD LANGUAGES
Most work in computational linguistics has derived either directly or indirectly from the widespread interest in machine translation, which is a practical goal only in the case of those languages where a large and growing body of material is available for translation. There is, however, a large class of languages which consist of written texts but which have no living native speakers. Historical linguists are attempting to provide adequate descriptions of some of these languages. This paper explores at least one way in which computational linguistics can aid the production of adequate descriptions of these so-called dead languages by providing distributional data for syntactic patterns.
2.1 Limited goals. Automatic analysis programs which start with unedited corpora are not likely to be practical at the present stage of research. Historical linguists who study a particular ‘dead’ language are generally a small group by comparison with the number of linguists working on a living language such as Modern English or Russian. Furthermore, there is no particular reason why a government agency or private foundation should want to encourage an increase in the number of researchers. Thus if computation is to play a role in historical linguistics, these limitations must be taken into account and goals which are somewhat short of completely automatic analysis of natural language adopted.
2.2 Written texts. A second point, which is obvious but nevertheless often overlooked, is that methods for the description of languages preserved only in written texts are necessarily some-what different from methods used in describing languages with living native speakers. Native-speaker intuition is not available to historical linguists. On the other hand, the fact that a language exists only as a set of written texts is an invitation for computational linguistics, especially since one of the most obvious limitations of computational linguistics is that its application is as yet restricted mainly to the written language. Exclusive reliance on the written language becomes a necessity rather than an inadequacy when working with languages which are attested only in written form.
2.3 Available information. Thirdly, while one might deplore the fact that much of the work in historical linguistics is out of date or in some sense methodologically inadequate, it is also true that a large body of knowledge has been accumulated. Any researcher who does not attempt to include a maximum amount of the available knowledge in any proposal for further research runs the risk of rediscovering the already known or overlooking previously gained insights. Available information should be put to use in furnishing better descriptions than those currently provided.
2.4 Syntactic research. Finally we are faced with the curious situation that work in historical linguistics has been concentrated more on lexicography and phonology than on syntax. Yet it is obvious that in a language which is preserved only in written form syntax is intrinsically more accessible to study than sound systems or meaning. We have no direct access to the way in which the native speakers pronounced their language nor do we know exactly what they ‘meant’ when they used it. On the other hand we do have a more or less exact record of how they manipulated the signs of their language, or at least some portion of it. One practical limitation on progress in syntactic studies has been the amount of data which must be grasped and worked through by the researcher when no native speaker is available to answer specific questions. Large-scale data manipulation is one of the most obvious abilities of the digital computer, so it is not surprising that computational linguistics has a potential application in simplifying the problems which are encountered by historical linguists in conducting syntactic research.
3.0 MIDDLE HIGH GERMAN
Middle High German is an example of a language which has been intensively studied using methods which are now regarded as outdated.1 A large body of knowledge has been accumulated in more than a century and a half of research. At the lowest level of understanding there is general agreement about certain grammatical categories. For example, in the sentence dâ lâgen zwei kreftigiu her (Parzival 16.28) ‘two powerful armies were camped there’, da. is an adverb, lagen is a verb in the past tense, her is a neuter plural noun in the nominative case and is the subject of the sentence, and zwei and kreftigiu modify her.
There may be widespread disagreement on the appropriateness of some of the terminology which is used in the traditional approach to Middle High German grammar, and there may even be doubts about the descriptive validity of many of the categories. Certainly rigorous criterial tests will fail in many instances. At the same time it is also true that if a small set of traditional categories is furnished and scholars are asked to use them in classifying forms which occur in a text, there will be general agreement, as in the example above. This information is too valuable to be wasted and offers the basis for preediting texts in order to construct an input for a digital computer. The input will consist of the text plus a code which will include information about grammatical categories and dependency relationships within the sentence. The grammatical analysis which will be built into the input will be a first approximation toward a more complete description of the grammar of the language. The grammatical information can be used to acquire statistics about Middle High German syntactic patterns as an aid in furnishing better descriptions of Middle High German.
Nearly every university in the United States that offers a Ph.D. degree in German has a specialist in Middle High German, but the majority of these scholars are interested primarily in Middle High German literature; they study and teach the language in preparation for literary research. In German-speaking areas of the world, Middle High German literature is regarded as an important part of the cultural tradition, and even students in the Gymnasium receive some exposure to it. University students who specialize in German ordinarily take several courses in Middle High German and acquire a basic understanding of the language. One can estimate that the number of students who intensively study Middle High German is substantially larger than a thousand annually. Linguistic research, however, is carried on by only a handful of scholars.
One typical Middle High German ‘beginner’s’ grammar contains the following divisions: ‘Lautlehre’, pp. 7-56; ‘Flexionslehre’, 57-119; ‘Zur Satzlehre’, 120-126.2 More thorough grammars contain long sections on syntax (in Paul-Mitzka ‘Syntax’ is the longest section in the grammar), but the information is of little help to anyone who wishes to know which syntactic patterns are productive, which are nonproductive but in widespread use in certain constructions, and which are isolated occurrences. It is clear that even for pedagogical purposes statistical information about patterns which occur would be most helpful. Such information can also contribute to the study of style, which is of interest to the literary scholar as well as the linguist.
4.0 LIMITATIONS OF THE CODING SYSTEM
In order for preediting to be feasible for a large corpus, it is obvious that the amount of time required to analyze the sentences of the corpus must be kept to a minimum. Therefore, while it is desirable to include as much information as possible, it is also necessary to keep the categories broad enough to reduce decision-making to a minimum. For this reason, rigor will often have to be sacrificed for speed, and some information will have to be arbitrarily excluded as potentially less interesting than other. Thus, gender is not marked in the system of coding proposed in this paper, on the grounds that little additional information is likely to be derived from further analysis based on gender. Categories need to be specified as carefully as possible, but there will be cases where classification will have to be done with recourse to ‘general agreement among students of Middle High German’ or a similar appeal to tradition or authority. If general agreement proves to be illusory, then the analysis will have to be changed. The method of classification which is offered here represents a first step and will need modification before it can be actually employed. In order to be a useful procedure, the system of analysis will have to be easy to teach to scholars and students who are familiar with Middle High German for it is clear that many thousands of lines of Middle High German will have to be analyzed before meaningful results are obtained. Attempts to teach the coding system will presumably indicate what kind of changes is necessar to make the code usable.
4.1 Coding system. The preedited computer input could be stored on magnetic tape or a similar form of permanent storage, which will contain the text and grammatical information. Each word of the text will be entered, followed by a number of slots which will be filled as described in Table 2.2 The information which is built into the input tape can be clarified by a phrase-structure grammar of the type described in Table 2.1. Such a phrase-structure grammar does not attempt to describe the word order of Middle High German sentences, but information about word order is already present in the input tape since the text is entered from left to right in the order in which it occurs in texts.
The main reason for including Table 2.1 in its present form is to make the code used in Table 2.2 more readily comprehensible The rules given in Table 2.1 are meant to describe the dependency relationships which are indicated in the coding system.
The rules of Table 2.1 cannot be freely used to generate Middl High German strings because the rules as they now stand would permit far too many nonoccurring sequences. Restrictions on appl cation of the rules can be determined by what actually occurs. In fact, one of the potential uses of the input tape might be to determine what cooccurrence restrictions will need to be written into the rules in order to convert them into a real generative grammar of Middle High German. The first approximation to a grammar which is provided here might conceivably be used as the basis for a discovery procedure for refining the grammar.
4.2 Defining word classes. It is important to note that the traditional system of analysis which is used as the basis for the coding system contains word classes which are defined in rather different ways. The categories noun, adjective, verb, pronoun, and determiner (see Table 2.2, Column 12) can be defined largely by inflectional criteria. At the present stage of research these inflectionally determined categories can be regarded as rather clearly defined. There are residual problems and borderline cases, but for the most part one can furnish defining criteria. The categories adverb, conjunction, sentence-word, and preposition contain forms which are not subject to inflection. The definitions of the noninflected categories depend entirely on syntactic information. Thus a given form might be classified as a conjunction or an adverb depending on whether or not it was the first item in a subordinate clause. The class adverb in particular has long been a catch-all category for items which do not fit into any other category. It is used in the present coding system in just this fashion. One of the uses of the syntactic concordance may well be to reclassify some of the syntactically defined categories, or at least to provide more rigorous definitions for them.
4.3 Sentences of unusual complexity. It is clear that the rules in their present form will not account for all the data of Middle High German.3 It will probably prove most feasible to analyze as many sentences as possible with the machinery provided by the code, and to append to each text a supplementary list of sentences which cannot be readily analyzed. When a program is run to answer a specific question using a coded corpus, the researcher can examine the unanalyzed sentences by hand for any additional data which they might contain. A procedure of this sort is surely more practical than an attempt to make the code complex enough to handle all data from the very start. If some simplifying generalization should later be discovered, it would always be possible to modify the code to take account of the new information.
4.4 Practical problems. The main concern of this paper is to develop a code which is economically feasible and yet adequate for providing useful data. My experience with the code as it is now formulated indicates that a maximum speed of about 50 lines of hand-coding per hour can be expected if one is familiar with the text and does not dwell too long on constructions which do not fit readily into the coding system. (‘Line’ means the equivalent of an average line from a courtly epic — about six or seven words.) At best then it would require 2, 000 hours of skilled analysis plus additional clerical costs, keypunching, programming, and computer time to get a corpus of 100, 000 lines, which would seem to be a minimum corpus for answering questions of a statistical nature The initial investment in a program of this size would clearly be a substantial one, but once the coding job was completed, the corpus would be useful for an indefinite time and would render unnecessary a large number of data-gathering tasks which were formerly done by hand if at all.
5.0 SAMPLE FLOWCHARTS
It remains to give some examples of the kind of information which can be provided using a code of the type described above. Figures 2.1, 2.2, and 2.3 contain flowcharts showing how two questions might be answered using the data available in the coded corpus. Both questions can be answered in a straightforward manner, but this does not preclude the possibility of more complicated kinds of programs. The two flowcharts are intended to give only an idea of the possibilities.
5.1 Impersonal clauses. Figure 2.1 contains a flowchart for counting the number of clauses which do not have subjects and the number of those which do. One-word clauses, clauses with imperative verbs, and clauses with no verbs are not counted. Clauses without subjects are presumed to be impersonal clauses similar to Modern German Mir ist kalt. We now read, e. g. in Paul-Mitzka 186: ‘Unpersonlich werden im Mhd. im allgemeinen die gleichen Verben gebraucht wie im Nhd. Ausserdem noch manche andere. . .’ With the information provided by a program such as that proposed in Figure 2.1 it would be possible to say something like: ‘x per cent of the multiword clauses in the test corpus which have nonimperative verbs contain verbs with no subject’.
5.2 Position of finite verbs. Figures 2.2 and 2.3 give a flowchart for calculating the position of the finite verb in Middle High German sentences in terms of four positions: initial, second, final, other. The statistics are further broken down according to clause-type. This information is relevant to statements in the existing grammars which are notably vague on the location of the finite verb in various clause-types. Phrases such as ‘in general’ and ‘frequently’ abound. A program similar to the one outlined in Figures 2.2 and 2.3 would permit exact statements to be made.
5.3 Usefulness of analyses. Both of the programs outlined in the preceding paragraphs could give information which would allow us to speak more precisely about Middle High German than we now can. Neither question is of enough importance in itself to merit a great deal of scholarly effort in trying to answer it. If, however, a large number of such questions can be answered using the same corpus, then a sizable contribution can be made to the study of Middle High German without overburdening the few scholars engaged in its study.
6.0 ADDITIONAL USES
Several further uses of the corpus can be foreseen in addition to the main purpose of providing distributional data about syntactic patterns. Since the text is included along with the grammatical information, it would be possible to write a program which would produce an alphabetized and parsed word list of the entire corpus or individual parts of it.4
The text component could also be used, without the grammatical information, for performing the kinds of analysis which are ordinarily done on unanalyzed text.5 The syntactic concordance would then serve both as a corpus of written language and as a source of statistics about syntactic patterns.
NOTES
1. Further information on the chronological and spatial delimitation of Middle High German can be found in Hermann Paul, Mittelhochdeutsche Grammatik, 18th edition by Walther Mitzka 16-31 (Tubingen, 1959). This work is henceforth cited as ‘Paul-Mitzka’.
2. K. Weinhold, G. Ehrismann, and H. Moser, Kleine mittelhochdeutsche Grammatik1 3 (Vienna and Stuttgart, 1963).
3. To give just one example, it is assumed when dealing with discontinuous constituents that an intervening constituent is enclosed within the larger constituent. If both constituents are to be coded in the same column, one might get a sequence I M I M F F(see Tables 2.2 and 2.3). The code can be looked at as equivalent to a right-facing parenthesis before I and a left-facing parenthesis after F. The sequence would then be bracketed as (I M (I M F) F) and the parentheses would be resolved starting with the innermost ones. Such a situation is typical for relative clauses which are included within main clauses. There are, how-ever, also constructions in which two clauses share a common element such as the following example: do spranc von dem gesidele her Hagene also sprach (Kudrun 5 38, as cited in Paul-Mitzka 279) ‘then jumped from his seat Lord Hagen, (Lord Hagen) spoke as follows’. At the clause level the following position-markers would be assigned (Column 2, Table 2.3): I M M M M I F M F. According to our convention for bracketing this would represent (I M M M M (I F) M F), but the wrong interpretation would be derived with unlabeled parentheses. The correct bracketing would have to be represented by e. g. (I M M M M [I F) M F]. On the other hand constructions where unlabeled parentheses give wrong analyses are extremely rare, and the added complication of providing for constructions such as the one discussed above seems impractical if one wants to keep the code manageable.
4. See R. Wisbey, ‘The analysis of Middle High German texts by computer — some lexicographical aspects’, Transactions of the Philological Society 28-48 (196 3), for a discussion of word lists and concordances for Middle High German. Wisbey concludes that it is wasteful to include grammatical information with running text (p. 33). Since his interests are primarily lexicographical, this is probably true. The kind of program which he describes would, however, contain no information about word order. At the same time it should be emphasized that the concordances which Wisbey proposes are likely to be in use far sooner than anything similar to the plan outlined in this paper. Time is a factor which is worth considering as pointed out by John C. Wells, ‘A word-index and glossary to the Old High German glosses’, IBM literary data processing conference proceedings 148-59 (1964), to justify his work with the Old High German glosses.
5. The text component could serve the same function as for example the million-word corpus of English prepared by W. N. Francis. See the Manual of information to accompany A standard sample of present-day edited American English, for use with digital computers (Providence, 1964) for a description of this corpus.
Table 2.11
Rules To Explain the Grammar Code of Tables 2.2 and 2.3
Notes to Table 2.1
1. The following conventions are used in the presentation of the rules. Braces { } are used to indicate that any of the paths to the right of the left brace are possible. The right side is closed only if there are further items to be added. Parentheses () are used to enclose items which may be optionally selected. When all items to the right of an arrow are in parentheses, it is assumed that at least one item must be chosen. Square brackets [ ] are used to refer to the proper rules for expanding nonterminal symbols unless they are dealt with by the next rule in regular sequence. Nonterminal symbols are indicated by labels with upper-case first letters; terminal symbols by lower-case first letters. Underlining indicates the graphemic representation of a Middle High German form. For purposes of simplification, case and number markers are not included. In a grammar of Middle High German case and number as well as gender would have to be marked, perhaps by a system of subscripts plus transformational rules. See Emmon Bach, ‘The order of elements in German’, Lg. 38.264-5 (1962).
2. Relative Clause, Nominal Clause, and Adverbial Clause reenter the rules at Rule 2 and expand like Main Clause.
3. Rule 15 is an optional, recursive rule which allows for coordination or series. Most items can be expanded in the same way, but there are many constraints which need not be discussed here. In an actual grammar of Middle High German such rules would prove cumbersome and inadequate. They would probably have to be replaced by rules of transformational complexity.
Table 2.2
Slot Fillers for the Grammar Code
1. See Note 2 to Table 2.1. The parts of the entries which are underlined are the designations which will be used for hand-coding as in Table 2.3.
2. The entries in Column 2 refer to the position within the clause of the item being coded. Initial means leftmost in the clause, Medial means neither rightmost nor leftmost, Final means rightmost, and Word means that the clause contains only one item.
3. Columns 3-8 have the same entries as Column 2 with similar reference to position within constituents smaller than the clause. In addition, Columns 3-8 are not always applicable, so the category Blank must be added. In Table 2.3 Blank is marked by leaving the column blank, but in a computer code it would of course require a designation just like any other entry.
4. Column 8 is based on the branching provided in Rule 2, Table 2.1. An infinitive phrase derived by Rule 14 will be included in Column 8, but one derived through Rule 11 will not be included in Column 8.
5. See Rule 14 in Table 2.1. Column 9 can also have Blank as an entry for obvious reasons.
6. See Note 1 to Table 2.1. The entries DG and Ind are provided to deal with occurrences of case syncretism where it is difficult to arrive at a decision. DG reflects a special category for the frequent syncretism of the dative and genitive singular of feminine nominals. These categories are used only as a last resort. Even where there is formal syncretism a decision will be made if it is plausible.
7. Column 12 always has an entry. All entries except Verb are terminal symbols. Verbal nouns (see Rule 11, Table 2.1) are indicated by the entry noun in Column 12 plus the appropriate designation in Column 9.
Table 2.31
1. The text consists of lines l-6of Gregorius by Hartmann von Aue, edited by Hermann Paul in the Altdeutsche Textbibliothek series, No. 2. The ninth edition prepared by Ludwig Wolff (Tubingen, 1959) was used. A free rendering of the pas sage is: ‘My heart has often forced my tongue to say a great deal which is aimed towards earthly reward. My lack of experience gave it that advice. Now I well know that to be true.’ See Table 2.2 for identification of the columns. The sequence of derivation is partially indicated by the system of coding. Coding nach der werlde 1ôn (line 4) in Column 4 as constituent followed by der werlde in Column 5 means that der werlde can be derived only after a second pas s through Rule 1 1.
Figure 2.1
Figure 2.2
Figure 2.3
Notes to Figures 2.1, 2.2, and 2.3
1. The mechanics of the form in which the text and the grammar code would be put into the computer are not discussed. The code would be some sort of linear representation of Table 2.3. The first two words of the sample text might be represented by-something like: 1 Min M I I O O O O O O N S det herze M M F O O O O O O N S noun. Appropriate boundary markers would also have to be included.
2. The labels SUB and NSUB indicate ‘subject’ and ‘no subject’.
3. The labels are formed from the following components: M = main clause, R = relative clause, A = adverbial clause, N = nominal clause, 1 = finite verb in initial position, 2 = verb in second position, F = verb in final position, and 3 = verb in other position.
We use cookies to analyze our traffic. Please decide if you are willing to accept cookies from our website. You can change this setting anytime in Privacy Settings.