“Computation in Linguistics: A Case Book”
A Program For the Determination of Lexical Similarity between Dialects
1.0 LINGUISTIC BACKGROUND
1.1 Introduction. Since linguists first noticed variation among speech communities in Fiji, they have used the term ‘dialects’ to mark this variation. Its extent, however, has remained loosely labeled, largely because of a lack of precise interpretation of data. As missionaries, ethnographers, and linguists moved westward through the Fiji Group, their estimates of dialect diversity ranged from ‘very similar’ to ‘great enough to form two distinct groups’. For either of these extremes, minimum evidence was cited.
The phrase ‘Fijian dialects’ suggests greater linguistic homogeneity within the group than actually exists. The term in fact has more political and geographical than linguistic significance, since the diversity among speech patterns on Fiji is greater than that among some sets in the Malayo-Polynesian Group traditionally treated as separate languages. But it is customary to speak of a single Fijian language. In order to make more precise statements about Fijian dialect diversity, it is necessary to compare not total systems but more abstractable features. Some of these features that have been described and discussed elsewhere2 are: the phonemic systems, sets of markers, types of morphological constructions, pronoun systems, and a limited number of syntactic constructions. Some general statements about lexical relationships have been made, based on the patterns derived from geographical mapping and plotting of isoglosses. However, there is a need for a set of more specific statements about the lexical relationship between the members of each different pair of villages.
1.2 Collection of data. Bauan, the official dialect, was used as the eliciting medium. The basis for the lexical sample was the Swadesh 200-word list, with the following omissions: geographically unsuitable items like freeze, ice, and snow; items that for various reasons were difficult to elicit such as warm; and multimorpheme translations of single English morphemes, such as ulu-ni-vanua ‘head-of-land’ for ‘mountain’.
The word lists, along with other linguistic data, were tape-recorded in the field. Some transcription was done at the time, but most of it was done later with the help of a Bauan speaker. A phonemic analysis was done for each village, and the regular consonant correspondences were tabulated as the linguist became more familiar with the material. An awareness of the occasional sound correspondences came much later, when the responses for each item were collated and their distribution shown on a dialect map.
1.3 Cognate vs. phonetically similar. Even though the lexical systems of various communities in Fiji are not as divergent as some early selected word lists have indicated, they are significantly different. A sample figure of lexical diversity, as measured by the conventional method of cognate inspection, is about 60 per cent for villages on the extreme ends of the main island of Viti Levu, 150 miles apart by coastal road. The figure is likely to be even lower between the outer island groups to the west and to the east. But because of a history of interaction among the villages, there are no dialects that have not been influenced by some others. For example, Bauan, spoken throughout the group, is the official language for all government, religious, and educational transactions. As a consequence, it is used as the medium of communication for speakers of dialects with a low degree of mutual intelligibility. It is difficult to measure the effect that the official dialect has had on the vocabularies of the other dialects, since there were only a few glossaries compiled before the time at which Bauan was adopted. Interest in the non-Bauan dialects (largely from missionary sources) dropped sharply when there was no longer a need to use them as teaching media. But as early as 1876, the influences of Bauan on the other dialects was noted. In his introduction to a collection of texts from different dialects, A. S. Gatschet remarked that ‘probably an excessive Bau element has crept into some of the specimens, owing to the circumstance that they were obtained from young men, whose speech approximates more to Bau than that of the older men.’3 Today, comparisons between younger and older speakers show that this trend has continued.4 Consequently, there are likely to be Bauan borrowings in the word lists from any of the other dialects, and, with a few exceptions (comparable to the English borrowings maakete ‘market’ and tii ‘tea’ in an area in which a /t/ borrowing would normally occur as /?/ they cannot be distinguished from ‘true’ cognates. Since any count of purported ‘cognates’ will in fact yield a figure for word pairs that are phonetically similar, forms displaying phonetic similarity have been used to avoid any unwarranted historical conclusions about the data.
1.4 Utility of computer in phonetic comparison. Among various advantages of electronic data processing — accuracy, consistency, speed, ability to handle large amounts of data — the first of these would be least applicable if we desired a measure of cognation. Because we have no standards for accuracy in cognate-counting other than the linguist’s decisions, any deviations from his results are, by definition, inaccurate. Although a program can be designed to manipulate the given data in a number of ways, it has no way of drawing on the exterior data that a linguist sometimes uses for his decisions. But no exterior data are required to measure phonetic similarity, and in this measurement, the computer can combine the other advantages listed with as high a degree of accuracy as hand inspection would produce.
The most important use of the computer in this project is the reduction of a large (and generally unmanageable) body of comparative data to more succinct statements. Comparative methodology can be quite cumbersone, particularly when it involves n(n-1) /2 comparisons for either the 23 villages used in the trial runs or the 100 in the total collection of data.
1.5 Utilization of results. Although, for the reasons given above, it would be difficult to draw historical conclusions from the project results, they could serve as part of an index to the similarity among dialects at the present time. It may be possible to use the results to tell what kinds of differences — types of regular correspondences, syllable ordering, types of reduplication, etc. — make greater obstacles to intelligibility between dialect pairs.
In addition, in examining the procedures involved in machine-counting and hand-counting, we may be better able to understand what kinds of exterior data or experience the linguist draws on to decide, in less clear-cut cases, the relationship between two forms.
1.6 Methodology
1.610 Types of similarity between forms. Within this group of related dialects, most of the cognates show identity5 or regular correspondences. For example, in a hand comparison of two lists of 175 items, 40 items were identical, 29 showed regular correspondences, 34 had some degree of similarity, and 72 were not similar. Programming for recognition of identical items or those with regular correspondences seems relatively simple, since the symbols will be either identical or matched from a short list of correspondences. Recognizing those pairs fitting into neither of these categories, but still showing some similarity of form, is more complicated. These remaining forms that are similar have the following characteristics: 1) occasional correspondences, such as /β/ to /ð/; 2) phoneme-to-zero correspondences, particularly for /k/, /β/, and /ð/; 3) metathesis; and 4) an extra sequence, either similar or dissimilar to the rest of the form.
1.620 Possible units for comparison. Although the investigator doing a hand comparison of word lists probably scans whole forms in his first comparison, it seemed more practical to use some smaller unit in the program. Several possibilities follow.
1.621 Phoneme. Using the phoneme as a basic unit, Flowchart 5.1 shows a broad outline for comparison. Boxes 1 and 2 rewrite /?/ for certain villages to avoid equating /?/ corresponding to Bauan /t/ with /?/ corresponding to Bauan /k/.6 Following the No direction, Box 3 checks for identity of symbols. If the symbols are identical, this is recorded, and a check is made to see that neither of the words has ended. If the symbols are not identical, they are checked to see that two consonants or two vowels are being compared. If a consonant is being matched with a vowel, Box 9 allows certain consonant-to-zero correspondences. Box 7 lists the allowable consonant correspondences, both regular and occasional. If none of these operations checks, a negative answer is recorded, and the next pair of phonemes is checked. The end-of-word subroutine, Flowchart 5.2, divides the words into syllables and reorders the syllables of one form to check for metathesis and reduplication. To note the latter, the extra sequences in forms of unequal length are matched with the rest of the form. For each different reordering and matching, the percentage figures for phoneme correspondences are stored, and the highest used for tabulation.
1.622 Syllable. An alternate to the preceding plan is the use of the syllable (definable, for the purposes of this study, as any CV sequence, or V when preceded by a space or another V) as the basic unit for comparison. It is possible to list all allowable syllable correspondences — for example, /ndi/ — /ti/, etc. Flowchart 5.3 is a broad outline of the initial stages of a syllable-to-syllable comparison routine, using these correspondences.
1.623 Distinctive feature. Another linguistic unit that could possibly be used for comparison is the distinctive feature, which for purposes of the present study is viewed as an articulatory component. Considering the consonant phonemes of the overall pattern to be bundles of articulatory components, it is possible to account for all the contrasts by listing components of three types: position, manner, and cofeature (nasalization and labialization). Thus any consonant can be described by checking the appropriate columns in Table 5.1. Two examples are given.
Table 5.1
/p/, as shown above, is a combination of labial position, stop manner, and absence of nasalization and labialization as cofeatures. / ŋgw / is a combination of velar position, stop manner, and the cofeatures of nasalization and labialization!
If we examine the consonant correspondences (Table 5.2) in terms of articulatory components, we find that the relationships between the corresponding pairs can be described in terms of phonetic distance — there are correspondences between certain points of articulation, certain manners of articulation, and presence or absence of cofeatures. In Table 5.1, the arrangement of the columns and the addition of the logically unnecessary categories of nonnasalized and nonlabialized are graphic means of illustrating this pattern: ANY TWO CONSONANTS MATCH OR CORRESPOND WHEN THEIR COMPONENTS FALL IN THE SAME COLUMN OR ADJACENT COLUMNS. Blank columns appear on the table to prevent two unrelated columns from being adjacent. For example, under position, correspondences of /t/ to /?/, /k/ to /?/, or /s/ to /h/ are allowed, but not /k/ to /t/. Under manner, a stop-to-spirant correspondence such as /p/ to /β/, or /k/ to /x/ is allowed. An additional set of instructions is needed to account for the limited number of consonant-to-zero correspondences.
Table 5.2
1.63 Choice of unit for comparison. Because of its functioning in the language, the syllable was used as the unit for comparison in the present program. The structural reason for its use is that it operates as a unit for the following processes: 1) Reduplication. Because of the syllable structure, there is no reordering of elements smaller than a syllable. The pattern C1V1C2V2 cannot become V1C1C2V2. 2) Metathesis. For a similar reason, there is no metathesis of the patterns C1VC2→C2VC1 or V1CV2→V2CV1. 3) Other morphological constructions. Although the most productive morphological construction — transitive suffixation — has been eliminated from the data, there are examples that seem to reflect older constructions. These elements, although unidentifiable, are always at least a syllable in length. The practical reasons for this choice are: 1) the simple structure makes identification of syllable boundaries easy; 2) the syllable structure also limits the number of different syllables that can occur7; 3) the stability of the vowels throughout the dialects limits the number of corresponding syllables in the overall pattern; and 4) some operations with syllables were already necessary in the preceding plan.
With the addition of one more feature, vowel quality, the distinctive feature analysis was used as a means for syllable comparison. This eliminated the need for separate correspondence tables. This approach may produce some undesirable results. For example, the program is likely to be more complicated, and some correspondences may be posited that have little validity. But it is also likely that this procedure will show some relationships that were too obscure or whose occurrences were too infrequent for the linguist to observe. Were there not a series of irregular sound correspondences scattered throughout the data, it would have been simpler to check each syllable pair against a list of regular correspondences. The phonetic-distance method was used to discover either more examples for, or exceptions to, the theory of the relationship of Fijian sound correspondences to distinctive features.
The resultant index of phonetic similarity, having taken into account the patterning of distinctive features that seems to exist for both regular and occasional correspondences, should show a high correlation to counts of related forms.
2.0 THE PROGRAM
2.1 Introduction. A computer program capable of automatically comparing lexicons from Fijian utilizing only the list of dialect vocabulary entries as input data has been devised. Due to limitations in time, the full computer program — called the ‘Fiji Lexical-Comparison Program’ — has not been completed. However, enough work has been done to indicate the feasibility and practicality of the computational technique used. This section describes the data, technique, results achieved, and suggests plans for future research in computer-aided dialect analysis.
2.2 The input data. The input data for the program consist of the word lists for each dialect. In each list the entries are numbered and blanks are used for missing entries so that, for example, entry five on the first list is to be compared with entry five on every other list that contains a word at this location.
For the computer this input data must be keypunched onto IBM cards. Each word list is given in one deck of cards, with each card containing four consecutive dialect entries. The first 72 columns of each card contain these four words, with the first word starting in column 1, the second word starting in column 19, the third word starting in column 37, and the fourth word starting in column 55. Thus each word is limited to a maximum of 18 characters and if this vocabulary is missing a word, all 18 characters at the position on the card of this word are blank. Columns 75-77 contain the number of the last word on the card. Thus a card containing words five through eight of the dialect would contain 008 in columns 75-77. Finally, all dialects which are to be compared are assigned unique arbitrary numbers and columns 78-80 contain the number of the dialect which uses the words given in columns 1-72. All cards containing data on a single dialect must be kept together for the computer.
The words of the various Fiji dialects are keypunched using the letters A, E, I, O, and U for the vowels and B for /mb/, C for /ð/, D for /nd/, G for /ŋ/, H for /h/, K for /k/, L for /1/, M for /m/, N for /n/, P for /p/, Q for /ŋg/, R for /ř/, S for /s/, T for /t/, V for /β/, W for /w/, X for /x/, Y for /y/, DR for /nr/, GW for /ŋw/, KW for /kw/, QW for /ŋgw/, XW for /xw/, and ’ for /?/ for consonants. All other characters are assumed to be typing mistakes.
The syllables of each word consist of # V or CV, where # represents a blank, V represents any vowel given on the above vowel list and C represents any consonant given on the above consonant list. Thus V always represents a single character while C represents either one or two characters. Limiting syllable forms greatly simplifies the computations of the program.
2.3 The Fiji Lexical-Comparison Program. The Fiji Lexical Comparison Program must prepare all dialect pairs for comparison and then calculate the phonetic relationship between the members of each pair. For simplification, the discussion of the program will be divided into two segments. The first segment, called the control program, includes all operations required to prepare each dialect pair for comparison. The second segment, called the comparison program, uses the data pertaining to each of the two given dialects as input data and produces the numbers indicating the relationship of these two dialects as output.
The control program uses the dialect vocabularies, prepared according to the card format described above, as input data. As each dialect vocabulary is read, it is converted into the numerical format, described below, required by the comparison program and assigned an input sequence number.
The basic sequence of operations used by the control program is outlined in this and the following three paragraphs. The first dialect is read, converted into the required format, and stored as one of the two dialects to be compared. The second dialect is read, converted into the required format, written on magnetic tape (thus saving the converted data for future reference),8 and stored as the second of the two dialects to be compared. Then the first two dialects are compared by use of the comparison program. The results of the comparison are then printed as the output of this first dialect-pair comparison.
The third dialect is then read, converted into the required format, written on magnetic tape, and stored as the second of the two dialects to be compared. Then the first and third dialects are compared and the results of the comparison are printed. This process of reading, converting, writing, storing, and comparing dialect pairs is continued until every remaining dialect has been converted, written on magnetic tape, compared with the first dialect read, and all of the results have been printed. At this point the first dialect has been compared with all remaining dialects so it can be set aside, and all remaining dialects have been converted and stored on magnetic tape ready for comparison.
The magnetic tape containing the converted dialects is rewound and the second dialect is read and stored as the first of the two dialects to be compared. The third dialect is then read, stored as the second of the two dialects to be compared, and compared with the second dialect. This process of reading, storing, and comparing dialect pairs is continued until every remaining dialect has been compared with the second dialect read, and all of the results have been printed. At this point the second dialect has been compared with all remaining dialects so it too can be set aside.
The magnetic tape containing the converted dialects is repositioned and the third dialect is read and stored as the first of the two dialects to be compared. The remaining dialects are then read, stored, and compared with the third dialect. This process of reading and storing one dialect followed by reading, storing, and comparing every remaining dialect with this dialect is continued until each dialect has been compared with every remaining dialect. When this has been completed, every dialect has been compared once, and only once, with every other dialect and the Fiji Lexical-Comparison Program has completed the processing of all of the dialects.
The preceding four paragraphs have outlined the major parts of the control program and have given sufficient information on all parts of this control program except for the description of the conversion of the input dialect vocabularies into the numerical values required by the comparison program.9 The conversion program requires that each dialect word be subdivided into its syllables and that each syllable be replaced by a number identifying the distinctive features of that syllable. As mentioned previously, the syllables of Fiji dialects have the form # V or CV which, as keypunched, appear as v, cv, or ccv. Thus the program has no problems in subdividing the dialect word into syllables. The distinctive feature values for each vowel and consonant are stored in tables and, by reference to these tables, the characters given in the dialect word are replaced by the desired numbers. These numbers are stored in consecutive machine-word locations and, after the word has been completed, additional zeroes are stored to complete a block of nine10 machine words for each dialect word. Thus the complete conversion program consists of subdividing each dialect word into its syllables, replacing each syllable by the numeric value of its distinctive features, and adding additional zero numbers to complete a block of nine machine words for each dialect word.
The segment of the Fiji Lexical-Comparison Program remaining to be discussed is the comparison program. This program compares the two dialects that have been selected by the control program and converted into numbers by the conversion part of the control program. The input to the comparison program consists of two lists of numbers, one list from each dialect to be compared. Each list consists of 200 blocks of nine numbers, since the dialect vocabulary is defined to be 200 words long11 and each dialect word is set to be nine machine words long.
The comparison program uses the first block of nine numbers in the first list and the first block of nine numbers in the second list to compare the first word on the first list with the first word on the second list. The result of this word-pair comparison, to be discussed later, is a number called the word-pair-comparison value, which indicates the phonetic relationship of these two words. The value of this number, to be discussed later, ranges from zero, indicating no relationship, to 1-2-35, indicating identity. (In the following discussion, ‘identity’ will refer to identity of symbols.) The second block of nine numbers from the first list is then compared with the second block of nine numbers from the second list. This process is continued until each word on the first list has been compared with the equivalent word on the second list. If a dialect word is missing from one or both lists, one or both blocks will consist of zeroes, no word-pair comparison is made, and a zero is used to indicate that no comparison is possible.
When all word-pair comparisons have been made, the computer is ready to combine the results of these word-pair comparisons and obtain the numbers used to indicate the relationships of the two dialects. The number of words present in both vocabularies, and hence the number of comparisons possible, is calculated. The number of word pairs deemed to be related is counted and the percentage is calculated by dividing this number of word pairs by the total number of word pairs compared. The average of the word-pair-comparison values for all word pairs, and for all word pairs deemed related is then calculated. These two averages, the percentage of word pairs deemed to be related, and the number of possible word-pair comparisons are the final results of the comparison program.
The most important part of the program is the calculation of the word-pair-comparison value, since the results of the dialect-pair comparison depend on how accurately the word-pair-comparison routine calculates this value. Based on the results obtained with this preliminary program, it can be said that the current program is reasonably accurate, and with some revisions can be expected to be more so. However, before discussing the weaknesses of the program and how technically feasible revisions can improve the results, a description of the technique used for word-pair comparisons is necessary.
The word-pair-comparison technique is divided into three parts. The first calculates the relationships between syllables, the second selects syllable pairs for analysis, and the third uses these selected syllable pairs to obtain the word-pair-comparison value.
The first part of the comparison of the two words is to compare each syllable of the first word with each syllable of the second word. This comparison is done by use of a syllable-pair-comparison program which computes a syllable-pair-comparison value for each syllable pair. This syllable-pair-comparison value is a number which indicates the degree of similarity of the two syllables such that the number is small if the two syllables are identical and large if they are not.12 The distinctive features of the syllables are divided into four groups — one group consists of the distinctive features of vowel quality and the other three groups subdivide the distinctive features of the consonant into ‘position’, ‘manner’, and ‘cofeature’. The number of differences in distinctive features is counted for each group and the sum of: 1) the number of differences in ‘vowel quality’ distinctive features multiplied by 23; 2) the number of differences in ‘position’ distinctive features multiplied by 22; 3) the number of differences in ‘manner’ distinctive features multiplied by 21; and 4) the number of differences in ‘cofeatures’ multiplied by 4 is then formed. To this value 8192 is added if the two syllables are not identical.
These syllable-pair-comparison values are then used in the following manner to select the syllable pairs which indicate the maximum amount of similarity.
The syllable in the first word which shows the maximum similarity to a syllable in the second word is identified by selecting the smallest syllable-pair-comparison value.13 This syllable-pair-comparison value is stored at the head of a list, the number supplying the location of the syllable selected from the first word is stored as the second number on the list, and the number supplying the location of the syllable selected from the second word is stored as the third number on the list. Then all syllable-pair-comparison values using either of the two syllables selected are set aside. This last step has the effect of deleting one syllable from each word with the two syllables ‘deleted’ forming the syllable pair with the maximum degree of similarity.
The next calculation consists of using the remaining syllables in each word to select the syllable in the shortened first word which shows the maximum degree of similarity to a syllable in the shortened second word. This second syllable pair is identified by selecting the smallest number among the remaining syllable-pair-comparison values.14 This new syllable-pair-comparison value is stored on the same list as the syllable-pair-comparison value first selected and this new value is followed by the number supplying the location of the syllable selected from the first word and the number supplying the location of the syllable selected from the second word.15 Then all syllable-pair-comparison values using either of these two syllables selected are set aside. This has the effect of deleting a second syllable from each word with the two syllables ‘deleted’ having a degree of similarity second only to that of the first syllable pair deleted.
Again the remaining syllables of each word are used to find the syllable in the twice-shortened first word which shows the maximum degree of similarity to a syllable in the twice-shortened second word. The syllable-pair-comparison value found and the numbers of the syllables selected are added to the list. This process is continued until every syllable of the shorter word has been paired off with some syllable of the longer word.
If the two words have the same number of syllables, this second part of the word-pair-comparison program is completed. If not, each remaining syllable of the longer word is compared with every syllable of the shorter word and the syllable pair showing the maximum degree of similarity is added to the list.16 Thus each excess syllable of the longer word is paired with the syllable of the shorter word which it must nearly resembles in value and position.
When the above calculations have been completed, the results consist of a list of numbers with three numbers for each syllable of the longer word. The first of these three numbers is the number indicating the relationship between one specific syllable of the first word and one specific syllable of the second word, the second of these three numbers is the number supplying the location of that specific syllable in the first word, and the third of these three numbers is the number supplying the location of that specific syllable in the second word. Thus each syllable of the longer word is paired with one syllable of the shorter word and each syllable of the shorter word is paired with at least one syllable of the longer word. This list of syllable numbers and syllable-pair-comparison values is the input data to the third and final section of the word-pair-comparison technique.
Because the current syllable-pair-comparison program cannot produce a range of values usable for accurately identifying the degree of similarity of syllable pairs which are not identical, the third segment of the word-pair-comparison technique is limited to identifying syllable pairs as being identical or not identical. Thus this word-pair-evaluation routine is limited to the use of the number of syllables in the shorter of the two words (called SYLMIN), the number of syllables in the longer of the two words (called SYLMAX), the number of syllables in the longer of the two words which are identical with syllables in the shorter of the two words (called MATCH), and the relative positions of the syllables in each syllable pair. At present the ordering of syllables in each word pair is ignored because the syllable-pair-comparison values are not accurate enough to warrant the use of the syllable ordering. Thus only SYLMIN, SYLMAX, and MATCH are used in obtaining the word-pair-comparison value.
The word-pair-evaluation routine, which produces the word-pair-comparison value, uses a series of calculations depending upon the values of SYLMIN, SYLMAX, and MATCH to calculate the desired value. Thus if MATCH is zero, indicating that there are no identical syllables, the word-pair-comparison value is set to zero. If MATCH is not zero, but the longer of the two words has at least two more than twice as many syllables as the shorter of the two words, then the word-pair-comparison value is arbitrarily set to . 000002, since it is assumed that the difference in word length precludes any possibility of the two words being cognates.
Next, if the number of identical syllable pairs (MATCH) is less than the number of syllables in the longer of the two words (SYLMAX), the word-pair-comparison value is set to be If the number of identical syllable pairs (MATCH) is equal to the number of syllables in the longer of the two words (SYLMAX) and both words have the same number of syllables (SYLMAX equals SYLMIN), then the two words are identical (except for a possible reordering of the syllables), and the word-pair-comparison value is set to be 1-2-35.
Finally, if the number of identical syllables (MATCH) is equal to the number of syllables in the longer of the two words (SYLMAX), but the two words do not have the same number of syllables, then the longer of the two words contains all of the syllables of the shorter word and at least one syllable of the shorter word appears twice in the longer word. When this occurs, the word-pair-comparison value is set to be .5 + , which is an average of the two previous equations for calculating the word-pair-comparison value.
The net result of this word-pair-evaluation routine is to produce a number indicating the relationship of the two words with zero indicating no relationship and l-2”35 indicating that the two words are identical. The word-pair-comparison values produced by this routine are then used to calculate the relationship of the vocabulary pair as described above.
2.4 Current limitations. While the program discussed above is capable of recognizing items with either a very high or a very low degree of phonetic similarity, it does not identify all phonetically similar (hence, related17) items. The basic problem is, as mentioned above, the limitations in the syllable-pair-comparison program.
Since the syllables of the Fiji dialects have the form #V or CV, it would be easy to have the computer first identify the syllable pair as being one of the following eight combinations: 1) CV-CV; 2) CV-CV1 3) CV-C1V; 4) CV-C1V 1; 5) CV-#V; 6) CV-#V1; 7) #V-#V; or 8) #V-#V1 After eliminating combinations 1) and 7), since these are the cases where the syllable pairs are identical, a fairly simple computation involving the syllable-pair combination, the number of distinctive features which appear in both syllables, the number of distinctive features which appear in only one of the syllables, shifts of distinctive features as described in section 2.3 above, and the types (‘position’, ‘manner’, ‘cofeature’, and ‘vowel quality’) of distinctive features which are in both syllables or in only one of the syllables can be devised. This computation need produce only a syllable-pair-comparison value capable of differentiating syllable pairs into three groups: 1) identical; 2) related; and 3) unrelated.
Using these three groupings, it will be relatively simple to design a word-pair evaluation routine which can accurately assess the relationship of the two words. This in turn will automatically improve the accuracy of the numbers used to evaluate the relationship of each dialect pair.
3.0 EVALUATION
3.1 Comparison with existing programs. Although the computer is a valuable tool for lexicostatistical research, this portion of the field of language data processing has received less attention than it deserves. One use of the computer in lexicostatistics so far is to do the numerical computations after the linguist has assigned the cognate identifications by hand.18 To use this program, the linguist must take the equivalent words in each language, identify the words which appear in only one language and, for the remaining words, assign identification numbers to each group of cognates. Once the linguist has made the cognate classifications for every word in every language, then the cards containing these data are prepared and read into the computer. The computer takes each pair of languages and counts the number of words available for comparison and the number of these word pairs which the linguist has said are cognate.
Programs that make more of the decisions include those for phonostatistical analysis,19 comparative reconstruction, 20 and cognate and correspondence recognition21 .Undoubtedly, a more thorough search of the literature would reveal additional related projects.
In terms of the number of decisions required for the linguist, the present program seems to stand midway between the program described below and some of those that search for cognates and correspondences. The Fiji Lexical-Comparison Program requires the linguist to have done phonemic analyses of all the input dialects and to have discovered the sound correspondences, since they are at present an integral part of the program.
It is expected that when the proposed alterations of the program are completed, it will have a fairly high degree of accuracy. Aside from those items that will be affected by the improved syllable-comparison routine, 9 out of 182 were found in one printout to register low relation, while the linguist, on the basis of general comparative information, would possibly consider them related. But even for hand-counting, most of these examples were of the coin-tossing variety.
3.2 Future plans. The immediate plans for the Fiji Lexical-Comparison Program are, as indicated previously, first to complete the segments of the program necessary for handling all dialect-pair comparisons, and second to improve the syllable-pair-comparison program and the word-pair-evaluation routine. Once these alterations have been completed, the program will be completed and the problem of analyzing Fiji dialect lexicons will be reduced to that of adding to the input data and analyzing the computed dialect-pair relationships.
However, it may be possible to input lexical data from other related languages by alterations of the program that converts each syllable into distinctive features. For instance, some of the consonant correspondences shown by Tongan, one of the closest languages to Fijian, fit well into the phonetic distance scheme: F. /mb/—T. /p/, F. /s/—T. /h/, F. /gg/_T. /k/, F. /k/_T. /’/. Other correspondences such as F. /β/—T. /f/ would require minor changes; still others such as F. /#/ —T. / ? /, F. /y/ — T. /#/, and F. /r/—T. /#/ would require major changes, since they involve multiple correspondences and several phoneme-to-zero correspondences. Vowel correspondences are even more complex, involving some assimilation in the Tongan forms. Another related language, Rotuman, has /f/ corresponding to Fijian /t/, illustrating the nonuniversality of the phonetic distance theory.
Using the present program for other languages or dialects with similar syllable structure will, of course, give an indication of phonetic similarity. But unless the syllable-comparison routine and the program that converts syllables into distinctive features are altered to fit each group, the results will be of little linguistic interest.
NOTES
1. The first-named author is primarily responsible for part 1 of this paper, the second-named author for parts 2 and 3.
2. Albert J. Schtitz, ‘A Phonemic Typology of Fijian Dialects’, Oceanic Linguistics 2.62-79 (1963); Albert J. Schiitz, A Dialect Survey of Viti Levu, Fiji (in preparation).
3. A. S. Gatschet, ‘Specimens of Fijian Dialects’, ed. from manuscript of the Rev. Lorimer Fison, Int. Z. allg. SprW. 2. 194 (1885).
4. Albert J. Schtitz, ‘Lexical Differences between Generations in Fiji’, Te Reo 6. 28-29 (1963).
5. Since identity between phonemes of different systems is theoretically impossible, ‘identity’ here refers to identity of symbols. Since the phonological systems are generally similar, each symbol represents approximately the same distinctive features in all dialects for which it is used.
6. This is an inconsistency in the present design, a departure from phonetic similarity to an existing phonemic correspondence. A program for measuring only phonetic similarity would have to equate all glottal stops, whether they corresponded to /t/ or /k/.
7. In the overall pattern — all ‘nonidentical’ syllables from all dialects — there are 102 different syllables: 2 3 consonants combining with five vowels, plus individual occurrences of the vowels, minus certain sequences that do not occur.
8. Due to limitations in time, the sections of the program that write, read, and process this intermediate magnetic tape have not been completely coded. The only instructions which are missing are the specific input and output commands. Because this section of the program has been extensively checked by hand simulation, no major computational problems are anticipated when this section is completed.
9. Any further discussion of the control program — except for the conversion program — would consist of computational conventions and details which are not pertinent to the linguistic discussion of the program.
10. Thus, at the present time the program is limited to words of no more than nine syllables. However, the program has been designed so that this number can be changed by altering only one card in the symbolic deck of the Fiji Lexical-Comparison Program.
11. Because of the omissions listed in Section 1, the total of 200 is never used in the present program. The number may be changed by altering only one card in the symbolic deck of the Fiji Lexical-Comparison Program.
12. The major weakness of the current program is, as will be discussed later, the syllable-pair-comparison program. The present program does not yet account for the ‘adjacent’ and ‘non-adjacent’ columns described in part 1.
13. If two syllable pairs show the same degree of similarity by having the same syllable-pair-comparison value, the syllable pair with the greater similarity in syllable position is chosen. Thus, for example, in comparing two words with syllables WXYZ and VZYU, the syllable pairs Y-Y and Z-Z will have the same small syllable-pair-comparison value, but since Y is the third syllable in each word and Z is the second syllable in one word and the fourth syllable in the other, the syllable pair Y-Y will be chosen. Similarly, in comparing QRST and RTP the syllable pair R-R (first syllable in one word and second syllable in the other) would be chosen in place of the syllable pair T-T (second syllable in one word and fourth syllable in the other). If two syllable pairs have the same syllable-pair-comparison value and the same degree of similarity in syllable position, the first syllable pair found is chosen since it will be closer to the beginning of the words. Thus, for example, in comparing two words with syllables KLMN and OKNJ, the syllable pair K-K (first syllable in one word and second syllable in the other word) will be chosen before the syllable pair N-N (third syllable in one word and fourth syllable in the other word).
14. The calculation used to select the second, the subsequent, syllable pairs is exactly the same as the calculation used to select the first syllable pair. If consideration of the position of the syllables is required by this calculation, the syllable position in the full word is used and not the position in the shortened word.
15. At this point in the computation the list which is being built contains six numbers. For the purpose of summarizing this much of the calculation, the values of these six numbers are: 1) the number supplying the location of the syllable in the first word which shows the maximum degree of similarity to a syllable in the second word; 2) the number supplying the location of that syllable in the second word; and 3) the value of the comparison of these two syllables, followed by: 4) the number supplying the location of the syllable in the firstword (other than the syllable given in 1) above) which shows the maximum degree of similarity to a syllable in the second word (other than the syllable given in 2) above); 5) the number supplying the location of this newly chosen syllable in the second word; and 6) the value of the comparison of these two syllables. On the list itself the se number s appear in the sequence 3, 1, 2, 6, 4, and 5.
16. The syllable-pair-comparison values set aside during the first part of this computation are used for this calculation. If the syllable in the longer word has the same degree of similarity with two syllables in the shorter word, the syllable positions are used to select which syllable is to be chosen.
17. The current computation essentially considers thattwo words are related if: 1) more than half of the syllables of the longer word are identical with syllables in the shorter word; and 2) the longer of the two words has fewer syllables than two plus twice the number of syllables in the shorter of the two words. A more restrictive criterion for relation (a definition of cognate for purposes of glottochronology) is that regular phonemic correspondence is required, at least to the extent of 7 5 per cent of the phonemes in each member of the pair. Fred W. Householder, Jr., ‘Validity of Glottochronology’, Current Anthropology 5.326 (1964).
18. See John B. Carroll and Isidore Dyen, ‘High-speed computation of lexico-statistical indices’, Lg.38. 274-8 (1962) for a discus-sion of the computational technique used, and Isidore Dyen, ‘The Lexico statistical classification of the Maiayopolynexianlanguages’, Lg. 38. 38 -46 (1962) for a discussion of the resultsobtained by this program.
19. Howard P. McKaughan, ‘A Study of Divergence in Four New Guinea Languages’, American Anthropologist 66.98-120 (1964).
20. Martin Kay, The Logic of Cognate Recognition in Historical Linguistics, Rand Memorandum RM-4224-PR (Sept., 1964). CM. Nairn, ‘A Program for Partial Automation of Comparative Reconstruction’, Anthropological Linguistics 4:9-1-10 (Dec. 1962).
21. H. A. Gleason, Jr., ‘Genetic Relationship among Languages’, Structure of Language and its Mathematical Aspects 179-89 (1961).
Flowchart 5.1
Flowchart 5.2
Flowchart 5.3
We use cookies to analyze our traffic. Please decide if you are willing to accept cookies from our website. You can change this setting anytime in Privacy Settings.