“Computation in Linguistics: A Case Book”
DESCRIPTIVE PROBLEMS — PHONOLOGY
Automatic Spelling-to-Sound Conversion
1.0 INTRODUCTION
The analysis of English spelling-to-sound relationships has recently become an interest of several disciplines. Educators, after years of debate over the relative merits of phonics, the wholeword method, and the various combinations of these two extremes, are attempting to make a thorough analysis of spelling-to-sound correspondences as a prelude to the improvement of reading instruction.1 Engineers, with the advent of marketable (but limited) optical scanners and speech synthesis devices, have turned their attention in the design of a reading machine for the blind toward the network which would translate between the spelling detected by the optical scanner2 and the sound stream produced by the synthesis system.3 Linguists, viewing the widespread use of digital computers in other areas of language analysis, are attempting to convert lengthy printed texts to phonetic form for various linguistic studies.4
1.1 Problem. While the ultimate goals of these three groups are quite different, one basic problem is the same — to ascertain the factors which control the conversion of spelling into sound. The purpose of this paper is to examine the practicality of automatic spelling-to-sound conversion for English, taking into account the factors which must be considered in predicting sound from spelling and the methods of implementing these factors on a general purpose digital computer. While much of the information used in this study was derived from studies for the improvement of reading instruction,5 only the problems of automatic conversion will be considered here.
1.2 Scope. To avoid lengthy definitions and extensive qualifying verbiage, I will limit this discussion to the consideration of a program for a general purpose digital computer that receives distinct input signals for each of the 26 letters of the Reman alphabet and for the blank space, and produces distinct output signals in turn for members of a set of phonemes. Modifications for hyphenated words, apostrophes, and other punctuation marks would be minor and have little bearing upon the basic conversion problem. Attention will be placed upon spelling-to-sound relationships in isolated words, without considering interword alternations and intonation contours. While the phonetically pleasing reading machine must handle these problems, the basic problems are met in dealing with isolated words and must be attended to first. The conversion from phonemic to phonetic output poses no theoretical problems. Assuming that we have a complete description of the allophones of some widely spoken English dialect, then the selection of the proper phonetic form for any phonemic word could be done on the basis of relatively simple decision rules. This is not to say that the design of the synthesis unit is simple — it obviously is not — but such considerations are beyond the scope of this paper.
2.0 PROPOSED SOLUTIONS
Three conversion schemes will be considered: straight dictionary look-up, dictionary look-up with preliminary segmentation, and the algorithm system. To aid in the comparison of these schemes, I will set an acceptability criterion of 96 per cent recognition of the graphemic words in common texts.6 This means that each scheme must be accurate enough to correctly pronounce at least 96 per cent of the graphemic words it encounters. According to the Thorndike-Lorge word counts,7 the most common 20, 000 English words (words in the common sense — see note 8) would account for this level. Statistics are not available on the number of different words this would be, since the Thorndike counts include common derivatives under the same base form.8 An estimate of 50, 00060, 000 different graphemic words, based upon an examination of the Thorndike-Lorge list, will be assumed for this report.9
In considering the three schemes, the most important questions to be answered are the following:
1)Can the scheme be used to reach the acceptability level?
2)Is the scheme practical in terms of equipment and programming costs?
3)How easily can the scheme handle special vocabularies?
If the answer to question 1 is not affirmative, then the other two questions need not be asked — the scheme is totally unacceptable. Question 2 concerns practicality: a theoretically sound idea that is too expensive to implement has little use. Practicality, for the most part, will be defined relatively, by comparing the equipment needed and programming effort required to implement each scheme. Whether or not the most practical of the schemes considered here will be practical enough to attract an investor is not discussed here.
The third question is a corollary of the second. Given that we have a working, practical scheme, how easily can it be adapted to the vocabulary requirements of law, or metallurgy, or embryology, or of any other discipline? Dendrite, acetylcholine, and nucleolus are not members of the select 20, 000, but an automatic converter used for reading neurological literature would bump into them with alarming frequency. A conversion system that could not be altered to handle special vocabularies would have a restricted use.
3.0 STRAIGHT DICTIONARY LOOK-UP
The simplest, surest scheme is straight dictionary look-up. Input words are matched against dictionary entries until a match is found. The code for the corresponding phonemic form, stored in the dictionary in some location relative to the graphemic form, is then transmitted to the synthesis section. If a match is not found, a null signal indicating ‘no match’ is transmitted.10 There is no doubt that such a scheme will work. All we need is a standard digital computer with an auxiliary random access storage device (drum, disk, auxiliary core storage, etc.).11 Through trial and error we could establish what words to include in the dictionary to reach the criterion level.
Programming such a system would be relatively simple, but the hardware cost seems excessive when we consider the items in the auxiliary dictionary storage unit. Besides containing a word like love and its corresponding phonemic code, the dictionary must also contain loved, lover, lovers, loving, lovingly, lovelorn and their corresponding phonemic codes (to say nothing of unloved, loveless, and lovable). The question now becomes, is it possible to reduce the size of the dictionary by segmenting the input words before the look-up process? This possibility will be considered in section 5.
As a final note, this scheme is easy to modify; a dictionary update routine would add the new entries in appropriate places in the dictionary and make necessary address modifications in the look-up program. While this may involve several hours running time for merging the new entries with the dictionary, computing new locations, and modifying the look-up program, no repro-gramming would be necessary.
4.0 ALGORITHM SYSTEM
4.1 General evaluation. The opposite extreme from the straight look-up system is the algorithm system, wherein phonemic output is computed from the graphemic input. Such a system would presumably scan the input words12 and decide, on the basis of programmed decision criteria, what output to generate for each grapheme or grapheme cluster it encountered. Rules like ‘final ue after g, except in argue, is silent’ and ’ gh in word initial position becomes the phoneme g’ would be built into the program along with rules for segmenting polymorphemic input forms into smaller units. From an overhead standpoint, such a system would be preferable to the dictionary look-up system since an auxiliary random access storage would not be needed, assuming that the conversion program did not exceed the limits of a standard storage unit. Modifications would be less necessary than under the look-up schemes, but where necessary would require reprogramming. (Results of recent research indicate that rules which would predict the pronunciations of the most common 20, 000 words would equally well predict the pronunciations of the next 20, 000 most common words.13)
The drawback to this scheme is, however, that it cannot reach our criterion level and still be practical. The factors which figure into spelling-to-sound relationships cannot be programmed efficiently. In many cases we would need to know stress placement, form class, functor-contentive membership, and even etymology. In addition, thousands of exceptions to regular patterns would have to be listed as whole words along with at least part of their pronunciations. These factors are explained in the examples that follow. I will forego a discussion of the segmentation process for the present since that topic will be covered in the discussion of the next scheme (section 5). In the examples below I am assuming that some type of segmentation has already taken place.
4.2 Stress placement. Stress placement is important not only for predicting vowel quality, but also for predicting voicing or non-voicing of particular consonants and for determining whether or not the clusters [tj], [dj], [sj], and [zj] are palatalized to [č], [ǰ], [š], and [ž] before a vowel. Stress also plays a part in the backing of [n] to [ŋ] before velar stops, but this problem can be handled by segmenting.14
Vowel neutralization in English has been discussed elsewhere and needs little comment here.15 Some examples borrowed from C. F. Thomas16 and elsewhere to illustrate this problem are shown below.
Stressed vowel | Unstressed vowel |
today [tǝdé-] | Monday [mɚndr] |
man [mæn] | woman [wưmǝn] |
ask [æsk] | askance [ǝskǽns] |
coral [kórǝl] | corral [kǝrǽl] |
matinee [mætiné] | coffee. [kófr] |
surface [s | face [fés] |
advice [ædvaís] | crevice [krέvis] |
Knowing the stress pattern does not guarantee correct conversion of vowel graphemes to phonemes, since various other factors enter into this relationship, but without considering stress, little regularity can be found in the vowel grapheme-to-phoneme relationships.
The grapheme <x>has two basic pronunciations: viz. [ks] as in exit, exercise and [gz] as in exert, examine.17 Whether the voiced or unvoiced cluster occurs is generally determined by stress placement. If the primary stress occurs on the vowel preceding < x>, the corresponding phoneme cluster is unvoiced; otherwise it is voiced. (Preconsonantal and final <x> are always unvoiced. Initial <x>, pronounced [z], can be handled like intervocalic <x>, giving first the cluster [gz]. Then, by the same phonotactical rules which level initial [kn] (knee, know) and [gn] (gnat, gnaw) to [n], [gz] would be leveled to [z].)
Evidence from the early Modern English period indicates that the clusters [tj], [dj], [sj], and [zj] before an unstressed vowel became [č], [ǰ], [š], and [ž] as in fortune, arduous, cynosure, and treasure.18 Since the spellings of the earlier forms have survived, we must, in order to predict the correct pronunciation, duplicate, in a sense, the palatalization process. Thus, in fortune and importune we would convert first to [fortjun] and [importjun] and then, knowing that fortune is stressed on the first syllable and importune on the last, convert to the forms [fórčǝn] and [rmptjún]. Without knowing the stress patterns we could not produce the correct phonemic forms in these cases.
As further examples of the palatalized-unpalatalized cases, compare the stress placements in the following lists:
Palatalized | Unpalatalized |
Closure | assume |
Leisure | gratuity |
Capitulate | institute |
congratulate | assiduity |
Creature | seduce |
Credulous | credulity |
Stress prediction based upon the graphemic shape of a word only cannot be accomplished with any high degree of accuracy. Consider, as an example, one of the best known and supposedly exceptionless rules for stress prediction — the requirement of penult stress before the adjectival ending <-ic>. First, there is a short list of exceptions: Arabic, catholic, choleric, lunatic, and politic. Secondly, we must distinguish between noun and adjectives with <-ic> endings, and this is impossible on the graphemic level. Therefore, forms like arsenic, arithmetic, heretic and rhetoric would also have to be included in the dictionary and marked as exceptions to the <-ic> stress rule.
The following passage from C. K. Thomas is offered as a final discouraging note on automatic stress prediction:
What determines syllable stress? No completely satisfactory answer can be given, but two historical tendencies may be noted. In native English words, such as the series love, lovely, lovable, loveliness, lovableness, the stress remains on the root syllable. But in words of Greek or Latin origin, such as the series photograph, photography, photographic, or the series equal, equality, equalization,equalitarian, the stress shifts from one syllable to another as the word is lengthened. Words like love illustrate the so called fixed or recessive stress of the native English and Germanic tradition. Words like equal and photograph illustrate the so called free or variable stress of the Greco-Latin tradition.19
4.3 Other conditioning factors. The two lists shown below illustrate another conditioning factor.
[ð] | [Ɵ] |
the | thaw |
them | theft |
then | theme |
there | thermal |
these | thick |
they | thin |
thither | thistle |
thou | thong |
though | thug |
thy | thigh |
In words beginning with the cluster <th> (there is a total of 170 such words in the Thorndike-Lorge 20, 000 word list), the corrersponding pronunciations[Ɵ] and [ð] are obtained by knowing whether the word is a functor or contentive.20 Thus, functors have the voiced interdental spirant and contentives have the unvoiced. Without including such information in the algorithms, the correct pronunciation for these words could not be obtained. But the only way to build such information into a system is to list all words beginning with <th>, and code each one with its functor-contentive class membership. If these were the only words that required this process, we might tolerate the listing, but they are not. Compare the pronunciation of the last <a> in these words:
[ej] | [i] |
abate | celibate |
placate | affricate |
vacate | frigate |
deprecate | intermediate |
dedicate | collegiate |
duplicate | duplicate |
communicate | novitiate |
fabricate | prelate |
To determine the correct pronunciation for these forms we must know their form classes. Verbs have the [ej] pronunciation and nouns and adjectives have [r]. Once again, we must include a dictionary in the system — only now we are concerned with over 500 words that have this ending.
The correct pronunciation of <-ng-> also depends upon form class. Compare the following:
[η] | [ηg] |
longer (n) | longer (adj.) |
singer | stronger |
swinger (one who swings) | younger |
The <g> in <ng>, followed by the agentive <er> is silent, but before the comparative <er> becomes [g].
4.4 Further drawbacks. All of these problems could probably be tolerated if algorithms existed for predicting pronunciations in all other cases, but unfortunately this is not the situation. In numerous cases, rules cannot be formulated for predicting the various pronunciations of the same grapheme or grapheme cluster. <ch>, for example, has three different pronunciations, depending mostly upon the etymology of the word. In words derived from Old English, Old French, and early Middle French, the pronunciation is [č] as in church <OE circe and chief <OF chef. Words borrowed from French after the early Middle French period are pronounced with [š] as chef <F. chef, and those borrowed ultimately from Greek generally have [k] as in chord. Since there are few reliable clues to etymology in the graphemic patterns,21 most words containing <ch>would have to be contained in a dictionary along with the proper pronunciation for that cluster.22 This includes about 500 words.
Words ending in <-ine> have no regular pronunciation either, as shown below, and would also have to be included in a dictionary (approximately 150 words):
[ar] | [i] | [r] |
alkaline | vaseline | crystalline |
valentine | libertine | destine |
divine | gasoline | doctrine |
genuine | nectarine | engine |
cosine | vaccine | famine |
turpentine | routine | urine |
Similarly, words containing <gh> (30 words), words containing <ou> and <ow> not pronounced [au] (approx. 370 words) and words containing irregular pronunciations for <o> (approx. 300) would have to be included in this list, to name just a few. It appears that from 4, 000 to 6, 000 words containing irregular grapheme-phoneme correspondences have to be included in a dictionary, assuming that the segmentation and stress prediction routines had no exceptions.
With the large number of words whose pronunciations cannot be predicted from their graphemic shapes alone and the added problems of stress prediction and of segmentation, which will be discussed shortly, the algorithm method holds little promise as a practical spelling-to-sound conversion scheme.
5.0 DICTIONARY LOOK-UP WITH PRELIMINARY SEGMENTATION
Since the algorithm scheme is not practical for automated spelling-to-sound conversion, attention should turn to improving the only workable scheme, dictionary look-up. Of the 50 or 60 thousand words which must be stored to meet the 9 6 per cent recognition acceptability criterion, only about 20, 000 are basic words. The remaining are regular inflected compounds of these words, formed by adding <-ed>, <-ing>, <-ly>, and so onto the basic words. Since all of these derivatives are regular in some sense,23 it would appear that a segmentation scheme would be easy to specify and would, at a minimum, reduce the dictionary storage requirements by 50 per cent.
A segmentation program would scan each input word, looking for designated initial and final grapheme strings like <un->, <in->, <-ment>, <-ness>, and <-able> . These strings would then be stripped from the word and the remainder matched against the dictionary entries. If a match were found, the output would be the phonemic code found in the dictionary plus a phonemic code for each segment stripped from the word. These latter phonemic codes would be stored in a table in the segmentation program. Thus, an input word like workable would first be split into work + able. The section <work> would be matched against dictionary entries until a match were found and the phonemic code for [wk] retrieved. To this would be added the code for [-əbəl] and the full form [w
kəbəl] passed to the synthesis unit. In this way the pronunciation for <-able>and other segmentable strings would be stored only once and only one dictionary entry would be necessary for each paradigm (or, occasionally, set of paradigms; for example, the paradigms for love (v) and love (n) would be accounted for by a single entry in the dictionary, assuming that all affixes could be segmented).
Not all segmentation is so simple, however. With words like blamable, we are left with <blam-> after removing the suffix (-able), but the base form is blame, which does not match this form. One solution to this problem is to remove final <e> from dictionary entries and from input forms, but this would require some means for recording that a final <e> was once present to avoid merging forms like bite: bit, paste: past, here: her, spare: spar. To do this, the dictionary must have not only a phonemic code for each entry, but also an inflectional code that records whether or not final <e> must be present in any of its forms. The allocation of the necessary storage to hold the inflectional codes may not be so drastic, if by doing so 200, 000 to 300,000 characters used to store regular derivatives can be eliminated.
Suffixes like <-ly>, <-ic>, <-ity> require graphemic and phonemic changes in the base form and probably can not be segmented economically. <ly>, for example, after polysyllables ending in <C-y>, requires the base to change <y> to <i> . Thus, angry: angrily. The <y> to <i>alternation could be handled by adding an additional bit to the inflectional code, but some notion of functional load of this suffix has to be computed before the decrease in storage requirements can be weighted against the increase in segmenting time.
<-ic> offers more serious segmentation problems, as does <-ity> and numerous others. Consider the forms
aside from the graphemic alternations <y> →<i> and <e> →» <ø>, there are also phonemic alternations and a stress shift, viz.
In these cases, to store one form for symmetry/symmetric and one form for sane/ sanity would require not only an inflectional code to indicate the stress shift, but also, codes for the vowel alternations. Simple one-bit codes will not be sufficient for the phonemic alternations; in fact, the only recourse is to store all of the forms of all phonemes involved. This requires that a code word be used to indicate which phonemes (and stresses) belong with which affixes — and where to find the alternants for any form. Not only is there a storage problem here, but also a programming problem, in that considerable time must be consumed in locating the phoneme and stress alternation code, and in substituting the proper units for the affix concerned. Segmentation is obviously beyond its practicable application at this point.
In many cases a graphemic string identical to common prefixes and suffixes may be part of the base form, e. g. <un> in union and uncle. In these cases we could store ion and cle in the dictionary and reserve one bit to indicate when the form is not affix plus base. The full pronunciation of union and uncle would be stored also and the segmenting routine, once it found from the affix code that <un> were not a prefix, would output the phonemic form retrieved from the dictionary without prefixing the phonemic code for /ən-/ to it. This process depends, of course, upon not having other words beginning with common prefixes and ending in <cle>or <ion> . Such cases, however, are rare and can be safely ignored.
The full scope of segmentation will have to be determined by trial and error. Theoretically, the determining process is simple. Select a common affix, find all of the base forms to which it can be affixed, and tabulate the graphemic and phonemic alternations between the base forms and the affixed forms.
If there are no alternations, or if there are alternations which can be handled easily, then the affix is segmentable. What ‘handled easily’ means is difficult to make explicit. Somehow the system designer must balance storage space and total look-up time (including segmentation) to meet prescribed specifications. The cost of segmentation consists of increases in program size, average look-up time, and in overall system complexity — a factor which may affect both development cost and adaptability.
6.0 CONCLUSIONS
The most practical spelling-to-sound conversion scheme is the look-up scheme with limited preliminary segmentation. Except for the segmentation, this approach is quite inelegant from both a linguistic and a programming standpoint — in fact, it smacks of brute force. But the ravages of sound change, promiscuous borrowing, and scribal dilettantism leave us with little choice.
Lack of elegance nevertheless should not prevent us from making the most of this conversion scheme. It will, in spite of its inadequacies, produce the desired results. Furthermore, additional research may show how to reduce significantly the size of the stored dictionary. To this end, the following areas should be considered.24
6.1 Storage by word length. Since an optical scanner could count, with little additional circuitry, the number of graphemes in each word it recognized, word length could be used to shorten search time. Dictionary entries, whatever they turn out to be, could be separated first into groups according to length and then organized within each group according to some other feature. This procedure would eliminate considerable storage space since the same number of characters would not have to be allotted for every word, but rather, a different character length could be assigned to each length group.
A table in the look-up program would contain the first dictionary-address for each length group while the character-count used to reach the next item location during searching would be the length of the search item itself (assuming character addressing). This process should reduce the dictionary storage requirements by over 50 per cent.
6.2 Type/token considerations. The 1,000 most common words in the Thorndike-Lorge list account for over 75 per cent of all textual words. If, therefore, the dictionary were organized by length groups, then within each length group the next feature of organization should probably be relative frequency of occurrence. It may, however, be better to divide each length group into two groups — one containing words in the first 1, 000 group and the other all the remaining — and then alphabetize each of the two groups. The most frequently occurring group would be searched first, using a binary search. If the word were not found there, then the second group would be searched in the same way. Once a complete dictionary list is obtained, the relative search times for these techniques can be computed and compared.
6.3 Recoding. Only 27 different characters are utilized in storing the graphemic words. Assuming a six-bit character, this leaves 37 unused characters which could be used to recode frequently occurring initial and final grapheme sequences and thus further reduce dictionary storage requirements. Leading candidates for recoding are nonsegmentable affixes, and clusters like <ch>, <gh>, <ph>, <sh>, <th>, and <qu> .
If we record for each length group the address and length of each subgroup beginning with a different grapheme, then the initial grapheme could be omitted from every graphemic word in the dictionary — a saving of from 20, 000 to 40, 000 character locations. The first letter tables would require, in turn, only about 1, 000 character locations in central storage.
These considerations may not eliminate the need for an auxiliary storage unit, but every reduction should help make the conversion scheme more practical. Knowing that automatic spelling-to-sound conversion is possible, practicality is now the factor which will eventually determine whether or not such conversion is ever implemented successfully.
NOTES
1. For a summary of current research, see John B. Carroll, ‘The analysis of reading instruction: perspectives from psychology and linguistics’, Yearbook of the National Society for the Study of Education 336-53 (1964).
2. A technical survey of optical scanners, including characteristics of commercial devices and an extensive bibliography, can be found in Howard Falk, ‘Optical character recognition systems’, Electro-Technology 74. 42-52, 160 (1964). J. H. Davis, ‘Print recognition apparatus for blind readers’, J. British Inst. Radio Engrs. 24:2. 103-10 (Aug., 1962), contains an adequate discussion of the requirements of optical recognition systems for reading machines. F. S. Cooper, ‘Speech from stored data’, IEEE Int. Convention Rec. 7:7. 137-49 (1963), describes current notions in speech synthesis. Included are some rudimentary comments on spelling-to-sound conversion.
While neither scanners nor speech synthesizers are developed sufficiently for the needs of a marketable reading machine for the blind, the state of the art is such that we could now throw together a working, experimental device with existing hardware, assuming that funds were available.
3. The earliest reading machine on record, demonstrated by its inventor E. E. Fournier d’Albe in 1913, did not produce speechlike signals, but rather converted each distinct letter into a different musical chord. (Cf. E. E. Fournier d’Albe, ‘On a direct-reading octophone’, Proc. Royal Soc. A. 90. 373 (19 14). Tactile output devices have been suggested by numerous writers(Cf. J. H. Davis, note 2) and one was even patented in 19 32 in England. For a discussion of more current work on reading machines, see Chapters 17-25 in Human Factors in Technology, edited by E. Bennett, J. Degan, andJ. Spiegel, (New York, 1963).
4. For a general survey of computer applications in linguistics, see Paul L. Garvin (ed.), Natural language and the computer, (New York, 1963). For more detailed discussions of specific problems, consult the ‘Humanities Applications’ section of Computing Reviews, published six times each year by the Association for Computing Machinery.
5. My main source of evidence for spelling-to-sound correspondences has been research done by Professor R. H. Weir and me at Stanford University under United States Office of Education contract Nos. OE-4- 10-2 1 3 and OE-4- 10-206. Our data consists of spelling-to-sound tabulations for the most common 20, 000 English words, taken from E. L. Thorndike and I. Lorge, The teachers’ word book of 30,000 words, (New York, 1941). These tabulations were obtained through a computer program written for the CDC 1604. Cf. R. L. Venezky, ‘A computer program for deriving spelling-to sound correlations’, MA thesis (Cornell University, 19 62) published in part in H. Levin, et al., A basic research program on reading, (Ithaca, New York, 1963).
6. By ‘common texts’ is intended daily newspapers, popular (nontechnical) magazines, and most fictional prose. Since the largest unit to be considered is the word, homograph problems like read / rid/:/red/ cannot be resolved completely and therefore are not considered in determining the acceptability of any conversion scheme. One of the variant pronunciations for such forms will be arbitrarily selected as correct and the others will be counted as incorrect. The number of such cases will be extremely small, however.
7. Cf. especially F. W. Harwood and A. M. Wright, ‘A statistical study of English word formation’, Lg. 32.2 60-7 3 (1956). Type-token relations for the Thorndike-Lorge list are given on page 261 of Harwood and Wright.
8. ‘Regular plurals, comparatives and superlatives, verb forms in Sj d, ed, and ing, past participles formed by adding n, . . . are ordinarily counted under the main word’. Thorndike-Lorge p. ix.
9. F. S. Cooper (see reference 1 above) estimates that a 7, 000 word dictionary will account for 95 per cent of all text and that 20, 000 words will account for at least 99 per cent. He does not, however, cite any evidence for his figures, which contrast quite radically with the Thorndike-Lorge counts, nor is he clear on whether or not his 7, 000 and 20, 000 word figures include derivatives formed with common suffixes.
10. As an alternative to the ‘no match’ signal, a phonetic form of the spelling could be output. For example, if the input word Peoria were not found in the dictionary, the system would output /pi+i+o+ar+et+e/ with, possibly, a high frequency tone before the word to signal a capital letter, assuming that upper-lower case distinctions were retained in the signal from the optical scanner.
11. Assuming that we need to store approximately 50,000 English words with their phonemic codes, with an average of eight characters per word (graphemic or phonemic), we would require an 800, 000 character storage. This is far in excess of the standard storage units on present-day digital computers (200, 000-260, 000 characters: e. g. CDC 3600; IBM 360/50; Univac 1107/8). 800, 000 characters is a minimum figure, however, since for an efficient look-up system we may have to allocate the same number of characters for each dictionary entry, regardless of whether we assume a machine with character addressing, word addressing, or both. This may raise our storage requirements to over 2 million characters. Means for reducing this storage requirement are discussed in the conclusion of this paper.
12. Because of (1) the predominant occurrence of regressive assimilation over progressive assimilation in English, (2) stress (and vowel) conditioning by various suffixes, and (3) terminal graphemic markers like <-e->, scanning from right to left would be the most economical scanning procedure.
13. See reference 4 above.
14. Cf. C. K. Thomas, Phonetics of American English2 82 (New York, 1958). The transcriptions used in this report follow those of John S. Kenyon and Thomas A. Knott, A pronouncing dictionary of American English (Springfield, Mass., 1941).
15. Cf. M. D. Berger, ‘Neutralization in American English vowels’, WORD 5. 255-7 (1949) and JohnS. Kenyon, American pronunciation 101-10 (Ann Arbor, 1943).
16. Op. cit., 153.
17. Through palatalization, [ks] and [gz] become [kš] and [gž] as in luxury and luxurious. Note also that the [ks] - [gz] distinction is not a case of stress conditioning alone. Only clusters spelled <x> behave in this fashion, cf. accede, accept. See E. J. Dobson, English Pronunciation 1500-1700 2. 935 (Oxford, 1957).
18. Cf. Otto Jespersen, A Modern English grammar on historical principles 1. 341-9 (London, 1961), and H. C. Wyld, A history of modern colloquial English 29 3-4 (London, 1920).
19. Op. cit., 148. For some interesting work on stress prediction, see Roger Kingdon, The groundwork of English stress (London, 1958).
20. For definitions of functors and contentive, see C. F. Hockett, A course in modern linguistics, (New York, 1958).
21. J. Vachek has attempted to utilize the concept of synchroni-cally foreign characteristics to explain certain English sound changes. Cf. J. Vachek, ‘On the interplay of external and internal factors in the development of language’, Lingua 11. 433-88 (1962). While certain graphemic patterns are obviously foreign, e. g. <mn->, <hypo->, the spelling-to-sound correspondences in such forms have been irregularly assimilated to English patterns, so that few consistent rules can be based upon such a notion. (Cf. for example, the discussion of the pronunciation of the Greek form hypo- in the New English Dictionary, 5. 505.)
22. Some, but not very many <ch> spellings have predictable pronunciations. Initial <ch> before <l> or <r> is [k] — chlorine, chrome, etc., and final <tch>is always [č], match, latch, etc. These account for less than 10 per cent of all <ch> spellings.
23. Cf. StanleyS. Newman, ‘English suffixation: a descriptive approach’, WORD 4. 24-36 (1948) for an excellent introduction to the linguistic aspects of English suffixation, and T. L. Thorndike, The teaching of English suffixes,(New York, 1941), for type-token statistics on common English suffixes.
24. Much of the discussion which follows concerns dictionary look-up schemes. I have, at present, no data to support the selection of any one dictionary look-up scheme over another. This choice must be based upon required look-up speed, maximum allotted storage space, and certain characteristics of the entries which are to be stored in the dictionary. Look-up schemes used by machine-translation research groups are described in various research reports from these projects. For a bibliography, consult Charles F. Balz, Literature on information retrieval and machine translation (White Plains, New York: IBM Corp., 1962). See also C. E. Price, ‘Table look-up techniques for computer programming; Report Number K-DP-515’, (Oak Ridge, Tennessee, Union Carbide Corporation, 1965).
We use cookies to analyze our traffic. Please decide if you are willing to accept cookies from our website. You can change this setting anytime in Privacy Settings.