“Psycholinguistics” in “PSYCHOLINGUISTICS”
2. THREE APPROACHES TO LANGUAGE BEHAVIOR
By way of orientation to the theoretical analyses and research proposals which form the body of the report, this section provides a brief summary of each of the three major approaches to language study under investigation, linguistics, learning theory, and information theory. These introductions are aimed, as it were, at those in other disciplines—the linguistics summary is written for non-linguists, the learning theory summary for non-psychologists, and so forth. This means that the specialist in each area may find much to take exception to, much that he would present differently, and this is to be expected. These summaries are also intended to be non-controversial, but our need for conceptions adequate to handle psycholinguistic problems has undoubtedly influenced the emphases given, particularly in learning theory and linguistics sections. For the reader who wishes to go further into the details of each field, a limited annotated bibliography accompanies each summary.
As distinct from psychology, which is concerned with verbal behavior in the context of events occurring within the organism, and from the other social sciences, which analyze the contents of verbal behavior insofar as it consists of shared cultural beliefs and actions (e.g., religion, philosophy, economic and political norms), linguistic science has as its traditional subject matter the signal system as such. Its orientation tends to be social rather than individual, since the use of speech in communication presupposes a group of intercommunicating people, a speech community. In general, therefore, it has dealt with the speech of individuals merely as representative of the speech of a community. The interest in an individual’s speech as such, his idiolect, in relation to his personality structure constitutes a relatively new, marginal, and little explored area. The distinction between language as a system and its actual employment has been variously phrased as langue vs. parole (de Saussure), syntactic vs. pragmatic (Morris) or code vs. message (information theory). However stated, it marks in general the boundary between what has traditionally been considered the province of linguistic science and what lies outside it.
2.1.1. The Field of Linguistics
The primary subject matter of the linguist is spoken language. Writing, and other systems partly or wholly isomorphic with speech are viewed by most linguists as secondary systems. Speech has both ontogenetic and phylogenetic priority. There are even now peoples with spoken but not written languages (so- called primitives), but the reverse situation has never been obtained. Moreover, written systems are relatively stable while spoken language, by and large, changes more rapidly. It is always the written language which must make the readaptation, when it is made, by way of a new orthography. The effect of, say, alphabetic writing on speech, in the form of spelling pronunciations, is a real but quite minor factor in the change of spoken language. The linguist views writing, then, as a derivative system whose symbols stand for units of the spoken language.
Linguistic science is divided into two main branches, the descriptive and the historical. Historical interests presided at the inception of modern linguistic science (ca. 1800) and have predominated until fairly recently. Within the last few decades the focal point of linguistics has shifted to problems of description. These two chief areas of study complement each other. The degree of success of historical inquiry is largely dependent on the adequacy of descriptive data. On the other hand any particular stage of a language, while it can be completely described without reference to its past, can be more fully understood if the time axis is also taken into account. A cardinal and generally accepted methodological principle, however, is the clear distinction between synchronic and diachronic investigations. In particular, descriptive grammars were, and sometimes are, so replete with historical interpretations, that the locus in time of individual linguistic facts is obscured and observed phenomena are not distinguished from inferences, so that no clear picture of the structure of the language at any one time emerges.
The aim of a scientific language description is to state as accurately, exhaustively, concisely, and elegantly as possible, the facts concerning a particular language at a particular time. It is assumed that the changes which are inevitably proceeding during the period in which the linguistic informant’s speech is being studied are negligible and can be safely disregarded. It is also assumed that the speech of the informant is an adequate sample of some speech community. This concept is applied rather vaguely to any group within which linguistic communication takes place regularly. Minor cleavages within a group of mutually intelligible speech forms are called dialects. The maximal mutually intelligible group is a language community, as defined by scientific linguistics, but the term is often loosely applied on a political basis. Thus Norwegian is usually called a language although it is mutually intelligible with Danish, while Low German is considered a form of German, although objectively the difference between Low and High German is greater than that between Danish and Norwegian. The phrase ‘mutually intelligible’ is itself vague.
The speech of an informant is normally characteristic of that of a dialect community along with some idiosyncrasies. Language is so standardized an aspect of culture, particularly in regard to those structural aspects which are of chief concern to the linguist, that a very small number of informants usually proves to be adequate. If necessary, the linguist will even be satisfied with a single informant in the belief that systematic divergence from the shared habits of the community as a whole are likely to be of minimal significance. However, the sampling problem must eventually be faced in a less makeshift manner. The systematic mapping of speech differences on a geographic basis, through sampling at selected points, is known as linguistic geography and is a well-established sub-discipline of linguistics. Far more remains to be done with non-geographic factors of cleavage within the language community, on sex, occupational and class lines. Such study is a prerequisite for adequate sampling.
2.1.2. Units of Linguistic Analysis
Linguistic description is carried out in terms of certain fundamental units which can be isolated by analytic procedures. The two key units are the phoneme and the morpheme, of which the phoneme has a somewhat more assured status. The phoneme is the unit of description of the sound system (phonology) of a language. Many widely differing definitions have been offered, some of which are objects of doctrinal differences between various linguistic ‘schools.’ Fortunately, the actual results in practice of the applications of these divergent approaches are surprisingly similar.
The phoneme was foreshadowed by the pre-scientific invention of alphabetic writing. An adequate orthography of this kind disregards differences in sound which have no potential for the discrimination of meaning. Moreover, unlike syllabic writing, alphabetic writing selects the minimal unit capable of such differential contrast. The naive speaker is generally unaware of sound variations which do not carry this function of distinguishing different forms. For example, speakers of English have usually never noticed that the sound spelled t in ‘stop’ is unaspirated as contrasted with the aspirated t of ‘top.’ Yet this difference is sufficient to differentiate forms in Chinese, Hindustani, and many other languages. Phonemic theory is necessary because if we approach other languages naively we will only respond to those cues as different which are significant in our own language. On the other hand, we will attribute significance, and consider as indicative of separate elements, those differences which have a function in our own language, although they may not have such a function in the language we are describing.
For example, in Algonquian languages distinctions of voicing are not significant. A naive observer with an English linguistic background will carefully mark all p’s as different from b’s. The reaction of an Algonquian would be similar to that of an English speaker if he were presented with an orthography devised by a Hindu in which the t of ‘top’ was represented by a different symbol from the t of ‘stop.’ The arbitrariness of such a procedure comes out when we realize that an untrained Frenchman would describe the sound system of a particular language in different terms than a naive Englishman or German. As a matter of fact, this has often occurred. Equally unsatisfactory results are obtained by a phonetically trained observer, unaware of the phonemic principle, who indicates all kinds of nonessential variants because his training permits him to distinguish them. Here also there is a certain arbitrariness based on the particular phonetic training of the observer. The logical outcome of such a phonetic approach would be to Carry discriminations even further by instrumental means, and the result would be that every utterance of a language would be completely unique, for no two utterances of the ‘same’ sequence of phonemes is ever acoustically identical with any other.
The procedure of the descriptive linguist, then, is a process of discovering the basic contrasts which are significant in a language. Since he cannot know a priori which particular features of an utterance will prove to be significant, he must be prepared to indicate them all at the beginning by a phonetic transcription. Instrumental aids, though useful, are not essential to the preliminary research. The linguist gradually eliminates those sound differences from his transcription which prove to be non-significant so that the phonetic transcription becomes a phonemic one. In doing this, he makes use of the two principles of conditioned and non-conditioned variation. If the occurrence of one or another of a set of sounds may be predicted in terms of other sounds in the environment, this variation is said to be conditioned. If either of two sounds may be used for the other and still produce a meaningful utterance, the variation is called free, or non-conditioned. Such variant sounds grouped within the same phoneme are called allophones. In English, k, a front velar sound is found before i, I, e, E and other front vowels (e.g., the initial sound of ‘key’). A sound different enough to be a separate phoneme in many languages, k, a back velar sound, is found before u, v, o, ? and other back vowels (e.g., the initial sound of ‘coat’). Since the particular variant can be predicted by reference to the following vowel sound, ķ and ? are in conditioned allophonic variation and are members of the same English /k/ phoneme.
The number of potential phones (sounds) in a language approaches infinity. The great virtue of the phonemic principle is that it enables the linguist to effect a powerful reduction from this complexity to a limited number of signals that constitute the code, and this represents a great economy in description. For languages so far investigated, the number of phonemes runs about 25 to 30 (the English system tending toward the higher figure). It is possible to effect a still greater economy in description. This is achieved by the analysis of phonemes into concurrent sets of distinctive features. Since the features which distinguish certain pairs of phonemes are found to be identical with the features which distinguish certain other pairs, the number of entities necessary to describe the significant aspects of the sound matter is thus further reduced. For example, in English /p/ is distinguished from /b/, /t/ from /d/, /k/ from /g/, and /s/ from /z/ on the basis of the same feature, the former being unvoiced and the latter voiced. Other distinctive features, such as tongue position or nasalization, produce other sets of contrasts. By contrasting every phoneme in the language with every other phoneme, each phoneme comes to be uniquely identified in terms of the set of contrasts into which it enters, this ‘bundle of distinctive features’ being the definition of that phoneme. The distinctive oppositions that occur in languages studied so far run about 6 to 8. These are perhaps the minimal discriminanda in language codes.
Analysis into distinctive features is a development within the past two decades, associated with the Prague School but not universally accepted. Jakobson and his associates (cf., 9, 11) go one step further still, by imposing upon the entire phonemic material binary opposition as a consistent patterning principle, but this needs much further exploration. Whereas American linguists usually say that sounds must be phonetically similar to be classed as members of the same phoneme, members of the Prague School state that members of the same phoneme class must share the same set of distinctive features. These criteria will generally lead to the same classificatory structure.
For example, k and k would be said by members of the Prague School to share the following features in common: velar articulation, non-nasality and lack of voicing. These would be the relevant features shared by all varieties of the /k/ phoneme while, in this instance, back or forward articulation is irrelevant. The /g/ phoneme shares velarity and non-nasality with /k/ but not lack of voicing. The /g/phoneme (as in ‘sing’) shares velar articulation but not non-nasality or lack of voicing. The /t/ phoneme shares non-nasality and lack of voicing with /k/ but not velar articulation. Thus /k/ is uniquely determined by these three relevant features. Certain recent American analyses employ a methodology nearly identical with that just described.
Phonemes are sometimes distinguished as being either segmental or prosodic. The former proceed in one dimensional time succession without gap. The latter are intermittent and necessarily simultaneous with segmental phonemes or successions of segmental phonemes. Examples of prosodic phonemes are phonemes of tone (sometimes called tonemes), stress, etc. In principle, we should sharply distinguish prosodic phonemes simultaneous with a single segmental phoneme from those which are distributed over a grammatically defined unit such as a phrase or sentence. The former can always be dispensed with in analysis, though they often prove convenient. For example, in a language with three vowel phonemes /a, i, u/ and two tone levels high /’/ and low /`/ we might analyze /à/, /á/, /i/ A/3 /u/ and /u/ as six separate segmental phonemes or we might make /a/, /i/ and /u/ segmental and /’/ and /v/ prosodic. This particular analysis has no doubt been largely determined by our traditional orthography which uses separate marks for pitch. The carrying through of this procedure to its logical conclusion is called componential analysis and results in the resolution of each phoneme into a set of simultaneous elements equivalent to the distinctive features mentioned above. The other type of prosodic element is illustrated by question or statement intonation in English. Unlike the elements just discussed, it cannot be dispensed with.
Still another type of phoneme is the juncture or significant bounda, whose status is much disputed in contemporary linguistics. The conditioning factor for phonemic variation is sometimes found to be the initial or final position in some grammatical unit such as a word, rather than a neighboring sound. For example, unreleased stops p, t, k are found in English in final morpheme or word position. Unless we indicate the boundary in some fashion we must nearly double the number of phonemes in English. Spaces, hyphens and other devices are employed to indicate the presence of these modifications. For example, the n of ‘syntax* is shorter than the n is ‘sin-tax.’ Either we posit two different n phonemes or we describe the longer wasn plus juncture, transcribing /sintaks/ and /sin-taks/ respectively (or we deny the existence of the phenomenon altogether).3 The agreement as to the boundaries of grammatical elements is almost never perfect, and some linguists assume that if such boundary modifications exist in some cases they must exist in all, even though they have not actually been observed to occur.
In addition to the enumeration of phonemes and their allophonic variants, the phonological section of a description usually contains a set of statements regarding permitted and non-permitted sequences of phonemes, frequently in terms of the structure of the syllable. In this as in other aspects of linguistic description it is not usual to give text or lexicon frequencies. Statements are limited to those of simple occurrence or non-occurrence. Only such quantifiers as some, none and all occur in most linguistic description.
Corresponding to the minimal unit of phonology, the phoneme, we have a unit of somewhat less certain status, the morpheme, which is basic for grammatical description. Bloomfield (2) states as the fundamental assumption of linguistic science that in a given speech community some utterances show partial formal- semantic similarity. For example, in the English-speaking community the utterances ‘the dog is eating meat’ and ‘the dog is eating biscuits’ are partially similar in their sequence of phonemes and refer to partially similar situations. The linguist, through the analysis of these partial similarities, arrives at the division of utterances into meaningful parts. The analytical procedure as applied to individual utterances must eventually reach a point beyond which analysis becomes arbitrary and futile. The minimum sequence of phonemes thus isolated, which has a meaning, is called a morpheme. The morpheme is a smaller unit than the word. Some words are monomorphemic, e.g., ‘house.’ Others are multi- morphemic, e.g., ‘un-child-like.’ There is some uncertainty as to the point up to which such divisions are justified and the rules of procedure may be stated in several alternate ways. Thus all would concur in analyzing ‘singing’ as having two morphemes ‘sing-’ and ‘-ing’ and there would likewise be general agreement that to analyze ‘chair’ as containing two morphemes, say ‘ch-’ meaning ‘wooden object’ and ‘-air’ meaning ‘something to sit on’ is not acceptable. But there is an intermediate area in which opinions differ. For example, ‘deceive’ contains two morphemes ‘de’ and ‘ceive’ according to some but not according to others. In such borderline cases it becomes impossible to specify the meaning of each morpheme without some arbitrariness.
2.1.3. Morphology and Syntax
The work of the descriptive linguists in this area is not exhausted by the analytic task just described. Having arrived at his units he must describe the rules according to which they are synthesized into words, phrases, and sentences. In somewhat parallel fashion to the situation in phonology, having isolated minimal units, he must describe their variation and their rules of combination.
In regard to the first of these problems, it is not sufficient to consider each sequence of phonemes which differs either in form or meaning as a different unit from every other. For example, the sequence ‘leaf’ /lijf/ is different in form from ‘leav-’ of the plural ‘leaves’ /lijv-z/ but we cannot consider them as units without relation to each other. We call /lijf/ and /lijv-/ morphs rather than morphemes and consider them allomorphs of the same morpheme because: (1) they are in complementary distribution /lijv-/ occurring only with /-z/ of the plural and /lijf/ under all other conditions; (2) they have the same meaning; (3) there are other sequences which do not vary in form and which have the same type of distribution, e.g., ‘cliff’ for which we have /klif/ and /klif-s/.4 Such variation in the phonemic composition of allomorphs of the same morpheme is called morphophonemic alternation, and systematic statements of such alternations comprise the portion of grammar known as morphophonemics. Some alternations occur in all instances in a language regardless of the particular morphemes in which the phonemes occur. Such alternations are called automatic. There are others which are unique. These are called irregular. Others are intermediate in that they apply to classes of morphemes of various sizes. In English, morphemes which have s, z and dz as variants exhibit automatic alternation, dz occurring after sibilants (and affricates), s after unvoiced non-sibilants and z after voiced non-sibilants. Thus the same rule applies both for the third person singular present of the verb and the nominative plural. On the other hand, the variation between /Sajld/ ‘child’ and /cildr-/ of the plural ‘childr-en’ is a unique irregularity. Psychologically, there would seem to be a real difference between these extremes.
Having distinguished morphemic units, there remains the basic task of grammatical description—the setting up of rules of permitted combinations of morphemes to form sentences. Generality of statement is here obviously a prime requirement. Languages vary widely in number of morphemes, from some hundreds to many thousands. Their possible sequences in constructions can only be stated in practice by the setting up of classes whose members have the same privilege of occurrence. In setting up such classes, modern linguistics characteristically uses a formal, rather than semantic approach. Classes of morphemes or classes of sequences of morphemes (word classes, phrase types, etc.) are defined in terms of mutual substitutability in a given frame. Any utterance and the morpheme or morpheme sequence within it, for which substitutions are made, defines a class. Thus, in English, among other criteria, substitution of single words for house in the frame ‘I see the house’ determines the class of nouns. This contrasts with the traditional a priori semantic approach according to which all languages have the same basic grammatical categories (actually based on Latin grammar) and a noun, for example is defined as the name of a person, place, or thing. Actually, formal criteria have always been used in grammars, although often tacitly. ‘Lightning’ is a noun in traditional English grammar also, although it names an event, because it functions in the same constructions as other nouns.
It is customary to regard sentences as the largest normalized units,5 and these are successively decomposed into clauses, phrases, words, and morphemes. These units constitute a hierarchy which is also reflected in the speech event by con- figurational features, which, like the distinctive features of phonemic analysis, are assumed to operate on a strictly binary, ‘yes-no’ basis. Configurational features include such distinctions as those of pitch, stress, rhythm, and juncture, and provide appropriate signals as to construction. The sentence is so complex a unit that it cannot be described directly in terms of morpheme constructions. Rather, the description is built up in layers. On any particular level, the combinations are practically always accounted for in terms of immediate constituents. In the sentence ‘unlikely events may actually occur,’ the morpheme un- and the morpheme sequence -likely are the two immediate constituents which make up the word unlikely. In turn, likely has as immediate binary constituents the morphemes Hike-9 and ‘lyOn a higher level unlikely enters as a whole in a construction with events while events itself has event- and -s as immediate constituents.
It is usual to distinguish as primary divisions of grammar all constructions of morphemes to form words as morphology and all constructions using words as units to form phrases, clauses, and sentences as syntax. Although no generally accepted definition of the word-unit exists, in fact very nearly every grammar written makes use of the word as a fundamental unit and describes morphological and syntactic constructions separately.6 In spite of traditional differences of terminology in morphology and syntax, it is generally agreed that the same fundamental principles of analysis apply.
2.1.4. Problem of Meaning in Linguistics
Besides specifying meaningful units and their constructions, a complete linguistic description must state the meanings of these units and of the constructions into which they enter. The status of meaning has been a crucial point in contemporary linguistic theory. The statements of Bloomfield concerning meaning in his influential book (2) have sometimes been interpreted both by followers and opponents as indicating that the field of linguistic science only includes a logical syntax of lariguage without reference to meanings. The definition of meanings, on this view, rests with other sciences which deal with the subject matters which speakers talk about. Thus, the definition of ‘moon’ is the business of the astronomer, not the linguist. The actual practice of linguists both here and in Europe, however, indicates that semantic problems are in fact dealt with and cannot well be ‘excluded from scientific linguistics.
Without entering into the exegetical problem of what Bloomfield meant, which is irrelevant to the present purpose, it may be pointed out that Bloomfield coined the technical terms ‘sememe’ for the meaning of a morpheme and ‘episememe’ for the meaning of a construction, both of which are current in American linguistics. Moreover, problems of historical meaning change are discussed at length in his book. This would imply that scientific linguistics does not exclude semantics. It is evident that historical linguistics draws conclusions regarding relationships by comparisons of cognates, that is, forms with both formal and semantic resemblances, so that in this branch, at least, meanings must be dealt with. It is likewise clear that the compiling of dictionaries has traditionally fallen within the linguist’s province and continues to do so. No linguist has ever written a grammar in which the forms cited were not accompanied by translations.
The linguist deals with meaning by the bilingual method of translation or the unilingual method of paraphrase, that is, by the apparatus of traditional lexicography. In keeping with the general orientation of linguistics as a social science, the linguist defines the socially shared denotative meanings. Avoiding as far as possible controversial issues in the domain of epistemology, it may perhaps be ventured that a distinction may be, and in practice is, drawn between definitions which embody our scientific knowledge about a thing and nominal definitions which are observed rules of use in a given speech community. The linguist practices the latter type of definition. His methods up to now have been the more or less rough and ready methods of lexicography based on the traditional logical concepts of definition. The difficulties involved in the vagueness of actual usage of all linguistic terms in a speech community (if we exclude some scientific discourse in a few societies) are in practice circumvented by the not altogether happy devices of translation and paraphrase, which, involving as they do, language in its everyday use, are equally as vague as the terms which are to be defined. Ambiguity is dealt with by multiple listings of separate meanings based primarily on common-sense analysis. The boundary between the same form with synonymous meanings and separate homonymous forms has never been clearly determined, since it has not been possible to specify how different meanings must be in order to justify treatment as homonyms. Nor, in this instance, does an approach in terms of purely formal differences in distribution prove more successful.
2.1.5. Historical Linguistics
Thus far all our consideration of linguistic topics has omitted the basic dimension of change in time. This is the field of historical and of comparative linguistics which form a single sub-discipline. The investigation of the history of a specific language may be considered as a comparative study of its sequential synchronic states, while one result of comparing related, contemporaneous languages is a reconstruction of their history. History and comparison are thus, for the most part, inseparable in practice, though a much less frequently employed non- historical comparative approach, the so-called ‘typological,’ will be considered bqlow.
It was the recognition of certain facts about language change that ushered in the modern scientific period in linguistics. The most fundamental of these were (a) the universality of language change, (b) the fact that changes in the same linguistic structure when they occur independently, as through geographical isolation, always lead to different total end results, and finally (c) that certain of those changes, particularly in the area of phonology, show a high degree of regularity. The acceptance of these three principles—universality, differential character, and regularity of language change—add up to a historical and evolutionary interpretation of language similarities and differences which contrast with the older notion based on the Babel-legend that, as with organic species, languages were types fixed from the time of creation and only subject to haphazard, degenerative changes.
The second and third of these principles, those concerning the differential nature of independent changes and their regularity, in combination, lead to the concept of genetic relationship among languages. Whenever a language continues to be spoken over a long period of time, weaknesses in communication through migration, geographical and political barriers and other factors, result in a pattern of dialect cleavage as linguistic innovations starting in one part of the speech community fail habitually at certain points to diffuse to the remainder. As this continues, the dialects drift farther and farther apart until they become mutually unintelligible languages. However, they continue to show evidence of their common origin for a very long period. In fact, a number of successive series of cleavages may occur within a period short enough for the whole set of events to be inferred. For example, the Proto-Indo-European speech community was differentiated into a number of separate speech communities, one of which was the Proto-Italic. The Proto-Italic in turn split into the Latin, Venetic, Oscan, Um- brian and other separate language-communities in ancient Italy. One of these, Latin, survived, but it in turn developed into the various Romance languages, French, Italian, Spanish, etc. Sometimes, as in the case of Latin, the original speech from which the descendant forms branched off is attested from written records. In other cases we legitimately assume that such a language must once have existed although no direct evidence is available. Such an inferred language is called a proto-language (‘Ursprache’).
Because of the regular nature of much linguistic change, it is possible under favorable circumstances to reconstruct much of the actual content of such extinct languages. In particular, the reconstruction of the ancestral language of the Indo-European family has been a highly successful enterprise which has occupied a major proportion of the interest of linguists up to the present day. Thus far, linguistic relationships are well-established only in certain portions of the world and reconstruction has been carried out for only a limited number of linguistic families, particularly Indo-European, Uralic, Semitic, Bantu, Malayo-Polynesian, and Algonquian. Reconstruction is most successful, probably, in phonology, somewhat less so in grammar, and least of all in semantics. Forms which resemble each other in related languages because of common origin from a single ancestral form are called cognates, e.g., English foot and German Fuss. The history of such a particular cognate is called its etymology and it has both a phonological and semantic aspect.
The difficulties of semantic reconstruction may be appreciated from the following artificial example which illustrates, however, the real difficulties often encountered. If in three related languages, a cognate form means ‘day’ in A, ‘sun’ in B, and ‘light’ in C, here are some of the possibilities among which it is impossible to make a rational choice. (1) The original meaning was ‘day’ which remained in A, shifted to ‘sun’ in B and to ‘light’ in C. (2) The original meaning was ‘sun’ which shifted to ‘day’ in A, remained in B and shifted to ‘light’ in C. (3) The original meaning included both ‘sun’ and ‘day.’ It narrowed to ‘day’ in A, to ‘sun’ in B, while in C it narrowed to ‘sun’ and then shifted to ‘light.’ These and others are all possible, and in the present stage of our knowledge, about equally plausible. On the other hand, various Indo-European languages do have cognates all of which mean approximately ‘horse,’ which can therefore be safely reconstructed for the parent language.
The changes undergone by languages whether documented or inferred can be classified under various universally applicable processes such as sound change, borrowing and analogy. Such processes show a high degree of specific similarity. To cite an example from phonology, au has become o in many different languages independently. Similar highly specific parallel changes occur in grammar and semantics. In spite of this, our second postulate of differential change shows that there are always a number of possible changes from a given state and our knowledge is not yet sufficient to predict which one will ensue or indeed whether the system will change or remain stable in some particular aspect. Parallel changes within related languages, called ‘drift’ by Sapir, are probably especially frequent and presumably strongly conditioned by internal linguistic factors.7
2.1.6. Typological Comparison
The ascertaining of historic relationships and the reconstruction of processes of change is not the only possible motive for the comparison of languages. We can examine the languages of the world, comparing both related and unrelated ones, in order to discover language universals, the greater than chance occurrence of certain traits, and the significant tendencies of traits to cluster in the same languages. The isolation of such clusters leads to the setting up of criteria for classifying language types. The classical nineteenth century typologies rested primarily on considerations of the morphological structure of the word. Because of the relatively unadvanced state of descriptive theory, it suffered from lack of precise definitions for the units employed and was, moreover, tied to an ethnocentric outmoded type of evolutionism. Recently text ratios of more rigidly defined units have been employed in order to construct a more refined typology.
The problems of typology are of intimate concern to psycholinguistics. The universal or more than chance occurrence of certain traits is in need of correlation with our psychological knowledge. More data on languages in many parts of the world and some effort at cross-linguistic cataloguing are probably necessary prerequisites for any considerable advance in this area. One paper growing out of The Symposium on American Indian Linguistics at the 1951 Linguistic Institute (Voegelin’s 17a) and two papers published subsequent to the Conference on Archiving at the 1953 Linguistic Institute (Allen’s and Wells’ la, 18) are concerned primarily with typology.
2.1.7. Bibliography
1a. Allen, W. S., 1954, Statements for the conference on archiving. International Journal of American Linguistics 20. 83-84.
1b. Bloch, B. and G. L. Trager, 1942, Outline of linguistic analysis, Linguistic Society of America, Baltimore.
A concise summary of linguistic techniques for the general reader. An excellent, if somewhat outmoded, introduction.
2. Bloomfield, L., 1933, Language, New York.
The most important technical introduction to linguistic science as a whole; a classic of the field.
3. Carroll, J. B., 1953, The study of language, Cambridge.
A survey of linguistics and related disciplines, including chapters on linguistics, psychology, communication engineering, and the study of speech.
4. Fries, C. C., 1952, The structure of English, New York.
A formal analysis of English syntax presented in readable form. Useful presentation dealing with a language familiar to the reader.
5. Greenberg, J. H., 1953, Historical linguistics and unwritten languages, in Anthropology today, 265-86, Chicago.
Deals with the establishment of linguistic relationships and historical reconstruction.
6. Hall, R. A., Jr., 1950, Leave your language alone!, Ithaca.
A good popularization of the fundamentals of linguistic science; recommended as a starting point for the novice in this field.
7. Harris, Z. S., 1951, Methods in structural linguistics, Chicago.
The procedures of descriptive linguistics are presented from an extremely for- malistic point of view.
8. Hoijer, Harry, and others, 1946, Linguistic structures of native America, Viking Fund Publications in Anthropology 6. New York.
Descriptive sketches of diverse American Indian languages illustrate a variety of linguistic techniques and structures.
9. Jakobson, R., C. G. Fant, and M. Halle, 1952, Preliminaries to speech analysis: the distinctive features and their correlates, Acoustic Laboratory, Massachusetts Institute of Technology, Technical report No. 13, Cambridge.
The binary principle is employed to present phonemic solutions of phonetic data.
10. Joos, M., 1948, Acoustic phonetics, Language Monograph, No. 23, Baltimore.
This contribution indicates results and possibilities of instruments in linguistics.
11. Lotz, J., 1950, Speech and language, Journal of the Acoustical Society of America 22. 712-17.
An outline of the principal theoretical problems of modern linguistics.
12. Lounsbury, F. G., 1953, Field method and techniques of linguistics, in Anthropology today, 401-16, Chicago.
Review of the various ways in which linguists gather their data.
13. Martinet, A., 1953, Structural linguistics, in Anthropology today, 574-86, Chicago.
The three principal schools of structural linguistics are compared and evaluated.
14. Nida, E. A., 1949, Morphology2, University of Michigan Publications in Linguistics, Ann Arbor.
A presentation of methods for identification of morphemes with useful sections of problems to be worked on by reader.
15. Pike, K. L., 1947, Phonemics, University of Michigan Publications in Linguistics, Ann Arbor.
Techniques for phonemic analysis with examples from widely different languages.
16. Sebeok, T. A., and Frances J. Ingemann, Structural analysis and content analysis in folklore research, in Studies in Cheremisf Vol. 2: The Supernatural, Part Two. Viking Fund Publications in Anthropology, forthcoming.
Some techniques of psycholinguistics are applied to collections of texts of folkloris- tic character.
17a. Voegelin, C. F., 1954, Inductively arrived at models for cross-genetic comparisons of American Indian languages, University of California Publications in Linguistics 10.27-45.
17b. Voegelin, C. F. and Z. S. Harris, 1947, The scope of linguistics, American Anthropologist 49. 588-600.
The data and techniques of linguistics and cultural anthropology are compared and contrasted, and trends in linguistics sketched.
18. Wells, R. S., 1954, Archiving and language typology, International Journal of American Linguistics 20.101-107.
2.2 The Learning Theory Approach8
Language is perhaps the most complex behavior displayed by the human organism, and, in the main, it is learned behavior. It is understandable, therefore, that linguists should find learning theory of special interest. Although linguists have for many years refrained from ‘psychologizing’ within their science, it now appears that more interaction between psychologists and linguists would be fruitful. Even while Bloomfield was espousing the separation of the fields, he felt it desirable from time to time to deal with linguistic matters in the framework of early psychological behaviorism (chiefly as structured by A. P. Weiss). Fortunately, this has been an aspect of psychology which has seen tremendous development in the last 20 years. At the present time, probably more experimental work is being done in learning than in any other psychological field. This section of the report attempts to do two things: first, to present some of the major phenomena of learning and, second, to discuss briefly some of the major theories of learning or ways of organizing and explaining the phenomena.
Figure 5
2.2.1. Phenomena of Learning
In order to discuss the phenomena of learning most meaningfully and fruitfully, two paradigms will be presented and discussed to reveal the major variables which affect the learning process. These models are phenotypes, and it may be argued that they are in some ultimate sense different kinds of learning or it may be argued that they are explicable under one system. This is not our concern here. It is sufficient for our purposes that they act as convenient vehicles for illustration and discussion.
2.2.1.1. Classical conditioning. The first model is taken directly from the famous work of the Russian physiologist Pavlov. It is diagrammatically represented in Figure 5. In its simplest form this learning proceeds as follows: A given stimulus (the unconditioned stimulus or US) is found to be followed by a characteristic response (R1); another stimulus (the conditioned stimulus or CS) is inadequate with respect to eliciting R1 but may be followed by some other response (R2) irrelevant to the experiment; a long series of trials is given in which the US is always preceded by the CS; finally, it is noted that the CS alone elicits some of the response characteristics which normally would occur only after the US; at this point we say that learning has occurred—an initially neutral stimulus now has acquired the ability to elicit a response which originally occurred only in the presence of another stimulus. Suppose we have a dog in our laboratory. We know that he salivates when we place meat powder in his mouth. (This is the US [meat powder]—> R1 [salivation] connection.) We decide to condition the response of salivation to the stimulus of ringing a bell. We note before experimentation that ringing the bell (CS) results in extraneous responses (R2) (moving the head, pricking up ears, etc.), but not salivation. In the training series we ring the bell and, while it is still ringing, place meat powder in the dog’s mouth, eliciting salivation. After, say, 100 trials, we ring the bell and note that the dog salivates without the stimulus of the meat powder. The bell (CS) now elicits the response (R1) (or part of the response) elicited by the meat powder. The conditioning is completed.
To illustrate the manner in which different factors influence this learning process, we may now begin to alter the situation by changing first one and then another variable.
The first thing we might notice is that the number of trials pairing the bell and the meat powder is important. If we only pair the stimuli once, we may not detect any effect. If we have more trials, we see a slight effect. With a great number of trials we get maximum response (most like the original response). Our first important variable, then, is frequency. The number of times the experimental situation occurs is important.
Secondly, we might experiment with the temporal relation between the presentation of the bell and the meat powder. If the meat powder precedes the bell-ringing, we discover (perhaps to our surprise) that little or no learning takes place even after a long series of trials. There is practically no backward conditioning. Further experimentation shows that simultaneous presentation of the CS and US is “learnable” but not optimal. Maximum conditioning occurs when the bell begins to ring about half a second before the meat powder is presented. We find that the onset of the bell can be moved further and further ahead of the presentation of the meat powder. For example, we might have the bell ring for 30 seconds before the US; with enough training we will find that the dog salivates just 30 seconds after the onset of the bell. Such learning is called delayed conditioning. We may even let the bell ring for a few seconds, stop it, wait for 20 seconds, and then present the US—conditioning requires still more trials, but is attainable. This is called trace conditioning. If we were very persevering, we might even set up temporal conditioning, in which the dog is fed periodically on a short time cycle and the CS is the time lapse itself. Our conditioned dog would salivate periodically like a short term alarm clock.
A third discovery might, occur accidentally: when a buzzer was inadvertently pressed, or a glass fell off a shelf or some other noise intruded in the experimental situation, the conditioned animal started salivating. Exploring this systematically, we would find whole classes of stimuli which could be substituted with greater or less success for the original CS, the bell. This is called stimulus generalization. In the main, the more alike two stimuli are, the greater is the likelihood that they will function for each other. If we originally condition to a tone, say middle C, we find that, as we move away from C up or down the scale, the conditioned response decreases. In our example we would expect the most salivation to C itself, the next most to B and D, less to A and E, etc.
If we proceed with the generalization experiments, but always pair the US with tone C and never with tone A, we can discover a related phenomenon. Soon tone A loses its power to elicit the conditioned response and it is said that stimulus discrimination has taken place. In effect, we have systematically cut down the gradient of stimulus generalization. By this technique we can discover the limits of discrimination of which the organism is capable.
This ‘damping out’ of a response suggests still another question. What happens in general when the CS is no longer followed by the US? If we tried this in our example, ringing the bell repeatedly but never following it with the meat powder, we would note that the magnitude of the response decreased over successive trials and finally disappeared altogether. The response is said to have been extinguished. At this point we might naively assume that our experiment was at an end and the dog was now unconditioned, but if we happened on some subsequent day to bring the dog into the laboratory and again rang the bell, we would find that the conditioned response was still observable, reduced in magnitude but still there. If we extinguished the response again (in fewer trials this time), let a day elapse and again brought the dog into the laboratory, we would still find some residual of the conditioned response. This apparently mysterious revitalization of an extinguished response is called spontaneous recovery and indicates a need to postulate something other than ‘forgetting’ to account for the decline in responses which we observed in the extinction trials. Most psychologists prefer to treat this as a case of ‘learning-not-to-respond’ or inhibition.
In this model we may measure learning in a variety of ways. We may take the occurrence of a response or the frequency of occurrence as an index of learning. Alternatively, we may measure the amplitude or magnitude of the response, or the resistance of a response to extinction as other indices of strength. Depending on the response in question, one measure may be more appropriate than another. It should be noted, however, that in many special cases the indices may not be in perfect agreement and it may be important to specify exactly what aspect of behavior is being considered.
Figure 6
2.2.1.2. Instrumental learning. The second model considered here is markedly different (at least phenotypically) from the first. It has been called variously ‘trial and error’ learning, ‘operant/ and ‘instrumental’ conditioning and descends most directly perhaps from the work of Thorndike. While instrumental conditioning does not readily lend itself to neat diagramming, it can perhaps be roughly portrayed as in Figure 6. In this learning model the organism is placed in a situation in which a variety of responses can be made. The organism is usually assumed to be motivated, that is, some state of need (hunger, thirst, etc.) is presumed to exist on the basis of prior knowledge (hours of food deprivation, water deprivation, etc.). A ‘correct’ response is followed by reward which is appropriate to the need state of the organism. The probability of the re-occurrence of the rewarded response increases with each rewarded trial up to some limit. The response is said to be learned when it occurs with high probability.
When contrasted with classical conditioning, several different features stand out. The first difference is that the response is emitted, not elicited by an unconditioned stimulus. This is not to say that responses take place without reference to the stimuli present but rather that we cannot specify the configurations of stimuli which lead to the various complex responses. A second, and presumably important, difference is that the response is instrumental in obtaining reward. If the correct response does not occur, the organism is not rewarded. This model seems to entail more ‘active’ learning than classical conditioning. It goes without saying, of course, that the correct response must be in the behavior repertory of the organism prior to conditioning. Finally, motivation (‘drive’) and reward (‘reinforcement’) are much more prominent in this model than in the first one. If the organism is well-fed and comfortable, he is more likely to go to sleep than to learn to solve the experimental problem.
A simple example of this model of learning is one made famous by Skinner. Let us assume that we have a simple box with a small lever in one end of it. The apparatus may be so arranged as to cause a pellet of food to drop into the box when the lever is pressed. If we now put a rat which has not been fed for 24 hours into the box we can observe marked changes in his behavior. At first the rat runs around the box, sniffs in the corners, washes his face, rears on his hind legs, scratches at the walls, etc. Sooner or later the rat ‘accidentally’ pushes the lever and a pellet is discharged into the box. After an interval the rat discovers and eats the pellet. Sometime thereafter he may blunder onto the lever again. If we chart the rat’s behavior we discover that after scattered lever presses the time between presses gets much shorter. In an hour or so we may find the rat industriously pressing the lever, eating, pressing the lever, eating, etc., with great speed and regularity. We say that the instrumental response of lever pressing has been learned.
If we manipulate this situation as we did the first one, we find many of the same variables controlling the modification of behavior. Again we might notice that frequency is important. Here, however, it is not the frequency of pairing of CS and US but rather the frequency of response and reward occurring. Similarly, we would find contiguity or time relations to be important, but it is response-reward contiguity. We would discover again that order is important, the response must precede the reward, and that the longer the interval between the response and the reward, the less learning takes place. Thus, the responses which occur immediately prior to the reward (whether they are relevant or irrelevant) will be strengthened more than those which were considerably in advance of the reward. (To take a different case, this explains why a maze is learned from the back to the front, errors being reduced in the vicinity of the goal box before being reduced in the middle of the maze, etc.) In this model, too, we may demonstrate extinction (if we cease giving pellets for lever pressing) and spontaneous recovery. If we alter our situation to include a new stimulus (say a small light over the lever), we can train the rat to respond only when the light is on and thus introduce the phenomena of discrimination and generalization discussed above.
Other variables not noted in the first model are more clearly revealed in the instrumental situation. It becomes apparent, for example, that the reward functions optimally when it is relevant to the need. Giving water to a hungry rat or food to a thirsty rat does not result in rapid learning if any learning at all. The amount of reinforcement likewise plays a role. All other things being equal, the speed of learning tends to increase as the amount of reward is increased. One of the most important phenomena which may be disclosed and studied most clearly and easily in this second model is secondary reinforcement. This is the name given to the reinforcing power which a neutral stimulus may acquire by virtue of having been associated with primary reinforcement. If we put a rat into the experimental box without a lever present, we may train him to approach the food box and eat every time the food mechanism clicks. Then we may introduce a lever which will produce the click when pressed, but empty the mechanism so that it provides no food. In this case in spite of the fact that lever pressing is never rewarded by food, the rat will learn to press the lever and will make a good many responses before extinguishing. We must conclude here that the click of the mechanism itself has acquired reinforcing power. In more dramatic and publicized experiments it has been demonstrated that chimpanzees will work for, collect, and hoard poker chips after experience with the chimp-o-mat—a slot machine arrangement in which the poker chips may be traded for food. It seems likely that most human learning is obtained under conditions of secondary reinforcement (money, praise, smiles, approval, etc.), and it seems especially likely that secondary reinforcement plays an important role in language learning.
In this model, as before, we may measure the extent of learning by frequency of response, amplitude, latency, or resistance to extinction. In certain cases we may also be interested in error scores. A word of caution is necessary here, however. Because these situations differ from each other and from the situation in the first model, measures having the same names may require different interpretations. For example, in classical conditioning amplitude might be a positive function of learning (e.g., drops of saliva) while in instrumental conditioning it might be a negative function (e.g., as lever pressing becomes more skilled, it may be executed with less force).
In a rather great oversimplification we might generalize that the first model presents a picture of the conditioning to an arbitrary stimulus of a highly specific, elicitable response, while the second model describes the differentiation and discrimination of a response out of a mass of behavior emitted in response to a complex stimulus field. The first model stresses time and stimulus controls and the second model stresses the role of motivation and reward. It should be remembered, however, that the models are not independent and the phenomena observed in one are observable (with more or less effort) in the other.
2.2.1.3. Some additional descriptive statements. While the models given serve excellently as pedagogical devices, they do not, of course, do justice to the wide areas in which learning studies have been carried on. The development of many complex human skills (typewriting, sending and receiving codes, memorizing verbal material—both meaningful and nonsense, etc.) has been carefully studied and described under a staggering variety of conditions. A few of the many findings which may be of relevance to us can be briefly described. They are explicable in terms of the phenomena described above, but may be of special interest as molar phenomena themselves.
A good deal of attention has been devoted by psychologists to learning curves (more properly, performance curves). In general, such curves are negatively accelerated, that is, large gains are made initially, then smaller and smaller gains until no appreciable improvement is noted. It is likely that most tasks, however, are approached with considerable residues of skill and experience. For a few tasks ‘S’ curves may be noted, positively accelerated then negatively accelerated. Such tasks appear to be those in which subjects have had little experience (e.g., tight rope walking, juggling, etc.). Perhaps the curves are the ‘true’ performance curves and the ‘typical’ negatively accelerated curves are only those portions which we see because our subjects already have considerable response repertories available to them.
Figure 7
Work with skill sequences has revealed both in the lower animals and in man an extraordinary capacity for eliminating waste motion and executing a highly polished and tremendously rapid series of responses. A brief consideration of the movements made by a skilled typist or piano player is sufficient to demonstrate the high degree of integration of elements into a smooth series which may be attained. It may be further shown that these are actual integrations, not merely a rapid sequencing of discrete responses. While the beginner types ‘t-h-e,’ the skilled typist writes ‘the’ or larger units such as ‘the next meeting’ with such speed that she could not be aware (by virtue of the time lag in the nervous system) that the’t’ had printed before the ‘e’ had been struck. This kind of short-circuiting, grouping and executing of serial responses plays an important role in all frequently executed response chains, including language.
Finally, a considerable body of research has been devoted to questions concerning the effects of prior training on subsequent training and vice versa—the problems of facilitation and interference. In general, we may consider here three cases as illustrated in Figure 7.
In the first case we observe a divergent structure. To one stimulus, two (or more) responses must be made. If the responses are highly similar, there will be little interference in the second learning and in the test, but if they are quite different (antagonistic) maximum interference will result in both places. In the second case a convergent structure is found. Here the response will be facilitated in both the second learning and the test situation, and the amount of facilitation will be a function of the similarity of the stimuli. In the third case, we can expect little interference beyond that contributed by any interposed activity and little facilitation beyond adaptation to the experimental setting. In general, what is being said here is: making different responses to the same stimulus is more difficult than making the same response to different stimuli. The first situation gives rise to conflicting response tendencies and demands more information about the occasion, while the second situation broadens the occasion for the use of a single response and, hence, requires less information.
2.2.2. Learning Theories
Since ‘theory’ is a somewhat ambiguous word, it seems advisable to outline briefly the conceptual framework which the seminar utilized in its discussion of learning theory.
2.2.2.1. General nature of psychological theories. A fully developed scientific theory contains three distinguishable levels. Level I contains the relatively raw ‘immediately apprehended’ sense data (e.g., the speech sounds, the observations of a dial reading, the perceived movements of a rat). All sciences contain this level, but they differ in their selection of events. Level II contains the concepts which are the special concerns of the science (e.g., the stimulus, the phoneme, energy) and laws which are summaries of their observed relations or hypotheses predicting relations not yet observed. Such concepts are meaningful only if they are unambiguously related directly or indirectly to Level I events. Such a relation is equivalent to an operational definition. Concepts which are not operationally defined, and systems containing many such concepts, are called meaningless. The criterion of meaningfulness is related to that of testability since only meaningful concepts can be used in stating testable hypotheses and laws. Level III contains a formal mathematical or logical system. All concepts on this hypothetical level are purely formal or logical in contrast to those of Level II, which are ‘descriptive’ of Level I events. Level III ordinarily consists of statements defining the elements in the hypothetical system, statements defining operations and relations in terms of the elements, and statements of rules of inference to be used in deriving the theorems of the system. The theorems may be regarded as the logical results of the assumed relations in the postulate set. The interpretation of this formal system consists of placing its entities and relations into correspondence with the concepts of Level II. Thus, a theorem on Level III leads directly to an hypothesis on Level II by means of translation of terms indicated by the interpretation. In turn, the laws or hypotheses of Level II are summaries of observed or predicted Level I events. Because of these relations between the levels of a theory, the formal system of Level III is said to explain the laws of Level II which, in turn, explain the events of Level I.
A scientist is free to select or develop any mathematical or logical system which he desires to use. The utility of his choice is then determined by examining the correspondence between his model or system and the concepts or empirical data which he observes. Ordinarily, it is desired that the experimental model be testable (that it generate meaningful predictions), reliable (that it generate consistent predictions), coherent (not in conflict with itself), comprehensive (that it explain a wide variety of phenomena) and simple. Obviously, both comprehensiveness and simplicity are subjective and debatable, but the other requirements are relatively clear.
Theory-building in psychology has not, of course, proceeded self-consciously to develop level by level as our description above might imply. Psychology developed as a branch of philosophy, as did the other sciences, but the weaning was longer than for most. As late as 1900 most psychology departments were subdivisions within philosophy departments; some still are. Along with the mentalistic tradition of the 19th century, psychologists and pseudo-psychologists were prone to ‘theorize’ by sticking into the organism whatever faculties, aptitudes, instincts, etc., seemed to serve their immediate purpose. There were practically as many intervening ‘explanatory’ constructs as there were things to be explained. This has been aptly entitled ‘junk shop’ psychology.
In the early part of the present century there was a general revulsion against this kind of theorizing, typified by the writings of such men as Watson, Kantor, Weiss, and more recently, Skinner. This stress on objectivity paralleled a similar revolution taking place in linguistics through the same period. These men went to the other extreme, the ‘empty organism’ position. This view held that the psychologist should concentrate on exploring the many functional relations between objectively verifiable S (stimulus) events and objectively verifiable II (response) events, avoiding intervening variables which involve ‘going into’ the organism. Thus, Skinner is content to study the behavior of the rat in the lever box under various stimulus conditions where the only observations are tracings on a recording drum—the actual movements of the animal itself not even being observed. If all variations in R were in fact predictable from knowledge of the current stimulus field, then this model would be sufficient. It is quite apparent, however, that with S conditions constant, the characteristics of R will still vary as functions of factors like past history (previous learning), individual differences in aptitudes, motivation, personality, and so forth. Facts of this order led Wood- worth in the middle 30’s, for example, to insert an O in the formula, i.e., S—O—R, where the O refers rather vaguely to gross classes of intervening ‘organismic’ variables.
Most contemporary learning theorists utilize models which introduce certain terms between the S and R. These terms may be thought of as falling roughly into two classes: first, terms which imply nothing about the internal mechanics of the organism but act as convenient summary terms, for example, ‘drive’ defined only in terms of hours since last feeding, ‘habit’ defined in terms of response probabilities or histories, etc.; and second, terms which are intended to describe internal states or activities, such as ‘drive’ defined in terms of blood chemistry, neural and muscular activity, ‘habit’ defined in terms of neural connections and strengths, etc. Most systems use both types of concepts and attempt to avoid the ‘junk shop’ kind of psychology by introducing such terms only when they are unavoidable and by anchoring these variables firmly to antecedent (S) and subsequent (R) observable conditions.
At present, the models of learning theory are sets of Level II concepts and laws—some of which are little better than plausible hypotheses. There have been no acceptable attempts to develop or apply formal Level III systems except on a very limited basis.
2.2.2.2. Four current theories of learning. While it is obviously impossible to develop in detail even one theory of learning in the space available here, an attempt will be made to outline, and present the contrasts between, four current theories which have great influence at the present time. These are the theories of Guthrie, Tolman, Skinner, and Hull. The interested reader is referred to the more adequate accounts of these and other systems given in the list of references following this section.
Guthrie’s Association Theory
Of the theories to be considered here, perhaps the system which is simplest in appearance is that of E. R. Guthrie. This system, which is one of the early offshoots of Watsonian Behaviorism, reduces all learning to a simple associative rule: any combination or totality of stimuli which has accompanied a movement will be followed by that movement when the combination occurs again. Complete learning thus occurs on the first occasion on which a stimulus is paired with a response.
At first glance this simple association rule may seem to be in direct disagreement with the phenomena discussed previously, but this is not at all the case. Guthrie is concerned with stimuli and responses at a ‘molecular’ level. Viewed in minute detail, no total stimulus pattern is ever exactly like another. Even if all external stimuli were rigidly controlled, changes are taking place within the organism (it is getting hungrier, thirstier, older, weaker, etc.; it is tense, relaxed, asleep, etc.; ad infinitum). Similarly, no two movements are ever exactly alike. They differ in the precise musculature involved, the state of the musculature, etc. The consequences of this infinite shading and change in both stimuli and responses is that learning appears to increase gradually through practice and time as more and more of the total possible stimuli and patterns of stimuli become associated with more and more of the relevant responses or muscle actions.
Generalization is taken care of by thinking of similar gross stimuli as actually consisting of overlapping pools of minute stimuli. As the stimuli become more dissimilar, the pools of stimuli overlap less and less until finally there are too few common elements to mediate the appropriate response. In order to handle temporal delays and sequences, Guthrie makes extensive use of movement-produced-stimulation as the actual stimulus field to which the responses are associated. Motivation and reward have no primary status in Guthrie’s system. They operate only as they affect the central principle. Motivation is important in that it determines and intensifies sets of movements which then are available for associative connections. It supplies members to both the stimulus and response pools. Reward is important in that it terminates a class of movements and changes the stimulus situation—removing the organism, so-to-speak, from the situation before other movements can be associated with the stimuli. Thus, reward acts to prevent associations being formed with incompatible responses; it has no ‘positive’ function. Extinction occurs, according to Guthrie, when the ‘correct’ response no longer terminates the situation. Other movements follow and in turn are associated with the stimuli. In this manner on successive trials more and more stimuli are related to other movements and responses until finally the ‘correct’ response disappears. With ever changing stimulus pools, competing responses which are close to the same strength will occur in various alternations, depending on the exact number of stimulus-movement associations present, until one of them gains a clear superiority.
It may be seen even in this brief presentation that Guthrie’s theory deals with inferred elements of external and internal stimuli and inferred elements of responses. If everything is exactly the same, the organism will do exactly as it did the previous time. If it does not, then it may be argued that things really were not all the same. This amounts to saying that critical tests of the theory are difficult, if not impossible, to devise. The theory is facile in explanation but weak in prediction; it can be used to explain almost any (even directly opposite) outcomes. Its generality and simplicity are advantages, but it leaves much to be desired in the way of precision, reliability, and testability.
Tolman’s Sign-Gestalt Theory
A sharp contrast to Guthrie’s theory both as to sources and complexity is the sign-gestalt theory of E. C. Tolman. Drawing from virtually all psychology, from Watson’s behaviorism on one side to Lewin’s gestalt psychology on the other, Tolman builds a purposive, behavioristic theory of learning. The theory breaks sharply with the association of elemental stimuli and elemental movements and attempts to deal with goal-directed, whole acts of the organism. The level of description employed is molar, showing the relation of the organism to the goal. Most significantly perhaps, Tolman insists that what is learned is not movements or responses but ‘sign-significate expectations.’ The organism learns meanings and ‘what-leads-to-what’ relationships. The relationship between a sign and its significate (an early stimulus and a later stimulus) is established in accord with the usual contiguity rule of association, and this relation is the ‘expectation.’ Thus, to Tolman, classical conditioning may be interpreted by saying that the buzzer comes to mean food-in-the-mouth and the salivation is a consequence of this meaning or expectancy.
This system stresses contiguity of stimuli in building up expectations. The closer in time two stimuli occur the greater the likelihood that an expectation will be set up. Practice plays a role in confirming and strengthening expectations. The more often S2 follows S1 the higher is the expectancy. It may be seen that expectancy can be viewed as a cognition of the probability that a given event will follow another. What increases, then, is not response potentials or habits but cognitions, which may be acted on in a variety of ways depending on the cumulative past experiences of the organism with objects and situations in its environment. This gives Tolman’s system flexibility and allows him to predict the striking changes which are sometimes observed in the behavior of organisms when the learning situation is radically changed (such as providing alternative routes to a goal, changing the goal object, etc.).
Generalization is regarded as the result of stimulus sign-equivalence. Alterations in stimuli only affect performance by changing the expectancies of the organism. Reward and motivation have no direct effect upon learning in this system but affect performance, which is regarded as clearly different from learning. Thus, a rat may ‘know’ how to run a maze (i.e., he may know the sign-significate relationships of all of the pathways) but not demonstrate this in performance until he is rewarded for it, at which time his ‘knowledge’ should suddenly become evident. Reward does, of course, enter in as a stimulus significate whose presence or absence confirms or weakens an expectation. Motivation enters in as a sensitizer or emphasizer of certain significates or sign-significate relations which have been associated with it. Extinction is treated as a progressive disconfirmation of expectancies which cumulatively couples with the pattern of preceding situations to eliminate the learned performance. Spontaneous recovery takes place because the pattern of preceding situations is changed and the expectancy is still at some strength. When alternative responses are available, the pattern of behavior will ensue which is in accord with the strongest expectancy, and when that expectancy is disconfirmed the next strongest will control behavior and so on.
Tolman also points out that individual differences in organisms (heredity, age, training, special physical conditions) act to define particular behaviors on any occasion. He is thus one of the few learning theorists to comment on capacity laws, but even he has done little with them. In general, Tolman’s position is a very broad one. He recognizes levels of learning and lately has come to suggest that there may be several kinds of learning. His system has stimulated much research, especially in the area of cognition. He has been criticized for vagueness and non-quantitativeness, but in part this is true of all of these theories.
Skinner’s Descriptive Account
B. F. Skinner himself would deny that his approach is a theory or that psychology needs theories. He prefers, as indicated above, to collect and classify phenomena on Level II, using the most rigorous and simple descriptive categories he can develop, toward the end of systematizing knowledge about the basic phenomena of learning. The first major difference between Skinner and the other theorists discussed here is that he regards the two paradigms, conditioning and instrumental learning, as representing different kinds of learning.
Pavlovian conditioning is regarded as a highly specialized form of learning which plays little part in most human behavior. Skinner terms it respondent conditioning, emphasizing that it utilizes a response which can be elicited by a specific stimulus. The laws of respondent conditioning state (1) that contiguity of stimulation is the condition for increasing the strength of the C3-R relation and (2) that the exercise of the CS-R without the US results in decreased strength. These laws are summary descriptive statements with little elaboration, and in general Skinner has little concern with this kind of learning.
In instrumental conditioning stimulus conditions sufficient to elicit the behavior cannot be specified and are in fact irrelevant to the understanding of this behavior. The important aspect in this model is that responses are emitted and that they generate consequences. Skinner calls this operant behavior, stressing the role of the response. He is most concerned with the laws of this model and is convinced that most human behavior (including specifically language behavior) is dependent on this kind of learning. The basic laws of operant conditioning state that (1) if an operant is followed by the presentation of a reinforcing stimulus, its strength is increased and (2) if an operant is not followed by a reinforcing stimulus, its strength is decreased. In most situations an operant does become related to the stimulus field. It may come to occur, for example, only in the presence or absence of given stimuli. It is then termed a discriminated operant, but it is still not elicited. The stimulus conditions merely furnish the occasion for the appearance of the operant.
Skinner’s system is somewhat like Tolman’s in that it tends to deal with acts (not specific muscle movements) but unlike it in that it stresses the role of reinforcement. The all-important contiguity is that of the response and the reward, and one of the major determinants of the strength of an operant is the number of times the response-reward pairing occurs. These pairings summate in a non-linear but increasing fashion to increase the probability of occurrence or rate of occurrence of the operant.
Skinner introduces the concept of a reflex reserve9 which may be defined loosely as the amount of ‘available activity’ of a given sort which the organism is capable of emitting. Rewarding an operant increases the size of the reserve and non-reward decreases it. The rate of responding at any given moment is the function of the size of the reserve. Responses are emitted as some proportion of the total reserve remaining.
The size of the reserve is not a simple function of the number of reinforcements, however. Skinner has found that periodic reinforcement (one rewarded response every few minutes), aperiodic reinforcement (rewards on a random time schedule), fixed ratio reinforcement (reward every nth response), etc., generate very great reserves. His theory lays considerable stress on the important role played by secondary reinforcement (the discriminatory stimuli), and he finds this quite useful in discussing language behavior.
The proportionality which exists between the reserve and the rate of responding may be altered by differing ‘states’ of the organism. ‘States’ are carefully defined intervening variables such as drive, emotion, etc. The hypothetical term ‘state’ is introduced when it can be shown that several operations affect several reflexes in a similar fashion. States are defined by the operations and their effects and imply no physiological correlates. (It is this aspect of the system which has led to its being labeled as an ‘empty organism’ approach.) Certain states increase the proportionality; others decrease it, but none of them are said to change the size of the reserve. As an example, in a state of high drive a rapid rate of responding would be established and, if the operant were not rewarded, rapid extinction would take place; in a state of low drive the rate of response would be low and extinction slow. Presumably, the same number of responses would be made in either case.
Skinner has studied stimulus discrimination and response differentiation extensively. When reward is made experimentally dependent on stimulus conditions, discrimination takes place. When it is dependent on response characteristics, differentiation of response takes place. Skinner’s view is roughly one of mass behavior in a context of generalized stimuli, both becoming progressively more defined as the situation demands it. The problem, as he sees it, is not explaining generalization but rather the lack of it and, similarly, not response variability but lack of it. This aspect of his approach has some great advantages in dealing with progressively changing behaviors. In situations in which alternative responses are available, the response of highest strength has the greatest probability of occurrence. Alternative responses become available as earlier responses are weakened.
Skinner’s system has been criticized on the grounds of its narrowness, its concern with only the lever box situation as an experimental base, and its use of the reflex reserve concept. It is, however, basically an empirical, descriptive approach and in the main there can be little argument with its basic laws. It has been valuable and stimulating in its somewhat different analysis of the learning process and in the attention it has directed towards special facets of learning phenomena.
Hull’s Deductive System
The most ambitious attempt to develop a rigorous, formal learning theory is unquestionably that of C. L. Hull. This system consists of a basic set of postulates from which, it is hoped, the laws of learning may be deduced in clear and quantitative form. The system stems most directly from Watsonian behaviorism and Thorndike’s connectionism.
At the center of the Hullian system are two notions: habit strength and drive reduction. Habit is a tendency for a given stimulus discharge in the nervous system to evoke a given response. It is what is learned. Drive reduction is the diminution of the neural state accompanying a need. It is the condition which effects learning; it is reinforcement. It is apparent that Hull does not hesitate, as some other theorists discussed here, to ‘get inside’ the organism and to make positive claims about the nature of physiological events. It should be kept in mind, however, that he anchors these variables (in their role as constructs) to observable events.
Step by step Hull’s postulates describe the process of learning as follows:
Stimuli impinge on the organism and generate neural activity which persists for some time before disappearing (P-1).10 Complex stimuli interact in the nervous system to produce modified stimulus patterns (P-2). Organisms have innate general responses which are set in action by needs. These are not random responses but are selectively sensitized responses which have relatively high probabilities of terminating the specific need (e.g., withdrawal from pain, general movement and locomotion when hungry, etc.) (P-3). When a stimulus trace and a response occur in close contiguity and, at the same time, need is reduced, an increment is added to the habit strength of the particular stimulus-response pair (P-4). This constitutes learning.
Stimuli which are similar evoke the same responses, and the amount of generalization is a function of the difference between the stimuli in terms of ‘just noticeable differences’ (a commonly used form of measurement in psychology of sensation) (P-5). Drives have stimulus properties and the intensity of a drive stimulus increases with intensity of the drive (P-6). Reaction potential is a product of habit strength and drive (P-7), but does not in itself lead directly to responding. Reaction potential to be effective must be greater than the resistances to response, reactive inhibition (similar to fatigue), conditioned inhibition (learned nonresponding) and the oscillating inhibitory potential associated with the reaction potential (P-8, 9, 10). If the momentary effective reaction potential is above the reaction threshold and stronger than competing responses, the response will occur (P-11). The remaining postulates discuss response measurement and incompatible responses.
Since Hull’s system embraces both of the paradigms given above and is at the same time a reinforcement theory, his concern with contiguity is two-fold. He is concerned with the contiguity of the stimulus and the response and the response and reward. Learning is a function of the time lapse between the stimulus and the response according to a rather steep gradient and also a function of the time lapse between the response and the reward according to a gradient of reinforcement. This gradient of reinforcement is believed to be quite short. The gradient of reinforcement, however, can in effect be lengthened into a goal gradient. Stimuli within the range of the gradient of reinforcement acquire secondary reinforcing power and develop reinforcement gradients of their own. These small overlapping gradients summate to produce a major gradient extending considerable distances in space and time from the primary reinforcement itself. This complex treatment of contiguity proves to be a very useful tool in discussing many learning situations.
Habits are formed and increase in strength as a function of the number of reinforcements and the amount of need reduction. Since Hull specifies his position on generalization, it is easy to see how the summation can take place even though exact conditions are not reproduced. One interesting facet of Hull’s theory is that habits are never ‘unlearned.’ Habit strength can only increase or remain the same since it is a function only of rewarded trials. Unrewarded trials do decrease responses, however, because they lead to increased reactive and conditioned inhibition. A response which had been ‘completely learned’ and ‘completely extinguished’ would be represented here by maximum habit strength and maximum conditioned inhibition with the net result that it would not appear. Drive and drive reduction are also obviously important in this system. Drive has an activational role through its multiplicative relation with habit strength and in addition exercises a stimulus role. Emphasis of this stimulus role permits Hull to explain experiments which otherwise would be classed as cognitions and has led to interesting work on generalization gradients and discriminative characteristics of drives.
At every step Hull attempts to state at least tentatively the nature of the mathematical functions relating his constructs to each other and to the observable antecedent and subsequent conditions. He also derives corollaries or secondary principles which amplify the basic principles. Hull’s position is that all behavior should be deducible from the system. He urges that such attempts be made and, when confirmation is not obtained, the postulate set be appropriately revised. He (as indeed all other theorists) regards the system as a self-correcting one, continuously predicting, verifying, and altering until it is complete.
Hull’s system has been criticized for a variety of reasons—its excursions into the nervous system, its insistence upon reinforcement as a necessary condition for all learning, its lack of adequate definition of response, its peculiar mixture of levels in postulates, etc. In the main, however, it has proved to be quite durable. The system has been tremendously successful in stimulating research and providing a frame of reference for new material. It has been, and is being, widely extended to applications in social psychology, personality, and language behavior.
All of the theories outlined above are part of the behavioristic and functionalist tradition. As such they are primarily concerned with the phenomena of learning as manifested in overt behavior and, with the partial exception of Tolman, they all use ‘mechanistic’ terms in describing their concepts. Some critics believe that these theorists have not sufficiently considered physiological knowledge in developing their concepts. A seemingly larger group of critics object to the apparently ‘mechanistic’ and ‘atomistic’ nature of these concepts and feel that such concepts cannot do justice to the full range of human and animal behavior. While such criticism has been expressed in many diverse ways, much of the basis for such objections may be found in the work of the gestalt psychologists, whose name comes from the emphasis they have placed on ‘wholes’ and ‘organizing principles.’ Because the primary concern of this group has been the study of perception and problem solving rather than learning, a summary of their theorizing has not been included in this section. (A brief discussion of some of the gestalt phenomena in perception may be found in section 3.1. of this report.)
2.2.3. Bibliography
1. Guthrie, E. R., 1935, The psychology of learning, New York.
A presentation of Guthrie’s theory along with applications to many everyday behavioral situations. Highly readable.
2. Hilgard, E. R., 1948, Theories of learning, New York.
A fairly recent and comprehensive survey and evaluation of various contemporary conceptions of learning, including gestalt and behavioristic theories.
3. Hilgard, E. R. and D. G. Marquis, 1940, Conditioning and learning, New York.
A more advanced and tightly argued analysis of the fundamental concepts and phenomena of learning. One of the most important critical books in the field.
4. Hull, C. L., 1945, Principles of behavior, New York.
A systematic presentation of Hull’s learning postulates and their application to phenomena. Although more recent revisions have appeared, this is probably the best comprehensive statement.
5. Koffka, K., 1935, Principles of gestalt psychology, New York.
Probably the most detailed and extensive presentation of the gestalt point of view, including materials on perception and learning.
6. Lewin, K., 1936, Principles of topological psychology, New York.
Behavior of organisms treated in terms of field theory.
7. McGeoch, J. A. and A. L. Irion, 1952, The psychology of human learning, New York.
An extensive coverage of the facts and theories of learning as applied particularly to human behavior.
8. Miller, N. E., 1951, Learnable drives and rewards, in Handbook of experimental psychology (Ed., Stevens), New York.
A penetrating analysis of the nature and development of secondary motivation and reward mechanisms, with presentation of experimental findings.
9. Miller, N. E. and J. Dollard, 1941, Social learning and imitation, New Haven.
An application of Hull-type learning theory to a variety of social problems. Provides a very readable introduction.
10. Mowrer, O. H., 1950, Learning theory and personality dynamics, New York.
A collection of Mowrer’s best papers relating learning theory to personality dynamics. Also presents Mowrer’s two-factor theory of learning.
11. National Society for the Study of Education, 1942, The psychology of learning, 41st Yearbook, Part II.
Presents in careful, systematic form several important summaries of contemporary theories as written by their sponsors.
12. Osgood, C. E., 1953, Method and theory in experimental psychology, New York.
A graduate text in experimental psychology, including sections on sensory processes, perception, learning, and symbolic processes. Includes critical analyses of contemporary learning theories and a presentation of the author’s mediation hypothesis.
13. Skinner, B. F., 1938, The behavior of organisms, New York.
The most complete presentation of Skinner’s conception of learning and data derived from his type of instrumental situation. Important for its methodological contributions as well as for its viewpoint.
14. Spence, K. W., 1951, Theoretical interpretations of learning, in Handbook of experimental psychology (Ed., Stevens), New York.
A critical comparative evaluation of contemporary theories of learning by one of the most active theorists and investigators on the scene today.
15. Tolman, E. C., 1932, Purposive behavior in animals and men, New York.
A systematic presentation of Tolman’s early views along with results of numerous experiments. A classic in the field, and a book which has had great influence upon contemporary psychology.
16. Wertheimer, M., 1945, Productive thinking, New York.
A delightful little book, published posthumously, by one of the founders of the gestalt movement. It presents Wertheimer’s insightful analysis of the process of insight and problem-solving in humans.
2.3 The Information Theory Approach11
Strictly speaking, the term information theory is a misnomer. As the following discussion will indicate, it is not a theory of ‘information,’ per se, but of information transmission, and then only in situations where a message input may be said to contain ‘information’ in something like the usual sense of the word. It is concerned with characterizing the entropy or uncertainty of sequences of events. It was with such considerations in mind that it was suggested12 that the label information theory be replaced by theory of signal transmission. At any rate, information theory is essentially an extension of the general mathematical theory of probability which has provided some useful descriptive measures in several areas of scientific research. Therefore, it is necessary to be acquainted with some of the fundamentals of the concept of probability and probability theory in order to properly understand information theory.
2.3.1. Probability Theory
Despite the seeming simplicity of the concept of probability, there has been much controversy among mathematicians and logicians over its definition. The reader interested in the details of this controversy, as well as the development of a mathematical theory of probability from a postulational system, may be referred to works by Nagel (14) and Feller (4), listed in the references appended hereto. There is general agreement, however, that for most practical and theoretical purposes the probability of an event may be defined as the limit of the relative frequency of its occurrence. In mathematical symbolism,
where p’(i) is the ‘true’ probability of event i, symbolizes the limit of the following expression as n becomes indefinitely large, n is the number of events, f(i) is the frequency of occurrence of event i. In other words, if we have n events, and a particular class of event, symbolized i, occurs f(i) times, the true probability of event i is the value towards which the ratio of f(i) to n tends as we allow n to become indefinitely large.
In practice, of course, we cannot determine probability in this fashion since we can never have an indefinitely large (i.e., infinite) number of events. Therefore, we simply compute
for some reasonably large but finite value of n. In this case p (i) may be called an empirical estimate of the true probability. If p(i) tends to become nearer and nearer to p’(i) as n increases, then we say that the process generating our sequences of n events is a stochastic process. For example, if we spin a fair roulette wheel we should find that the probability of any number tends to become closer to 1/38 as n, the number of spins or trials, increases. This would be a stochastic process. Now suppose that there is a magnet under the wheel that tends to attract the ball to the O or OO position and which is turned off and on randomly. In this case, we would find that the probabilities of O and OO would tend to decrease toward 1/38 when the magnet is off and increase away from 1/38 when the magnet is on. This would not be a stochastic process since the estimates of probabilities would not converge toward any particular value. However, if the magnet were left on constantly we would most likely find that our probability estimates would converge to some values greater than 1/38 for O and OO and less than 1/38 for the remaining events and we would again have a stochastic process. Since some such severe fluctuation in the condition of sampling is required to make a process non-stochastic, we may generally assume that we are dealing with stochastic processes.
Figure 8
In scientific work we choose certain features of our empirical events as a basis for classification and disregard others. In this manner we create sets or classes of events (e.g., a ‘response’ which may include all the minute variants of a bar-pressing response, or a ‘phoneme,’ which is a class of allophones). Such sets are sub-sets of the class of all events. We will here consider only events which may fall in only a finite number of classes or sets, since such discrete classes are nearly always used in psychology and in linguistics. We will use the symbol Ω to refer to the class of all events, which is divided into a finite number of sub-classes or sub-sets as in Figure 8. The sub-classes in Figure 8 are mutually exclusive; that is, no event may fall in more than one sub-class, and therefore no two sub-classes have any events in common. We will use S to refer to the class of all such sub-sets. We will let i refer to any of the sets in S and p(i) to the probability of an event being in set i. When 12 and S are defined, and p(i) determined (or estimated) for each set in S, we have defined a probability space. Before we employ probability or information theory in dealing with empirical data, we should make sure that we have a completely defined probability space.
We can now introduce the notions of joint probability, conditional probability, and independence which are so crucial to information theory. If we have a set of simultaneous or successive events, each of which may fall into one of the classes in the probability space, it is often useful to consider the probability of a joint occurrence of events. The joint probability, p(i, j), of events in classes i and j is the probability of their joint occurrence, that is, the relative frequency of a joint event involving both i and j in a large number (or an indefinitely large number) of joint events. The conditional probability, pi(j), of classy’ is the probability of j when it is given that an event in class i had also occurred; thus, it is the relative frequency of j in the class of all joint events involving i. In mathematical symbolism,
and therefore p(i, j) = p(i)-pi(j). We say that classes i and j are independent if and only if the probability of an event being in class j is unaffected by its being in class i i.e., if p(j) = pi(j), then i and j are independent. If i and j are independent, then p(i9 j) = p(i)-p(j), and this formula, or an extension of it, p(i, j,…,s) = p(i)p(j) ‘‘‘ p(s), may be used to compute the joint probability of a set of independent events.
Languages appear to be structured so that in a sequence of language events, subsequent events are rarely independent of antecedent events, and not all the possible sequences occur. The so-called Markov process is appropriate as a model for representing such sequences. Suppose, for example, that A, B, C, D, and E represent all possible events in a set of sequences. Suppose also that the conditional probabilities of these events are as represented in Figure 9. O represents the state of the system while at its starting point and is merely a convention which indicates the point at which sequential dependency begins and ends; in this example, sequential dependency begins when A or B occurs and ends whenever C, D, or E occurs. The arrows indicate the sequential order of the alternative events and the figures within the arrows indicate the conditional probabilities of the subsequent events. The complete set of sequences which may be generated in this example and their probabilities are as follows:
The probability of each sequence is the product of the conditional probabilities of each event in the sequence, and may also be said to be the joint probability of the events in the sequence. The sum of these probabilities equals unity. In general, a Markov process may be used to represent any set of sequences of events such that the probabilities of subsequent events are dependent on particular antecedent events.
2.3.2. Basic Concepts of Information Theory
In the subsequent portions of this presentation, we will be concerned with probability spaces for which Ω (the class of all events) is a restricted class of physical events—e.g., sounds of speech, energy changes which serve as stimulus inputs. S (the class of all sub-sets of events) for these probability spaces is some ‘convenient’ or ‘suitable’ ordering of these events into a finite number of subclasses. Henceforth, we shall refer to the entire set of events in Ω as a system and each of the sub-classes of events in S will be referred to as a state of the system.
Comparable systems may differ in ‘randomness’ due to differences in the probabilities of their states or in the degree to which their states are dependent on prior states. The measures of information theory are extensions of the entropy measures of thermodynamics and measure the degree of entropy—i.e., ‘randomness’—of a system’s states. A system possesses maximum entropy when its states are equiprobable and independent of previous states—uncertainty is maximal and predictions can be no better than chance. For example, consider a system consisting of a tossed coin which has two states, H (heads) and T (tails): if the coin is ‘fair,’ and p(H) = p(T) = .5, the system has maximal entropy and we are maximally uncertain of the outcome of each toss. The entropy of a system is decreased when the probability of some of its states is greater than others—we are less uncertain about what states will occur and predictions can be better than chance. If our coin is ‘biased,’ and p(H) = .75 and p(T) = .25, the system possesses less than maximal entropy and we are less uncertain about the outcome. Entropy can also be reduced by making subsequent events dependent upon antecedent events, which will be discussed in greater detail at a later point.
The term information in information theory is identified with the concept of entropy and so has a meaning that differs somewhat from ordinary usage. The term is not entirely unjustified since a system with little entropy has highly predictable states and the occurrence of any particular state is therefore not very ‘informative.’ The use of the term, information, may also be justified in another manner. The unit of entropy measure, the bit, may be defined as the amount of information needed to specify one of two classes of equally probable events. In the case of our ‘fair’ coin, H and T constitute two equiprobable classes of events so that we would need only one bit of information to determine if H or T has occurred. Now let us consider a system consisting of two such coins whose states are independent. Here, we would require one bit of information to specify the state of each coin so that two bits of information would be required to specify the state of the total system (i.e., the particular combination of positions, HH, HT, TH, TT, assumed by the coins).
If our system consists of m fair and independent coins, then we would require m bits of information to specify the state of the total system. Since such a system can assume k = 2m states, the amount of information needed to specify the state of the system, is log2 k13 Since more information is required to specify the state of a system as the number of states of the system increases, assuming that the states of the system are equiprobable and the states of sub-systems are independent, it is apparent that this amount of information grows with our uncertainty of predicting states of the system. Hence, amount of information may be regarded as equal to the entropy of the system. If a system has k states, its maximal entropy, Hmax , is given by the equation: Hmax = log2 k. If the states of a system can be divided into m pairs of subclasses, then it is apparent that the system has maximal entropy when these subclasses are independent and equiprobable and the amount of this entropy may be determined by the same reasoning as was applied above to the system consisting of m coins. If k is not an integral power of 2, then the applicability of the argument above is not obvious but it suffices to say that Hmax would then be the average amount of information needed to specify the state of the system if these states were equiprobable and the subsystems were independent. More rigorous mathematical treatments of such considerations may be found in Fano (3) and Shannon (18).
Let i be any event, or state of a system, in the set of events, I, and let p(i) be its probability. If p(i) equals a particular value, say 1/α, we can regard this statement as equivalent to stating that i falls into one of a equiprobable classes. Hence, there will be log2 a bits of information needed to specify i. Let h(i) be this amount of information, so h(i) = log2 a = log2 [l/p(i)] = —log2 p(i). We may express the average, X, of a sample of numbers, as
where x is the numerical value of any class of members in the sample, f(x) is the frequency of that class, and n is the sample size. In other words, the average (or arithmetical mean) is the sum (Ó) of the values found in the sample, each weighted by its proportional frequency. Now, we have previously defined our probability estimate as
so that
We may define the entropy of a system H(I), with a set of states, I, as the average amount of entropy associated with its states. Thus,
The second form of this equation is the expression usually given as the measure of the entropy of a system.
The measure, H(I), has the following characteristics, all of which are in keeping with our intuitive feelings about the notion of entropy or uncertainty:
(a) If one p(i) = 1 and all others are zero, then H(I) = 0. In other words, if one state always occurs, then the behavior of our system is completely predictable and its entropy is zero.
(b) If a system consists of k independent sub-systems, each with entropy H(I), then the entropy of the total system is k H(I). This theorem follows from the same sort of argument that was applied to the system consisting of k coins. This characteristic is one of the justifications for using a logarithmic measure.
(c) If a system has m equiprobable states, then H(I)—Hmax = log2 m. It is apparent that H(I) will approach Hmax as the set of p(i) approaches equiproba- bility.
(d) If a system has m equiprobable states, then H(I) increases when m increases. Because of this last characteristic, it is desirable to have a measure which may be used to compare systems with different numbers of states. This measure, Hrel(I), relative entropy, is
Hrel(I) is zero when H(I) is zero and equals one when H(I) equals Hmax .
A more complex situation is that in which we deal with associated pairs of events. In applications of information theory these pairs of events fall into two main classes: (a) Pairs of events in antecedent (input) and subsequent (output) systems, e.g., the stimulus and the response of behavioristic psychology, the speech sound and the hearer’s interpretation; (b) pairs of antecedent and subsequent events in the same system, e.g., sequences of responses, sequences of phonemes or morphemes. The main value of information theory to linguistic and psycholinguistic problems lies in the application of entropy measures suitable for such situations. In general, these measures indicate how much effect the pattern of antecedent events has on the occurrence of subsequent events, and hence they indicate the degree to which sequences of such events are structured (i.e., non-random).
Let I be a set of antecedent or input states and let J be a set of subsequent or output states and let i and j, respectively, be any member of these sets. We may compute H(I) and H(J) for these sets as shown above. Let I, J be a set of associated pairs of states, let i, j be any member of this set, and let p(i, j) be its probability. For every antecedent or input state, i, there will be a conditional distribution of associated j’s. We may apply the measure of entropy developed above to these distributions and obtain
where pi(j) is the probability of the subsequent or output state, j, when the associated i has occurred. We may define the conditional entropy, HI(J), of the set I, J as the average amount of entropy associated with these conditional distributions.
In effect, this measure weights the entropy of the conditional distribution of each i by the value of p(i).
HI(J) has the following characteristics:
(a) If one and only one j occurs with every i, i.e., if for every i, one Pi(j) = 1 and all others are zero, then HI(J) = 0. We have already found that H(I) = 0 if but one state of the system occurs. Similarly, all Hi(J) = 0 if but one, j occurs with every i and their average, Hi(J), will also be zero.
(b) If the set J is independent of the set I, i.e., if all pi(j) = p(j) and p(i, j) = p(i)p(j) then H(J) = H(J). This theorem follows directly from the definition of independence, i.e., that all pi(j) = p(j), and from the definition of HI(J).
Because of these two characteristics, conditional entropy, HI(J), is used to measure the amount of random error or ‘noise’ in a communication channel, where we are concerned with pairs of input and output events. If output events are independent of input events, and the distribution of output events is the same regardless of the input events, then ‘noise’ is maximal and HI(J) is equal to H(J). If a particular input event always produces a particular output event, then the system is ‘noiseless’ and HI(J) equals 0. There is no requirement here that output events correspond in any way, or be related to, input events; for example, a ‘scrambler’ such as is used in trans-oceanic telegraph communication, which reliably changes the input sounds into some arbitrary but predictable pattern of electrical energy, is a ‘noiseless’ channel. Thus it can be seen that HI(J) is a measure of random error and not systematic error. It is possible to devise measures of systematic error, e.g., measures of the validity of signal transmission rather than reliability of the transmission,14 but they do not derive readily from entropy estimates. The amount of information transmitted, It, is defined as the amount of information in the output or subsequent system minus the ‘information’ contributed by noise: It = H(J)— HI(J).
Conditional entropy, HI(J), is also of value in measuring redundancy in sequences of states of the same system. If a particular antecedent state always occurs prior to a particular subsequent state, the sequence is completely redundant, and HI(J) equals 0. If the antecedent states are independent of the subsequent states, then there is no redundancy and HI(J) = H(J). There is no reason to suppose that a subsequent state should be dependent on the single antecedent state only. Rather, it is quite conceivable that this dependency could extend for sequences of several states. In determining the extent of this dependency, we redefine I as the class of all possible sequences of r antecedent states, where r = 1,2,3,4,5, . . . , and J is the class of all subsequent states as before. Under these conditions, HI(J) has an additional characteristic.
(c) If all sequences of length s are independent of all such prior sequences in a longer sequence of states, then HI(J) will approach its minimum value as r approaches s and will remain at that value for larger values of r. This characteristic is actually a generalization of characteristic (b) and is based on essentially the same line of reasoning.
Characteristic (c) permits us to use entropy measures to determine the size of the ‘structured’ (i.e., non-random) sequences in any longer sequence of message states: e.g., a linguistic text transcribed phonetically, phonemically, or morphemically. For example, if Hi(J) were computed for the example of a Markov process given above, it should reach its minimum for r = 3. However, it is not always practical to use any but small values of r (usually less than 10) because of the difficulty of tabulation and the large sample required to obtain adequate estimates of the probabilities of such a large number of sequences. For sequences consisting of m different states and r units long the number of possible sequences is mr; e.g., for m = 2 and r = 10, mr = 1024. An alternative approach has been to determine the conditional entropy of pairs of states r units apart in a sequence (15).
In order to compare two systems with differing numbers of states, it is useful to have a measure of relative conditional entropy, HIrel (J):
This measure will vary from a minimum of zero when HI(J) = 0 and each antecedent state is associated with only one subsequent state to a maximum of 1.00 when HI(J) = H(J) and the subsequent states are independent of the antecedent states. A useful measure of redundancy, R, may be obtained by subtracting HIrel(J) from 1; R = 1—HIrel(J). This measure will vary from a maximum of 1.00 when HI(J) = 0 to a minimum of zero when HI(J) = H(J).
The measure of joint entropy, H(I, J), is closely related to the entropy measures discussed above. Just as we have the measure H(I) defined for a class of single states I, with , we have the measure H(I, J) defined for a class of pairs of states, I, J, with
It will then be true of H(I, J) that H(I, J) = H(I) + HI(J). The proof of this theorem is based on the analogous relation, p(i, j) = p(i)pi(j), as derived earlier.
In computing the entropy measures described above it is often convenient to prepare a table of p(i, j)3s of the form illustrated in Figure 10. The values of p(i) and p(j) in the margins of the table may be obtained by simply adding across appropriate rows or columns. H(I) and H(J) may easily be obtained from these marginal figures by simply adding appropriate values of—p log2 p which may be found in Newman’s (15) or Dolansky’s (2) tables. H(I, J) may be computed by carrying out the same operation on the values of p(i} j) in the main body of the table. HI(J) and HJ(I) may be directly obtained by using the equation below.
Figure 10
These computational procedures are best used when entropy measures are being applied to pairs of input and output events or to sequences of no more than two events. The computational method described by Newman (15) is more suitable for longer sequences.
Since most of information theory has been developed to deal with problems of electronic communications systems, only those aspects which seem most applicable to language research have been included above. All discussion of entropy measures for continuous data has been omitted, for example. However, the concept of channel capacity seems to be of potential value to language research. Shannon (19) defines channel capacity, C, essentially, as the maximum rate (in bits per unit time) at which a communications channel can transmit information. His fundamental theorem for a channel with noise states, in effect, that for rates of transmission of less than C it is possible to make HI(J) (i.e., ‘noise’) as small as desired by coding information in some ‘optimal’ fashion but that for rates of transmission greater than C, we can never decrease HI(J) below the amount by which the rate of transmission exceeds C. In other words, error can be reduced as much as desired for transmission rates less than C but increases linearly for rates greater than C. If we regard the human organism as a communications channel and responses as output states, this theorem seems of great theoretical and practical interest to students of human communication.15
2.3.3. Some Applications of Information Theory
Two main classes of application of entropy measures were indicated briefly in section 2.3.2. At this point new symbols for sets of states will be substituted for the general I and J symbols in order to more sharply distinguish the measures used in these two types of situation.
(a) Conditional Relations between Systems. In many cases we are interested in describing the degree to which events in one antecedent system influence events in another subsequent system to which it is directly or indirectly coupled. We may symbolize the class of events in the antecedent system as I (input) and the class of events in the subsequent system as O (output). Here HI(O) may measure the degree of ‘noise’ in the communication channel between the two systems or the randomness introduced by the channel. If I were a class of stimuli and O were a class of responses, HI(0) would indicate the average randomness of response tendencies to these stimuli. Conversely, H(O)—HI(O) would index the dependency of output upon input events, e.g., the lack of randomness in the channel.
Nearly all studies involving measurement of information transmission have closely followed the familiar communication channel model developed by Shannon (19). The states of the input are usually simple stimuli such as alternative light patterns or spoken commands. The experimental subjects are treated as communication channels and their responses to the alternative stimuli are treated as output states. The conditional entropy of responses to stimuli is the usual measure. Garner and Hake (5) describe the basic methodology of such studies and have written a later series of experimental reports. Since, on the content side, these studies are mainly of interest to those concerned with perception and the design of control system displays, they will not be discussed any further here. However, it should be noted that such methods can be used in psycholinguistic studies where the subject makes an immediate overt response to the message stimuli—i.e., in situations where the message stimuli serve as ‘signals’ rather than ‘symbols.’
In situations where the message stimuli serve as ‘symbols,’ their effect is generally to change response tendencies in some later situation. For reasons discussed in another portion of this report (section 7), the conventional methods of measuring information transmission do not seem suitable. Rather, the reduction of the conditional entropy of responses in some extra-message situation seems to be a more appropriate measure. Bendig’s interesting experimental study (1) is the only one to date which has used this type of measurement.
There is one important caution which should be observed in the application of entropy measures to measurement of information transmission. The meaning of the entropy measures is obviously confounded if the probability of the events we are considering changes during the period of measurement—i.e., if these events cannot be regarded as a stochastic process. It is apparent that such changes will occur if any learning occurs and affects response tendencies during the period of measurement. The contaminating effects of learning may be avoided in any of the three following ways: (I) using groups of similar subjects for short periods of measurement rather than single or small groups of subjects for extended periods of measurement; (II) making several measurements over relatively short periods during the learning process or before and after learning; and (III) using responses which are relatively well learned and which can safely be assumed to be unaffected by learning during the period of measurement. Bendig (1) has used a combination of methods (I) and (II) while Garner and Hake (5) have used method (III) in their experimental studies.
(b) Transitional relations within systems. In other cases we are interested in describing the extent to which antecedent events in a system influence subsequent events in the same system. We may symbolize the class of antecedent events as A and the class of subsequent events as S. Here HA(S) indicates the degree to which on the average particular subsequent events are independent of particular antecedent events, i.e., the degree of randomness in the sequencing. Conversely, H(S)—HA(S) indexes the degree to which antecedent events predict or lead to subsequent events, i.e., the redundancy in the sequencing. If A is a class of antecedent phonemes and S is a class of subsequent phonemes, H(S)—HA(S) indicates the degree to which sequences of these phonemes are structured or organized.
Miller and Frick (11) made an early application of a measure of relative redundancy to measure response stereotypy in learning situations. Newman (16) has analysed the entropy of vowels and consonants in sequences of orthography in a number of languages. Shannon (18) and Newman and Gerstman (17) have examined entropy relations in sequences of English orthography. Shannon (19) and Miller (9) discuss the interesting concept of the order of approximation to the statistical structure of English orthography: a zero-order approximation consists of sequences generated from the assumption that letters have equal probability of occurrence; a first-order approximation is a sequence generated from the assumption that letters occur with the same probability as in English; a second-order approximation is generated from the assumption that sequences of two letters occur with the same probability as in English; an nth-order approximation has the same characteristic for all sequences of n letters. The same techniques have been applied to sequences of words, e.g., by Miller and Selfridge (13). The latter investigators have found that the retention of such sequences after rote learning is directly related to the order of approximation, sequences of higher order approximation being more easily retained. It is perhaps unfortunate that so much attention has been devoted to orthography and so little to spoken language in these studies; it is difficult to relate the results of these researches to linguistic theory. The studies cited above demonstrate the potentialities of these relatively new techniques of measurement, and proposals for further study of entropy relations with special reference to linguistic structure are given elsewhere in this report (particularly section 5.1).
2.3.4. Some Limiting Considerations
Information theory concepts and measures are particularly liable to misinterpretation and misapplication. For one thing, the term, amount of information, has been used to mean amount of entropy in both situations (9,) and (b) above. It is necessary to draw a sharp distinction between information in this sense and its common referential sense—we commonly regard a message as ‘informative’ only if it has some dependable relation to events outside of the message, i.e., an ‘informative’ message is so because it is indicative of some other state of affairs. Thus, we regard language messages with external referential meaning as ‘informative’ but arbitrarily selected sequences, such as random numbers or nonsense syllables, as ‘uninformative.’ When considering sequences of message events we can only measure the degree of randomness, not relations to external events. In this case, entropy measures only indicate how many binary decisions we need to make in order to predict subsequent message states, not how much ‘information’ in the referential sense the message contains. On the other hand, when we are considering pairs of input and output events in the channel connecting different systems, we can determine the relation of the events in the message output to the external events of the input. Also, the entropy produced by the channel, HI(J) or ‘noise,’ is distinguishable from that attributable to the informational content of the input. Only in this case is it possible to measure the amount of information, in the referential sense, in a message by the use of entropy measures, and thus allow the term ‘information’ to retain something like its conventional meaning.
The distinction made above may be clarified by considering the following example. Suppose that a ‘mechanical oracle’ has been constructed which answers any inquiry with a set of randomly selected words such that the choice of words is independent of the inquiry and any previously selected words. The entropy within the sequences of the answer is high due to the absence of redundancy, and the conditional entropy of the answers given the inquiries is also high due to the independence of the answers and the inquiries. Thus, we may have high sequential entropy and low information transmission (due to the high conditional entropy) at the same time. This example demonstrates that the two types of measures are sensitive to different aspects of messages and that we should not indiscriminately equate the amount of entropy with the amount of information in a message if the term information is to retain anything like its conventional meaning. To date, nearly all of the applications of information theory to psycholinguistic problems have been concerned with the measurement of entropy of single message events or sequences of events rather than with the measurement of referential information in messages. This is probably due to the greater ease of direct application of entropy measures in the former situations.
Another frequent misconception probably stems from the term ‘information theory.’ As the previous discussion has indicated, the chief contribution made by this ‘theory’ to the study of language is a set of descriptive measures and a unit, the bit, which are much more broadly applicable than to language processes themselves. It serves chiefly, therefore, as a quantitative tool for describing language processes. It is not a theory of information in the usual sense, nor does it provide us with a theoretical model which can provide hypotheses about or explain the phenomena of human language communication.
Two limitations of a statistical nature must also be mentioned. In the first place, these entropy measures are as yet of little value in hypothesis testing and statistical inference. This is because so little is known about their sampling distributions. However, a recent paper by Miller and Madow (12) provides a valuable initial step in the derivation of these distributions. It is also possible to test hypotheses concerning the probabilities on which the entropy measures are based by using the well-known Chi-square test. Secondly, entropy measures take no account of similarity among states of a system. Suppose we are studying communication via facial expressions and use HI(O) as a measure of the degree to which observer judgments (O) are dependent upon actor intentions (I). Within the limited number of judgmental categories provided, conditional entropy measures the degree of uncertainty of judgments made in response to facial poses, but it does not reflect similarity or clustering among judgments since each alternative state is treated as unique. It is possible to reclassify output states on some similarity basis, but procedures for doing this do not involve entropy estimates.
2.3.5. Bibliography
1. Bendig, A., 1953, Twenty questions: an information analysis, Journal of Experimental Psychology 46. 345-348.
An experimental study of the reduction of response entropy (i.e., amount of information transmitted) by each of a series of statements in a modified game of Twenty Questions.
2. Dolansky, L. and M. P. Dolansky, 1952, Table of log2 l/p, p log2 l/p, and p log2 + (1—p) log2 (1—p), Research Laboratory of Electronics, Massachusetts Institute of Technology, Technical Report no. 277, Cambridge.
Use of this table reduces computation of entropy measures to the addition of a set of numbers.
3. Fano, R. M., 1949-50, The transmission of information: I and II. Research Laboratory of Electronics, Massachusetts Institute of Technology, Technical reports Nos. 65 and 149, Cambridge.
The first of these papers covers much the same ground as Shannon and Weaver, but the basic theorems are developed in a different way. The second paper contains a valuable discussion of the problem of coding information for transmission through a noisy channel.
4. Feller, W., 1950, Introduction to probability theory and its applications.
This is one of the best available treatments of probability theory, and it contains an excellent discussion of Markov processes.
5. Garner, W. H. and H. W. Hake, 1951, The amount of information in absolute judgments, Psychological Review 58. 446-59.
A valuable description of application of measures of joint and conditional entropy to a situation where I is a set of stimuli and J is a set of responses. The computational illustrations are also of value.
6. Hockett, C. F., 1953, Review of Shannon and Weaver: The mathematical theory of communication, Language 29. 69-92.
This paper is actually much more than a review. It is rather an introduction to Information Theory for linguists and contains an excellent discussion of potential applications to linguistic problems.
7. Jakobson, R., H. Halle and E. C. Cherry, 1953, Towards the logical description of languages in their phonemic aspect, Language 29. 34-46.
The entropy of the set of Russian phonemes is estimated and the problem of optimal coding in terms of a set of binary distinctive features is discussed.
8. Miller, G. A., 1950, Language engineering, Journal of the Acoustic Society of America 22. 720-25.
The problem of designing a language for special purposes is discussed and the relation between information theory variables and results obtained in intelligibility experiments is discussed.
9. Miller, G. A., 1951, Language and communication, New York.
The chapter titled “The Statistical Approach” contains some interesting material on frequency counts of English phonemes, letters and words, and some examples of various degrees of ‘statistical approximation’ to English.
10. Miller, G. A., 1953, What is information measurement?, American Psychologist 8. 3-11.
A good elementary and non-technical introduction and contains an excellent annotated bibliography.
11. Miller, G. A., and F. C. Frick, 1949, Statistical behavioristics and sequences of responses, Psychological Review 56. 311-24.
Contains an elementary development of the theory of measuring sequences of responses.
12. Miller, G. A., and W. G. Madow, On the limiting distribution and asymptotic moments of the maximum likelihood estimate of the Shannon-Wiener measure of amount of information (Unpublished dittoed manuscript).
This is an attempt to derive the sampling distributions of entropy measures and it requires a good deal of mathematical sophistication in order to follow the derivations.
13. Miller, G. A., and Selfridge, 1950, Verbal context and the recall of meaningful material, American Journal of Psychology 63. 176-85.
An experimental study of the rote learning of written material with various orders of statistical approximation to English. It was found that the amount recalled was directly related to the order of statistical approximation.
14. Nagel, Ernest, 1939, Principles of the theory of probability, International Encyclopedia of Unified Sciences, 1. No. 6, Chicago.
This work is an excellent review of the development and current theoretical status of the concept of probability. The treatment is mainly logical rather than mathematically formal and includes a discussion of some of the more basic theorems of probability theory.
15. Newman, E. B., 1951, Computational methods useful in analyzing series of binary data, American Journal of Psychology 64. 252-62.
Contains a table of—p log2 p that is not as complete as Dolansky’s but may be more readily available. Also, detailed computational illustrations are given for H(I, J) for sequences of various lengths, and a valuable recursion relationship for obtaining HI(J) for these sequences is included. There is also a design for an apparatus to be used in tabulating sequences of binary data.
16. Newman, E. B., 1951, The pattern of vowels and constants in various languages, American Journal of Psychology 64. 369-79.
The redundancy of constants and vowels in the orthography of eleven languages is given.
17. Newman, E. B. and L. S. Gerstman, 1952, A new method for analyzing printed English, Journal of Experimental Psychology 44. 114-25.
This is an extension of Shannon’s discussion of redundancy in English orthography. (See 18 below.)
18. Shannon, C. E., 1951, Prediction and entropy of printed English, Bell System Technical Journal, 30. 50-64.
The conditional entropy of the rth letter, given the first letter, in sequences of printed English letters is computed for various values of r.
19. Shannon, C. E. and W. Weaver, 1949, The mathematical theory of communication, Urbana.
The section by Shannon contains the derivation of the entropy measures which marked the beginning of information theory as a separate area. This is still the most basic reference in the field, although the reader who is not mathematically sophisticated would be well advised to begin elsewhere. Weaver’s section is essentially a restatement of Shannon’s section with a few additional comments and without the mathematical proofs.
20. Weiner, N., 1948, Cybernetics, New York.
This book contains much of the basis for Shannon’s later development, and the chapter on “Time Series, Information and Communication” contains the original identification of information with entropy. However, both the prose and the mathematics are difficult to follow, and there is little apparent continuity in the whole work. An expository chapter titled “Information, Language and Society” is provocative.
2 Joseph H. Greenberg.
3 Actually there is also a louder stress on the second syllable of ‘sin-tax’ and some would maintain that it is merely the stress difference which is phonemic. Even if this is true for English, the question arises in other languages.
4 This is too simple a formulation. Many problems arise at this point which cannot be discussed here.
5 However, discourse analysis, being currently developed by Zellig S. Harris, carries linguistic techniques beyond the boundary of the sentence, and Thomas A. Sebeok has attempted to study the construction of sets of whole texts of folkloristic character in this manner (16).
6 For a discussion of the word as a unit see section 3.3. of this report.
7 The various processes of linguistic change are discussed in detail in section 6.
8James .J. Jenkins.
9 This concept was used in Behavior of organisms but has been dropped in later work.
10 This is the postulate number, here Postulate 1. The postulates themselves are quite lengthy and detailed. The sentences here are crude approximations.
11 Kellogg Wilson.
12 By Y. Bar-Hillel, in a talk at Massachusetts Institute of Technology, 1952.
13 A logarithm (abbreviated ‘log’) is most simply defined as an exponent. In mathematical symbolism: if xy = z, then log xz—y, by definition, and x is the base of the logarithm. A base of 2 is most widely used in information theory. The examples below, using logs to the base 2, may make the concept of a logarithm clearer.
Logs of numbers which are not integral powers of 2 can be readily obtained from a table of base 2 logs such as that of Dolansky (2). Since logs to any base are proportional to logs to any other base, we may convert base 10 logs to base 2 logs by the formula log2 x = (1/log10 2) log10 x = 3.3219 log 10x.
Logs to any base have the properties indicated in the three equations below. The base is not indicated but is assumed to be the same in all cases:
14 Measures termed ‘fidelity’ and ‘communication,’ based on the proportion of total transmission which involves corresponding states of antecedent and subsequent systems, have been described by Osgood and Wilson in a mimeographed paper.
15 Research proposals relating to the empirical determination of channel capacities in human language behavior are suggested in section 5.5.
We use cookies to analyze our traffic. Please decide if you are willing to accept cookies from our website. You can change this setting anytime in Privacy Settings.