“Psycholinguistics” in “PSYCHOLINGUISTICS”
5. SEQUENTIAL PSYCHOLINGUISTICS
Study of the sequential or transitional structure of language behavior provides a meeting ground for linguists, information theorists, and learning theorists. The linguist, applying his own methods of analysis, discovers hierarchies of more and more inclusive units; the information theorist, usually starting with lower-level units such as letters or words, finds evidence for rather regular oscillations in transitional uncertainty in message sequences, the points of highest uncertainty often corresponding to unit boundaries as linguistically determined; and the learning theorist, working with notions like the habit-family hierarchy, finds it possible to make predictions about sequential psycholinguistic phenomena that can be tested with information theory techniques. Here we come back once again to the problem of units in encoding and decoding, the general notion being that at any given level of selection by speaker or hearer both the transitional probabilities and the correlated indices of habit strength will be higher within units appropriate to that level than between such units. And we again find it necessary to think in terms of interactions between hierarchical levels in the processes of encoding and decoding, a sort of ‘super-Markov process’ in which selection of higher-order, more inclusive units results in a reloading of the transitional probabilities obtaining among lower-order units.
5.1 Transitional Probability, Linguistic Structure, and Systems of Habit-family Hierarchies47
This section offers a general picture of how our three approaches intersect and facilitate one another in understanding sequential mechanisms. It also provides, by way of concrete illustration, a discussion of hesitation phenomena in ordinary conversation and lecturing and some hypotheses about such phenomena which are capable of empirical testing.
5.1.1. Statistical Structure of Messages
Transitional structural analysis assumes units of a given order (phonemes, morphemes, words) and seeks to ascertain their transitional probabilities: “Given an occurrence of the unit x, what is the probability that y will be the next unit to follow?” “That z will be?” Etc. for first-order probability. Or: “Given the sequence of units xy, what is the probability that z will be the next?” “That w will be?” Etc. for second-order probability. Similarly for higher-order probabilities. Or, stated in information-theory terms: “In the case of the occurrence of a sequence xy, what is the ‘amount of information’ in the occurrence of y, given the previous occurrence of x?” (first order). “In the case of the occurrence of the sequence xyw, what is the ‘amount of information’ in the occurrence of w, given the previous occurrence of xy?” (second order).
The units with which this type of analysis may operate are various. The most readily available units are letters of conventional orthography. This choice may be purposeful, as when the investigator is concerned with the statistical properties of telegraph messages (e.g., Shannon), or it may represent some linguistic naiveté (e.g., Newman’s Samoan Bible). Where interest is in speech behavior, the units chosen should be phonemes rather than orthographic symbols. Morphemes are also possible units for this type of analysis, as are words. Zipf did some simple counting with these orders of units, but he did not obtain transitional probabilities. To the best of our knowledge no one has yet carried out any systematic transitional analysis involving morphemes or words.
Transitional probabilities are determined generally from natural data, i.e., from records of the normal flow of speech. When the units of analysis are words, however, recourse can be had to the experimental device of the word-association test and other short-cut procedures (see sections 5.3, 5.4, 5.5). The transitional probabilities determined in this manner appear to correspond well with those which might be determined from the analysis of a necessarily large amount of natural data, though there is not yet conclusive evidence for this. When the units of analysis are anything less than minimal free forms of a language, no such experimental short-cut appears to be possible.
5.1.2. Statistical vs. Linguistic Structure
We must distinguish between ‘transitional structure’ and ‘linguistic structure.’ The former is a product of statistical analysis; the latter, of linguistic analysis. They reflect important differences in statistical and linguistic procedures. In the procedure of contemporary structural linguistic analysis, frequency of occurrence (of a given unit in a given context, or of a given contrast) is not a relevant criterion. Only the possibility of occurrence—as represented by some one instance or by many instances of it—is relevant. The answers which are sought from data are of a simple yes-or-no type rather than of a how-much type. In statistical analysis on the other hand, frequencies are the immediate goal of analysis, e.g., the probability of occurrence. Statistical procedure usually ignores, however, a matter which is basic to linguistics—the distinguishing of levels of structure. Linguistic analysis is directed toward the discovery of these and their combinatory and hierarchical arrangements. The structure of particular utterances is stated in terms of these, and the structural pattern, or grammar, of a whole language consists of generalized summary statements of the same. Discovery of the hierarchical structure in a language is by means of ‘immediate-constituent’ analysis. The boundaries between constituents on the same level of structure are established. A sentence—any one on this page for example—is not to be broken down simply into all of the units of a given order, such as words or morphemes. Rather, the process of immediate-constituent analysis is carried out, proceeding from level to level of structure, so that constructions on one level are established as constituting the units of structure on the next higher level. The criterion for establishing the boundaries between two different units on the same level is generally that of maximum substitutibility of possible replacement parts (Wells, Nida, Harris).
Statistical analysis ignores the differing hierarchical values of these boundaries. All boundaries, for purposes of statistical procedure, are taken as equivalent. Thus, in the sentence just preceding this one, the boundary between are and taken, that between taken and as, and that between as and equivalent are not accorded the different statuses which linguistic analysis would ascribe to them. They are lumped together as cases of the same sort of thing. The differences between them which are initially ignored do, however, get reflected in a certain fashion in the statistical results. These different ‘transitions’ will often be found to have different probabilities of occurrence. The different transitional probabilities are in a way indexical of the different linguistic statuses of the boundaries between the words of each pair. The correspondence is only rough, however, and many other factors besides the linguistic hierarchical statuses of the boundaries affect the transitional probabilities. The former cannot be derived from the latter, nor vice versa.
‘Statistical structure,’ then, is to be understood as denoting the system of transitional-probability relationships between the units of a given order in a language. (Care should be taken not to confuse the terms ‘order’ and ‘level.’ Words, morphemes, and phonemes are different orders of units. Between units of the same order strung along in sequence there are boundaries, or ‘transitions,’ belonging to different levels of construction.) ‘Linguistic structure,’ on the other hand, may be understood as the system of hierarchical combinatory possibilities between the units of a given order.
5.1.3. Behavioral Levels in Encoding and Decoding
There are behavioral data of various kinds which support the inference of at least three psychological levels of organization of linguistic responses (see section 6.1). Osgood has distinguished a ‘representational level,’ an ‘integrational level,’ and a ‘skill level.’ The triggering of linguistic responses appears to be accomplished by a complex of internal stimuli deriving from each of these levels. (It should be noted that the use of the word ‘level’ in the present context is independent of its use in the different context of the preceding paragraphs.) Stimuli from the representational level derive from the meanings or significances of incoming stimuli and have been labeled with the roughly characteristic term, ‘intentions.’ (Meanings and significances of incoming stimuli, of course, derive not only from the external sources, but also from the internal emotive and evaluative systems which in turn derive from past experience and learning.) Intentions are probably more synthetic than their relatively analytic expression in speech. The process of selecting the larger semantic units of language for the expression of intentions has been called ‘semantic encoding.’
Much of the triggering of linguistic responses, however, is accomplished at a lower organizational level of greater automaticity and less conscious awareness. Ordering of semantic units, concrete-relational classification of these, concordal agreement and certain other relational phenomena appear to belong to the ‘integrational level.’ This process has been called ‘grammatical encoding.’ The final triggering of the motor acts which produce the sounds of speech appears to be accomplished on the still lower level of motor skill organization. The sequenced triggering of the individual motor acts in speech is accomplished at a rate of speed which Lashley showed to be, like the individual motor acts in piano playing, too great for each such act to be under specific cortical control via feedback mechanisms. This process may be called ‘motor encoding.’ Speech pathology, particularly aphasia, shows examples of disturbances in each of the above described systems.
5.1.4. Habit-family Hierarchies and Transitional Probabilities
Whenever a variety of stimuli terminate in a common response, we have a convergent habit-family hierarchy; whenever a given stimulus is associated with a variety of responses, we have a divergent habit-family hierarchy. In other words, a habit-family hierarchy is a cluster of associations in which one of the members, S or R, is common. Associations (habits) vary in strength, and variations in habit strength are known to correlate with probability of occurrence of responses (as well as with other indices, such as latency and amplitude). Habit strength, in turn, is known to depend upon variations in both the frequency and contiguity of S-R associations. Information theory measurements deal with the probability of occurrence of one event among the class of possible events of the same order. If we conceive of an antecedent message event (of any order or size of unit) as constituting or indexing a stimulus situation and the subsequent message event (of the same order or size of unit) as constituting a response, then the transitional probability measurements of information theory can be viewed as reflections of the systems of encoding or decoding habit strengths.
Since the linguistic structure of the language and the ‘semantic structure’ of the culture is such that certain message events co-occur more often than others (frequency of S-R) and certain message events appear closer together in the temporal sequence than others (contiguity of S-R), it must follow that at each level of organization hierarchies of habits of varying strength will be developed, and these will correspond to sets of transitional probabilities. Assuming a constant and limited number of alternative events of a given order (phonemes, morphemes, words, constructions, etc.), transitions characterized by convergent hierarchies should correspond to points of relatively low transitional entropy or uncertainty (e.g., where a wide variety of stem morphemes converge upon a limited number of suffixes) and transitions characterized by divergent hierarchies should correspond to points of relatively high transitional entropy or uncertainty (e.g., initial phonemes of words following junctures).
Beyond these general determinants, the habit strengths of associations within hierarchies will vary with frequency and contiguity factors, and so therefore will vary the entropy characteristics of sequential sets of message events. If frequency and contiguity factors are such that all of the alternatives following a given event are of about equal habit strength, uncertainty will be maximal for that number of alternatives; if frequency and contiguity factors are such that one event is highly associated with another and other events only remotely, a relatively low degree of uncertainty will exist. A number of observations of behavioral stereotypy, including the masses of highly regular data about languages assembled by Zipf, lead one to the hypothesis that habit-family hierarchies tend toward a structure such that habits strengths of the member associations decrease according to a logarithmic function of their rank in strength. Further discussion of transitional entropy measurements and entropy profiles will be found in section 5.3.
As was pointed out above, it seems necessary to view language behavior as organized simultaneously on at least three levels, a semantic (representational) level, a grammatical (integrational) level, and a receptive-expressive (sensory-motor skill) level. Each of these levels is assumed to deal with units of decreasing size, hierarchically arranged such that the units at a higher level include units of the next lower level. We assume also that habit- family hierarchies of the sort we have been discussing operate at each of these levels. A given antecedent event at the semantic level (e.g., the meaning of a stimulus word in free association tests) will tend to elicit a hierarchy of subsequent semantic events (e.g., meanings of associates of variable strength), as indexed by the hierarchical frequencies of overt responses—arranged, interestingly enough, according to a Zipf-type function. Similarly, reception or production of an antecedent syntactical unit (e.g., a nominal phrase, such as the little red schoolhouse ...) will set up readinesses, based on past redundancies, for a variety of subsequent syntactical units (such as ... sat on a hill, or ... I love is still there, or ... and barn were painted), and these alternative constructions constitute syntactical hierarchies which, although there is no evidence available, probably have a Zipf-type distribution. Similar hierarchical arrangements have been demonstrated for phoneme sequences.
Finally, mention should be made of the conditioning or restricting effect of context upon selection within hierarchies and hence upon transitional probabilities. Given only knowledge of the immediately antecedent event at any one of these levels, uncertainty as to the subsequent event is maximal (within limits imposed by the structure of the hierarchy). As we increase our knowledge by taking into account more and more of the sequence of antecedent events—as well as subsequent events, in the case of decoding—uncertainty as to the subsequent event decreases. Psychologically, this is due to stimulus patterning, e.g., the stimulus, including traces from past events, becomes more specific and hence more precisely associated with a given response than with the others. The association of a subject to the single stimulus word, BLUE, is less predictable than to the sequence, I’M ALL BLACK AND BLUE. The way in which events at superordinate levels reshuffle transitional probabilities at subordinate levels (see particularly section 5.3.) can also be understood in terms of the effects of contextual stimuli upon modulating the ‘average’ structure of hierarchies. Thus the cue effects of a given semantic decision persist through some period of time and serve to modify the actual eliciting stimulus pattern at each of a series of hierarchical choice points at some lower level of encoding or decoding. This conception helps explain an apparent paradox—the fact that a speaker’s sequencing is almost perfectly dependable, e.g., he ‘says what he meant to say and it always makes sense,’ despite the uncertainty present from the point of view of the observer with his entropy measurement. The point is that, from the speaker’s point of view, selection at each hierarchy is a simultaneous function of all of the preceding sequence and of regulating inputs from all levels of organization, whereas, from the entropy estimater’s point of view, selection at each hierarchy is being predicted from only first and second order segmental probabilities (usually) and takes only a single level of organization into account (usually).
5.1.5. Pausal, Juncture, and Hesitation Phenomena
Encoding and decoding processes being as complex as they are, it is always difficult to discover easy checks on the type of model described above. The fact that habit strength is inversely correlated with the latency between S and R seems to offer one avenue of approach, however. At any level of the model just described, the stronger the transitional habits, and hence the lower the transitional entropy or uncertainty, the shorter should be the pausal durations separating sequential events. This means that within syllables, within familiar morphemes, and even within familiar words and phrases the durations of pauses (latencies) separating successive events should be minimal, if measureable at all. On the other hand, pauses should be somewhat longer, and hence measureable, at boundaries between units where transitional habits are weak, the number of alternatives large, and hence the transitional probabilities low. The boundaries of constructive units might be of this sort. As will be discussed below, what we are calling ‘hesitation phenomena’ seem to reflect transitions of low probability at the semantic level, and these do not seem to correspond in any simple fashion to standard linguistic boundaries.
Hesitations which interrupt the continuous flow of speech are anything from very brief pauses to extended periods of halting, often filled with ‘hemming and hawing.’ The phenomena we are speaking of are not to be identified with linguistic ‘junctures.’ A variety of phonetic phenomena, including such things as brief pauses, ritardando effect, slight articulatory shifts, and even morphophonemic alternations have at one time or another, or by one writer or another, been set up as ‘juncture phonemes.’ But we are not referring to these. Even if ‘junctures’ sometimes consist of short pauses, the pauses under consideration here are not the same. For one thing, there is a difference in duration. Juncture pauses which we have seen in spectrographic analysis of speech were in the order of a hundredth of a second or less in length. The pauses referred to here, however, are appreciably longer. We are not sure of the lower limit in duration of these pauses, for measurements have not been made, but in general, certainly, they are longer. They often, of course, may last as much as several seconds. Another and more important difference is that they do not characteristically fall at the points in a sentence where junctures are presumed to fall.
This last point may be made clearer by means of an illustration. Consider the speech of a man lecturing or speaking on a difficult and not too familiar subject and, as we say, ‘thinking on his feet.’ There are pauses and perhaps quite a bit of hemming and hawing as he ‘organizes his thoughts’ or ‘gropes for the right expression.’ Compare his output under these conditions with his output if he is reading a prepared and rehearsed typescript on a familiar subject, or if he is delivering it after having committed it to memory. In the latter case the pauses which we note are those which fall at the boundaries of syntactic units, the so-called syntactic junctures. They may be fleetingly brief and few in number, or they may be exaggerated, longer, and more frequent for emphasis and stylistic effect, but in any case they are distributed systematically in some sort of conformity with the linguistic structure of the sentence as revealed by immediate-constituent analysis. This is not so in the first case where the man was thinking on his feet. To be sure, the syntactic junctures appear also here. But in addition there are frequent hesitation pauses. These would vary considerably in length, some would be dead-ends from which the speaker retreats to start over, some might be filled with hem-and-haw to mark time, etc. But the significant thing about these is that the majority of them do not fall at syntactic juncture points. Instead of occurring at the boundaries of major syntactic units, they typically fall at minor structural boundaries and within, rather than at the ends or beginnings, of larger syntactic constructions. Whereas juncture pauses are an aid to the hearer and help to put across the structure of a sentence, these hesitation pauses are often an annoyance to the hearer and interfere with rather than aid in grasping the sentence as a whole. Reading the material after sufficient rehearsal, or speaking it after memorization, would eliminate most of these hesitation pauses. Even the practised and fluent lecturer, however, apparently cannot entirely eliminate these in unrehearsed discourse. He may reduce them to such a point that neither he nor his hearers are aware of them, but a listener who is concentrating upon these rather than on the content of the lecture will find them very marked though brief.
Hesitation pauses have figured very little in linguistic analysis. Probably one reason for this is the way in which the informant technique has been made use of in the past. Whether the informant be an American Indian speaking a strange language to an inquiring linguist, or whether he be a linguist speaking his own language to himself, the time-consuming task of committing the observations to paper has necessitated a great many repetitions of stretches of speech a sentence or less in length. The repetitions demanded of the informant amount to rehearsal and result in his memorization of the phrase or sentence, and thus the hesitation pauses are weeded out. Only nowadays, with the advent of easy-to-use recording machines, are records of speech possible which preserve these little ‘defects’ for the investigator. One group of linguists has recently given particular attention to these pauses, preserving them in their transcriptions. But they have not been clearly enough distinguished from junctures. In some cases they have in fact been regarded as junctures.
Hesitation pauses in speech need much more study. We have hunches as to some of the results which such a study might show. These may be put in the form of hypotheses to be tested. The hypotheses have to do with the suspected relationships between hesitation pauses, transitional structure, and units of encoding. More conjectural are some which have to do with linguistic structure and units of decoding.
Hypothesis 1: Hesitation pauses correspond to the points of highest statistical uncertainty in the sequencing of units of any given order. (High statistical uncertainty = high transitional entropy.) The observations which lead us to formulate this hypothesis have been focused on the sequencing of words. We are relatively hopeful for the substantiation of the hypothesis when the units are of this order. Whether the same may hold true for some sort of hesitation or tempo phenomenon when the units are morphemes or phonemes, or perhaps some higher-order phrase units is a completely open question.
Testing of this hypothesis will require accumulation of two sorts of data: measurements of hesitation pauses, and transitional probabilities. It should be done for a single speaker, since the values of these would vary considerably with the speaker and his familiarity with various possible subjects of discourse. Our observations suggest to us that magnetic recordings of the class performances of a good lecturer would make excellent material for the identification and measurement of hesitation pauses. The measurement of transitional probabilities, on the other hand, would be more laborious. There are two theoretically possible methods. The one might make use of a large amount of natural data, e.g., a semester’s lectures in a particular course. The calculation of all transitional probabilities for every pertinent word in its various contexts, or for every pertinent context and the various words which may follow, would be an impossible task. A limited sampling could be done, however. A more practical short-cut in establishing transitional probabilities would be to administer word-association tests to the speaker of the recorded material. An interesting experiment could be worked out combining these two methods of getting at transitional probabilities. The ‘Cloze’ procedure being developed by Wilson Taylor should also be useful here, in this case given to the speaker himself.
Hypothesis 2: Hesitation pauses and points of high statistical uncertainty correspond to the beginning of units of encoding. Evidence on this point will be of an indirect sort, since the encoding process is not open to observation. The psychological theory would have a unit of encoding begin with semantic encoding in a higher mediational system and set off a train of more automatic responses in the lower dispositional and motor skill systems. Automaticity of response is a product of frequent repetitions. A response which originally is consciously directed is transformed with sufficient repetition into an automatic unconscious response. (To understand the point, one need only think of a person learning to drive an automobile or to type or to execute immediately and ‘without thinking’ any prescribed response to a given stimulus.) If it should be shown that the stretch of speech from one hesitation pause to the next is a convergent one, i.e., one characterized by decreasing statistical uncertainty (increasing transitional probabilities), then we would have strong support for claiming this as a unit of encoding.
Hypothesis 8: Hesitation pauses and points of high statistical uncertainty frequently do not fall at the points where immediate-constituent analysis would establish boundaries between higher-order linguistic units or where syntactic junctures or ‘facultative pauses’ would occur. Evidence on this question would be relatively easy to assemble, given the recordings and analysis of data proposed under Hyp. 1 above. It would be necessary only to add linguistic analysis of the same material.
Hypothesis 4’ The units given by immediate-constituent analysis, and especially those bounded by facultative pause points, do correspond to units of decoding, however. (These do not necessarily coincide with units of encoding: see Hyp. 5.) A definition of ‘unit of decoding’ would have to be given in terms of speech comprehension. It is a common English-class dogma that carefully phrased speech, with pauses, etc. ‘for expression,’ is more comprehensible than either ‘slurred’ or ‘chopped-up’ speech. The phrasing pauses here referred to are characteristically inserted at points which immediate-constituent analysis establishes as the boundaries between larger units. Conceivably an experiment might be designed to test the facilitation or hindrance of comprehension with different distributions of pauses in speech material. Among the experimental distributions would be included the two which we have discussed and which we suspect correspond to units of encoding and to units of decoding, respectively.
Hypothesis 5: Units of encoding for easy oft-repeated, combinations approach coincidence with those of decoding. In such material (e.g., the favorite oft-repeated assertions of a professor in his university lectures) hesitation pauses will tend to be eliminated. The frequent repetitions increase very highly the transitional probabilities between the units of which it is composed and reduce the statistical uncertainty at all points within it. The pauses which remain in the delivery of such material are those which fall at major syntactic juncture points and which may be magnified for stylistic effect or diminished for speed and economy of effort, depending on the content of utterance. A test of this hypothesis could be achieved fairly simply by choosing from a large collection of recordings a number of the most frequently repeated sentences, series of sentences, or parts of sentences, and examining these and comparing them with other less frequent sentences.
5.2 Certain Characteristics of Phoneme Sequences48
A few applications of entropy measures have already been made on the level of phonemic analysis. The probabilities of English phonemes and of all possible sequences of two such phonemes have been estimated from a text of 20,000 phonemes, and appropriate entropy measures have been computed.49 Similar analyses will probably be carried out in the near future on,: other languages. Such studies would be of great value in describing and comparing the structures of various languages. However, since the factors governing the choice of phonemes extend over long sequences of phonemes, and even morphemes, these entropy measures can at best be regarded as averages over a large set of conditions and so only partial descriptions. Jakobson and his co-workers50 go at this descriptive problem in a different fashion. Here the phoneme is treated as a class of sounds defined by a set of distinctive features. This sort of analysis permits one to estimate the degree to which all of the potential combinations of distinctive features are used. Both of these approaches are utilized in the following analysis.
Whereas the descriptive linguist has usually limited his interest to those combinations which can occur in a language, it appears that analysis of relative frequencies of combinations may reveal data which can be more meaningfully interpreted and lead to more fruitful hypotheses. One such hypothesis is based on the assumption that a message will tend to be produced in such a way as to take into consideration the effort of both the speaker and the hearer (cf., Zipf). For example, in any cluster of consonant phonemes, the minimum effort for the encoder would be the one in which any two successive phonemes would be most similar; this, however, would cause a maximum effort on the part of the listener, who would be forced to make a series of very fine distinctions. For the decoder, the simplest situation is the one in which two succeeding phonemes differ as much as possible, thus making the distinction easy to make. If speech does reflect both factors, then we would expect low frequencies of both extremely similar clusters and extremely different clusters, i.e., a normal distribution curve, where frequency of a cluster is a function of the difference between the two phonemes in the cluster.
Roman Jakobson’s distinctive feature analysis offers a meaningful measure of differences between phonemes. If we compare the English phonemes /p/, /t/, and / è /, we can establish that /t/ and / è / have the same distinctive features, except for their contrast as to continuant/interrupted. This—vs. + contrast is here counted as being a difference of 2 units. On the other hand, while /t/ and /p/ contrast as to grave/acute, which is two units of difference, they also differ in that the feature of strident/mellow is irrelevant in /p/ but is - in /t/. This kind of a difference is here counted as one unit of difference, so that /p/ and /t/ differ by a total of 3 units. In this way, the units of difference between any given phoneme and all other phonemes may be established. The series of 20,000 phonemes analysed by Carroll, showing the frequency with which each phoneme is followed by every other phoneme, provides data for a test of our hypothesis. We would predict that lowest frequencies of clustering would be between phonemes maximally similar or maximally different. The results for 845 consonant clusters are as follows:
It is apparent, then, that clustering does tend to follow a normal curve, except for the disproportionately low occurrence of clusters differing by 6 units. This is based on an analysis whereby /£/ and /]/ are considered unit phonemes. If one accepts a phonemic analysis whereby these affricates are considered to be clusters of /ts/ and /d£/ respectively, each occurrence of /£/ and /]/ becomes a cluster. The resulting analysis into distinctive features indicates that these clusters differ by a total of 6 units difference. The average frequency of clusters differing by 6 units would then be 3.6 ,which is quite in keeping with the normal curve. It seems justified, then, to assume that at least in consonant clusters, maximum efforts for either encoder or decoder are avoided in favor of those situations where the effort is more or less equally divided. If our hypothesis is correct, it should apply to all languages. Phonemic transcriptions for diverse languages must be analyzed in terms of transitional frequencies, as well as in terms of distinctive features, to determine whether or not this is a general principle.
The above is merely one particular analysis. The investigation of factors which may determine transitional frequencies, however, can be extended to cover all types of data. A further examination of the same material used above indicates other possible fields for investigation. For example, a significantly higher percentage of voiceless stops is found to occur before juncture than the corresponding voiced stop. The fact that this is due to a well-known historical change in Germanic is of no relevance here. Precisely the same psychological factor which might tend to cause this on the synchronic level would also operate in affecting change. Exactly what this factor is, of course, is difficult to determine. One possible explanation is the seemingly reasonable assumption that less information need be given in final position of word units. This would then assume that voiceless gives less information than the corresponding voiced. It has been suggested by Zipf that the voiceless is easier than the corresponding voiced, so that if the information value is not a factor, the system might tend to choose the unit requiring the least effort. This is, of course, merely a hypothesis which would have to be tested in various languages.
Leopold has suggested that in child language there seems to be a tendency for a stop to be replaced by the corresponding fricative in word final position but not in word initial position. This immediately suggests that a correlation might be found in adult speech. The available corpus of transitional frequencies for English does indicate a tendency for this to be true, but it is not significant with this amount of data. In any case, it seems that this method may have many fruitful applications. The immediate need is for similar data in diverse languages, so that general principles, if any, may be disclosed.
5.3 Applications of Entropy Measures to Problems of Sequential Structure51
Because of the frequently observed effects of antecedent events on the choice of subsequent events in language, the Markov Process has been regarded as an ideal conceptual tool in the study of linguistic structure. The Whorf and Harris models of the English monosyllable could be readily used in such an analysis and lack only the conditional probabilities of passing from one state to another to be complete Markov processes. Likewise, knowledge of syntactical structure can guide us in applying entropy measures to morphemes or words and setting up appropriate Markov processes. However, it seems that existing knowledge cannot do more than provide us with guides for describing such relatively simple phenomena, leaving the more complex and less well understood aspects of linguistic structure untouched.
The most obvious way to approach these more complex aspects would be to apply entropy measures to extended sequences of phonemes taken from a large sample of texts. By increasing the length, r, of the sequences of phonemes in A, the class of all possible sequences of antecedent phonemes, we should be able to find a minimum sequence length, say n, for which HA(S) ceases to decrease significantly, S being the class of subsequent events. The set of joint and conditional probabilities obtained for all sequences of length n or less should enable us to set up a Markov Process which completely represents linguistic structure within the limits of sampling error. While such a procedure is feasible in theory, it is hardly practical because of the enormous effort needed to sample and tabulate the very large number of sequences in A.52 Moreover, the results of such an analysis would be of such a bewildering complexity that they would be practically unusable.
5.3.1. Higher-order Markov Processes
A more practical conceptual scheme can be devised using the concept of a higher-order Markov Process—a Markov Process such that each of its states is itself a Markov Process. Such a scheme can allow incorporation of the existing units of linguistic analysis. For example, the states of a higher-order Markov Process could be morpheme classes, each of which is represented by a Markov Process whose states are phonemes. Such a representation has additional advantages in clarifying the entropy analysis of phonemes. It is easily demonstrated that the probability of a phoneme may be expressed as a sum, over all the morpheme classes, of the probability of a morpheme class times the probability of the phoneme in that morpheme class.53 In other words, the probability of a phoneme is a weighted average of its probability within each of the morpheme classes. Thus, a highly probable phoneme could be highly probable in just a few morpheme classes or moderately probable in nearly all morpheme classes. In English, the phoneme / è / is highly probable only in words including a definite article morpheme (e.g., ‘the’, ‘these,’ ‘those’) while the high probability of the vowel phonemes is most likely due to their moderate probability in a large number of morpheme classes. A simple count of phonemes over a large sample of texts would be incapable of indicating these phonemena whereas the analysis suggested above should. Thus, the more complex analysis proposed here is potentially more capable of indicating the details of linguistic structure.
Eventually, it will probably be necessary to establish a hierarchy of Markov Processes where each level contains some of the processes of the next lower level as states. For the time being, however, it would probably be best to confine the setting up of such a hierarchy to linguistic units which are relatively well understood, such as the phoneme and some classes of morphemes. It should be realized that the choice of levels in the hierarchy of Markov Processes is largely a matter of convenience in conceptualizing linguistic structure and that there can be no serious objections to using any particular set of categories so long as the categories on each level provide a probability space with mutually exclusive divisions. In fact, the use of categories established by various schools of linguists may well indicate a hitherto unrealized agreement in the nature of linguistic structure.
5.3.2. Entropy Analysis of a Small Scale Artificial Language54
The analysis of a small scale artificial language given here is meant as a demonstration of a potentially valuable technique. The demonstration hinges on the fact that, given the rules of its construction and the number of its interchangeable states, the total number of messages that can be transmitted is known and finite. This is, of course, not true of natural languages. However, artificial languages can be so designed as to incorporate any particular aspects of language in which the investigator is interested without contamination with the many complexities of natural languages. Moreover, such languages permit the study of phenomena which may not be found in any natural language. While it is correct to point out that we can ‘only get out what we put in’ such an artificial language, this technique allows us to explore the implications of certain aspects of linguistic structure which we may not have been aware of previously.
The entropy analysis of the language described above seems to justify the following conclusions:
(1) The amount of entropy of any message in the language is constant regardless of what type of units are being analysed. However, the amount of redundancy depends upon the characteristics of the symbols in which the message is coded.
(2) The amount of entropy of a message with specified structural boundaries is a function of the ensemble of all possible messages within these boundaries. This ensemble is determinable from the grammar of the language and the inventory of its form classes. It would be possible to test the validity of this conclusion for English sentences of a limited structural type—say of the form Noun-Verb- Noun. This also could be done-for similarly limited classes of words, but in both cases we would have to account for the differential probabilities of the units within the language.
(3) The entropy of a given symbol at a given point in a message is a function of the extent to which its transmission narrows the range of possible messages. The average amount of entropy per symbol is an average of such measures.
5.3.3. Entropy Profiles
The analysis of the amount of entropy reduction for every unit in the model language above seems to be closely related to the entropy profile analysis to be described here. However, there are two important differences which should be noted:
(a) An entropy reduction analysis presupposes that the number of possible messages is finite and that the probabilities of each of the messages is known. An entropy profile analysis involves no assumption concerning the number of possible messages or their probabilities, but requires only that the various component units which can occur in the environment of the message and their probabilities be known. Thus, it appears that entropy reduction analysis could be applied only to limited classes of natural language messages since the number of messages in nearly all languages is indefinitely large.
(b) An entropy reduction analysis of the type above presupposes that the structural units of the language are known. Such an analysis indicates the contribution of these units to entropy reduction. An entropy profile analysis involves no such assumption about the higher order units of a language, but serves in the selection of the most appropriate higher order units.
Let 1, 2, 3, 4 ... n represent a set of sequentially ordered phonemes in a text of n units length. Let x and y be any pair of antecedent and subsequent phonemes in this sequence. Let A be the class of antecedent phonemes which may be selected before any y and let S be the class of subsequent phonemes which may be selected after any x. Let a be any member of A and s be any member of S. Finally, let px(S) and py(A) represent the conditional probabilities of the s’s and afs for particular x’s and y’s.
It is possible to measure the entropy of class S after any x and the entropy of the class A before any y by means of the equations and
.55 The total amount of entropy between any x and y, E(x, y), is given by E(x, y) = HX(S) + Hy(A).
Let us examine the behavior of E(x, y) in four extreme cases.
(1) Only one phoneme follows x and only one phoneme precedes y. In other words x and y always occur together.
Obviously, HX(S) = Hy(A) = 0, so E(x, y) = 0.
(2) (a) Only one phoneme, y, follows x but k different phonemes can precede y equi- probably.
Obviously, HX(S) = 0 and Hy(A) == log2 k, and E(x, y) = log2 k.
(b) I different phonemes follow x equiprobably but only one phoneme, x, can precede y.
Obviously, HX(S) = log21 and Hy(A) = 0, so E(x, y) = log21.
(3) I different phonemes can follow x equiprobably and k different phonemes can precede y equiprobably.
Obviously, HX(S) = log21 and Hy(A) = log2 k so that E(x, y) = log21 -f log2 k.56
Once E(x, y) has been computed for all pairs (x, y) for the sequentially ordered phonemes 1, 2, 3 ... n, an entropy profile may be plotted from these values. We would expect this profile to be near zero for instances of high redundancy like case 1, to be moderately high for instances of partial redundancy like cases 2 (a) and (b), and to be maximally high for instances of minimal redundancy like case 3. We may further distinguish between the two types of case 2 instances by plotting HX(S) on the same graph as E(x, y) since HX(S) will be low for instances like, case 2 (a) and high for instances like 2 (b). The resulting profile will appear as in Fig. 11. The two data points between phonemes 1 and 2 represent E(l, 2) and HX(S); the two points between phonemes 2 and 3 represent E(2, 3) and H2(S), etc. The underlined numbers beneath the points on the graph indicate the class type most nearly represented by the entropy relations.
So far, we have not mentioned the most tedious part of the computation of entropy profiles—the estimation of the px(s)’s and the py(a)’s. This estimation can be carried out in at least two distinctly different ways which will naturally lead to somewhat different interpretations:
(1) Estimation from sample texts. Such estimation would require a large sample of texts in a given language with a variety of semantic contents. Since all the phonemes of a language will be included in any sizeable sample, one such estimation should suffice for the computation of entropy profiles for any text in the sampled language. However, this conclusion would not necessarily apply to morphemes or any larger linguistic units. An entropy profile based on such an estimate should indicate only the effect of the formal structure of the language and should be relatively independent of the semantic content of the text.
(2) Estimation from subjects’ anticipations. This technique would require a group of homogeneous subjects—all speakers of the language of the text—to anticipate the phonemes of the text. In obtaining the px(s)’s, the text would be given in a forward direction and the subjects would be asked to anticipate what the next phoneme would be. In obtaining the py(a)’s, the text would be given in a reverse direction and the subjects would be asked to anticipate the preceding phoneme. In both cases, it would probably be necessary to repeat the portion of the message already given to control for differential memory effects. Also, sufficient instruction concerning the semantic content of the message should be given before the first anticipations are made to insure that the effect of semantic content is relatively constant throughout. This method is rather cumbersome if the units are phonemes, since it only could be used for short texts in one experimental session and because of the difficulty in recording the responses of linguistically naive subjects. Regardless of the units used, it would be necessary to make a new estimation for every next text analysed. Nevertheless, this method of estimation should reflect both the effect of the structure of the language (assuming that the subjects respond in terms of this structure) and the semantic content of the message. While this sort of analysis is of little importance to linguistics, per se, it is of great potential value to the determination of the psycholinguistic units of decoding.
Figure 11
Once entropy profiles have been computed for a variety of texts, it would be of interest to determine the degree to which the points of high entropy in the texts coincide with the morpheme boundaries. If such a correspondence is found, it would be possible to define morphemes objectively in terms of entropy relations and perhaps it would be possible to distinguish various types of morphemes in terms of these entropy relations. If such a correspondence does not occur, we can only hope that the points of high entropy have enough common characteristics to permit the identification of new linguistic units. The discussion of entropy profiles has been mainly concerned with the isolation of morphemic or morpheme-like units as states of some higher order Markov Process. Once these units have been determined, we could transcribe our texts in terms of these known units and repeat the analysis, with the aim of determining yet higher order units.
5.3.4. The ‘Cloze’ Procedure57
While not strictly an application of entropy measurements, a new method of measuring ‘comprehensibility’ of relatively large scale texts being developed by Wilson Taylor is certainly relevant here and could be translated into information theory statistics. The underlying logic of the method is as follows: In the process of encoding, transitional dependencies among semantic events, among grammatical and syntactical regularities, and also (although less importantly here) within skill sequences are simultaneously contributing to a rather precise selection among hierarchies of alternatives at each choice point. If the encoder producing a message and the decoder receiving it happen to have highly similar semantic and grammatical habit systems, the decoder ought to be able to predict or anticipate what the encoder will produce at each moment with considerable accuracy. In other words, if both members of the communication act share common associations and common constructive tendencies, they should be able to anticipate each other’s verbalization.
The term ‘cloze’ is derived from the gestalt notion of closure, e.g., the tendency to fill in a missing gap in a well-structured whole. Given the sequence ‘Chickens cackle and ——— quack,’ almost anyone would immediately supply the missing ‘ducks.’ Similarly, given ‘The old man ——— along the dusty road,’ almost everyone will supply some verb form (grammatical disposition) and most will be affected by the ‘old’ element semantically and choose an appropriate verb, such as ‘hobbled,’ ‘crept,’ or ‘limped.’ As the actual procedure has been worked out, the experimenter deletes every nth word in a text (it has been shown that this automatic procedure works as well or better than either deleting specific categories of words or words at random, providing one is using a text of sufficient size), leaving equal sized blanks in their places, and decoding subjects read through the passage filling in the missing words. The more closely the totality of sequential cues in the passage elicits at each test point the same word selection as the original author’s, the higher will be the decoder’s ‘cloze’ score (only absolutely correct fill-ins are counted, judging synonyms proving too difficult and not materially affecting results).
Taylor has demonstrated the feasibility of this technique as an index of ‘readability’—in fact, it behaves much more satisfactorily than either the Flesch or Dale-Chall formulas. Not only does it order the same materials used as demonstrations by the authors of these standard formulas in the same way, but on some special test materials it alone yields sensible results. For example, both Flesch and Dale-Chall indicate a passage from Gertrude Stein as being very ‘easy’! Taylor’s ‘Cloze’ score shows Stein, more appropriately, as very difficult. In other words, this method takes into account the highly unpredictable semantic and grammatical sequencing characteristic of Stein. Taylor has also tested the assumption that his method is essentially a measure of degree of ‘comprehension.’ In a very carefully designed experiment using Air Force training materials for which comprehension tests were already available, he showed that ‘Cloze’ scores correlated very highly with initial comprehension scores (pre-message) and also predicted terminal comprehension (post-message).
There are many possible applications of this technique to psycholinguistic problems. For example, it is possible to construct alternatively coded messages on the same topic and use the ‘Cloze’ method to determine which form produces the most information transfer (cf., section 7.3 for discussion of an entropy measure of information transfer which could be combined with the Taylor procedure). Along similar lines, one may construct messages which vary in the transitional dependency of either their semantic or grammatical sequencing, or both, and use the ‘Cloze’ procedure to measure the effects produced on decoders (cf., section 5.4 for discussion of a method for constructing such messages). Using the same message and deleting every, say, fifth word, but using five equated groups to cover the entire message (e.g., group I having words 1, 6, 11, etc. deleted, group II having words 2; 7, 12, etc. deleted, and so forth) it should be possible to use this method to construct an entropy profile at the word unit level. The significant advantage of Taylor’s ‘Cloze’ procedure is that it taps simultaneously all of the complex determinants affecting word choice, both at various levels of organization and through long stretches of sequencing; it is applicable to comparing encoders (e.g., readability), messages (comprehensibility), and decoders (e.g., individual differences in reading skills, second language mastery, information about topic, etc.).
5.4 Transitional Organization: Association Techniques58
In any empirical analysis of verbal behavior as it occurs ‘naturally’ (in a conversation, an interview, a letter, a book, an oration or what have you) the investigator is likely to feel overwhelmed with problems of multiple causation affecting the production of utterances. In an effort to simplify the analysis, one might (following Skinner) divide the ‘causes’ into four major groups: (1) States of the speaker. Here one might study such variables as drives or needs, attitudes, beliefs, fatigue, etc. (2) Audience variables. The language or sublanguage spoken or understood by the audience, the stimuli from the audience indicating approval or disapproval, the ease with which the audience can hear the speaker, etc., are important considerations within this category. (3) Verbal and non-verbal referential stimuli. Under this heading one might investigate the effects of presence or absence of things being talked about, past experiences in the presence of given stimuli, discriminative reinforcement histories, etc. (4) Intraverbal connections. In this category one might study the tendencies of a speaker’s responses to influence his future responses; i.e., the tendency of the choice of one word to lead to the choice of a related word later, the choice of one form of utterance to lead to the choice of a particular subsequent one, etc. Since we are concerned here with transitional organization of language behavior, the fourth category is the one to which we may turn our attention.
The general assumption being made here is that emission of any antecedent response increases the probability of occurrence of a hierarchy of interrelated subsequent responses. It is also assumed, of course, that these intraverbal connections arise in the same manner in which any skill sequence arises, through repetition, contiguity, differential reinforcement. It should be recognized that this analysis does not lead immediately to a tool for breaking down contextual effects. Any utterance (especially a single word) may be thought of as belonging to a large number of response hierarchies, sound classes, form classes, sequence classes, frequency classes, etc. The analysis does suggest, however, experimental techniques for dealing with fragments of context in simple situations in which their specific influences may be more precisely studied.
5.4.1. The Word Association Technique
A first approach to the examination of interrelationships between word units may be found in the classic word association test. In this kind of test the subject is instructed to respond to a stimulus word with the first word (other than the stimulus word) that occurs to him. Substantial amounts of data have been collected on the responses of groups of people to small sets of stimulus words. For stimulus words occurring with high frequency in a given culture hierarchies of response words have been observed. For a given stimulus word a large number of subjects (sometimes as high as 80 per cent) may give the same response word; a much smaller number of subjects give a second response word; a slightly smaller number gives a third word, and so on down to responses which are made only by individual subjects. Thus, for a stimulus word the probabilities of given responses may be specified for cultural groups. While there is some evidence that the response hierarchies obtained from a group of subjects are related to individual hierarchies of response, this has not been clearly established.
It is possible, then, with this technique to ascertain the transitional probability between stimulus words and response words for a given group under these restricted conditions. This amounts to specifying a divergent hierarchy of responses to each given stimulus word.
In addition, it has recently been shown that these probabilities are directly related to the transitional probabilities between the same words when they are both produced by the subject himself in a restricted recall situation. If S-R words from the association test are scrambled in a list and read to subjects who are asked to recall them, it can be observed that in recalling the list, the subjects tend markedly to recall the words of the pairs together. It appears that recalling one word of a pair acts as a stimulus for the recall of the second word of the pair. As the strength of the word pairs on the association test norms is increased, the amount of pairing in recall increases. To a considerable extent the order (apparently freely determined by the subject) is predictable from a knowledge of the cultural S-R pairs. Our information is thus extended from a knowledge of responses made to outside stimuli to a prediction of responses made to previous responses. This is an important step, since it suggests that we may use the word association test data in constructing experiments to examine the effect of high and low transitional probabilities on the performance of a variety of verbal tasks.
Past experiments in free association have also demonstrated the importance of instructions given the subject in the determination of the response words made to the stimuli. If, for example, the suggestion is made that opposites may be given or even more directly the subjects are requested to respond with opposites, the variety of response words decreases markedly and the frequencies of a few responses rise correspondingly. In this situation also the responses are in general more rapid. It is as if a major portion of the response hierarchy were removed and only the specific subportion designated by dual class membership (related to stimulus word and opposition) were available. The existence of this phenomenon illustrates the possibility of determining transitional probabilities under special limiting conditions.
It has also been shown that speed of response to a stimulus word in free association is an index to the rarity of the response word (although it may also indicate emotional involvement or competition of response words). In like manner, speed of response is also a function of the familiarity or rarity of the stimulus word. This is additional evidence supporting the notion that the free association test measures transitional probabilities in a manner which should be useful in experimentation which moves closer to ‘real life’ situations and the problems of context.
5.4.2. Word Association in the Study of Language Structure
The above characteristics of the word association test suggested a major experiment designed to evaluate the effect of varied transitional probabilities measured in this manner. In brief, the experiment would consist of three stages: (1) building up networks of high and low transitional probabilities by word association techniques, (2) using these networks to construct stories or essays of very high and very low average transitions, and (3) testing these stories against each other for differences in comprehension, reading or speaking ease, ability to withstand mutilation (‘Cloze’ procedure), etc. This would constitute a full- scale test of the efficacy of this approach to transitional probabilities.
Stage one could be accomplished by capitalizing on the control of set and the measurement of association strength which have been pointed out above. A group of subjects could be asked to respond with the first verb they think of when a particular noun is given; the first noun they think of for a given adjective, the first adverb for a given verb, etc. The most popular and most rapidly given words would be paired with the stimulus words to construct high probability sentences. The very infrequent and delayed responses would be used for the low probability sentences. While the stories or essays resulting from the manipulation of these rather sizeable amounts of data might not be great literature, it seems likely that fairly parallel texts (in content) could be developed. In stage two these texts would be assembled and tested to exclude extraneous variables, such as differences in the basic frequency of occurrence of words in the culture, the sentence constructions, order of presentation of material, etc. In stage three the texts would be presented to new groups of subjects in controlled reading situations. It would be predicted that the high transitional probability text would be read faster (both silently and aloud), would require fewer eye fixations, would be more completely understood (as determined by a comprehension test), would be more accurately recalled after a lapse of time, and would be more easily read after mutilation (i.e., when every fifth or tenth word is deleted). All of these predictions can be readily appraised.
It might be of further interest and lend verification to this study if another group of subjects were simply given the lists of words and asked to construct stories. It would be predicted that these subjects would use the combinations found to be of high transitional probability and avoid the low transitional probability combinations. This would constitute further evidence for the similarity of the stimulus-response and response-response conditions and would increase our information concerning these phenomena. If these predictions are borne out, the word association test would appear to be an instrument par excellence for the study and examination of transitional probabilities as they effect context.
Other suggestions appear relevant both for the understanding of context and for the understanding of the phenomenon of word association itself. The linguist looks at free association data or serial associations and notes that stimuli and responses often fall in the same form classes. The data presently available on free associations are not sufficient to determine if this is the case, since for the most part stimuli have been nouns and adjectives with a very few verbs. It is proposed that a large body of associative data be built up, systematically sampling grammatical classes, grammatical ‘tags’ and various lexical units. Pronouns, verbs, adverbs, prepositions, conjunctions, relative pronouns, etc. must be studied. Various changes in the stimuli (from singular to plural, present to past tense, etc.) need to be explored. This simple kind of experimentation may contribute markedly to our understanding of language habits which are essential to mature language behavior. Cross linguistic studies would be of interest and may further embody suggestions for second language learning.
A straightforward linguistic analysis of free association also may contribute to a clarification of ‘normal’ response categories. In spite of the long history of use of free association tests, a satisfactory method of classification has not been found. The most common attempts have been an unsystematic mixture of semantic, psychological and linguistic criteria. The inadequacy of these measures is indicated by the following example: in one system the response ‘length’ to the stimulus ‘long’ is classified as an example of ‘compounding;’ however, the pair ‘height-high’ is an example of ‘phonetic similarity’ and is the same as the response ‘able’ to ‘table.’ Perhaps purely linguistic criteria may be found which will classify possible responses. While it is unlikely that any system will ‘explain’ all of the responses, a suggestion for classification is presented here. It is intended to apply it to the broad collection of associations proposed above. It seems probable that refinements may be included as the work progresses.
Word associations may be interpreted as a result of relative distribution of the stimuli and responses. The similarity between any two words can be conceived linguistically as the degree of similarity in distribution. However, it seems apparent that this similarity may be profitably divided into two classes, paradigmatic and syntagmatic. Two words are considered paradigmatically similar to the extent that they are substitutable in the identical frame (this corresponds rather closely to Zellig Harris’ use of the term ‘selection’) and syntagmatic to the extent that they follow one another in utterances.
For example, if we were to measure the paradigmatic similarity between ‘table’ and its most common response in word association tests, i.e., ‘chair,’ we would investigate to what extent they occurred in the same frame. ‘Table’ and ‘chair,’ as well as almost any other member of the noun class, such as man, woman, dog, cat, etc., occur, for example, in the frame, ‘I saw a ——— ‘ If we then consider the frame, ‘I bought a ———,’ we have eliminated ‘man’ and ‘woman,’ but our class still includes, ‘table,’ ‘chair,’ ‘cat,’ ‘dog,’ etc. At the other extreme is the frame ‘My favorite piece of furniture is ——— ‘ in which ‘table’ and ‘chair’ occur, but not the others. Furthermore, ‘chair’ and not ‘table’ occurs in the frame ‘I like to sit in an easy ——— ‘ We would then hypothesize that one factor in any word association test would be the relative paradigmatic similarity of the hierarchy of responses, so that frequency of responses would be a function of paradigmatic similarity.
Another factor in forming word associations is the relative frequency with which words follow one another in a sequence. The frame ‘I saw a ——— ‘ may obviously be completed by a larger number of possibilities than ‘I bought a ——— ‘ or ‘I sat on a ———.’ Consequently, the associative strength of ‘sat-chair* will be greater than ‘saw-chair/ We can then define syntagmatic similarity as the probability with which any one word will be followed immediately by the second. It seems reasonable to exclude from this analysis the grammatical morphemes, or function words, such as ‘a,’ ‘and/ etc.
As presented up to this point, paradigmatic similarity is restricted almost exclusively to words of the same form class. Syntagmatic similarity, however, can be extended to include both words of the same as well as of different form classes. The example above is one of association between verb and noun. If we include the frame ‘I bought a table and ——— we can establish a syntagmatic similarity between words of the same form class.
If both paradigmatic and syntagmatic similarity may be factors in strength of association, it follows that the highest associative strength will be between words of the same form class, insofar as only these words can be similar both paradigmatically and syntagmatically. It is not surprising, then, to find that the most frequent types of responses among adults are ‘coordination’ (e.g., table-chair) and ‘contrast’ (e.g., black-white).59 Certain related hypotheses present themselves regarding what might be expected in word association tests. For example, if, as seems likely, the sequence ‘black and white’ is more frequent than ‘white and black’ in our culture, this difference should manifest itself in word association tests in that ‘black’ would tend to elicit ‘white’ significantly more than the reverse. In this light, the Kent-Rosanoff tests60 were re-examined and yield the following cases (out of 1000 responses): table-chair, 844 vs. chair- table, 494; black-white, 706 vs. white-black, 605; hand-foot, 156 vs. foot-hand, 198; long-short, 758 vs. short-long, 336. The latter is an example of a word, in this case ‘short,’ with two competing responses, namely, ‘tall’ and ‘long;’ ‘long,’ however, has just one main response. A series of words of this kind might be used in experiments to test the validity of this hypothesis.
It seems likely that a few special categories may have to be invoked for a complete analysis of associations. For example, the ‘phonetic similarity’ class (e.g., table-able) may be indispensable, although it accounts for a very small percentage of associations. Even here it seems likely that one might get a higher percentage of this type of response by selecting words which are either paradigmatically or syntagmatically similar as well as phonetically similar.
A basic problem for this analysis, of course, is the determination of objective measures of paradigmatic and syntagmatic similarity. As a beginning, paradigmatic similarity might be defined simply as common form class membership, but a stronger measure of similarity might be developed through judges’ ratings or sentence completion techniques. Syntagmatic similarity is even more difficult to measure. Ideally, an extensive count of spoken and written English might provide it, but the task is too great for practicality. Again, perhaps judges’ ratings or sentence completions will be required. Further work here is sorely needed.
5.4.3. Context and Association
Two experiments which have brought the word association work closer to context problems are given here to illustrate the uses of the technique as an experimental tool.
Howes and Osgood61 have made a careful study of word associations in the determination of responses to a complex of four stimulus words. Subjects were told that they would be given four words and that they were to respond to the last word by writing the first other word it made them think of. Control sets, used nonsense words or numbers preceding the last word. The experimental sets were devised to study the influence of adjacent words on the responses to the last word. Variables studied were the distance (in time) of an experimental word from the last word, the density of the experimental words (whether they were all calculated to influence the stimulus word in the same way; whether two of them were; whether only one of them was) and the cultural frequency (by Thorndike-Lorge count) of the experimental words used. As an example, consider the stimulus word ‘man.’ Used alone it evokes ‘woman,’ ‘boy,’ ‘child,’ etc. If the word ‘yellow’ is inserted before it, does the response ‘Chinese,’ ‘Japanese,’ Jap,’ etc. appear? If so, does this response decrease as the word ‘yellow’ is moved one or two words away from the stimulus word ‘man?’ If the word ‘alien’ or ‘eastern’ is added instead of the other neutral words in addition to ‘yellow’ does this increase the number of ‘Chinese’ responses? If in place of ‘yellow’ some rare synonym were used, would it achieve the same result?
This experiment clearly demonstrated that all of these were significant variables. The subjects did respond to the compound stimuli. The influence of an experimental word decreased as it moved away from the last word. Increasing density increased the number of influenced responses. Words of high cultural frequency exercised more influence than words of low cultural frequency. This experiment is an excellent example of the use of a simple tool to attack this complicated problem. Its implications are immediately obvious.
A second experiment is one undertaken by MacCorquodale and others as part of the Minnesota Studies in Verbal Behavior. While the research is not as yet complete, the influence of associative bonds in context appears to have been demonstrated. In an attempt to reveal ‘thematic strengthening,’ alternate sets of sentences were constructed to have the ‘same meaning.’ In one pair of such sentences a word was changed to a substitutable word, but one which it was felt strengthened a different response hierarchy. The sentences were left incomplete and given to different groups of subjects for completion. For example, one sentence (in its control form) read, ‘The children noticed that the snow was beginning to hide the ground as they got out of ——.’ In its experimental form this sentence read, ‘The children noticed that the snow was beginning to blanket the ground as they got out of ——.’ The difference between the sentences, then, lies in the response hierarchies evoked by the two words ‘hide’ and ‘blanket.’ The sentence exercises only the control that the children must be getting out of something or somewhere. Any difference in the determination of what or where must be the result of the associations strengthened by the changed words. In this example, the control sentence elicited many references to ‘school,’ ‘the bus,’ ‘the house,’ etc., and the experimental sentence in contrast elicited, as hypothesized, a large number of references to ‘bed’ which were almost totally lacking in the control group. Further experimentation with this technique should reveal in actual context the operation of the significant variables dealt with by Howes and Osgood.
In summary it appears that many important questions regarding ‘language in action’ may be attacked with one of the oldest tools in the psychological repertoire, the free association test and its derivatives, and that many challenging hypotheses are available for research.
5.5 Channel Capacity in Semantic Decoding62
In Shannon’s development of information theory, channel capacity is defined as the maximum rate (expressed in bits per second) at which a communication channel can transmit messages with a minimal amount of error. When the rate at which messages are presented to a channel (i.e., the rate of input) exceeds its capacity, the amount of random error in transmission increases with the amount of excessive information. Comparable phenomena seem to occur in ordinary language communication. When a radio announcer spins through a series of baseball scores—Yankees 2, Browns 4; Red Sox 12, Senators 5; Indians 3, Tigers 7; White Sox 3, Athletics 2—what is a simple task for the encoder may be an impossible task for the decoder (who wants to know who played whom and with what result). If the decoder does hold onto one particular game and its result, he loses both what went before and what followed. In this case, the channel capacity of the decoder has been passed.
Experimentally, it is necessary to deal with the human communicator as a single system intervening between manipulatable states of some physical input system and recordable states of some physical output system, e.g., as intervening between observable stimuli and responses. However, this total communicating unit is comprised of many sub-systems whose limits, or capacities, may vary one from the other. One experimental problem is immediately apparent: in order to study the characteristics of any system in a communication chain, it is necessary to devise conditions under which the capacities of the other systems are not limiting factors.
5.5.1 Theoretical Analysis
For present purposes we shall eliminate by our choice of stimulus materials and response categories grammatical aspects of decoding. As shown in Figure 12, the semantic system is here conceived as a set of mediating processes (ri ——— Si, . . . rn ——— sn) which, as implicit reactions, are dependent to variable degrees upon states of the input system (Si, . . . Sn). We shall assume that, as with any system, mechanical or organic, the semantic system is limited in the number of different states which it can assume during any finite time, e.g., its rate of shifting from state to state is finite. The capacity of a system under optimal conditions will be called its maximum capacity. We shall also assume that the output system in human language behavior has a lower maximum capacity than the semantic system, e.g., that sequences of ‘ideas’ can proceed at a faster rate than the sequences of vocalizations with which they are associated. What we shall call the functional capacity of a system—that which it displays under a given set of conditions—is always equal to or less than its maximum capacity.
Figure 12
What are some of the conditions affecting functional capacity? (1) The greater the conditional dependencies of states in a given system upon states in its antecedent system, the greater its functional capacity. Conditional dependency in this context is assumed to be equivalent to habit strength. Since latency of reaction is an inverse function of habit strength, the stronger the decoding habits the more rapidly each mediating reaction will follow presentation of the appropriate sign. One of Baseball’s Faithful, for whom the signs ‘Athletics’ and ‘Red Sox’ are strongly associated with differential mediators, would have less difficulty keeping up with the announcer’s stream of scores than a casual follower of the sport.
Referring back to Figure 12, in decoding the conditional dependencies are those between members of the set Sn and members of the set rn. In ordinary language communication, of course, these conditional dependencies are very high (e.g., heard or seen words as physical stimuli are strongly associated with particular significances and not with others); under these conditions, therefore, functional capacities should tend to approach their maxima.
The question of the effect of transitional dependencies within a given system, e.g., the tendency for certain states of the system to follow others with non- chance probabilities, raises a number of complicated problems. Transitional dependency (or redundancy) is assumed to be equivalent to ‘associations’ when dealing with sequences in the semantic system (e.g., predictability of subsequent members of the set rn given knowledge of the occurrence of antecedent members of this set). Certainly, on a practical level, it seems obvious that the rate of decoding will be faster when the sequence required of the semantic system is one for which it is already ‘set’ on the basis of past experience—the Lord’s Prayer is presumably easier to decode than a series of baseball scores.
In the first place, it should be noted that the example given in the preceding paragraph involves high conditional dependencies as well—the sequence ‘impressed’ on the test system by its input happens to correspond to already established transitional probabilities within the test system. If the conditional dependencies between input and semantic systems were near zero (for example, with a series of low association value nonsense syllables as stimuli), only random output could result, and the transitional organization of the semantic system would seem to be irrelevant. On the other hand, with high degrees of conditional dependency between input and semantic systems the amount of transitional dependency operative can be varied independently by manipulating the sequences of input signals (e.g., from a random sequence of signs like AGAINST WOULD IT THE FOLLOWED SET to one of high transitional predictability like A ROLLING STONE GATHERS NO MOSS). This was essentially the variable investigated by Miller and Selfridge in studying the case of learning verbal materials chosen with varying approximations to English syntactical structure.
Taking into account conditional dependency, we may then state that (2) the more the sequences of states impressed on a system by the input correspond to existing transitional dependencies within the system itself, the greater will be its functional capacity. Assuming equally strong decoding habits, the rate at which A ROLLING STONE GATHERS NO MOSS can be handled by the semantic system would be faster than the rate of handling messages like STONE A NO MOSS ROLLING GATHERS.
So far we have assumed that the number of units in messages, as determined objectively from the input or output, necessarily corresponds to the number of states assumed by the semantic system in any sequence, e.g., A ROLLING STONE GATHERS NO MOSS contains six units because there are six ‘words’ separated by white spaces. But let us suppose for the sake of argument that in English ROLLING is always followed by STONE and STONE is always preceded by ROLLING, i.e., maximum transitional dependency or redundancy—would ROLLING STONE represent one state or two sequential states in the semantic system? Conversely, does everything within the brackets defined by either white spaces (orthography) or pauses of certain length (speech) constitute a single unit semanti- cally? Since GATHERS is divisible linguistically into GATHER and S in morphemic analysis, aren’t there at least two semantic units here?
We do not as yet have any satisfactory ways for identifying semantic units and correlating them with message units (cf., discussion of psycholinguistic units, section 3). It seems likely, however, that high orders of transitional dependency within systems will be equivalent to reduction in numbers of units or states. The ‘short circuiting process’ envisaged here is presumably more feasible in the semantic system than in the motor skill output system. If r1 is highly predictive of rn, the required indexing responses can be initiated by S1 rather than waiting for sn; but just because R1 (e.g., saying A ROLLING STONE . . . ) is highly predictive of Rn (saying . . . MOSS) does not mean that the encoder will skip the intervening vocalization. Such a ‘short circuiting process’ may, of course, underlie the empirical law associated with Zipf to the effect that frequently used forms tend to become reduced in length.
So far nothing has been said about the number of alternative states among which a system must choose, the variable most often dealt with in information theory studies. The usual observation is that performance, as indexed by latency, errors, or some other measure, decreases as number of alternatives is increased. In such experiments, however, conditional dependencies have been low. In intelligibility studies, for example (cf., Miller’s Language and Communication), it is necessary to work with a signal/noise ratio near discrimination threshold for number of alternatives to have its maximum effect. Obviously, if the signal were clear, it would make little difference how many alternatives were allowed. In ordinary communication, the number of alternative semantic states or meanings is extremely large, but we have no trouble in decoding as long as the peripheral signals are clear.
In effect, such studies have used the number of alternatives as a means of manipulating the conditional and transitional dependencies in their decoding and encoding. Such manipulation is feasible if conditional dependencies are low so that all of the possible states of a system follow a given state of the antecedent system with nearly equal probabilities. In such a situation, where habit strengths for all responsse are nearly equal, additional alternatives have the effect of increasing response randomness or entropy. However, if conditional dependencies are generally high so that given states of the antecedent system reliably lead to particular states of the subsequent system, then one habit strength is so much larger than the others that additional alternatives can have little effect on response entropy.
Similarly, number of alternatives should become a less important determiner of channel capacity as transitional dependencies within the system increase—if alternative b is highly dependent upon occurrence of alternative a and alternative d is highly dependent upon occurrence of alternative c, we have effectively reduced the alternatives from four to two. Assuming these arguments to be valid, (3) the greater the number of alternative states required of a system, the lower will be its functional capacity; the effectiveness of this factor varies inversely with both the conditional and transitional dependencies involved, having no effect when either is maximal. In other words, channel capacity becomes independent of number of alternatives when either conditional dependency (predictability of states of the subsequent system from those of the antecedent system) or transitional dependency (prediction of subsequent states of the test system from antecedent states of the same system) becomes maximal.
A final general variable to be considered is the nature of the alternatives representing various dimensions. It should be easier, for example, for a subject to choose among four objects differing only in color than to choose among four objects differing simultaneously in shape and color, e.g., among red circle, green circle, yellow circle and blue circle as compared with among red circle, green circle, red square, and green square. Generalizing, (4) if total number of alternatives is held constant, the slope of channel capacity as a function of number of alternatives should be steeper as the dimensionality of the alternatives is increased. All of the hypotheses described above are susceptible to experimental test, as well as a number of secondary hypotheses to be described in course.
5.5.2. Experimental Requirements
In most psychological experiments dealing with intact human subjects we manipulate input (stimuli) and observe output (responses). Therefore we are necessarily dealing with the complete decoding-encoding sequence and all of the systems intervening between S and R. In the information theory sense, we are necessarily treating the individual as a channel connecting input and output systems. If we are interested in the capacity of any particular system, it is necessary that the contribution of other systems be minimal and roughly constant. If we are studying decoding time, we want to be able to segregate encoding time. There seems to be no direct way to index decoding time as a separable portion of total time (e.g., time between presentation of stimulus and occurrence of reaction). On the other hand, this does seem to be possible in the case of encoding time. Therefore we would start with decoding time as the variable and encoding time as the constant.
The general nature of the research proposal is as follows: (1) We provide the subject with an extremely simple and overly practiced encoding response (e.g., reaching out and touching an object). (2) Using optimally coded input, we give him practice at the encoding alternatives until conditional dependency under this condition is maximal (e.g., the ‘locations’ of the objects to be touched perfectly established). Optimal coding in this case might be flashing pictures of the objects to be touched on the screen one at a time (e.g., shown a picture of the round, red, tall object, he must touch it as soon as possible). Encoding time under these conditions should quickly reach a stable minimum. (3) The subject is now presented with serial verbal information, either spoken or written, such as ROUND ... RED ... TALL, and must react by touching the correct object as soon as possible after hearing the last signal. (4) The rate of presentation of this serial information is gradually increased. Measurement is made of both total time (from onset of first signal to termination of encoding reaction) and encoding time (from end of last signal to termination of encoding reaction).
The general nature of the results to be expected is shown diagrammatically in Figure 13. Up to a certain critical rate of input (range a), total time will be a decreasing function of increasing rate of presentation and encoding time will be a constant. Encoding time is constant through this range because it depends solely upon decoding of the last signal plus the constant encoding time—prior signals are completely decoded before the next appears. Total time decreases through this range because the rate of presentation is becoming faster. For a certain range beyond this critical point (range b), total time will remain at some constant value while encoding time gradually increases. Encoding time increases because it now includes increasing time spent in decoding prior signals (e.g., the subject is still decoding ROUND when RED appears and is still decoding RED when TALL appears, starting the measurement of encoding time). Total time remains constant through this range because the increase in encoding time is compensated for by the more rapid rate of input. At some further point, total time should become variable, and this should be accompanied by appearance of frequent errors (range c).
At the first critical point—that at which total time becomes constant and encoding time begins to increase—the difference between total time and encoding time should provide a measure of decoding time under these particular conditions, e.g., that amount of time required for decoding n—1 input signals. The projection of this critical point of the base line (shown by dashed arrow in Figure 13) should indicate the decoding channel capacity under these conditions, e.g., the rate of input events in units per second which can be handled by the system.
Figure 13
Design 1. Materials might consist of a set of objects variable in four ways through two dimensions (SHAPE: circular, square, triangular, oval; COLOR: red, yellow, green, blue) and in two ways through two other dimensions (CROSS-SECTIONAL SIZE: wide, narrow; HEIGHT: tall, short). At any one time, a maximum of 16 alternatives may be set in the panel, either each of 4 shapes in each of 4 colors (16 alternatives, but only two dimensions) or each combination of 2 shapes, in 2 colors, of 2 sizes, and having 2 heights (16 alternatives involving 4 binary choices). In this way one may investigate the effects of varying the number of alternatives when dimensionality is either held constant or varied with number of alternatives. These objects would probably be displayed in a panel against electrical contact switches, so arranged that a mere touch against any one will stop the timers for total and encoding time.
Design 2. A closer approach to typical linguistic materials could be obtained with a panel of ‘nominal’ objects arranged in 1 to 4 rows or columns, these objects being BALL, WHEEL, HAND, FACE, for example, and being set on levers. Each row could be in a different color or some other ‘adjectival’ variable. Each object could be capable of movement in 4 ‘verbal’ ways, e.g., PUSH, RAISE, TURN, SHAKE, and in 4 ‘adverbial’ modes, e.g., QUICKLY, SLOWLY, SMOOTHLY, ROUGHLY. Again, the input information could be recorded on tape and could be varied in both rate and complexity with respect to the response board. At the simplest level would be commands involving only two variables, e.g., PUSH A WHEEL, SHAKE A BALL; this could be extended to RAISE THE RED FACE and TURN THE BLUE WHEEL, etc.; and extended further to four alternative linguistic dimensions, e.g., SHAKE THE YELLOW HAND SLOWLY or TURN THE GREEN BALL SMOOTHLY.
Design 3. This general type of method could probably be extended to decoding of pictorial materials. We might first give the subject a statement, e.g., THE CIRCLE IS RED, then flash on the screen a simple picture with the subject to respond either true or false as quickly as possible. The picture shown could vary from the simplest case of being a red (or not red) circle, to a binary situation showing a circle and a square, one red and the other not red; similarly, the statement to be tested against the picture could be varied from THE CIRCLE IS RED to THE LARGE CIRCLE IS RED (with appropriate samples of objects shown), and so forth. Even more complex linguistic combinations could be used with appropriate pictures, such as THE LARGE BALL UNDER THE ROUND TABLE IS GREEN. Extrinsic variables would be such things as frequency of usage of the labels used (e.g., decoding habit strengths), amount of relevant and irrelevant information, transitional predictability of the sequences use (e.g., THE MAN IS SMOKING A PIPE should be decoded correctly more quickly than THE WOMAN IS SMOKING A PIPE).
5.5.3. Test of Predictions
(1) The rate of presentation at which encoding time begins to increase will always be equal to that at which total time becomes constant. This applies to all situations and is important because it makes possible the specification of empirical units of decoding channel capacity. If this prediction does not hold, it means that our theoretical analysis of this general situation has been wrong. If it does hold, then we are in a position to explore the effects of many other variables upon decoding time, using this critical rate as an index.
(2) Channel capacity will be an increasing function of the strength of the decoding habits involved—e.g., of conditional dependencies. In the sample material given above, conditional dependencies were maximal—the decoding significances of GREEN, ROUND and so forth are maximal—and this condition therefore serves as a control. If nonsense syllables were substituted for these words with other groups of subjects, and different groups were given varying amounts of pre-training in decoding (seeing particular nonsense items with appropriate objects), it would be possible to test this prediction. The greater the amount of pre-training (and hence, theoretically, the greater the conditional dependency), the greater should become the decoding channel capacity. The function derived would presumably be typical of other learning phenomena, e.g., a negatively accelerated growth curve. Another way of testing this prediction would be to use meaningful materials varying in familiarity or frequency of usage. If VERMILLION, MAUVE, TURQUOISE, and so on were substituted for familiar color labels, one would expect decoding channel capacity to decrease.
(3) Channel capacity will be an increasing function of the strengths of associations between sequential semantic states, e.g., transitional dependency. Here one could manipulate external redundancy (for example, man vs. woman smoking pipe as discussed above or “turn wheel” vs. “turn face” in another design) or pre-experimental training (for example, by giving training in which certain sequences of nonsense syllables were highly probable and others unlikely). The manipulation of pre-experimental training would permit the greatest control and hence presumably yield the most stable functions.
(4) Channel capacity will decrease with number of alternatives. (5) Channel capacity will be a steeper function of number of alternatives when number of dimensions of variation also increases than when dimensionality is constant. The materials described under design 1 provide a means of testing these predictions. The total number of alternatives can be increased from 4 to 16 with dimensionality either constant (from 2 shapes in 2 colors to 4 shapes in 4 colors) or increasing (from 2 shapes in 2 colors to 2 shapes in 2 colors of 2 sizes and of 2 heights). If this source of variation is combined with degree of pre-training on nonsense substitutes for meaningful words, then the additional prediction—that number of alternatives as a variable has decreasing effect as conditional dependencies increase—can be tested.
One additional possibility in this line of study may be mentioned, and that it its relation to the problem of psycholinguistic units. If stability of the decoding-time index for a given condition and abrupt changes in its value for varying conditions can be demonstrated, it should then be possible to determine what sorts of linguistic variations involve the addition or subtraction of semantic decoding units. If changing numbers of phonemes or syllables or even grammatical morphemes do not change the decoding channel capacity, it would be apparent that these are not relevant psycholinguistic units as far as semantic decoding is concerned. On the other hand, if adding or subtracting lexical morphemes did regularly produce correlated shifts in decoding time, this would be evidence for the lexical morpheme as a semantic decoding unit. It is realized, of course, that what has been suggested in these few pages on channel capacity represents close to a lifetime of research for the person who undertakes to investigate this problem fully. On the other hand, it should not take long to determine whether or not the basic experimental notion—that total time can be separated experimentally into measurable decoding and encoding times in the manner indicated—is itself valid.
47 Floyd G. Lounsbury.
48 Sol Saporta.
49 This tabulation was carried out under the direction of John B. Carroll at the Summer Seminar on Psychology and Linguistics held at Cornell University in 1951. Only a privately distributed mimeographed summary of the results is available so far.
50 Jakobson, Fant, and Halle, Preliminaries to speech analysis (Cambridge, 1952). Cherry, Halle, and Jakobson, Toward a logical description of languages in their phonemic aspect, Language 29. 34-46 (1953).
51 Kellogg Wilson and John B. Carroll.
52 This number is relatively small for very small values of r, say 1 or 2, but increases very rapidly as r increases. For example, if our phonemic transcription used 34 phonemes, the number of sequences in A for r = 4 would be 344 = 1,367,500 (approx.). This state of affairs would be much worse if a full phonetic transcription were used.
53 Using mathematical symbolism, if p(a) is the probability of phoneme a, p(B) the probability of a morpheme class B which is any of a set of morpheme classes, and p8 (a) the probability of a in B, it is easily shown that
54 This analysis has been described elsewhere by John B. Carroll and is included here because of the remarkably clear way in which it illustrates many of the problems with which this section is concerned.
55 It should be noted that HX(S) and Hy(A) are measures of entropy in particular conditions and so correspond to the measure Hi(J) discussed in section 2.3. They are not measures of conditional entropy which average measures corresponding to Hi(J) over a number of conditions.
56 Naturally, we can expect to find such extreme cases only rarely in actual texts. Nevertheless, these cases are pure examples of the four general kinds of relationships we can expect to find among sequentially ordered message events and so we can expect that they will be approximated by empirical data.
57 Cf. W. Taylor, Cloze procedure: A new tool for measuring readability, Journalism Quarterly 30. 415-33 (1953).
58 James J. Jenkins.
59 The terms are those used by Miller, Language and communication, based on work done by Woodrow and Lowell, Psychological monograph 22. 97 (1916).
60 W. A. Russell and J. J. Jenkins, Kent-Rosanoff norms for Minnesota college students (in press).
61 D. H. Howes and C. E. Osgood, The effect of linguistic context on associative word probabilities, American Journal of Psychology (forthcoming).
62 Charles E. Osgood.
We use cookies to analyze our traffic. Please decide if you are willing to accept cookies from our website. You can change this setting anytime in Privacy Settings.