Since Charles Darwin's On the Origin of Species (1859), the study of hominid evolution has been based on two lines of complementary research: the comparative study of living species that we believe manifest certain aspects of the behavior of earlier hominids; and the inferences that we can make from the artifacts that have been found in association with extinct fossil hominids, as well as the direct examination of these fossils. The anatomy of extinct fossils is relevant to the study of their behavior because it is evident that certain aspects of behavior are predicated on the presence of particular anatomical specializations. Upright bipedal locomotion, for example, is not possible without certain specialized anatomical features that are present in modern Homo sapiens.
The study of the behavior of living nonhuman species is essential to our understanding of the function of "human" anatomical specializations. Electromyographic studies of chimpanzees and gorillas, for example, show that these animals, who lack the specialized pelvic and limbic anatomy of Homo sapiens, can stand upright only for short periods of time and with the expenditure of great muscular effort. In contrast, humans can stand upright or walk at a moderate pace with very little muscular effort (Basmajian and Tuttle, in press). Comparative studies of living nonhuman primates can serve as "experiments" that enable us to assess the function of particular anatomical specializations in Homo sapiens. All animals are, in effect, living "experimental preparations." Appropriate experiments can relate particular aspects of the morphology of different species to the behavior of these species.
We can project the relevance of the anatomical specializations that relate to upright posture to the study of human evolution when we note that the skeletal remains of particular extinct hominids also appear to be functionally adapted for upright posture and bipedal locomotion. The fossil remains of Australopithecus africanus, for example, show that these early hominids, who lived from one to four million years ago, also possessed the behavioral attribute of bipedal locomotion and upright posture (Campbell, 1966; Pilbeam, 1972). We can thus infer that these anatomical specializations were retained by the Darwinian process of natural selection because upright posture and bipedal locomotion were selective advantages to these early hominids.
The anatomical specializations that are necessary conditions for upright posture have a biologic "cost." The change in the pelvic area vis-à-vis animals who lack upright posture makes childbirth more difficult. The advantages of upright posture and bipedal locomotion must have outweighed the increased mortality in childbirth for these specializations to have been retained and elaborated over a period of millions of years. The presence of stone artifacts like the rounded stones that have been found in association with Australopithecine fossils (Leakey, 1971) thus take on a new significance when we consider the probable presence of upright posture and bipedal locomotion in Australopithecine culture. The stone "balls" may have served as projectiles in hunting small animals. The selective advantage of upright posture would have been enhanced if Australopithecines used projectiles. The use of projectiles in itself would not appear to be very significant if we again did not refer to the results of comparative studies on living nonhuman hominoids. Although living apes do, in fact, hurl stones and branches, they usually cannot hit particular targets. Instead they ineffectually hurl things about in some general direction (Beck, in press; van Lawick-Goodall, 1973). All humans can learn to hit targets regularly with projectiles. Small children learn to do so without any special instruction. It is a human attribute.
The ability to hit things with projectiles is not very interesting until we compare human behavior with that of living related species. The human "quality" of being able to hit targets must involve the presence of certain innately determined neural mechanisms and pathways that enable us to acquire this ability readily. The stone projectiles that have been found in association with Australopithecine fossil remains indicate that these early hominids probably had this ability. It would have made upright posture an asset by freeing the hands for throwing. The entire behavioral and physiologic complex—upright posture, stone throwing, hunting, pelvic and limbic anatomy, neural mechanisms coordinating the motions of the arms and hands with vision, etc.—thus can be viewed as an interdependent evolutionary pattern. It started with hominids like Ramapithecus (Pilbeam, 1972), who may have lacked upright posture and who might have hurled projectiles in some general direction, but it ultimately resulted in the evolution of hominids like the Australopithecines. The initial conditions for this evolutionary process can be seen in the physiology and behavior of present-day apes, who have the arms, hands, and acute vision that are necessary though not sufficient conditions for accurately throwing projectiles. There is, as Darwin (1859) claimed, a continuity and gradualness in the process of evolution.
Human language is at present a unique quality of Homo sapiens. However, like the unique human quality of stone throwing, it too must be the result of a gradual evolutionary process. We can profitably apply the same comparative techniques toward an understanding of its evolution. In fact, there already are a number of studies that bear either directly or indirectly on the evolution of language in general and of human language in particular. The claim is often made that human language has absolutely nothing in common with the communications of animals (Lenneberg, 1967). Human language is supposedly disjoint with the communications of animals. It is supposedly "referential," while animal communications are "emotive." In fact, there are no studies that demonstrate that the communications of animals are simply emotive. At best, there are tenuous arguments that claim that the neural embodiment of human language involves cortical-to-cortical pathways, which are present only in the human brain, whereas the communicative signals of animals supposedly involve cortical-subcortical pathways. Cortical-cortical pathways are supposedly involved in referential thought, while cortical-subcortical pathways are supposedly involved in the expression of emotions. These claims cannot be supported by the findings of modern neuroanatomy. The "referential" activity of the human brain probably involves many cortical-subcortical pathways. Nor is there any substantive basis for assigning the expression of emotion exclusively to cortical-subcortical pathways.
The supposed uniqueness of human language seems to me to be an echo of the traditional Cartesian view. Many ethological, behavioral, and linguistic studies of the communications of animals that are otherwise faultless are limited by their unquestioning acceptance of this Cartesian premise. The tendency is to draw negative conclusions concerning the linguistic ability of nonhuman animals, even though these conclusions are not supported by the data. Green (1973), for example, in an acoustic analysis of some of the vocalizations of nonhuman primates, uses spectrographic data that show that a number of these vocalizations form a "graded" series along what appear to be the acoustical dimensions of amplitude and fundamental frequency. Green concludes that the vocalizations are "graded" rather than "discrete." Nonhuman primate vocalizations, supposedly, are thus not "linguistic" signals, like the human sounds transcribed by the symbols /b/, /p/, /a/, etc.
However, the method that Green uses, if applied to human speech, would demonstrate that the discrete human phonetic elements /b/ and /p/ were also graded, nonlinguistic signals. The acoustic basis of the distinction between the English sounds /b/ and /p/ rests in the delay between the sound generated when the speaker's lips are opened and the start of phonatory vocal cord activity. If phonation starts within 20 msec after the speaker's lips open, the sound is perceived as a /b/. If the delay exceeds 20 msec, the sound is perceived as a /p/ (Lisker and Abramson, 1964). Human speakers, particularly young human speakers (Preston and Port, 1972), produce many /b/ sounds with phonation delays ranging over the 0-20 msec interval that defines these sounds. They also produce many /p/ sounds in which phonation delays vary from 20 to over 130 msec. If these sounds were examined using the same criteria that have been applied to the analysis of chimpanzee vocalizations, the discrete, categorical responses of human listeners to these sounds would not be apparent.
The basis of the discrete, categorical nature of these speech sounds appears to be the presence of an innately determined auditory mechanism—a neural "feature detector," which is even manifested in the behavior of four-week-old human infants (Eimas and Corbitt, 1973). The sounds /b/ and /p/, in other words, are discrete signals because human perception makes them discrete. Acoustic analysis, in itself, cannot reveal the presence of the neural mechanism that is the basis of the discrete nature of these speech sounds. We cannot assume that the primate vocalizations discussed by Green or in other accounts of primate communication are not discrete signaling units until we perform the appropriate perceptual experiments.
We really know very little about the communications and the possible "language" of various animals. Linguists have been rather anthropocentric when they attempt to limit the term "language" to human language. It would make as much sense to limit the term "swimming" to human swimming. The details of the way a human swims are different from the way a dog swims, but the end result is similar. Both animals move through water. The communication system of a dog is probably simpler than that of a human, but there may be common elements, and we can perhaps gain some insights into the nature of human language by studying the simpler, canid system.
We might want to restrict the term "language" to a special class of communication systems, but we really cannot limit the term to the language of present-day Homo sapiens without making the classificative function of the term unprofitably restrictive. The phonetic abilities of some fossil hominids like the classic Neanderthals, who inhabited parts of Europe and Asia about 70,000 years ago, would have precluded their speaking any of the languages of modern Homo sapiens. However, the stone artifacts that have been found in association with these fossils and the evidence of their cultural tradition make it evident that some form of language must have been a feature of Neanderthal society (Lieberman, 1975). The language of these advanced hominids would not have been a human language if we restrict the term "human" to modern Homo sapiens. However, there is no reason to believe that Neanderthal hominids could not have transmitted new, previously unanticipated information among themselves. Thus, the operational definition of language (Lieberman, 1973) is a communication system that permits the transmission of new information. This definition obviously would not fit the limited codes that simple animals like frogs appear to use. However, it would admit many possible languages that might differ substantially from the language of present-day Homo sapiens.
Defining language in terms of the properties of human language is fruitless because we do not know what they really are. Even if we knew the complete inventory of properties that characterize human language we probably would not want to limit the term "language" to communication systems that had all of these properties. For example, it would be unreasonable to state that a language that had all of the attributes of human languages except relative clauses really was not a language. The operational definition of language is functional rather than taxonomic. It is a productive definition insofar as it encourages questions about what animals can do with their communication systems and the relation of these particular systems to human language and to the intermediate levels of language that probably were associated with early hominids.
Neural Feature Detectors
As I noted earlier, the perception of human speech appears to involve the presence of neural mechanisms that are sensitive to particular acoustic events. In recent years a number of electrophysiological and behavioral studies have demonstrated that various animals have auditory detectors that are "tuned" to signals that are of interest to them. Wollberg and Newman (1972), for example, recorded the electrical activity of single cells in the auditory cortex of awake squirrel monkeys (Saimiri sciureus) during the presentation of recorded monkey vocalizations and other acoustic signals. They presented eleven calls, representing the major classes of this species' vocal repertoire, as well as a variety of acoustic signals designed to explore the total auditory range of these animals. Extracellular unit discharges were recorded from 213 neurons in the superior temporal gyrus of the monkeys. More than 80 percent of the neurons responded to the tape-recorded vocalizations. Some cells responded to many of the calls that had complex acoustic properties. Other cells, however, responded to only a few calls. One cell responded with a high probability only to one specific signal, the "isolation peep" call of the monkey.
The experimental techniques that are necessary in these electrophysiological studies demand great care and great patience. Micro-electrodes that can isolate the electrical signal from a single neuron must be prepared and accurately positioned, the electrical signals must be amplified and recorded, and, most important, the experimenters must present the animal with a set of acoustic signals that explore the range of sounds it would encounter in its natural state. Demonstrating the presence of neural mechanisms matched to the constraints of the sound-producing systems of particular animals is therefore a difficult undertaking. The sound-producing possibilities and behavioral responses of most "higher" animals make comprehensive statements on the relationship between perception and production difficult. We can only explore part of the total system of signaling and behavior. "Simpler" animals, however, are useful in this respect since we can see the whole pattern of the animal's behavior.
The behavioral experiments of Capranica (1965) and the electrophysiological experiments of Frishkopf and Goldstein (1963), for example, demonstrate that the auditory system of the bullfrog (Rana catesbeiana) has single units that are matched to the formant frequencies of the species-specific mating call. Bullfrogs are among the simplest living animals that produce sound by means of a laryngeal source and a supralaryngeal vocal tract. The latter consists of a mouth, a pharynx, and a vocal sac that opens into the floor of the mouth in the male. Vocalizations are produced in the same manner as in primates. The vocal cords of the larynx open and close rapidly, emitting puffs of air into the supralaryngeal vocal tract, which acts as an acoustic filter. Frogs can make a number of different sounds (Bogert, 1960), including mating calls, release calls, territorial calls that serve as warnings to intruding frogs, rain calls, distress calls, and warning calls. The different calls have distinct acoustic properties.
The mating call of the bullfrog consists of a series of croaks. The duration of each croak varies from 0.6 to 1.5 sec, and the interval between each croak varies from 0.5 to 1.0 sec. The fundamental frequency of the bullfrog croak is about 100 Hz. The formant frequencies of the croak are about 200 Hz and 1,400 Hz. Capranica generated synthetic frog croaks by means of a fixed, POVO speech synthesizer (Stevens et al., 1955) that was designed to produce human vowels but that serves equally well for the synthesis of bullfrog croaks. In a behavioral experiment, Capranica showed that bullfrogs responded to synthesized croaks so long as the croaks had energy concentrations at either or both of the formant frequencies. The presence of acoustic energy at other frequencies inhibited the bullfrogs' responses (which consisted of joining in a croak chorus).
Frishkopf and Goldstein (1963), in their electrophysiologic study of the bullfrog's auditory system, found two types of auditory units. They found cells in units in the eighth cranial nerve of the anesthetized bullfrog that had maximum sensitivity to frequencies between 1,000 and 2,000 Hz. They found other units that had maximum sensitivity for frequencies between 200 Hz and 700 Hz. The units that responded to the lower frequency range, however, were inhibited by appropriate acoustic signals. Maximum response occurred when the two units responded to pulse trains at rates of 50 and 100 pulses per sec, with energy concentrations at or near the formant frequencies of bullfrog mating calls. Adding acoustic energy between the two formant frequencies, at 500 Hz, inhibited the responses of the low-frequency single units.
The electrophysiologic, behavioral, and acoustic data all complement each other. Bullfrogs have auditory mechanisms that are structured specifically to respond to the bullfrog mating call. They do not simply respond to any sort of acoustic signal as though it were a mating call, but respond to particular calls that can be made only by male bullfrogs; and they have neural mechanisms structured in terms of the species-specific constraints of the bullfrog sound-producing mechanism.
Plasticity and the Evolution of Human Speech
Frogs are rather simple animals but they have nonetheless evolved different species-specific calls. Of the thirty-four species whose mating calls failed to elicit responses from Rana catesbeina, some were closely related, others more distantly related. It is obvious that natural selection has produced changes in the mating calls of Anuran species. The neural mechanisms for the perception of frog calls are at the periphery of the auditory system. They apparently are not very plastic, since Capranica was not able to modify the bullfrogs' responses over the course of an eighteen-month interval. Despite this lack of plasticity, frogs have evolved different calls in the course of their evolutionary development.
Primates have more flexible and plastic neural mechanisms for the perception of their vocalizations. Recent electrophysiological data (Miller et al., 1972) show that primates like rhesus monkey (Macaca mulatta) will develop neural detectors that identify signals important to the animal. Receptors in the auditory cortex responsive to a 200 Hz sine wave were discovered after the animals were trained by the classic methods of conditioning to respond behaviorally to this acoustic signal. These neural detectors could not be found in the auditory cortex of untrained animals. The auditory system of these primates thus appears to be plastic. Receptive neural devices can be formed to respond to acoustic signals that the animal finds useful. These results are in accord with behavioral experiments involving human subjects in which "categorical" responses to arbitrary acoustic signals can be produced by means of operant conditioning techniques (Lane, 1965). They are also in accord with the results of classic conditioning experiments like those reported by Pavlov. The dogs learned to identify and respond decisively to the sound of a bell, which is an unnatural sound for a dog. The dog obviously had to learn to identify the bell.
The first hominid "languages" probably evolved from communication systems that resembled those of present-day apes. The social interactions of chimpanzees are marked by exchanges of facial and bodily gestures as well as vocalizations (van Lawick-Goodall, 1973). The recent successful efforts establishing "linguistic" communications between humans and chimpanzees by means of either visual symbols or sign language (Gardner and Gardner, 1969; Fouts, 1973; Premack, 1972) show that apes have the cognitive foundations for analytic thought. They also use tools, make tools, and engage in cooperative behavior (for example, hunting). All these activities have been identified as factors that may have placed a selective advantage on the evolution of enhanced linguistic ability (Washburn, 1968; Hill, 1972).
It is obviously impossible to determine directly what types of feature detectors may have existed in the brains of extinct hominids. We can, however, get some insights into the general evolutionary process that developed the phonetic level of hominid language by taking note of the evolution of the speech-producing anatomy. The "match" that exists between the constraints of speech production and speech perception in modern humans (Lieberman, 1970) as well as comparative data on other living species make this procedure reasonable. An extinct hominid could obviously not make use of a phonetic contrast that could not be produced by the species. The reader may still wonder precisely what insights we may gain on the general question of the evolution of language even if we can determine some of the phonetic constraints on the languages of earlier, now extinct hominids. The answer is that certain sounds that occur in the languages of present-day humans have important functional attributes. The presence or absence of these sounds can tell us something about the general level of linguistic ability in an extinct hominid species. I shall return to this topic, which involves the physiology of speech, after discussing some of the data that are available at present.
The Phonetic Ability of Neanderthal Hominids
Neanderthal hominids like those represented by the "classic" La Chapelle-aux-Saints and La Ferrasie fossils lived until comparatively recent times. They form a class of fossils that differ significantly, using quantitative statistical methods (Howells, 1970, in press), from other fossil hominid populations and from modern Homo sapiens. Neanderthal hominids were not primitive in the sense that they lacked culture. They produced complex stone tools and had a cultural tradition that has left traces of burial rituals and care for the infirm and aged. The data that form the basis of our inferences regarding Neanderthal culture consist of stone and bone tools, traces of fire sites, burial sites, and skeletal material that have survived between 40,000 and perhaps 100,000 years.
However, nothing remains of the soft tissue of the supralaryngeal vocal tract or the larynx. How can we then make any inferences about phonetic ability? Fortunately we can reconstruct the supralaryngeal vocal tract that is typical of these extinct fossils, making use of the methods first proposed by Darwin concerning the "affinities of extinct species to each other, and to living forms" (1859:329) and his observations with regard to embryology (1859:439-49). The basis for the reconstruction of the supralaryngeal vocal tract of Neanderthal hominids (Lieberman and Crelin, 1971) is the similarity between the skulls of the fossils and of newborn Homo sapiens, i.e., newborn modern man. At first this might seem implausible. How can an adult fossil skull that has massive supraorbital brow ridges, a huge massive mandible, and a generally prognathous aspect be compared with that of a newborn human? The answer is that certain aspects of the skeletal morphology of newborn human and Neanderthal skulls are similar, even though other aspects are not. The claim is not that newborn humans are little Neanderthalers, but that the two share certain skeletal features. Vlcek (1970), in his comparative study of the development of skeletal morphology in Neanderthal infants and children, independently arrived at similar conclusions.
There are a great many Neanderthal fossil skulls including those of infants and children, two to fourteen years old at the time of death. Vlcek was therefore able to study the ontogeny of Neanderthal skull development in relation to that of modern man. He concludes:
Certain primitive traits that are present in the skeletons of Neanderthal forms occur again in different periods of the foetal life of contemporary man with different degrees of intensity. Thus we can observe the development and the presence of many morphological characteristics typical of the Neanderthal skeleton in the skeleton of contemporary man in the course of his ontogenetic development. [1970:150]
In Fig. 1 sketches of the skulls of adult and newborn Homo sapiens and the La Chapelle-aux-Saints fossil are presented. The La Chapelle-aux-Saints fossil is probably somewhere between 45,000 and 100,000 years old. The exact dating is not important since Neanderthalian fossils, for example, all those discussed in connection with Vlcek's study, persisted throughout this period. (The skulls in Fig. 1 have all been drawn to the same approximate size.) Note the basic similarity between the newborn human skull and the Neanderthal skull. The newborn human and the young Neanderthal skulls of Vlcek's study are very similar. The older skulls in Vlcek's study and the adult La Chapelle skull are even more similar in important features to the newborn human skull than the newborn human is to the adult human skull. The newborn human and Neanderthal skulls are relatively more elongated from front to back and relatively more flattened from top to bottom than that of adult Homo sapiens. The squamous part of the temporal bone is similar in newborn human and all Neanderthal skulls. A long list of similar anatomical features could be presented, but we are really only concerned with skeletal features that are directly relevant to the reconstruction of the supralaryngeal vocal tract of Neanderthal man.
We have to think in terms of the functional anatomy of the vocal tract. If we were to ignore the functional aspects of skeletal morphology we could be led astray, for example, by the fact that the mastoid process is absent in newborn humans and relatively small in the La Chapelle fossil, adding to their similarity to the skull of the adult Homo sapiens in Fig. 1. The mastoid process, however, plays no role in the reconstruction of the supralaryngeal vocal tract.
Most of the unsuccessful attempts at deducing the presence or absence of speech from skeletal structures were based on comparative studies that did not properly assess the functional roles of particular features. Vallois (1961) reviews many of these attempts, which were hampered by the absence of both a quantitative acoustic theory of speech production and suitable anatomical comparisons with living primates that lack the physical basis for human speech. The absence of prominent genial tubercles in certain fossil mandibles, for example, was taken as an absolute sign of the absence of speech, but genial tubercles are sometimes absent in normal adult humans who speak normally. They play a part in attaching the geniohyoid muscle to the mandible, but they are not in themselves crucial features. Indeed the notion of looking for crucial, isolated morphological features is not particularly useful. It is necessary to explore the complete relationship of the skeletal structure of the skull and mandible to the supralaryngeal vocal tract.
Fig. 2 shows lateral views of the skull, vertebral column, and larynx of newborn and adult Homo sapiens and the reconstructed La Chapelle-aux-Saints fossil. The Neanderthal skull is placed on top of an erect cervical vertebral column instead of on one sloping forward, as depicted by Boule (1911-13). This is in agreement with Straus and Cave (1957), who determined that the La Chapelle-aux-Saints fossil had suffered from arthritis, but this condition could not have affected his supralaryngeal vocal tract. Severe arthritis at advanced ages has virtually no effect on speech in modern man. (The La Chapelle-aux-Saints fossil was probably about forty years old at the time of his death.)
Since the second, third, and fourth cervical vertebrae were missing, they were reconstructed to conform with those of adult Homo sapiens. In addition, the spinous processes of the lower cervical vertebrae shown for the adult human in Fig. 2 are curved slightly upwards. They come from a normal vertebral column and were purposely chosen to show that the La Chapelle-aux-Saints vertebrae were not necessarily pongid in form, as Boule (1911-13) claimed. Crelin's reconstruction (Lieberman and Crelin, 1971) is, in fact, purposely weighted toward making the La Chapelle-aux-Saints fossil more like modern man than like an ape. In all cases of doubt, the La Chapelle-aux-Saints supralaryngeal vocal tract reconstruction was modeled on that of the modern human vocal tract. Thus any conclusions that we will draw concerning limits on Neanderthal phonetic ability are conservative.
Note that the geniohyoid muscle in adult Homo sapiens runs down and back from the hyoid symphysis of the mandible. This is necessarily the case because the hyoid bone is positioned well below the mandible in adult Homo sapiens. The two anterior portions of the digastric muscle, which are not shown in Fig. 2, also run down and back from the mandible for the same reason. When the facets into which these muscles are inserted at the symphysis of the mandible are examined, it is evident that the facets are likewise inclined to minimize the shear forces for these muscles. Shear forces always pose a greater problem than do tensile forces in all mechanical systems since the shear strength of most materials is substantially smaller than the tensile strength. A stick of blackboard chalk, for example, has great tensile strength. It cannot be pulled apart easily if you pull on it lengthwise. However, it has an exceedingly small shear strength, and you can snap it apart with two fingers. The human chin appears to be a consequence of the inclination of the facets of the muscles that run down and back to the hyoid. The outward inclination of the chin in some human populations reflects the inclination of the inferior (inside) plane of the mandible at the symphysis. Muscles are essentially "glued" to their facets. In this light, tubercles and fossae may be simply regarded as adaptions that increase the strength of the muscle-to-bone bond by increasing the "glued" surface area. Their presence or absence is not very critical (DuBrul and Reed, 1960) since the inclination and form of the digastric and geniohyoid facets is the primary element in increasing the functional strength of the muscle-to-bone bond by minimizing shear forces. As Bernard Campbell (1966:2) succinctly notes, "Muscles leave marks where they are attached to bones, and from such marks we assess the form and size of the muscles."
You can easily feel the inclination of the inferior surface of the symphysis of your mandible. Whereas the chin is more prominent in some adult humans than in others, the inferior surface of the mandibular symphysis is always arranged to accommodate muscles that run down and back to a low hyoid position. As DuBrul (1958:42) correctly notes, the human mandible is unique, "The whole lower border of the jaw has swung from a pose leaning inward to one flaring outward." An examination of the collection of skulls at the Musée de l'Homme in Paris indicated that this is true regardless of race and sex for all normal adult humans. When the corresponding features are examined in newborn Homo sapiens, it is evident that the nearly horizontal inclination of the facets of the geniohyoid and digastric muscles is a concomitant feature of the high position of the hyoid bone (Negus, 1929; Crelin, 1969; Wind, 1970). These muscles are nearly horizontal with respect to the symphysis of the mandible in newborn Homo sapiens, and the facets therefore are nearly horizontal to minimize shear forces. Newborn Homo sapiens thus lacks a chin because the inferior surface of the symphysis of the mandible is not inclined to accommodate muscles that run down and back. When the mandible of the La Chapelle-aux-Saints fossil is examined, it is evident that the facets of these muscles resemble those of newborn Homo sapiens. The inclination of the styloid process away from the vertical plane is also a concomitant and coordinated aspect of the skeletal complex that supports a high hyoid position in newborn Homo sapiens and in the La Chapelle-aux-Saints fossil. Enough of the base of the La Chapelle-aux- Saints styloid process remains to determine its original approximate size and location.
The skeletal features that support the muscles of the supralaryngeal vocal tract and mandible are all similar in the Neanderthal fossil and newborn Homo sapiens. When the bases of the skulls of newborn and adult Homo sapiens and the La Chapelle-aux-Saints fossil are examined, it is again apparent that the newborn Homo sapiens and the fossil forms have many common features that differ from adult Homo sapiens. These differences are all consistent with the morphology of the supralaryngeal airways of newborn Homo sapiens, in which the pharynx and the pharyngeal constrictor muscles lie behind the opening of the larynx. In Fig. 3 casts of the supralaryngeal vocal tracts of newborn Homo sapiens, the Neanderthal reconstruction, and adult Homo sapiens are shown. The details of the reconstruction as well as the general motivating constraints are discussed in Lieberman (1975) and in less detail in Lieberman and Crelin (1971) and Lieberman (1973).
What are the phonetic consequences with respect to human speech of the reconstructed Neanderthal supralaryngeal vocal tract? Understanding the anatomical basis of human speech requires that we briefly review the source-filter theory of speech production (Chiba and Kajiyama, 1958; Fant, 1960). Human speech is the result of a source or sources of acoustic energy filtered by the supralaryngeal vocal tract. For voiced sounds, that is, sounds like the English vowels, the source of energy is the periodic puffs of air that pass through the larynx as the vocal cords rapidly open and shut. The rate at which the vocal cords open and close detarmines the funtamental frequency of phonation. Acoustic engery is present at the funtamental frequency and at higher harmonics. The funtamental frequency of phonation can vary about 80 Hz for adult males to about 500 Hz for children and some adult females. Significant acoustic energy is present in the harmonics of fundamental frequency to at least 3,000 Hz. The fundamental frequency of phonation is, within wide limits, under the control of the speaker, who can produce controlled variations by changing either the pulmonary air pressure or the tension of the laryngeal muscles (Lieberman, 1967). Linguistically significant information can be transmitted by means of these variations in fundamental frequency, as, for example, in Chinese, where they are used to differentiate words.
The main source of phonetic differentiation in human language, however, arises from the dynamic properties of the supralaryngeal vocal tract, which acts as an acoustic filter. The length and shape of the supralaryngeal vocal tract determines the frequencies at which maximum energy will be transmitted from the laryngeal source to the air adjacent to the speaker's lip They are known as formant frequencies. A speaker can vary the formant frequencies by changing the length and shape of his supralaryngeal vocal tract. He can, for example, drastically alter the shape of the airway formed by the posterior margin of his tongue body in his pharynx. He can raise or lower the upper boundary of his tongue in his oral cavity. He can raise or lower his larynx and retract or extend his lips. He can open or close his nasal cavity to the rest of the supralaryngeal vocal tract by lowering or raising his velum. The speaker can, in short, continually vary the formant frequencies generated by his supralaryngeal vocal tract. The acoustic properties that differentiate the vowels [a] and [i], for example, are determined solely by the differences in shape and length that the supralaryngeal vocal tract assumes when these vowels are articulated. The situation is analogous to a pipe organ, where the length and type (open or closed end) of pipe determine the musical quality of each note. The damped resonances of the human supralaryngeal vocal tract are, in effect, the formant frequencies. The length and shape (more precisely the cross-sectional area as a function of distance from the laryngeal source) determine the formant frequencies.
The situation is similar for unvoiced sounds. Here the vocal cords do not open and close at a rapid rate to release quasiperiodic puffs of air, but the source of acoustic energy is the turbulence generated by air rushing through a constriction in the vocal tract. The vocal tract still acts as an acoustic filter but the acoustic source may not be at the level of the larynx; for example, in the sound [s] the source is the turbulence generated near the speaker's teeth.
The supralaryngeal vocal tract's filtering properties are completely specified by its shape and size, i.e., its cross-sectional area function. We can therefore determine the constraints that a particular vocal tract will impose on the phonetic repertoire independently of the possible limitations of such things as muscular ability or the properties of the larynx. We could, for example, make models of possible vocal tract shapes by pounding and forming brass tubes with cutters and brazing torches. We could record the actual formant frequencies that corresponded to particular shapes by exciting the tubes with an artificial larynx or a reed. We could thus determine the constraints that the supralaryngeal vocal tract of a Neanderthal fossil placed on the phonetic repertoire independently of the possible further limitations imposed by the extinct hominid's control or lack of control. The only difficulty that would obtain would be in making sure that we had explored the full range of vocal-tract shapes. The acoustic properties of the brass models would closely approximate the filtering properties of the vocal tract shapes they represented. This modeling technique was actually once the principal means of phonetic analysis. The technology of the late eighteenth and early nineteenth centuries was adequate for the fabrication of brass tubes with complex shapes and for making mechanical models. The speech synthesizers devised by Kratzenstein (1780) and von Kempelin (1791) (whose famous talking machine was one of the wonders of the time) generated acoustic signals by exciting tubes by means of mechanical reeds. The method employed by Lieberman and Crelin (1971) and Lieberman et al. (1972) simply makes use of the technology of the third quarter of the twentieth century.
We could, if we wished, continue to use mechanical models to assess the constraints of the supralaryngeal vocal tract on an animal's phonetic repertoire. We could determine the range of possible supralaryngeal vocal tract shapes by dissecting living animals that had similar vocal tracts, making casts of the air passages, and taking note of the musculature, soft tissue, and effects of the contraction of particular muscles. We could enhance our knowledge by making cineradiographs of the animal during episodes of phonation, respiration, and swallowing. It would then be possible, though somewhat tedious, to make models of possible supralaryngeal vocal tract configurations. The models could even be made of plastic materials that approximated the acoustic properties of flesh. If they were excited by means of a rapid, quasiperiodic series of puffs of air (i.e., an artificial larynx), we would be able to hear the actual sounds that a particular vocal tract configuration produced. If we systematically made models that covered the range of possible vocal-tract configurations we could determine the constraints imposed by supralaryngeal vocal-tract morphology on phonetic repertoire independently of the further possible constraints of the extinct hominids' muscular or neural control, dialect, or habits. We would, of course, be restricted to continuant sounds, i.e., those that were not transient or interrupted, since we could not rapidly change the shape of our vocal tract model. We could, however, generalize our results to consonant-vowel syllables, like the sounds [bI] and [dæ], since we could model the articulatory configurations that occur at specified intervals of time when these sounds are produced.
In Fig. 4, area functions that could be generated by a Neanderthal vocal tract are plotted. These area functions were entered into a computer program (Henke, 1966) that essentially represents the supralaryngeal vocal tract by a series of contiguous cylindrical sections, each of fixed area. Each section can be described by a characteristic impedance and a complex propagation constant, both of which are known quantities for cylindrical tubes (Beranek, 1954). Junctions between sections satisfy the constraints of continuity of air pressure and conservation of air flow. In other words, the air pressure must be a continuous function, and air particles can neither disappear nor be created at section boundaries. The computer program calculated the three lowest formant frequencies for any area function specified. This arrangement made it possible to enter many area functions in a comparatively short time.
A number of area functions were sketched into the computer, using its light pen and oscilloscope input system. The area functions plotted in Fig. 4 were directed toward producing the "best" Neanderthal approximations to the human vowels [i], [u], and [a]. The frequency scales and labeled loops are taken from the Peterson and Barney (1952) study of vowel production by adult men, adult women, and older children speaking American English. It is apparent that the reconstructed Neanderthal vocal tract cannot produce the vowels [i], [u], or [a]. Consonants like the dental and bilabial [d], [t], [s], [b], and [p] and other vowels would be phonetic possibilities for the Neanderthal vocal tract, but velar consonants like [g] and [k] would not (Lieberman and Crelin, 1971).
Reconstruction and modeling of the supralaryngeal vocal tract of Australopitechs africanus show similar phonetic restrictions (Lieberman, 1973, 1975). In contrast, the reconstructed vocal tract of the Es Skhul V fossil is essentially modern in character and would not restrict the phonetic ability of this hominid. These reconstructions are all the work of Edmund S. Crelin. His results are in accord with the independent univariate and multivariate analyses of Howells (1970, and in press), which demonstrate that Es Skhul V falls within the same class as modern humans, whereas certain measurements of the La Chapelle-aux-Saints fossil vary four to five standard deviations from those of modern skulls.
The Physiology of Human Speech
Since the time of Johannes Müller (1848), who initiated the modern study of the physiology of speech, it has been apparent that some sounds have a more central status than others. The vowels [i], [u], and [a] appear to occur in all human languages. Troubetzkoy (1969) notes that a language may have other vowels but that it always has one or more of these. Recent functional, i.e., physiologic, analyses of human speech show that these vowels really are more useful speech signals than other vowels. Stevens (1972), for example, shows that these vowels are acoustically stable. A speaker can make comparatively large articulatory errors in the production of these sounds without changing their acoustic character. The constricted part of the supralaryngeal vocal tract can, for example, vary over a two-centimeter range without perceptibly changing these vowels' formant frequencies. The formant frequency patterns that define these vowels are, moreover, maximally distinct from all other vowels. The formant frequencies are centered for [a], maximally high for [i], and maximally low for [u]. In effect, these three vowels are the "best" possible vowel sounds for vocal communication.
Natural selection would act to retain mutations that allowed these signals to be produced only if enhanced vocal communication were an advantage. The specialized anatomy that allows hominids like modern Homo sapiens and Es Skhul V (who is functionally modern) to produce these sounds is less suited for breathing, swallowing, and chewing than the vocal tract anatomy of Neanderthal and Australopithecine hominids and nonhuman primates. The human supralaryngeal airway, in which the pharynx is part of the direct path from the lungs, allows the tongue body to be shifted up, down, and back to form the abrupt discontinuities in the supralaryngeal area function that are necessary to produce vowels like [a], [i], and [u]. Other primates, who lack this anatomical complex, essentially have a "single" tube vocal tract, formed simply by the oral cavity, and the larynx and pharynx open independently into the oral cavity. This arrangement offers less resistance to air flow and allows respiration to go on when the oral cavity is full of fluid; the epiglottis can seal the oral cavity from the nasal-laryngeal pathway. In modern Homo sapiens the pharynx serves both as part of the air pathway and as part of the pathway for the ingestion of food. The adult human epiglottis cannot seal the oral cavity, and food lodged in the pharynx can block the air flow to the lungs. The only function for which the human supralaryngeal vocal tract is better adapted is speech production (Lieberman and Crelin, 1971; Lieberman et al., 1972; Negus, 1929; Kirchner, 1970; Lieberman, 1973, 1975).
Speech communication must have existed in late hominid forms like Neanderthal man. The mutations that yield enhanced phonetic ability in modern Homo sapiens would not have been retained unless vocal communication was already an established phonetic mode of language. However, there is an additional physiologic factor, which is related to the encoded nature of human speech, that would have resulted in strong selectional pressures for the retention of the mutations that allowed the vowels [a], [i], and [u] to be produced.
Speech Encoding and Decoding
Modern human speech communication achieves a high rate of speed by a process of speech encoding and a complementary process of speech decoding. Phonetic distinctions that differentiate meaningful words, e.g., the sounds symbolized by [b], [æ], and [t] in the word bat, are transmitted, identified, and stored at a rate of 20-30 segments per sec. It is obvious that human listeners cannot transmit and identify these sound distinctions as separate entities. The fastest rate at which sounds can be identified is about 7-9 segments per sec (Liberman, 1970). Sounds transmitted at a rate of 20 per sec merge into an undifferentiate "tone." That is why high-fidelity amplifiers and loudspeakers generally have an advertised lower frequency limit of 20 Hz. The human auditory system simply cannot temporally resolve auditory events that occur at a rate of 20 per sec. (The human visual system, incidentally, cannot work any faster either. A motion picture projector presents individual still frames at rates in excess of 16 frames per sec.) The linguist's traditional conception of phonetic elements as a set of "beads on a string" clearly is not correct at the acoustic level. How, then, is speech transmitted and perceived?
The answer to this question comes from work that was originally directed at making a reading machine for the blind. The machine was to identify alphabetic characters in printed texts and convert them into sounds that a blind person could listen to. It was not too difficult to devise a print-reading device, although that was not really necessary if the machine's use was to be initially restricted to the "reading" of new books and publications. At some stage in the preparation of a publication a machine with a keyboard is used. The talking machine could be connected to the keyboard so that it produced a different sound, or combination of sounds, for each typewriter or linotype key. The sequence of sounds could then be tape-recorded, and blind people could then listen to the recordings after the tapes were perhaps slowed down and edited to eliminate pauses and errors. A number of different systems were developed, but all of them were useless because the tapes had to be slowed down to rates about one-tenth that of normal human speech. The blind "readers" would forget what a sentence was about before they heard its end. It did not matter what sorts of sounds were connected to the typewriter keys. They all were equally bad. The basic rate of transmission and the inherent difficulty of these systems were about the same as listening to the traditional dots and dashes of the telegrapher's Morse Code. The systems would work, but they were very, very slow, and the listeners had to expend most of their attention simply keeping track of the message.
The obvious solution to this problem seemed to rest in making machines that would "glue" the phonetic elements of speech together to make words. There seemed to be no inherent problem if the linguists' traditional beads on a string were isolated, collected, and then appropriately strung together. The medium of tape recording seemed to be the solution. Carefully pronounced test words could be recorded, and the phonetic elements of these words could then be isolated by cutting up the magnetic tape (preferably by segmenting the tape with the electronic equivalent of a pair of scissors). The speaker, for example, would record a list of words that included pet, bat, cat, hat, etc. The experimenters would then theoretically be able to isolate the sounds [p], [b], [h], [k], [e], [æ], which would then be stored in a machine that could put them together in different patterns to form new words, for example, get and pat. The list of possible permutations would, of course, increase as the vocabulary of isolated stored phonetic elements increased. Systems of this sort were implemented at great expense and with enormous efforts (Peterson et al., 1958). They surprisingly produced speech that was scarcely intelligible. Despite many attempts to improve the technique by changing the methods used in isolating the phonetic elements, the system proved to be completely useless.
Though these studies failed to produce a useful reading machine, they demonstrated that phonetic elements could not be regarded as beads on a string. It was, in fact, impossible to isolate a consonant like [b] or [t] without also hearing the vowels that either preceded or followed it. It is in fact impossible to produce a stop consonant like [p] or [b] without pronouncing a vowel. The smallest segment of speech that can be pronounced is the syllable. If you try to say the sound [b] you will discover that it is impossible. You can say [bi], [bu], [bU], [ba], [bI], [bæ], [bIt], [bId], etc., but you cannot produce an isolated [b]. The results of the past twenty years of research on the perception of speech by humans demonstrate that individual sounds like [b], [I], and [d] are encoded, that is, "squashed together" into a single unit when we produce the syllable-sized unit [bIt] (the phonetic transcription of the English word bit). A human speaker in producing this syllable starts with his supralaryngeal vocal tract, i.e., his tongue, lips, velum, etc., in the positions characteristic of [b]. However, he does not maintain this articulatory configuration but instead moves his articulators toward the positions that would be attained if he were instructed to maintain an isolated, steady [I]. He never reaches these positions, however, because he starts toward the articulatory configuration characteristic of [t] before he ever reaches the "steady state" (isolated and sustained) vowel [I]. The articulatory gestures that would be characteristic of each isolated sound are never attained. Instead, the articulatory gestures are melded together into a composite, characteristic of the syllable.
The sound pattern that results from this encoding process is an indivisible composite. Just as there is no way of separating with absolute certainty the [b] articulatory gestures from the [I] gestures (you cannot tell exactly when the [b] ends and the [I] begins), there is no way of separating the acoustic cues that are generated by these articulatory maneuvers. The isolated sounds have a psychological status as motor control or "programming" instructions for the speech-production apparatus. The sound pattern that results is a composite, and the acoustic cues for the initial and final consonants are largely transmitted as modulations imposed on the vowel. The process is, in effect, a time-compressing system. The acoustic cues that characterize the initial and final consonants are transmitted in the time slot that would have been necessary to transmit a single, isolated [I] vowel.
The human brain decodes, that is, "unscrambles" the acoustic signal in terms of the articulatory maneuvers that were put together to generate the syllable. The individual consonants [b] and [t], though they have no independent acoustic status, are perceived as discrete entities. The process of human speech perception inherently requires knowledge of the acoustic consequences of the possible range of human supralaryngeal vocal tract speech articulation and the size of the supralaryngeal vocal tract that produced the speech signal that is being decoded. A number of independent studies (Ladefoged and Broadbent, 1957; Rand, 1971; Nearey, 1975) have demonstrated that a human listener will interpret an identical acoustic stimulus as a different speech sound, e.g., the same acoustic signal may be "heard" as a token of the vowel [I], [æ], or [e]. The listener will perceive the sound as an [I] if he thinks the vocal tract he is listening to is large. If he thinks the vocal tract is small, he might perceive the same acoustic signal as a token of the vowel [æ].
Listeners can arrive at an estimate of the size of the vocal tract in several ways. They can listen to a stretch of speech and take note of the average range of formant frequencies. Larger and longer vocal tracts will tend to produce lower formant frequencies. A listener, however, can estimate the size of a vocal tract almost instantly if he knows what sound the speaker intended to make. The vowels [i], [u], and [a] have special acoustic properties that make them especially suited for this "vocal tract size calibrating" function (Lieberman, 1973, 1975). The formant frequency patterns that define vowels like [i] are "determinate," and listeners (or computer programs) can make use of these vowels to calibrate vowel perception (Nearey, 1975; Gerstman, 1967).
The absence of vowels like [i], [u], and [a] in the phonetic repertoire of Australopithecine and Neanderthal hominids is thus significant. Sounds that are inherently optimum signals for vocal communication and that facilitate fully encoded, rapid speech are absent. However, this deficiency cannot be taken as an indication either of the absence of vocal communication or of the total absence of encoding in the speech of these earlier hominids. The vocal-tract anatomy of present-day apes, for example, though it presents limits on the total phonetic repertoires of these animals, can produce many of the sound contrasts that convey meaningful information in human speech, i.e., the "phonetic features" that occur in human languages (Lieberman, 1973, 1975). Our knowledge of the vocal communication of living apes is rudimentary. We know virtually nothing about the perceptual factors that may structure their vocal communication, nor do we really have sufficient data that relate particular behavioral situations with the total vocal and gestural communicative output of living apes. The speech-producing anatomy of apes can be viewed as a factor that inherently sets an upper limit on the phonetic repertoires of these animals. It, however, would allow the production of at least the following phonetic features.
I shall start by discussing phonetic features that involve the laryngeal source. As Negus (1929) observed, as we ascend the phylogenetic scale in terrestrial animals, there is a continual elaboration of the larynx, which reflects, in part, adaptions for phonation. Studies like that of Kelemen (1948), which have attempted to show that chimpanzees cannot talk because of laryngeal deficiencies, are not correct. Kelemen shows that the chimpanzee's larynx is different from the larynx of a normal adult human male and will not produce the same range of fundamental frequencies; and, moreover, the spectrum of its glottal source will be different from that of a normal adult human male. The chimpanzee's voice thus would sound "harsh"—to a human listener! However, human listeners do not really count with regard to chimpanzees! Chimpanzees and other hominoids and New and Old World monkeys probably could produce the following phonetic features by making use of laryngeal and subglottal articulatory maneuvers.
VOICED VERSUS UNVOICED
The supralaryngeal vocal tract could be excited either by the quasi-periodic excitation of the larynx or by means of noise generated by air turbulence. Air turbulence will occur whenever the flow of air exceeds a critical value at any point in the vocal tract. During phonation the vocal cords are adducted, i.e., moved together and closed or nearly closed, and the flow of air through the larynx is relatively low. In humans, turbulent noise generally does not occur during the production of voiced vowels, i.e., vowels produced with normal phonation. In the production of a sound like [s], the vocal cords are in a more open position. The resulting air flow is much higher (Klatt et al., 1968), and noise is generated at the "dental" constriction orifice. It is clear that nonhuman primates can produce sounds that are either voiced or unvoiced (Lieberman, 1975).
HIGH FUNDAMENTAL VERSUS NORMAL FUNDAMENTAL FREQUENCY
Several studies (Van den Berg, 1960; Atkinson, 1973) have shown that the human larynx can be adjusted so that phonation occurs in the falsetto register, in which the fundamental frequency of phonation is higher than in the normal register. The mode of operation of the larynx is actually somewhat different in these two registers (Van den Berg, 1960; Lieberman, 1967). The spectrum of the glottal source also changes in falsetto, and comparatively little energy occurs at higher frequency harmonics of the fundamental. The larynges of nonhuman primates inherently should be capable of producing this distinction (Negus, 1929; Wind, 1970).
LOW FUNDAMENTAL VERSUS NORMAL FUNDAMENTAL FREQUENCY
The human larynx likewise may be adjusted to phonate at a low fundamental frequency. This lower register, termed "fry," produces very low fundamental frequencies (Hollien et al., 1966), which are very irregular (Lieberman, 1963).
DYNAMIC FUNDAMENTAL FREQUENCY VARIATIONS
Virtually all human languages make use of dynamic variations in the temporal pattern of fundamental frequency (Lieberman, 1967). In languages like Chinese, dynamic tone patterns, i.e., rapid changes in fundamental frequency, differentiate words. The spectrograms of chimpanzee and gorilla vocalizations often show fundamental frequency variations that could serve as the basis of phonetic features based on dynamic fundamental frequency variations (Lieberman, 1968). Vocalizations could be differentiated by means of rising or falling patterns or combinations of rising and falling contours with high or low fundamental frequencies.
STRIDENT LARYNGEAL OUTPUT
The high fundamental frequency cries mixed with breathy, i.e., noise excitation that can be observed in the spectrograms and oscillograms of the vocalizations of nonhuman primates and newborn humans, constitute a phonetic feature (Lieberman, 1975). Speakers of American English sometimes make use of this phonetic feature to convey emotional qualities. It does not have a strictly "linguistic" role in American English, since it is not used to convey different words, but that is not a crucial objection to our noting the possible use of this sound contrast as a phonetic feature. Many sound contrasts that serve as phonetic features in other languages are not used in English.
The anatomical basis of the phonetic feature of Phonation Onset rests in the independent nature of the laryngeal source and the supralaryngeal vocal tract. All primates thus can, in principle, make use of this sound contrast, which differentiates sounds like [b] and [p], as a phonetic feature.
Phonation Onset obviously involves articulatory maneuvers in the supralaryngeal vocal tract, which must be occluded to produce sounds like the English "stops" [p] and [b]. All primates inherently can do so to produce the phonetic feature Stop.
CONSONANTAL "PLACE OF ARTICULATION"
The point at which the supralaryngeal vocal tract can be occluded can vary. All primates can close their vocal tracts by moving their lips together. Thus a bilabial point of articulation is a possibility for all primates. A dental point of occlusion or constriction is effected in adult Homo sapiens by moving the tongue up toward the hard palate, and cineradiographic studies of the swallowing movements of newborn Homo sapiens (Truby, Bosma, and Lind, 1965) indicate that a dental point of articulation is a possibility for all primates. In all primates the supralaryngeal vocal tract can also be occluded at the level of the glottis, i.e., at the level of the vocal cords. This follows from one of the surviving, basic, vegetative functions of the larynx, which can close to protect the lungs from the intrusion of foreign material. A glottal point of articulation is thus a possibility for all primates.
A chimpanzee, therefore, has the speech-producing anatomy that would, with the proper muscular controls, be sufficient to allow the production of the English sounds [b], [p], [t], [d], the glottal stop [?], as well as prevoiced dental and bilabial stops like those that occur in Spanish. Glottal stops normally are not used to differentiate words in English, though they occur in many English dialects; they are used more extensively in many other languages, e.g., Danish. It is important to note that the phonetic feature of consonantal point of articulation is a multivalued feature and that we are simply discussing the upper bounds set by the gross anatomy of primates. An animal would have to possess the neural and muscular control necessary to position the tongue against the palate during speech at a precise moment if the dental point of articulation were to be realized.
CONTINUANT VERSUS INTERRUPTED
Sounds may be differentiated either by being prolonged without interruptions or by being interrupted. This phonetic feature can be effected either by direct modulation control of the laryngeal muscles to start and stop phonation or by occluding the supralaryngeal vocal tract.
FORMANT FREQUENCY RAISING
All primates can shorten the length of their supralaryngeal vocal tracts. They can shorten it at its "front" end by flaring and/or pulling their lips back, and adult Homo sapiens can shorten it at its "back" end by pulling the larynx upward as much as 20 mm during the course of a single word (Perkell, 1969). The mobility of the larynx is comparatively restricted in newborn Homo sapiens (Truby, Bosma, and Lind, 1965; Negus, 1929) and in nonhuman primates (Negus, 1949). The reduction in laryngeal mobility follows both from the position of the larynx with respect to the supralaryngeal vocal tract and from the fact that the hyoid bone is very close to the thyroid cartilage (Negus, 1929). The reduction in laryngeal mobility in forms other than adult humans can be observed in radiographic pictures of both speech and swallowing (Negus, 1929; Truby, Bosma, and Lind, 1965). During swallowing, for example, the larynx moves upward and forward in adult humans, whereas it only moves forward in newborn humans.
The acoustic consequences of shortening the supralaryngeal vocal tract—irrespective of the articulatory maneuvers that effect the shortening —is a rising formant frequency pattern.
FORMANT FREQUENCY LOWERING
All primates can also lengthen their supralaryngeal vocal tracts by protruding their lips or by moving their larynges downward or backward. Adult Homo spaiens again has more freedom in this regard since the human larynx has greater mobility. Closing the lips to produce a smaller orifice at the mouth has the same acoustic effect as increasing the length of the supralaryngeal vocal tract (Stevens and House, 1955; Fant, 1960). All these articulatory maneuvers generate a falling formant frequency pattern. In human speech formant transitions are the normal case. They may be rarer in the acoustic signals of nonhuman primates.
ORAL VERSUS NONORAL
Nonhuman primates can produce cries in which their oral cavities are closed by the epiglottis while the nose remains open.
AIR SAC VARIATIONS
Some nonhuman primates, e.g., howling monkey, have large air sacs above their vocal cords that can act as variable acoustic filters as their volume changes. The vocalizations of primates with these air sacs have not yet been subjected to quantitative acoustic analysis. Their calls, however, appear to be differentiated by modulations introduced by the air sacs.
The Uniqueness of Speech Encoding
Although the speech of modern Homo sapiens is fully encoded, the vocal communications of any animal that can produce formant transitions could be partially encoded. Since even early fossil hominids like Australopithecus africanus had supralaryngeal vocal tracts that were equivalent to those of modern apes, partial speech encoding may have existed at a very early period of hominid evolution. It is even possible that the communications of living nonhuman primates are partially encoded. The acoustic basis of speech encoding rests in the fact that the pattern of formant frequency variation of the supralaryngeal vocal tract must inherently involve transitions. The shape of the supralaryngeal vocal tract cannot change instantaneously. If a speaker utters a syllable that starts with the consonant [b] and ends with the vowel [æ], his vocal tract must first produce the shape necessary for [b] and then gradually move toward the [æ] shape. Formant transitions thus have to occur in the [æ] segment that reflects the initial [b] configuration. The transitions would be quite different if the initial consonant were a [d].
The nonhuman supralaryngeal vocal tract can, in fact, produce consonants like [b] and [d]. Simple encoding could be established using only bilabial and dental consonant contrasts. The formant transitions would either all be rising in frequency in the base of [dæ] or falling in frequency for [bæ]. It probably would be quite difficult, if not impossible, to sort the various intermediate vowels contrasts that are possible with the nonhuman vocal tract, but a simple encoding system could be built up using rising and falling formant transitions imposed on a general, unspecified vowel [V]. The resulting language would have only one vowel (a claim that has sometimes been made for the supposed ancestral language of Homo sapiens: Kuipers, 1960). The process of speech encoding and decoding and the elaboration of the vowel repertoire could build on vocal-tract normalization schemes that made use of sounds like [s] and could provide a listener or a digital computer program with information about the size of the speaker's vocal tract. Vocal-tract normalizing information could also be derived by listening to a fairly long stretch of speech and then computing the average formant frequency range. The process would be slower than simply hearing a token of [i] or [u], but it would be possible. There might have been a gradual path toward more and more encoding for all hominid populations as social structure and technology became more complex. If this were true, the preadaption of the bent pharyngeal-oral supralaryngeal vocal tract in some hominid populations would have provided an enormous selective advantage.
The differences between human speech and human language and the communication systems of other animals may not be qualitative. It is difficult to think of any aspect of human behavior that is really unique. Although language is seemingly a unique aspect of human behavior, qualitatively different from the communication system of any other living animal, the difference may be only a quantitative phenomenon. Qualitative behavioral differences can be the result of quantitative structural differences. Both an electronic desk calculator and a large general-purpose digital computer may be constructed using similar circuits and memory devices. However, the distinctions between the problems that can be solved using one device or the other will be qualitative as well as quantitative.
The differences between human and animal communication are more obvious because the intermediate stages of hominid evolution are no longer alive. It is possible that there are some qualitative differences, insofar as no other living species can presently make use of encoded acoustic signals. However, we have not examined the possible "speech" system of any other living animal sufficiently to demonstrate the absence of encoding. Quantitative acoustic analysis is still in its infancy, and we have still to develop a physiologic, i.e., functional, theory that explains the nature of human speech and human language. The study of the communications of species other than modern Homo sapiens is just as important for the insights we may gain into the nature of human language and its evolution as for our understanding of specific systems of animal communication.
Atkinson, J. R., 1973. Aspects of intonation in speech: implications from an experimental study of fundamental frequency. Ph.D. diss., University of Connecticut.
Basmajian, B. J., and Tuttle, R., in press. In: Proceedings of the IXth International Congress of Anthropological and Ethnological Science, Chicago. The Hague: Mouton.
Beck, B. B., in press. Primate tool behavior. In: Proceedings of the IXth International Congress of Anthropological and Ethnological Science, Chicago. The Hague: Mouton.
Beranek, L. L., 1954. Acoustics. New York: McGraw-Hill.
Bogert, C. M., 1960. The influence of sound on the behavior of amphibians and reptiles. In: Animal Sounds and Communication, W. E. Lanyon and W. N. Tavolga, eds. Arlington, Va.: American Institute of Biological Sciences.
Boule, M., 1911-1913. L'homme fosille de la Chapelle-aux-Saints. Ann. Paleontol., 6:109; 7:21, 85; 8:1.
Campbell, B., 1966. Human Evolution: An Introduction to Man's Adaptions. Chicago: Aldine.
Capranica, R. R., 1965. The Evoked Vocal Response of the Bullfrog. Cambridge: M.I.T. Press.
Chiba, T., and Kajiyama, M., 1958. The Vowel: Its Nature and Structure. Tokyo: Phonetic Society of Japan.
Crelin, E. S., 1969. Anatomy of the Newborn: An Atlas. Philadelphia: Lea and Febiger.
Darwin, C., 1859. On the Origin of Species, facsimile ed. New York: Atheneum.
DuBrul, E. L., 1958. Evolution of the Speech Apparatus. Springfield, 111.: Charles C. Thomas.
DuBrul, E. L., and Reed, C. A., 1960. Skeletal evidence of speech? Amer. f. Phys. Anthropol., 18:153-56.
Eimas, P. D., and Corbitt, J. D., 1973. Selective adaption of linguistic feature detectors. Cog. Psychol 4:99-109.
Fant, G., 1960. Acoustic Theory of Speech Production. The Hague: Mouton.
Ferrein, C. J., 1741. Mem. Acad. Pans, 409-32 (Nov. 15).
Fouts, R. S., 1973. Acquisition and testing of gestural signs in four young chimpanzees. Science, 180:978-80.
Frishkopf, L. S., and Goldstein, M. H., Jr., 1963. Responses to acoustic stimuli from single units in the eighth nerve of the bullfrog. J. Acoust. Soc. Amer., 35:1219-28.
Gardner, R. A., and Gardner, B. T., 1969. Teaching sign language to a chimpanzee. Science, 165:664-72.
Gerstman, L., 1967. Classification of self-normalized vowels. In: Proceedings of IEEE Conference on Speech Communication and Processing. New York: IEEE, pp. 97-100.
Green, S., 1973. Physiological control of vocalizations in the Japanese monkey: inferences from a field study. J. Acoust. Soc. Amer., 53:310 (abstract).
Henke, W. L., 1966. Dynamic articulatory model of speech production using computer simulation. Ph.D. diss., Massachusetts Institute of Technology, Appendix B.
Hill, J. H., 1972. On the evolutionary foundations of language. Amer. Anthropologist, 74:308-17.
Hollien, H.; Moore, P.; Wendahl, R. W.; and Michel, J. F.; 1966. On the nature of vocal fry, J. Speech Hearing Res., 9:245-47.
Howells, W. W., 1970. Mount Carmel man: morphological relationships. In: Proceedings of the Vlllth International Congress of Anthropological and Ethnological Science, Tokyo. Tokyo: Science Council of Japan. Vol. 1: Anthropology, pp. 269-72.
Howells, W. W., in press. Neanderthal man: facts and figures. In: Proceedings of the IXth International Congress of Anthropological and Ethnological Science, Chicago. The Hague: Mouton.
Kelemen, G., 1948. The anatomical basis of phonation in the chimpanzee. J. Morphoi, 82:229-56.
Kempelen, W. R. von, 1791. Mechanismum der menschlichen Sprache nebst der Beschreibung seiner sprechenden Maschine. J. B. Degen.
Kirchner, J. A., 1970. Pressman and Kelemen's Physiology of the Larynx, rev. ed. Washington, D.C.: American Academy of Ophthalmology and Otolaryngology.
Klatt, D. H.; Stevens, K. N.; and Mead, J.; 1968. Studies of articulatory activity and airflow during speech. Ann. N.Y. Acad. Sci., 155:42-54.
Kratzenstein, C. G., 1780. Sur la naissance de la formation des voyelles. J. Phys. Chim. Hist. Nat. Arts, 21 (1782):358-81 (translated from Acta Acad. Petrograd, 1780).
Kuipers, A. H., 1960. Phoneme and Morpheme in Kabardian. The Hague: Mouton.
Ladefoged, P., and Broadbent, D. E., 1957. Information conveyed by vowels. J. Acoust. Soc. Amer., 29:98-104.
Lane, H., 1965. Motor theory of speech perception: a critical review. Psychol. Rev., 72:175-309.
Leakey, M. D., 1971. Olduvai Gorge, Vol. III. Cambridge: Cambridge University Press.
Lenneberg, E. H., 1967. Biological Foundations of Language. New York: Wiley.
Liberman, A. M., 1970. The grammars of speech and language. Cog Psychol., 1:301-23.
Lieberman, P., 1963. Some acoustic measures of the periodicity of normal and pathologic laryngea. J. Acoust. Soc. Amer., 35:334-53.
Lieberman, P., 1967. Intonation, Perception, and Language. Cambridge: M.I.T. Press.
Lieberman, P., 1968. Primate vocalizations and human linguistic ability. J. Acoust. Soc. Amer., 44:1574-84.
Lieberman, P., 1970. Towards a unified phonetic theory. Linguist. Inquiry, 1:307-22.
Lieberman, P., 1973. On the evolution of human language: a unified view. Cognition, 2:59-94.
Lieberman, P., 1975. On the Origin of Language. New York: Macmillan.
Lieberman, P., and Crelin, E. S., 1971. On the speech of Neanderthal man. Linguist. Inquiry, 2:203-22.
Lieberman, P.; Crelin, E. S.; and Klatt, D. H.; 1972. Phonetic ability and related anatomy of the newborn, adult human, Neanderthal man, and the chimpanzee. Amer. Anthropologist, 74:287-307.
Lisker, L., and Abramson, A. S., 1964. A cross-language study of voicing in initial stops: acoustical measurements. Word, 20:384-422.
Miller, J. M.; Sutton, D.; Pfingst, B.; Ryan, A.; and Beaton, R.; 1972. Single cell activity in the auditory cortex of rhesus monkeys: behavioral dependency. Science, 177:449-51.
Müller, J., 1848. The Physiology of the Senses, Voice and Muscular Motion with the Mental Faculties, W. Baly, trans. London: Walton and Maberly.
Nearey, T., 1975. Phonetic features for vowels. Ph.D. diss., University of Connecticut.
Negus, V. E., 1929. The Mechanism of the Larynx. New York: Heinemann.
Perkell, J. S., 1969. Physiology of Speech Production: Results and Implications of a Quantitative Cineradiography Study. Cambridge: M.I.T. Press.
Peterson, G. E., and Barney, H. L., 1952. Control methods used in a study of the vowels. J. Acoust. Soc. Amer., 24:175-84.
Peterson, G. E.; Wang, W. S.-Y.; and Sivertsen, E.; 1958. Segmentation techniques in speech synthesis. J. Acoust. Soc. Amer., 30:739-42.
Pilbeam, D., 1972. The Ascent of Man: An Introduction to Human Evolution. New York: Macmillan.
Premack, D., 1972. Language in chimpanzee? Science, 172:808-22.
Preston, M., and Port, D., 1972. Early apical stop production. A voice onset time analysis. Haskins Laboratories Status Reports (New Haven, Ct.), SR 29/30:125-49.
Rand, T. C., 1971. Vocal tract normalization in the perception of stop consonants. Haskins Laboratories Status Reports (New Haven, Ct.), SR 25/26:141-46.
Stevens, K. N., 1972. Quantal nature of speech. In: Human Communication: A Unified View, E. E. David and P. B. Denes, eds. New York: McGraw-Hill.
Stevens, K. N.; Bastide, R. P.; and Smith, C. P.; 1955. Electrical synthesizer of continuous speech. J. Acoust. Soc. Amer., 27:207.
Stevens, K. N., and House, A. S., 1955. Development of a quantitative description of vowel articulation. J. Acoust. Soc. Amer., 27:484-93.
Straus, W. L., Jr., and Cave, A.J. E., 1957. Pathology and posture of Neanderthal man. Quart. Rev. Biol., 32:348-63.
Troubetzkoy, N. S., 1969. Principles of Phonology, C. Baltaxe, trans. Berkeley: University of California Press.
Truby, H. M.; Bosma, J. F.; and Lind, J.; 1965. Newborn Infant Cry. Stockholm: Almqvist and Wiksell.
Vallois, H. V., 1961. The evidence of skeletons. In: Social Life of Early Man, S. L. Washburn, ed. Chicago: Aldine.
Van den Berg, J. W., 1960. Vocal ligaments versus registers. Curr. Probl. Phoniat. Logoped., 1:19-34.
van Lawick-Goodall, J., 1973. Cultural elements in a chimpanzee community. In: Symposia of the Fourth International Congress of Primatology, vol. 1. White Plains, N.Y.: Karger.
Vlcek, E., 1970. Etude comparative ontophylogenetique de l'enfant du Pech-de-l'Aze par rapport a d'autres enfants neandertaliens. In: L'enfant du Pech-de-l'Aze, D. Feremback et al., eds. Paris: Masson, pp. 149-86.
Washburn, S. L., 1968. The Study of Human Evolution. Portland: Oregon State System of Higher Education.
Wind, J., 1970. On the Phylogeny and Ontogeny of the Human Larynx. Groningen: Wolters-Noordhoff.
Wollberg, Z., and Newman, J. D., 1972. Auditory cortex of squirrel monkey: response patterns of single cells to species-specific vocalizations. Science, 175:212-14.