Modulation and demodulation in production, perception, and imitation of speech and bodily gestures

Hartmut Traunmüller

Published in: Proceedings, FONETIK 98 (1998), Dept. of Linguistics, Stockhlom University, 40 - 43


The perception and imitation of speech is similar to that of bodily postures and gestures. Both can be said to involve a modulation of a carrier and to require a demodulation by the perceiver. Often, the perceiver's expectations play an important part in this process of separating the different types of quality: linguistic, expressive, organic, and perspectival.

Each spoken utterance has a certain linguistic phonetic quality that reflects not only the message but also the speaker's idiolect and speech style. At least in principle, this can be reproduced in an accurate phonetic transcription, but not in phonemic or orthographic notation.

In addition to this linguistic quality, acoustic as well as optic speech signals contain several other types of information that phoneticians do not usually transcribe. Among this additional information and variation, it is often convenient to distinguish the organic variation between speakers from the expressive variation within speakers, as listed in Table 1. The organic quality varies with the speaker's age and sometimes on a time scale of a few days, as when he has a cold. Expressive variation occurs on a shorter time scale given by variations in the psychological state of the speaker. Its scope can be as short as a clause in speech, and this may consist of a single word. The typical time scale of linguistic phonetic variations is still shorter. It corresponds to a single phonetic speech segment.

Table 1. Types of information and variation in speech, as reflected in the auditory and/or visual representation of a message.

Type of quality

Linguistic phonetic quality
Social, conventional
Expressive quality
within speaker variation
Organic quality
Physiological, anatomical,
between speaker variation
Perspectival quality
Physical, spatial

Information conveyed

The message;
speaker's dialect, sociolect,
speech style, accent, etc.
Speaker's emotions,
adaptation to environment, etc.
Speaker's age, sex,
Where the speaker is
in relation to the listener
(and how he is oriented).

Phenomena involved

Different words,
speech sounds,
prosodic patterns, etc.
Type of phonation,
vocal effort, speech rate,
liveliness, etc.
Size of the larynx,
length of the supraglottal
vocal tract, etc.
projection angles,
acoustic signal attenuation, etc.

Listeners are capable of evaluating the different types of information without much cross-interference. However, most of the properties usually studied by phoneticians (signal levels, F0, formant frequencies, segment durations, etc.) are affected by organic, expressive, and linguistic factors to a similar extent. The acoustic attributes that convey the linguistic quality are evidently not independent of those conveying the organic and expressive qualities. From the perceiver's point of view, speech signals are also affected by perspectival variation in the same way as most other acoustic and optic signals. This creates mainly variation in signal levels and in the projection angles that are relevant when we consider lip-reading.

In automatic speech recognition, a method of adaptive normalization is often used. This allows to neutralize slowly changing perspectival and organic variation together with some of the idiolectal variation. However, this can hardly be considered a good model of human speech perception. It does not work instantaneously, has trouble with expressive variations, and aims at a phonological or orthographic representation of the utterance, confusing and ignoring all other types of information, including some of the linguistic.

The way in which listeners can separate the linguistic phonetic quality from the other types of information listed in Table 1 is described by the Modulation Theory of Speech (Traunmüller, 1994). This theory is founded on an analysis of how the different kinds of information are fused in speech production. A speech signal is regarded as the result of a process in which a carrier signal or "voice" whose properties are given by organic and expressive factors, has been modulated with conventional linguistic speech gestures.

A linguistically neutral, unmodulated carrier signal can be thought of as a 'colorless' vowel, a primitive human vocalization such as a hesitation sound. Its properties are given by the size and proportions of the speaker's organ of speech and by its paralinguistic expressive "settings". For the latter notion se Laver (1980). In distinction from the glottal source signal of the Acoustic Theory of Speech Production (Fant, 1960), some of its properties, mainly the neutral formant frequencies, are given by the supraglottal cavities (vocal tract length etc.).

The acoustic properties of speech signals deviate from those of a neutral carrier signal in a way that is specific to each speech sound in a given context. Thus, the linguistic phonetic quality is associated with these deviations and not immediately with the absolute properties of the speech signal. For the perception of the different types of information in speech, this implies that a demodulation is necessary in order to be able to separate them.

The listener has to discover how the carrier signal has been modulated in order to be able to recognize the conventional linguistic information, while this modulation should not affect his judgment of the organic and expressive qualities that are reflected in the carrier signal. Listeners must separate modulation and carrier and judge each by its own.

When an infant says its first word, it demonstrates that it has acquired at least a rudimentary control over these processes. When a child imitates something an older person has said, which is what happens here, it must first have recognized how the speaker had modulated his carrier, and thereafter it must have modulated its own carrier in the same way.

This kind of procedure is not something very specific to spoken human language, but the imitation of any bodily posture or gesture follows an analogous procedure. In all such cases of imitation, there is a carrier (body, hand, face, mouth, etc.) that provides a system of reference and standards of comparison used in transposing the posture or gesture into another system of reference with different standards. A capability of imitating postures and gestures is present among all primates. They can ape, but only humans can both ape and parrot.

In humans, a capability of imitating oral and manual gestures is present already in neonates (Meltzoff and Moore, 1977). This tells us that there is an innate link between visual perception and motor control. There is at least a rudimentary capability of "demodulating" visually perceived gestures and translating them into the motor commands that are required in order to "modulate" the own body in the same way. It is interesting, though, that babies do not imitate faces expressive of emotions as readily as arbitrary facial gestures [Kaitz et al., 1988]. We are certainly predisposed to express genuine emotions, but not to fake them. A link between visual and auditory perception of speech sounds has been shown to be present at an age of 20 weeks (Kuhl and Meltzoff 1984). This, and the phenomenon of lip-reading, has been interpreted to suggest an intermodal or amodal (Studdert-Kennedy, 1983) mental representation of speech. There is also a link between speech motor commands and auditory perception, and babbling appears to serve the purpose of fine-tuning this link.

In order to recognize the linguistic quality of speech, listeners can be said to evaluate the deviations of the current properties of the speech signal (F0, formant frequencies, lip shape, etc.) from those they expect of a linguistically neutral sound with the same voice quality. In this process, the expectations of listeners are not only governed by extrinsic properties known from previous experience, e.g., when they know the speaker, or when they have heard him speak for a while, but primarily by intrinsic properties such as the frequency positions of the higher formants (F3 and above), which are not affected as much as F1 and F2 by a variation in linguistic quality. F0 plays also an important part in this process. Listeners appear to analyze its recent course and to take an estimate of its base value as a reference.

Listeners evaluate the instantaneous positions of the spectral peaks shaped by the formants in relation to each other and to the F0 reference. Experiments have shown that listeners do this above all with spectral peaks that are fairly close to each other. In this way, the linguistic information encoded in the formant frequencies can be discovered without depending on prior recognition of the organic and expressive quality. When the acoustic signal is deficient in information, e.g., in whispering, where F0 is missing, or in the presence of any disturbing noise, listeners' expectations are more important also when based on less reliable evidence.

Consider the vowel and speaker sex recognition results obtained for whispered vowels by Eklund and Traunmüller (1996), see Table 2. When the speaker's sex had been misperceived, the percentage of vowel confusions was more than doubled due to cases like mistaking a male [e] for a female [ĝ], which has approximately the same formant frequencies. Among the normally phonated vowels, there were also confusions, but not even a single case, among 2700 responses, in which both speaker sex and vowel had been confused.

Table 2. An analysis of misperceptions of vowel quality and speaker sex in normally phonated and in whispered vowels presented in random order.

Perceived          Number       Misperceived vowel (%)
             Phonated Whispered   Phonated Whispered
Right sex      2662     2456         4.8      11.6
Wrong sex        38      244         0.0      25.0
Total          2700     2700         4.7      12.9
Perceived          Number        Misperceived sex (%)
             Phonated Whispered   Phonated Whispered
Right vowel    2573     2353         1.5       7.8
Wrong vowel     127      347         0.0      17.6
Total          2700     2700         1.4       9.0

A classical investigation that illustrates the role of the listeners' expectations induced by context is that of Ladefoged and Broadbent (1957), who had shown that the perceived vowel quality in a synthetic [bVt] syllable was affected when the formant frequencies of a synthetic introductory phrase were modified. When, e.g., F1 was increased, the perceived degree of openness of the test vowel decreased. In a more recent investigation, Ohala and Feder (1994) showed that context effects could be induced by mere expectations, which calls for a mentalistic theory like the Modulation Theory.

In the case of visually perceived postures and gestures, there is usually a structure that is visible whenever it is sufficiently light and that can be considered the "carrier", while the posture or gesture can be considered a "modulation" of that carrier. However, when speaking we have to produce a carrier signal ourselves by phonating. Another difference consists in the more complex possibilities of perspectival variation that have to be handled in the visual case due to the variation of the proximal stimulus with the spatial orientation of the carrier. The perceiver is capable of handling projective geometry subconsciously. In the case of acoustic signals, the information about orientation is mostly blurred.

As for the notion of a neutral reference and its role in perception, we can see an analogy between the perception of speech and the perception of human faces. The perception of faces appears to involve a comparison with a neutral reference, and we attach particular significance to deviations from the neutral shape. This becomes evident in caricatures, whose essence it is to have these deviations exaggerated.

The analogies between speech acquisition and the imitation of bodily gestures, which are captured by the Modulation Theory, suggest that the imitative behavior that is necessary for speech development may have had its phylogenetic origin in a preexisting disposition for aping, but this alone would rather have disposed our ancestors for sign language. Therefore, it appears reasonable to assume that prior to the development of communication by speech, there was already a preexisting disposition for parroting, which is reflected in onomatopoeia.


Eklund I. and Traunmüller H. 1996. Comparative study of male and female whispered and phonated versions of the long vowels of Swedish. Phonetica 54, 1-21. (Abstract)

Fant G. 1960. Acoustic Theory of Speech Production. The Hague: Mouton.

Kaitz M., Meschulach-Sarfaty O., Auerbach J., and Eidelman A. 1988. A reexamination of newborns' ability to imitate facial expressions. Developmental Psychology 24, 3-7.

Kuhl P.K. and Meltzoff, A.N. 1984. The Intermodal Representation of Speech in Infants. Infant Behavior and Development 7, 361-381.

Ladefoged P. and Broadbent D.E. 1957. Information conveyed by vowels. Journal of the Acoustical Society of America 29, 98-104.

Laver J. 1980. The Phonetic Description of Voice Quality. Cambridge: Cambridge University Press.

Meltzoff A.N. and Moore K. 1977. Imitation of facial and manual gestures by human neonates. Science 198, 75-78.

Ohala J.J. and Feder D. 1994. Listeners' normalization of vowel quality is influenced by restored consonantal context. Phonetica 51, 111-118.

Studdert-Kennedy M. 1983. On learning to speak. Human Neurobiology 2, 191-195.

Traunmüller H. 1994. Conventional, biological, and environmental factors in speech communication: A Modulation Theory. Phonetica 51, 170-183. (Abstract)

Other publications by H. Traunmüller, Phonetics Lab, Dept. of Linguistics, Stockholm University

Posted in 1998-03-30