The Modulation Theory of Speech

As we have seen, [background (in Swedish)], speech signals contain linguistic, expressive, organic and perspectival information. Listeners are capable of distinguishing these different types of information from each other, but the acoustic properties usually measured by phoneticians are affected by several of these factors. This has not been treated adequately in previous theoretical reasoning and in theories of speech perception. The Modulation Theory of Speech is intended to let us see how these different kinds of information can be separated again, based on an analysis of how they are fused when speech is produced.

Within the frame of the Modulation Theory, man's facility of communicating by means of speech is seen as a biological innovation that is founded on a facility of expressive communication that has been around for a long time before and which still plays an important part in human communication.

In accordance with this, speech signals are regarded as the result of a process in which a carrier signal, whose properties are given by organic and expressive factors, has been modulated with conventional linguistic speech gestures.

A linguistically neutral carrier signal can be thought of as a 'colorless' vowel, a primitive human vocalization that occurs, e.g., as a hesitation sound. Its properties are given by the size of the speaker's organ of speech (vocal fold mass and length, vocal tract length, etc.) and by its paralinguistic "settings".

The acoustic properties of speech signals deviate from those of a neutral carrier signal in a way that is specific to each speech sound.

Thus, the linguistic phonetic quality is associated with these deviations and not immediately with the absolute properties of the speech signal.

For the perception of the different types of information in speech, this implies that a demodulation is necessary in order to be able to separate them.

The listener has to discover how the carrier signal has been modulated in order to be able to recognize the conventional linguistic information. On the other hand, the modulation must not affect his judgment of the organic and expressive qualities, which are reflected in the carrier signal. Thus, the listener has to separate the modulation from the carrier signal and to judge each by its own.

When an infant says its first word, it demonstrates that it has acquired at least a rudimentary control over the processes which are described by the Modulation Theory of Speech. When a child imitates something an older person has said, which is what happens here, it must first have recognized how the older person has modulated his carrier, and thereafter it must have modulated its own carrier in the same way. The imitation of any bodily posture or gesture follows an analogous procedure. There is a carrier (body, hand, face, vocal tract) that provides a system of reference and standards of comparison used in transposing the posture or gesture into a different system of reference with different standards of comparison.

In describing speech perception, it is important to measure each type of deviation with the right kind of ruler. On these rulers, equal intervals have to be equivalent for the listener. Thus, it would be wrong to measure pitch and its deviations from its base value in Hz, which is the physical unit of frequency. Pitch is more correctly represented in semitones or some other measure that is proportional to the logarithm of frequency. For formant frequencies, a tonotopic (bark) scale appears to be the correct choice, but certain power functions of frequency can also be used. For intensity differences, a dB-scale appears to be close to ideal.

In order to recognize the linguistic quality of speech sounds, listeners can be said to evaluate the deviations of the instantaneous properties of the speech signal (F0, formant frequencies, etc.) from those they expect of a linguistically neutral sound with the same organic and expressive quality. In this process, the expectations of listeners are governed by extrinsic properties known from previous experience, e.g., when they know the speaker, or when they have heard him speak for a while, and by such intrinsic properties as the frequency positions of the higher formants (F3 and above), which are not affected as much as F1 and F2 by a variation in linguistic quality. As we have seen on the previous page, F0 plays an important part in this connection. Listeners appear to obtain (unconsciously) an estimate of its base value by analyzing how the F0-curve did look during the recent past.

Listeners evaluate the instantaneous positions of the spectral peaks shaped by the formants in relation to each other and to the base value of F0. Experiments have shown that listeners do this above all with spectral peaks that are fairly close to each other. In this way, it is often possible to discover the linguistic information encoded in the formant frequencies without depending on prior recognition of the organic and expressive quality. When the acoustic signal is deficient in information, e.g., in whispering, when F0 is missing, or in the presence of any disturbing noise, listeners have to rely more on their expectations.

In the presence of disturbing noise, it becomes very clear that the recognition of the linguistic quality of speech signals also in other ways is driven by expectations, and not only by the speech signal. Listeners have a capacity of hearing also that which can not be heard objectively. This phenomenon is known as "perceptual restoration". In the process of listening, listeners continuously test how compatible the properties of the signal are with different alternative interpretations, and a speech signal remains compatible with an interpretation even if it is partially masked by a disturbing noise. Phenomena of this kind are illustrated on the next page (in Swedish).

The observations concerning the role of F0 in vowel perception are compatible with this theory, which can also be used in order to manipulate the organic and expressive quality of speech.

References:

Hartmut Traunmüller (1994) "Conventional, biological, and environmental factors in speech communication: A modulation theory" Phonetica 51: 170-183. doi (Also in PERILUS XVIII: 92-102.)
Note: The terms "expressive" and "organic" (quality, information, properties), are much more adequate and should be substituted for "affective" and "personal" used in that paper.

Hartmut Traunmüller (1998) "Modulation and demodulation in production, perception, and imitation of speech and bodily gestures" Proceedings, FONETIK 98: 40-43 (Dept. of Linguistics, Stockhlom University). html

Hartmut Traunmüller (2000) "Evidence for demodulation in speech perception" Proceedings of the 6th ICSLP, vol III: 790-793. Abstract | pdf (Also contributed to a workshop on "The Nature of Speech Perception".)

Hartmut Traunmüller (2005) "Paralinguale Phänomene" (Paralinguistic phenomena), chapter 76 in: SOCIOLINGUISTICS An International Handbook of the Science of Language and Society / SOZIOLINGUISTIK Ein internationales Handbuch zur Wissenschaft von Sprache und Gesellschaft, 2nd ed., Ulrich Ammon, Norbert Dittmar, Klaus Mattheier, Peter Trudgill (eds.), Vol. 1, pp 653-665. Walter de Gruyter, Berlin/New York. Abstract | link | Sonderdruck auf Anfrage

Hartmut Traunmüller (2005) "Speech considered as modulated voice" (Manuscript, 42 p) Abstract | pdf

Hartmut Traunmüller (2007) "Demodulation, mirror neurons and audiovisual perception nullify the motor theory" Contr. to Fonetik 2007, TMH-QPSR 50: 17-20. Detpt. of Speech, Music and Hearing, Royal Inst. of Technology, Stockholm. pdf | ppt

Hartmut Traunmüller | Phonetics Lab | Dept. of Linguistics | Stockholm University
Text last modified in 1998. More recent references added later.