[Published in Actes du XIIème Congrès International des Sciences Phonétiques, 19-24 août 1991 - Aix-en-Provence, France, Vol 5, p. 62-65. Some typos corrected in this version.]
The tonotopic distance hypothesis was first proposed to explain the results of perceptual experiments with synthetic vowels . Its quantitative validity has been questioned on the basis of results obtained in another perceptual experiment, in which the influence of F0 turned out to be smaller . The discrepancy can be explained if it is assumed that listeners relate F1 to the prosodic baseline rather than to an instantaneous or average value of F0 . Such a baseline is obtained by interpolation between successive minima in the F0-contour of the breath-group in question.
Data on F0 in different styles of speech show that an invariant minimal value of F0 is characteristic of each speaker . That value of F0 is normally reached close to the end of statements. It appears to be stable in various types of paralinguistic variations, such as the degree of involvement  and in different styles of speech [2, 3], at least as long as these do not involve an overall change in vocal effort. More precisely, the invariant value of F0 is slightly above its minimum, and it might represent an average of the baseline.
If this is to be reflected in speech perception, listeners should, in effect, relate F1 to an estimate of the speaker's prosodic baseline in judging vowel openness. According to slightly different hypotheses, the minimum F0 in the whole breath-group or in a smaller unit of speech might be relevant instead. In order to test the various hypotheses, an experiment was performed with synthetic vowels and diphthongs in which either F1 or F0 varied or both varied in unison.
The stimuli had a duration of 470 ms. Prospective diphthongs were obtained by frequency modulation of F1 and/or F0 with part of a sinusoid with a period of 360 ms, phased such that the nominal target values of F1 and F0 were reached 30 ms after the beginning and 80 ms before the end of the stimuli. The asymmetry was motivated by a final decrease in excitation amplitude.
The nominal F0-targets for the diphthongs were 250 Hz and 453 Hz (stimulus series 3a and 3b), 161 Hz and 569 Hz (4a and 4b), and 250 Hz and 347 Hz (5a and 5b). The targets of F1 were in each series 1 Bark above those of F0.
Fig. 1 shows the average perceived degree of openness in the vowel series with subsequently rising F1 with and without rising F0 (series 1 and 2). The last one of the four values assigned to each response was ignored.
The vowels with the same F0 and the same higher formants, but with subsequently rising F1 were unanimously perceived as subsequently more open, from [i] to [æ] (upper line). The spread in perceived openness was small.
Listen to series 1.
As for the stimuli in which both F1 and F0 increased (lower line), ten subjects perceived essentially no change in openness, hearing all as [i] (lower dashed line), while the other ten were less uniform in behavior. For them, only about 40 to 70 % of a shift in F1 was compensated by an F0-shift equal in Bark (upper dashed line). The subjects behaved in a unanimous fashion only up to F0 = 250 Hz. Listen to series 2.
Fig. 1. Perceived degree of openness in vowels, as explained in text.
Fig. 2 shows the effect of variations in F1 on the perceived degree of openness as a function of time from the beginning to the end of each stimulus in the series 3 to 5. The two non-terminal openness values have been averaged for these figures. The figure shows the results pooled over all subjects and over all three choices of extreme values for F1 and F0. There was no noticeable difference between the two orders of presentation. Fig. 2a includes the four cases in which F0 was low, while F1 was low (l), rising (r), high (h), and falling (f). In Fig. 2b, F0 is falling, while F1 is either high or falling. In Fig. 2c, F0 is rising, while F1 is either high or rising.
Fig. 2. Effects of F1 on the perceived degree of openness in vowels, as explained in text.
Fig. 3 is analogous to Fig. 2, but it shows the effect of variations in F0 when F1 is the same. Fig. 3a includes the four cases in which F1 was high, while F0 was low (l), rising (r), high (h), or falling (f). In Figs. 3b and 3c, F1 is rising and falling, respectively, while F0 is either low or rising and falling with F1.
Fig. 3. Effects of F0 on the perceived degree of openness in vowels, as explained in text.
The stimuli in which F1 and F0 were "stationary" were often heard as finally diphthongized. This tendency is exaggerated in the results, since even a slight degree of closing diphthongization in open vowels was often transcribed as [Vi] or [Vj].
Fig. 3 demonstrates clearly that the instantaneous F0 (or a short time average) is of some importance. If the subjects were only sensitive to F0 averaged over the whole stimulus, the contours in each panel would run in parallel, with a vertical displacement. If they were only sensitive to the F0-minimum within each stimulus, the contours in each panel, except h in 3a, would coincide. If they were only sensitive to the baseline, all contours would coincide within each panel. The data show a combination of baseline and instantaneous effects, the relative weight of the latter increasing from 0.36 to 0.68 during the course of the stimuli, but this does not hold for each subject.
The responses of individual subjects to the stimuli of Figs. 2 and 3 are not generally predictable from their responses to those of Fig. 1. This is shown in Fig. 4, in which the F0-sensitivity (in % compensation) in the two types of context is shown for each subject. The comparison includes only the stimuli with "stationary" F0. The correlation between the two sets of data is low (0.46). There were some subjects who relied entirely on the instantaneous F0, while others relied on the baseline. The former appear along the diagonal (y = x), the latter at y = 0. Thus, between listener differences turned out to be very prominent.
Fig. 4. Context sensitivity of individual subjects, as explained in text.
In a previous experiment, in which the perception of F2' was in focus, it was also observed that some subjects behaved consistently in agreement with the tonotopic distance hypothesis, while others showed a reduced influence of F0 and often a less consistent behavior . The proportion of the latter was lower among speakers of Swedish than among speakers of Turkish. Apparently, it had been still lower in speakers of Austrian German . This might then be correlated with functional load: The minimum number of openness distinctions which are necessary to describe the phonological distinctions in the vowel systems is two for Turkish, three for Swedish, and four for Austrian German. As for the balance between instantaneous F0 and its baseline, the functional load of tone might be of importance, but there are no data to substantiate such a hypothesis.
A related presentation: The role of F0 in vowel perception