Listening to different speakers: On the time-course of perceptual compensation for vocal-tract characteristics
Highlights
► We examine compensation processes in speech perception with event-related potentials. ► Vowels were presented in different speaker contexts, inducing normalization. ► Normalization effects were found in the N1 time window. ► Normalization processes influence representations at an early processing level.
Introduction
In everyday life, we listen to the speech of many individuals. The current paper investigates a perceptual compensation process that helps listeners to understand speech sounds spoken by different talkers. Individuals have different vocal-tract characteristics, caused by influences such as talker sex, talker size (or vocal-tract length), speaking style, and dialect. This variance appears to challenge speech comprehension because vocal tracts can differ on the same acoustic dimensions that allow listeners to discriminate between different speech-sound categories. We ask here how early in the speech perception process listener's representations of speech sounds are changed in order to compensate for talkers’ vocal-tract characteristics.
Vowels are discriminated mainly on the basis of acoustic properties that are referred to as formants. Formants are bands of increased intensity in the spectral makeup of speech sounds. For example, in English the main difference between the words “bit” and “bet” (phonemically transcribed as /bit/ versus /bɛt/) lies in the frequency of the first formant (F1). The average F1 value for /ɛ/ is around 731 Hz while the average F1 of /i/ lies around 483 Hz, for vowels recorded from female American English speakers (F2 value for /ɛ/: 2058; F2 of /i/: 2365). For male speakers the average F1 value for /ɛ/ is around 580 Hz while the average F1 of /i/ lies around 427 Hz (F2 value for /ɛ/: 1799; F2 of /i/: 2034) (Hillenbrand, Getty, Clark, & Wheeler, 1995). However, these averages do not tell the complete story. There is a large degree of overlap among different vowel categories (Hillenbrand et al., 1995, Joos, 1948). Single instances of two different vowels, spoken by two different speakers, can have very similar absolute formant values. This is not restricted to English; Dutch, the target language of the current paper, shows similar overlap in vowel categories (Adank et al., 2004b, Van Nierop et al., 1973). This is especially the case when comparing speakers of different sex or age. Such variance therefore causes multiple signal-to-category mappings for a single spoken speech sound. In other words, a single sound can often be interpreted as either of two different phonemes, so that listeners may be confused whether the intended word was bit or bet.
It has been argued that listeners compensate for vocal-tract characteristics in a number of different ways (Johnson, 2005, Nearey, 1989). An important contribution may be made by a mechanism that compensates for speaker characteristics by taking into account the vocal-tract characteristics of the speaker as revealed in a preceding context (Ladefoged & Broadbent, 1957). Ladefoged and Broadbent (1957) found that listeners interpret a vowel that is acoustically halfway between an [i] and an [ɛ] more often as /i/ (which has a low F1) when it is preceded by a sentence with a relatively high F1, while the same sound is more often interpreted as /ɛ/ (which has a high F1) when preceded by a precursor sentence with a relatively low F1. This contrastive process therefore effectively normalizes perception for the F1 range of a speaker and reduces potential overlap of vowel categories across speakers. Although F1 is not the only cue to differences in vocal-tract characteristics, Ladefoged and Broadbent (1957) have shown that listeners can use F1 characteristics to map speech sounds onto the correct phonemes.
Watkins (1991) and Watkins and Makin, 1994, Watkins and Makin, 1996 have argued that the bulk of this effect can be explained by a mechanism that compensates for the average spectral makeup of a precursor, whether the shape of this spectral makeup is due to vocal-tract characteristics or something else (such as room acoustics). The suggestion is that these context effects are the result of a mechanism that focuses on how a given stimulus is different from the preceding context (“Sensitivity to change”, cf. Kluender, Coady, & Kiefte, 2003). This mechanism is assumed to be a general perceptual mechanism that is not specific to speech perception. In line with this claim, contrast effects similar to vowel normalization also occur for musical timbre perception (Stilp, Alexander, Kiefte, & Kluender, 2010). Additional evidence stems from the finding that a precursor spoken by a female talker can influence the perception of a subsequent target sound that was produced by a male talker (Watkins, 1991), and that speech sound categorization can also be influenced by non-speech precursors (Holt, 2006). This kind of mechanism could therefore function as a means of enhancing contrast (Kluender et al., 2003, Kluender and Kiefte, 2006) and displays a clear analogy to contrast effects with visual stimuli. A surface with a certain brightness will be perceived as darker when surrounded by a light surface, but as lighter when surrounded by a dark surface. Furthermore, effects of preceding context on speech sound categorization have also been observed with Japanese quail (Lotto, Kluender, & Holt, 1997). This again suggests that influences of context are in part the result of a relatively general perceptual mechanism (see Holt, 2006, for an overview of context dependent effects on categorization), but it is not clear that a single general-purpose mechanism is sufficient to explain all the results on vowel normalization. Sjerps, Mitterer, and McQueen (2011) argue, for example, that vowel normalization may primarily reflect a compensation mechanism that is based on the Long-Term Average Spectrum (LTAS) of the auditory input, but one that only operates if the input has spectro-temporal characteristics that are similar to those of speech. The primary compensation mechanism for vowel normalization thus appears to be general-purpose (based on contrast), but one which operates under some spectro-temporal constraints.
Little is known, however, about when in the processing stream this compensation mechanism has its influence. Does normalization influence low-level representations or does it influence higher-level cognitive processes? Clearly, the assumption of a general perceptual mechanism that focuses on change is more compatible with the assumption of an early locus. The present study therefore examines the temporal locus of compensation for speaker vocal-tract characteristics by tracking its neurophysiological correlates during the perception of vowels. This is a novel approach in the investigation of the extrinsic normalization of vowels.
In order to establish whether normalization influences representations early or late in the stream of processing, four different time windows were investigated that can be considered to reflect subsequent stages in the processing stream. These were the P1, the N1, the N2 and the P3 time windows. Previous neurophysiological investigations with speech stimuli have suggested different functional interpretations for the processes/representations underlying these different waveform components. The earliest long-latency brain waves (P1 and N1 or their magnetic counterparts P1m, N1m) peaking at about 50 and 100 ms after stimulus onset respectively, seem to reflect early cortical information processing (Diesch et al., 1996, Makela et al., 2003, Obleser et al., 2004). While P1 has been argued to reflect basic auditory feature extraction, N1 seems to reflect a subsequent level closer to a more abstract phonological representational stage (Tavabi, Obleser, Dobel, & Pantev, 2007). Roberts, Flagg, and Gage (2004) argue that the N1 response represents some form of abstract processing. Cortical responses that were recorded when participants listened to sounds on a vowel continuum from [u] to [a] reflected clustering of N1 peak latencies around the regions of the continuum identified as either [u] or [a] (and less clustering around the ambiguous region). Roberts et al. (2004) also find, however, that when acoustic aspects of a single stimulus are held constant while the percept is changed through a response bias induced by preceding trials, the dominant N1 latency effect is related to the physical properties of the stimulus, and not to the eventual decision. This indicates that higher-level processes like response biases do not influence N1. Furthermore, while the N1 might reflect some abstract properties, Näätänen and Winkler (1999) argue that it does not reflect a completely abstract level of processing as it does not directly reflect the consciously perceived event. Bien, Lagemann, Dobel, and Zwitserlood (2009) have also shown that there is a difference between the N1 response and conscious decisions about the signal. Furthermore, in contrast to Roberts et al. (2004), Toscano, McMurray, Dennhardt, and Luck (2010) have shown that, for the perception of the voiced versus voiceless stop consonant distinction (as in beach versus peach), there is no relation between the N1 amplitude and the categorical status of a phoneme. They found that there was a linear relation between the N1 amplitude and the step on the voiced-voiceless continuum. Thus, while it is not yet clear whether the N1 can reflect abstract aspects of speech sounds, previous results do show that the N1 reflects processes that are not influenced by response-bias or by the consciously perceived qualities of the stimulus. This shows that these processes take place early in the processing of speech information. Roberts et al. (2004) thus argue that the ultimate perception of speech sounds depends on the coding of stimulus properties that takes place during the N1 time window.
Later time windows such as the N2/MMN time window (200–300 ms after stimulus onset) do seem to reflect abstract levels of processing (Näätänen et al., 1997, Winkler et al., 1999). MMN responses, for example, are larger to deviants that are linguistically relevant for the listener (Näätänen et al., 1997, Winkler et al., 1999). Moreover, in a study measuring both N1 and MMN, Sharma and Dorman (2000) found a dissociation between measures of N1 and MMN. Both Hindi and English listeners showed similar, direct dependencies of N1 latency on a Voice Onset Time (VOT) continuum that is only relevant for listeners of Hindi (−90 to 0 ms). However, only Hindi listeners elicited a MMN effect with these stimuli. This shows that N1 and MMN reflect subsequent stages in the processing hierarchy and only the MMN response is dependent on linguistic exposure. Finally, the P3 response in response-active oddball designs (300–600 ms after stimulus onset) has been associated with the evaluation of deviant events with relation to subsequent behavioural action (Friedman, Cycowicz, & Gaeta, 2001). The P3 is thus likely to also reflect higher-level cognitive processes, although it is not necessarily insensitive to gradedness within speech categories (Toscano et al., 2010).
Our aim here was to investigate whether compensation for speaker vocal-tract characteristics is a process that influences representations of speech sounds at a relatively early stage of processing (i.e., during P1 and/or N1 time windows) or at a relatively late stage of processing (i.e., during N2 and/or P3 time windows). We investigated the influence of vocal-tract characteristics on vowel perception by presenting participants with target vowels in contexts that simulate speakers with different vocal-tract characteristics. Previous findings have shown that manipulated context sentences can change the perception of subsequently presented vowels, indicated by a shift in the categorization functions for these vowels (Kiefte and Kluender, 2008, Ladefoged and Broadbent, 1957, Mitterer, 2006, Sjerps et al., 2011, Watkins, 1991, Watkins and Makin, 1994, Watkins and Makin, 1996).
Consider Fig. 1, it displays two mock-up categorization functions for sounds from a vowel continuum ranging from [i] to [ɛ], that represent the sort of shift in categorization function that has been found. The dotted line represents the categorization of vowels that have been presented after a precursor with a generally low F1, while the solid line represents categorization of the same vowel tokens, but then presented after a precursor sentence with a high F1. It can be observed that more sounds on the continuum are categorized as /i/ (which itself has a low F1) in the context of a precursor with a high F1 and more often as /ɛ/ (which has a high F1) in the context of a low F1 precursor. The current research attempts to investigate at what level of processing the representation of speech sounds is influenced by the mechanism that leads to this shift in perception.
In the present experiment, target non-words were presented in a response-active mismatch detection design, such that listeners heard a repeating (standard) non-word that was replaced by two different (deviant) non-words on 20% of trials. The standard consisted of a non-word in which the initial vowel was manipulated to sound halfway between [i] and [ɛ], from now on indicated by [I ɛ] (the transcription of the ambiguous sound as [I ɛ] should make clear that this sound does not represent an actual Dutch phoneme category). The deviant non-words started with a vowel that was an unambiguous instance of /i/ or /ɛ/. The following two syllables in each non-word (/papu/) were manipulated to have a high F1 or a low F1 so as to induce normalization effects in different experimental blocks. The bisyllable /papu/ contains two point vowels that provide the range of a speaker's F1. The induced change in perception through normalization should make it harder for participants to detect a change from the ambiguous vowel [I ɛ] to [i] than to [ɛ] in the high F1 context, whereas listeners should find it harder to detect a change from [I ɛ] to [ɛ] than to [i] in the low F1 context.
Listeners thus heard the nonsense words [I ɛpapu] (as the standard stimulus), and [ipapu] and [ɛpapu] (as the deviant stimuli). In this setup the 2nd and 3rd syllables of stimulus x provided the preceding context for the next stimulus, x+1. This approach was chosen to be able to create an interstimulus interval (ISI) between the context ([papu]) and the subsequent target-vowels of 750 ms (i.e., [I ɛpapu]–750 ms–[ɛpapu]–750 ms–[I ɛpapu], etc.). The contextual influence of the [papu] syllables might extend to subsequent trials (i.e., the perception of a target vowel is influenced not only by the just-preceding context), but because of the blocked presentation this influence was in the same direction within a block. The large ISI between a target vowel and its immediately preceding context is important because small ISI could lead to contextual influences that are a result of peripheral auditory influences such as the negative auditory after-image (Summerfield et al., 1984, Watkins, 1991, Wilson, 1970). Such peripheral influences can cause a compensation effect in the same direction as the more central compensation effect under investigation here and could thus obscure its effects. The contexts thus followed directly after the target vowels. This is not a problem for the interpretation of the EEG waveform, however, as the following context started 250 ms after the onset of the vowel, which leaves enough time for any early cortical signatures in response to the critical vowels to appear before any effect of following context (at least those in the P1, N1, and N2 time windows). The early components induced by the following [papu] coincide with the P3 effect induced by the target vowel. This is not a problem, however, as the P3 response is larger in amplitude than the earlier cortical responses that could be induced by the following context. It should be noted that the strength of normalization effects might decrease over repetitions (Broadbent & Ladefoged, 1960). This decrease, however, has been argued to be stronger when different context conditions are presented in a mixed fashion instead of the blocked approach that was taken here (Sjerps et al., 2011).
An additional control condition was run that had the vowel [ɔ] as the initial vowel on the standard items (i.e., [ɔpapu]), but had the same deviants as in the experimental condition ([ipapu] and [ɛpapu]). In this control condition the [papu] part had a neutral F1 contour that was halfway between that of the high and low F1 conditions. This control condition was used to test whether our design was capable of producing a clear standard-deviant mismatch effect in the cortical signatures, and when and where on the cortical topography these mismatch effects would express themselves.
The control data were analyzed by comparing the size and distribution of the effect of deviant ([ipapu] and [ɛpapu]) versus standard ([ɔpapu]) in the four time windows. For the experimental (i.e., non-control) stimuli, an initial analysis compared ERPs between the two standard stimuli ([I ɛpapu] in both the high F1 and the low F1 condition) versus the deviants ([ipapu] and [ɛpapu] in both the high F1 and the low F1 condition). This comparison was made to see whether and when the small auditory differences that we used in the experimental condition were able to elicit different cortical responses to deviants (note that in both sets of data the deviant vowels were the same, only the standards differed). In the final and critical analysis we tested at what point in the stages of cortical processing of speech the influence of the contexts’ F1-properties on the detectability of a vowel change was reflected. This effect was tested by looking for an interaction between the F1 condition and the identity of the deviant vowel, with the size of the difference response (in voltage) as the dependent variable. Note that the analysis of this critical interaction focuses on the processing of the deviants and not on the processing of the standard, despite the fact that our design hinges on the fact that the perception of the standard is changed across blocks. Normalization processes change the perceived quality of the standard and thus also the mental traces of the standard. For the critical analysis, we measured the relative strength of the cortical signature of the mental comparison of a deviant vowel to those traces of the standard. Traditionally, designs with this oddball paradigm focus on the difference wave between standard and deviant (cf. Näätänen & Winkler, 1999). As we were interested primarily in the interaction between deviant identity and the contexts’ F1 properties, a comparison of the deviants themselves suffices here.
In the present study, an early influence should thus be reflected in early time windows (i.e., within the first 160 ms after vowel onset) such as those related to the P1 and/or the N1, whereas a later influence should only be able to affect cortical signatures later than about 200 ms, a time window which is related to the N2/MMN or the P3. To exemplify the expected results, imagine the analysis in the P3 time window. We expected that easy detectability of deviants would lead to a stronger positivity. The [i] deviant should be easier to detect (and thus result in a larger positivity) in the low F1 condition than in the high F1 condition. The difference wave for “[i] in a low F1 context”–“[i] in a high F1 context” should thus be positive. For [ɛ], this pattern should be reversed, and the difference wave for “[ɛ] in a low F1 context”–“[ɛ] in a high F1 context” should be negative. This mirror-image pattern of results should not necessarily arise only in the P3 time window; in fact, it should be observed from the point in time where the normalization processes start to take place. The question we asked was when that would be.
Section snippets
Participants
Twenty-four native speakers of Dutch from the Max Planck Institute for Psycholinguistics participant pool were tested. They received a monetary reward for their participation. None of the participants reported a hearing disorder, language impairment, or uncorrected visual impairment and all participants were right-handed.
Materials
All recordings were made by a female native speaker of Dutch. Acoustic processing of the stimuli was carried out using PRAAT software (Boersma & Weenink, 2005). The materials
Behavioural data
In the control condition participants detected 98.4% of the deviants. In the experimental conditions participants detected on average 52.5% of the deviants. The behavioural data were analyzed using linear mixed-effects models in R (version 2.6.2, R development core team, 2008, with the lmer function from the lme4 package of Bates & Sarkar, 2007). Detection responses were modelled using the logit-linking function (Dixon, 2008). Hits were coded as 1 and misses as 0. Different models were tested
General discussion
The current paper investigated the level of processing at which compensation for vocal-tract characteristics in speech perception has its influence. In a two-deviant active oddball design, listeners were asked to detect the vowel deviants [ɛ] and [i], embedded in a stream of standards that consisted of an ambiguous sound [I ɛ] (a sound that was acoustically halfway between [ɛ] and [i]). These standard and deviant vowels were prepended to a non-word context consisting of two syllables (/papu/)
Acknowledgments
We would like to thank Marcel Bastiaansen for suggestions while designing the study, Katja Poelmann and Ellie van Setten for assistance with electrode application, and two anonymous reviewers for constructive commentary.
References (53)
Analyzing ‘visual world’ eyetracking data using multilevel logistic regression
Journal of Memory and Language
(2008)- et al.
Quantifying signal-to-noise ratio of mismatch negativity in humans
Neuroscience Letters
(2003) - et al.
The neurotopography of vowels as mirrored by evoked magnetic field measurements
Brain and Language
(1996) Models of accuracy in repeated-measures designs
Journal of Memory and Language
(2008)- et al.
The novelty P3: An event-related brain potential (ERP) sign of the brain's evaluation of novelty
Neuroscience and Biobehavioral Reviews
(2001) - et al.
Auditory-visual integration of talker gender in vowel perception
Journal of Phonetics
(1999) - et al.
Sensitivity to change in perception of speech
Speech Communication
(2003) - et al.
Speech perception within a biologically realistic information-theoretic framework
- et al.
The auditory N1m reveals the left-hemispheric representation of vowel identity in humans
Neuroscience Letters
(2003) - et al.
Cortical representation of vowels reflects acoustic dissimilarity determined by formant frequencies
Cognitive Brain Research
(2003)
Processing of vowels in supratemporal auditory cortex
Neuroscience Letters
Pre-attentive detection of vowel contrasts utilizes both phonetic and auditory memory representations
Cognitive Brain Research
A comparison of vowel normalization procedures for language variation research
Journal of the Acoustical Society of America
An acoustic description of the vowels of Northern and Southern Standard Dutch
The Journal of the Acoustical Society of America
lme4: Linear mixed-effects models using S4 classes (version 0.999375-27) [software application]
Implicit and explicit categorization of speech sounds – Dissociating behavioural and neurophysiological data
European Journal of Neuroscience
Praat: Doing phonetics by computer
Vowel judgements and adaptation level
Proceedings of the Royal Society of London Series B-Biological Sciences
ERP correlates of phoneme perception in speech and sound contexts
Neuroreport
Acoustic characteristics of American English vowels
Journal of the Acoustical Society of America
Temporally nonadjacent nonlinguistic sounds affect speech categorization
Psychological Science
The mean matters: Effects of statistically defined nonspeech spectral distributions on speech categorization
Journal of the Acoustical Society of America
Speaker normalization in speech perception
Acoustic phonetics
Language
Absorption of reliable spectral characteristics in auditory perception
Journal of the Acoustical Society of America
Protection from acoustic trauma is not a primary function of the medial olivocochlear efferent system
Jaro-Journal of the Association for Research in Otolaryngology
Cited by (31)
The advantage of the music-enabled brain in accommodating lexical tone variabilities
2023, Brain and LanguageThe time course of normalizing speech variability in vowels
2021, Brain and LanguageCitation Excerpt :This suggests that speech normalization is a continual process which occurs in both phonetic and phonological stages. Vowel normalization has been observed in the N1 time window, during which acoustic signals are processed, supporting the idea of a general auditory contrast enhancement mechanism (Sjerps et al., 2011b). Meanwhile, phonetic/phonological information also plays an important role in the perceptual normalization of lexical tones (Zhang et al., 2015) and consonants (Kang et al., 2016), indicating that in addition to the acoustic stage, normalization probably also occurs in the speech-specific processing stages, which is in line with the context tuning mechanism.
Integral perception, but separate processing: The perceptual normalization of lexical tones and vowels
2021, NeuropsychologiaCitation Excerpt :This is related to a more fundamental question concerning the dominant unit of speech normalization. Previous studies (Sjerps et al., 2011; 2019; C. Zhang et al., 2013; K. Zhang and Peng, 2018) reported different time courses for vowel normalization and lexical tone normalization. However, these incongruences can be caused by differences in the stimuli, participants, and experimental paradigms used to test the context effects.
Time and information in perceptual adaptation to speech
2019, Cognition