Listening to different speakers: On the time-course of perceptual compensation for vocal-tract characteristics

doi:10.1016/j.neuropsychologia.2011.09.044

Neuropsychologia

Volume 49, Issue 14, December 2011, Pages 3831-3846

https://doi.org/10.1016/j.neuropsychologia.2011.09.044 Get rights and content

Abstract

This study used an active multiple-deviant oddball design to investigate the time-course of normalization processes that help listeners deal with between-speaker variability. Electroencephalograms were recorded while Dutch listeners heard sequences of non-words (standards and occasional deviants). Deviants were [ipapu] or [ɛpapu], and the standard was [^I _ɛpapu], where [^I _ɛ] was a vowel that was ambiguous between [ɛ] and [i]. These sequences were presented in two conditions, which differed with respect to the vocal-tract characteristics (i.e., the average 1st formant frequency) of the [papu] part, but not of the initial vowels [i], [ɛ] or [^I _ɛ] (these vowels were thus identical across conditions). Listeners more often detected a shift from [^I _ɛpapu] to [ɛpapu] than from [^I _ɛpapu] to [ipapu] in the high F₁ context condition; the reverse was true in the low F₁ context condition. This shows that listeners’ perception of vowels differs depending on the speaker's vocal-tract characteristics, as revealed in the speech surrounding those vowels. Cortical electrophysiological responses reflected this normalization process as early as about 120 ms after vowel onset, which suggests that shifts in perception precede influences due to conscious biases or decision strategies. Listeners’ abilities to normalize for speaker-vocal-tract properties are for an important part the result of a process that influences representations of speech sounds early in the speech processing stream.

Highlights

► We examine compensation processes in speech perception with event-related potentials. ► Vowels were presented in different speaker contexts, inducing normalization. ► Normalization effects were found in the N1 time window. ► Normalization processes influence representations at an early processing level.

Introduction

In everyday life, we listen to the speech of many individuals. The current paper investigates a perceptual compensation process that helps listeners to understand speech sounds spoken by different talkers. Individuals have different vocal-tract characteristics, caused by influences such as talker sex, talker size (or vocal-tract length), speaking style, and dialect. This variance appears to challenge speech comprehension because vocal tracts can differ on the same acoustic dimensions that allow listeners to discriminate between different speech-sound categories. We ask here how early in the speech perception process listener's representations of speech sounds are changed in order to compensate for talkers’ vocal-tract characteristics.

Vowels are discriminated mainly on the basis of acoustic properties that are referred to as formants. Formants are bands of increased intensity in the spectral makeup of speech sounds. For example, in English the main difference between the words “bit” and “bet” (phonemically transcribed as /bit/ versus /bɛt/) lies in the frequency of the first formant (F₁). The average F₁ value for /ɛ/ is around 731 Hz while the average F₁ of /i/ lies around 483 Hz, for vowels recorded from female American English speakers (F₂ value for /ɛ/: 2058; F₂ of /i/: 2365). For male speakers the average F₁ value for /ɛ/ is around 580 Hz while the average F₁ of /i/ lies around 427 Hz (F₂ value for /ɛ/: 1799; F₂ of /i/: 2034) (Hillenbrand, Getty, Clark, & Wheeler, 1995). However, these averages do not tell the complete story. There is a large degree of overlap among different vowel categories (Hillenbrand et al., 1995, Joos, 1948). Single instances of two different vowels, spoken by two different speakers, can have very similar absolute formant values. This is not restricted to English; Dutch, the target language of the current paper, shows similar overlap in vowel categories (Adank et al., 2004b, Van Nierop et al., 1973). This is especially the case when comparing speakers of different sex or age. Such variance therefore causes multiple signal-to-category mappings for a single spoken speech sound. In other words, a single sound can often be interpreted as either of two different phonemes, so that listeners may be confused whether the intended word was bit or bet.

It has been argued that listeners compensate for vocal-tract characteristics in a number of different ways (Johnson, 2005, Nearey, 1989). An important contribution may be made by a mechanism that compensates for speaker characteristics by taking into account the vocal-tract characteristics of the speaker as revealed in a preceding context (Ladefoged & Broadbent, 1957). Ladefoged and Broadbent (1957) found that listeners interpret a vowel that is acoustically halfway between an [i] and an [ɛ] more often as /i/ (which has a low F₁) when it is preceded by a sentence with a relatively high F₁, while the same sound is more often interpreted as /ɛ/ (which has a high F₁) when preceded by a precursor sentence with a relatively low F₁. This contrastive process therefore effectively normalizes perception for the F₁ range of a speaker and reduces potential overlap of vowel categories across speakers. Although F₁ is not the only cue to differences in vocal-tract characteristics, Ladefoged and Broadbent (1957) have shown that listeners can use F₁ characteristics to map speech sounds onto the correct phonemes.

Watkins (1991) and Watkins and Makin, 1994, Watkins and Makin, 1996 have argued that the bulk of this effect can be explained by a mechanism that compensates for the average spectral makeup of a precursor, whether the shape of this spectral makeup is due to vocal-tract characteristics or something else (such as room acoustics). The suggestion is that these context effects are the result of a mechanism that focuses on how a given stimulus is different from the preceding context (“Sensitivity to change”, cf. Kluender, Coady, & Kiefte, 2003). This mechanism is assumed to be a general perceptual mechanism that is not specific to speech perception. In line with this claim, contrast effects similar to vowel normalization also occur for musical timbre perception (Stilp, Alexander, Kiefte, & Kluender, 2010). Additional evidence stems from the finding that a precursor spoken by a female talker can influence the perception of a subsequent target sound that was produced by a male talker (Watkins, 1991), and that speech sound categorization can also be influenced by non-speech precursors (Holt, 2006). This kind of mechanism could therefore function as a means of enhancing contrast (Kluender et al., 2003, Kluender and Kiefte, 2006) and displays a clear analogy to contrast effects with visual stimuli. A surface with a certain brightness will be perceived as darker when surrounded by a light surface, but as lighter when surrounded by a dark surface. Furthermore, effects of preceding context on speech sound categorization have also been observed with Japanese quail (Lotto, Kluender, & Holt, 1997). This again suggests that influences of context are in part the result of a relatively general perceptual mechanism (see Holt, 2006, for an overview of context dependent effects on categorization), but it is not clear that a single general-purpose mechanism is sufficient to explain all the results on vowel normalization. Sjerps, Mitterer, and McQueen (2011) argue, for example, that vowel normalization may primarily reflect a compensation mechanism that is based on the Long-Term Average Spectrum (LTAS) of the auditory input, but one that only operates if the input has spectro-temporal characteristics that are similar to those of speech. The primary compensation mechanism for vowel normalization thus appears to be general-purpose (based on contrast), but one which operates under some spectro-temporal constraints.

Little is known, however, about when in the processing stream this compensation mechanism has its influence. Does normalization influence low-level representations or does it influence higher-level cognitive processes? Clearly, the assumption of a general perceptual mechanism that focuses on change is more compatible with the assumption of an early locus. The present study therefore examines the temporal locus of compensation for speaker vocal-tract characteristics by tracking its neurophysiological correlates during the perception of vowels. This is a novel approach in the investigation of the extrinsic normalization of vowels.

In order to establish whether normalization influences representations early or late in the stream of processing, four different time windows were investigated that can be considered to reflect subsequent stages in the processing stream. These were the P1, the N1, the N2 and the P3 time windows. Previous neurophysiological investigations with speech stimuli have suggested different functional interpretations for the processes/representations underlying these different waveform components. The earliest long-latency brain waves (P1 and N1 or their magnetic counterparts P1m, N1m) peaking at about 50 and 100 ms after stimulus onset respectively, seem to reflect early cortical information processing (Diesch et al., 1996, Makela et al., 2003, Obleser et al., 2004). While P1 has been argued to reflect basic auditory feature extraction, N1 seems to reflect a subsequent level closer to a more abstract phonological representational stage (Tavabi, Obleser, Dobel, & Pantev, 2007). Roberts, Flagg, and Gage (2004) argue that the N1 response represents some form of abstract processing. Cortical responses that were recorded when participants listened to sounds on a vowel continuum from [u] to [a] reflected clustering of N1 peak latencies around the regions of the continuum identified as either [u] or [a] (and less clustering around the ambiguous region). Roberts et al. (2004) also find, however, that when acoustic aspects of a single stimulus are held constant while the percept is changed through a response bias induced by preceding trials, the dominant N1 latency effect is related to the physical properties of the stimulus, and not to the eventual decision. This indicates that higher-level processes like response biases do not influence N1. Furthermore, while the N1 might reflect some abstract properties, Näätänen and Winkler (1999) argue that it does not reflect a completely abstract level of processing as it does not directly reflect the consciously perceived event. Bien, Lagemann, Dobel, and Zwitserlood (2009) have also shown that there is a difference between the N1 response and conscious decisions about the signal. Furthermore, in contrast to Roberts et al. (2004), Toscano, McMurray, Dennhardt, and Luck (2010) have shown that, for the perception of the voiced versus voiceless stop consonant distinction (as in beach versus peach), there is no relation between the N1 amplitude and the categorical status of a phoneme. They found that there was a linear relation between the N1 amplitude and the step on the voiced-voiceless continuum. Thus, while it is not yet clear whether the N1 can reflect abstract aspects of speech sounds, previous results do show that the N1 reflects processes that are not influenced by response-bias or by the consciously perceived qualities of the stimulus. This shows that these processes take place early in the processing of speech information. Roberts et al. (2004) thus argue that the ultimate perception of speech sounds depends on the coding of stimulus properties that takes place during the N1 time window.

Later time windows such as the N2/MMN time window (200–300 ms after stimulus onset) do seem to reflect abstract levels of processing (Näätänen et al., 1997, Winkler et al., 1999). MMN responses, for example, are larger to deviants that are linguistically relevant for the listener (Näätänen et al., 1997, Winkler et al., 1999). Moreover, in a study measuring both N1 and MMN, Sharma and Dorman (2000) found a dissociation between measures of N1 and MMN. Both Hindi and English listeners showed similar, direct dependencies of N1 latency on a Voice Onset Time (VOT) continuum that is only relevant for listeners of Hindi (−90 to 0 ms). However, only Hindi listeners elicited a MMN effect with these stimuli. This shows that N1 and MMN reflect subsequent stages in the processing hierarchy and only the MMN response is dependent on linguistic exposure. Finally, the P3 response in response-active oddball designs (300–600 ms after stimulus onset) has been associated with the evaluation of deviant events with relation to subsequent behavioural action (Friedman, Cycowicz, & Gaeta, 2001). The P3 is thus likely to also reflect higher-level cognitive processes, although it is not necessarily insensitive to gradedness within speech categories (Toscano et al., 2010).

Our aim here was to investigate whether compensation for speaker vocal-tract characteristics is a process that influences representations of speech sounds at a relatively early stage of processing (i.e., during P1 and/or N1 time windows) or at a relatively late stage of processing (i.e., during N2 and/or P3 time windows). We investigated the influence of vocal-tract characteristics on vowel perception by presenting participants with target vowels in contexts that simulate speakers with different vocal-tract characteristics. Previous findings have shown that manipulated context sentences can change the perception of subsequently presented vowels, indicated by a shift in the categorization functions for these vowels (Kiefte and Kluender, 2008, Ladefoged and Broadbent, 1957, Mitterer, 2006, Sjerps et al., 2011, Watkins, 1991, Watkins and Makin, 1994, Watkins and Makin, 1996).

Consider Fig. 1, it displays two mock-up categorization functions for sounds from a vowel continuum ranging from [i] to [ɛ], that represent the sort of shift in categorization function that has been found. The dotted line represents the categorization of vowels that have been presented after a precursor with a generally low F₁, while the solid line represents categorization of the same vowel tokens, but then presented after a precursor sentence with a high F₁. It can be observed that more sounds on the continuum are categorized as /i/ (which itself has a low F₁) in the context of a precursor with a high F₁ and more often as /ɛ/ (which has a high F₁) in the context of a low F₁ precursor. The current research attempts to investigate at what level of processing the representation of speech sounds is influenced by the mechanism that leads to this shift in perception.

In the present experiment, target non-words were presented in a response-active mismatch detection design, such that listeners heard a repeating (standard) non-word that was replaced by two different (deviant) non-words on 20% of trials. The standard consisted of a non-word in which the initial vowel was manipulated to sound halfway between [i] and [ɛ], from now on indicated by [^I _ɛ] (the transcription of the ambiguous sound as [^I _ɛ] should make clear that this sound does not represent an actual Dutch phoneme category). The deviant non-words started with a vowel that was an unambiguous instance of /i/ or /ɛ/. The following two syllables in each non-word (/papu/) were manipulated to have a high F₁ or a low F₁ so as to induce normalization effects in different experimental blocks. The bisyllable /papu/ contains two point vowels that provide the range of a speaker's F₁. The induced change in perception through normalization should make it harder for participants to detect a change from the ambiguous vowel [^I _ɛ] to [i] than to [ɛ] in the high F₁ context, whereas listeners should find it harder to detect a change from [^I _ɛ] to [ɛ] than to [i] in the low F₁ context.

Listeners thus heard the nonsense words [^I _ɛpapu] (as the standard stimulus), and [ipapu] and [ɛpapu] (as the deviant stimuli). In this setup the 2nd and 3rd syllables of stimulus x provided the preceding context for the next stimulus, x+1. This approach was chosen to be able to create an interstimulus interval (ISI) between the context ([papu]) and the subsequent target-vowels of 750 ms (i.e., [^I _ɛpapu]–750 ms–[ɛpapu]–750 ms–[^I _ɛpapu], etc.). The contextual influence of the [papu] syllables might extend to subsequent trials (i.e., the perception of a target vowel is influenced not only by the just-preceding context), but because of the blocked presentation this influence was in the same direction within a block. The large ISI between a target vowel and its immediately preceding context is important because small ISI could lead to contextual influences that are a result of peripheral auditory influences such as the negative auditory after-image (Summerfield et al., 1984, Watkins, 1991, Wilson, 1970). Such peripheral influences can cause a compensation effect in the same direction as the more central compensation effect under investigation here and could thus obscure its effects. The contexts thus followed directly after the target vowels. This is not a problem for the interpretation of the EEG waveform, however, as the following context started 250 ms after the onset of the vowel, which leaves enough time for any early cortical signatures in response to the critical vowels to appear before any effect of following context (at least those in the P1, N1, and N2 time windows). The early components induced by the following [papu] coincide with the P3 effect induced by the target vowel. This is not a problem, however, as the P3 response is larger in amplitude than the earlier cortical responses that could be induced by the following context. It should be noted that the strength of normalization effects might decrease over repetitions (Broadbent & Ladefoged, 1960). This decrease, however, has been argued to be stronger when different context conditions are presented in a mixed fashion instead of the blocked approach that was taken here (Sjerps et al., 2011).

An additional control condition was run that had the vowel [ɔ] as the initial vowel on the standard items (i.e., [ɔpapu]), but had the same deviants as in the experimental condition ([ipapu] and [ɛpapu]). In this control condition the [papu] part had a neutral F₁ contour that was halfway between that of the high and low F₁ conditions. This control condition was used to test whether our design was capable of producing a clear standard-deviant mismatch effect in the cortical signatures, and when and where on the cortical topography these mismatch effects would express themselves.

The control data were analyzed by comparing the size and distribution of the effect of deviant ([ipapu] and [ɛpapu]) versus standard ([ɔpapu]) in the four time windows. For the experimental (i.e., non-control) stimuli, an initial analysis compared ERPs between the two standard stimuli ([^I _ɛpapu] in both the high F₁ and the low F₁ condition) versus the deviants ([ipapu] and [ɛpapu] in both the high F₁ and the low F₁ condition). This comparison was made to see whether and when the small auditory differences that we used in the experimental condition were able to elicit different cortical responses to deviants (note that in both sets of data the deviant vowels were the same, only the standards differed). In the final and critical analysis we tested at what point in the stages of cortical processing of speech the influence of the contexts’ F₁-properties on the detectability of a vowel change was reflected. This effect was tested by looking for an interaction between the F₁ condition and the identity of the deviant vowel, with the size of the difference response (in voltage) as the dependent variable. Note that the analysis of this critical interaction focuses on the processing of the deviants and not on the processing of the standard, despite the fact that our design hinges on the fact that the perception of the standard is changed across blocks. Normalization processes change the perceived quality of the standard and thus also the mental traces of the standard. For the critical analysis, we measured the relative strength of the cortical signature of the mental comparison of a deviant vowel to those traces of the standard. Traditionally, designs with this oddball paradigm focus on the difference wave between standard and deviant (cf. Näätänen & Winkler, 1999). As we were interested primarily in the interaction between deviant identity and the contexts’ F₁ properties, a comparison of the deviants themselves suffices here.

In the present study, an early influence should thus be reflected in early time windows (i.e., within the first 160 ms after vowel onset) such as those related to the P1 and/or the N1, whereas a later influence should only be able to affect cortical signatures later than about 200 ms, a time window which is related to the N2/MMN or the P3. To exemplify the expected results, imagine the analysis in the P3 time window. We expected that easy detectability of deviants would lead to a stronger positivity. The [i] deviant should be easier to detect (and thus result in a larger positivity) in the low F₁ condition than in the high F₁ condition. The difference wave for “[i] in a low F₁ context”–“[i] in a high F₁ context” should thus be positive. For [ɛ], this pattern should be reversed, and the difference wave for “[ɛ] in a low F₁ context”–“[ɛ] in a high F₁ context” should be negative. This mirror-image pattern of results should not necessarily arise only in the P3 time window; in fact, it should be observed from the point in time where the normalization processes start to take place. The question we asked was when that would be.

Section snippets

Participants

Twenty-four native speakers of Dutch from the Max Planck Institute for Psycholinguistics participant pool were tested. They received a monetary reward for their participation. None of the participants reported a hearing disorder, language impairment, or uncorrected visual impairment and all participants were right-handed.

Materials

All recordings were made by a female native speaker of Dutch. Acoustic processing of the stimuli was carried out using PRAAT software (Boersma & Weenink, 2005). The materials

Behavioural data

In the control condition participants detected 98.4% of the deviants. In the experimental conditions participants detected on average 52.5% of the deviants. The behavioural data were analyzed using linear mixed-effects models in R (version 2.6.2, R development core team, 2008, with the lmer function from the lme4 package of Bates & Sarkar, 2007). Detection responses were modelled using the logit-linking function (Dixon, 2008). Hits were coded as 1 and misses as 0. Different models were tested

General discussion

The current paper investigated the level of processing at which compensation for vocal-tract characteristics in speech perception has its influence. In a two-deviant active oddball design, listeners were asked to detect the vowel deviants [ɛ] and [i], embedded in a stream of standards that consisted of an ambiguous sound [^I _ɛ] (a sound that was acoustically halfway between [ɛ] and [i]). These standard and deviant vowels were prepended to a non-word context consisting of two syllables (/papu/)

Acknowledgments

We would like to thank Marcel Bastiaansen for suggestions while designing the study, Katja Poelmann and Ellie van Setten for assistance with electrode application, and two anonymous reviewers for constructive commentary.

References (53)

D.J. Barr
Analyzing ‘visual world’ eyetracking data using multilevel logistic regression
Journal of Memory and Language
(2008)
A.T. Cacace et al.
Quantifying signal-to-noise ratio of mismatch negativity in humans
Neuroscience Letters
(2003)
E. Diesch et al.
The neurotopography of vowels as mirrored by evoked magnetic field measurements
Brain and Language
(1996)
P. Dixon
Models of accuracy in repeated-measures designs
Journal of Memory and Language
(2008)
D. Friedman et al.
The novelty P3: An event-related brain potential (ERP) sign of the brain's evaluation of novelty
Neuroscience and Biobehavioral Reviews
(2001)
K. Johnson et al.
Auditory-visual integration of talker gender in vowel perception
Journal of Phonetics
(1999)
K.R. Kluender et al.
Sensitivity to change in perception of speech
Speech Communication
(2003)
K.R. Kluender et al.
Speech perception within a biologically realistic information-theoretic framework
A.M. Makela et al.
The auditory N1m reveals the left-hemispheric representation of vowel identity in humans
Neuroscience Letters
(2003)
J. Obleser et al.
Cortical representation of vowels reflects acoustic dissimilarity determined by formant frequencies
Cognitive Brain Research
(2003)

D. Poeppel et al.

Processing of vowels in supratemporal auditory cortex

Neuroscience Letters

(1997)

I. Winkler et al.

Pre-attentive detection of vowel contrasts utilizes both phonetic and auditory memory representations

Cognitive Brain Research

(1999)

P. Adank et al.

A comparison of vowel normalization procedures for language variation research

Journal of the Acoustical Society of America

(2004)

P. Adank et al.

An acoustic description of the vowels of Northern and Southern Standard Dutch

The Journal of the Acoustical Society of America

(2004)

D.M. Bates et al.

lme4: Linear mixed-effects models using S4 classes (version 0.999375-27) [software application]

(2007)

H. Bien et al.

Implicit and explicit categorization of speech sounds – Dissociating behavioural and neurophysiological data

European Journal of Neuroscience

(2009)

P. Boersma et al.

Praat: Doing phonetics by computer

(2005)

D.E. Broadbent et al.

Vowel judgements and adaptation level

Proceedings of the Royal Society of London Series B-Biological Sciences

(1960)

P. Celsis et al.

ERP correlates of phoneme perception in speech and sound contexts

Neuroreport

(1999)

J. Hillenbrand et al.

Acoustic characteristics of American English vowels

Journal of the Acoustical Society of America

(1995)

L.L. Holt

Temporally nonadjacent nonlinguistic sounds affect speech categorization

Psychological Science

(2005)

L.L. Holt

The mean matters: Effects of statistically defined nonspeech spectral distributions on speech categorization

Journal of the Acoustical Society of America

(2006)

K. Johnson

Speaker normalization in speech perception

M. Joos

Acoustic phonetics

Language

(1948)

M. Kiefte et al.

Absorption of reliable spectral characteristics in auditory perception

Journal of the Acoustical Society of America

(2008)

E.C. Kirk et al.

Protection from acoustic trauma is not a primary function of the medial olivocochlear efferent system

Jaro-Journal of the Association for Research in Otolaryngology

(2003)

Cited by (31)

The advantage of the music-enabled brain in accommodating lexical tone variabilities
2023, Brain and Language
The perception of multiple-speaker speech is challenging. People with music training generally show more robust and faster tone perception. The present study investigated whether music training experience can facilitate tonal-language speakers to accommodate speech variability in lexical tones. Native Cantonese musicians and nonmusicians were asked to identify Cantonese level tones from multiple speakers. Two groups were equally well in using context cues to normalize lexical tone variability at behavioral level. However, the advantage of music training was observed at cortical level. The time-domain ERP analysis suggested that musicians normalized lexical tone variability much earlier than nonmusicians (N1: 70–175 ms vs. P2: 175–280 ms). An exploratory source analysis further revealed that two groups probably relied on different cortical regions to normalize lexical tones. Left BA41 showed stronger involvement in musicians in accommodating tone variability, but right auditory cortex (including BA 41, 42 and 22) activated to a greater extend in nonmusicians.
What we do (not) know about the mechanisms underlying adaptive speech perception: A computational framework and review
2023, Cortex
Speech from unfamiliar talkers can be difficult to comprehend initially. These difficulties tend to dissipate with exposure, sometimes within minutes or less. Adaptivity in response to unfamiliar input is now considered a fundamental property of speech perception, and research over the past two decades has made substantial progress in identifying its characteristics. The mechanisms underlying adaptive speech perception, however, remain unknown. Past work has attributed facilitatory effects of exposure to any one of three qualitatively different hypothesized mechanisms: (1) low-level, pre-linguistic, signal normalization, (2) changes in/selection of linguistic representations, or (3) changes in post-perceptual decision-making. Direct comparisons of these hypotheses, or combinations thereof, have been lacking. We describe a general computational framework for adaptive speech perception (ASP) that—for the first time—implements all three mechanisms. We demonstrate how the framework can be used to derive predictions for experiments on perception from the acoustic properties of the stimuli. Using this approach, we find that—at the level of data analysis presently employed by most studies in the field—the signature results of influential experimental paradigms do not distinguish between the three mechanisms. This highlights the need for a change in research practices, so that future experiments provide more informative results. We recommend specific changes to experimental paradigms and data analysis. All data and code for this study are shared via OSF, including the R markdown document that this article is generated from, and an R library that implements the models we present.
The time course of normalizing speech variability in vowels
2021, Brain and Language
Citation Excerpt :
This suggests that speech normalization is a continual process which occurs in both phonetic and phonological stages. Vowel normalization has been observed in the N1 time window, during which acoustic signals are processed, supporting the idea of a general auditory contrast enhancement mechanism (Sjerps et al., 2011b). Meanwhile, phonetic/phonological information also plays an important role in the perceptual normalization of lexical tones (Zhang et al., 2015) and consonants (Kang et al., 2016), indicating that in addition to the acoustic stage, normalization probably also occurs in the speech-specific processing stages, which is in line with the context tuning mechanism.
To achieve perceptual constancy, listeners utilize contextual cues to normalize speech variabilities in speakers. The present study tested the time course of this cognitive process with an event-related potential (ERP) experiment. The first neurophysiological evidence of speech normalization is observed in P2 (130–250 ms), which is functionally related to phonetic and phonological processes. Furthermore, the normalization process was found to ease lexical retrieval, as indexed by smaller N400 (350–470 ms) after larger P2. A cross-language vowel perception task was carried out to further specify whether normalization was processed in the phonetic and/or phonological stage(s). It was found that both phonetic and phonological cues in the speech context contributed to vowel normalization. The results suggest that vowel normalization in the speech context can be observed in the P2 time window and largely overlaps with phonetic and phonological processes.
Integral perception, but separate processing: The perceptual normalization of lexical tones and vowels
2021, Neuropsychologia
Citation Excerpt :
This is related to a more fundamental question concerning the dominant unit of speech normalization. Previous studies (Sjerps et al., 2011; 2019; C. Zhang et al., 2013; K. Zhang and Peng, 2018) reported different time courses for vowel normalization and lexical tone normalization. However, these incongruences can be caused by differences in the stimuli, participants, and experimental paradigms used to test the context effects.
In tonal languages, speech variability arises in both lexical tone (i.e., suprasegmentally) and vowel quality (segmentally). Listeners can use surrounding speech context to overcome variability in both speech cues, a process known as extrinsic normalization. Although vowels are the main carriers of tones, it is still unknown whether the combined percept (lexical tone and vowel quality) is normalized integrally or in partly separate processes. Here we used electroencephalography (EEG) to investigate the time course of lexical tone normalization and vowel normalization to answer this question. Cantonese adults listened to synthesized three-syllable stimuli in which the identity of a target syllable — ambiguous between high vs. mid-tone (Tone condition) or between /o/ vs. /u/ (Vowel condition) — was dependent on either the tone range (Tone condition) or the formant range (Vowel condition) of the first two syllables. It was observed that the ambiguous tone was more often interpreted as a high-level tone when the context had a relatively low pitch than when it had a high pitch (Tone condition). Similarly, the ambiguous vowel was more often interpreted as /o/ when the context had a relatively low formant range than when it had a relatively high formant range (Vowel condition). These findings show the typical pattern of extrinsic tone and vowel normalization. Importantly, the EEG results of participants showing the contrastive normalization effect demonstrated that the effects of vowel normalization could already be observed within the N2 time window (190–350 ms), while the first reliable effect of lexical tone normalization on cortical processing was observable only from the P3 time window (220–500 ms) onwards. The ERP patterns demonstrate that the contrastive perceptual normalization of lexical tones and that of vowels occur at least in partially separate time windows. This suggests that the extrinsic normalization can operate at the level of phonemes and tonemes separately instead of operating on the whole syllable at once.
Selecting among competing models of talker adaptation: Attention, cognition, and memory in speech processing efficiency
2020, Cognition
Phonetic variability across talkers imposes additional processing costs during speech perception, often measured by performance decrements between single- and mixed-talker conditions. However, models differ in their predictions about whether accommodating greater phonetic variability (i.e., more talkers) imposes greater processing costs. We measured speech processing efficiency in a speeded word identification task, in which we manipulated the number of talkers (1, 2, 4, 8, or 16) listeners heard. Word identification was less efficient in every mixed-talker condition compared to the single-talker condition, but the magnitude of this performance decrement was not affected by the number of talkers. Furthermore, in a condition with uniform transition probabilities between two talkers, word identification was more efficient when the talker was the same as the prior trial compared to trials when the talker switched. These results support an auditory streaming model of talker adaptation, where processing costs associated with changing talkers result from attentional reorientation.
Time and information in perceptual adaptation to speech
2019, Cognition
Perceptual adaptation to a talker enables listeners to efficiently resolve the many-to-many mapping between variable speech acoustics and abstract linguistic representations. However, models of speech perception have not delved into the variety or the quantity of information necessary for successful adaptation, nor how adaptation unfolds over time. In three experiments using speeded classification of spoken words, we explored how the quantity (duration), quality (phonetic detail), and temporal continuity of talker-specific context contribute to facilitating perceptual adaptation to speech. In single- and mixed-talker conditions, listeners identified phonetically-confusable target words in isolation or preceded by carrier phrases of varying lengths and phonetic content, spoken by the same talker as the target word. Word identification was always slower in mixed-talker conditions than single-talker ones. However, interference from talker variability decreased as the duration of preceding speech increased but was not affected by the amount of preceding talker-specific phonetic information. Furthermore, efficiency gains from adaptation depended on temporal continuity between preceding speech and the target word. These results suggest that perceptual adaptation to speech may be understood via models of auditory streaming, where perceptual continuity of an auditory object (e.g., a talker) facilitates allocation of attentional resources, resulting in more efficient perceptual processing.

View all citing articles on Scopus

View full text

Listening to different speakers: On the time-course of perceptual compensation for vocal-tract characteristics

Abstract

Highlights

Introduction

Section snippets

Participants

Materials

Behavioural data

General discussion

Acknowledgments

Journal of Memory and Language

Neuroscience Letters

Brain and Language

Journal of Memory and Language

Neuroscience and Biobehavioral Reviews

Journal of Phonetics

Speech Communication

Neuroscience Letters

Cognitive Brain Research

Neuroscience Letters

Cognitive Brain Research

A comparison of vowel normalization procedures for language variation research

Journal of the Acoustical Society of America

An acoustic description of the vowels of Northern and Southern Standard Dutch

The Journal of the Acoustical Society of America

lme4: Linear mixed-effects models using S4 classes (version 0.999375-27) [software application]

Implicit and explicit categorization of speech sounds – Dissociating behavioural and neurophysiological data

European Journal of Neuroscience

Praat: Doing phonetics by computer

Vowel judgements and adaptation level

Proceedings of the Royal Society of London Series B-Biological Sciences

ERP correlates of phoneme perception in speech and sound contexts

Neuroreport

Acoustic characteristics of American English vowels

Journal of the Acoustical Society of America

Temporally nonadjacent nonlinguistic sounds affect speech categorization

Psychological Science

The mean matters: Effects of statistically defined nonspeech spectral distributions on speech categorization

Journal of the Acoustical Society of America

Speaker normalization in speech perception

Acoustic phonetics

Language

Absorption of reliable spectral characteristics in auditory perception

Journal of the Acoustical Society of America

Protection from acoustic trauma is not a primary function of the medial olivocochlear efferent system

Jaro-Journal of the Association for Research in Otolaryngology