In natural situations, listeners are surrounded by a multitude of sounds that compete for attention. To comprehend a talker’s speech in the presence of competing distractors, both voice characteristics, such as the talker’s pitch, timbre, and articulatory style (reviewed by Diehl, Lotto, & Holt, 2004; Mathias & von Kriegstein, 2014), and listening goals, such as selective attention to and working memory of the talker’s speech (reviewed by Fritz, Elhilali, David, & Shamma, 2007; Shinn-Cunningham, 2008), guide the listener to focus on relevant sounds and to ignore irrelevant distractors.

One beneficial factor for speech comprehension under such conditions is familiarity with the talker’s voice. Several studies have shown that listeners are better at comprehending target speech when it is produced by a familiar rather than an unfamiliar talker (Holmes, Domingo, & Johnsrude, 2018; Johnsrude et al., 2013; Kreitewolf, Mathias, & von Kriegstein, 2017; Levi, Winters, & Pisoni, 2011; Newman & Evers, 2007; Nygaard & Pisoni, 1998; Souza, Gehani, Wright, & McCloy, 2013). In the following, we refer to this phenomenon as the familiarity benefit (e.g., Johnsrude et al., 2013; Kreitewolf et al., 2017).

The familiarity benefit likely relies on listeners’ previous experience with the talker’s vocal characteristics that can be exploited to direct selective attention to target sounds in the auditory scene (Bressler, Masud, Bharadwaj, & Shinn-Cunningham, 2014; Kreitewolf, Mathias, Trapeau, Obleser, & Schönwiesner, 2018). A conceivable implication of the familiarity benefit is that listeners do not only benefit from talker familiarity when their goal is to attend to the familiar talker’s speech but also when they want to ignore it. In other words, if the familiarity benefit is based on previous experience with the talker’s voice, then this experience might also help listeners to filter out distracting speech produced by this talker.

Only a few studies have investigated whether talker familiarity helps listeners ignore distracting, task-irrelevant speech. Johnsrude et al. (2013) presented listeners with two concurrent spoken sentences and asked them to report key words from the target sentence. Critically, the authors manipulated talker familiarity in these sentences such that either the attended (target), the unattended (masker), or none of the two was spoken by a highly familiar talker (i.e., the listeners’ spouses). Listeners correctly reported more keywords from the target sentence when either the target or the masker sentence was spoken by a familiar as compared to an unfamiliar talker, suggesting that talker familiarity facilitates both attending to target speech and ignoring distracting speech. In an earlier study, Newman and Evers (2007) asked listeners to attend to and immediately repeat (i.e., to shadow) speech from a target talker while a distracting talker was presented in the background. Listeners differed in their familiarity with one of the two concurrent talkers and whether or not they were told that they would hear a familiar voice (i.e., explicit vs. implicit knowledge). In this study, talker familiarity was ensured by presenting speech produced by the listeners’ university professor. The results showed that listeners with explicit knowledge about talker familiarity made fewer shadowing errors than listeners who only had implicit knowledge or listeners who were not familiar with the talker at all. Yet, this benefit was limited to familiarity with the target talker. Unlike Johnsrude et al., listeners did not benefit from familiarity with the distracting background talker. Therefore, these two studies produced somewhat incongruent results with regard to the question of whether talker familiarity helps listeners to ignore distracting, task-irrelevant speech.

Here, we investigated the effect of talker familiarity on the distraction induced by task-irrelevant speech using a different, yet well-established experimental paradigm: the irrelevant-speech task (e.g., Colle & Welsh, 1976; Salamé & Baddeley, 1982). The irrelevant-speech task requires listeners to keep the serial order of to-be-attended target stimuli in working memory while task-irrelevant, to-be-ignored speech is presented during memory retention. The number of incorrectly recalled targets is thought to increase proportionally to the distraction by the task-irrelevant speech, making the irrelevant-speech task an effective paradigm to study memory disruption by distracting speech. To modulate talker familiarity, we used an adaptation of the irrelevant-speech task in which a task-irrelevant, to-be-ignored sentence was spoken by either a familiar or an unfamiliar talker (Fig. 1).

Fig. 1
figure 1

(a) Experimental design. Talker training in the student group was accomplished via classroom teaching. Students attended an introductory statistics course taught by one of two talkers. After the semester, the students as well as a group of family members and friends of the two talkers completed an online experiment. The familiar and unfamiliar talkers were presented in alternating blocks of trials. For half of the listeners, the experiment started with a familiar-talker block. (b) Task and experimental manipulation. In each trial of the online experiment, listeners heard the spoken digits 1 to 9 in random order, followed by a task-irrelevant sentence. The onset of the sentence was jittered. In about 17% of the trials, the sentence was omitted. The listeners’ task was to select the digits in the order of presentation on a visually presented number pad. The same task-irrelevant sentences were spoken by both talkers. The digits were spoken by an unfamiliar female talker

The major objective of this study was to test whether familiarity with a task-irrelevant talker affects the serial recall of attended target speech. One possibility is that talker familiarity helps listeners filter out irrelevant speech. This should manifest in fewer recall errors when the task-irrelevant talker is familiar versus unfamiliar. Such a finding would be not only in line with Johnsrude et al. (2013), but also consistent with the idea of proactive filtering: When a distractor is known or anticipated, the filter can be applied even before the distractor appears (e.g., Noonan et al., 2016; Ruff & Driver, 2006). Interestingly, proactive filtering is beneficial when a distractor is present, but behaviorally costly when an expected distractor is omitted (Marini, Chelazzi, & Maravita, 2013).

To test the possibility that listeners proactively filter out irrelevant speech produced by the familiar talker, we blocked presentation of the familiar and unfamiliar talkers and omitted the task-irrelevant sentence on rare occasions. If listeners proactively filtered irrelevant speech from a familiar talker, we would observe fewer recall errors on trials in which the task-irrelevant sentence was spoken by a familiar than on trials with an unfamiliar talker (familiarity benefit), and more recall errors on trials in which a familiar rather than an unfamiliar distracting talker was anticipated but the irrelevant sentence was omitted (familiarity deficit).

An alternative possibility is that task-irrelevant speech produced by a familiar talker captures attention and draws it away from items in memory. Previous work on the irrelevant-speech effect has shown that distractors of high familiarity, such as the listener’s own name (Röer, Bell, & Buchner, 2013) and the listener’s native language (Ellermeier, Kattner, Ueda, Doumoto, & Nakajima, 2015), enhance memory disruption. Possibly, familiar distractors constitute salient stimuli that involuntarily draw attention resources away from the serial memory of target items. If this is also the case for familiar voices, we would observe more recall errors when the task-irrelevant sentence is spoken by a familiar versus unfamiliar talker; however, we would expect no effect of talker familiarity on trials in which the task-irrelevant sentence is omitted.

A third possibility is that the effect of talker familiarity is not modulated by the actual presentation of a task-irrelevant sentence. That is, talker familiarity affects working memory irrespective of whether the familiar talker’s distracting speech is heard or merely expected. Such finding would be difficult to explain by (proactive) filtering or attentional capture of the familiar talker’s speech; instead, it would rather speak to more general differences in how familiar and unfamiliar task-irrelevant talkers affect working memory of target speech.

Based on the collective results from previous studies, one might draw the conclusion that the degree of familiarity plays an important role for the distraction by irrelevant speech and that listeners only benefit from high (Johnsrude et al., 2013) but not moderate familiarity (Newman & Evers, 2007) with a distracting talker. Yet, these studies are difficult to compare since they differ markedly in their experimental procedures, including stimuli and task instructions.

Here, we tested the effect of the degree of familiarity by comparing two groups of listeners that performed the exact same irrelevant-speech task. Importantly, these two groups differed in their degree of familiarity with similar magnitudes of familiarity as in previous studies. Specifically, we compared a group of students who heard irrelevant speech produced by one of their course instructors (i.e., moderate familiarity, similar to Newman & Evers, 2007) with a group of the course instructors’ family members and friends (i.e., high familiarity, similar to Johnsrude et al., 2013).

Method

Participants

Two groups of listeners participated in this study. The first group were N = 66 undergraduate psychology students (59 females, seven males; mean age 23.11 years, age range 17–48 years; see Table 1 for details) who had received classroom instructions by one of two talkers. The classroom teaching comprised a total of fourteen 90-min sessions (of which all participants attended at least nine; see Table 1). The second group of listeners were N = 20 family members and close friends of either one of the two talkers (eight females, 12 males; mean age 39 years, age range 30–65 years; see Table 1 for details), who did not receive classroom instructions. All listeners were native German speakers. Students gained course credit for their participation; family members and friends were paid €10. The experimental procedures were approved by the ethics committee of the University of Lübeck.

Table 1 Details of the listener groups

Stimuli

The to-be-attended speech stimuli were recordings of the German digits 1 to 9, which we had used in previous studies (Obleser, Wöstmann, Hellbernd, Wilsch, & Maess, 2012; Wöstmann, Lim, & Obleser, 2017; Wöstmann & Obleser, 2016). All digits were spoken by a native German female talker (mean fundamental frequency, f0, of 192 Hz). None of the listeners was familiar with the talker’s voice. Digits had an average duration of 0.6 s (ranging from 0.5 to 0.7 s) and were concatenated with an onset-to-onset delay of 0.75 s. The resulting digit streams had an average duration of 6.6 s.

For the task-irrelevant speech, we used a German version of the speech-in-noise sentences (Erb, Henry, Eisner, & Obleser, 2012) adopted from Kalikow, Stevens, and Elliott (1977). The same 50 sentences were recorded from two male talkers who were the authors J.K. and M.P. (both native German speakers). The mean f0, averaged across all sentences, was 93 Hz for talker J.K. and 85 Hz for talker M.P. The sentences produced by J.K. had an average duration of 2.17 s (ranging from 1.83 to 2.58 s); the sentences produced by M.P. had an average duration of 2.19 s (ranging from 1.91 to 2.72 s). All sentences and digit streams were normalized to the same root mean square (RMS) decibel full-scale amplitude using MATLAB (version 8.6, MathWorks, United States). On a given trial, the onset of the task-irrelevant sentence was delayed by 1,409 ms (± 400 ms) so that, on average, the sentences were centered in the middle of a 5-s memory retention phase (Fig. 1B).

Procedure

The listeners performed an online experiment implemented in Labvanced (Scicovery GmbH, Osnabrück, Germany) that used an adaptation of the irrelevant-speech paradigm (e.g., Colle & Welsh, 1976; Jones & Morris, 1992). The online experiment was executed in the browser in full-screen mode. Online experiments allow for the rapid collection of large datasets (e.g., Buhrmester, Kwang, & Gosling, 2011) and have been shown to produce reliable data in several areas of behavioral research, including linguistics (e.g., Saunders, Bex, & Woods, 2013) and psychoacoustics (e.g., McPherson & McDermott, 2018). Here, the online experiment had the additional advantage to prevent direct contact between the listeners and one of the two task-irrelevant talkers immediately before the start of the experiment, which would have otherwise contaminated our manipulation of talker familiarity.

All listeners completed the experiment within 1 h. On each trial, listeners heard the nine spoken digits in random order followed by a task-irrelevant sentence (Fig. 1B), while a fixation cross was presented on the computer screen. In about 17% of trials (i.e., 20 out of 120 trials), silence was presented instead of task-irrelevant speech. A total of 5 s after the offset of the digit stream (i.e., at the end of the memory retention phase), a number pad consisting of the digits 1 to 9 was visually presented. Listeners were asked to select the digits in the order of their presentation. Each visually presented digit disappeared directly after it had been selected. This was done to avoid that the same digit could be selected more than once per trial. After the selection of the ninth digit, the next trial started with a delay of 500 ms. No feedback was given.

Listeners were asked to perform the online experiment in a quiet setting, to use a computer (no tablets, smart phones, etc.), and to listen to the sounds using headphones. Prior to the start of the experiment, listeners were instructed to silently rehearse the digit stream during the memory retention phase, to keep their eyes open and not to speak the digits aloud during a trial. Listeners could adjust the loudness of the sounds to a comfortable level. They were asked not to change the sound level during the experiment.

The experiment comprised four blocks (Fig. 1A, “Test”). Each block consisted of 30 trials: 25 trials with a distracting, task-irrelevant sentence (distractor trials), and five trials with silence in the memory retention phase (no-distractor trials). In each block, the no-distractor trials were pseudo-randomly interspersed with the distractor trials, with the restrictions that the first no-distractor trial within a block could not occur within the first five trials and that two no-distractor trials could not occur in succession. The task-irrelevant familiar and unfamiliar talkers were presented in alternating blocks of trials. The talker familiarity was not made explicit (i.e., listeners were not told that they would hear a familiar talker’s voice). Half of the listeners started with a familiar-talker block; the other half started with an unfamiliar-talker block (Fig. 1A, “Test”). Note that the no-distractor trials were acoustically identical for the task-irrelevant familiar and unfamiliar talkers. However, the blocked presentation of the familiar and unfamiliar talkers allowed us to test whether the infrequent presentation of silence in the memory retention phase would affect performance differently when a familiar versus an unfamiliar talker’s voice was expected.

In the first and second halves of the experiment (each consisting of one familiar- and one unfamiliar-talker block), the same combinations of digit streams and task-irrelevant sentences were presented (but each task-irrelevant sentence was spoken by one talker in the first half and by the other talker in the second half of the experiment). This was done to ensure that differences in performance between the blocks were due to familiarity with the task-irrelevant talker and not to differences in the memorability of the digit streams or the distractibility of the task-irrelevant sentences. To reduce item-specific learning, we ensured that the trial order was always different in familiar- and unfamiliar-talker blocks.

Statistical analyses

To assess listeners’ memory of the serial order of digits, we considered digits recalled at their respective position of presentation as correct and all other responses as incorrect. As a measure of distraction by the task-irrelevant speech, we counted the number of errors per trial (0– 9). All statistical analyses were carried out in R (R Core Team, 2017) using RStudio (version 1.1.383).

To overcome problems related to the unequal numbers of trials with and without a distractor in the memory retention phase, we fitted generalized linear mixed-effect models as implemented in the lme4 package (Bates, Mächler, Bolker, & Walker, 2015) to the number of errors per trial using Poisson regression (log link function; treating the number of errors as count data).

We followed an iterative model-fitting procedure: Starting with a minimal model that only included the random intercepts for subjects, we first added fixed- and then random-effect terms in a stepwise fashion. The fixed-effect terms were added in the order of their conceptual importance (i.e., talker familiarity, distractor, listener group, and interactions between these factors; see below). The random-effect terms included random intercepts for sentences and the subject- and sentence-specific random slopes for all significant main factors and interactions. After each step, we fitted the model using maximum-likelihood estimation and assessed the change in model fit using likelihood-ratio tests. Model terms that significantly improved the model fit were kept in the model, and nonsignificant terms were dropped (unless they were involved in higher-order interactions), resulting in a best-fitting model.

To investigate the potential effects of talker familiarity (familiar vs. unfamiliar task-irrelevant talker), distractor (distractor vs. no-distractor trial), and listener group (students vs. family and friends) on the number of errors, we modeled these predictors as fixed effects using deviation coding. To explore significant interaction terms, we performed post-hoc comparisons using Tukey’s range tests, as implemented in the lsmeans package (Lenth, 2016). We report unstandardized coefficients b in order to provide an estimate of effect size for fixed effects. Note that Poisson regression operates on a log transform of the dependent measure. The coefficients are therefore given in log-scale units. For significant random-effect terms, we report the likelihood-ratio test comparing the more complex model that includes the random-effect term with the simpler model excluding that term.

To enhance the interpretability of nonsignificant effects in particular, we calculated Bayes factors (BFs) using the brms package (Bürkner, 2016). When comparing two statistical models, the BF indicates how many times more likely the observed data are under the more complex model (including a particular model term of interest) than under the simpler model (excluding the model term of interest). In accordance with Jeffreys (1961), a BF of 0.33 or smaller is interpreted as providing evidence in favor of the null hypothesis, whereas a BF of 3 or larger is interpreted as evidence against it.

Results

Figure 2A shows the average proportions of errors as a function of digit position for distractor and no-distractor trials. Several basic observations can be made, on the basis of the data shown in this figure. Descriptively, listeners made fewer errors for digits presented at initial and final positions, which is likely due to primacy and recency effects (e.g., Jones & Macken, 1993; Röer et al., 2013; Salamé & Baddeley, 1982; Schlittmeier, Weisz, & Bertrand, 2011; Wöstmann & Obleser, 2016). Interestingly, differences between distractor and no-distractor trials were most pronounced for digits presented in the second half of the digit stream (i.e., from the fifth to the eighth digit), and digits presented at these positions were generally more difficult to recall. Another observation is that, averaged across digit positions, listeners made more errors in distractor than in no-distractor trials (Fig. 2B).

Fig. 2
figure 2

Distraction by task-irrelevant speech. (a) Average proportions of errors in sentence-distractor (black) and no-distractor (gray) trials, as a function of digit position. (b) Left: Numbers of errors per trial, averaged across digit positions for sentence-distractor (left dot; black) and no-distractor (right dot; gray) trials. Individual listeners’ data are shown as gray lines. Across-listener averages are shown by dots connected by the thick black line. Right: Numbers of errors per trial predicted by the best-fitting model (i.e., controlling for the effects of all other model terms; see Table 2 for a summary of the model terms). A significant effect of distractor is denoted by the asterisks: **p < .01

Distraction by task-irrelevant speech

To test the effect of distractor, among other things, on the number of errors per trial, we fitted linear mixed-effect models. In sum, the best-fitting model included the three main factors, distractor, talker familiarity, and listener group, as well as the interaction between the factors talker familiarity and listener group, as fixed-effect terms (see Table 2 for a summary of fixed-effect terms). The random-effect terms included the random intercepts for subjects and sentences, as well as the subject-specific random slopes for talker familiarity.

Table 2 Fixed-effect terms included in the best-fitting model

The inclusion of distractor in the best-fitting model (Z = – 2.81, p = .005, b = – 0.047, BF = 7.6) demonstrated the irrelevant-speech effect: Presentation of a task-irrelevant sentence within the memory retention phase was indeed more distracting than a silence period. The sentence-specific random intercepts were also included in the best-fitting model (χ21 = 9.75, p = .002, BF = 4.09), suggesting that the task-irrelevant sentences differed in distractibility.

Moderate but not high talker familiarity disrupts memory of target speech

The main aim of the present study was to investigate the effect of familiarity with a task-irrelevant talker on the serial recall of target digits. Figure 3 shows how the average proportion of errors evolved over digit positions in familiar- and unfamiliar-talker blocks (Fig. 3A) and the numbers of errors per trial averaged across digit positions for task-irrelevant familiar and unfamiliar talkers (Fig. 3B).

Fig. 3
figure 3

Effect of talker familiarity in all listeners (a, b), as well as in students (c) and in family members and friends (d). (a) Average proportions of errors in familiar- and unfamiliar-talker blocks as a function of digit position. (b, c, d) Left: Numbers of errors per trial, averaged across digit positions for familiar- (left dot) and unfamiliar-talker (right dot) blocks. Individual listeners’ data are shown as gray lines. The across-listener averages are shown by color online dots and are connected by the thick black line. Right: Numbers of errors per trial predicted by the best-fitting model. Gray lines show the pairs between the random intercepts for subjects and the subject-specific random slopes for talker familiarity. Significant effects are denoted by asterisks: *p < .05, ***p < .001

The best-fitting model included the factor talker familiarity (Z = – 1.98, p = .048, b = – 0.0472, BF = 1.58) and the interaction between the factors talker familiarity and listener group (Z = 2.01, p = .045, BF = 0.58). Post-hoc comparisons revealed a significant effect of talker familiarity in the student group (Z = – 4.07, p < .001, b = – 0.0947, BF = 13.71), but not in the group of family members and friends (Z = 0.01, p = .994, b = 0.0003, BF = 0.04). This means that students (Fig. 3C) but not family members and friends (Fig. 3D) made more errors when they were familiar with the task-irrelevant talker than when they were not.

Notably, the sample size was much smaller in the group of family and friends (N = 20) than in the student group (N = 66). However, the small BF of 0.04 provides strong evidence for the absence of a familiarity effect for family and friends. It is therefore unlikely that the comparably small sample was responsible for the lack of a familiarity effect in this group of listeners.

The best-fitting model also included the subject-specific random slopes for talker familiarity (χ22 = 64.38, p < .001, BF = 17.72), suggesting that the effect of talker familiarity differed across listeners. Interestingly, however, the best-fitting model did not include the interaction between the factors talker familiarity and distractor: Compared to the simpler model, the inclusion of the fixed-effect term for the talker familiarity-by-distractor interaction did not significantly improve the model fit (χ21 = 0.61, p = .434, BF < 0.001). Thus, the effect of talker familiarity (higher distraction in the familiar- than in the unfamiliar-talker blocks) was not modulated by the presentation of a task-irrelevant sentence within the memory retention phase. This finding suggests that the mere expectation of a (moderately) familiar talker in the memory retention phase suffices to disrupt listeners’ working memory of target speech.

Familiarity effects are talker-specific

By design, it was possible that any familiarity effect would only be driven by one of our two task-irrelevant talkers. To investigate this possibility, we carried out a control analysis in which we added the factor identity of familiar talker (J.K. vs. M.P.) and, critically, the interaction between identity of familiar talker and talker familiarity as a fixed-effect terms to the best-fitting model. The inclusion of identity of familiar talker did not significantly improve the model fit, relative to the best-fitting model (χ21 = 2.2, p = .138, BF = 0.32), but the inclusion of the interaction term between identity of familiar talker and talker familiarity did (χ22 = 11.76, p = .003, BF = 5.33). Post-hoc comparisons revealed that listeners who were familiar with talker M.P. showed a detrimental effect of talker familiarity on their serial recall of target digits (Z = – 3.67, p < .001, b = – 0.1057, BF = 13.61), but this was not the case for listeners who were familiar with talker J.K. (Z = 0.44, p = .663, b = 0.013, BF = 0.03).

Although these results suggest that memory disruption by a familiar talker might be talker-specific, they cannot explain our main finding of stronger memory disruption by moderate (but not high) familiarity with the task-irrelevant talker. First, the factor identity of familiar talker did not modulate the talker familiarity-by-listener group interaction: The inclusion of the three-way interaction between identity of familiar talker, talker familiarity, and listener group did not significantly improve the model fit, as compared to the simpler model (χ22 = 1.43, p = .49, BF = 0.04). Second, the interaction between talker familiarity and listener group remained a significant predictor for the number of errors per trial even after inclusion of the identity of familiar talker-by-talker familiarity interaction (Z = 2.07, p = .038), with higher memory disruption by the familiar talker in the student group (Z = – 4.20, p < .001, b = – 0.0926, BF = 27.61), but not in the group of family members and friends (Z = – 0.001, p = .999, b = – 4.49e–5, BF = 0.04). Third, and most importantly, the identity of the familiar talker was balanced across listeners; that is, similar numbers of listeners were familiar with either of the two talkers (see Table 1 for details). Our experimental design therefore inherently controlled for any talker-specific effects on the main effect of talker familiarity, as well as on the talker familiarity-by-listener group interaction.

No effect of listener group and block order

In total, students did not differ from family members and friends in their serial recall of digits. As compared to the simpler model, the inclusion of the factor listener group did not significantly improve the model fit (χ21 = 0.02, p = .888, BF = 1.11). Note, however, that the BF shows no evidence for either the absence or the presence of an effect of listener group. It is thus possible that the nonsignificant difference between the two listener groups was due to the rather small sample of family members and friends.

To investigate potential effects of the presentation order of familiar and unfamiliar talkers (Fig. 1A, “Test”), we carried out a second control analysis in which we added the factor block order (familiar first vs. unfamiliar first) as a fixed-effect term to the best-fitting model. As compared to the respective simpler model, neither the main effect of block order (χ21 = 0.04, p = .843, BF = 0.06) nor the interaction term between the factors block order and talker familiarity (χ24 = 5.69, p = .223, BF = 0.05) significantly improved the model fit. Thus, neither the overall performance nor the effect of talker familiarity depended on whether listeners started the experiment with a familiar- or an unfamiliar-talker block.

Discussion

In the present study, we used a variant of the irrelevant-speech task to investigate the effect of familiarity with a distracting, task-irrelevant talker on the serial recall of target speech. The main finding was that listeners made more recall errors in blocks of trials with a familiar compared to an unfamiliar distracting talker. Critically, this effect depends on the degree to which listeners are familiar with this task-irrelevant voice: Only listeners with moderate (i.e., students) but not high (i.e., family and friends) familiarity showed stronger working-memory disruption by talker familiarity. Interestingly, however, the effect of talker familiarity does not depend on the presence of task-irrelevant speech during memory retention: students experienced stronger working-memory disruption irrespective of whether they heard a task-irrelevant sentence produced by the familiar talker (in most of the trials) or merely expected it (in a small subset of trials).

Familiarity with a distracting talker improves comprehension but disrupts working memory of target speech

Two previous studies had investigated the distraction by familiar talkers’ speech. One study (Newman & Evers, 2007) showed no benefit from moderate familiarity with a distracting background talker (i.e., the listeners’ university professor) when listeners had to shadow target speech. Another study (Johnsrude et al., 2013), however, showed that listeners’ comprehension of target speech does benefit from high familiarity with a distracting talker (i.e., the listener’s spouse)—a finding that has recently been extended to familiarity with accented speech (Senior & Babel, 2018). Together, these studies suggest that familiarity with a distracting talker can be beneficial, but that a high degree of familiarity with the talker is necessary for this benefit to occur. The results of the present study, by contrast, suggest that familiarity with a distracting talker is not beneficial but rather harmful, and that moderate instead of high talker familiarity is necessary for this familiarity effect to occur.

Notably, the present study used an irrelevant-speech task to investigate the distraction by familiar and unfamiliar task-irrelevant talkers. To succeed in the irrelevant-speech task, selective attention to items in working memory is needed (for a review on the interaction of attention and working memory, see Awh, Vogel, & Oh, 2006). The previous studies, by contrast, had used concurrent-speech tasks. For example, Johnsrude et al. (2013) presented listeners with two concurrent sentences (from the coordinate response measure corpus; Bolia, Nelson, Ericson, & Simpson, 2000) of the form “Ready [call sign], go to [color] [number] now” (e.g., “Ready Baron, go to green six now”) and asked them to report the color and number word from the target sentence. This task creates much lower working-memory load than the irrelevant-speech task used in the present study in which listeners had to keep the serial order of nine digits in memory while ignoring a task-irrelevant sentence. Our results therefore suggest that, although talker familiarity reduces the distraction from irrelevant concurrent speech, it increases the disruption of working memory for target speech.

Perceptual filtering versus attentional capture of familiar distractors

In the literature, two opposing effects of familiarity with a distracting stimulus have been described. One line of research suggests that advance knowledge about the distractor can enable listeners to form an efficient perceptual filter (e.g., Röer, Bell, & Buchner, 2015). To suppress distraction by familiar stimuli, the filter can be applied even before the distractor appears (Noonan et al., 2016; Ruff & Driver, 2006). However, such proactive suppression of the distractor has been found to produce behavioral costs when the distractor is expected but not presented (Marini et al., 2013). Our results clearly speak against proactive filtering in the case of familiar voices because we observed stronger working-memory disruption by a (moderately) familiar versus unfamiliar distracting talker and no modulation of this effect by whether or not a distracting sentence was presented.

Another line of research suggests that familiarity with a distracting stimulus is not beneficial, but rather costly for working memory (e.g., Ellermeier et al., 2015; Röer et al., 2013). This is because familiar distractors can automatically capture attention resources and draw them away from items in working memory (Cowan, 1998). Although the attentional capture theory predicts stronger working memory disruption by familiar than by unfamiliar talkers, it cannot explain why we only observed a familiarity effect in the student group. If anything, the familiar talker should have been more salient for family members and friends, who should have been therefore more susceptible to attentional capture by the familiar talker’s task-irrelevant speech (but see Gaspelin & Luck, 2018, for a recent review on the suppression of salient stimuli). Furthermore, the attentional capture theory is difficult to reconcile with our finding that the mere expectation of a moderately familiar distractor enhances memory disruption.

Uncertainty about vocal identity causes working-memory disruption

Our results, rather, speak to more general differences in how familiar and unfamiliar talkers are processed (for a recent review, see Maguinness, Roswandowitz, & von Kriegstein, 2018). We argue that the disparity of findings both within and across studies can be explained by a model that takes these differences into account, in particular with regard to how familiarity shapes the representation of vocal identity.

Figure 4 summarizes and illustrates these differences in the representation of familiar and unfamiliar voices and how these differences may relate to working-memory disruption and distraction by concurrent speech. Consistent with recent advances in voice-identity research (Lavan, Burton, Scott, & McGettigan, 2019), we argue that a high degree of talker familiarity is needed in order to arrive at a stable representation of vocal identity. Moderate talker familiarity, however, creates uncertainty about vocal identity that, in turn, causes disruption of working memory and distraction by concurrent speech.

Fig. 4
figure 4

A model of the relationship between talker familiarity and voice-identity representation. Critically, the model assumes insufficient voice-identity representation for moderately familiar talkers, causing uncertainty about their vocal identity. This model is capable of explaining the seemingly disparate effects that the degree of familiarity with a to-be-ignored talker has on disruption of working memory (present study) and on distraction by concurrent speech (Johnsrude et al., 2013; Newman & Evers, 2007). See the text for details

Both the present and previous findings (Johnsrude et al., 2013; Newman & Evers, 2007) have shown that the effect of talker familiarity is not based on a simple dichotomy of familiar versus unfamiliar voices, but that it is rather the degree of familiarity that determines the effect of talker familiarity. We argue that students experienced stronger working-memory disruption by the familiar talker not only because of their limited amount of exposure to the talker’s voice but, critically, because that voice exposure was limited to a very specific context (i.e., classroom teaching). Family members and friends, by contrast, have heard the familiar talker’s voice in a much wider range of contexts. Recent work (Lavan, Burston, & Garrido, 2018; Lavan et al., 2019) suggests that it is this variance in previous voice encounters that enables listeners to form a stable percept of person identity.

Of note, talker familiarity was not made explicit in the present study. That is, listeners were not told that they would hear a familiar voice. It is reasonable to assume that family members and friends nevertheless recognized the familiar talker, whereas the students remained uncertain about the familiar talker’s identity. Resolving uncertainty about a distractor has been shown to be cognitively effortful (e.g., Geyer, Müller, & Krummenacher, 2006; Kerzel & Barras, 2016). In the case of unfamiliar distracting talkers, uncertainty was likely minimal, since listeners had little to no expectation about their vocal identity. The difference in vocal uncertainty can thus explain why students experienced stronger working-memory disruption by a moderately familiar than by an unfamiliar distracting talker. Critically, this explanation also holds for our finding that the mere expectation of a moderately familiar distractor can cause working-memory disruption.

In the case of concurrent speech, listeners are similarly distracted by an unfamiliar and by a moderately familiar talker (Newman & Evers, 2007). Possibly, this is because listeners need more, and more variable, experience with a talker’s voice to arrive at a representation of vocal identity that is sufficiently reliable to alleviate distraction by irrelevant concurrent speech. This explanation is consistent with a recent extension (Maguinness et al., 2018) of the prototype model of voice-identity processing (Lavner, Rosenhouse, & Gath, 2001). While listeners can recognize familiar talkers on the basis of stored reference patterns of their vocal identities, such reference patterns have yet to be established for unfamiliar talkers and may not suffice for robust identity recognition of moderately familiar talkers. Hence, for both unfamiliar and moderately familiar talkers, additional voice exposure is required.

Several studies have shown a link between voice-identity recognition and speech comprehension (Levi et al., 2011; Magnuson, Yamada, & Nusbaum, 1995; Nygaard & Pisoni, 1998; Nygaard, Sommers, & Pisoni, 1994; but see Holmes et al., 2018): Listeners are better at comprehending target speech when they have previously learned to identify the target talker by voice. It is likely that voice-identity information also helps listeners attenuate distraction by irrelevant concurrent speech. The findings by Newman and Evers (2007), however, suggest that it is not sufficient to make the identity of the distracting talker explicit. Rather, it seems that extensive experience with a talker’s voice is needed to overcome immature representation of vocal identity and to benefit from familiarity with a distracting talker.

Conclusions

Here we demonstrated that moderate, but not high, familiarity with a distracting talker disrupts working memory of target speech. We propose a model that can explain both the present and previous findings on the distraction by talker familiarity, by taking into account how voice familiarity shapes the representation of vocal identity.