Speech-sound representations are dynamic entities. Recent adult training studies in foreign-accented speech, foreign speech contrasts, and artificially degraded speech, suggest that newly learned acoustic-phonetic representations undergo qualitative changes even within the first 24 hours of exposure (Earle & Myers, 2015a; Fenn et al., 2003; Qin & Zhang, 2019; Xie et al., 2018). The nature of these changes is not uniform however, and appear subject to differences in the perceptual goals of the task (see Earle & Myers, 2014, for review). For example, perceptual shifts made to accommodate talker-specific variability stabilize immediately (Eisner & McQueen, 2006), whereas evidence of generalization of phonetic knowledge emerge following posttraining sleep (Earle & Myers, 2015a; Xie et al., 2018). In short, what underlies these short-term changes, and how these processes inform the nature of long-term representations, is unclear.

A potentially useful framework for understanding these changes is to consider the role of general learning and memory mechanisms in the representation of speech-sounds. The idea that speech and language learning is supported by domain-general learning and memory networks is growing in popularity (Chandrasekaran et al., 2016; Ullman et al., 2020). According to one such model, the dual-systems framework argues that explicit and implicit learning systems engage in a division of labor in building speech-sound category representations (Chandrasekaran et al., 2016). In this context, explicit memory refers to learning of associations that can be verbally described (“reflective”), and primarily recruits medial temporal lobe structures (Nomura et al., 2007). Implicit memory refers to slow, incremental learning that occurs unconsciously (“reflexive”), and is associated with the striatum (Nomura et al., 2007). The dual-systems framework is growing in empirical support. In a recent study by Quam et al. (2018), it was observed that the learning of novel speech sound categories, which requires integration acoustic-phonetic cues from both a nonnative pitch dimension and native vowel dimension, relies initially on reflective systems, but that it shifts to reflexive strategies at later stages of this learning (Quam et al., 2018). Thus, the building of speech-sound representations appears to recruit different memory systems at different stages of learning.

Elaborating on this idea, under the dual-systems framework, the retrieval of speech-sound knowledge would be subject to the time course(s) of memory processing that apply to explicit and implicit information after perceptual training has taken place. Specifically, memory consolidation, which is a broader term for the different stages of memory processing that occurs following a learning event (see Dudai, 2012, for review), may affect explicit versus implicit speech-sound knowledge differently. To illustrate, the qualitative changes that are observed in the acoustic-phonetic memory trace, that is, in the abstraction and generalization of acoustic-phonetic information following a period of posttraining sleep (Earle & Myers, 2015a; Fenn et al., 2003; Xie et al., 2018), is consistent with offline, “systems” consolidation of explicit memory (e.g., complementary systems account of learning; McClelland et al., 1995; see Diekelmann & Born, 2010, for review). By comparison, the changes to performance that reflect online perceptual adjustments to accommodate talker idiosyncrasy has been observed stabilize as a function of time, with or without a period of sleep (Eisner & McQueen, 2006). This observation is consistent with localized, “synaptic” consolidation of implicit memory (see Janacsek & Nemeth, 2012). In other words, acoustic-phonetic information encoded through implicit and explicit memory systems may undergo memory consolidation over different time scales. By extension, the influence of implicit versus explicitly encoded information on speech-perceptual performance may differ according to when the posttraining tasks are performed.

A potential avenue for disentangling the relative contributions of implicit and explicit information across time to perceptual performance, is to obtain performance measures of reflexive/implicit and reflective/explicit learning over time that is independent of the target speech information. For example, statistical learning has been observed to recruit the same neural circuitry proposed for reflexive/implicit learning (Chandrasekaran et al., 2016; Turk-Browne et al., 2009). In contrast, recognition memory tasks have been observed to recruit neural structures implicated for reflective/explicit learning (Brown & Aggleton, 2001; Chandrasekaran et al., 2016). Thus, a relationship between performance on these tasks and speech-sound learning may point to overlap in memory abilities shared across domains.

Examining the memory processes that support the learning and retention of speech-sound information in adulthood necessitates a means of tracking acoustic-phonetic knowledge that can be separated from speech-sound exposure through the habitual use of language. A popular means by which to achieve this is to train learners to attend to a set of acoustic-phonetic features that do not occur in their linguistic environment. For example, the Hindi contrast of the dental (/d̪/) and retroflex (/ɖ/) stops provide a perceptually difficult speech-sound contrast for monolingual English speakers, as these are perceived as allophonic variants of alveolar (/d/) stops in English (Best, 1994; see Strange, 1995, for review). The use of nonnative consonant contrast to track speech-sound learning carries the caveat that this type of learning differs from the acquisition of native phonological categories, both in terms of maturation of the learner and of the amount of preexisting phonological knowledge in place (Flege, 1993; Francis & Nusbaum, 2002; Iverson & Kuhl, 1995). Thus, in discussing the potential roles of implicit and explicit memory speech-sound learning, it is important to acknowledge that the strategies for performing nonnative versus native speech-perception tasks may differ. However, from a learning and memory perspective, the neural systems involved in encoding and representing phonetic features may not be wholly dissimilar across native and nonnative sounds (Chandrasekaran et al., 2014a, b; Earle & Myers, 2014; Ullman et al., 2020). There is some support for the suggestion that similar neural mechanisms underlie the building of both native and nonnative speech representations (Díaz et al., 2008; Qi, Han, et al., 2019b; cf. Fuhrmeister & Myers, 2017), however that evidence is limited.

Thus, this investigation had two aims. First, it aimed to track how the retrieval of implicitly versus explicitly learned information related to individuals’ ability to perceive trained nonnative speech-sounds, shortly after learning versus after a 12-hour (overnight) posttraining interval. This objective was addressed by assessing posttraining performance at two time points following a nonnative speech-sound training task, a statistical (implicit) learning task, and a recognition memory (explicit learning) task. Importantly, the two nonspeech learning tasks were implemented in the visual domain, in order to capture the variance in speech explained by domain-general learning mechanisms. We hypothesized that implicit learning abilities will be predictive of posttraining speech-perceptual performance shortly after learning, consistent with prior observations that speech-sounds are better learned under a reflexive strategy (Chandrasekaran et al., 2014a, b). However, we reasoned that the retrieval of the acoustic-phonetic trace following sleep (Earle & Myers, 2015b) will be heavily influenced by the retrieval of consolidated explicit memory, and thus we hypothesized that the association between perception and memory would shift to explicit memory performance on Day 2.

Our second aim was to determine if nonnative speech-sound learning was related to the perception of native phonological categories. To address this objective, we included an experimental measure of categorical perception on a native vowel (/a/–/e/) continuum in our study protocol. We hypothesized that if performance on perceptual tasks is driven by static similarities in acoustic-phonetic processing, there would be an association between performances in native and nonnative speech perception tasks regardless of when we assessed performance on the nonnative contrast. On the other hand, if we were to observe a relationship between native and nonnative sounds to only emerge following a posttraining, overnight delay, this may indicate that native and nonnative speech share similar retrieval mechanisms of consolidated acoustic-phonetic information in the service of perceptual tasks.

Methods

Participants

Twenty-one (18 females, 3 males) adults (mean = 21.51 yrs, SD = 2.05 yrs) for the present study were recruited from those already participating in a larger speech-sound learning study at the University of Delaware (UD). Participants were monolingual speakers of American English, with no history of hearing, neurological, socio-emotional, or attentional disorders. All participants provided informed consent according to UD Institutional Review Board guidelines, and were compensated at a rate of $10/hour in gift cards for their time.

Study overview

All participants completed an initial 1:1 session with a trained experimenter to assess their language abilities, and were administered a pure tone hearing screening (500 Hz, 1 kHz, 2 kHz, 4 kHz, 6 kHz) administered at 25 dB to confirm hearing within normal limits. During this session, we administered the Digit Span Forward, Digit Span Backward, and Digit Span Sequencing subtests of the Wechsler Adult Intelligence Scale (WAIS-IV; Wechsler, 2008) to obtain measures of verbal working memory. Because working memory has been associated with integration of speech-perceptual features (Quam et al., 2018), we used this metric as a covariate in our analyses. During this test session, we also administered the native categorical perception task (see description below, under Experimental Task Procedures). Raw score sheets were scored by two trained research students for accuracy, and discrepancies in scoring were flagged and resolved by the first author. The other measures obtained during this test session are not relevant to the present study and are therefore not described in this manuscript.

In addition to this initial session, participants completed an additional four sessions, in two sets of two sessions separated by 12 hours. These sessions were scheduled in the evening (8 p.m.) and again the next morning (8 a.m.). The participants completed the first of these two-session sequences in our laboratory. The evening session included the recognition memory encoding and assessment tasks, followed by the nonnative speech contrast learning and assessment tasks. In the morning session, participants completed reassessments of the recognition memory and nonnative speech contrast training tasks. The second of these two-session experiments were completed at home, on the participants’ personal computers. During the evening session, participants first completed the visual statistical learning and assessment tasks. In the morning, participants completed the reassessment of the visual statistical learning task. The scheduling of learning tasks at 8 p.m. and reassessment at 8 a.m. served to heighten the likelihood for off-line consolidation to take place between sessions, as well as limit the potential for exposure to interfering information between the two sessions (see Earle & Myers, 2015b). This schedule also allowed to control for the potential for circadian variability in performance across tasks.

All stimulus presentation and response recording conducted in the laboratory were controlled by E-Prime 3.0 software (Psychology Software Tools, Inc) on PC laptops, and auditory stimuli were presented through ATH-M50x circum-aural headphones (Audio-Technica, Inc) at 70 dB. The visual statistical learning task completed at home was web-based and programmed using JsPsych (de Leeuw, 2015; Qi et al., 2020; Qi, Araujo, et al., 2019a).

Experimental task procedures

Native categorical perception task

The quality of native (English) speech sound representations was indexed by task performance on a categorical perception experiment. Because the disambiguation of our nonnative contrast requires an analysis of spectral (rather than temporal [e.g., voice onset time]) features, we assessed category function for a native speech-sound contrast that is likewise disambiguated through frequency information. The stimuli used were a continuum (/e/–/a/) of seven tokens (180-ms duration) synthesized using Praat software (Boersma, 2002). The first two formants were centered at 650 Hz and 1950 Hz in the first token, and graded in equal steps to 900 Hz and 1750 Hz, respectively.

In this task, participants completed an identification and a discrimination task. In the identification task, participants were instructed to indicate the sound that they hear by clicking left or right on a computer mouse. “/a/” and “/e/” remained visible on the sides of the screen throughout the task as a reminder of response options and corresponding buttons. Participants completed 56 trials (eight trials/token, presented randomly). In the discrimination task, participants were instructed to indicate if the sounds that they hear are the same or different. The prompts “same” and “different” remained on either side of the screen for the duration of the task. At the start of each trial, two tokens were presented in succession, 800-ms apart. Participants completed 76 trials (28 “same” [4 trials/token], 48 “different” [8 trials/each of five two-step token pairs (1–3, 2–4, 3–5, 4–6, 5–7), and an endpoint pair (1–7) as a check for task compliance]).

Given that perceptual overlap between vowel categories is subject to listener age, regional differences, and diachronic shifts (see Hillenbrand et al.’s, 1995, replication of Peterson & Barney, 1952), we reasoned that a hypothesized “ideal” category function may not be necessarily representative of the language experience of our sample. Therefore, we opted to derive the “expected” category functions based on the consensus reached by our sample: We calculated a mean proportion of /e/ responses at each continuum step for the identification task, and mean accuracy at each continuum step for the discrimination task. The degree of divergence from the sample mean was calculated for each participant at each continuum step, and these differences were summed to obtain a “goodness of fit” index of individual category functions by task. Thus, according to this index, greater scores indicated a greater divergence from the expected category function, with scores approximating zero indicating conventional perceptual performance. We visually inspected each participant’s category function to ensure that no participant’s deviation score resulted from having a category function that was more categorical than the sample means.

Nonnative speech-sound learning

In order to track the learning and retention of speech-sounds that is separable from habitual linguistic exposure, participants were trained to perceive a nonnative speech sound contrast (the /d̪/–/ɖ/ contrast in Hindi), and assessed in their changes to their ability to identify and discriminate these sounds over time. The stimuli used were a closed token set of five productions each /d̪ug/ and /ɖug/, produced by a native speaker of Hindi and digitally recorded in a sound-treated audiology booth. Tokens were rescaled to a mean amplitude of 70 dB and cut to the onset of the burst (thus, prevoicing information was removed), in order for the timing of the stop burst relative to the trial onset to be consistent across trials.

There were two tasks in the experiment: identification and discrimination. During the identification task, participants were asked to match the sound that they heard with a novel visual object (“Fribbles”; Tarr, n.d.). The discrimination task followed an AX design, wherein two tokens were presented at an inter-stimulus interval of 800ms, and participants were instructed to determine if the two sounds they hear are the same or different.

On Day 1 (first session in the second phase of the study), participants were assessed in their baseline discrimination ability (64 trials), completed 200 trials of identification training with feedback after every trial, and then completed posttraining identification (50 trials without feedback) and discrimination tests (64 trials). On Day 2 (second session in the second phase of the study), participants completed the posttraining identification and discrimination tests again. Because the posttraining identification and discrimination scores were highly colinear, we averaged across the scores to arrive at a single nonnative speech learning score per day per participant for our main analyses.

Recognition memory task assessment of explicit learning

A recognition memory task was used to measure explicit memory (Hedenius et al., 2013). Visual stimuli consisted of 128 black-and-white drawings of 64 real and 64 made-up images. Images of objects were developed from images obtained through free websites, purchased collections, and prior publications (Cornelissen et al., 2004; Eals & Silverman, 1994; Snodgrass & Vanderwart, 1980; Williams & Tarr, 1997). Real items across stimulus sets were matched for word frequency, and number of syllables and phonemes.

In this experiment, participants completed the Encoding and Recognition tasks on Day 1 and the Recognition task again on Day 2. During the Encoding task, participants were instructed to place an index finger from each hand on marked keys on either side of the keyboard (“s” and “l,” marked with stickers). They were asked to determine is the images presented on the screen were of real or made-up objects, by pressing the key to the left or right as quickly and as accurately as possible. Participants completed three practice trials, followed by 64 trials (32 real/32 made-up) of images presented in pseudorandom order. During each trial, participants were presented with the image for exactly 500 ms regardless of when they indicated a response, in order to control the duration of exposure to the visual stimuli. The Recognition task was administered approximately 10 minutes after the Encoding task (Day 1), and again 12 hours later (Day 2). During the Recognition task, participants were again shown a series of images, and were asked to indicate if they had seen the image before during the Encoding task. Following six practice trials, participants completed 128 trials of this task (64 objects presented during encoding [seen before], 64 new) presented in pseudorandom order. Participants’ d' scores in the recognition task was computed for each day as indices for object recognition ability.

Visual statistical learning assessment of implicit learning

The materials and the procedure of the visual statistical task were identical to those in Qi, Han, et al. (2019b). The visual stimuli consisted of twelve unique cartoon images organized in four sets of triplets. Each image was presented at the center of the screen for 800 ms with a 200 ms interval before the onset of the next image. During the training phase, a total of 288 images, concatenated by 24 repetitions of each triplet, were presented one at a time continuously. Sixteen pairs of target and foil triplets were constructed for the two-alternative forced-choice task of the test phase. Each target triplet seen during the training phase was paired with one of the four foil triplets, where the neighboring images never co-occurred in a triplet during training but the relative position of each image in a triplet was preserved (e.g., if ABC, DEF, GHI, and JKL are the target triplets, then AEI, DHL, GKC, and JBF would be the four foils). Each pair was presented twice, so that the target and the foil each was presented once as the first option in the pair, resulting in 32 test trials.

On Day 1, during the training phase, participants were told to track a particular cartoon character during a continuous stream of visual presentation by pressing the space bar, so that they were not aware of the embedded structure of the sequence nor the learning nature of the task. During the surprising test phase immediately after training, participants were asked to choose which of the two groups of the aliens were more likely to go together. On Day 2, participants completed the same testing phase, approximately 12 hours after the first testing phase, but with a different trial order. Participants’ behavioral accuracy during the two testing phases was computed as the index of implicit visual learning and retention.

Results

We present data in this manuscript from all participants who were enrolled and participated in at least one session of all three learning tasks. There was some data loss inherent to the five-session nature of the study (one data point missing on Day 2 nonnative speech learning/object recognition due to cancellation/inclement weather; two data points missing on the categorical perception task due to equipment failure; four data points missing on the statistical learning task due to user error). Missing cases were estimated during analysis using the Multivariate Imputation by Chained Equations (package ‘mice’; van Buuren et al., 2015) program in R (R Development Core Team; Version 3.4.1, 2017). For a closer inspection of our data set, please see raw performance scores available through a github repository (https://github.com/fsearle/SLNCT.git).

Descriptive analyses and results by task

Verbal working memory task

The three Digit Span subtests (WAIS-IV; Wechsler, 2008), were combined into a composite verbal working memory score (mean 29.42, standard deviation 3.26). This measure was entered into our omnibus statistical models (see below), to control for the potential variance shared between speech perception and memory tasks attributable to working memory demands.

Native categorical perception task

In order to validate our version of the native speech categorical perception task, we conducted a repeated-measures analysis of variance (ANOVA) on the percentage correct on the “different” trials of the discrimination task, with continuum step as the within-subjects factor. Continuum step significantly accounted for within-subjects variance in performance, F(4, 68) = 27.26, p < .001, η2 = .62, driven by larger proportions of correct responses across Steps 3 to 5 on the 7-step continuum (see Fig. 1a). We conducted a second repeated-measures ANOVA on identification performance, with proportions of trials identified as /e/ as the outcome variable, with continuum step as the within-subjects factor. Continuum step significantly accounted for within-subjects variance, F(6, 102) = 104.9, p < .001, η2 = .86, driven by larger proportions of /e/ responses at earlier steps in the 7-step continuum (see Fig. 1b). These findings replicate the classical categorical perception experiment (Liberman, 1970), ensuring that our measure of native speech perception is consistent with prior literature. We found performance on the identification and discrimination tasks to be highly colinear (Pearson’s R = .86), and we therefore averaged across the scores to arrive at a single native categorical perception per participant for our main analyses.

Fig. 1
figure 1

Discrimination and Identification functions replicate classical speech perception experiments (e.g., Liberman, 1970). Error bars denote standard errors of the mean

Speech-sound training

Proportions of trials answered correctly in the posttraining identification and discrimination tasks were transformed to d′ scores (MacMillan & Creelman, 2004) separately. To determine if performance changed over time, we conducted a repeated-measures ANOVA on the discrimination d' scores, with three levels of Time (baseline, Day 1 posttraining, Day 2 posttraining) as the within-subjects factor. Time significantly accounted for within-subjects variance in perceptual performance, F(2, 38) = 25.29, p < .001, η2 =.57, driven by gains made from baseline on Day 1, and again on Day 2 (see Fig. 2a). To determine if there were overnight changes to performance on the identification task, a two-tailed, paired-samples t test was performed on the identification d' scores. Performance improved between Day 1 and Day 2, t(19) = −4.40, p < .001, Cohen’s d = .962. These results replicate earlier findings (Earle et al., 2018; Earle & Myers, 2015a, 2015b) that a posttraining, overnight between-session interval results in improved perception of the nonnative contrast in the absence of further training (see Fig. 2b). We found the posttraining perceptual scores across tasks to be highly colinear (Day 1: Pearson’s R = .75; Day 2: Pearson’s R = .84), and we therefore averaged across the scores to arrive at a single nonnative speech-perception score per day per participant for our main analyses.

Fig. 2
figure 2

On average, participants improve in the perception of a trained nonnative contrast in the absence of further training between Days 1 and 2, in a replication of prior work by Earle and Myers (2015b). Error bars denote standard errors of the mean

Visual statistical learning

Participants performed significantly above chance on both days (Day 1: M = 74.67%, SD = 21.83%; Day 2: M = 75.39%, SD = 22.01%; ps < .001). A two-tailed, paired-samples t test conducted on the percent accuracy on the visual statistical learning task found no significant differences in performance across days, t(14) = 1.06, p = .306, Cohen’s d = .274. This indicates that the overnight session interval does not result in enhanced performance on a statistical learning task, consistent with prior reports of relative stability in implicit learning (Kalra et al., 2019; see Fig. 3a).

Fig. 3
figure 3

On average, participants do not change their performance in an implicit, visual statistical learning task between Days 1 and 2. On average, participants improve in object recognition in the absence of further exposure to items between Days 1 and 2 in the object recognition task. Error bars denote standard errors of the mean

Recognition memory

Proportions of correct responses in the recognition memory task were transformed to d′ scores (MacMillan & Creelman, 2004) for real and made-up objects separately for each day. To determine if there were changes in performance to this task across days, we conducted a repeated measures ANOVA on recognition performance, with two levels of Time (Day 1 recognition, Day 2 recognition) and two levels of object type (real or made up) as within-subjects factors. This revealed a significant main effect of time, F(1, 19) = 17.55, p < .001, η2 = .48, driven by better performance on Day 2 over Day 1, and a significant main effect of object type, F(1, 19) = 91.26, p < .001, η2 = .83, driven by better performance on real over made up objects, but no interaction between time and object type (see Fig. 3b). This suggested to us that while recognition memory performance was better overall for real objects, that the pattern of changes to recognition performance over time was similar between real and made up objects. Moreover, high collinearity between real and made up objects (Day 1: Pearson’s R = .75; Day 2: Pearson’s R = .84) supported the decision to average d' scores across real and made up items to arrive at a single recognition memory score per day per participant.

To summarize our descriptive findings, our categorical perception task was a classic replication, lending support to the interpretation of individual differences in performance on this task as a measure of the representational quality of native speech-sounds. In the learning and memory tasks, participants made overnight gains in both the speech sound learning and the recognition memory tasks in the absence of further training. Performance on the visual statistical learning task did not change overnight, suggesting that performance on this task did not benefit from an offline consolidation period. These findings are consistent with prior works that have suggested that implicit memory does not benefit from offline gains (Janacsek & Nemeth, 2012), whereas sleep is thought to enhance information encoded by explicit memory (Diekelmann & Born, 2010).

Relationships between nonnative speech-sound learning and immediate memory performance

The first aim of this study is to determine how domain-general learning abilities contribute to speech sound learning. First, the recognition memory scores, baseline discrimination scores, and verbal working memory scores were rescaled according to the proximity-to-maximum scaling method (Little, 2013; see Moeller, 2015, regarding its application to repeated measures data), to be on a commensurate scale with the statistical learning scores. In order to test the associations between nonnative speech-sound learning and memory tasks across days, we employed a linear mixed-effects modeling approach (VanLeeuwen et al., 1996). Models were fitted using the lme4 package (Bates et al., 2015) in R (Version 3.3.1; R Development Core Team, 2016). Marginal and conditional coefficients for mixed-effects models were calculated using the MuMIn package (Burnham & Anderson, 2002), and effect sizes for individual predictors were calculated using the r2glmm package (Jaeger, 2017), in R.

In order to determine whether memory abilities assessed immediately after learning predicted speech-perceptual performance on the nonnative contrast, a mixed-effects model was fitted on the speech-perceptual scores as the outcome measure, with a three-way interaction between Day 1 implicit memory, Day 1 explicit memory, and Time (Day 1 vs. Day 2), and their subsuming two-way interactions and main effects as factors, and by-participant intercepts entered as a random factor; 1-0 coding was used for Time, referenced to Day 1. To test for effects of memory while controlling for differences in pre-training perceptual acuity and verbal working memory, baseline discrimination scores and verbal working memory scores were entered as covariates.

The model of best fit was determined using a likelihood ratio test in a backward-fitting procedure (lmtest package; Zeileis & Hothorn, 2002). The accepted model (AIC = 71.7, BIC = 90.8, LogLikelihood = −24.8, r2m = .697, r2c = .932) contained a significant three-way interaction between time, explicit, and implicit memory (β = 24.99, SE = 8.21, p = .006, partial R2 = .052), a two-way interaction between time and implicit memory (β = −16.25, SE = 5.26, p = .006, partial R2 = .054), a two-way interaction between time and explicit memory (β = −17.90, SE = 6.50, p = .012, partial R2 = .044), a main effect of Time (β = 12.07, SE = 4.13, p = .008, partial R2 = .048), and baseline discrimination as a significant covariate (β = 9.22, SE = 1.90, p <.001, partial R2 = .523). The two-way interaction between implicit memory and explicit memory, and main effects of explicit and implicit memory, were not statistically significant.

In order to examine the source of the three and two-way interactions involving time, we conducted follow-up regression analyses on the speech-perceptual performance on the nonnative contrast for each day separately, with a two-way interaction between implicit and explicit memory as factor, and baseline discrimination as a covariate. We chose the model of best fit using the same backwards-fitting procedure described above. For Day 1, the accepted model, F(2, 18) = 18.98, p < .001, R2adj = .64) retained Day 1 implicit memory as the only significant predictor (β = 1.99, SE = .64, p = .004, partial R2 = .378), with baseline discrimination ability as a significant covariate (β = .86, SE = .21, p = .001, partial R2 = .492). The accepted model for Day 2, F(2, 18) = 15.54, p < .001, R2adj = .59, likewise retained only Day 1 implicit memory as predictor and baseline discrimination as covariate; however, the effect of Day 1 implicit memory on Day 2 speech performance was marginal (β = 1.21, SE = .76, p = .061, partial R2 = .181; baseline discrimination: β = 10.88, SE = 2.41, p < .001, partial R2 = .531).

To summarize, despite the initial interactions observed in the omnibus mixed-effects model between time, implicit memory and explicit memory, and between time and explicit memory, closer inspection of the data revealed no interactions between Day 1 implicit and explicit memory abilities on either day, or an effect of Day 1 explicit memory on either day. The interaction observed between time and Day 1 implicit memory appears to reflect a strong effect of Day 1 implicit memory on speech-perceptual performance, combined with a weaker, marginal effect of Day 1 implicit memory on speech-perceptual performance assessed on Day 2. Baseline discrimination ability, entered as a statistical control for our memory effects, was a strong predictor of speech-perceptual performance assessed across days, consistent with prior observations that baseline perceptual acuity is a strong predictor of speech-perceptual learning (Fuhrmeister & Myers, 2017).

Relationships between retention of memory and speech-sound learning and retention

In order to test the association between speech sound retention and the retention of implicit or explicit memory on Day 2, we conducted a linear regression on the nonnative speech perception scores on Day 2. To note, we conducted this analysis as a linear regression (rather than as a linear mixed-effects model) because the outcome measure in this case does not support a random effects structure. In order to control for variance explained by speech-perceptual performance on Day 1, the omnibus model contained a three-way interaction between Day 1 speech perception scores, Day 2 implicit memory, and Day 2 explicit memory scores, and their subsuming interactions and main effects as factors, and verbal working memory scores as covariate. Following the same backwards-fitting procedure as described above, we arrived at an accepted model, F(2, 18) = 55.33, p < .001, R2adj = .84, that contained the terms for baseline discrimination scores (β = .97, SE = .11, p < .001, partial R2 = .815) and Day 2 explicit memory (β = 3.21, SE = 1.28, p = .022, partial R2 = .260).

To summarize, it appears that speech perception performance on Day 2 is associated with both Day 1 performance and the retention of explicit memory on Day 2. We note that, in the above mixed-effects model, we found Day 1 implicit memory to be marginally associated with Day 2 speech perception performance as well. Thus, while we cannot altogether rule out a subtle influence of implicit memory on Day 2 speech-perceptual performance, it does appear to be the case that the retention of explicit memory emerges as a strong influence on Day 2. See Fig. 4 for a graphical depiction of the pattern of effects of implicit and explicit on speech learning across days.

Fig. 4
figure 4

Perceptual ability is plotted in d' (y-axis), and memory ability (x-axis) is plotted in values scaled by the proximity-to-maximum scaling method (Moeller, 2015). Confidence bands displayed around regression lines indicate 95% intervals

Relationships between nonnative speech-sound learning and perception of native phonological categories

The second aim of this study was to determine if the perception of native speech-sound categories is associated with nonnative speech-sound learning in adulthood. We tested the associations between perception of native categories and performance on the nonnative speech-perception tasks on each day separately. In order to determine if speech-sound learning assessed immediately after learning predicted speech-perceptual performance on a native phonological contrast, we regressed the categorical perception “goodness of fit” scores with the Day 1 nonnative speech perception scores. There was a marginal association, F(1, 19) = 4.08, p = .058, R2adj = .13. In order to determine if the retention of learned speech-sounds predicted native speech-sound perception, we regressed the native categorical perception scores with the Day 2 nonnative speech perception scores. This association was significant, F(1, 19) = 6.92, p = .017, R2adj = .23). In order to determine if there were differences in the strengths of associations between identification of native speech-sound categories and nonnative speech perception across days, we conducted a likelihood ratio test on the two regression models. The test indicated a significant difference (χ2 = 4.45, df = 3, p < .001), driven by a greater effect of Day 2 over Day 1 nonnative speech perception performance.

Discussion

This study was a preliminary investigation into the relationships between speech-perceptual learning on a nonnative contrast measured before and after a 12-hour, overnight delay, implicit and explicit learning, and perception of a native speech-sound contrast. Below, we summarize our main findings and the implications in the context of prior literature.

First, we found speech-perceptual performance and explicit memory performance to change over time, such that on average, scores improved following the overnight delay. In contrast, implicit memory performance remained relatively stable over time. We note that no learning task is likely to be “pure” measure of function. Learners may access explicit knowledge about target sequences or statistical regularities in an implicit learning task (Bertels et al., 2012). In addition, object familiarity, as may assist performance on a recognition memory task, has been argued to relate to implicit processes (Wang & Yonelinas, 2012). Despite these limitations, our interpretation of task performances is supported by the knowledge that implicit memory is not typically subject to offline gains (Janacsek & Nemeth, 2012; Simor et al., 2019), whereas offline gains are expected, supported by a robust literature, for information encoded by explicit memory (Diekelmann & Born, 2010; Hu et al., 2006; Rasch et al., 2007). Because our tasks patterned similarly to these expectations, we were reasonably reassured that performance on these two tasks relied on the intended memory systems. Furthermore, our nonnative speech-sound learning task replicated earlier findings (Earle et al. 2017; Earle & Myers, 2015a, 2015b), promoting confidence that we are examining robust behavioral phenomena.

In exploring the relationships between nonnative speech-sound learning and retention and implicit and explicit memory, we report two main findings. First, the results suggest that implicit memory performance (which, as noted above, is relatively stable across days) is associated with perceptual performance assessed immediately after learning, and marginally associated with perceptual performance assessed after an overnight delay. This suggests that the influence of acoustic-phonetic information acquired via the implicit memory system on speech-perceptual performance may be relatively stable, but perhaps subtly weaken, over time. Our second finding was that an association between speech-perceptual performance and explicit memory emerged on Day 2. Taken together, these findings suggest that implicit and explicit learning abilities encode speech information in parallel; however, speech information encoded by the different memory systems may influence perceptual performance at different points in time following initial exposure. Specifically, while perceptual performance assessed immediately after learning may rely on information encoded by implicit memory, speech-perceptual performance following a period of posttraining sleep may rely on the retrieval of consolidated information encoded by explicit memory. While we did not directly examine qualitative changes to the speech-sound representations in the current study, this interpretation may account for the prior observations of changes in speech-representational quality overnight, as being more abstract and generalizable following a period of sleep (Earle & Myers, 2015a; Xie et al., 2018).

There are surface dissimilarities between the current findings and those by Quam et al. (2018), who found that the successful integration of acoustic-phonetic features was associated with explicit learning initially, but that this association shifted to implicit memory when training continued into a second day. It may be worthwhile to note that this previous investigation focused on the contributions of memory processes while speech-perceptual learning was continuing to take place. By contrast, the current investigation focused on the influences of information encoded by different types of memory on speech-perceptual performance following the conclusion of training. While there are additional methodological differences between studies (e.g., differences in memory tasks, controlling for time of day in the current study) that may have contributed to these outcomes, we consider the current findings to be an extension of this earlier narrative surrounding the process of building new speech-sound representations. Specifically, the current findings highlight the sources of acoustic-phonetic information that is recruited in the service of perceptual tasks at different stages of memory processing, that may differ from the division of labor between learning systems while speech-perceptual learning is taking place.

Our second research objective was to determine to what extent the perception of a trained nonnative speech-sound contrast related to speech-perception abilities in one’s native language in adulthood. We found that categorical perception performance on an English vowel continuum was associated with speech-perceptual performance on the nonnative contrast, albeit the relationship was marginal (p = .058) on Day 1. This association was significantly greater with perceptual performance on the nonnative contrast assessed after an overnight delay. This may suggest that performance on speech perception tasks, across both native and nonnative speech-sounds, may rely on similar mechanisms of retrieval of acoustic-phonetic information. That this relationship was observed to be stronger on Day 2, taken together with our findings concerning the influence of implicit and explicit memory above, may suggest that performance on native speech-sound categorization is informed by both implicit memory, and by the retrieval of explicit memory following a period of overnight consolidation.

Previously, Fuhrmeister and Myers (2017), reported that discrimination of a native /da/–/ta/ continuum did not predict one’s ability to learn the dental-retroflex contrast assessed immediately after training. Our finding of a marginal association between Day 1 nonnative speech perception scores and native categorical perception scores may relate to our use of a perceptual index that combined both identification and discrimination abilities. Moreover, the stronger association between these values on Day 2 may suggest that a reliable relationship between perception of native and nonnative speech only emerges after a period of overnight delay.

There are several important limitations that needs to be acknowledged. First, our sample size is small for a study on individual differences. Moreover, implicit and explicit learning was each indexed by a single task in the current study. A replication of the current data with a larger sample, and with additional/varied implicit and explicit memory tasks, is an important future direction. Second, the focus of the current study was not on whether or not changes to behavior are specifically attributable to sleep, and therefore we did not employ a wake-state control. Thus, while we may suspect based on previous literature that enhanced performance following the 12-hour interval in speech perception and recognition memory tasks are due to sleep-mediated effects, we are unable to make that claim definitively with this data set.

Despite these limitations, there are several intriguing implications to the current findings. The current paper illustrates the kind of changes that speech-sound representations may undergo in the first 12 hours through a dual-systems learning and memory framework. This has important methodological implications for research. For example, within-individual changes over time in perceptual performance over time may reflect not just quantitative differences in representational strength, but also changes in the sources of acoustic-phonetic information. In addition, the different associations between implicit and explicit memory with perceptual performance over time suggests that those with poor off-line consolidation may not demonstrate deficits in speech-perceptual learning until hours after the learning task (see Earle et al., 2018, for this behavioral pattern demonstrated by adults with developmental language disorder). Finally, the relationship observed here between the process of building nonnative speech-sound representations and the quality of native speech representations offer avenues for theoretical insights, as well as methodological advances, in examining potential memory breakdowns in clinical populations. This is particularly important for conditions in which atypical speech-sound representations are a hallmark of its linguistic symptoms (e.g., developmental language disorder, Earle et al., 2018; developmental dyslexia, Gabay & Holt, 2015). These are important directions to pursue in future research.

The associations between native and nonnative speech perception in adulthood also raises interesting implications for foreign language learning in adulthood. It is a well-known phenomenon that children at earlier stages of phonological development are better able to perceive nonnative speech sound than after committing to a native phonological inventory (Werker & Tees, 1984; Zhang et al., 2005). The current study suggests that in adulthood, those with robust native speech-sound representations are better able to learn nonnative speech contrasts. This is consistent with prior observations that stronger neural responses to native speech-sound contrasts were associated with better foreign speech-sound learning (Diaz et al., 2008; Qi, Han, et al., 2019b). While these associations may be classically interpreted as attributable to the stability of perceptual anchors facilitating of perceptual learning (Kuhl, 1992), the current findings raise the intriguing possibility that perceptual performance may share similar memory retrieval mechanisms, across native and nonnative speech-sounds. These are questions to pursue in depth in the future.