Elsevier

NeuroImage

Volume 124, Part A, 1 January 2016, Pages 641-653
NeuroImage

When sentences live up to your expectations

https://doi.org/10.1016/j.neuroimage.2015.09.004Get rights and content

Highlights

  • Top-down predictions (priming) render noise-like fine structure speech intelligible.

  • Unintelligible fine structure speech processing is confined to auditory cortex.

  • Top-down predictions allow activity to propagate into posterior temporal areas.

  • The brain decodes speech by integrating bottom-up inputs with top-down predictions.

Abstract

Speech recognition is rapid, automatic and amazingly robust. How the brain is able to decode speech from noisy acoustic inputs is unknown. We show that the brain recognizes speech by integrating bottom-up acoustic signals with top-down predictions.

Subjects listened to intelligible normal and unintelligible fine structure speech that lacked the predictability of the temporal envelope and did not enable access to higher linguistic representations. Their top-down predictions were manipulated using priming. Activation for unintelligible fine structure speech was confined to primary auditory cortices, but propagated into posterior middle temporal areas when fine structure speech was made intelligible by top-down predictions. By contrast, normal speech engaged posterior middle temporal areas irrespective of subjects' predictions. Critically, when speech violated subjects' expectations, activation increases in anterior temporal gyri/sulci signalled a prediction error and the need for new semantic integration.

In line with predictive coding, our findings compellingly demonstrate that top-down predictions determine whether and how the brain translates bottom-up acoustic inputs into intelligible speech.

Introduction

Speech recognition is a seemingly effortless process despite background noise, inter-speaker variability or co-articulation patterns that preclude a simple one-to-one mapping between auditory signal and speech percept. To infer the most likely interpretation of the complex time-varying acoustic signal, the brain is challenged to integrate multiple probabilistic cues with prior expectations. Predictive coding models posit that speech recognition as perceptual inference emerges in the cortical hierarchy by iterative adjustment of top-down predictions against bottom-up sensory evidence (Davis and Johnsrude, 2007, Friston, 2005, Friston, 2010). Specifically, backwards connections provide predictions from higher to subordinate cortical levels. Conversely, forwards connections furnish the prediction error that is computed at each cortical level as the difference between top-down predictions and bottom-up inputs. The cortical architecture may thus recapitulate the hierarchical structure of speech that generates the acoustic inputs. While activations in low level auditory areas reflect prediction errors at the ‘acoustic level’, activations in higher order auditory areas reflect prediction errors at higher representational (e.g. phonological, semantic) levels.

Speech processing is thought to rely on a left-biased frontotemporal system encompassing a dorsal and a ventral stream (Hickok and Poeppel, 2007, Rauschecker and Scott, 2009). The dorsal stream projects to the frontoparietal cortices and is involved in auditory–motor integration to translate the incoming acoustic inputs into articulatory patterns. The ventral stream along the superior temporal sulcus maps the acoustic signals onto semantic representations. Indeed, intelligible normal speech increased activations in the superior/middle temporal gyri and sulci relative to a range of speech-like, yet unintelligible, control stimuli such as rotated, noise-vocoded or temporally reversed speech (Binder et al., 2000, Crinion et al., 2003, Leff et al., 2008, Peelle et al., 2013, Scott et al., 2000). While the responses in lower level regions adjacent to primary auditory cortex depended on the particular form of speech degradation, they became progressively invariant to the specific stimulus manipulations in higher order areas and reflected primarily the signal's intelligibility (Davis and Johnsrude, 2003). However, since these studies compared intelligible speech with various forms of unintelligible degraded speech, they could not unambiguously dissociate effects of spectrotemporal structure and speech intelligibility.

Only few studies have investigated how speech intelligibility emerges from bottom-up inputs and top-down prior knowledge or expectations. Most of these studies have employed noise or noise-vocoding to render speech partially intelligible thereby allowing prior knowledge to enhance speech comprehension (e.g. (Obleser and Kotz, 2010, Sohoglu et al., 2012)). Yet, these experimental procedures enabled only a small increase in speech intelligibility (e.g. about 20% in (Obleser and Kotz, 2010)). Moreover, as participants were already able to understand ‘degraded speech’ at least to some extent, participants may have engaged greater attentional resources for comprehension of degraded speech (see (Wild et al., 2012b)). Another study used a written word to render noise vocoded auditory speech intelligible via crossmodal priming (Wild et al., 2012a). To our knowledge, only one very early study (Giraud et al., 2004) manipulated speech intelligibility for sentence stimuli more extensively by presenting participants with broad-band speech envelope noises that were initially unintelligible and rendered intelligible only after extensive practice. However, this experimental design introduced temporal and training confounds rendering the intelligibility effect more difficult to interpret. Other studies have transformed syllables (Dehaene-Lambertz et al., 2005), single words (Möttönen et al., 2006) or sentences (Lee and Noppeney, 2011b, Lee and Noppeney, 2014) into sine wave speech stimuli that were processed as speech or non-speech depending on participants' prior experience. Collectively, these studies emphasized the role of anterior or posterior portions of superior temporal sulci in processing sine wave speech stimuli as speech relative to non-speech.

To further investigate the role of prior expectations in speech processing the current study independently manipulated (i) bottom-up inputs by comparing normal and fine structure speech and (ii) subjects' top-down predictions using priming. Fine structure speech preserves the rapidly varying fine structure of the original speech, but lacks the temporal cues of the acoustic envelope. Critically, after additional bandpass filtering fine structure speech signals are generally perceived as noise. Yet, they can be rendered intelligible by prior top-down predictions via immediate priming, i.e. presenting the identical normal sentence directly before the fine structure stimulus (cf. Audio file A.1 in appendix).

To our knowledge this is the first neuroimaging study that generates ‘unintelligible speech-like stimuli’ by removing the information of the envelope. This approach allows us to compare normal speech with speech-like stimuli that preserve the fine structure information. Moreover, through additional filtering we were able to fine tune the fine structure speech stimuli, such that they were only 5% intelligible throughout the entire experiment in the absence of prior knowledge (i.e. unprimed), yet 97% intelligible after priming. This novel auditory pop-out phenomenon enabled us to compare physically identical fine structure stimuli that were or were not intelligible depending on top-down prior predictions.

Based on the notion of predictive coding, we expected prediction errors at multiple levels of the speech processing hierarchy. The regional expression of the prediction error should depend on the prediction that is violated (Lee and Noppeney, 2011a). Further, predictions can be formed at multiple timescales ranging from milliseconds (e.g. online prediction of the next auditory spectrotemporal input) to seconds (e.g. prediction of the next sentence based on prior semantic context). First, we expected that fine structure speech relative to normal speech increases activations in low level auditory areas signalling the brain's failure to anticipate the auditory input in the absence of the temporal envelope. Similar to spatial grouping in the visual modality, we would expect the temporal envelope of speech to enable temporal grouping of auditory signals (i.e. physical effect, see (Murray et al., 2002) for a related argument in the visual modality). Thus, the temporal envelope enables moment-to-moment predictions of the incoming auditory signal. Second, prior top-down predictions (i.e. priming) render fine structure speech intelligible and enable speech recognition processes that are not engaged by unprimed fine structure speech. Similar to priming studies in the visual domain (Dolan et al., 1997, George et al., 1999, Henson et al., 2000, Henson, 2003), we would therefore expect enhanced activations for primed intelligible relative to unprimed unintelligible fine structure speech in higher order auditory areas reflecting the formation of novel linguistic representations (i.e. perceptual effect, cf. (Davis and Johnsrude, 2003)). From the perspective of predictive coding, these activation increases in higher order areas can be interpreted as prediction error signals that are elicited by the newly formed linguistic representations at a higher hierarchical level. Third, novel (i.e. unprimed) speech that violates subjects' prior semantic (or phonological, syntactic) expectations should elicit a greater response in the anterior superior temporal sulci signalling a prediction error at a higher representational level. This prediction error at the sentential level emerges at a slower timescale acting from sentence to sentence. It predominantly indicates the need for new semantic integration at the highest cortical level (i.e. novelty effect, cf. (Dehaene-Lambertz et al., 2006)). Prediction errors as indexed by physical, perceptual and novelty effects can thus emerge at multiple temporal scales, representational and cortical levels (Werner and Noppeney, 2011).

Finally, we employed effective connectivity analyses (i.e. Dynamic Causal Modelling) to investigate how perceptual and novelty effects emerged from interactions amongst brain regions.

In summary, manipulating top-down predictions and bottom-up physical inputs enabled us to elicit prediction errors at multiple hierarchical levels and temporal scales thereby providing insights into how the brain generates semantic representations at the sentential level from acoustic inputs.

Section snippets

Subjects

20 healthy right-handed German native speakers (10 females; 10 males; median age: 24.05) participated in the fMRI study. Seven of those participants took also part in an additional psychophysics study inside the scanner. Eight right-handed German native speakers participated in the additional psychophysics studies performed outside the scanner (Study1: 5 male, 3 female, median age 33 years; Study2: 6 male, 2 female, median age 29.5 years). All participants gave informed consent to participate in

Results

Subjects were presented with normal and fine structure sentences that – unknown to the subject – were arranged in pairs of two sentences. In the main factorial design (shown in Fig. 1a and b), the 1st sentence was always a normal sentence, while the 2nd sentence could be either a normal or a fine structure sentence. Further, both the fine structure and the normal 2nd sentence could be corresponding (= primed) or unrelated (= unprimed) to the 1st normal sentence.

In short, this factorial design

Discussion

The neural mechanisms that enable robust speech recognition are poorly understood. Our results support models of predictive coding where speech is decoded by integrating bottom-up inputs (or prediction error signals) and top-down prior predictions. In the following, we will discuss the activation results in relation to the three hypotheses posed in the introduction based on predictive coding (Davis and Johnsrude, 2007, Friston, 2005, Friston, 2010).

Implications for a neuroanatomical model of speech recognition

Generally, our results highlight the importance of top-down predictions in speech recognition. They support models of predictive coding where speech recognition emerges in the cortical hierarchy via iterative adjustment of top-down predictions against bottom-up acoustic signals.

In brief, activations for unintelligible fine structure speech were confined to primary auditory cortices; yet when top-down predictions made fine structure speech intelligible, they propagated into posterior middle

Funding

This work was supported by Max Planck Society.

Conflict of Interest

There is no conflict of interest.

Acknowledgments

We thank Mario Kleiner for help with recording the stimuli.

References (66)

  • D. Kersten et al.

    Bayesian models of object perception

    Curr. Opin. Neurobiol.

    (2003)
  • P. Kok et al.

    Shape perception simultaneously up- and downregulates neural activity in the primary visual cortex

    Curr. Biol.

    (2014)
  • S.A. Kotz et al.

    Modulation of the lexical-semantic network by auditory semantic priming: an event-related functional MRI study

    NeuroImage

    (2002)
  • H. Lee et al.

    Temporal prediction errors in visual and auditory cortices

    Curr. Biol.

    (2014)
  • P. Morosan et al.

    Human primary auditory cortex: cytoarchitectonic subdivisions and mapping into a spatial reference system

    NeuroImage

    (2001)
  • R. Möttönen et al.

    Perceiving identical sounds as speech or non-speech modulates activity in the left posterior superior temporal sulcus

    NeuroImage

    (2006)
  • S.O. Murray et al.

    Perceptual grouping and the interactions between visual cortical areas

    Neural Netw.

    (2004)
  • W.D. Penny et al.

    Comparing dynamic causal models

    NeuroImage

    (2004)
  • C.J. Price

    A review and synthesis of the first 20 years of PET and fMRI studies of heard speech, spoken language and reading

    NeuroImage

    (2012)
  • K.E. Stephan et al.

    Bayesian model selection for group studies

    NeuroImage

    (2009)
  • A. Trebuchon et al.

    Ventral and dorsal pathways of speech perception: an intracerebral ERP study

    Brain Lang.

    (2013)
  • C.L. Wiggs et al.

    Properties and mechanisms of perceptual priming

    Curr. Opin. Neurobiol.

    (1998)
  • C.J. Wild et al.

    Human auditory cortex is sensitive to the perceived clarity of speech

    NeuroImage

    (2012)
  • J.R. Binder et al.

    Human brain language areas identified by functional magnetic resonance imaging

    J. Neurosci.

    (1997)
  • J.R. Binder et al.

    Human temporal lobe activation by speech and nonspeech sounds

    Cereb. Cortex

    (2000)
  • J.T. Crinion et al.

    Temporal lobe regions engaged during normal speech comprehension

    Brain

    (2003)
  • M.H. Davis et al.

    Hierarchical processing in spoken language comprehension

    J. Neurosci.

    (2003)
  • G. Dehaene-Lambertz et al.

    Functional segregation of cortical language areas by sentence repetition

    Hum. Brain Mapp.

    (2006)
  • I. DeWitt et al.

    Phoneme and word recognition in the auditory ventral stream

    Proc. Natl. Acad. Sci. U. S. A.

    (2012)
  • R.J. Dolan et al.

    How the brain learns to see objects and faces in an impoverished context

    Nature

    (1997)
  • R. Drullman

    Temporal envelope and fine structure cues for speech intelligibility

    J. Acoust. Soc. Am.

    (1995)
  • S. Evans et al.

    The pathways for intelligible speech: multivariate and univariate perspectives

    Cereb. Cortex

    (2014)
  • H. Feldman et al.

    Attention, uncertainty, and free-energy

    Front. Hum. Neurosci.

    (2010)
  • Cited by (25)

    • Intelligibility of audiovisual sentences drives multivoxel response patterns in human superior temporal cortex

      2022, NeuroImage
      Citation Excerpt :

      In the multisensory domain, pSTG/S responded more strongly to audiovisual speech than to either modality presented alone, consistent with previous studies (Beauchamp et al., 2004b; van Atteveldt et al., 2004; Wright et al., 2003). For the post hoc sorted trials, the univariate pSTG/S response to intelligible sentences was significantly larger than the response to unintelligible sentences, consistent with a previous report that univariate responses in pSTG/S are driven both by the physical stimulus and the resulting percept (Tuennerhoff and Noppeney, 2016). Converging evidence from human fMRI studies and monkey single unit recording show that different subregions of pSTG/S respond more strongly to auditory, visual or audiovisual stimulation (Beauchamp et al., 2004a; Dahl et al., 2009).

    • Causal cortical dynamics of a predictive enhancement of speech intelligibility

      2018, NeuroImage
      Citation Excerpt :

      In the context of a hierarchical structure of the speech processing network, we hypothesized that this measure would have shown interactions primarily between functionally neighboring stages, such as HG-STS and STS-MTG (e.g. Hickok and Poeppel, 2007), while based on previous studies, we had less defined predictions from the literature on what causal links might exist between IFG and temporal areas. Specifically, the dDTF analyses explicitly tested the following predictions based on previous work that used prior knowledge to modulate speech intelligibility: i) The emergence of top-down connections from IFG to STG/STS and of bottom-up connections from STG/STS to IFG (Sohoglu and Davis, 2016) and ii) top-down effects from STG to HG (Tuennerhoff and Noppeney, 2016). Furthermore, a broad literature suggests that hemispheric differences exist in the processing of sounds and speech (Poeppel, 2003; Gross et al., 2013; Peelle et al., 2013; Zoefel and VanRullen, 2016).

    View all citing articles on Scopus
    View full text