Elsevier

Hearing Research

Volume 426, December 2022, 108553
Hearing Research

Predicting speech intelligibility in hearing-impaired listeners using a physiologically inspired auditory model

https://doi.org/10.1016/j.heares.2022.108553Get rights and content

Highlights

  • Across-frequency fluctuation profiles predict speech intelligibility in noise

  • Model predicts effects of level and different noise types, incl. masking release

  • Model predicts effects of individual hearing loss

  • Plausible effects of inner- and outer-hair-cell impairment

Abstract

This study presents a major update and full evaluation of a speech intelligibility (SI) prediction model previously introduced by Scheidiger, Carney, Dau, and Zaar [(2018), Acta Acust. United Ac. 104, 914-917]. The model predicts SI in speech-in-noise conditions via comparison of the noisy speech and the noise-alone reference. The two signals are processed through a physiologically inspired nonlinear model of the auditory periphery, for a range of characteristic frequencies (CFs), followed by a modulation analysis in the range of the fundamental frequency of speech. The decision metric of the model is the mean of a series of short-term, across-CF correlations between population responses to noisy speech and noise alone, with a sensitivity-limitation process imposed. The decision metric is assumed to be inversely related to SI and is converted to a percent-correct score using a single data-based fitting function. The model performance was evaluated in conditions of stationary, fluctuating, and speech-like interferers using sentence-based speech-reception thresholds (SRTs) previously obtained in 5 normal-hearing (NH) and 13 hearing-impaired (HI) listeners. For the NH listener group, the model accurately predicted SRTs across the different acoustic conditions (apart from a slight overestimation of the masking release observed for fluctuating maskers), as well as plausible effects in response to changes in presentation level. For HI listeners, the model was adjusted to account for the individual audiograms using standard assumptions concerning the amount of HI attributed to inner-hair-cell (IHC) and outer-hair-cell (OHC) impairment. HI model results accounted remarkably well for elevated individual SRTs and reduced masking release. Furthermore, plausible predictions of worsened SI were obtained when the relative contribution of IHC impairment to HI was increased. Overall, the present model provides a useful tool to accurately predict speech-in-noise outcomes in NH and HI listeners, and may yield important insights into auditory processes that are crucial for speech understanding.

Introduction

A large number of speech intelligibility (SI) prediction models have been proposed over the past decades, usually with the aim to provide tools for assessing transmission channels (e.g., in telecommunications and room acoustics) and signal-enhancement algorithms and/or to better understand the healthy human auditory system in terms of speech processing in various conditions. Most of these models rely on a simplistic linear representation of the peripheral stages of the human auditory system, employing a linear filterbank to simulate the frequency selectivity of the auditory system (e.g., ANSI, 1997; Rhebergen et al., 2006; Taal et al., 2011). Some of the most powerful and versatile linear models combine the initial filterbank stage with a subband envelope extraction followed by another filterbank that analyses the slower level fluctuations in the subband signals (e.g., Houtgast et al., 1980; Elhilali et al., 2003; Jørgensen and Dau, 2011; Jørgensen et al., 2013; Relaño-Iborra et al., 2016; review: Relaño-Iborra and Dau, 2022, this issue). The model predictions are obtained by comparing the noisy or processed speech signal with a reference signal, usually the clean speech or the noise alone, and obtaining either a type of signal-to-noise ratio (SNR, e.g., ANSI, 1997; Rhebergen et al., 2006; Houtgast et al., 1980; Jørgensen and Dau, 2011; Jørgensen et al., 2013) or a correlation-type metric (Taal et al., 2011; Relaño-Iborra et al., 2016). The decision metric can then be transformed to SI in percent correct based on data and predictions from a given fitting condition, e.g., speech in stationary speech-shaped noise (SSN).

The SI models mentioned above have been created using simplistic linear pre-processing with a focus on accounting for as many acoustic conditions as possible in a population of normal-hearing (NH) listeners. However, the healthy auditory system is strongly nonlinear due to cochlear amplification attributed to outer hair cells (OHCs) and the saturating nonlinear nature of the inner hair cells (IHCs). Hearing impairment, on the other hand, which is far from being fully understood in all its complexity, typically induces a partial linearization of the system due to an OHC-loss-induced reduction in cochlear amplification (review: Heinz, 2010). Therefore, linear models provide a suboptimal starting point for accounting for effects of hearing impairment, as they may already be considered “impaired” in a sense and thus are functionally limited in accounting for (supra-threshold) effects of hearing impairment beyond audibility limitations. Only a few researchers have attempted to incorporate more sophisticated nonlinear models of the auditory periphery in an SI-prediction framework. Relaño-Iborra et al. (2019) adapted the computational auditory signal processing and perception model (CASP; Jepsen et al., 2011), an auditory model with nonlinear OHC (but not IHC) behavior that contains an envelope-frequency analysis stage, for predicting SI using a processed-speech vs. clean-speech correlation approach. The resulting speech-based CASP (sCASP) model showed accurate predictions across many acoustic conditions and plausible trends across different presentation levels for NH listeners, while its predictive power regarding effects of hearing impairment on SI has not been fully explored yet (Relaño-Iborra and Dau, 2022, this issue). Another nonlinear model of the auditory periphery is the auditory-nerve model (ANM), which has been developed to describe the temporal properties of auditory-nerve rate functions and spike trains in cats and other species (Carney, 1993; Zilany et al., 2009; Zilany et al., 2014). The ANM simulates the nonlinear behavior of both OHCs and IHCs and can thus can functionally account for level effects as well as for OHC and IHC impairment. Zilany and Bruce (2007) incorporated the ANM in the framework of Elhilali et al. (2003) and showed promising predictions for word recognition across different presentation levels in NH and hearing-impaired (HI) listeners. Hines and Harte (2012) used the ANM in combination with their image-processing based Neurogram Similarity Index Measure (NSIM) to predict phoneme identification scores across different presentation levels in NH listeners. Bruce et al. (2013) used an ANM-based model framework to predict effects of masking release on consonant identification in NH and HI listeners. Hossain et al. (2016) conceived a reference-free model using a bispectrum analysis of the ANM-based neurogram to predict phoneme identification scores for groups of NH and HI listeners.

Carney et al. (2015) proposed a model of vowel coding in the midbrain, which was shown to be robust over a wide range of sound levels as well as background noise. The model is heavily based on the interaction of sound level, basilar membrane nonlinearities controlled by the OHCs, and the saturating nonlinearity of the IHCs in the ANM, which yield very flat responses at characteristic frequencies (CFs) close to vowel formant frequencies, whereas responses that fluctuate strongly at the fundamental frequency (F0) are found at CFs in between vowel formants. As also argued in Carney (2018), these fluctuation profiles can be revealed by a band-pass (or band-enhanced) filter centered around F0, which is a simplistic representation of responses of inferior-colliculus (IC) neurons, many of which exhibit band-pass tuning to amplitude modulations. Inspired by these observations, Scheidiger et al. (2018) proposed a modeling framework that mimics the above process using the ANM followed by a bandpass modulation filter. The processing was applied to the noisy speech and the noise-alone signals, and the decision metric was based on the across-CF correlation between the internal representations of the two signals. Scheidiger et al. (2018) showed accurate predictions for NH listeners across different noise types and promising predictions for some, but not all, of the considered HI listeners.

The current study proposes a matured version of the modeling approach conceived by Scheidiger et al. (2018), removing some of the unnecessary complexity of the original model and adding a crucial component that limits the sensitivity of the across-CF correlation metric used for predicting SI. The study furthermore systematically investigated the predictive power of the updated SI model, simulating SRTs measured in NH listeners (Jørgensen et al., 2013) and in HI listeners (Christiansen and Dau, 2012) using sentences in SSN, 8-Hz sinusoidally amplitude-modulated noise (SAM), and the speech-like international speech test signal (ISTS; Holube et al., 2010). This evaluation was conducted by comparing measured and predicted NH and HI group SRTs, as well as by comparing individual HI listener SRTs with the corresponding model predictions. Similarly, the measured and predicted masking release (MR), i.e., the SI benefit induced by fluctuating interferers (SAM, ISTS) as compared to a stationary interferer (SSN), was explicitly compared. In addition to predicting the measured data, the model's reaction to a number of presentation levels (in NH configuration) and the effect of interpreting the hearing losses with different proportions of IHC and OHC impairment were also analyzed.

Section snippets

Model description

A flowchart of the proposed model is shown in Fig. 1. The inputs to the model are the noisy speech stimulus (SN) and the noise alone (N), which serves as a reference signal. The two signals are processed through the ANM (Zilany et al., 2014), which represents the auditory periphery in terms of peripheral frequency tuning and various non-linear aspects of the cochlear mechanics and the hair cell responses. The ANM also receives a spline-interpolated version of the audiogram at the considered

Model predictions for the NH and HI groups

This section compares the measured data with the model predictions for the two groups of NH and HI listeners. The comparison is conducted in terms of (i) SRTs as a function of the SSN, SAM, and ISTS conditions and (ii) in terms of MR, defined as MRSAM=SRTSSNSRTSAM and MRISTS=SRTSSNSRTISTS. The left panel of Fig. 3 demonstrates that the NH group (open squares) performed better and thus reached lower SRTs than the HI group (open diamonds). For both groups, the highest SRTs were observed for

Model performance and behavior

The proposed SI model, which is based on Scheidiger et al. (2018) and Carney et al. (2015), was evaluated based on NH data (Jørgensen et al., 2013) and HI data (Christiansen and Dau, 2012) in three different noise conditions. Based only on a single conversion function (“fitting”), obtained using NH listener data collected in SSN, and incorporating the HI listeners’ audiograms in the front-end processing, the model accounted very well for (i) NH group SRTs across SAM and ISTS interferers and

Conclusions

The present study presented a matured version of a speech-intelligibility prediction model previously proposed by Scheidiger et al. (2018). The model is based on a sophisticated nonlinear auditory model that allows incorporation of hearing loss, combined with a back end that quantifies the similarity between across-frequency fluctuation profiles of noisy speech and noise alone. The model showed accurate predictions of speech reception thresholds (SRTs) measured in normal-hearing listeners

CRediT authorship contribution statement

Johannes Zaar: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Resources, Writing – original draft, Writing – review & editing, Visualization. Laurel H. Carney: Conceptualization, Methodology, Investigation, Resources, Writing – review & editing.

Acknowledgments

We would like to acknowledge Christoph Scheidiger and Torsten Dau for their contributions to the initial work on this modeling approach, as well as Helia Relaño-Iborra for providing helpful comments on an earlier version of this manuscript. Johannes Zaar is supported by Swedish Research Council grant 2017-06092. Laurel Carney is supported by NIH-R01-DC-001641.

References (42)

  • J.M. Festen

    Contributions of comodulation masking release and temporal resolution to the speech-reception threshold masked by an interfering voice

    J. Acoust. Soc. Am.

    (1993)
  • N.R. French et al.

    Factors governing the intelligibility of speech sounds

    J. Acoust. Soc. Am.

    (1947)
  • Heinz, M. G. (2010): “Computational Modeling of Sensorineural Hearing Loss,” In: Meddis R., Lopez-Poveda E., Fay R.,...
  • M.E. Hossain et al.

    Reference-Free Assessment of Speech Intelligibility Using Bispectrum of an Auditory Neurogram

    PLOS ONE

    (2016)
  • I. Holube et al.

    Development and analysis of an International Speech Test Signal (ISTS

    Int. J. Audiol.

    (2010)
  • T. Houtgast et al.

    Predicting speech intelligibility in rooms from the modulation transfer function. I. General room acoustics

    Acustica

    (1980)
  • R.A. Ibrahim et al.

    Effects of Peripheral Tuning on the Auditory Nerve's Representation of Speech Envelope and Temporal Fine Structure Cues

  • M.L. Jepsen et al.

    Characterizing auditory processing and perception in individual listeners with sensorineural hearing loss

    J. Acoust. Soc. Am.

    (2011)
  • S. Jørgensen et al.

    Predicting speech intelligibility based on the signal-to-noise envelope power ratio after modulation-frequency selective processing

    J. Acoust. Soc. Am.

    (2011)
  • P.T. Johannesen et al.

    Across-frequency behavioral estimates of the contribution of inner and outer hair cell dysfunction to individualized audiometric loss

    Frontiers in Neuroscience

    (2014)
  • S. Jørgensen et al.

    A multi-resolution envelope-power based model for speech intelligibility

    J. Acoust. Soc. Am.

    (2013)
  • Cited by (7)

    • Speech intelligibility prediction based on modulation frequency-selective processing

      2022, Hearing Research
      Citation Excerpt :

      Several speech intelligibility models have been proposed that include a non-linear peripheral processing stage. For example, the auditory-nerve (AN) model of Carney (1993) or more recent variants of it have been combined with different back ends to predict speech intelligibility, such as the STMI (Zilany and Bruce, 2007), the ‘neurogram similarity index’ (Hines and Harte, 2012; Bruce et al., 2013), the SNRenv (Scheidiger, 2017; Scheidiger et al., 2017), and correlation-based metrics (Scheidiger et al., 2018; Zaar and Carney, 2022, this issue). Scheidiger et al. (2017)’s ‘AN-sEPSM’ (Fig. 1, most-right column) includes simulations of middle-ear filtering, a basilar membrane (chirp) filter with a forward control path to account for level effects, an inner hair cell (IHC) stage, an IHC-AN synapse model and a non-homogeneous Poisson process underlying spike generation.

    View all citing articles on Scopus
    View full text