Predicting speech intelligibility in hearing-impaired listeners using a physiologically inspired auditory model
Introduction
A large number of speech intelligibility (SI) prediction models have been proposed over the past decades, usually with the aim to provide tools for assessing transmission channels (e.g., in telecommunications and room acoustics) and signal-enhancement algorithms and/or to better understand the healthy human auditory system in terms of speech processing in various conditions. Most of these models rely on a simplistic linear representation of the peripheral stages of the human auditory system, employing a linear filterbank to simulate the frequency selectivity of the auditory system (e.g., ANSI, 1997; Rhebergen et al., 2006; Taal et al., 2011). Some of the most powerful and versatile linear models combine the initial filterbank stage with a subband envelope extraction followed by another filterbank that analyses the slower level fluctuations in the subband signals (e.g., Houtgast et al., 1980; Elhilali et al., 2003; Jørgensen and Dau, 2011; Jørgensen et al., 2013; Relaño-Iborra et al., 2016; review: Relaño-Iborra and Dau, 2022, this issue). The model predictions are obtained by comparing the noisy or processed speech signal with a reference signal, usually the clean speech or the noise alone, and obtaining either a type of signal-to-noise ratio (SNR, e.g., ANSI, 1997; Rhebergen et al., 2006; Houtgast et al., 1980; Jørgensen and Dau, 2011; Jørgensen et al., 2013) or a correlation-type metric (Taal et al., 2011; Relaño-Iborra et al., 2016). The decision metric can then be transformed to SI in percent correct based on data and predictions from a given fitting condition, e.g., speech in stationary speech-shaped noise (SSN).
The SI models mentioned above have been created using simplistic linear pre-processing with a focus on accounting for as many acoustic conditions as possible in a population of normal-hearing (NH) listeners. However, the healthy auditory system is strongly nonlinear due to cochlear amplification attributed to outer hair cells (OHCs) and the saturating nonlinear nature of the inner hair cells (IHCs). Hearing impairment, on the other hand, which is far from being fully understood in all its complexity, typically induces a partial linearization of the system due to an OHC-loss-induced reduction in cochlear amplification (review: Heinz, 2010). Therefore, linear models provide a suboptimal starting point for accounting for effects of hearing impairment, as they may already be considered “impaired” in a sense and thus are functionally limited in accounting for (supra-threshold) effects of hearing impairment beyond audibility limitations. Only a few researchers have attempted to incorporate more sophisticated nonlinear models of the auditory periphery in an SI-prediction framework. Relaño-Iborra et al. (2019) adapted the computational auditory signal processing and perception model (CASP; Jepsen et al., 2011), an auditory model with nonlinear OHC (but not IHC) behavior that contains an envelope-frequency analysis stage, for predicting SI using a processed-speech vs. clean-speech correlation approach. The resulting speech-based CASP (sCASP) model showed accurate predictions across many acoustic conditions and plausible trends across different presentation levels for NH listeners, while its predictive power regarding effects of hearing impairment on SI has not been fully explored yet (Relaño-Iborra and Dau, 2022, this issue). Another nonlinear model of the auditory periphery is the auditory-nerve model (ANM), which has been developed to describe the temporal properties of auditory-nerve rate functions and spike trains in cats and other species (Carney, 1993; Zilany et al., 2009; Zilany et al., 2014). The ANM simulates the nonlinear behavior of both OHCs and IHCs and can thus can functionally account for level effects as well as for OHC and IHC impairment. Zilany and Bruce (2007) incorporated the ANM in the framework of Elhilali et al. (2003) and showed promising predictions for word recognition across different presentation levels in NH and hearing-impaired (HI) listeners. Hines and Harte (2012) used the ANM in combination with their image-processing based Neurogram Similarity Index Measure (NSIM) to predict phoneme identification scores across different presentation levels in NH listeners. Bruce et al. (2013) used an ANM-based model framework to predict effects of masking release on consonant identification in NH and HI listeners. Hossain et al. (2016) conceived a reference-free model using a bispectrum analysis of the ANM-based neurogram to predict phoneme identification scores for groups of NH and HI listeners.
Carney et al. (2015) proposed a model of vowel coding in the midbrain, which was shown to be robust over a wide range of sound levels as well as background noise. The model is heavily based on the interaction of sound level, basilar membrane nonlinearities controlled by the OHCs, and the saturating nonlinearity of the IHCs in the ANM, which yield very flat responses at characteristic frequencies (CFs) close to vowel formant frequencies, whereas responses that fluctuate strongly at the fundamental frequency (F0) are found at CFs in between vowel formants. As also argued in Carney (2018), these fluctuation profiles can be revealed by a band-pass (or band-enhanced) filter centered around F0, which is a simplistic representation of responses of inferior-colliculus (IC) neurons, many of which exhibit band-pass tuning to amplitude modulations. Inspired by these observations, Scheidiger et al. (2018) proposed a modeling framework that mimics the above process using the ANM followed by a bandpass modulation filter. The processing was applied to the noisy speech and the noise-alone signals, and the decision metric was based on the across-CF correlation between the internal representations of the two signals. Scheidiger et al. (2018) showed accurate predictions for NH listeners across different noise types and promising predictions for some, but not all, of the considered HI listeners.
The current study proposes a matured version of the modeling approach conceived by Scheidiger et al. (2018), removing some of the unnecessary complexity of the original model and adding a crucial component that limits the sensitivity of the across-CF correlation metric used for predicting SI. The study furthermore systematically investigated the predictive power of the updated SI model, simulating SRTs measured in NH listeners (Jørgensen et al., 2013) and in HI listeners (Christiansen and Dau, 2012) using sentences in SSN, 8-Hz sinusoidally amplitude-modulated noise (SAM), and the speech-like international speech test signal (ISTS; Holube et al., 2010). This evaluation was conducted by comparing measured and predicted NH and HI group SRTs, as well as by comparing individual HI listener SRTs with the corresponding model predictions. Similarly, the measured and predicted masking release (MR), i.e., the SI benefit induced by fluctuating interferers (SAM, ISTS) as compared to a stationary interferer (SSN), was explicitly compared. In addition to predicting the measured data, the model's reaction to a number of presentation levels (in NH configuration) and the effect of interpreting the hearing losses with different proportions of IHC and OHC impairment were also analyzed.
Section snippets
Model description
A flowchart of the proposed model is shown in Fig. 1. The inputs to the model are the noisy speech stimulus (SN) and the noise alone (N), which serves as a reference signal. The two signals are processed through the ANM (Zilany et al., 2014), which represents the auditory periphery in terms of peripheral frequency tuning and various non-linear aspects of the cochlear mechanics and the hair cell responses. The ANM also receives a spline-interpolated version of the audiogram at the considered
Model predictions for the NH and HI groups
This section compares the measured data with the model predictions for the two groups of NH and HI listeners. The comparison is conducted in terms of (i) SRTs as a function of the SSN, SAM, and ISTS conditions and (ii) in terms of MR, defined as and . The left panel of Fig. 3 demonstrates that the NH group (open squares) performed better and thus reached lower SRTs than the HI group (open diamonds). For both groups, the highest SRTs were observed for
Model performance and behavior
The proposed SI model, which is based on Scheidiger et al. (2018) and Carney et al. (2015), was evaluated based on NH data (Jørgensen et al., 2013) and HI data (Christiansen and Dau, 2012) in three different noise conditions. Based only on a single conversion function (“fitting”), obtained using NH listener data collected in SSN, and incorporating the HI listeners’ audiograms in the front-end processing, the model accounted very well for (i) NH group SRTs across SAM and ISTS interferers and
Conclusions
The present study presented a matured version of a speech-intelligibility prediction model previously proposed by Scheidiger et al. (2018). The model is based on a sophisticated nonlinear auditory model that allows incorporation of hearing loss, combined with a back end that quantifies the similarity between across-frequency fluctuation profiles of noisy speech and noise alone. The model showed accurate predictions of speech reception thresholds (SRTs) measured in normal-hearing listeners
CRediT authorship contribution statement
Johannes Zaar: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Resources, Writing – original draft, Writing – review & editing, Visualization. Laurel H. Carney: Conceptualization, Methodology, Investigation, Resources, Writing – review & editing.
Acknowledgments
We would like to acknowledge Christoph Scheidiger and Torsten Dau for their contributions to the initial work on this modeling approach, as well as Helia Relaño-Iborra for providing helpful comments on an earlier version of this manuscript. Johannes Zaar is supported by Swedish Research Council grant 2017-06092. Laurel Carney is supported by NIH-R01-DC-001641.
References (42)
- et al.
The search for noise-induced cochlear synaptopathy in humans: Mission impossible?
Hearing Research
(2019) - et al.
A spectro-temporal modulation index (STMI) for assessment of speech intelligibility
Speech Commun
(2003) Animal models of hidden hearing loss: Does auditory-nerve-fiber loss cause real-world listening difficulties?
Molecular and Cellular Neuroscience
(2022)- et al.
Speech intelligibility prediction using a Neurogram Similarity Index Measure
Speech Communication
(2012) S3.5, Methods for the Calculation of the Speech Intelligibility Index
(1997)- et al.
Physiological prediction of masking release for normal-hearing and hearing-impaired listeners
Proc. of Meetings on Acoustics
(2013) A model for the responses of low-frequency auditory-nerve fibers in cat
J. Acoust. Soc. Am.
(1993)- et al.
Speech coding in the brain: representation of vowel formants by midbrain neurons tuned to sound fluctuations
eNeuro
(2015) Supra-threshold hearing and fluctuation profiles: implications for sensorineural and hidden hearing loss
J. Assoc. Res. Otolaryngol.
(2018)- et al.
Relationship between masking release in fluctuating maskers and speech reception thresholds in stationary noise
J. Acoust. Soc. Am.
(2012)
Contributions of comodulation masking release and temporal resolution to the speech-reception threshold masked by an interfering voice
J. Acoust. Soc. Am.
Factors governing the intelligibility of speech sounds
J. Acoust. Soc. Am.
Reference-Free Assessment of Speech Intelligibility Using Bispectrum of an Auditory Neurogram
PLOS ONE
Development and analysis of an International Speech Test Signal (ISTS
Int. J. Audiol.
Predicting speech intelligibility in rooms from the modulation transfer function. I. General room acoustics
Acustica
Effects of Peripheral Tuning on the Auditory Nerve's Representation of Speech Envelope and Temporal Fine Structure Cues
Characterizing auditory processing and perception in individual listeners with sensorineural hearing loss
J. Acoust. Soc. Am.
Predicting speech intelligibility based on the signal-to-noise envelope power ratio after modulation-frequency selective processing
J. Acoust. Soc. Am.
Across-frequency behavioral estimates of the contribution of inner and outer hair cell dysfunction to individualized audiometric loss
Frontiers in Neuroscience
A multi-resolution envelope-power based model for speech intelligibility
J. Acoust. Soc. Am.
Cited by (7)
Speech intelligibility prediction based on modulation frequency-selective processing
2022, Hearing ResearchCitation Excerpt :Several speech intelligibility models have been proposed that include a non-linear peripheral processing stage. For example, the auditory-nerve (AN) model of Carney (1993) or more recent variants of it have been combined with different back ends to predict speech intelligibility, such as the STMI (Zilany and Bruce, 2007), the ‘neurogram similarity index’ (Hines and Harte, 2012; Bruce et al., 2013), the SNRenv (Scheidiger, 2017; Scheidiger et al., 2017), and correlation-based metrics (Scheidiger et al., 2018; Zaar and Carney, 2022, this issue). Scheidiger et al. (2017)’s ‘AN-sEPSM’ (Fig. 1, most-right column) includes simulations of middle-ear filtering, a basilar membrane (chirp) filter with a forward control path to account for level effects, an inner hair cell (IHC) stage, an IHC-AN synapse model and a non-homogeneous Poisson process underlying spike generation.