Reducing over- and under-estimation of the a priori SNR in speech enhancement techniques
Graphical abstract
Introduction
The problem of enhancing speech degraded by uncorrelated additive noise has been widely studied in the past and is still an active field of research. Although background and applications differ largely, a specific class of these methods is essentially based on the SNR estimation that is used with a gain function to correct the corrupted speech signal. In practice, the deficiency of speech acquisition and transmission system results a speech signal commonly corrupted by noise. The contamination of the speech not only affects its audio quality but also reduces the performance of high-level speech processing applications such as voice communication and automatic speech recognition where efficient noise reduction techniques are required. Depending of the application, the goal of the speech enhancement techniques is to reduce the noise to make the speech intelligible, to decrease annoyance or to improve the overall sound quality. The most challenging case is the single-channel speech enhancement where only a single noisy speech recording is available for recovering the clean speech. Many algorithms have been proposed to solve this problem, such as spectral subtraction (SS) [1], minimum-mean square error (MMSE) estimator [2], [3], and Wiener filter based algorithms [4], [5]. Such noise reduction techniques rely mainly on the estimation of a short-time spectral gain which is a function of the a priori SNR and/or the a posteriori SNR computed on each frequency component. In [6], the authors highlighted the interest of estimating the a priori SNR thanks to the “decision-directed” (DD) approach initially proposed in [2]. This DD approach has then received a lot of attention due to its low computational complexity and good performance in various noise reduction applications. Unfortunately, the DD approach suffers from inherent errors (or bias) in estimating the spectral SNRs. Such errors affect directly the estimation of the short-time spectral gain, and inaccuracies in estimating the gain function often lead to spectral attenuation (i.e., enhanced spectral components are smaller in magnitude than corresponding clean spectral components) and/or spectral amplification (i.e., enhanced spectral components are larger in magnitude than corresponding clean spectral component). Consequently, the output speech intelligibility may be impacted by the presence of these two kinds of distortions.
A first source of bias in the a priori SNR estimate is due to the one-frame delay bias introduced by the decision-directed estimator. This behavior has been analyzed in [5] where the authors have demonstrated that the a priori SNR estimate follows the shape of the a posteriori SNR with a one frame delay in regions of high a priori SNR. Consequently, since the spectral gain depends on the a priori SNR, it does not match the current frame and thus the performance of the noise suppression system is degraded especially at transient periods (speech to non-speech or non-speech to speech). To reduce this effect, a Two-Step Noise Reduction algorithm (TS-NR) has been proposed in [7] that refines the estimation of the a priori SNR. By suppressing the frame delay bias, the TS-NR algorithm removes some of the drawbacks of the DD approach while maintaining its advantage, i.e. a highly reduced musical noise level. Such algorithm has been selected by ITU-T in 2008 as the optional post-filter in the standardized G.711.1 (multi-rate wide-band extension for the well-known ITU-T G.711) to reduce the lower band quantization noise at the decoder [8].
Recently, a second source of bias in the a priori SNR estimate has been identified in [9]. To assess the impact of SNR and gain over- and under-estimation on speech intelligibility, the authors in [9] conducted listening tests with different biased spectral gain function. More precisely, by assuming perfect a priori knowledge of the short-time versions of the a priori SNR, such a bias was introduced by including a bias in the a priori SNR estimation. Listening tests indicated that SNR and gain-function over-estimation errors in frequency bins with negative SNR are particularly harmful to speech intelligibility. Furthermore, the authors suggested that better methods are needed to estimate the spectral SNR from noisy observations, particularly at low input SNR levels. Such methods hold promise for improving speech intelligibility.
In this work, we focus on the problem of over- and under- estimation of the spectral SNR from noisy observations. The proposed correction technique is applied to DD- and TS-SNR Wiener filters but it could also be applied to any recent MMSE based techniques relying on more than the a priori and a posteriori SNR estimates [10], [11], [12], [13]. Starting with the same experiments as those presented in [9]. We consider a baseline noise reduction filter which is implemented as a short-time Wiener filter combined with a DD approach for the estimation of the a priori SNR. By assuming perfect a priori knowledge of the evolution of the short-time SNR, we have analyzed the over- and under-estimation errors of the a priori SNR. However, as oppose to [9], we do not introduce a known bias in the a priori SNR estimation. Instead we use a well-known technique such as the DD approach for the estimation of the a priori SNR and we analyze the SNR bias by comparing this SNR estimate with the perfect a priori knowledge of the evolution of the short-time SNR. By processing several noisy speech sentences from a training database and analyzing the SNR bias for all frequency bins within all short-time frames, we thus obtain an empirical estimate of the 2-dimensional distribution of the SNR bias as a function of two main parameters: the a priori and the a posteriori SNRs. Once this 2-D bias distribution is available, we applied a k-means clustering method (Lloyd's algorithm) to segment the 2-D plane (a priori SNR and a posteriori SNR parameters) into m clusters in which each observation (i.e. the bias) belongs to the cluster with the nearest mean. This results into a partitioning of the a priori-a posteriori SNR space in m Voronoi cells. The mean estimation error in each cluster is then used as bias correction terms of the a priori SNR for all features belonging to that cluster in the 2-D-space. Moreover, it might be worth mentioning that each frequency bin of the estimated SNR is corrected independently according to the estimated bias by the new method that we propose in this paper.
The remainder of this paper is organized as follows. Section 2 presents a brief description of speech enhancement techniques that operate in the frequency domain. Then, we focus our analysis on the considered approach which corresponds to a combination of the DD approach and the two step noise reduction (TS-NR) algorithm. The bias effect in the DD a priori SNR estimator is then studied in Section 3. The probability density function of this bias is analyzed in a 2-D space driven by the a posteriori and a priori SNR parameters. Also in this same Section 3, we deal with the partition of the 2-D plane into m clusters in order to retain the main properties of the bias information. We then proposed in Section 4 a refinement of the noise reduction technique in which the DD and TS-NR a priori SNR estimates are modified according to the mean value (i.e. bias) of the error observed in the corresponding cluster. Simulations are carried out in Section 5 to assess the performance of the proposed technique. Objective evaluation in various environmental conditions show that the proposed modification approach is advantageous, particularly for low input SNRs. Excellent noise reduction can be achieved even in the most adverse noise conditions, while avoiding musical residual noise and the attenuation of weak speech components.
Section snippets
Frequency domain speech enhancement
In the additive noise model, the clean speech is corrupted by an independent noise signal with zero-mean. The resulting noisy speech is given in the time domain by where n is the discrete time index. To obtain the clean speech from the noisy one, conventional noise reduction techniques used a noise reduction filter whose impulse response is designed to produce an enhanced speech corresponding to a trade-off between noise reduction and speech distortion.
Properties of SNR estimators
In this section, we concentrate our analysis on several important properties of the a priori SNR estimate as this parameter plays a central role in spectral noise reduction techniques. As described previously, estimating the a priori SNR through the “decision-directed” approach is widely used (instead of the a posteriori SNR) because it reduces the musical noise to an acceptable level. However, as we shall see, this a priori SNR estimate often leads to an under- or over-estimation of the SNR on
Bias compensation of the a priori SNR estimator
In this section, we address the problem of improving the SNR estimation. Several attempts have been made in the past to modify noise reduction techniques. Some of them have tried to modify the short-time spectral gain in a data-driven approach [18], [19], or to propose alternative analytical suppression rules [20], [21]. Other studies propose new approaches for the a priori SNR estimate such as robust estimators [22], [23], or noncausal estimators [24]. However, as seen previously, the a priori
Experimental results
In this section, the performance of the two proposed approaches is compared to that of the ‘conventional’ DD as well as TS-NR approaches. The bias-compensated approach is applied in two configurations, namely the ‘conventional’ DD and the TS-NR a priori SNR estimators previously described in Section 2. These two new techniques will be hereafter called Propositions I and II corresponding to the combination of the bias-compensated approach together with respectively (10) or (12). For testing, the
Conclusion
In this paper we have proposed an original method to analyze and to compensate for the over- and under-estimation of the a priori signal-to-noise-ratio in noise reduction systems. Starting with the conventional decision-directed approach coupled with the two step noise reduction algorithm, we have shown that a large bias can be observed in the estimated a priori SNR which causes serious speech distortion when the weight factor approaches one in speech activity periods when the spectral gain is
Mohamed Djendi received his first Ph.D. degree in Electronics-Signal and Telecommunications from the High National Polytechnic School (ENSP) of Algiers, Algeria, in 2006. He received, in January 2010, his second Ph.D. degree in Signal Processing and Telecommunications from the University of Rennes—IRISA/ENSSAT, France. He holds a Postdoctoral position at High National School of Technologies and Sciences (ENSSAT, France), engaging in research on Signal Processing and Communication Systems. In
References (31)
- et al.
Impact of SNR and gain-function over- and under-estimation on speech intelligibility
Speech Commun.
(2012) - et al.
The importance of phase in speech enhancement
Speech Commun.
(2011) - et al.
Speech enhancement for non-stationary noise environments
Signal Process.
(2001) - et al.
A data-driven approach to optimizing spectral speech enhancement methods for various error criteria
Speech Commun.
(2007) - et al.
SNR loss: a new objective measure for predicting the intelligibility of noise-suppressed speech
Speech Commun.
(2011) Suppression of acoustic noise in speech using spectral subtraction
IEEE Trans. Acoust. Speech Signal Process.
(Apr. 1979)- et al.
Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator
IEEE Trans. Acoust. Speech Signal Process.
(Dec. 1984) - et al.
Speech enhancement using a minimum mean-square error log-spectral amplitude estimator
IEEE Trans. Acoust. Speech Signal Process.
(Apr. 1985) - et al.
Speech enhancement based on a priori signal to noise estimation
- et al.
Improved signal-to noise ratio estimation for speech enhancement
IEEE Trans. Audio Speech Lang. Process.
(2006)
Elimination of the musical noise phenomenon with the Ephraim and Malah noise suppressor
IEEE Trans. Speech Audio Process.
A two-step noise reduction technique
G.711.1: a wideband extension to ITU-T G.711
Stochastic-deterministic MMSE STFT speech enhancement with general a priori information
IEEE Trans. Acoust. Speech Signal Process.
An MMSE estimator for speech enhancement under a combined stochastic-deterministic speech model
IEEE Trans. Audio Speech Lang. Process.
Cited by (23)
A new efficient two-channel fast transversal adaptive filtering algorithm for blind speech enhancement and acoustic noise reduction
2019, Computers and Electrical EngineeringCitation Excerpt :Hence, the adaptation gain in the proposed TCSFTF algorithm is only based on the forward prediction. The proposed algorithm performs well with non-stationary signals, unlike the classical TCNLMS algorithm [21–22]. Similarly, it does not suffer from computational complexity like the TCRLS algorithm [23–25].
An efficient inverse short-time Fourier transform algorithm for improved signal reconstruction by time-frequency synthesis: Optimality and computational issues
2017, Digital Signal Processing: A Review JournalAuthentication and recovery algorithm for speech signal based on digital watermarking
2016, Signal ProcessingCitation Excerpt :At the same time, after the attacked signals being detected, the recovery of the attacked content can minimize the user׳s loss. As to digital speech signal, there are a lot of achievements in the field of the research on speech enhancement [1–5] and speaker recognition [6–9]. As to content authentication, the method based on digital watermark provides a solution to verify the authenticity of speech signal.
Improved subband-forward algorithm for acoustic noise reduction and speech quality enhancement
2016, Applied Soft Computing JournalCitation Excerpt :In [5–8], several two and multi-sensors techniques have been proposed to separate the sources, which are mixed by a convolutive model [22]. Recently, several works have been conducted to increase convergence rates and reducing the steady-state estimation error (low misadjustment) of these different adaptive filtering algorithms in acoustic noise reduction and speech enhancement applications [23–38,1,39]. Furthermore, in the same direction, many variable step-size adaptive filtering algorithms for speech enhancement application have been also proposed [23–38,1,39].
Improving listeners' experience for movie playback through enhancing dialogue clarity in soundtracks
2016, Digital Signal Processing: A Review JournalCitation Excerpt :Some of the existing methods for reduction of musical noise employ median filtering [20], or the Ephraim–Malah algorithm [21,22]. Very efficient methods for reducing musical noise were also proposed by Djendi and Scalart [23], by improving the SNR estimation originally proposed by Ephraim and Malah. In this work we employ smoothing in the time domain (by exponential averaging) to suppress the residual noise in the reconstructed signal, mostly due to its low complexity.
Statistically Guided Near-End Speech Intelligibility Improvement Through Voice Transformation and Transfer Learning
2024, IEEE/ACM Transactions on Audio Speech and Language Processing
Mohamed Djendi received his first Ph.D. degree in Electronics-Signal and Telecommunications from the High National Polytechnic School (ENSP) of Algiers, Algeria, in 2006. He received, in January 2010, his second Ph.D. degree in Signal Processing and Telecommunications from the University of Rennes—IRISA/ENSSAT, France. He holds a Postdoctoral position at High National School of Technologies and Sciences (ENSSAT, France), engaging in research on Signal Processing and Communication Systems. In this Postdoctoral position periods, he was supervised by Professor Pascal Scalart. From 2001 to 2011 and from 2012 to present, he has been a full Professor and researcher at Blida University and in LATSI research Laboratory. His current research activities include and not limited to adaptive filtering algorithms, noise reduction and speech enhancement algorithms, Blind source separation, and digital communication.
Pascal Scalart received the M.S. (1989), and the Ph.D. degree (1992) in Signal Processing from the University of Rennes (France). In 1993, he held a postdoctoral position (Laval University, Canada) engaging in research on digital communications. From 1994 to 2001, he was with the R&D center of France Telecom working on speech signal processing techniques (noise reduction, acoustic echo cancellation, acoustic antennas, post-filters ...). He has authored over 100 papers and holds several patents in key audio technologies such as ITU-T G.711.1 speech coder, or noise reduction for W-CDMA terminals. In 2001, he joined the engineering school ENSSAT, University of Rennes where he is currently a Professor, head of the Electronic and Computer Engineering Department.