Frame-wise model re-estimation method based on Gaussian pruning with weight normalization for noise robust voice activity detection

doi:10.1016/j.specom.2011.08.005

Speech Communication

Volume 54, Issue 2, February 2012, Pages 229-244

https://doi.org/10.1016/j.specom.2011.08.005 Get rights and content

Abstract

This paper proposes a robust voice activity detection (VAD) method that operates in the presence of noise. For noise robust VAD, we have already proposed statistical models and a switching Kalman filter (SKF)-based technique. In this paper, we focus on a model re-estimation method using Gaussian pruning with weight normalization. The statistical model for SKF-based VAD is constructed using Gaussian mixture models (GMMs), and consists of pre-trained silence and clean speech GMMs and a sequentially estimated noise GMM. However, the composed model is not optimal in that it does not fully reflect the characteristics of the observed signal. Thus, to ensure the optimality of the composed model, we investigate a method for its re-estimation that reflects the characteristics of the observed signal sequence. Since our VAD method works through the use of frame-wise sequential processing, processing with the smallest latency is very important. In this case, there are insufficient re-training data for a re-estimation of all the Gaussian parameters. To solve this problem, we propose a model re-estimation method that involves the extraction of reliable characteristics using Gaussian pruning with weight normalization. Namely, the proposed method re-estimates the model by pruning non-dominant Gaussian distributions that express the local characteristics of each frame and by normalizing the Gaussian weights of the remaining distributions. In an experiment using a speech corpus for VAD evaluation, CENSREC-1-C, the proposed method significantly improved the VAD performance with compared that of the original SKF-based VAD. This result confirmed that the proposed Gaussian pruning contributes to an improvement in VAD accuracy.

Highlights

► We propose a model re-estimation method for statistical model-based VAD. ► The method extracts reliable features by Gaussian pruning with weight normalization. ► Gaussian pruning reduces non-dominant Gaussian distributions at each frame. ► The remaining distributions are enhanced by normalizing the Gaussian weights. ► The accurate VAD decision is obtained by likelihoods of the remaining distributions.

Introduction

Voice activity detection (VAD) that automatically detects a period containing a target speech signal from a continuously observed signal is playing a crucial role in the development of speech processing technology. VAD can be employed in various speech-oriented technology fields, e.g., speech enhancement, speech coding for cellular or IP phones, and the front-end processing of automatic speech recognition. Thus, a lot of research on VAD has been proposed and its performance discussed.

VAD usually consists of two parts: a part for extracting a distinctive feature parameter and a part for detecting the absence of speech or speech activity. The feature extraction part extracts acoustic features that distinguish between speech absence and speech activity. Traditionally, the zero-crossing rate and the energy difference between non-speech and speech (Rabiner and Sambur, 1975, ETSI EN 301 708 v.7.1.1, 1999) have been widely used as distinctive feature parameters. However, these traditional parameters are not robust in the presence of interference noises, and so various noise robust features and their combinations have been proposed as follows:

•
Line spectra, all-band energy, low-band energy, and zero crossing ratio (ITU-T Recommendation G.729 Annex B., 1996).
•
All-band spectra, sub-band spectra of output signal of Wiener filter, and variance of spectra (ETSI ES 202 050 v.1.1.4, 2006).
•
Higher order statistics (Nemer et al., 2001, Li et al., 2005, Cournapeau and Kawahara, 2007).
•
Long-term spectral divergence (Ramírez et al., 2004).
•
Speech periodicity (Kristjansson et al., 2005) and speech periodic component to aperiodic component ratio (Ishizuka et al., 2010).
•
Volatility (covariance variance) obtained from generalized autoregressive conditional heteroskedasticity (GARCH) filtering (Tahmasbi and Razaei, 2007, Kato et al., 2008).

These parameters can improve VAD accuracy, however, employing robust feature parameters alone is insufficient for noise robust VAD. In particular, in a low signal to noise ratio (SNR) environment, the discriminative characteristics of the feature parameter inevitably degrade due to the strong noise energy, even if noise robust feature parameters are used. This problem suggests the importance of a decision mechanism for noise robust VAD. If a robust decision mechanism is introduced into VAD, the VAD accuracy will improve, even if the discriminative characteristics of the feature parameter become unreliable owing to crucial interference noises.

Some statistical model-based VAD techniques have been proposed as one of the most robust decision mechanisms (Sohn et al., 1999, Ramírez et al., 2007, Fujimoto and Ishizuka, 2007, Weiss and Kristjansson, 2008, Ramírez et al., 2005, Chang et al., 2006, Fujimoto et al., 2008). Most of these methods define a non-speech/speech state transition model, and calculate the speech state to non-speech state likelihood ratio. Whether speech is absent or active is indicated by a likelihood ratio test (LRT) with a threshold.

The method proposed by Sohn et al. (1999) is a fundamental technique for statistical model-based VAD. However, assumptions regarding stationary noise environments and a priori knowledge are indispensable to Sohn’s method, and its applicable noise environments are restricted to specific ones. In most cases, a noise observed in the real world has non-stationary characteristics and is unknown in advance. Thus, robustness in the presence of non-stationary noise without a priori knowledge of the noise is the most important factor for robust and useful VAD in the real world. For VAD based on such assumptions, we have already proposed a technique that employs a switching Kalman filter (SKF), which is robust in non-stationary noise environments (Fujimoto and Ishizuka, 2007).

SKF-based VAD consists of the following three steps:

(1)
Noise estimation step: the parameters of a noise Gaussian mixture model (GMM) are estimated based on SKF by using the mean and the variance parameters of silence and clean speech GMMs, which are estimated in advance by using clean speech corpora.
(2)
Composition step: two internal states of non-speech (silence + noise) and speech (clean speech + noise) in the noisy speech (observed signal) model are constructed by composing silence and clean speech GMMs with the estimated noise GMM as shown in Fig. 1.
(3)
Discrimination step: speech absence or activity in the observed signal is discriminated by using the likelihood ratio of the observed signal between non-speech and speech GMMs.

In this framework, the characteristics of the observed signal are not directly reflected in the composition step, unlike the noise estimation. Namely, the composition step lacks optimality in terms of the consideration of the observed signal. Thus, one of the most important factors as regards our proposed approach is finding a way to ensure the optimality of the noisy speech model by incorporating the characteristics of the observed signal. A standard way of achieving this is to employ a Gaussian parameter re-estimation method, e.g., a maximum likelihood (ML) estimation or a maximum a posterior (MAP) estimation, in the noisy speech model by using the observed signal at the composition step. However, since VADs strongly require the smallest possible latency in many applications, a re-estimation method with one frame sample is desired in the composition step. Here, in this paper, the frame length and the frame shift length are set at 20 and 10 ms, respectively. Thus, the frame interval (algorithm latency) is 10 ms. In this situation, it is almost impossible to re-estimate all of the Gaussian parameters by ML or MAP estimation due to the well-known over-fitting problem, and frame-by-frame processing-based VAD using statistical approaches suffers inherently from this problem.

To satisfy the requirement and to avoid this over-fitting problem, we proposed another model re-estimation method from a different standpoint. The method proposed here is a Gaussian pruning method that selects dominant Gaussian distributions of non-speech and speech GMMs depending on the observed signal of the current frame. Thus, this method extracts beneficial Gaussian distributions that optimally represent the observed signal in the current frame. Fig. 2 shows the processing flow of our VAD method, and the proposed Gaussian pruning method is employed after the composition estimation step.

The composed non-speech and speech GMMs include several Gaussian distributions, and each Gaussian distribution represents various characteristics of a noisy speech signal. Since the observed signal may have various characteristics, various Gaussian distributions that can satisfactorily represent the characteristics of the observed signal are usually required to achieve an exact likelihood calculation. However, this demand is restricted for a likelihood calculation involving the entire frame signal of an observed signal. In sequential processing, only partial aspects of whole speech characteristics are assumed to appear in each frame. Therefore, contributions made by a small number of Gaussian distributions would be more dominant and superior to the others in expressing the local characteristics of the observed signal in each frame. On the other hand, non-dominant Gaussian distributions may be both unimportant and harmful due to the serious gap between the observed signal and the distributions.

Under these assumptions, we investigate a way of improving the optimality of the noisy speech model by using Gaussian pruning-based model re-estimation. Then, we also apply Gaussian weight normalization to the remaining dominant Gaussian distributions in each frame. This normalization method can enhance the likelihood given by each remaining Gaussian distribution, and improves the effectiveness of the proposed Gaussian pruning.

Usually, the characteristics of the ambient noise and speech do not suddenly change in neighboring frames if these sound sources are in the stable state. However, if the sound sources are in a transient state, i.e., speech onset, speech offset, or the appearance of burst noise, the acoustical characteristics of an observed signal will vary greatly in a short time range. In addition, this sudden change is very difficult to predict. Thus, continual frame-wise optimization is a reasonable way to cope with unpredictable sudden changes in the sound sources included in the observed signal.

We employ the posterior probability as the Gaussian pruning criterion. Namely, the Gaussian distributions with high posterior probabilities are extracted as dominant Gaussian distributions for likelihood calculation. We investigate and evaluate three techniques as the posterior probability-based pruning method, i.e., N-best selection, threshold processing for each posterior probability, and threshold processing for posterior probability accumulation. Furthermore, we investigate a method for determining the pruning parameters of each Gaussian pruning technique, e.g., the sequential determination of the pruning threshold or the number of selected distributions, and prove that our pruning parameter determination method satisfies the aim of the Gaussian pruning in terms of sequential model re-estimation.

The Gaussian pruning technique is usually used for reducing computational complexity without degrading speech recognition accuracy (Ogawa and Takahashi, 2008, Shinoda and Iso, 2002, Fischer and Rob, 1999). Most Gaussian pruning techniques reduce the number of Gaussian distributions by merging similar distributions using certain clustering methods. These methods involve the prior processing of speech recognition, and the characteristics of the observed signal are not usually reflected in the pruning result. On the other hand, our proposed method involves the posterior processing of VAD, and improves the optimality of the noisy speech model by using the characteristics of the observed signal. The proposed method does not aim to reduce computational complexity, which makes it different from conventional pruning techniques.

The proposed method was evaluated with the CENSREC-1-C database (corpora and environments for noisy speech recognition-1 concatenated) (Kitaoka et al., 2007), which is Japanese noisy speech data for VAD evaluation. The evaluation results revealed that the proposed Gaussian pruning method significantly improves VAD accuracy compared with the conventional methods. In particular, we confirmed that Gaussian pruning contributes greatly to the improvement of VAD accuracy.

In this paper, Section 2 reviews statistical model-based VAD and SKF-based VAD, Section 3 describes Gaussian pruning in detail, Section 4 describes a frame-wise method for determining the pruning parameter, Section 5 reports evaluation results obtained with the CENSREC-1-C framework, and Section 6 summarizes the paper and describes future directions.

Section snippets

Reviews of statistical model-based VAD and SKF-based VAD

This section reviews statistical model-based VAD (Sohn et al., 1999) and our previous work, SKF-based VAD (Fujimoto and Ishizuka, 2007).

Problem of frame-wise model re-estimation

SKF-based VAD discriminates between the speech absence and speech activity of an observed signal by using the likelihood ratio of non-speech and speech GMMs. The non-speech and speech GMMs, namely, noisy speech GMMs, consist of the parameters derived using Eqs. (7), (8). With this method, the noise parameters are optimally estimated by the SKF, because the estimation scheme of the Kalman filter ensures an optimal estimation result by reflecting the information of the observed signal. However,

Frame-wise pruning parameter determination based on maximum likelihood criterion

With the Gaussian pruning methods described in Section 3.3, we have to set appropriate pruning parameters depending on variations in the speakers or in the noise environments. In addition, each pruning parameter is independent of the frame index t and the model index j. Thus, these methods represent a stationary determination property for pruning parameters N, β, and Z, because each pruning parameter is manually adjusted to a constant value. This stationary determination is contrary to the aim

Evaluation database

The proposed method was evaluated by using CENSREC-1-C (Kitaoka et al., 2007). CENSREC-1-C was designed as an evaluation framework for VAD in noisy environments and has two types of evaluation data sets, i.e., artificial data and real data. In this paper, we chose the real data set for the evaluation.

The real data were recorded in two real noisy environments (a restaurant and a street) with two different sound pressure levels (high SNR and low SNR). The low SNR recordings were made in a crowded

Conclusion

This paper presented Gaussian pruning techniques for statistical model-based VAD. The aim of Gaussian pruning is to realize an exact likelihood calculation that reflects the local characteristics of the observed signal. We investigated three different Gaussian pruning techniques based on posterior probability and corresponding pruning parameter determination techniques. The evaluation results show that our proposed method significantly improves VAD accuracy compared with our previous work,

Acknowledgements

The present study was conducted using a CENSREC-1 database and a CENSREC-1-C database developed by the IPSJ-SIG SLP Noisy Speech Recognition Evaluation Working Group.

References (28)

K. Ishizuka et al.
Noise robust voice activity detection based on periodic to aperiodic component ratio
Speech Commun.
(2010)
J. Ramírez et al.
Efficient voice activity detection algorithm using long-term speech information
Speech Commun.
(2004)
J.H. Chang et al.
Voice activity detection based on multiple statistical models
IEEE Trans. Signal Process.
(2006)
Cournapeau, D., Kawahara, T., 2007. Evaluation of real-time voice activity detection based on high order statistics....
ETSI EN 301 708 v.7.1.1, 1999. Voice activity detector (VAD) for adaptive multi-rate (AMR) speech traffic...
ETSI ES 202 050 v.1.1.4, 2006. Speech processing, transmission and quality aspects (STQ), advanced distributed speech...
Fischer, V., Rob, T., 1999. Reduced Gaussian mixture models in a large vocabulary continuous speech recognizer. In:...
Fujimoto, M., Ishizuka, K., 2007. Noise robust voice activity detection based on switching Kalman filter. In: Proc. of...
Fujimoto, M., Ishizuka, K., Nakatani, T., 2008. A voice activity detection based on the adaptive integration of...
Fujimoto, M., Ishizuka, K., Nakatani, T., 2009. A study of mutual front-end processing method based on statistical...

Hirsch, H.G., Pearce, D., 2000. The AURORA experimental framework for the performance evaluations of speech recognition...

T. Hori et al.

Efficient WFST-based one-pass decoding with on-the-fly hypothesis rescoring in extremely large vocabulary continuous speech recognition

IEEE Trans. Audio Speech Lang. Process.

(2007)

ITU-T Recommendation G.729 Annex B, 1996. A silence compression scheme for G.729 optimized for terminals conforming to...

H. Kato et al.

A voice activity detection based on an adjustable linear prediction and GARCH models

Speech Commun.

(2008)

Cited by (0)

View full text

Frame-wise model re-estimation method based on Gaussian pruning with weight normalization for noise robust voice activity detection

Abstract

Highlights

Introduction

Section snippets

Reviews of statistical model-based VAD and SKF-based VAD

Problem of frame-wise model re-estimation

Frame-wise pruning parameter determination based on maximum likelihood criterion

Evaluation database

Conclusion

Acknowledgements

Speech Commun.

Speech Commun.

Voice activity detection based on multiple statistical models

IEEE Trans. Signal Process.

Efficient WFST-based one-pass decoding with on-the-fly hypothesis rescoring in extremely large vocabulary continuous speech recognition

IEEE Trans. Audio Speech Lang. Process.

A voice activity detection based on an adjustable linear prediction and GARCH models

Speech Commun.