Abstract
Traditionally, most of voice activity detection (VAD) methods are based on speech features such as spectrum, temporal energy, and periodicity. The robustness of these features plays a critical role on the performance of VAD. However, since these features are always directly generated from observed signal, the robustness of these features would be significantly degraded in non-stationary noise environments, especially at low level signal-to-noise ratio (SNR) condition. This paper proposes a kind of robust feature for VAD based on sparse representation with an optimized learned dictionary. To do so, a speech dictionary and a noise dictionary are first learned from speech corpus and noise corpus, respectively. Then an optimization algorithm is designed to reduce the mutual coherence between the two learned dictionaries. After that the proposed feature is generated from the optimized dictionary-based sparse representation, and a VAD method is derived from the proposed feature. The proposed method is evaluated over seven types of noise and four types of SNR level, experimental results show that the optimized dictionary is important for enhancing the robustness of the proposed method, and the proposed method performs well under non-stationary noise, especially at low level SNR condition.
Similar content being viewed by others
References
M. Aharon, M. Elad, A.M. Bruckstein, The K-SVD: an algorithm for designing of overcomplete dictionaries for sparse representations. IEEE Trans. Signal Process. 54, 4311–4322 (2006)
F. Beritelli, S. Casale, G. Rugeri, Performance evaluation and comparison of G.729/AMR/Fuzzy voice activity detectors. IEEE Signal Process. Lett. 9(3), 85–88 (2002)
D. Bertsekas, Nonlinear Programming (Athena Scientific, Belmont, 1999)
C. Breithaupt, T. Gerkmann, and R. Martin, A novel a priori SNR estimation approach based on selective cepstro-temporal smoothing, in IEEE International Conference Acoustics, Speech, and Signal Processing, 2008, pp. 4897–4900
E.J. Candes, J. Romberg, T. Tao, Robust uncertianty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Trans. Inf. Theory 52(2), 489–509 (2006)
J.H. Chang, N.S. Kim, Voice activity detection based on complex Laplacianmodel. IEE Electron. Lett. 39(7), 632–634 (2003)
S.H. Chen, H.T. Wu, Y.K. Chang, Teager energy based feature parameters for speech recognition in car noise. Pattern Recogn. Lett. 28, 1327–1332 (2007)
S. Chen, D. Donoho, M. Saunders, Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20, 33–61 (1999)
Y.D. Cho, A. Kondoz, Analysis and improvement of a statistical model-based voice activity detector. IEEE Signal Process. Lett. 8(10), 276–278 (2001)
Y.D. Cho, K.A. Naimi, A. Kondoz, Improved voice activity detection based on a smoothed statistical likelihood ratio, in IEEE International Conference Acoustics, Speech, and Signal Processing, vol. 2, 2001, pp. 7–11
D.S. Christian, D. Tomas, M.B. Joachim, Speech enhancement with sparse coding learned dictionaries, in IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 20, 2010, pp. 4758–4761
A. Craciun, M. Gabrea, Correlation coefficient-based voice activity detector algorithm, in Proc. Can. Conf. Elect. Comput. Eng., vol 3, 2004, pp. 1789–1792
A. Davis, S. Nordholm, R. Togneri, Statistical voice activity detection using low-variance spectrum estimation and an adaptive threshold. IEEE Trans. Audio Speech Lang. Process. 14(2), 412–424 (2006)
D. Donoho, Compressed sensing. IEEE Trans. Inf. Theory 52(4), 1289–1306 (2006)
B. Efron, T. Hastie, I. Johnstone, R. Tibshirani, Least angle regression. Ann. Stat. 32(2), 407–499 (2004)
D. Enqing, Z. Heming, L. Yongli, Lowbit and variable rate speech coding using local cosine transform. Proc. TENCON 1, 423–426 (2002)
A. Fazel, S. Chakrabartty, An overview of statistical pattern recognition techniques for speaker verification. IEEE Circuits Syst. Mag. 11(2), 62–81 (2011)
A. Fazel, S. Chakrabartty, Sparse auditory reproducing kernel (SPARK) features for noise-robust speech recognition. IEEE Trans. Audio Speech Lang. Process. 20, 4 (2012)
D.K. Freeman, C.B. Southcott, I. Boyd, and G. Cosier, A voice activity detector for pan-European digital cellular mobile telephone service, in IEEE International Conference on Acoustics, Speech, and Signal Processing, Glasgow, U.K., vol. 1, 1989, pp. 369–372
M. Fujimoto, K. Ishizuka, T. Nakatani, A voice activity detection based on the adaptive integration of multiple speech features and a signal decision scheme, in IEEE International Conference on Acoustics, Speech, and Signal Processing, 2008, pp. 4441–4444
R. Fulchiero, A. Spanias, Speech enhancement using the bispectrum, in IEEE International Conference on Acoustics, Speech, and Signal Processing, 1993, pp. 488–491
P.K. Ghosh, A. Tsiartas, S. Narayanan, Robust voice activity detection using long-term signal variability. IEEE Trans. Audio Speech Lang. Process. 19(3), 600–613 (2011)
J.A. Haigh, and J.S. Mason, Robust voice activity detection using cepstral feature, in IEEE TELCON, 1993, pp. 321–324
Y. Hu, P. Loizou, Subjective comparison and evaluation of speech enhancement algorithms. Speech Commun. 49, 588–601 (2007)
Y. Hu, P. Loizou, A generalized subspace approach for enhancing speech corrupted by colored noise. IEEE Trans. Speech Audio Process. 11, 334–341 (2003)
K. Itoh, M. Mizushima, Environmental noise reduction based on speech/non-speech identification for hearing aids, in Proc. Int. Conf. Acoust., Speech. Signal Process., vol. 1, 1997, pp. 419–422
F.G. Jort, H. Antti, V. Tuomas, and S. Yang, Toward a practical implementation of exemplar-based noise robust ASR, in 19th European Signal Processing Conference, 2011, pp. 1490–1494
M. Julien, P. Jean, and S. Guillermo, Online dictionary learning for sparse coding, in Proc. 26th ICML, 2009
B. Kotnik, Z. Kacic, and B. Horvat, A multiconditional robust front-end feature extraction with a noise reduction procedure based on improved spectral subtraction algorithm, in Proc. 7th Eurospeech, Aalborg, Denmark, 2001, pp. 197–200
T. Kristjansson, S. Deligne, P. Olsen, Voicing features for robust speech detec-tion, in Proc. Interspeech, 2005, pp. 369–372
K. Li, M.N.S. Swamy, M.O. Ahmad, An improved voice activity detection using higher order statistics. IEEE Trans. Speech Audio Process. 13(5), 956–974 (2005)
J. Ma, Y. Hu, P. Loizou, Objective measures for predicting speech intelligibility in noisy conditions based on new band-importance functions. J. Acoust. Soc. Am. 125(5), 3387–3405 (2009)
S. Mallat, A Wavelet Tour of Signal Processing, the Sparse Way (Academic Press, Burlington, 2009)
E. Nemer, R. Goubran, S. Mahmoud, Robust voice activity detection using higher-order statistics in the LPC residual domain. IEEE Trans. Speech Audio Process. 9(3), 217–231 (2001)
R. Padmanabhan, P.S.H. Krishnan, and H.A. Murthy, A pattern recognition approach to VAD using modified group delay, in Proc. 14th National Conf. Commun., IIT Bombay, 2008, pp. 432–437
R. Prasad, H. Saruwatari, K. Shikano, Noise estimation using negentropy based voice-activity detector, in Proc. 47th Midwest Symp. Circuits Syst., vol. 2, 2004, pp. 149–152
J. Ramirez, J.C. Segura, C. Benitez, An effective subband OSF-based vad with noise reduction for robust speech recognition. IEEE Trans. Speech Audio Process. 13(6), 1119–1129 (2005)
J. Ramirez, J.C. Segura, C. Benitez, A. Torre, A. Rubio, Efficient voice activity detection algorithms using long-term speech information. Speech Commun. 42, 271–287 (2004)
J. Ramirez, J.C. Segura, C. Benitez, L. Garcia, and A. Rubio, Statistical voice detection using a multiple observation likelihood ratio test, in IEEE Signal Processing Letters, vol. 12, no. 10, 2005.
A. Sangwan, M.C. Chiranth, H.S. Jamadagni, R. Sah, R.V. Prasad, and V. Gaurav, VAD techniques for real-time speech transmission on the Internet, in Proc. IEEE Int. Conf. High-Speech Netw. Multimedia Commun., 2002, pp. 365–368
J.W. Shin, H.J. Kwon, S.H. Jin and N.S. Kim, Voice activity detection based on conditional MAP criterion, in IEEE Signal Processing Letters, vol. 15, 2008
E.C. Smith, M.S. Lewicki, Efficient auditory coding. Nature 439, 978–982 (2006)
J. Sohn, N.S. Kim, W.A. Sung, A statistical model-based voice activity detection. IEEE Signal Process. Lett. 6(1), 1–3 (1999)
S.A. Soleimani and S.M. Ahadi, Voice activity detection based on combination of multiple features using linear/kernel discriminant analyses, in Proc. 3rd Int. Conf. Inf. Commun. Technol.: from theory to applicat., 2008, pp. 1–5
R. Tibshirani, Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Series B 67, 267–288 (1996)
A. Varga, H.J.M. Steenken, Assessment for automatic speech recognition II: NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993)
R. Vipperla, J.T. Geiger, S. Bozonnet, D. Wang, Nicholas Evans, Bjorn Schuller, Gerhard Rigoll, Speech overlap detection and attribution using convolutive non-negative sparse coding, in IEEE International Conference on Acoustics, Speech, and Signal Processing, 2012, pp. 4181–4184
D. Vlaj, B. Kotnik, B. Horvat, Z. Kacic, A computationally efficient mel-flter bank VAD algorithm for distributed speech recognition systems. EURASIP J. Appl. Signal Process. 4, 487–497 (2005)
B. Yegnanarayana, C. Alessandro, V. Darsinos, An iterative algorithm for decompo-sition of speech signals into periodic and aperiodic components. IEEE Trans. Speech Audio Process. 6(1), 1–11 (1998)
D.T. You, J.Q. Han, G.B. Zheng, and T.R. Zheng, Sparse power spectrum based robust voice activity detector, in IEEE International Conference on Acoustics, Speech, and Signal Processing, 2012, pp. 289–292
Acknowledgments
This work was supported by the National Natural Science Foundation of China under Grant No. 91120303, No. 91220301, and No. 61170243, and the Ph. D Programs Foundation of Ministry of Education of China (No. 20112302110042).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
You, D., Han, J., Zheng, G. et al. Sparse Representation with Optimized Learned Dictionary for Robust Voice Activity Detection. Circuits Syst Signal Process 33, 2267–2291 (2014). https://doi.org/10.1007/s00034-014-9748-y
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-014-9748-y