Clean speech/speech with background music classification using HNGD spectrum

Khonglah, Banriskhem K.; Prasanna, S. R. Mahadeva

doi:10.1007/s10772-017-9464-7

Clean speech/speech with background music classification using HNGD spectrum

Published: 16 October 2017

Volume 20, pages 1023–1036, (2017)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Banriskhem K. Khonglah¹ &
S. R. Mahadeva Prasanna¹

186 Accesses
6 Citations
Explore all metrics

Abstract

This work explores the characteristics of speech in terms of the spectral characteristics of vocal tract system for deriving features effective for clean speech and speech with background music classification. A representation of the spectral characteristics of the vocal tract system in the form of Hilbert envelope of the numerator of group delay (HNGD) spectrum is explored for the task. This representation complements the existing methods of computing the spectral characteristics in terms of the temporal resolution. This spectrum has an additive and high resolution property which gives a better representation of the formants especially the higher ones. A feature is extracted from the HNGD spectrum which is known as the spectral contrast across the sub-bands and this feature essentially represents the relative spectral characteristics of the vocal tract system. The vocal tract system is also represented approximately in terms of the mel frequency cepstral coefficients (MFCCs) which represent the average spectral characteristics. The MFCCs and the sum of the spectral contrast on HNGD can be used as features to represent the average and relative spectral characteristics of the vocal tract system, respectively. These features complement each other and can be combined in a multidimensional framework to provide good discrimination between clean speech and speech with background music segments. The spectral contrast on HNGD spectrum is compared to the spectral contrast on discrete fourier transform (DFT) spectrum, which also represents the relative spectral characteristics of the vocal tract system. It is observed that better performances are achieved on the HNGD spectrum than the DFT spectrum. The features are classified using classifiers like Gaussian mixture models and support vector machines.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Empirical mode decomposition based statistical features for discrimination of speech and low frequency music signal

Article 03 June 2022

Combining Evidences from Mel Cepstral and Cochlear Cepstral Features for Speaker Recognition Using Whispered Speech

Discrimination of environmental background noise in presence of speech using sample-pairs statistics based features

Article 13 September 2015

References

Anand Joseph, M., Guruprasad, S., & Yegnanarayana, B. (2006). Extracting formants from short segments of speech using group delay functions.
Bayya, Y., & Gowda, D. N. (2013). Spectro-temporal analysis of speech signals using zero-time windowing and group delay function. Speech Communication, 55(6), 782–795.
Article Google Scholar
Beyerlein, P., Aubert, X., Haeb-Umbach, R., Harris, M., Klakow, D., Wendemuth, A., et al. (2002). Large vocabulary continuous speech recognition of broadcast news-the philips/rwth approach. Speech Communication, 37(1), 109–131.
Article MATH Google Scholar
Bhattacharyya, A. (1943). On a measure of divergence between two statistical populations defined by their probability distribution. Bulletin of the Calcutta Mathematical Society, 35, 99–109.
MathSciNet MATH Google Scholar
Castán, D., Ortega, A., Miguel, A., & Lleida, E. (2014). Audio segmentation-by-classification approach based on factor analysis in broadcast news domain. EURASIP Journal on Audio, Speech, and Music Processing, 2014(1), 1–13.
Article Google Scholar
Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2, 27:1–27:27. http://www.csie.ntu.edu.tw/~cjlin/libsvm.
Gauvain, J., Lamel, L., & Adda, G. (2000). Transcribing broadcast news for audio and video indexing. Communications of the ACM, 43(2), 64–70.
Article Google Scholar
Gauvain, J.-L., Lamel, L., & Adda, G. (2002). The limsi broadcast news transcription system. Speech Communication, 37(1), 89–108.
Article MATH Google Scholar
Jiang, D.-N., Lu, L., Zhang, H.-J., Tao, J.-H., & Cai, L.-H. (2002). Music type classification by spectral contrast feature. In Proceedings 2002 IEEE international conference on multimedia and expo, 2002 (ICME’02) (Vol. 1, pp. 113–116). IEEE.
Khonglah, B. K., & Prasanna, S. M. (2016). Speech/music classification using speech-specific features. Digital Signal Processing, 48, 71–83.
Article MathSciNet Google Scholar
Murthy, K. S. R., & Yegnanarayana, B. (2008). Epoch extraction from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 16, 1602–1613.
Article Google Scholar
Nguyen, L., Matsoukas, S., Davenport, J., Kubala, F., Schwartz, R., & Makhoul, J. (2002). Progress in transcription of broadcast news using byblos. Speech Communication, 38(1–2), 213230.
MATH Google Scholar
Oppenheim, A . V., & Schafer, R . W. (1975). Digital signal processing. New Delhi: Prentice-Hall.
MATH Google Scholar
Prasad, R., & Yegnanarayana, B. (2013). Acoustic segmentation of speech using zero time liftering (ztl) (pp. 2292–2296).
Renals, S., Abberley, D., Kirby, D., & Robinson, T. (2000). Indexing and retrieval of broadcast news. Speech Communication, 32(1), 5–20.
Article Google Scholar
Sarma, B. D., Prasanna, S. M., & Sarmah, P. (2017). Consonant-vowel unit recognition using dominant aperiodic and transition region detection. Speech Communication, 92, 77–89.
Article Google Scholar
Scheirer, E., & Slaney, M. (1997). Construction and evaluation of a robust multifeature speech/music discriminator. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing., 2, 1331–1334.
Google Scholar
Sell, G., & Clark, P. (2014). Music tonality features for speech/music discrimination. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2489–2493). IEEE.
Siegler, M. A., Jain, U., Raj, B., & Stern, R. M. (1997). Automatic segmentation, classification and clustering of broadcast news audio. In Proceedings of DARPA Speech Recognition Workshop (pp. 97–99).
Srinivas, K. S., & Prahallad, K. (2012). An fir implementation of zero frequency filtering of speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 20(9), 2613–2617.
Article Google Scholar
Tzanetakis, G., & Cook, P. (2000). Sound analysis using mpeg compressed audio. In Proceedings IEEE international conference on acoustics, speech, and signal processing, 2000 (ICASSP’00) (Vol. 2, pp. II761–II764).
Vavrek, J., Vozáriková, E., Pleva, M., & Juhár, J. (2012). Broadcast news audio classification using svm binary trees. In 2012 35th international conference on telecommunications and signal processing (TSP) (pp. 469–473). IEEE
Wegmann, S., Zhan, P., & Gillick, L. (1999). Progress in broadcast news transcription at dragon systems. IEEE International Conference on Acoustics, Speech, and Signal Processing, 1, 33–36.
Google Scholar
Woodland, P. (2002). The development of the htk broadcast news transcription system: An overview. Speech Communication, 37(1–2), 47–67.
Article MATH Google Scholar
Yegnanarayana, B. (1978). Formant extraction from linear-prediction phase spectra. The Journal of the Acoustical Society of America, 63(5), 1638–1640.
Article Google Scholar
Yegnanarayana, B., & Murthy, H. A. (1992). Significance of group delay functions in spectrum estimation. IEEE Transactions on Signal Processing, 40(9), 2281–2289.
Article MATH Google Scholar
Zhang, T., & Kuo, C. J. (2001). Audio content analysis for online audiovisual data segmentation and classification. IEEE Transactions on Speech and Audio Processing, 9(4), 441–457.
Article Google Scholar

Download references

Acknowledgements

This work is part of the project titled “Multi-modal Broadcast Analytics: Structured Evidence Visualization for Events of Security Concern” funded by the e-Security division of the Department of Electronics & Information Technology (DeitY), Govt. of India.

Author information

Authors and Affiliations

Department of Electronics and Communication Engineering, Indian Institute of Technology Guwahati, Guwahati, 781039, India
Banriskhem K. Khonglah & S. R. Mahadeva Prasanna

Authors

Banriskhem K. Khonglah
View author publications
You can also search for this author in PubMed Google Scholar
S. R. Mahadeva Prasanna
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Banriskhem K. Khonglah.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Khonglah, B.K., Prasanna, S.R.M. Clean speech/speech with background music classification using HNGD spectrum. Int J Speech Technol 20, 1023–1036 (2017). https://doi.org/10.1007/s10772-017-9464-7

Download citation

Received: 07 September 2017
Accepted: 21 September 2017
Published: 16 October 2017
Issue Date: December 2017
DOI: https://doi.org/10.1007/s10772-017-9464-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clean speech/speech with background music classification using HNGD spectrum

Abstract

Access this article

Similar content being viewed by others

Empirical mode decomposition based statistical features for discrimination of speech and low frequency music signal

Combining Evidences from Mel Cepstral and Cochlear Cepstral Features for Speaker Recognition Using Whispered Speech

Discrimination of environmental background noise in presence of speech using sample-pairs statistics based features

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Clean speech/speech with background music classification using HNGD spectrum

Abstract

Access this article

Similar content being viewed by others

Empirical mode decomposition based statistical features for discrimination of speech and low frequency music signal

Combining Evidences from Mel Cepstral and Cochlear Cepstral Features for Speaker Recognition Using Whispered Speech

Discrimination of environmental background noise in presence of speech using sample-pairs statistics based features

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation