Speech and music classification using spectrogram based statistical descriptors and extreme learning machine

Birajdar, Gajanan K.; Patil, Mukesh D.

doi:10.1007/s11042-018-6899-z

Speech and music classification using spectrogram based statistical descriptors and extreme learning machine

Published: 28 November 2018

Volume 78, pages 15141–15168, (2019)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

813 Accesses
15 Citations
Explore all metrics

Abstract

This article proposes a novel feature extraction approach for speech/music classification based on generalized Gaussian distribution descriptors extracted from IIR-CQT spectrogram representation. IIR-CQT spectrogram visual representation provides superior temporal resolution at high frequencies and better spectral resolution for low frequencies compared to the conventional short-time Fourier transform analysis which provides uniform frequency resolution. Multi-level decomposition of the spectrogram image is then performed using the Nonsubsampled Contourlet Transform (NSCT) which a fully shift-invariant, multi-scale, and multi-direction expansion that can preserve the edges of the textural pattern of speech and music. The generalized Gaussian distribution (GGD) parameters are produced using maximum likelihood estimation (MLE) from the NSCT subbands to create the image feature descriptor. Chaos crow search algorithm is employed to chose the most relevant feature sub-set and to discard redundant features and finally the extreme learning machine classifier categorizes input audio segment into speech/music. The experimental results show that the proposed feature descriptor is effective and performs better compared to the existing approaches in the speech/music classification. In addition, mismatched training and testing results are also presented.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speech classification using SIFT features on spectrogram images

Article Open access 16 June 2016

Time-frequency visual representation and texture features for audio applications: a comprehensive review, recent trends, and challenges

Article 16 March 2023

Using Three Reassigned Spectrogram Patches and Log-Gabor Filter for Audio Surveillance Application

References

Alam J, Kenny P (2017) Spoofing detection employing infinite impulse response-constant q transform-based feature representations. In: 25th European Signal Processing Conference (EUSIPCO 2017), pp 111–115
Anandhi D, Valli S (2018) An algorithm for multi-sensor image fusion using maximum a posteriori and nonsubsampled contourlet transform. Comput Electr Eng 65:139–152. https://doi.org/10.1016/j.compeleceng.2017.04.002
Article Google Scholar
Askarzadeh A (2016) A novel metaheuristic method for solving constrained engineering optimization problems: Crow search algorithm. Comput Struct 169:1–12. https://doi.org/10.1016/j.compstruc.2016.03.001
Article Google Scholar
Bartlett PL (1997) For valid generalization, the size of the weights is more important than the size. In: Jordan M, Kearns M, Solla S (eds) Neural Information Processing Systems 1997, pp 134–139
Cancela P, Rocamora M, Lopez E (2009) Feature selection for high-dimensional data: a fast correlation-based filter solution. In: 10th International Society for Music Information Retrieval Conference (ISMIR 2009), pp 309–314
Chacko BP, Vimal Krishnan VR, Raju G, Babu Anto P (2012) Handwritten character recognition using wavelet energy and extreme learning machine. Int J Mach Learn Cybern 3(2):149–161. https://doi.org/10.1007/s13042-011-0049-5
Article Google Scholar
Costa Y, Oliveira LS, Silla C (2017) An evaluation of convolutional neural networks for music classification using spectrograms. Appl Soft Comput 52 (Supplement C):28–38. https://doi.org/10.1016/j.asoc.2016.12.024
Article Google Scholar
Cunha L, Zhou J (2006) The nonsubsampled contourlet transform: theory, design, and applications. IEEE Trans Image Process 15(10):3089–3101
Article Google Scholar
Devanna H, Kumar GAES, Giri Prasad MN (2017) A spatio-frequency orientational energy based medical image fusion using non-sub sampled contourlet transform. Cluster Computing. https://doi.org/10.1007/s10586-017-1351-0
Didiot E, Illina I, Fohr D, Mella O (2010) A wavelet-based parameterization for speech/music discrimination. Comput Speech Lang 24(2):341–357. https://doi.org/10.1016/j.csl.2009.05.003
Article Google Scholar
Do MN, Vetterli M (2005) The contourlet transform: an efficient directional multiresolution image representation. IEEE Trans Image Process 14(12):2091–2106
Article Google Scholar
Do MN, Vetterli M (2006) Wavelet-based texture retrieval using generalized gaussian density and kullback-leibler distance. IEEE Trans Image Process 11(2):146–158
Article MathSciNet Google Scholar
El-Maleh K, Klein M, Petrucci G, Kabal P (2000) Speech/music discrimination for multimedia applications. In: Proceedings of the IEEE International Conference on Acoustics, Speech, Signal Processing, ICASSP 2000. IEEE, pp 2445–2448
Evans M, Hastings N, Peacock B (2000) Statistical distributions, third edn. Wiley Series in Probability and Statistics. Wiley
Fuchs G (2015) A robust speech/music discriminator for switched audio coding. In: 23rd European Signal Processing Conference (EUSIPCO). IEEE, pp 569–573. https://doi.org/10.1109/EUSIPCO.2015.7362447
Ghosal A, Chakraborty R, Chakraborty R, Haty S, Dhara BC, Saha SK (2009) Speech/music classification using occurrence pattern of zcr and ste. In: 3rd International Symposium on Intelligent Information Technology Application. IEEE, pp 435–438
Ghosal A, Dhara BC, Saha SK (2011) Speech/music classification using empirical mode decomposition. In: 2nd International Conference on Emerging Applications of Information Technology (EAIT). IEEE, pp 49–52. https://doi.org/10.1109/EAIT.2011.19
Ghosal A, Dutta S (2017) Speech/music discrimination using perceptual feature. In: International Conference on Computational Science and Engineering. CRC Press, pp 71–76
Guo JM, Prasetyo H, Farfoura ME, Lee H (2015) Vehicle verification using features from curvelet transform and generalized gaussian distribution modeling. IEEE Trans Intell Transp Syst 16(4):1989–1998
Article Google Scholar
Hirvonen T (2014) Speech/music classification of short audio segments. In: IEEE International symposium on multimedia. IEEE, pp 135–138. https://doi.org/10.1109/ISM.2014.27
https://www.statista.com/topics/2019/youtube. Accessed: 2018-02-26
Huang GB, Zhu QY, Siew CK (2006) Extreme learning machine: Theory and applications. Neurocomputing 70(1):489–501. https://doi.org/10.1016/j.neucom.2005.12.126
Article Google Scholar
Huang GB, Zhou H, Ding X, Zhang R (2012) Extreme learning machine for regression and multiclass classification. IEEE Trans Syst Man Cybern B (Cybern) 42(2):513–529. https://doi.org/10.1109/TSMCB.2011.2168604
Article Google Scholar
Huang X (2017) Automatic video superimposed text detection based on nonsubsampled contourlet transform. Multimedia Tools and Applications. https://doi.org/10.1007/s11042-017-4619-8
Jensen R, Shen Q (2008) Computational intelligence and feature selection. Wiley, Hoboken
Book Google Scholar
Kacprzak S, Ziółko M (2013) Speech/music discrimination via energy density analysis, Springer, Berlin
Kacprzak S, ej Chwiec ko B, Zioko B (2017) Speech/music discrimination for analysis of radio stations. In: International Conference on Systems, Signals And Image Processing (IWSSIP). IEEE, pp 1–4. https://doi.org/10.1109/IWSSIP.2017.7965606
Karpagachelvi S, Arthanari M, Sivakumar M (2012) Classification of electrocardiogram signals with support vector machines and extreme learning machine. Neural Comput Appl 21(6):1331–1339. https://doi.org/10.1007/s00521-011-0572-z
Article Google Scholar
Khan MKS, Al-Khatib WG (2006) Machine-learning based classification of speech and music. Multimed Syst 12(1):55–67. https://doi.org/10.1007/s00530-006-0034-0
Article Google Scholar
Khonglah BK, Prasanna SM (2016) Speech / music classification using speech-specific features. Digit Signal Process 48(Supplement C):71–83. https://doi.org/10.1016/j.dsp.2015.09.005
Article MathSciNet Google Scholar
Kos M, Kačič Z, Vlaj D (2013) Acoustic classification and segmentation using modified spectral roll-off and variance-based features. Digit Signal Process 23(2):659–674. https://doi.org/10.1016/j.dsp.2012.10.008
Article MathSciNet Google Scholar
Krupinski R, Purczynski J (2006) Approximated fast estimator for the shape parameter of generalized gaussian distribution. Sinal Process 86(2):205–211
Article MATH Google Scholar
Lan Y, Hu Z, Soh YC, Huang GB (2013) An extreme learning machine approach for speaker recognition. Neural Comput Applic 22(3):417–425. https://doi.org/10.1007/s00521-012-0946-x
Article Google Scholar
Lavner Y, Ruinskiy D (2009) A decision-tree-based algorithm for speech/music classification and segmentation. EURASIP Journal on Audio, Speech and Music Processing 2009(1). https://doi.org/10.1155/2009/239892
Lee CC, Shih CY, Lee SK, Hong WT (2012) Enhancement of blood vessels in retinal imaging using the nonsubsampled contourlet transform. Multidim Syst Signal Process 23(4):423–436
Article MathSciNet MATH Google Scholar
Li Y, Li T, Liu H (2017) Recent advances in feature selection and its applications. Knowl Inf Syst 53(3):551–577. https://doi.org/10.1007/s10115-017-1059-8
Article Google Scholar
Lim C, Chang H (2012) Enhancing support vector machine-based speech/music classification using conditional maximum a posteriori criterion. IET Signal Process 6:335–340
Article MathSciNet Google Scholar
Lim C, Chang JH (2015) Efficient implementation techniques of an svm-based speech/music classifier in smv. Multimed Tools Appl 74(15):5375–5400. https://doi.org/10.1007/s11042-014-1859-8
Article Google Scholar
Liu Q, Yin J, Leung VCM, Zhai JH, Cai Z, Lin J (2016) Applying a new localized generalization error model to design neural networks trained with extreme learning machine. Neural Comput Applic 27(1):59–66. https://doi.org/10.1007/s00521-014-1549-5
Article Google Scholar
Luo F, Guo W, Yu Y, Chen G (2017) A multi-label classification algorithm based on kernel extreme learning machine. Neurocomputing 260:313–320. https://doi.org/10.1016/j.neucom.2017.04.052
Article Google Scholar
Miao J, Niu L (2016) A survey on feature selection. Proced Comput Sci 91 (Supplement C):919–926. https://doi.org/10.1016/j.procs.2016.07.111
Article Google Scholar
Muñoz-Expósito J, García-Galán S, Ruiz-Reyes N, Vera-Candeas P (2007) Adaptive network-based fuzzy inference system vs. other classification algorithms for warped lpc-based speech/music discrimination. Eng Appl Artif Intell 20(6):783–793. https://doi.org/10.1016/j.engappai.2006.10.007
Article Google Scholar
Nanni L, Costa Y, Lumini A, Kim MY, Baek SR (2016) Combining visual and acoustic features for music genre classification. Expert Syst Appl 45:108–117. https://doi.org/10.1016/j.eswa.2015.09.018
Article Google Scholar
Nanni L, Costa Y, Lucio D, Silla C, Brahnam S (2017) Combining visual and acoustic features for audio classification tasks. Pattern Recogn Lett 88(Supplement C):49–56. https://doi.org/10.1016/j.patrec.2017.01.013
Article Google Scholar
Pikrakis A, Giannakopoulos T, Theodoridis S (2008) A speech/music discriminator of radio recordings based on dynamic programming and bayesian networks. IEEE Trans Multimed 10(5):846–67. 0.1109/TMM.2008.922870
Article Google Scholar
Po DDY, Do MN (2006) Directional multiscale modeling of images using the contourlet transform. IEEE Trans Image Process 15(6):1610–1620
Article MathSciNet Google Scholar
Qu H, Peng Y, Sun W (2007) Texture image retrieval based on contourlet coefficient modeling with generalized gaussian distribution. In: Kang L, Liu Y, Zeng S (eds) Advances in Computation and Intelligence. Springer Berlin Heidelberg, pp 493–502
Rashno A, Nazari B, Sadri S, Saraee M (2017) Effective pixel classification of mars images based on ant colony optimization feature selection and extreme learning machine. Neurocomputing 226:66–79. https://doi.org/10.1016/j.neucom.2016.11.030
Article Google Scholar
Reyes NR, Candeas PV, Galán SG, Muñoz J (2010) Two-stage cascaded classification approach based on genetic fuzzy learning for speech/music discrimination. Eng Appl Artif Intell 23(2):151–159. https://doi.org/10.1016/j.engappai.2009.06.006
Article Google Scholar
Ruiz-Reyes N, Vera-Candeas P, Muñoz JE, García-galán S, Cañadas FJ (2009) New speech/music discrimination approach based on fundamental frequency estimation. Multimed Tools Appl 41(2):253–286. https://doi.org/10.1007/s11042-008-0228-x
Article Google Scholar
Salaken SM, Khosravi A, Nguyen T, Nahavandi S (2017) Extreme learning machine based transfer learning algorithms: a survey. Neurocomputing 267:516–524. https://doi.org/10.1016/j.neucom.2017.06.037
Article Google Scholar
Saunders J (1996) Real-time discrimination of broadcast speech/music. In: Proceedings of ICASSP, ICASSP 1996. IEEE, vol 2, pp 993–996
Sayed GI, Hassanien AE, Azar AT (2017) Feature selection via a novel chaotic crow search algorithm. Neural Computing and Applications. https://doi.org/10.1007/s00521-017-2988-6
Scheirer E, Slaney M (1997) Construction and evaluation of a robust multifeature speech/music discriminator. In: Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’97), ICASSP ’97. IEEE Computer Society, vol 2, pp 1331–1335
Sell G, Clark P (2014) Music tonality features for speech/music discrimination. In: IEEE International conference on acoustic, speech and signal processing (ICASSP). IEEE, pp 2489–2493. https://doi.org/10.1109/ICASSP.2014.6854048
Sharan RV, Moir TJ (2015) Noise robust audio surveillance using reduced spectrogram image feature and one-against-all svm. Neurocomputing 158:90–99. https://doi.org/10.1016/j.neucom.2015.02.001
Article Google Scholar
Shensa M (1992) The discrete wavelet transform: wedding the trous and mallat algorithms. IEEE Trans Signal Process 40(10):2464–2482
Article MATH Google Scholar
Shirazi J, Ghaemmaghami S (2010) Improvement to speech-music discrimination using sinusoidal model based features. Multimed Tools Appl 50(2):415–435. https://doi.org/10.1007/s11042-009-0416-3
Article Google Scholar
Tsipas N, Vrysis L, Dimoulas C, Papanikolaou G (2017) Efficient audio-driven multimedia indexing through similarity-based speech / music discrimination. Multimed Tools Appl 76(24):25603–25621. https://doi.org/10.1007/s11042-016-4315-0
Article Google Scholar
Varanasi M, Aazhang B (1989) Parametric generalized gaussian density estimation. J Acoust Soc Amer 86(4):1404–1415. https://doi.org/10.1121/1.398700
Article Google Scholar
Wan C, Wu Y (2015) Image retrieval by using non-subsampled shearlet transform and krawtchouk moment invariants. In: Jawahar CV, Shan S (eds) Computer Vision - ACCV 2014 Workshops. Springer International Publishing, pp 218–232
Wang WQ, GO W, Ying DW (2003) A fast and robust speech music discrimination approach. In: Fourth International Conference on Information, Communications & Signal Processing, Fourth IEEE Pacific-Rim Conference on Multimedia, ICICS-PCM 2003. IEEE, pp 1325–1329
Wang M, Chen H, Yang B, Zhao X, Hu L, Cai Z, Huang H, Tong C (2017) Toward an optimal kernel extreme learning machine using a chaotic moth-flame optimization strategy with applications in medical diagnoses. Neurocomputing 267:69–84. https://doi.org/10.1016/j.neucom.2017.04.060
Article Google Scholar
Wu Q, Yan Q, Deng H, Wang J (2010) A combination of data mining method with decision trees building for speech/music discrimination. Comput Speech Lang 24(2):257–272. https://doi.org/10.1016/j.csl.2009.04.009
Article Google Scholar
Yan CC, Zhang Y, Xu J, Dai F, Zhang J, Dai Q, Wu F (2014) Efficient parallel framework for hevc motion estimation on many-core processors. IEEE Trans Circ Syst Video Tech 24(12):2077–2089
Article Google Scholar
Yan C, Xie H, Chen J, Zha ZJ, Hao X, Zhang Y, Dai Q (2018) An effective uyghur text detector for complex background images. IEEE Transactions on Multimedia pp 1–1
Yan C, Xie H, Liu S, Yin J, Zhang Y, Dai Q (2018) Effective uyghur language text detection in complex background images for traffic prompt identification. IEEE Trans. Intell Trans Syst 19(1):220–229
Article Google Scholar
Yan C, Xie H, Yang D, Yin J, Zhang Y, Dai Q (2018) Supervised hash coding with deep neural network for environment perception of intelligent vehicles. IEEE Trans. Intell Transp Syst 19(1):284–295
Article Google Scholar
Yang G, Li M, Chen L, Yu J (2015) The nonsubsampled contourlet transform based statistical medical image fusion using generalized gaussian density. Comput Math Methods Med 2015(Article ID 262819):1–13. https://doi.org/10.1155/2015/262819
MathSciNet MATH Google Scholar
Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. In: Fawcett T, Mishra N (eds) Proceedings, Twentieth International Conference on Machine Learning, vol 2, pp 856–863
Yu S, Zhang A, Li H (2012) A review of estimating the shape parameter of generalized gaussian distribution. J Comput Inf Syst 8(21):9055–9064
Google Scholar
Zhang Q, Guo-long B (2009) Multifocus image fusion using the nonsubsampled contourlet transform. Signal Process 89(7):1334–1346
Article MATH Google Scholar
Zhang H, Yang XK, Zhang WQ, Zhang WL, Liu J (2016) Application of i-vector in speech and music classification. In: IEEE International symposium on signal processing and information technology (ISSPIT). IEEE, pp 1–5. https://doi.org/10.1109/ISSPIT.2016.7885999
Zhao J, Zhou Z, Cao F (2014) Human face recognition based on ensemble of polyharmonic extreme learning machine. Neural Comput Appl 24(6):1317–1326. https://doi.org/10.1007/s00521-013-1356-4
Article Google Scholar
Zhou H, Sadka A, Jiang RM (2008) Feature extraction for speech and music discrimination. In: International workshop on content-based multimedia indexing, CBMI 2008. IEEE, pp 170–173. https://doi.org/10.1109/CBMI.2008.4564943

Download references

Acknowledgments

The authors would like to thank Professor Dan Ellis for providing the Scheirer & Slaney database.

Author information

Authors and Affiliations

Department of Electronics Engineering, Ramrao Adik Institute of Technology, Nerul, Navi Mumbai, Maharashtra, 400706, India
Gajanan K. Birajdar
Department of Electronics & Telecommunication Engineering, Ramrao Adik Institute of Technology, Nerul, Navi Mumbai, Maharashtra, 400706, India
Mukesh D. Patil

Authors

Gajanan K. Birajdar
View author publications
You can also search for this author in PubMed Google Scholar
Mukesh D. Patil
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gajanan K. Birajdar.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Birajdar, G.K., Patil, M.D. Speech and music classification using spectrogram based statistical descriptors and extreme learning machine. Multimed Tools Appl 78, 15141–15168 (2019). https://doi.org/10.1007/s11042-018-6899-z

Download citation

Received: 04 May 2018
Revised: 02 November 2018
Accepted: 13 November 2018
Published: 28 November 2018
Issue Date: 15 June 2019
DOI: https://doi.org/10.1007/s11042-018-6899-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speech and music classification using spectrogram based statistical descriptors and extreme learning machine

Abstract

Access this article

Similar content being viewed by others

Speech classification using SIFT features on spectrogram images

Time-frequency visual representation and texture features for audio applications: a comprehensive review, recent trends, and challenges

Using Three Reassigned Spectrogram Patches and Log-Gabor Filter for Audio Surveillance Application

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Speech and music classification using spectrogram based statistical descriptors and extreme learning machine

Abstract

Access this article

Similar content being viewed by others

Speech classification using SIFT features on spectrogram images

Time-frequency visual representation and texture features for audio applications: a comprehensive review, recent trends, and challenges

Using Three Reassigned Spectrogram Patches and Log-Gabor Filter for Audio Surveillance Application

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation