Using multi-stream hierarchical deep neural network to extract deep audio feature for acoustic event detection

Li, Yanxiong; Zhang, Xue; Jin, Hai; Li, Xianku; Wang, Qin; He, Qianhua; Huang, Qian

doi:10.1007/s11042-016-4332-z

Using multi-stream hierarchical deep neural network to extract deep audio feature for acoustic event detection

Published: 06 January 2017

Volume 77, pages 897–916, (2018)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Yanxiong Li ORCID: orcid.org/0000-0003-4362-1125¹,
Xue Zhang¹,
Hai Jin¹,
Xianku Li¹,
Qin Wang¹,
Qianhua He¹ &
…
Qian Huang¹

861 Accesses
20 Citations
Explore all metrics

Abstract

Extraction of effective audio features from acoustic events definitely influences the performance of Acoustic Event Detection (AED) system, especially in adverse audio situations. In this study, we propose a framework for extracting Deep Audio Feature (DAF) using multi-stream hierarchical Deep Neural Network (DNN). The DAF outputted from the proposed framework fuses the potential complementary information of multiple input feature streams and thus could be more discriminative than those input features for AED. We take two input feature streams and the hierarchical DNNs with two stages as an example for showing the extraction of DAF. The effectiveness of different audio features for AED is evaluated on two audio corpora, i.e. BBC (British Broadcasting Corporation) audio dataset and TV audio dataset with different signal-to-noise ratios. Experimental results show that DAF outperforms other features for AED under several experimental conditions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploiting spectro-temporal locality in deep learning based acoustic event detection

Article Open access 14 September 2015

Miquel Espi, Masakiyo Fujimoto, … Tomohiro Nakatani

Hand-crafted versus learned representations for audio event detection

Article 07 April 2022

Selver Ezgi Küçükbay, Adnan Yazıcı & Sinan Kalkan

Improved Convolutional Neural Networks for Acoustic Event Classification

Article 08 December 2018

Guichen Tang, Ruiyu Liang, … Shijia Wang

Notes

http://speech.fit.vutbr.cz/software/neural-network-trainer-tnet

References

Atrey PK, Maddage M, Kankanhalli MS (2006) Audio based event detection for multimedia surveillance. In: Proc. of IEEE ICASSP, pp 813–816. IEEE
British Broadcasting Corporation (BBC), “BBC Sound Effects Library,” http://www.sound-ideas.com/bbc.html, Accessed May 2015
Bugalho M, Portelo J, Trancoso I, Pellegrini T, Abad A (2009) Detecting audio events for semantic video search. In: Proc. of INTERSPEECH, pp 1151–1154. ISCA
Cakir E, Heittola T, Huttunen H, Virtanen T (2015) Polyphonic sound event detection using multi label deep neural networks. In: Proc. of International Joint Conference on Neural Networks, pp 1–7. IEEE
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. In: ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27. ACM
Childers DG, Skinner DP, Kemerait RC (1977) The cepstrum: a guide to processing. In: Proceeding of IEEE, 65(10):1428–1443. IEEE
Diment A, Heittola T, Virtanen T (2013) Sound event detection for office live and office synthetic AASP challenge. In: Proc. of IEEE AASP challenge on detection and classification of acoustic scenes and events. IEEE
Fawcett T (2011) An introduction to ROC analysis. In: Pattern Recognition Letters, 27(8):861–874. Elsevier
Gabor D (1946) Theory of communication. In: Institute Electronica, no. 93, pp 429–457
Gencoglu O, Virtanen T, Huttunen H (2014) Recognition of acoustic events using deep neural networks. In: Proc. of the 22nd European Signal Processing Conference, pp 506–510. ISCA
Giannoulis D, Stowell D, Benetos E, Rossignol M, Lagrange M, Plumbley MD (2013) A database and challenge for acoustic scene classification and event detection. In: Proc. of EUSIPCO, pp 1–5. ISCA
Grezl F, Karafiat M, Kontar S, Cernocky J (2007) Probabilistic and bottle-neck features for LVCSR of meetings. In: Proc. of IEEE ICASSP, pp 757–760. IEEE
Heittola T, Klapuri A (2008) TUT acoustic event detection system 2007. In: multimodal technologies for perception of humans, vol. 4625 of the series Lecture Notes in Computer Science, pp 364–370. Springer
Heittola T, Mesaros A, Virtanen T, Gabbouj M (2013) Supervised model training for overlapping sound events based on unsupervised source separation. In: Proc. of IEEE ICASSP, Vancouver, Canada, pp 8677–8681. IEEE
Hinton GE, Deng L, Yu D, Dahl GE, Mohamed AR, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath TN et al (2012) Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. In: IEEE Signal Processing Magazine, 29(6):82–97. IEEE
Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18:1527–1554, MIT Press
Article MathSciNet MATH Google Scholar
Jin F, Sattar F, Krishnan S (2012) Log-frequency spectrogram for respiratory sound monitoring. In: Proc. of IEEE ICASSP, pp 597–600. IEEE
Lin KZ, Pwint M (2010) Structuring sport video through audio event classification. In: PCM 2010, Part I, LNCS 6297, pp 481–492. Springer
Loren DE, Robert KO (1968) Programming and analysis for digital time series data, United Stated Department of Defense, first edition, Shock and Vibration Information Center
Lu L, Hanjalic A (2009) audio keywords discovery for text-like audio content analysis and retrieval. In: IEEE Trans. on Multimedia 10(1):74–85. IEEE
Ma L, Milner B, Smith D (2006) Acoustic environment classification. In: ACM Trans. On Speech Language Processing, 3(2):1–22. ACM
McLoughlin I, Zhang HM, Xie ZP, Song Y, Xiao W (2015) Robust sound event classification using deep neural networks. In: IEEE Trans. on Audio, Speech, and Language Processing, 23(3):540–552. IEEE
Moritz N, Anemüller J, Kollmeier B (2011) Amplitude modulation spectrogram based features for robust speech recognition in noisy and reverberant environments. In: Proc. of IEEE ICASSP, pp 5492–5495. IEEE
Niessen ME, Van Kasteren TLM, Merentitis A (2013) Hierarchical modeling using automated sub-clustering for sound event recognition. In: Proc. of IEEE workshop on applications of signal processing to audio and acoustics, pp 1–4. IEEE
Nogueira W, Roma G, Herrera P (2013) Automatic event classification using front end single channel noise reduction, MFCC features and a support vector machine classifier. In: IEEE AASP challenge: detection and classification of acoustic scenes and events. IEEE
Okuyucu C, Sert M, Yazlcl A (2013) Audio feature and classifier analysis for efficient recognition of environmental sounds. In: Proc. of IEEE International Symposium on Multimedia, pp 125–132. IEEE
Phan H, Maaß M, Mazur R, Mertins A (2015) Random regression forests for acoustic event detection and classification. In: IEEE Trans. on Audio Speech & Language Processing, 23(1):20–31. IEEE
Qiu A, Schreiner C, Escabi M (2003) Gabor analysis of auditory midbrain receptive fields: spectro-temporal and binaural composition. J Neurophysiol 90(1):456–476, American Physiological Society
Article Google Scholar
Schadler MR, Kollmeier B (2012) Normalization of spectro-temporal Gabor filter bank features for improved robust automatic speech recognition systems. In: Proc. of INTERSPEECH, pp 1–4. ISCA
Schädler MR, Meyer BT, Kollmeier B (2012) Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition. J Acoust Soc Am 131(5):4134–4151, Acoustical Society of America
Article Google Scholar
Schröder J, Cauchi B, Schädler MR, Moritz N, Adiloglu K, Anemüller J, Doclo S, Kollmeier B, Goetze S (2013) Acoustic event detection using signal enhancement and spectro-temporal feature extraction. IEEE AASP challenge: detection and classification of acoustic scenes and events. IEEE
Schröder J, Goetze S, Anemüller J (2015) Spectro-temporal gabor filterbank features for acoustic event detection. In: IEEE/ACM Trans. on Audio, Speech, and Language Processing, 23(12):2198–2208. IEEE/ACM
Schröder J, Moritz N, Schädler MR, Cauchi B, Adiloglu K, Anemüller J, Doclo S, Kollmeier B, Goetze S (2013) On the use of spectro-temporal features for the IEEE AASP challenge detection and classification of acoustic scenes and events. In: Proc. of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp 1–4. IEEE
Temko A, Malkin R, Zieger C, Macho D, Nadeu C, Omologo M (2007) Clear evaluation of acoustic event detection and classification systems. Lecture notes in computing science, 4122:311–322. Springer
Temko A, Nadeu C (2009) Acoustic event detection in meeting-room environments. In: Pattern recognition letter, 30(14):1281–1288. Elsevier
Temko A, Nadeu C, Macho D, Malkin R, Zieger C, Omologo M (2009) Acoustic event detection and classification. In: Computers in the human interaction loop, pp 61–73. Springer
Varga A, Steeneken HJM (1993) Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. In: Speech Communication, 12(3):247–251. ISCA
Veselý K, Lukáš B, František (2010) Parallel training of neural networks for speech recognition. In: Proc. of INTERSPEECH, pp 439–446. ISCA
Wang S, Yang X, Zhang Y, Phillips P, Yang J, Yuan T (2015) Identification of green, Oolong and black teas in China via wavelet packet entropy and fuzzy support vector machine. In: Entropy, 17(10):6663–6682. MDPI
Young SJ, Evermann G, Gales MJF, Hain T, Kershaw D, Moore G, Odell J, Ollason D, Povey D, Valtchev V, Woodland PC (2006) The HTK Book, version 3.4. Cambridge University Engineering Department, Cambridge
Google Scholar
Yu D, Seltzer ML (2011) Improved bottleneck features using pretrained deep neural networks. In: Proc. of INTERSPEECH, pp 237–240. ISCA
Zhang Y, Chen S, Wang S, Yang J, Phillips P (2015) Magnetic resonance brain image classification based on weighted-type fractional Fourier transform and nonparallel support vector machine. Int J Imaging Syst Technol 25(4):317–327, Wiley
Article Google Scholar
Zhang X, He Q, Feng X (2015) Acoustic feature extraction by tensor-based sparse representation for sound effects classification. In: Proc. of IEEE ICASSP, pp 166–170. IEEE
Zhang Y, Wu L (2012) Classification of fruits using computer vision and a multiclass support vector machine. In: Sensors, 12(9):12489–12505. MDPI

Download references

Acknowledgments

The work was supported by the National Natural Science Foundation of China (61101160, 61271314, 61571192), the Fundamental Research Funds for the Central Universities, South China University of Technology, China (2015ZZ102), Project of the Pearl River Young Talents of Science and Technology in Guangzhou, China (2013J2200070), Science and Technology Planning Project of Guangdong Province (2014A050503022, 2015A010103003) and the Foundation of China Scholarship Council (201208440078).

Author information

Authors and Affiliations

School of Electronic and Information Engineering, South China University of Technology, 381 Wushan Road, Guangzhou, China
Yanxiong Li, Xue Zhang, Hai Jin, Xianku Li, Qin Wang, Qianhua He & Qian Huang

Authors

Yanxiong Li
View author publications
You can also search for this author in PubMed Google Scholar
Xue Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Hai Jin
View author publications
You can also search for this author in PubMed Google Scholar
Xianku Li
View author publications
You can also search for this author in PubMed Google Scholar
Qin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Qianhua He
View author publications
You can also search for this author in PubMed Google Scholar
Qian Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yanxiong Li.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, Y., Zhang, X., Jin, H. et al. Using multi-stream hierarchical deep neural network to extract deep audio feature for acoustic event detection. Multimed Tools Appl 77, 897–916 (2018). https://doi.org/10.1007/s11042-016-4332-z

Download citation

Received: 07 July 2016
Revised: 29 December 2016
Accepted: 29 December 2016
Published: 06 January 2017
Issue Date: January 2018
DOI: https://doi.org/10.1007/s11042-016-4332-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using multi-stream hierarchical deep neural network to extract deep audio feature for acoustic event detection

Abstract

Access this article

Similar content being viewed by others

Exploiting spectro-temporal locality in deep learning based acoustic event detection

Hand-crafted versus learned representations for audio event detection

Improved Convolutional Neural Networks for Acoustic Event Classification

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Using multi-stream hierarchical deep neural network to extract deep audio feature for acoustic event detection

Abstract

Access this article

Similar content being viewed by others

Exploiting spectro-temporal locality in deep learning based acoustic event detection

Hand-crafted versus learned representations for audio event detection

Improved Convolutional Neural Networks for Acoustic Event Classification

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation