Skip to main content
Log in

Using multi-stream hierarchical deep neural network to extract deep audio feature for acoustic event detection

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Extraction of effective audio features from acoustic events definitely influences the performance of Acoustic Event Detection (AED) system, especially in adverse audio situations. In this study, we propose a framework for extracting Deep Audio Feature (DAF) using multi-stream hierarchical Deep Neural Network (DNN). The DAF outputted from the proposed framework fuses the potential complementary information of multiple input feature streams and thus could be more discriminative than those input features for AED. We take two input feature streams and the hierarchical DNNs with two stages as an example for showing the extraction of DAF. The effectiveness of different audio features for AED is evaluated on two audio corpora, i.e. BBC (British Broadcasting Corporation) audio dataset and TV audio dataset with different signal-to-noise ratios. Experimental results show that DAF outperforms other features for AED under several experimental conditions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. http://speech.fit.vutbr.cz/software/neural-network-trainer-tnet

References

  1. Atrey PK, Maddage M, Kankanhalli MS (2006) Audio based event detection for multimedia surveillance. In: Proc. of IEEE ICASSP, pp 813–816. IEEE

  2. British Broadcasting Corporation (BBC), “BBC Sound Effects Library,” http://www.sound-ideas.com/bbc.html, Accessed May 2015

  3. Bugalho M, Portelo J, Trancoso I, Pellegrini T, Abad A (2009) Detecting audio events for semantic video search. In: Proc. of INTERSPEECH, pp 1151–1154. ISCA

  4. Cakir E, Heittola T, Huttunen H, Virtanen T (2015) Polyphonic sound event detection using multi label deep neural networks. In: Proc. of International Joint Conference on Neural Networks, pp 1–7. IEEE

  5. Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. In: ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27. ACM

  6. Childers DG, Skinner DP, Kemerait RC (1977) The cepstrum: a guide to processing. In: Proceeding of IEEE, 65(10):1428–1443. IEEE

  7. Diment A, Heittola T, Virtanen T (2013) Sound event detection for office live and office synthetic AASP challenge. In: Proc. of IEEE AASP challenge on detection and classification of acoustic scenes and events. IEEE

  8. Fawcett T (2011) An introduction to ROC analysis. In: Pattern Recognition Letters, 27(8):861–874. Elsevier

  9. Gabor D (1946) Theory of communication. In: Institute Electronica, no. 93, pp 429–457

  10. Gencoglu O, Virtanen T, Huttunen H (2014) Recognition of acoustic events using deep neural networks. In: Proc. of the 22nd European Signal Processing Conference, pp 506–510. ISCA

  11. Giannoulis D, Stowell D, Benetos E, Rossignol M, Lagrange M, Plumbley MD (2013) A database and challenge for acoustic scene classification and event detection. In: Proc. of EUSIPCO, pp 1–5. ISCA

  12. Grezl F, Karafiat M, Kontar S, Cernocky J (2007) Probabilistic and bottle-neck features for LVCSR of meetings. In: Proc. of IEEE ICASSP, pp 757–760. IEEE

  13. Heittola T, Klapuri A (2008) TUT acoustic event detection system 2007. In: multimodal technologies for perception of humans, vol. 4625 of the series Lecture Notes in Computer Science, pp 364–370. Springer

  14. Heittola T, Mesaros A, Virtanen T, Gabbouj M (2013) Supervised model training for overlapping sound events based on unsupervised source separation. In: Proc. of IEEE ICASSP, Vancouver, Canada, pp 8677–8681. IEEE

  15. Hinton GE, Deng L, Yu D, Dahl GE, Mohamed AR, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath TN et al (2012) Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. In: IEEE Signal Processing Magazine, 29(6):82–97. IEEE

  16. Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18:1527–1554, MIT Press

    Article  MathSciNet  MATH  Google Scholar 

  17. Jin F, Sattar F, Krishnan S (2012) Log-frequency spectrogram for respiratory sound monitoring. In: Proc. of IEEE ICASSP, pp 597–600. IEEE

  18. Lin KZ, Pwint M (2010) Structuring sport video through audio event classification. In: PCM 2010, Part I, LNCS 6297, pp 481–492. Springer

  19. Loren DE, Robert KO (1968) Programming and analysis for digital time series data, United Stated Department of Defense, first edition, Shock and Vibration Information Center

  20. Lu L, Hanjalic A (2009) audio keywords discovery for text-like audio content analysis and retrieval. In: IEEE Trans. on Multimedia 10(1):74–85. IEEE

  21. Ma L, Milner B, Smith D (2006) Acoustic environment classification. In: ACM Trans. On Speech Language Processing, 3(2):1–22. ACM

  22. McLoughlin I, Zhang HM, Xie ZP, Song Y, Xiao W (2015) Robust sound event classification using deep neural networks. In: IEEE Trans. on Audio, Speech, and Language Processing, 23(3):540–552. IEEE

  23. Moritz N, Anemüller J, Kollmeier B (2011) Amplitude modulation spectrogram based features for robust speech recognition in noisy and reverberant environments. In: Proc. of IEEE ICASSP, pp 5492–5495. IEEE

  24. Niessen ME, Van Kasteren TLM, Merentitis A (2013) Hierarchical modeling using automated sub-clustering for sound event recognition. In: Proc. of IEEE workshop on applications of signal processing to audio and acoustics, pp 1–4. IEEE

  25. Nogueira W, Roma G, Herrera P (2013) Automatic event classification using front end single channel noise reduction, MFCC features and a support vector machine classifier. In: IEEE AASP challenge: detection and classification of acoustic scenes and events. IEEE

  26. Okuyucu C, Sert M, Yazlcl A (2013) Audio feature and classifier analysis for efficient recognition of environmental sounds. In: Proc. of IEEE International Symposium on Multimedia, pp 125–132. IEEE

  27. Phan H, Maaß M, Mazur R, Mertins A (2015) Random regression forests for acoustic event detection and classification. In: IEEE Trans. on Audio Speech & Language Processing, 23(1):20–31. IEEE

  28. Qiu A, Schreiner C, Escabi M (2003) Gabor analysis of auditory midbrain receptive fields: spectro-temporal and binaural composition. J Neurophysiol 90(1):456–476, American Physiological Society

    Article  Google Scholar 

  29. Schadler MR, Kollmeier B (2012) Normalization of spectro-temporal Gabor filter bank features for improved robust automatic speech recognition systems. In: Proc. of INTERSPEECH, pp 1–4. ISCA

  30. Schädler MR, Meyer BT, Kollmeier B (2012) Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition. J Acoust Soc Am 131(5):4134–4151, Acoustical Society of America

    Article  Google Scholar 

  31. Schröder J, Cauchi B, Schädler MR, Moritz N, Adiloglu K, Anemüller J, Doclo S, Kollmeier B, Goetze S (2013) Acoustic event detection using signal enhancement and spectro-temporal feature extraction. IEEE AASP challenge: detection and classification of acoustic scenes and events. IEEE

  32. Schröder J, Goetze S, Anemüller J (2015) Spectro-temporal gabor filterbank features for acoustic event detection. In: IEEE/ACM Trans. on Audio, Speech, and Language Processing, 23(12):2198–2208. IEEE/ACM

  33. Schröder J, Moritz N, Schädler MR, Cauchi B, Adiloglu K, Anemüller J, Doclo S, Kollmeier B, Goetze S (2013) On the use of spectro-temporal features for the IEEE AASP challenge detection and classification of acoustic scenes and events. In: Proc. of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp 1–4. IEEE

  34. Temko A, Malkin R, Zieger C, Macho D, Nadeu C, Omologo M (2007) Clear evaluation of acoustic event detection and classification systems. Lecture notes in computing science, 4122:311–322. Springer

  35. Temko A, Nadeu C (2009) Acoustic event detection in meeting-room environments. In: Pattern recognition letter, 30(14):1281–1288. Elsevier

  36. Temko A, Nadeu C, Macho D, Malkin R, Zieger C, Omologo M (2009) Acoustic event detection and classification. In: Computers in the human interaction loop, pp 61–73. Springer

  37. Varga A, Steeneken HJM (1993) Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. In: Speech Communication, 12(3):247–251. ISCA

  38. Veselý K, Lukáš B, František (2010) Parallel training of neural networks for speech recognition. In: Proc. of INTERSPEECH, pp 439–446. ISCA

  39. Wang S, Yang X, Zhang Y, Phillips P, Yang J, Yuan T (2015) Identification of green, Oolong and black teas in China via wavelet packet entropy and fuzzy support vector machine. In: Entropy, 17(10):6663–6682. MDPI

  40. Young SJ, Evermann G, Gales MJF, Hain T, Kershaw D, Moore G, Odell J, Ollason D, Povey D, Valtchev V, Woodland PC (2006) The HTK Book, version 3.4. Cambridge University Engineering Department, Cambridge

    Google Scholar 

  41. Yu D, Seltzer ML (2011) Improved bottleneck features using pretrained deep neural networks. In: Proc. of INTERSPEECH, pp 237–240. ISCA

  42. Zhang Y, Chen S, Wang S, Yang J, Phillips P (2015) Magnetic resonance brain image classification based on weighted-type fractional Fourier transform and nonparallel support vector machine. Int J Imaging Syst Technol 25(4):317–327, Wiley

    Article  Google Scholar 

  43. Zhang X, He Q, Feng X (2015) Acoustic feature extraction by tensor-based sparse representation for sound effects classification. In: Proc. of IEEE ICASSP, pp 166–170. IEEE

  44. Zhang Y, Wu L (2012) Classification of fruits using computer vision and a multiclass support vector machine. In: Sensors, 12(9):12489–12505. MDPI

Download references

Acknowledgments

The work was supported by the National Natural Science Foundation of China (61101160, 61271314, 61571192), the Fundamental Research Funds for the Central Universities, South China University of Technology, China (2015ZZ102), Project of the Pearl River Young Talents of Science and Technology in Guangzhou, China (2013J2200070), Science and Technology Planning Project of Guangdong Province (2014A050503022, 2015A010103003) and the Foundation of China Scholarship Council (201208440078).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yanxiong Li.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, Y., Zhang, X., Jin, H. et al. Using multi-stream hierarchical deep neural network to extract deep audio feature for acoustic event detection. Multimed Tools Appl 77, 897–916 (2018). https://doi.org/10.1007/s11042-016-4332-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-016-4332-z

Keywords

Navigation