Skip to main content

Deep Learning Frameworks Applied For Audio-Visual Scene Classification

  • Conference paper
  • First Online:
Data Science – Analytics and Applications

Abstract

In this paper, we present deep learning frameworks for audio-visual scene classification (SC) and indicate how individual visual, audio features as well as their combination affect SC performance. Our extensive experiments are conducted on DCASE 2021 (IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events) Task 1B Development and Evaluation datasets. Our results on Development dataset achieve the best classification accuracy of 82.2%, 91.1%, and 93.9% with audio input only, visual input only, and both audio-visual input, respectively. The highest classification accuracy of 93.9%, obtained from an ensemble of audio-based and visual-based frameworks, shows an improvement of 16.5% compared with DCASE 2021 baseline. Our best results on Evaluation dataset is 91.5%, outperforming DCASE baseline of 77.1%

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. R. Arandjelović and A. Zisserman, “Objects that sound,” in ECCV, 2018.

    Google Scholar 

  2. J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Lip reading sentences in the wild,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 3444–3453.

    Google Scholar 

  3. H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. H. McDermott, and A. Torralba, “The sound of pixels,” ArXiv, vol. abs/1804.03160, 2018.

    Google Scholar 

  4. N. Takahashi, M. Gygli, and L. V. Gool, “Aenet: Learning deep audio features for video analysis,” IEEE Transactions on Multimedia, vol. 20, pp. 513–524, 2018.

    Google Scholar 

  5. C. Sanderson and B. Lovell, “Multi-region probabilistic histograms for robust and scalable identity inference,” in Proc. International conference on biometrics (ICB), 2009.

    Google Scholar 

  6. S. Zhang, S. Zhang, T. Huang, W. Gao, and Q. Tian, “Learning affective features with a hybrid deep model for audio–visual emotion recognition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, pp. 3030–3043, 2018.

    Google Scholar 

  7. R. Gao, T.-H. Oh, K. Grauman, and L. Torresani, “Listen to look: Action recognition by previewing audio,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 10 454–10 464.

    Google Scholar 

  8. F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles, “Activitynet: A large-scale video benchmark for human activity understanding,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 961–970.

    Google Scholar 

  9. K. Soomro, A. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” ArXiv, vol. abs/1212.0402, 2012.

    Google Scholar 

  10. R. Gade, M. Abou-Zleikha, M. G. Christensen, and T. Moeslund, “Audio-visual classification of sports types,” in IEEE International Conference on Computer Vision Workshop (ICCVW), 2015, pp. 768–773.

    Google Scholar 

  11. A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1725–1732.

    Google Scholar 

  12. Detection and Classification of Acoustic Scenes and Events Community, DCASE 2021 challenges, http://dcase.community/challenge2021.

  13. S. Wang, A. Mesaros, T. Heittola, and T. Virtanen, “A curated dataset of urban scenes for audio-visual scene analysis,” arXiv preprint arXiv:2011.00030, 2020.

    Google Scholar 

  14. L. Pham, I. Mcloughlin, H. Phan, R. Palaniappan, and A. Mertins, “Deep feature embedding and hierarchical classification for audio scene classification,” in International Joint Conference on Neural Networks (IJCNN), 2020, pp. 1–7.

    Google Scholar 

  15. L. Pham, H. Phan, T. Nguyen, R. Palaniappan, A. Mertins, and I. Mcloughlin, “Robust acoustic scene classification using a multispectrogram encoder-decoder framework,” Digital Signal Processing, vol. 110, p. 102943, 2021.

    Google Scholar 

  16. L. Pham, I. McLoughlin, H. Phan, R. Palaniappan, and Y. Lang, “Bag-of-features models based on C-DNN network for acoustic scene classification,” in Proc. International Conference on Audio Forensics (AES), 2019.

    Google Scholar 

  17. L. Pham, I. Mcloughlin, H. Phan, and R. Palaniappan, “A robust framework for acoustic scene classification,” in Proc. International Speech Communication Association (INTERSPEECH), 2019, pp. 3634–3638.

    Google Scholar 

  18. H. Phan, H. Le Nguyen, O. Y. Chén, L. Pham, P. Koch, I. McLoughlin, and A. Mertins, “Multi-view audio and music classification,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 611–615.

    Google Scholar 

  19. D. Ngo, H. Hoang, A. Nguyen, T. Ly, and L. Pham, “Sound context classification basing on join learning model and multi-spectrogram features,” ArXiv, vol. abs/2005.12779, 2020.

    Google Scholar 

  20. T. Nguyen and F. Pernkopf, “Acoustic scene classification using a convolutional neural network ensemble and nearest neighbor filters,” in Proc. DCASE, 2018, pp. 34–38.

    Google Scholar 

  21. B. McFee, R. Colin, L. Dawen, D. Ellis, M. Matt, B. Eric, and N. Oriol, “librosa: Audio and music signal analysis in python,” in Proceedings of The 14th Python in Science Conference, 2015, pp. 18–25.

    Google Scholar 

  22. D. P. W. . Ellis, “Gammatone-like spectrogram,” 2009. [Online]. Available: http://www.ee.columbia.edu/dpwe/resources/matlab/gammatonegram

  23. K. Xu, D. Feng, H. Mi, B. Zhu, D. Wang, L. Zhang, H. Cai, and S. Liu, “Mixup-based acoustic scene classification using multi-channel convolutional neural network,” in Pacific Rim Conference on Multimedia, 2018, pp. 14–23.

    Google Scholar 

  24. Y. Tokozume, Y. Ushiku, and T. Harada, “Learning from between-class examples for deep sound recognition,” in International Conference on Learning Representations (ICLR), 2018.

    Google Scholar 

  25. S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the 32nd International Conference on Machine Learning, 2015, pp. 448–456.

    Google Scholar 

  26. V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in International Conference on Machine Learning (ICML), 2010.

    Google Scholar 

  27. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.

    Google Scholar 

  28. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations (ICLR), 2015.

    Google Scholar 

  29. Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley, “Panns: Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880–2894, 2020.

    Google Scholar 

  30. J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017.

    Google Scholar 

  31. F. Chollet et al., “Keras,” https://keras.io, 2015.

  32. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.

    Google Scholar 

  33. S. Kullback and R. A. Leibler, “On information and sufficiency,” The annals of mathematical statistics, vol. 22, no. 1, pp. 79–86, 1951.

    Google Scholar 

  34. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2015.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lam Pham .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Fachmedien Wiesbaden GmbH, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Pham, L., Schindler, A., Schutz, M., Lampert, J., Schlarb, S., King, R. (2022). Deep Learning Frameworks Applied For Audio-Visual Scene Classification. In: Haber, P., Lampoltshammer, T.J., Leopold, H., Mayr, M. (eds) Data Science – Analytics and Applications. Springer Vieweg, Wiesbaden. https://doi.org/10.1007/978-3-658-36295-9_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-658-36295-9_6

  • Published:

  • Publisher Name: Springer Vieweg, Wiesbaden

  • Print ISBN: 978-3-658-36294-2

  • Online ISBN: 978-3-658-36295-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics