Deep Learning Frameworks Applied For Audio-Visual Scene Classification

Pham, Lam; Schindler, Alexander; Schutz, Mina; Lampert, Jasmin; Schlarb, Sven; King, Ross

doi:10.1007/978-3-658-36295-9_6

Lam Pham⁵,
Alexander Schindler⁵,
Mina Schutz⁵,
Jasmin Lampert⁵,
Sven Schlarb⁵ &
…
Ross King⁵

846 Accesses
2 Citations

Abstract

In this paper, we present deep learning frameworks for audio-visual scene classification (SC) and indicate how individual visual, audio features as well as their combination affect SC performance. Our extensive experiments are conducted on DCASE 2021 (IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events) Task 1B Development and Evaluation datasets. Our results on Development dataset achieve the best classification accuracy of 82.2%, 91.1%, and 93.9% with audio input only, visual input only, and both audio-visual input, respectively. The highest classification accuracy of 93.9%, obtained from an ensemble of audio-based and visual-based frameworks, shows an improvement of 16.5% compared with DCASE 2021 baseline. Our best results on Evaluation dataset is 91.5%, outperforming DCASE baseline of 77.1%

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Deep Audio-visual Learning: A Survey

Article Open access 15 April 2021

Exploring Fine-Grained Audiovisual Categorization with the SSW60 Dataset

Bi-level Acoustic Scene Classification Using Lightweight Deep Learning Model

Article 12 August 2023

References

R. Arandjelović and A. Zisserman, “Objects that sound,” in ECCV, 2018.
Google Scholar
J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Lip reading sentences in the wild,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 3444–3453.
Google Scholar
H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. H. McDermott, and A. Torralba, “The sound of pixels,” ArXiv, vol. abs/1804.03160, 2018.
Google Scholar
N. Takahashi, M. Gygli, and L. V. Gool, “Aenet: Learning deep audio features for video analysis,” IEEE Transactions on Multimedia, vol. 20, pp. 513–524, 2018.
Google Scholar
C. Sanderson and B. Lovell, “Multi-region probabilistic histograms for robust and scalable identity inference,” in Proc. International conference on biometrics (ICB), 2009.
Google Scholar
S. Zhang, S. Zhang, T. Huang, W. Gao, and Q. Tian, “Learning affective features with a hybrid deep model for audio–visual emotion recognition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, pp. 3030–3043, 2018.
Google Scholar
R. Gao, T.-H. Oh, K. Grauman, and L. Torresani, “Listen to look: Action recognition by previewing audio,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 10 454–10 464.
Google Scholar
F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles, “Activitynet: A large-scale video benchmark for human activity understanding,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 961–970.
Google Scholar
K. Soomro, A. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” ArXiv, vol. abs/1212.0402, 2012.
Google Scholar
R. Gade, M. Abou-Zleikha, M. G. Christensen, and T. Moeslund, “Audio-visual classification of sports types,” in IEEE International Conference on Computer Vision Workshop (ICCVW), 2015, pp. 768–773.
Google Scholar
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1725–1732.
Google Scholar
Detection and Classification of Acoustic Scenes and Events Community, DCASE 2021 challenges, http://dcase.community/challenge2021.
S. Wang, A. Mesaros, T. Heittola, and T. Virtanen, “A curated dataset of urban scenes for audio-visual scene analysis,” arXiv preprint arXiv:2011.00030, 2020.
Google Scholar
L. Pham, I. Mcloughlin, H. Phan, R. Palaniappan, and A. Mertins, “Deep feature embedding and hierarchical classification for audio scene classification,” in International Joint Conference on Neural Networks (IJCNN), 2020, pp. 1–7.
Google Scholar
L. Pham, H. Phan, T. Nguyen, R. Palaniappan, A. Mertins, and I. Mcloughlin, “Robust acoustic scene classification using a multispectrogram encoder-decoder framework,” Digital Signal Processing, vol. 110, p. 102943, 2021.
Google Scholar
L. Pham, I. McLoughlin, H. Phan, R. Palaniappan, and Y. Lang, “Bag-of-features models based on C-DNN network for acoustic scene classification,” in Proc. International Conference on Audio Forensics (AES), 2019.
Google Scholar
L. Pham, I. Mcloughlin, H. Phan, and R. Palaniappan, “A robust framework for acoustic scene classification,” in Proc. International Speech Communication Association (INTERSPEECH), 2019, pp. 3634–3638.
Google Scholar
H. Phan, H. Le Nguyen, O. Y. Chén, L. Pham, P. Koch, I. McLoughlin, and A. Mertins, “Multi-view audio and music classification,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 611–615.
Google Scholar
D. Ngo, H. Hoang, A. Nguyen, T. Ly, and L. Pham, “Sound context classification basing on join learning model and multi-spectrogram features,” ArXiv, vol. abs/2005.12779, 2020.
Google Scholar
T. Nguyen and F. Pernkopf, “Acoustic scene classification using a convolutional neural network ensemble and nearest neighbor filters,” in Proc. DCASE, 2018, pp. 34–38.
Google Scholar
B. McFee, R. Colin, L. Dawen, D. Ellis, M. Matt, B. Eric, and N. Oriol, “librosa: Audio and music signal analysis in python,” in Proceedings of The 14th Python in Science Conference, 2015, pp. 18–25.
Google Scholar
D. P. W. . Ellis, “Gammatone-like spectrogram,” 2009. [Online]. Available: http://www.ee.columbia.edu/dpwe/resources/matlab/gammatonegram
K. Xu, D. Feng, H. Mi, B. Zhu, D. Wang, L. Zhang, H. Cai, and S. Liu, “Mixup-based acoustic scene classification using multi-channel convolutional neural network,” in Pacific Rim Conference on Multimedia, 2018, pp. 14–23.
Google Scholar
Y. Tokozume, Y. Ushiku, and T. Harada, “Learning from between-class examples for deep sound recognition,” in International Conference on Learning Representations (ICLR), 2018.
Google Scholar
S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the 32nd International Conference on Machine Learning, 2015, pp. 448–456.
Google Scholar
V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in International Conference on Machine Learning (ICML), 2010.
Google Scholar
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
Google Scholar
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations (ICLR), 2015.
Google Scholar
Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley, “Panns: Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880–2894, 2020.
Google Scholar
J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017.
Google Scholar
F. Chollet et al., “Keras,” https://keras.io, 2015.
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.
Google Scholar
S. Kullback and R. A. Leibler, “On information and sufficiency,” The annals of mathematical statistics, vol. 22, no. 1, pp. 79–86, 1951.
Google Scholar
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2015.
Google Scholar

Download references

Author information

Authors and Affiliations

Center for Digital Safety & Security, Austrian Institute of Technology, Wien, Austria
Lam Pham, Alexander Schindler, Mina Schutz, Jasmin Lampert, Sven Schlarb & Ross King

Authors

Lam Pham
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Schindler
View author publications
You can also search for this author in PubMed Google Scholar
Mina Schutz
View author publications
You can also search for this author in PubMed Google Scholar
Jasmin Lampert
View author publications
You can also search for this author in PubMed Google Scholar
Sven Schlarb
View author publications
You can also search for this author in PubMed Google Scholar
Ross King
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lam Pham .

Editor information

Editors and Affiliations

Informationstechnik & System-Management, Fachhochschule Salzburg, Puch/Salzburg, Österreich
Peter Haber
Donau-Universität Krems Center for E-Governance, Krems an der Donau, Österreich
Thomas J. Lampoltshammer
Center for Safety & Security, AIT Austrian Institute of Technology, Wien, Österreich
Helmut Leopold
Informationstechnik & System-Management, Fachhochschule Salzburg, Puch/Salzburg, Salzburg, Österreich
Manfred Mayr

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pham, L., Schindler, A., Schutz, M., Lampert, J., Schlarb, S., King, R. (2022). Deep Learning Frameworks Applied For Audio-Visual Scene Classification. In: Haber, P., Lampoltshammer, T.J., Leopold, H., Mayr, M. (eds) Data Science – Analytics and Applications. Springer Vieweg, Wiesbaden. https://doi.org/10.1007/978-3-658-36295-9_6

Download citation

DOI: https://doi.org/10.1007/978-3-658-36295-9_6
Published: 30 March 2022
Publisher Name: Springer Vieweg, Wiesbaden
Print ISBN: 978-3-658-36294-2
Online ISBN: 978-3-658-36295-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Deep Learning Frameworks Applied For Audio-Visual Scene Classification

Abstract

Access this chapter

Preview

Similar content being viewed by others

Deep Audio-visual Learning: A Survey

Exploring Fine-Grained Audiovisual Categorization with the SSW60 Dataset

Bi-level Acoustic Scene Classification Using Lightweight Deep Learning Model

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Deep Learning Frameworks Applied For Audio-Visual Scene Classification

Abstract

Access this chapter

Preview

Similar content being viewed by others

Deep Audio-visual Learning: A Survey

Exploring Fine-Grained Audiovisual Categorization with the SSW60 Dataset

Bi-level Acoustic Scene Classification Using Lightweight Deep Learning Model

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation