Abstract
In this paper, we present deep learning frameworks for audio-visual scene classification (SC) and indicate how individual visual, audio features as well as their combination affect SC performance. Our extensive experiments are conducted on DCASE 2021 (IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events) Task 1B Development and Evaluation datasets. Our results on Development dataset achieve the best classification accuracy of 82.2%, 91.1%, and 93.9% with audio input only, visual input only, and both audio-visual input, respectively. The highest classification accuracy of 93.9%, obtained from an ensemble of audio-based and visual-based frameworks, shows an improvement of 16.5% compared with DCASE 2021 baseline. Our best results on Evaluation dataset is 91.5%, outperforming DCASE baseline of 77.1%
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
R. Arandjelović and A. Zisserman, “Objects that sound,” in ECCV, 2018.
J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Lip reading sentences in the wild,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 3444–3453.
H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. H. McDermott, and A. Torralba, “The sound of pixels,” ArXiv, vol. abs/1804.03160, 2018.
N. Takahashi, M. Gygli, and L. V. Gool, “Aenet: Learning deep audio features for video analysis,” IEEE Transactions on Multimedia, vol. 20, pp. 513–524, 2018.
C. Sanderson and B. Lovell, “Multi-region probabilistic histograms for robust and scalable identity inference,” in Proc. International conference on biometrics (ICB), 2009.
S. Zhang, S. Zhang, T. Huang, W. Gao, and Q. Tian, “Learning affective features with a hybrid deep model for audio–visual emotion recognition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, pp. 3030–3043, 2018.
R. Gao, T.-H. Oh, K. Grauman, and L. Torresani, “Listen to look: Action recognition by previewing audio,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 10 454–10 464.
F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles, “Activitynet: A large-scale video benchmark for human activity understanding,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 961–970.
K. Soomro, A. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” ArXiv, vol. abs/1212.0402, 2012.
R. Gade, M. Abou-Zleikha, M. G. Christensen, and T. Moeslund, “Audio-visual classification of sports types,” in IEEE International Conference on Computer Vision Workshop (ICCVW), 2015, pp. 768–773.
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1725–1732.
Detection and Classification of Acoustic Scenes and Events Community, DCASE 2021 challenges, http://dcase.community/challenge2021.
S. Wang, A. Mesaros, T. Heittola, and T. Virtanen, “A curated dataset of urban scenes for audio-visual scene analysis,” arXiv preprint arXiv:2011.00030, 2020.
L. Pham, I. Mcloughlin, H. Phan, R. Palaniappan, and A. Mertins, “Deep feature embedding and hierarchical classification for audio scene classification,” in International Joint Conference on Neural Networks (IJCNN), 2020, pp. 1–7.
L. Pham, H. Phan, T. Nguyen, R. Palaniappan, A. Mertins, and I. Mcloughlin, “Robust acoustic scene classification using a multispectrogram encoder-decoder framework,” Digital Signal Processing, vol. 110, p. 102943, 2021.
L. Pham, I. McLoughlin, H. Phan, R. Palaniappan, and Y. Lang, “Bag-of-features models based on C-DNN network for acoustic scene classification,” in Proc. International Conference on Audio Forensics (AES), 2019.
L. Pham, I. Mcloughlin, H. Phan, and R. Palaniappan, “A robust framework for acoustic scene classification,” in Proc. International Speech Communication Association (INTERSPEECH), 2019, pp. 3634–3638.
H. Phan, H. Le Nguyen, O. Y. Chén, L. Pham, P. Koch, I. McLoughlin, and A. Mertins, “Multi-view audio and music classification,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 611–615.
D. Ngo, H. Hoang, A. Nguyen, T. Ly, and L. Pham, “Sound context classification basing on join learning model and multi-spectrogram features,” ArXiv, vol. abs/2005.12779, 2020.
T. Nguyen and F. Pernkopf, “Acoustic scene classification using a convolutional neural network ensemble and nearest neighbor filters,” in Proc. DCASE, 2018, pp. 34–38.
B. McFee, R. Colin, L. Dawen, D. Ellis, M. Matt, B. Eric, and N. Oriol, “librosa: Audio and music signal analysis in python,” in Proceedings of The 14th Python in Science Conference, 2015, pp. 18–25.
D. P. W. . Ellis, “Gammatone-like spectrogram,” 2009. [Online]. Available: http://www.ee.columbia.edu/dpwe/resources/matlab/gammatonegram
K. Xu, D. Feng, H. Mi, B. Zhu, D. Wang, L. Zhang, H. Cai, and S. Liu, “Mixup-based acoustic scene classification using multi-channel convolutional neural network,” in Pacific Rim Conference on Multimedia, 2018, pp. 14–23.
Y. Tokozume, Y. Ushiku, and T. Harada, “Learning from between-class examples for deep sound recognition,” in International Conference on Learning Representations (ICLR), 2018.
S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the 32nd International Conference on Machine Learning, 2015, pp. 448–456.
V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in International Conference on Machine Learning (ICML), 2010.
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations (ICLR), 2015.
Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley, “Panns: Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880–2894, 2020.
J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017.
F. Chollet et al., “Keras,” https://keras.io, 2015.
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.
S. Kullback and R. A. Leibler, “On information and sufficiency,” The annals of mathematical statistics, vol. 22, no. 1, pp. 79–86, 1951.
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2015.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Fachmedien Wiesbaden GmbH, part of Springer Nature
About this paper
Cite this paper
Pham, L., Schindler, A., Schutz, M., Lampert, J., Schlarb, S., King, R. (2022). Deep Learning Frameworks Applied For Audio-Visual Scene Classification. In: Haber, P., Lampoltshammer, T.J., Leopold, H., Mayr, M. (eds) Data Science – Analytics and Applications. Springer Vieweg, Wiesbaden. https://doi.org/10.1007/978-3-658-36295-9_6
Download citation
DOI: https://doi.org/10.1007/978-3-658-36295-9_6
Published:
Publisher Name: Springer Vieweg, Wiesbaden
Print ISBN: 978-3-658-36294-2
Online ISBN: 978-3-658-36295-9
eBook Packages: Computer ScienceComputer Science (R0)