ABSTRACT
The presence of auditory and visual sensory streams enables human beings to obtain a profound understanding of a scene. While audio and visual signals are able to provide relevant information separately, the combination of both modalities offers more accurate and precise information. In this paper, we address the problem of audio-visual event detection. The goal is to identify events that are both visible and audible. For this, we propose an audio-visual network that models intra and inter-modality interactions with Multi-Head Attention layers. Furthermore, the proposed model captures the temporal correlation between the two modalities with multimodal LSTMs. Our method achieves state-of-the-art performance on the AVE dataset.
- Mart'in Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et almbox. 2016. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 265--283.Google ScholarDigital Library
- Relja Arandjelovic and Andrew Zisserman. 2017. Look, listen and learn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 609--617.Google ScholarCross Ref
- Yusuf Aytar, Carl Vondrick, and Antonio Torralba. 2016. Soundnet: Learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems (NIPS). 892--900.Google Scholar
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 4th International Conference on Learning Representations (ICLR).Google Scholar
- Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 6299--6308.Google ScholarCross Ref
- Francois Chollet et almbox. 2015. Keras. https://github.com/fchollet/kerasGoogle Scholar
- Jean-Benoit Delbrouck, Noé Tits, Mathilde Brousmiche, and Stéphane Dupont. 2020. A Transformer-based joint-encoding for Emotion Recognition and Sentiment Analysis. In Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML). Association for Computational Linguistics, Seattle, USA, 1--7.Google ScholarCross Ref
- Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 776--780.Google ScholarCross Ref
- E Bruce Goldstein and James Brockmole. 2016. Sensation and perception .Cengage Learning.Google Scholar
- Wangli Hao, Zhaoxiang Zhang, and He Guan. 2018. Integrating both Visual and Audio Cues for Enhanced Video Caption. In Thirty-Second AAAI Conference on Artificial Intelligence.Google Scholar
- John R Hershey and Javier R Movellan. 2000. Audio vision: Using audio-visual synchrony to locate sounds. In Advances in Neural Information Processing Systems (NIPS). 813--819.Google Scholar
- Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et almbox. 2017. CNN architectures for large-scale audio classification. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 131--135.Google ScholarCross Ref
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.Google Scholar
- Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4700--4708.Google ScholarCross Ref
- Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, and Dima Damen. 2019. Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 5492--5501.Google ScholarCross Ref
- Einat Kidron, Yoav Y Schechner, and Michael Elad. 2005. Pixels that sound. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1. IEEE, 88--95.Google ScholarDigital Library
- Kai Li, Jun Ye, and Kien A Hua. 2014. What's making that sound?. In Proceedings of the 22nd ACM International Conference on Multimedia. 147--156.Google ScholarDigital Library
- Zhenyang Li, Kirill Gavrilyuk, Efstratios Gavves, Mihir Jain, and Cees GM Snoek. 2018. VideoLSTM convolves, attends and flows for action recognition. Computer Vision and Image Understanding, Vol. 166 (2018), 41--50.Google ScholarDigital Library
- Yan-Bo Lin, Yu-Jhe Li, and Yu-Chiang Frank Wang. 2019. Dual-modality seq2seq network for audio-visual event localization. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2002--2006.Google ScholarCross Ref
- Xiang Long, Chuang Gan, Gerard De Melo, Xiao Liu, Yandong Li, Fu Li, and Shilei Wen. 2018a. Multimodal keyless attention fusion for video classification. In Thirty-Second AAAI Conference on Artificial Intelligence.Google ScholarCross Ref
- Xiang Long, Chuang Gan, Gerard De Melo, Jiajun Wu, Xiao Liu, and Shilei Wen. 2018b. Attention clusters: Purely attention based local feature integration for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 7834--7843.Google ScholarCross Ref
- Jiaxin Ma, Hao Tang, Wei-Long Zheng, and Bao-Liang Lu. 2019. Emotion recognition using multimodal residual LSTM network. In Proceedings of the 27th ACM International Conference on Multimedia. 176--183.Google ScholarDigital Library
- Volodymyr Mnih, Nicolas Heess, Alex Graves, et almbox. 2014. Recurrent models of visual attention. In Advances in Neural Information Processing Systems. 2204--2212.Google Scholar
- Andrew Owens, Jiajun Wu, Josh H McDermott, William T Freeman, and Antonio Torralba. 2016. Ambient sound provides supervision for visual learning. In European Conference on Computer Vision (ECCV). Springer, 801--816.Google ScholarCross Ref
- Gerasimos Potamianos, Chalapathy Neti, Juergen Luettin, and Iain Matthews. 2004. Audio-visual automatic speech recognition: An overview. Issues in Visual and Audio-visual Speech Processing, Vol. 22 (2004), 23.Google Scholar
- Janani Ramaswamy. 2020. What Makes the Sound?: A Dual-Modality Interacting Network for Audio-Visual Event Localization. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4372--4376.Google ScholarCross Ref
- Janani Ramaswamy and Sukhendu Das. 2020. See the Sound, Hear the Pixels. In The IEEE Winter Conference on Applications of Computer Vision. 2970--2979.Google Scholar
- Esteban Real, Jonathon Shlens, Stefano Mazzocchi, Xin Pan, and Vincent Vanhoucke. 2017. Youtube-boundingboxes: A large high-precision human-annotated data set for object detection in video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5296--5305.Google ScholarCross Ref
- Jimmy Ren, Yongtao Hu, Yu-Wing Tai, Chuan Wang, Li Xu, Wenxiu Sun, and Qiong Yan. 2016. Look, listen and learn?a multimodal LSTM for speaker identification. In Thirtieth AAAI Conference on Artificial Intelligence.Google ScholarDigital Library
- Yan Tian, Yifan Cao, Jiachen Wu, Wei Hu, Chao Song, and Tao Yang. 2019. Multi-cue combination network for action-based video classification. IET Computer Vision, Vol. 13, 6 (2019), 542--548.Google ScholarCross Ref
- Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chenliang Xu. 2018. Audio-visual event localization in unconstrained videos. In Proceedings of the European Conference on Computer Vision (ECCV). 247--263.Google ScholarDigital Library
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems (NIPS). 5998--6008.Google Scholar
- Chung-Hsien Wu, Jen-Chun Lin, and Wen-Li Wei. 2014. Survey on audiovisual emotion recognition: databases, features, and data fusion strategies. APSIPA Transactions on signal and Information Processing, Vol. 3 (2014).Google Scholar
- Jiajun Wu, Yinan Yu, Chang Huang, and Kai Yu. 2015. Deep multiple instance learning for image classification and auto-annotation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3460--3469.Google ScholarCross Ref
- Yu Wu, Linchao Zhu, Yan Yan, and Yi Yang. 2019. Dual attention matching for audio-visual event localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 6292--6300.Google ScholarCross Ref
- Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning (ICML). 2048--2057.Google ScholarDigital Library
- Hanyu Xuan, Zhenyu Zhang, Shuo Chen, Jian Yang, and Yan Yan. 2020. Cross-Modal Attention Network for Temporal Inconsistent Audio-Visual Event Localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 279--286.Google ScholarCross Ref
Index Terms
- Intra and Inter-modality Interactions for Audio-visual Event Detection
Recommendations
Multimodal Attentive Fusion Network for audio-visual event recognition
AbstractEvent classification is inherently sequential and multimodal. Therefore, deep neural models need to dynamically focus on the most relevant time window and/or modality of a video. In this study, we propose the Multimodal Attentive ...
Highlights- State-of-the-art audio and visual interactions in neural networks are relatively simple.
Visual Question Answering With Dense Inter- and Intra-Modality Interactions
Learning effective interactions between multi-modal features is at the heart of visual question answering (VQA). A common defect of the existing VQA approaches is that they only consider a very limited amount of inter-modality interactions, which may be ...
Audio-Visual Salient Object Detection
Intelligent Computing Theories and ApplicationAbstractThis paper studies audio-visual salient object detection. The task of salient object detection is to detect and mark the objects that are most concerned by people in the visual scene. Traditionally, visual salient object detection uses only images ...
Comments