research-article

Intra and Inter-modality Interactions for Audio-visual Event Detection

Authors:
Mathilde Brousmiche

University of Mons, Mons, Belgium

University of Mons, Mons, Belgium
View Profile

,
Stéphane Dupont

University of Mons, Mons, Belgium

University of Mons, Mons, Belgium
View Profile

,
Jean Rout

University of Sherbrooke, Sherbrooke, Canada

University of Sherbrooke, Sherbrooke, Canada
View Profile

HuMA'20: Proceedings of the 1st International Workshop on Human-centric Multimedia AnalysisOctober 2020Pages 5–11https://doi.org/10.1145/3422852.3423486

Published:12 October 2020Publication History

HuMA'20: Proceedings of the 1st International Workshop on Human-centric Multimedia Analysis

Pages 5–11

ABSTRACT

The presence of auditory and visual sensory streams enables human beings to obtain a profound understanding of a scene. While audio and visual signals are able to provide relevant information separately, the combination of both modalities offers more accurate and precise information. In this paper, we address the problem of audio-visual event detection. The goal is to identify events that are both visible and audible. For this, we propose an audio-visual network that models intra and inter-modality interactions with Multi-Head Attention layers. Furthermore, the proposed model captures the temporal correlation between the two modalities with multimodal LSTMs. Our method achieves state-of-the-art performance on the AVE dataset.

References

Mart'in Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et almbox. 2016. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 265--283.Google ScholarDigital Library
Relja Arandjelovic and Andrew Zisserman. 2017. Look, listen and learn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 609--617.Google ScholarCross Ref
Yusuf Aytar, Carl Vondrick, and Antonio Torralba. 2016. Soundnet: Learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems (NIPS). 892--900.Google Scholar
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 4th International Conference on Learning Representations (ICLR).Google Scholar
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 6299--6308.Google ScholarCross Ref
Francois Chollet et almbox. 2015. Keras. https://github.com/fchollet/kerasGoogle Scholar
Jean-Benoit Delbrouck, Noé Tits, Mathilde Brousmiche, and Stéphane Dupont. 2020. A Transformer-based joint-encoding for Emotion Recognition and Sentiment Analysis. In Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML). Association for Computational Linguistics, Seattle, USA, 1--7.Google ScholarCross Ref
Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 776--780.Google ScholarCross Ref
E Bruce Goldstein and James Brockmole. 2016. Sensation and perception .Cengage Learning.Google Scholar
Wangli Hao, Zhaoxiang Zhang, and He Guan. 2018. Integrating both Visual and Audio Cues for Enhanced Video Caption. In Thirty-Second AAAI Conference on Artificial Intelligence.Google Scholar
John R Hershey and Javier R Movellan. 2000. Audio vision: Using audio-visual synchrony to locate sounds. In Advances in Neural Information Processing Systems (NIPS). 813--819.Google Scholar
Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et almbox. 2017. CNN architectures for large-scale audio classification. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 131--135.Google ScholarCross Ref
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.Google Scholar
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4700--4708.Google ScholarCross Ref
Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, and Dima Damen. 2019. Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 5492--5501.Google ScholarCross Ref
Einat Kidron, Yoav Y Schechner, and Michael Elad. 2005. Pixels that sound. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1. IEEE, 88--95.Google ScholarDigital Library
Kai Li, Jun Ye, and Kien A Hua. 2014. What's making that sound?. In Proceedings of the 22nd ACM International Conference on Multimedia. 147--156.Google ScholarDigital Library
Zhenyang Li, Kirill Gavrilyuk, Efstratios Gavves, Mihir Jain, and Cees GM Snoek. 2018. VideoLSTM convolves, attends and flows for action recognition. Computer Vision and Image Understanding, Vol. 166 (2018), 41--50.Google ScholarDigital Library
Yan-Bo Lin, Yu-Jhe Li, and Yu-Chiang Frank Wang. 2019. Dual-modality seq2seq network for audio-visual event localization. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2002--2006.Google ScholarCross Ref
Xiang Long, Chuang Gan, Gerard De Melo, Xiao Liu, Yandong Li, Fu Li, and Shilei Wen. 2018a. Multimodal keyless attention fusion for video classification. In Thirty-Second AAAI Conference on Artificial Intelligence.Google ScholarCross Ref
Xiang Long, Chuang Gan, Gerard De Melo, Jiajun Wu, Xiao Liu, and Shilei Wen. 2018b. Attention clusters: Purely attention based local feature integration for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 7834--7843.Google ScholarCross Ref
Jiaxin Ma, Hao Tang, Wei-Long Zheng, and Bao-Liang Lu. 2019. Emotion recognition using multimodal residual LSTM network. In Proceedings of the 27th ACM International Conference on Multimedia. 176--183.Google ScholarDigital Library
Volodymyr Mnih, Nicolas Heess, Alex Graves, et almbox. 2014. Recurrent models of visual attention. In Advances in Neural Information Processing Systems. 2204--2212.Google Scholar
Andrew Owens, Jiajun Wu, Josh H McDermott, William T Freeman, and Antonio Torralba. 2016. Ambient sound provides supervision for visual learning. In European Conference on Computer Vision (ECCV). Springer, 801--816.Google ScholarCross Ref
Gerasimos Potamianos, Chalapathy Neti, Juergen Luettin, and Iain Matthews. 2004. Audio-visual automatic speech recognition: An overview. Issues in Visual and Audio-visual Speech Processing, Vol. 22 (2004), 23.Google Scholar
Janani Ramaswamy. 2020. What Makes the Sound?: A Dual-Modality Interacting Network for Audio-Visual Event Localization. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4372--4376.Google ScholarCross Ref
Janani Ramaswamy and Sukhendu Das. 2020. See the Sound, Hear the Pixels. In The IEEE Winter Conference on Applications of Computer Vision. 2970--2979.Google Scholar
Esteban Real, Jonathon Shlens, Stefano Mazzocchi, Xin Pan, and Vincent Vanhoucke. 2017. Youtube-boundingboxes: A large high-precision human-annotated data set for object detection in video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5296--5305.Google ScholarCross Ref
Jimmy Ren, Yongtao Hu, Yu-Wing Tai, Chuan Wang, Li Xu, Wenxiu Sun, and Qiong Yan. 2016. Look, listen and learn?a multimodal LSTM for speaker identification. In Thirtieth AAAI Conference on Artificial Intelligence.Google ScholarDigital Library
Yan Tian, Yifan Cao, Jiachen Wu, Wei Hu, Chao Song, and Tao Yang. 2019. Multi-cue combination network for action-based video classification. IET Computer Vision, Vol. 13, 6 (2019), 542--548.Google ScholarCross Ref
Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chenliang Xu. 2018. Audio-visual event localization in unconstrained videos. In Proceedings of the European Conference on Computer Vision (ECCV). 247--263.Google ScholarDigital Library
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems (NIPS). 5998--6008.Google Scholar
Chung-Hsien Wu, Jen-Chun Lin, and Wen-Li Wei. 2014. Survey on audiovisual emotion recognition: databases, features, and data fusion strategies. APSIPA Transactions on signal and Information Processing, Vol. 3 (2014).Google Scholar
Jiajun Wu, Yinan Yu, Chang Huang, and Kai Yu. 2015. Deep multiple instance learning for image classification and auto-annotation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3460--3469.Google ScholarCross Ref
Yu Wu, Linchao Zhu, Yan Yan, and Yi Yang. 2019. Dual attention matching for audio-visual event localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 6292--6300.Google ScholarCross Ref
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning (ICML). 2048--2057.Google ScholarDigital Library
Hanyu Xuan, Zhenyu Zhang, Shuo Chen, Jian Yang, and Yan Yan. 2020. Cross-Modal Attention Network for Temporal Inconsistent Audio-Visual Event Localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 279--286.Google ScholarCross Ref

Index Terms

Intra and Inter-modality Interactions for Audio-visual Event Detection
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding
        Scene understanding
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Multimodal Attentive Fusion Network for audio-visual event recognition
Abstract
Event classification is inherently sequential and multimodal. Therefore, deep neural models need to dynamically focus on the most relevant time window and/or modality of a video. In this study, we propose the Multimodal Attentive ...
Highlights
- State-of-the-art audio and visual interactions in neural networks are relatively simple.
Read More
Visual Question Answering With Dense Inter- and Intra-Modality Interactions
Learning effective interactions between multi-modal features is at the heart of visual question answering (VQA). A common defect of the existing VQA approaches is that they only consider a very limited amount of inter-modality interactions, which may be ...
Read More
Audio-Visual Salient Object Detection
Intelligent Computing Theories and Application
Abstract
This paper studies audio-visual salient object detection. The task of salient object detection is to detect and mark the objects that are most concerned by people in the visual scene. Traditionally, visual salient object detection uses only images ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
HuMA'20: Proceedings of the 1st International Workshop on Human-centric Multimedia Analysis
October 2020
108 pages
ISBN:9781450381512
DOI:10.1145/3422852
General Chairs:
Wu Liu
JD AI Research, China
,
Chuang Gan
MIT-IBM Watson AI Lab, USA
,
John Smith
IBM Research, USA
,
Program Chairs:
Jingkuan Song
University of Electronic Science and Technology of China, China
,
Dingwen Zhang
Xidian University, China
,
Wenbing Huang
Tsinghua University, China
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 October 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
audio-visual event detection
audio-visual fusion
multi-head attention
multimodal-lstm
Qualifiers
- research-article
Conference
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 244
  Total Downloads
- Downloads (Last 12 months)23
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Intra and Inter-modality Interactions for Audio-visual Event Detection

HuMA'20: Proceedings of the 1st International Workshop on Human-centric Multimedia Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

Multimodal Attentive Fusion Network for audio-visual event recognition

Visual Question Answering With Dense Inter- and Intra-Modality Interactions

Audio-Visual Salient Object Detection

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media