skip to main content
10.1145/3422852.3423486acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Intra and Inter-modality Interactions for Audio-visual Event Detection

Published:12 October 2020Publication History

ABSTRACT

The presence of auditory and visual sensory streams enables human beings to obtain a profound understanding of a scene. While audio and visual signals are able to provide relevant information separately, the combination of both modalities offers more accurate and precise information. In this paper, we address the problem of audio-visual event detection. The goal is to identify events that are both visible and audible. For this, we propose an audio-visual network that models intra and inter-modality interactions with Multi-Head Attention layers. Furthermore, the proposed model captures the temporal correlation between the two modalities with multimodal LSTMs. Our method achieves state-of-the-art performance on the AVE dataset.

References

  1. Mart'in Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et almbox. 2016. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 265--283.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Relja Arandjelovic and Andrew Zisserman. 2017. Look, listen and learn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 609--617.Google ScholarGoogle ScholarCross RefCross Ref
  3. Yusuf Aytar, Carl Vondrick, and Antonio Torralba. 2016. Soundnet: Learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems (NIPS). 892--900.Google ScholarGoogle Scholar
  4. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 4th International Conference on Learning Representations (ICLR).Google ScholarGoogle Scholar
  5. Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 6299--6308.Google ScholarGoogle ScholarCross RefCross Ref
  6. Francois Chollet et almbox. 2015. Keras. https://github.com/fchollet/kerasGoogle ScholarGoogle Scholar
  7. Jean-Benoit Delbrouck, Noé Tits, Mathilde Brousmiche, and Stéphane Dupont. 2020. A Transformer-based joint-encoding for Emotion Recognition and Sentiment Analysis. In Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML). Association for Computational Linguistics, Seattle, USA, 1--7.Google ScholarGoogle ScholarCross RefCross Ref
  8. Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 776--780.Google ScholarGoogle ScholarCross RefCross Ref
  9. E Bruce Goldstein and James Brockmole. 2016. Sensation and perception .Cengage Learning.Google ScholarGoogle Scholar
  10. Wangli Hao, Zhaoxiang Zhang, and He Guan. 2018. Integrating both Visual and Audio Cues for Enhanced Video Caption. In Thirty-Second AAAI Conference on Artificial Intelligence.Google ScholarGoogle Scholar
  11. John R Hershey and Javier R Movellan. 2000. Audio vision: Using audio-visual synchrony to locate sounds. In Advances in Neural Information Processing Systems (NIPS). 813--819.Google ScholarGoogle Scholar
  12. Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et almbox. 2017. CNN architectures for large-scale audio classification. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 131--135.Google ScholarGoogle ScholarCross RefCross Ref
  13. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.Google ScholarGoogle Scholar
  14. Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4700--4708.Google ScholarGoogle ScholarCross RefCross Ref
  15. Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, and Dima Damen. 2019. Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 5492--5501.Google ScholarGoogle ScholarCross RefCross Ref
  16. Einat Kidron, Yoav Y Schechner, and Michael Elad. 2005. Pixels that sound. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1. IEEE, 88--95.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Kai Li, Jun Ye, and Kien A Hua. 2014. What's making that sound?. In Proceedings of the 22nd ACM International Conference on Multimedia. 147--156.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Zhenyang Li, Kirill Gavrilyuk, Efstratios Gavves, Mihir Jain, and Cees GM Snoek. 2018. VideoLSTM convolves, attends and flows for action recognition. Computer Vision and Image Understanding, Vol. 166 (2018), 41--50.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Yan-Bo Lin, Yu-Jhe Li, and Yu-Chiang Frank Wang. 2019. Dual-modality seq2seq network for audio-visual event localization. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2002--2006.Google ScholarGoogle ScholarCross RefCross Ref
  20. Xiang Long, Chuang Gan, Gerard De Melo, Xiao Liu, Yandong Li, Fu Li, and Shilei Wen. 2018a. Multimodal keyless attention fusion for video classification. In Thirty-Second AAAI Conference on Artificial Intelligence.Google ScholarGoogle ScholarCross RefCross Ref
  21. Xiang Long, Chuang Gan, Gerard De Melo, Jiajun Wu, Xiao Liu, and Shilei Wen. 2018b. Attention clusters: Purely attention based local feature integration for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 7834--7843.Google ScholarGoogle ScholarCross RefCross Ref
  22. Jiaxin Ma, Hao Tang, Wei-Long Zheng, and Bao-Liang Lu. 2019. Emotion recognition using multimodal residual LSTM network. In Proceedings of the 27th ACM International Conference on Multimedia. 176--183.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Volodymyr Mnih, Nicolas Heess, Alex Graves, et almbox. 2014. Recurrent models of visual attention. In Advances in Neural Information Processing Systems. 2204--2212.Google ScholarGoogle Scholar
  24. Andrew Owens, Jiajun Wu, Josh H McDermott, William T Freeman, and Antonio Torralba. 2016. Ambient sound provides supervision for visual learning. In European Conference on Computer Vision (ECCV). Springer, 801--816.Google ScholarGoogle ScholarCross RefCross Ref
  25. Gerasimos Potamianos, Chalapathy Neti, Juergen Luettin, and Iain Matthews. 2004. Audio-visual automatic speech recognition: An overview. Issues in Visual and Audio-visual Speech Processing, Vol. 22 (2004), 23.Google ScholarGoogle Scholar
  26. Janani Ramaswamy. 2020. What Makes the Sound?: A Dual-Modality Interacting Network for Audio-Visual Event Localization. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4372--4376.Google ScholarGoogle ScholarCross RefCross Ref
  27. Janani Ramaswamy and Sukhendu Das. 2020. See the Sound, Hear the Pixels. In The IEEE Winter Conference on Applications of Computer Vision. 2970--2979.Google ScholarGoogle Scholar
  28. Esteban Real, Jonathon Shlens, Stefano Mazzocchi, Xin Pan, and Vincent Vanhoucke. 2017. Youtube-boundingboxes: A large high-precision human-annotated data set for object detection in video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5296--5305.Google ScholarGoogle ScholarCross RefCross Ref
  29. Jimmy Ren, Yongtao Hu, Yu-Wing Tai, Chuan Wang, Li Xu, Wenxiu Sun, and Qiong Yan. 2016. Look, listen and learn?a multimodal LSTM for speaker identification. In Thirtieth AAAI Conference on Artificial Intelligence.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Yan Tian, Yifan Cao, Jiachen Wu, Wei Hu, Chao Song, and Tao Yang. 2019. Multi-cue combination network for action-based video classification. IET Computer Vision, Vol. 13, 6 (2019), 542--548.Google ScholarGoogle ScholarCross RefCross Ref
  31. Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chenliang Xu. 2018. Audio-visual event localization in unconstrained videos. In Proceedings of the European Conference on Computer Vision (ECCV). 247--263.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems (NIPS). 5998--6008.Google ScholarGoogle Scholar
  33. Chung-Hsien Wu, Jen-Chun Lin, and Wen-Li Wei. 2014. Survey on audiovisual emotion recognition: databases, features, and data fusion strategies. APSIPA Transactions on signal and Information Processing, Vol. 3 (2014).Google ScholarGoogle Scholar
  34. Jiajun Wu, Yinan Yu, Chang Huang, and Kai Yu. 2015. Deep multiple instance learning for image classification and auto-annotation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3460--3469.Google ScholarGoogle ScholarCross RefCross Ref
  35. Yu Wu, Linchao Zhu, Yan Yan, and Yi Yang. 2019. Dual attention matching for audio-visual event localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 6292--6300.Google ScholarGoogle ScholarCross RefCross Ref
  36. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning (ICML). 2048--2057.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Hanyu Xuan, Zhenyu Zhang, Shuo Chen, Jian Yang, and Yan Yan. 2020. Cross-Modal Attention Network for Temporal Inconsistent Audio-Visual Event Localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 279--286.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Intra and Inter-modality Interactions for Audio-visual Event Detection

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          HuMA'20: Proceedings of the 1st International Workshop on Human-centric Multimedia Analysis
          October 2020
          108 pages
          ISBN:9781450381512
          DOI:10.1145/3422852

          Copyright © 2020 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 12 October 2020

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Upcoming Conference

          MM '24
          MM '24: The 32nd ACM International Conference on Multimedia
          October 28 - November 1, 2024
          Melbourne , VIC , Australia

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader