research-article

Video Relation Detection with Spatio-Temporal Graph

Authors:
Xufeng Qian

Zhejiang University, Zhejiang Hangzhou, China

Zhejiang University, Zhejiang Hangzhou, China
View Profile

,
Yueting Zhuang

Zhejiang University, Zhejiang Hangzhou, China

Zhejiang University, Zhejiang Hangzhou, China
View Profile

,
Yimeng Li

Zhejiang University, Zhejiang Hangzhou, China

Zhejiang University, Zhejiang Hangzhou, China
View Profile

,
Shaoning Xiao

Zhejiang University, Zhejiang Hangzhou, China

Zhejiang University, Zhejiang Hangzhou, China
View Profile

,
Shiliang Pu

Hikvision Research Institute, Zhejiang Hangzhou, China

Hikvision Research Institute, Zhejiang Hangzhou, China
View Profile

,
Jun Xiao

Zhejiang University, Zhejiang Hangzhou, China

Zhejiang University, Zhejiang Hangzhou, China
View Profile

MM '19: Proceedings of the 27th ACM International Conference on MultimediaOctober 2019Pages 84–93https://doi.org/10.1145/3343031.3351058

Published:15 October 2019Publication History

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Pages 84–93

ABSTRACT

What we perceive from visual content are not only collections of objects but the interactions between them. Visual relations, denoted by the triplet <subject, predicate, object>, could convey a wealth of information for visual understanding. Different from static images and because of the additional temporal channel, dynamic relations in videos are often correlated in both spatial and temporal dimensions, which make the relation detection in videos a more complex and challenging task. In this paper, we abstract videos into fully-connected spatial-temporal graphs. We pass message and conduct reasoning in these 3D graphs with a novel VidVRD model using graph convolution network. Our model can take advantage of spatial-temporal contextual cues to make better predictions on objects as well as their dynamic relationships. Furthermore, an online association method with a siamese network is proposed for accurate relation instances association. By combining our model (VRD-GCN) and the proposed association method, our framework for video relation detection achieves the best performance in the latest benchmarks. We validate our approach on benchmark ImageNet-VidVRD dataset. The experimental results show that our framework outperforms the state-of-the-art by a large margin and a series of ablation studies demonstrate our method's effectiveness.

References

Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2015. Deep Compositional Question Answering with Neural Module Networks. CoRR , Vol. abs/1511.02799 (2015).Google Scholar
Lei Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer Normalization. CoRR , Vol. abs/1607.06450 (2016).Google Scholar
Luca Bertinetto, Jack Valmadre, Jo a o F. Henriques, Andrea Vedaldi, and Philip H. S. Torr. 2016. Fully-Convolutional Siamese Networks for Object Tracking. In Computer Vision - ECCV 2016 Workshops - Amsterdam, The Netherlands, October 8--10 and 15--16, 2016, Proceedings, Part II. 850--865.Google Scholar
Alex Bewley, ZongYuan Ge, Lionel Ott, Fabio Tozeto Ramos, and Ben Upcroft. 2016. Simple online and realtime tracking. In 2016 IEEE International Conference on Image Processing, ICIP 2016, Phoenix, AZ, USA, September 25--28, 2016 . 3464--3468.Google ScholarCross Ref
David S. Bolme, J. Ross Beveridge, Bruce A. Draper, and Yui Man Lui. 2010. Visual object tracking using adaptive correlation filters. In The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2010, San Francisco, CA, USA, 13--18 June 2010. 2544--2550.Google ScholarCross Ref
Long Chen, Hanwang Zhang, Jun Xiao, Xiangnan He, Shiliang Pu, and Shih-Fu Chang. 2019. Counterfactual Critic Multi-Agent Training for Scene Graph Generation. In ICCV .Google Scholar
Zhiyong Cui, Kristian Henrickson, Ruimin Ke, and Yinhai Wang. 2018. High-Order Graph Convolutional Recurrent Neural Network: A Deep Learning Framework for Network-Scale Traffic Learning and Forecasting. CoRR , Vol. abs/1802.07007 (2018).Google Scholar
Martin Danelljan, Gustav H"a ger, Fahad Shahbaz Khan, and Michael Felsberg. 2014. Accurate Scale Estimation for Robust Visual Tracking. In British Machine Vision Conference, BMVC 2014, Nottingham, UK, September 1--5, 2014 .Google Scholar
Michaë l Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5--10, 2016, Barcelona, Spain . 3837--3845.Google Scholar
Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, and Bryan Russell. 2017. ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. 3165--3174.Google Scholar
William L. Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive Representation Learning on Large Graphs. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4--9 December 2017, Long Beach, CA, USA. 1025--1035.Google Scholar
Anfeng He, Chong Luo, Xinmei Tian, and Wenjun Zeng. 2018. A Twofold Siamese Network for Real-Time Object Tracking. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18--22, 2018. 4834--4843.Google ScholarCross Ref
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27--30, 2016 . 770--778.Google Scholar
Mikael Henaff, Joan Bruna, and Yann LeCun. 2015. Deep Convolutional Networks on Graph-Structured Data. CoRR , Vol. abs/1506.05163 (2015).Google Scholar
Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24--26, 2017, Conference Track Proceedings .Google Scholar
Yikang Li, Wanli Ouyang, Bolei Zhou, Kun Wang, and Xiaogang Wang. 2017. Scene Graph Generation from Objects, Phrases and Region Captions. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22--29, 2017. 1270--1279.Google Scholar
Cewu Lu, Ranjay Krishna, Michael S. Bernstein, and Fei-Fei Li. 2016. Visual Relationship Detection with Language Priors. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part I. 852--869.Google Scholar
Kenneth Marino, Ruslan Salakhutdinov, and Abhinav Gupta. 2017. The More You Know: Using Knowledge Graphs for Image Classification. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. 20--28.Google Scholar
Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodolà, Jan Svoboda, and Michael M. Bronstein. 2017. Geometric Deep Learning on Graphs and Manifolds Using Mixture Model CNNs. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. 5425--5434.Google Scholar
Julia Peyre, Ivan Laptev, Cordelia Schmid, and Josef Sivic. 2017. Weakly-Supervised Learning of Visual Relations. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22--29, 2017 . 5189--5198.Google Scholar
Afshin Rahimi, Trevor Cohn, and Timothy Baldwin. 2018. Semi-supervised User Geolocation via Graph Convolutional Networks. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15--20, 2018, Volume 1: Long Papers . 2009--2019.Google ScholarCross Ref
Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7--12, 2015, Montreal, Quebec, Canada. 91--99.Google ScholarDigital Library
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision , Vol. 115, 3 (2015), 211--252.Google ScholarDigital Library
Mohammad Amin Sadeghi and Ali Farhadi. 2011. Recognition using visual phrases. In The 24th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, Colorado Springs, CO, USA, 20--25 June 2011. 1745--1752.Google ScholarDigital Library
Victor Garcia Satorras and Joan Bruna Estrach. 2018. Few-Shot Learning with Graph Neural Networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings .Google Scholar
Xindi Shang, Tongwei Ren, Jingfan Guo, Hanwang Zhang, and Tat-Seng Chua. 2017. Video Visual Relation Detection. In Proceedings of the 2017 ACM on Multimedia Conference, MM 2017, Mountain View, CA, USA, October 23--27, 2017 . 1300--1308.Google ScholarDigital Library
Abhinav Shrivastava, Abhinav Gupta, and Ross B. Girshick. 2016. Training Region-Based Object Detectors with Online Hard Example Mining. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27--30, 2016. 761--769.Google Scholar
Damien Teney, Lingqiao Liu, and Anton van den Hengel. 2017. Graph-Structured Representations for Visual Question Answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. 3233--3241.Google Scholar
Jack Valmadre, Luca Bertinetto, Jo a o F. Henriques, Andrea Vedaldi, and Philip H. S. Torr. 2017. End-to-End Representation Learning for Correlation Filter Based Tracking. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. 5000--5008.Google Scholar
Rianne van den Berg, Thomas N. Kipf, and Max Welling. 2017. Graph Convolutional Matrix Completion. CoRR , Vol. abs/1706.02263 (2017).Google Scholar
Heng Wang and Cordelia Schmid. 2013. Action Recognition with Improved Trajectories. In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1--8, 2013 . 3551--3558.Google Scholar
Xiaolong Wang, Ross B. Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-Local Neural Networks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18--22, 2018. 7794--7803.Google Scholar
Xiaolong Wang and Abhinav Gupta. 2018. Videos as Space-Time Region Graphs. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8--14, 2018, Proceedings, Part V. 413--431.Google Scholar
Nicolai Wojke, Alex Bewley, and Dietrich Paulus. 2017. Simple online and realtime tracking with a deep association metric. In 2017 IEEE International Conference on Image Processing, ICIP 2017, Beijing, China, September 17--20, 2017. 3645--3649.Google ScholarCross Ref
Danfei Xu, Yuke Zhu, Christopher B. Choy, and Li Fei-Fei. 2017. Scene Graph Generation by Iterative Message Passing. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. 3097--3106.Google ScholarCross Ref
Ning Xu, An-An Liu, Yongkang Wong, Yongdong Zhang, Weizhi Nie, Yuting Su, and Mohan Kankanhalli. 2018. Dual-stream recurrent neural network for video captioning. IEEE Transactions on Circuits and Systems for Video Technology (2018).Google Scholar
Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. 2018. Graph R-CNN for Scene Graph Generation. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8--14, 2018, Proceedings, Part I. 690--706.Google Scholar
Guojun Yin, Lu Sheng, Bin Liu, Nenghai Yu, Xiaogang Wang, Jing Shao, and Chen Change Loy. 2018. Zoom-Net: Mining Deep Feature Interactions for Visual Relationship Recognition. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8--14, 2018, Proceedings, Part III. 330--347.Google Scholar
Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L. Hamilton, and Jure Leskovec. 2018. Graph Convolutional Neural Networks for Web-Scale Recommender Systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19--23, 2018 . 974--983.Google ScholarDigital Library
Ruichi Yu, Ang Li, Vlad I. Morariu, and Larry S. Davis. 2017. Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22--29, 2017 . 1068--1076.Google Scholar
Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural Motifs: Scene Graph Parsing With Global Context. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18--22, 2018. 5831--5840.Google Scholar
Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, and Tat-Seng Chua. 2017. Visual Translation Embedding Network for Visual Relation Detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. 3107--3115.Google Scholar
Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Torralba. 2018. Temporal Relational Reasoning in Videos. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8--14, 2018, Proceedings, Part I. 831--846.Google Scholar
Bohan Zhuang, Lingqiao Liu, Chunhua Shen, and Ian D. Reid. 2017. Towards Context-Aware Interaction Recognition for Visual Relationship Detection. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22--29, 2017. 589--598.Google Scholar

Index Terms

Video Relation Detection with Spatio-Temporal Graph
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

Video Visual Relation Detection
MM '17: Proceedings of the 25th ACM international conference on Multimedia

As a bridge to connect vision and language, visual relations between objects in the form of relation triplet $łangle subject,predicate,object\rangle$, such as "person-touch-dog'' and "cat-above-sofa'', provide a more comprehensive visual content ...
Read More
Visual Relation Detection with Multi-Level Attention
MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Visual relations, which describe various types of interactions between two objects in the image, can provide critical information for comprehensive semantic understanding of the image. Multiple cues related to the objects can contribute to visual ...
Read More
Video Relation Detection with Trajectory-aware Multi-modal Features
MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Video relation detection problem refers to the detection of the relationship between different objects in videos, such as spatial relationship and action relationship. In this paper, we present video relation detection with trajectory-aware multi-modal ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '19: Proceedings of the 27th ACM International Conference on Multimedia
October 2019
2794 pages
ISBN:9781450368896
DOI:10.1145/3343031
General Chairs:
Laurent Amsaleg
CNRS-IRISA, France
,
Benoit Huet
EURECOM, France
,
Martha Larson
Radboud University and TU Delft (Netherlands)
,
Program Chairs:
Guillaume Gravier
CNRS-IRISA, France
,
Hayley Hung
Delft University of Technology Netherlands
,
Chong-Wah Ngo
City University of Hong Kong Hong Kong
,
Wei Tsang Ooi
National University of Singapore Singapore
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 October 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
siamese association network
spatio-temporal graph convolutional network
video relation detection
visual relation detection
Qualifiers
- research-article
Conference

Acceptance Rates
MM '19 Paper Acceptance Rate252of936submissions,27%Overall Acceptance Rate995of4,171submissions,24%
More
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 46
  Total Citations
  View Citations
- 1,213
  Total Downloads
- Downloads (Last 12 months)99
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Video Relation Detection with Spatio-Temporal Graph

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Video Visual Relation Detection

Visual Relation Detection with Multi-Level Attention

Video Relation Detection with Trajectory-aware Multi-modal Features

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media