ABSTRACT
What we perceive from visual content are not only collections of objects but the interactions between them. Visual relations, denoted by the triplet <subject, predicate, object>, could convey a wealth of information for visual understanding. Different from static images and because of the additional temporal channel, dynamic relations in videos are often correlated in both spatial and temporal dimensions, which make the relation detection in videos a more complex and challenging task. In this paper, we abstract videos into fully-connected spatial-temporal graphs. We pass message and conduct reasoning in these 3D graphs with a novel VidVRD model using graph convolution network. Our model can take advantage of spatial-temporal contextual cues to make better predictions on objects as well as their dynamic relationships. Furthermore, an online association method with a siamese network is proposed for accurate relation instances association. By combining our model (VRD-GCN) and the proposed association method, our framework for video relation detection achieves the best performance in the latest benchmarks. We validate our approach on benchmark ImageNet-VidVRD dataset. The experimental results show that our framework outperforms the state-of-the-art by a large margin and a series of ablation studies demonstrate our method's effectiveness.
- Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2015. Deep Compositional Question Answering with Neural Module Networks. CoRR , Vol. abs/1511.02799 (2015).Google Scholar
- Lei Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer Normalization. CoRR , Vol. abs/1607.06450 (2016).Google Scholar
- Luca Bertinetto, Jack Valmadre, Jo a o F. Henriques, Andrea Vedaldi, and Philip H. S. Torr. 2016. Fully-Convolutional Siamese Networks for Object Tracking. In Computer Vision - ECCV 2016 Workshops - Amsterdam, The Netherlands, October 8--10 and 15--16, 2016, Proceedings, Part II. 850--865.Google Scholar
- Alex Bewley, ZongYuan Ge, Lionel Ott, Fabio Tozeto Ramos, and Ben Upcroft. 2016. Simple online and realtime tracking. In 2016 IEEE International Conference on Image Processing, ICIP 2016, Phoenix, AZ, USA, September 25--28, 2016 . 3464--3468.Google ScholarCross Ref
- David S. Bolme, J. Ross Beveridge, Bruce A. Draper, and Yui Man Lui. 2010. Visual object tracking using adaptive correlation filters. In The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2010, San Francisco, CA, USA, 13--18 June 2010. 2544--2550.Google ScholarCross Ref
- Long Chen, Hanwang Zhang, Jun Xiao, Xiangnan He, Shiliang Pu, and Shih-Fu Chang. 2019. Counterfactual Critic Multi-Agent Training for Scene Graph Generation. In ICCV .Google Scholar
- Zhiyong Cui, Kristian Henrickson, Ruimin Ke, and Yinhai Wang. 2018. High-Order Graph Convolutional Recurrent Neural Network: A Deep Learning Framework for Network-Scale Traffic Learning and Forecasting. CoRR , Vol. abs/1802.07007 (2018).Google Scholar
- Martin Danelljan, Gustav H"a ger, Fahad Shahbaz Khan, and Michael Felsberg. 2014. Accurate Scale Estimation for Robust Visual Tracking. In British Machine Vision Conference, BMVC 2014, Nottingham, UK, September 1--5, 2014 .Google Scholar
- Michaë l Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5--10, 2016, Barcelona, Spain . 3837--3845.Google Scholar
- Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, and Bryan Russell. 2017. ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. 3165--3174.Google Scholar
- William L. Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive Representation Learning on Large Graphs. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4--9 December 2017, Long Beach, CA, USA. 1025--1035.Google Scholar
- Anfeng He, Chong Luo, Xinmei Tian, and Wenjun Zeng. 2018. A Twofold Siamese Network for Real-Time Object Tracking. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18--22, 2018. 4834--4843.Google ScholarCross Ref
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27--30, 2016 . 770--778.Google Scholar
- Mikael Henaff, Joan Bruna, and Yann LeCun. 2015. Deep Convolutional Networks on Graph-Structured Data. CoRR , Vol. abs/1506.05163 (2015).Google Scholar
- Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24--26, 2017, Conference Track Proceedings .Google Scholar
- Yikang Li, Wanli Ouyang, Bolei Zhou, Kun Wang, and Xiaogang Wang. 2017. Scene Graph Generation from Objects, Phrases and Region Captions. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22--29, 2017. 1270--1279.Google Scholar
- Cewu Lu, Ranjay Krishna, Michael S. Bernstein, and Fei-Fei Li. 2016. Visual Relationship Detection with Language Priors. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part I. 852--869.Google Scholar
- Kenneth Marino, Ruslan Salakhutdinov, and Abhinav Gupta. 2017. The More You Know: Using Knowledge Graphs for Image Classification. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. 20--28.Google Scholar
- Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodolà, Jan Svoboda, and Michael M. Bronstein. 2017. Geometric Deep Learning on Graphs and Manifolds Using Mixture Model CNNs. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. 5425--5434.Google Scholar
- Julia Peyre, Ivan Laptev, Cordelia Schmid, and Josef Sivic. 2017. Weakly-Supervised Learning of Visual Relations. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22--29, 2017 . 5189--5198.Google Scholar
- Afshin Rahimi, Trevor Cohn, and Timothy Baldwin. 2018. Semi-supervised User Geolocation via Graph Convolutional Networks. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15--20, 2018, Volume 1: Long Papers . 2009--2019.Google ScholarCross Ref
- Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7--12, 2015, Montreal, Quebec, Canada. 91--99.Google ScholarDigital Library
- Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision , Vol. 115, 3 (2015), 211--252.Google ScholarDigital Library
- Mohammad Amin Sadeghi and Ali Farhadi. 2011. Recognition using visual phrases. In The 24th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, Colorado Springs, CO, USA, 20--25 June 2011. 1745--1752.Google ScholarDigital Library
- Victor Garcia Satorras and Joan Bruna Estrach. 2018. Few-Shot Learning with Graph Neural Networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings .Google Scholar
- Xindi Shang, Tongwei Ren, Jingfan Guo, Hanwang Zhang, and Tat-Seng Chua. 2017. Video Visual Relation Detection. In Proceedings of the 2017 ACM on Multimedia Conference, MM 2017, Mountain View, CA, USA, October 23--27, 2017 . 1300--1308.Google ScholarDigital Library
- Abhinav Shrivastava, Abhinav Gupta, and Ross B. Girshick. 2016. Training Region-Based Object Detectors with Online Hard Example Mining. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27--30, 2016. 761--769.Google Scholar
- Damien Teney, Lingqiao Liu, and Anton van den Hengel. 2017. Graph-Structured Representations for Visual Question Answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. 3233--3241.Google Scholar
- Jack Valmadre, Luca Bertinetto, Jo a o F. Henriques, Andrea Vedaldi, and Philip H. S. Torr. 2017. End-to-End Representation Learning for Correlation Filter Based Tracking. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. 5000--5008.Google Scholar
- Rianne van den Berg, Thomas N. Kipf, and Max Welling. 2017. Graph Convolutional Matrix Completion. CoRR , Vol. abs/1706.02263 (2017).Google Scholar
- Heng Wang and Cordelia Schmid. 2013. Action Recognition with Improved Trajectories. In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1--8, 2013 . 3551--3558.Google Scholar
- Xiaolong Wang, Ross B. Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-Local Neural Networks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18--22, 2018. 7794--7803.Google Scholar
- Xiaolong Wang and Abhinav Gupta. 2018. Videos as Space-Time Region Graphs. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8--14, 2018, Proceedings, Part V. 413--431.Google Scholar
- Nicolai Wojke, Alex Bewley, and Dietrich Paulus. 2017. Simple online and realtime tracking with a deep association metric. In 2017 IEEE International Conference on Image Processing, ICIP 2017, Beijing, China, September 17--20, 2017. 3645--3649.Google ScholarCross Ref
- Danfei Xu, Yuke Zhu, Christopher B. Choy, and Li Fei-Fei. 2017. Scene Graph Generation by Iterative Message Passing. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. 3097--3106.Google ScholarCross Ref
- Ning Xu, An-An Liu, Yongkang Wong, Yongdong Zhang, Weizhi Nie, Yuting Su, and Mohan Kankanhalli. 2018. Dual-stream recurrent neural network for video captioning. IEEE Transactions on Circuits and Systems for Video Technology (2018).Google Scholar
- Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. 2018. Graph R-CNN for Scene Graph Generation. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8--14, 2018, Proceedings, Part I. 690--706.Google Scholar
- Guojun Yin, Lu Sheng, Bin Liu, Nenghai Yu, Xiaogang Wang, Jing Shao, and Chen Change Loy. 2018. Zoom-Net: Mining Deep Feature Interactions for Visual Relationship Recognition. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8--14, 2018, Proceedings, Part III. 330--347.Google Scholar
- Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L. Hamilton, and Jure Leskovec. 2018. Graph Convolutional Neural Networks for Web-Scale Recommender Systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19--23, 2018 . 974--983.Google ScholarDigital Library
- Ruichi Yu, Ang Li, Vlad I. Morariu, and Larry S. Davis. 2017. Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22--29, 2017 . 1068--1076.Google Scholar
- Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural Motifs: Scene Graph Parsing With Global Context. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18--22, 2018. 5831--5840.Google Scholar
- Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, and Tat-Seng Chua. 2017. Visual Translation Embedding Network for Visual Relation Detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. 3107--3115.Google Scholar
- Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Torralba. 2018. Temporal Relational Reasoning in Videos. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8--14, 2018, Proceedings, Part I. 831--846.Google Scholar
- Bohan Zhuang, Lingqiao Liu, Chunhua Shen, and Ian D. Reid. 2017. Towards Context-Aware Interaction Recognition for Visual Relationship Detection. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22--29, 2017. 589--598.Google Scholar
Index Terms
- Video Relation Detection with Spatio-Temporal Graph
Recommendations
Video Visual Relation Detection
MM '17: Proceedings of the 25th ACM international conference on MultimediaAs a bridge to connect vision and language, visual relations between objects in the form of relation triplet $łangle subject,predicate,object\rangle$, such as "person-touch-dog'' and "cat-above-sofa'', provide a more comprehensive visual content ...
Visual Relation Detection with Multi-Level Attention
MM '19: Proceedings of the 27th ACM International Conference on MultimediaVisual relations, which describe various types of interactions between two objects in the image, can provide critical information for comprehensive semantic understanding of the image. Multiple cues related to the objects can contribute to visual ...
Video Relation Detection with Trajectory-aware Multi-modal Features
MM '20: Proceedings of the 28th ACM International Conference on MultimediaVideo relation detection problem refers to the detection of the relationship between different objects in videos, such as spatial relationship and action relationship. In this paper, we present video relation detection with trajectory-aware multi-modal ...
Comments