research-article

Learning Object-Centric Transformation for Video Prediction

Authors:
Xiongtao Chen

Peking University, Shenzhen, China

Peking University, Shenzhen, China
View Profile

,
Wenmin Wang

Peking University, Shenzhen, China

Peking University, Shenzhen, China
View Profile

,
Jinzhuo Wang

Peking University, Shenzhen, China

Peking University, Shenzhen, China
View Profile

,
Weimian Li

Peking University, Shenzhen, China

Peking University, Shenzhen, China
View Profile

MM '17: Proceedings of the 25th ACM international conference on MultimediaOctober 2017Pages 1503–1512https://doi.org/10.1145/3123266.3123349

Published:23 October 2017Publication History

MM '17: Proceedings of the 25th ACM international conference on Multimedia

Pages 1503–1512

ABSTRACT

Future frame prediction for video sequences is a challenging task and worth exploring problem in computer vision. Existing methods often learn motion information for the entire image to predict next frames. However, different objects in the same scene often move and deform in different ways intuitively. Considering the human visual system, one often pays attention to the key objects that contain crucial motion signals, rather than compress an entire image into a static representation. Motivated by this property of human perception, in this work, we develop a novel object-centric video prediction model that learns local motion transformation dynamically for key object regions with visual attention. By transforming objects iteratively to the original input frames, next frame can be produced. Specifically, we design an attention module with replaceable strategies to attend to objects in video frames automatically. Our method does not require any annotated data during training procedure. To produce sharp predictions, adversarial training is adopted in our work. We evaluate our model on the Moving MNIST and UCF101 datasets and report competitive results, compared to prior methods. The generated frames demonstrate that our model can characterize motion for different objects and produce plausible future frames.

References

Martén Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, and others. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016).Google Scholar
Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. 2014. Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755 (2014).Google Scholar
Bert De Brabandere, Xu Jia, Tinne Tuytelaars, and Luc Van Gool. 2016. Dynamic filter networks. In Neural Information Processing Systems (NIPS).Google Scholar
Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick van der Smagt, Daniel Cremers, and Thomas Brox. 2015. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 2758--2766. Google ScholarDigital Library
Chelsea Finn, Ian Goodfellow, and Sergey Levine. 2016. Unsuper- vised learning for physical interaction through video prediction. In Advances In Neural Information Processing Systems. 64--72.Google Scholar
John Flynn, Ivan Neulander, James Philbin, and Noah Snavely. 2016. DeepStereo: Learning to predict new views from the world's imagery. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5515--5524.Google Scholar
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems. 2672--2680. Google ScholarDigital Library
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2014. Spatial pyramid pooling in deep convolutional networks for visual recognition. In European Conference on Computer Vision. Springer, 346--361.Google ScholarCross Ref
Minh Hoai and Fernando De la Torre. 2014. Max-margin early event detectors. International Journal of Computer Vision 107, 2 (2014), 191--202. Google ScholarDigital Library
Max Jaderberg, Karen Simonyan, Andrew Zisserman, and others. 2015. Spatial transformer networks. In Advances in Neural Information Processing Systems. 2017--2025. Google ScholarDigital Library
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Le- ung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceed- ings of the IEEE conference on Computer Vision and Pattern Recognition. 1725--1732. Google ScholarDigital Library
Kris M Kitani, Brian D Ziebart, James Andrew Bagnell, and Mar- tial Hebert. 2012. Activity forecasting. In European Conference on Computer Vision. Springer, 201--214. Google ScholarDigital Library
Tian Lan, Tsung-Chuan Chen, and Silvio Savarese. 2014. A hier- archical representation for future action prediction. In European Conference on Computer Vision. Springer, 689--704.Google ScholarCross Ref
Ce Liu, Jenny Yuen, and Antonio Torralba. 2011. Sift flow: Dense correspondence across scenes and its applications. IEEE transactions on pattern analysis and machine intelligence 33, 5 (2011), 978--994. Google ScholarDigital Library
Michael Mathieu, Camille Couprie, and Yann LeCun. 2015. Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440 (2015).Google Scholar
Simone Meyer, Oliver Wang, Henning Zimmer, Max Grosse, and Alexander Sorkine-Hornung. 2015. Phase-based frame interpolation for video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1410--1418.Google ScholarCross Ref
Volodymyr Mnih, Nicolas Heess, Alex Graves, and others. 2014. Recurrent models of visual attention. In Advances in neural information processing systems. 2204--2212. Google ScholarDigital Library
Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. 2015. Action-conditional video prediction using deep networks in atari games. In Advances in Neural Information Processing Systems. 2863--2871. Google ScholarDigital Library
Viorica Patraucean, Ankur Handa, and Roberto Cipolla. 2015. Spatio-temporal video autoencoder with differentiable memory. arXiv preprint arXiv:1511.06309 (2015).Google Scholar
Silvia L Pintea, Jan C van Gemert, and Arnold WM Smeulders. 2014. Déja vu. In European Conference on Computer Vision. Springer, 172--187.Google ScholarCross Ref
MarcAurelio Ranzato, Arthur Szlam, Joan Bruna, Michael Mathieu, Ronan Collobert, and Sumit Chopra. 2014. Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604 (2014).Google Scholar
Danilo Jimenez Rezende, Shakir Mohamed, Ivo Danihelka, Karol Gregor, and Daan Wierstra. 2016. One-shot generalization in deep generative models. arXiv preprint arXiv:1603.05106 (2016).Google Scholar
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).Google Scholar
Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. 2015. Unsupervised Learning of Video Representations using LSTMs.. In ICML. 843--852. Google ScholarDigital Library
Joost van Amersfoort, Anitha Kannan, Marc'Aurelio Ranza- to, Arthur Szlam, Du Tran, and Soumith Chintala. 2017. Transformation-based models of video sequences. arXiv preprint arXiv:1701.08435 (2017).Google Scholar
Sebastian Volz, Andres Bruhn, Levi Valgaerts, and Henning Zimmer. 2011. Modeling temporal coherence for optical flow. In Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 1116--1123. Google ScholarDigital Library
Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. 2016. Generating videos with scene dynamics. In Advances In Neural Information Processing Systems. 613--621.Google Scholar
Vedran Vukotic, Silvia-Laura Pintea, Christian Raymond, Guillaume Gravier, and Jan Van Gemert. 2017. One-Step Time- Dependent Future Video Frame Prediction with a Convolutional Encoder-Decoder Neural Network. arXiv preprint arX- iv:1702.04125 (2017).Google Scholar
Jacob Walker, Carl Doersch, Abhinav Gupta, and Martial Hebert. 2016. An uncertain future: Forecasting from static images using variational autoencoders. In European Conference on Computer Vision. Springer, 835--851.Google ScholarCross Ref
Jacob Walker, Abhinav Gupta, and Martial Hebert. 2015. Dense optical flow prediction from a static image. In Proceedings of the IEEE International Conference on Computer Vision. 2443-- 2451. Google ScholarDigital Library
Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. 2011. Action recognition by dense trajectories. In Com- puter Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 3169--3176. Google ScholarDigital Library
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13, 4 (2004), 600--612. Google ScholarDigital Library
Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8, 3--4 (1992), 229--256. Google ScholarDigital Library
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C Courville, Ruslan Salakhutdinov, Richard S Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.. In ICML, Vol. 14. 77--81. Google ScholarDigital Library
Tianfan Xue, Jiajun Wu, Katherine Bouman, and Bill Freeman. 2016. Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In Advances in Neural Information Processing Systems. 91--99.Google Scholar
Serena Yeung, Olga Russakovsky, Greg Mori, and Li Fei-Fei. 2016. End-to-end learning of action detection from frame glimpses in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2678--2687.Google ScholarCross Ref
Jenny Yuen and Antonio Torralba. 2010. A data-driven approach for event prediction. In European Conference on Computer Vision. Springer, 707--720 Google ScholarDigital Library

Index Terms

Learning Object-Centric Transformation for Video Prediction
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Reconstruction
      2. Computer vision tasks
        Scene understanding
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Object-oriented video prediction with pixel-level attention
ICIMCS '18: Proceedings of the 10th International Conference on Internet Multimedia Computing and Service

It is really challenging to predict future video contents which can be partly achieved by learning from tons of videos automatically. Among the known video prediction approaches, pixel-level prediction is much harder than label-level prediction because ...
Read More
Attention prediction in egocentric video using motion and visual saliency
PSIVT'11: Proceedings of the 5th Pacific Rim conference on Advances in Image and Video Technology - Volume Part I

We propose a method of predicting human egocentric visual attention using bottom-up visual saliency and egomotion information. Computational models of visual saliency are often employed to predict human attention; however, its mechanism and ...
Read More
Depth Information Fused Salient Object Detection
ICIMCS '14: Proceedings of International Conference on Internet Multimedia Computing and Service

Saliency Detection has emerged as a hot topic due to its potential application in image and video understanding. Most existing saliency detection algorithms focus on two-dimensional information while the depth information is often ignored. In this paper,...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '17: Proceedings of the 25th ACM international conference on Multimedia
October 2017
2028 pages
ISBN:9781450349062
DOI:10.1145/3123266
General Chairs:
Qiong Liu
FXPAL, USA
,
Rainer Lienhart
Universität Augsburg, Germany
,
Haohong Wang
TCL America, USA
,
Program Chairs:
Sheng-Wei "Kuan-Ta" Chen
Academia Sinica, Taiwan
,
Susanne Boll
University of Oldenburg, Germany
,
Phoebe Chen
La Trobe University, Australia
,
Gerald Friedland
Lawrence Livermore National Lab, USA
,
Jia Li
Google, USA
,
Shuicheng Yan
Qihoo 360, China
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 October 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
motion transformation
object-centric
video prediction
visual attention
Qualifiers
- research-article
Conference

Acceptance Rates
MM '17 Paper Acceptance Rate189of684submissions,28%Overall Acceptance Rate995of4,171submissions,24%
More
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 13
  Total Citations
  View Citations
- 376
  Total Downloads
- Downloads (Last 12 months)23
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Learning Object-Centric Transformation for Video Prediction

MM '17: Proceedings of the 25th ACM international conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Object-oriented video prediction with pixel-level attention

Attention prediction in egocentric video using motion and visual saliency

Depth Information Fused Salient Object Detection

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media