skip to main content
10.1145/3123266.3123349acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Learning Object-Centric Transformation for Video Prediction

Published:23 October 2017Publication History

ABSTRACT

Future frame prediction for video sequences is a challenging task and worth exploring problem in computer vision. Existing methods often learn motion information for the entire image to predict next frames. However, different objects in the same scene often move and deform in different ways intuitively. Considering the human visual system, one often pays attention to the key objects that contain crucial motion signals, rather than compress an entire image into a static representation. Motivated by this property of human perception, in this work, we develop a novel object-centric video prediction model that learns local motion transformation dynamically for key object regions with visual attention. By transforming objects iteratively to the original input frames, next frame can be produced. Specifically, we design an attention module with replaceable strategies to attend to objects in video frames automatically. Our method does not require any annotated data during training procedure. To produce sharp predictions, adversarial training is adopted in our work. We evaluate our model on the Moving MNIST and UCF101 datasets and report competitive results, compared to prior methods. The generated frames demonstrate that our model can characterize motion for different objects and produce plausible future frames.

References

  1. Martén Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, and others. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016).Google ScholarGoogle Scholar
  2. Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. 2014. Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755 (2014).Google ScholarGoogle Scholar
  3. Bert De Brabandere, Xu Jia, Tinne Tuytelaars, and Luc Van Gool. 2016. Dynamic filter networks. In Neural Information Processing Systems (NIPS).Google ScholarGoogle Scholar
  4. Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick van der Smagt, Daniel Cremers, and Thomas Brox. 2015. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 2758--2766. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Chelsea Finn, Ian Goodfellow, and Sergey Levine. 2016. Unsuper- vised learning for physical interaction through video prediction. In Advances In Neural Information Processing Systems. 64--72.Google ScholarGoogle Scholar
  6. John Flynn, Ivan Neulander, James Philbin, and Noah Snavely. 2016. DeepStereo: Learning to predict new views from the world's imagery. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5515--5524.Google ScholarGoogle Scholar
  7. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems. 2672--2680. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2014. Spatial pyramid pooling in deep convolutional networks for visual recognition. In European Conference on Computer Vision. Springer, 346--361.Google ScholarGoogle ScholarCross RefCross Ref
  9. Minh Hoai and Fernando De la Torre. 2014. Max-margin early event detectors. International Journal of Computer Vision 107, 2 (2014), 191--202. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Max Jaderberg, Karen Simonyan, Andrew Zisserman, and others. 2015. Spatial transformer networks. In Advances in Neural Information Processing Systems. 2017--2025. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Le- ung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceed- ings of the IEEE conference on Computer Vision and Pattern Recognition. 1725--1732. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Kris M Kitani, Brian D Ziebart, James Andrew Bagnell, and Mar- tial Hebert. 2012. Activity forecasting. In European Conference on Computer Vision. Springer, 201--214. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Tian Lan, Tsung-Chuan Chen, and Silvio Savarese. 2014. A hier- archical representation for future action prediction. In European Conference on Computer Vision. Springer, 689--704.Google ScholarGoogle ScholarCross RefCross Ref
  14. Ce Liu, Jenny Yuen, and Antonio Torralba. 2011. Sift flow: Dense correspondence across scenes and its applications. IEEE transactions on pattern analysis and machine intelligence 33, 5 (2011), 978--994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Michael Mathieu, Camille Couprie, and Yann LeCun. 2015. Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440 (2015).Google ScholarGoogle Scholar
  16. Simone Meyer, Oliver Wang, Henning Zimmer, Max Grosse, and Alexander Sorkine-Hornung. 2015. Phase-based frame interpolation for video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1410--1418.Google ScholarGoogle ScholarCross RefCross Ref
  17. Volodymyr Mnih, Nicolas Heess, Alex Graves, and others. 2014. Recurrent models of visual attention. In Advances in neural information processing systems. 2204--2212. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. 2015. Action-conditional video prediction using deep networks in atari games. In Advances in Neural Information Processing Systems. 2863--2871. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Viorica Patraucean, Ankur Handa, and Roberto Cipolla. 2015. Spatio-temporal video autoencoder with differentiable memory. arXiv preprint arXiv:1511.06309 (2015).Google ScholarGoogle Scholar
  20. Silvia L Pintea, Jan C van Gemert, and Arnold WM Smeulders. 2014. Déja vu. In European Conference on Computer Vision. Springer, 172--187.Google ScholarGoogle ScholarCross RefCross Ref
  21. MarcAurelio Ranzato, Arthur Szlam, Joan Bruna, Michael Mathieu, Ronan Collobert, and Sumit Chopra. 2014. Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604 (2014).Google ScholarGoogle Scholar
  22. Danilo Jimenez Rezende, Shakir Mohamed, Ivo Danihelka, Karol Gregor, and Daan Wierstra. 2016. One-shot generalization in deep generative models. arXiv preprint arXiv:1603.05106 (2016).Google ScholarGoogle Scholar
  23. Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).Google ScholarGoogle Scholar
  24. Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. 2015. Unsupervised Learning of Video Representations using LSTMs.. In ICML. 843--852. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Joost van Amersfoort, Anitha Kannan, Marc'Aurelio Ranza- to, Arthur Szlam, Du Tran, and Soumith Chintala. 2017. Transformation-based models of video sequences. arXiv preprint arXiv:1701.08435 (2017).Google ScholarGoogle Scholar
  26. Sebastian Volz, Andres Bruhn, Levi Valgaerts, and Henning Zimmer. 2011. Modeling temporal coherence for optical flow. In Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 1116--1123. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. 2016. Generating videos with scene dynamics. In Advances In Neural Information Processing Systems. 613--621.Google ScholarGoogle Scholar
  28. Vedran Vukotic, Silvia-Laura Pintea, Christian Raymond, Guillaume Gravier, and Jan Van Gemert. 2017. One-Step Time- Dependent Future Video Frame Prediction with a Convolutional Encoder-Decoder Neural Network. arXiv preprint arX- iv:1702.04125 (2017).Google ScholarGoogle Scholar
  29. Jacob Walker, Carl Doersch, Abhinav Gupta, and Martial Hebert. 2016. An uncertain future: Forecasting from static images using variational autoencoders. In European Conference on Computer Vision. Springer, 835--851.Google ScholarGoogle ScholarCross RefCross Ref
  30. Jacob Walker, Abhinav Gupta, and Martial Hebert. 2015. Dense optical flow prediction from a static image. In Proceedings of the IEEE International Conference on Computer Vision. 2443-- 2451. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. 2011. Action recognition by dense trajectories. In Com- puter Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 3169--3176. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13, 4 (2004), 600--612. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8, 3--4 (1992), 229--256. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C Courville, Ruslan Salakhutdinov, Richard S Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.. In ICML, Vol. 14. 77--81. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Tianfan Xue, Jiajun Wu, Katherine Bouman, and Bill Freeman. 2016. Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In Advances in Neural Information Processing Systems. 91--99.Google ScholarGoogle Scholar
  36. Serena Yeung, Olga Russakovsky, Greg Mori, and Li Fei-Fei. 2016. End-to-end learning of action detection from frame glimpses in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2678--2687.Google ScholarGoogle ScholarCross RefCross Ref
  37. Jenny Yuen and Antonio Torralba. 2010. A data-driven approach for event prediction. In European Conference on Computer Vision. Springer, 707--720 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Learning Object-Centric Transformation for Video Prediction

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          MM '17: Proceedings of the 25th ACM international conference on Multimedia
          October 2017
          2028 pages
          ISBN:9781450349062
          DOI:10.1145/3123266

          Copyright © 2017 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 23 October 2017

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          MM '17 Paper Acceptance Rate189of684submissions,28%Overall Acceptance Rate995of4,171submissions,24%

          Upcoming Conference

          MM '24
          MM '24: The 32nd ACM International Conference on Multimedia
          October 28 - November 1, 2024
          Melbourne , VIC , Australia

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader