Train Sparsely, Generate Densely: Memory-Efficient Unsupervised Training of High-Resolution Temporal GAN

Saito, Masaki; Saito, Shunta; Koyama, Masanori; Kobayashi, Sosuke

doi:10.1007/s11263-020-01333-y

Train Sparsely, Generate Densely: Memory-Efficient Unsupervised Training of High-Resolution Temporal GAN

Published: 29 May 2020

Volume 128, pages 2586–2606, (2020)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Masaki Saito ORCID: orcid.org/0000-0003-4200-6585¹,
Shunta Saito¹,
Masanori Koyama¹ &
…
Sosuke Kobayashi¹

1171 Accesses
40 Citations
41 Altmetric
Explore all metrics

Abstract

Training of generative adversarial network (GAN) on a video dataset is a challenge because of the sheer size of the dataset and the complexity of each observation. In general, the computational cost of training GAN scales exponentially with the resolution. In this study, we present a novel memory efficient method of unsupervised learning of high-resolution video dataset whose computational cost scales only linearly with the resolution. We achieve this by designing the generator model as a stack of small sub-generators and training the model in a specific way. We train each sub-generator with its own specific discriminator. At the time of the training, we introduce between each pair of consecutive sub-generators an auxiliary subsampling layer that reduces the frame-rate by a certain ratio. This procedure can allow each sub-generator to learn the distribution of the video at different levels of resolution. We also need only a few GPUs to train a highly complex generator that far outperforms the predecessor in terms of inception scores.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 15

Effect of Input Noise Dimension in GANs

Understanding GANs: fundamentals, variants, training challenges, applications, and open problems

Article 14 May 2024

Image Super-Resolution with Deep Variational Autoencoders

Notes

The pre-trained model itself can be downloaded from http://vlg.cs.dartmouth.edu/c3d/.
Note that we mention videos after 16 frames. Up to 16 frames, our model generates stable videos regardless of the dataset.

References

Acharya, D., Huang, Z., Paudel, D. P., & Gool, L. V. (2018). Towards high resolution video generation with progressive growing of sliced Wasserstein GANs. Arxiv preprint arXiv:1810.02419
Akiba, T., Fukuda, K., & Suzuki, S. (2017). ChainerMN: Scalable distributed deep learning framework. In Proceedings of workshop on ML systems in NIPS
Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R. H. & Levine, S. (2018). Stochastic variational video prediction. In ICLR.
Bansal, A., Ma, S., Ramanan, D., & Sheikh, Y. (2018). Recycle-GAN: Unsupervised video retargeting. In ECCV.
Borji, A. (2018). Pros and cons of GAN evaluation measures. Arxiv preprint arXiv:1802.03446.
Brock, A., Donahue, J., & Simonyan, K. (2018). Large scale GAN training for high fidelity natural image synthesis. Arxiv preprint arXiv:1809.11096.
Byeon, W., Wang, Q., Srivastava, R. K., & Koumoutsakos, P. (2018). ContextVP: Fully context-aware video prediction. In ECCV.
Cai, H., Bai, C., Tai, Y. W., & Tang, C. K. (2018). Deep video generation. Prediction and completion of human action sequences. In ECCV.
Denton, E., Chintala, S., Szlam, A., & Fergus, R. (2015). Deep generative image models using a Laplacian pyramid of adversarial networks. In NIPS.
Denton, E., & Fergus, R. (2018). Stochastic video generation with a learned prior. Arxiv preprint arXiv:1802.07687.
Ebert, F., Finn, C., Lee, A.X. & Levine, S. (2017). Self-supervised visual planning with temporal skip connections. In Conference on robot learning (CoRL).
Finn, C., Goodfellow, I., & Levine, S. (2016). Unsupervised learning for physical interaction through video prediction. In NIPS.
Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In AISTATS.
Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et al. (2014). Generative adversarial nets. In NIPS.
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. C. (2017). Improved training of Wasserstein GANs. In NIPS.
Hao, Z., Huang, X., & Belongie, S. (2018). Controllable video generation with sparse trajectories. In CVPR.
Hara, K., Kataoka, H., & Satoh, Y. (2018). Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In CVPR.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In NIPS.
Huang, X., Liu, M. Y., Belongie, S., & Kautz, J. (2018). Multimodal unsupervised image-to-image translation. In ECCV.
Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In CVPR.
Kalchbrenner, N., van den Oord, A., Simonyan, K., Danihelka, I., Vinyals, O., Graves, A., et al. (2016). Video pixel networks. arXiv preprint arXiv:1610.00527.
Karpathy, A., Shetty, S., Toderici, G., Sukthankar, R., Leung, T., & Li Fei-Fei. (2014). Large-scale video classification with convolutional neural networks. In CVPR.
Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2018). Progressive growing of GANs for improved quality, stability, and variation. In ICLR.
Kingma, D., & Ba, J. (2015). Adam: A method for stochastic optimization. In ICLR.
Lee, A. X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., & Levine, S. (2018). Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523.
Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., & Yang, M. H. (2018). Flow-grounded spatial-temporal video prediction from still images. In ECCV.
Liang, X., Lee, L., Dai, W., & Xing, E. P. (2017). Dual motion GAN for future-flow embedded video prediction. In ICCV.
Liu, M. Y., Breuel, T., & Kautz, J. (2017). Unsupervised image-to-image translation networks. In NIPS.
Liu, Z., Yeh, R. A., Tang, X., Liu, Y., & Agarwala, A. (2017). Video frame synthesis using deep voxel flow. In ICCV.
Lotter, W., Kreiman, G., & Cox, D. (2017). Deep predictive coding networks for video prediction and unsupervised learning. In ICLR.
Mathieu, M., Couprie, C., & LeCun, Y. (2016). Deep multi-scale video prediction beyond mean square error. In ICLR.
Mescheder, L., Nowozin, S., & Geiger, A. (2018). Which training methods for GANs do actually converge? In ICML.
Miyato, T., Kataoka, T., Koyama, M., & Yoshida, Y. (2018). Spectral normalization for generative adversarial networks. In ICLR.
Miyato, T., & Koyama, M. (2018). cGANs with projection discriminator. In ICLR.
Oh, J., Guo, X., Lee, H., Lewis, R., & Singh, S. (2015). Action-conditional video prediction using deep networks in Atari games. In NIPS.
Ohnishi, K., Yamamoto, S., Ushiku, Y., & Harada, T. (2018). Hierarchical video generation from orthogonal information: Optical flow and texture. In AAAI.
Oliphant, T. E. (2015). Guide to NumPy (2nd ed.). USA: CreateSpace Independent Publishing Platform.
Google Scholar
Radford, A., Metz, L., & Chintala, S. (2016). Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR.
Ranzato, M., Szlam, A., Bruna, J., Mathieu, M., Collobert, R., & Chopra, S. (2014). Video (language) modeling: A baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604.
Rössler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., & Nießner, M. (2018). FaceForensics: A large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:1803.09179.
Saito, M., Matsumoto, E., & Saito, S. (2017). Temporal generative adversarial nets with singular value clipping. In ICCV.
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). Improved techniques for training GANs. In NIPS.
Shi, X., Chen, Z., Wang, H., Yeung, D. Y., Wong, W. K., & Woo, W. C. (2015). Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In NIPS.
Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
Srivastava, N., Mansimov, E., & Salakhutdinov, R. (2015). Unsupervised learning of video representations using LSTMs. In ICML.
Tokui, S., Oono, K., Hido, S., & Clayton, J. (2015). Chainer: A next-generation open source framework for deep learning. In Proceedings of workshop on machine learning systems in NIPS.
Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3D convolutional networks. In ICCV.
Tulyakov, S., Liu, M. Y., Yang, X., & Kautz, J. (2018). MoCoGAN: Decomposing motion and content for video generation. In CVPR.
Unterthiner, T., van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., & Gelly, S. (2018). Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717.
Vondrick, C., Pirsiavash, H., & Torralba, A. (2016). Generating videos with scene dynamics. In NIPS.
Wang, T. C., Liu, M. Y., Zhu, J. Y., Liu, G., Tao, A., Kautz, J., et al. (2018a). Video-to-video synthesis. arXiv preprint arXiv:1808.06601.
Wang, T. C., Liu, M. Y., Zhu, J. Y., Tao, A., Kautz, J., & Catanzaro, B. (2018b). High-resolution image synthesis and semantic manipulation with conditional GANs. In CVPR.
Wang, X., Girshick, R., Gupta, A., & He, K. (2018c). Non-local neural networks. In CVPR.
Yang, C., Wang, Z., Zhu, X., Huang, C., Shi, J., & Lin, D. (2018). Pose guided human video generation. In ECCV.
Zhang, H., Goodfellow, I., Metaxas, D., & Odena, A. (2018). Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318.
Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., et al. (2017a). StackGAN++: Realistic image synthesis with stacked generative adversarial networks. arXiv preprint arXiv:1710.10916.
Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., et al. (2017b). StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV.
Zhang, Z., Xie, Y., & Yang, L. (2018). Photographic text-to-image synthesis with a hierarchically-nested adversarial network. In CVPR.
Zhao, L., Peng, X., Tian, Y., Kapadia, M., & Metaxas, D. (2018). Learning to forecast and refine residual motion for image-to-video generation. In ECCV.
Zhu, J. Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV.

Download references

Acknowledgements

We would like to acknowledge Takeru Miyato and Shoichiro Yamaguchi for helpful discussions. We would like to acknowledge Daichi Suzuo for providing a tool to calculate the cost of computation and the amount of memory consumed. We also would like to thank the developers of Chainer (Tokui et al. 2015; Akiba et al. 2017).

Author information

Authors and Affiliations

Preferred Networks, Inc., Tokyo, Japan
Masaki Saito, Shunta Saito, Masanori Koyama & Sosuke Kobayashi

Authors

Masaki Saito
View author publications
You can also search for this author in PubMed Google Scholar
Shunta Saito
View author publications
You can also search for this author in PubMed Google Scholar
Masanori Koyama
View author publications
You can also search for this author in PubMed Google Scholar
Sosuke Kobayashi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Masaki Saito.

Additional information

Communicated by Jun-Yan Zhu, Hongsheng Li, Eli Shechtman, Ming-Yu Liu, Jan Kautz, Antonio Torralba.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Saito, M., Saito, S., Koyama, M. et al. Train Sparsely, Generate Densely: Memory-Efficient Unsupervised Training of High-Resolution Temporal GAN. Int J Comput Vis 128, 2586–2606 (2020). https://doi.org/10.1007/s11263-020-01333-y

Download citation

Received: 15 May 2019
Accepted: 21 April 2020
Published: 29 May 2020
Issue Date: November 2020
DOI: https://doi.org/10.1007/s11263-020-01333-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Train Sparsely, Generate Densely: Memory-Efficient Unsupervised Training of High-Resolution Temporal GAN

Abstract

Access this article

Similar content being viewed by others

Effect of Input Noise Dimension in GANs

Understanding GANs: fundamentals, variants, training challenges, applications, and open problems

Image Super-Resolution with Deep Variational Autoencoders

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Train Sparsely, Generate Densely: Memory-Efficient Unsupervised Training of High-Resolution Temporal GAN

Abstract

Access this article

Similar content being viewed by others

Effect of Input Noise Dimension in GANs

Understanding GANs: fundamentals, variants, training challenges, applications, and open problems

Image Super-Resolution with Deep Variational Autoencoders

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation