Video Enhancement with Task-Oriented Flow

Xue, Tianfan; Chen, Baian; Wu, Jiajun; Wei, Donglai; Freeman, William T.

doi:10.1007/s11263-018-01144-2

Video Enhancement with Task-Oriented Flow

Published: 12 February 2019

Volume 127, pages 1106–1125, (2019)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Tianfan Xue¹,
Baian Chen²,
Jiajun Wu ORCID: orcid.org/0000-0002-4176-343X²,
Donglai Wei³ &
…
William T. Freeman^2,4

5431 Accesses
774 Citations
14 Altmetric
Explore all metrics

Abstract

Many video enhancement algorithms rely on optical flow to register frames in a video sequence. Precise flow estimation is however intractable; and optical flow itself is often a sub-optimal representation for particular video processing tasks. In this paper, we propose task-oriented flow (TOFlow), a motion representation learned in a self-supervised, task-specific manner. We design a neural network with a trainable motion estimation component and a video processing component, and train them jointly to learn the task-oriented flow. For evaluation, we build Vimeo-90K, a large-scale, high-quality video dataset for low-level video processing. TOFlow outperforms traditional optical flow on standard benchmarks as well as our Vimeo-90K dataset in three video processing tasks: frame interpolation, video denoising/deblocking, and video super-resolution.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 13

End-to-End Learning of Video Super-Resolution with Motion Compensation

Optical flow for video super-resolution: a survey

Article 19 July 2022

Zhigang Tu, Hongyan Li, … Junsong Yuan

FLAVR: flow-free architecture for fast video frame interpolation

Article 11 August 2023

Tarun Kalluri, Deepak Pathak, … Du Tran

Notes

Note that Fixed Flow or TOFlow only uses 4-level structure of SpyNet for memory efficiency, while the original SpyNet network has 5 levels.
We did not evaluate AdaConv on DVF dataset, as neither the implementation of AdaConv nor the DVF dataset is publicly available.
The EPE of Fixed Flow on Sintel dataset is different from EPE of SpyNet (Ranjan and Black 2017) reported on Sintel website, as it is trained differently from SpyNet as we mentioned before.

References

Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., & Vijayanarasimhan, S. (2016). Youtube-8m: A large-scale video classification benchmark. arXiv:1609.08675.
Ahn, N., Kang, B., & Sohn, K. A. (2018). Fast, accurate, and, lightweight super-resolution with cascading residual network. In European conference on computer vision.
Aittala, M., & Durand, F. (2018). Burst image deblurring using permutation invariant convolutional neural networks. In European conference on computer vision (pp 731–747).
Baker, S., Scharstein, D., Lewis, J., Roth, S., Black, M. J., & Szeliski, R. (2011). A database and evaluation methodology for optical flow. International Journal of Computer Vision, 92(1), 1–31.
Article Google Scholar
Brox, T., Bruhn, A., Papenberg, N., & Weickert, J. (2004). High accuracy optical flow estimation based on a theory for warping. In European conference on computer vision.
Brox, T., Bregler, C., & Malik, J. (2009). Large displacement optical flow. In IEEE conference on computer vision and pattern recognition.
Bulat, A., Yang, J., & Tzimiropoulos, G. (2018). To learn image super-resolution, use a gan to learn how to do image degradation first. In European conference on computer vision.
Butler, D. J., Wulff, J., Stanley, G. B., & Black, M. J. (2012). A naturalistic open source movie for optical flow evaluation. In European conference on computer vision (pp. 611–625).
Caballero, J., Ledig, C., Aitken, A., Acosta, A., Totz, J., Wang, Z., Shi, W. (2017). Real-time video super-resolution with spatio-temporal networks and motion compensation. In IEEE conference on computer vision and pattern recognition.
Fischer, P., Dosovitskiy, A., Ilg, E., Häusser, P., Hazırbaş, C., Golkov, V., van der Smagt, P., Cremers, D., & Brox, T. (2015). Flownet: Learning optical flow with convolutional networks. In IEEE international conference on computer vision.
Ganin, Y., Kononenko, D., Sungatullina, D., & Lempitsky, V. (2016). Deepwarp: Photorealistic image resynthesis for gaze manipulation. In European conference on computer vision.
Ghoniem, M., Chahir, Y., & Elmoataz, A. (2010). Nonlocal video denoising, simplification and inpainting using discrete regularization on graphs. Signal Process, 90(8), 2445–2455.
Article MATH Google Scholar
Godard, C., Matzen, K., & Uyttendaele, M. (2017). Deep burst denoising. In European conference on computer vision.
Horn, B. K., & Schunck, B. G. (1981). Determining optical flow. Artif Intell, 17(1–3), 185–203.
Article Google Scholar
Huang, Y., Wang, W., & Wang, L. (2015). Bidirectional recurrent convolutional networks for multi-frame super-resolution. In Advances in neural information processing systems.
Jaderberg, M., Simonyan, K., & Zisserman, A., et al. (2015). Spatial transformer networks. In Advances in neural information processing systems.
Jiang, H., Sun, D., Jampani, V., Yang, M. H., Learned-Miller, E., Kautz, J. (2017). Super slomo: High quality estimation of multiple intermediate frames for video interpolation. In IEEE conference on computer vision and pattern recognition.
Jiang, X., Le Pendu, M., & Guillemot, C. (2018). Depth estimation with occlusion handling from a sparse set of light field views. In IEEE international conference on image processing.
Jo, Y., Oh, S. W., Kang, J., & Kim, S. J. (2018). Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation. In IEEE conference on computer vision and pattern recognition (pp. 3224–3232).
Kappeler, A., Yoo, S., Dai, Q., & Katsaggelos, A. K. (2016). Video super-resolution with convolutional neural networks. IEEE Transactions on Computational Imaging, 2(2), 109–122.
Article MathSciNet Google Scholar
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In International conference on learning representations.
Li, M., Xie, Q., Zhao, Q., Wei, W., Gu, S., Tao, J., & Meng, D. (2018). Video rain streak removal by multiscale convolutional sparse coding. In IEEE conference on computer vision and pattern recognition (pp. 6644–6653).
Liao, R., Tao, X., Li, R., Ma, Z., & Jia, J. (2015). Video super-resolution via deep draft-ensemble learning. In IEEE conference on computer vision and pattern recognition.
Liu, C., & Freeman, W. (2010). A high-quality video denoising algorithm based on reliable motion estimation. In European conference on computer vision.
Liu, C., & Sun, D. (2011). A bayesian approach to adaptive video super resolution. In IEEE conference on computer vision and pattern recognition.
Liu, C., & Sun, D. (2014). On bayesian adaptive video super resolution. IEEE Transactions on Pattern Analysis and Machine intelligence, 36(2), 346–360.
Article Google Scholar
Liu, Z., Yeh, R., Tang, X., Liu, Y., & Agarwala, A. (2017). Video frame synthesis using deep voxel flow. In IEEE international conference on computer vision.
Lu, G., Ouyang, W., Xu, D., Zhang, X., Gao, Z., & Sun, M. T. (2018). Deep Kalman filtering network for video compression artifact reduction. In European conference on computer vision (pp. 568–584).
Ma, Z., Liao, R., Tao, X., Xu, L., Jia, J., & Wu, E. (2015). Handling motion blur in multi-frame super-resolution. In IEEE conference on computer vision and pattern recognition.
Maggioni, M., Boracchi, G., Foi, A., & Egiazarian, K. (2012). Video denoising, deblocking, and enhancement through separable 4-d nonlocal spatiotemporal transforms. IEEE Transactions on Image Processing, 21(9), 3952–3966.
Article MathSciNet MATH Google Scholar
Makansi, O., Ilg, E., & Brox, T. (2017). End-to-end learning of video super-resolution with motion compensation. In German conference on pattern recognition.
Mathieu, M., Couprie, C., & LeCun, Y. (2016). Deep multi-scale video prediction beyond mean square error. In International conference on learning representations.
Mémin, E., & Pérez, P. (1998). Dense estimation and object-based segmentation of the optical flow with robust techniques. IEEE Transactions on Image Processing, 7(5), 703–719.
Article Google Scholar
Mildenhall, B., Barron, J. T., Chen, J., Sharlet, D., Ng, R., & Carroll, R. (2018). Burst denoising with kernel prediction networks. In IEEE conference on computer vision and pattern recognition.
Nasrollahi, K., & Moeslund, T. B. (2014). Super-resolution: A comprehensive survey. Machine Vision and Applications, 25(6), 1423–1468.
Article Google Scholar
Niklaus, S., & Liu, F. (2018). Context-aware synthesis for video frame interpolation. In IEEE conference on computer vision and pattern recognition.
Niklaus, S., Mai, L., & Liu, F. (2017a). Video frame interpolation via adaptive convolution. In IEEE conference on computer vision and pattern recognition.
Niklaus, S., Mai, L., & Liu, F. (2017b). Video frame interpolation via adaptive separable convolution. In IEEE international conference on computer vision.
Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3), 145–175.
Article MATH Google Scholar
Ranjan, A., & Black, M. J. (2017). Optical flow estimation using a spatial pyramid network. In IEEE conference on computer vision and pattern recognition.
Revaud, J., Weinzaepfel, P., Harchaoui, Z., & Schmid, C. (2015). Epicflow: Edge-preserving interpolation of correspondences for optical flow. In IEEE conference on computer vision and pattern recognition.
Sajjadi, M. S., Vemulapalli, R., & Brown, M. (2018). Frame-recurrent video super-resolution. In IEEE conference on computer vision and pattern recognition (pp. 6626–6634).
Tao, X., Gao, H., Liao, R., Wang, J., & Jia, J. (2017). Detail-revealing deep video super-resolution. In IEEE international conference on computer vision.
Varghese, G., & Wang, Z. (2010). Video denoising based on a spatiotemporal gaussian scale mixture model. IEEE Transactions on Circuits and Systems for Video Technology, 20(7), 1032–1040.
Article Google Scholar
Wang, T. C., Zhu, J. Y., Kalantari, N. K., Efros, A. A., & Ramamoorthi, R. (2017). Light field video capture using a learning-based hybrid imaging system. In SIGGRAPH.
Wedel, A., Cremers, D., Pock, T., & Bischof, H. (2009). Structure-and motion-adaptive regularization for high accuracy optic flow. In IEEE conference on computer vision and pattern recognition.
Wen, B., Li, Y., Pfister, L., & Bresler, Y. (2017). Joint adaptive sparsity and low-rankness on the fly: An online tensor reconstruction scheme for video denoising. In IEEE International Conference on Computer Vision (ICCV).
Werlberger, M., Pock, T., Unger, M., & Bischof, H. (2011). Optical flow guided tv-l1 video interpolation and restoration. In International conference on energy minimization methods in computer vision and pattern recognition.
Xu, J., Ranftl, R., & Koltun, V. (2017). Accurate optical flow via direct cost volume processing. In IEEE conference on computer vision and pattern recognition.
Xu, S., Zhang, F., He, X., Shen, X., & Zhang, X. (2015). Pm-pm: Patchmatch with potts model for object segmentation and stereo matching. IEEE Transactions on Image Processing, 24(7), 2182–2196.
Article MathSciNet MATH Google Scholar
Yang, R., Xu, M., Wang, Z., & Li, T. (2018). Multi-frame quality enhancement for compressed video. In IEEE conference on computer vision and pattern recognition (pp 6664–6673).
Yu, J. J., Harley, A. W., & Derpanis, K. G. (2016). Back to basics: Unsupervised learning of optical flow via brightness constancy and motion smoothness. In European conference on computer vision workshops.
Yu, Z., Li, H., Wang, Z., Hu, Z., & Chen, C. W. (2013). Multi-level video frame interpolation: Exploiting the interaction among different levels. IEEE Transactions on Circuits and Systems for Video Technology, 23(7), 1235–1248.
Article Google Scholar
Zhou, T., Tulsiani, S., Sun, W., Malik, J., & Efros, A. A. (2016). View synthesis by appearance flow. In European conference on computer vision.
Zhu, X., Wang, Y., Dai, J., Yuan, L., & Wei, Y. (2017). Flow-guided feature aggregation for video object detection. In IEEE International Conference on Computer Vision.
Zitnick, C. L., Kang, S. B., Uyttendaele, M., Winder, S., & Szeliski, R. (2004). High-quality video view interpolation using a layered representation. ACM Transactions on Graphics, 23(3), 600–608.
Article Google Scholar

Download references

Acknowledgements

This work is supported by NSF RI #1212849, NSF BIGDATA #1447476, Facebook, and Shell Research. This work was done when Tianfan Xue and Donglai Wei were graduate students at MIT CSAIL.

Author information

Authors and Affiliations

Google Research, Mountain View, CA, USA
Tianfan Xue
Massachusetts Institute of Technology, Cambridge, MA, USA
Baian Chen, Jiajun Wu & William T. Freeman
Harvard University, Cambridge, MA, USA
Donglai Wei
Google Research, Cambridge, MA, USA
William T. Freeman

Authors

Tianfan Xue
View author publications
You can also search for this author in PubMed Google Scholar
Baian Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jiajun Wu
View author publications
You can also search for this author in PubMed Google Scholar
Donglai Wei
View author publications
You can also search for this author in PubMed Google Scholar
William T. Freeman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiajun Wu.

Additional information

Communicated by Ming-Hsuan Yang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Additional Qualitative Results We show additional results on the following benchmarks: Vimeo interpolation benchmark (Fig. 14), Vimeo denoising benchmark (Fig. 15 for RGB videos, and Fig. 16 for grayscale videos), Vimeo deblocking benchmark (Fig. 17), and Vimeo super-resolution benchmark (Fig. 18). We randomly select testing images from test datasets. Differences between different algorithms are more clearer when zoomed in.

Flow Estimation Module We used SpyNet (Ranjan and Black 2017) as our flow estimation module. It consists of four sub-networks with the same network structure, but each sub-network has an independent set of parameters. Each sub-network consists of five sets of 7\(\times \)7 convolutional (with zero padding), batch normalization and ReLU layers. The number of channels after each convolutional layer is 32, 64, 32, 16, and 2. The input motion to the first network is a zero motion field.

Image Processing Module We use slight different structures in the image processing module for different tasks. For temporal frame interpolation both with and without masks, we build a residual network that consists of an averaging network and a residual network. The averaging network simply averages the two transformed frames (from frames 1 and 3). The residual network also takes the two transformed frames as input, but calculates the difference between the actual second frame and the average of two transformed frames through a convolutional network consists of three convolutional layers, each of which is followed by a ReLU layer. The kernel sizes of three layers are 9\(\times \)9, 1\(\times \)1, and 1\(\times \)1 (with zero padding), and the numbers of output channels are 64, 64, and 3. The final output is the summation of the output of the averaging network and the residual network.

For video denoising/deblocking, the image processing module uses the same six-layer convolutional structure (three convolutional layers and three ReLU layers) as interpolation, but without the residual structure. We have also tried the residual structure for denoising/deblocking, but there is no significant improvement.

For video super-resolution, the image processing module consists of four pairs of convolutional layers and ReLU layers. The kernel sizes for these four layers are 9\(\times \)9, 9\(\times \)9, 1\(\times \)1, and 1\(\times \)1 (with zero padding), and the numbers of output channels are 64, 64, 64, and 3.

Mask Network Similar to our flow estimation module, our mask estimation network is also a four-level convolutional neural network pyramid as in Fig. 4. Each level consists of the same sub-network structure with five sets of 7\(\times \)7 convolutional (with zero padding), batch normalization and ReLU layers, but an independent set of parameters (output channels are 32, 64, 32, 16, and 2). For the first level, the input to the network is a concatenation of two estimated optical flow fields (four channels after concatenation), and the output is a concatenation of two estimated masks (one channel per mask). From the second level, the input to the network switch to a concatenation of, first, two estimated optical flow fields at that resolution, and second, bilinear-upsampled masks from the previous level (the resolution is twice of the previous level). In this way, the first level mask network estimates a rough mask, and the rest refines high frequency details of the mask.

We use cycle consistencies to obtain the ground truth occlusion mask for pre-training the mask network. For two consecutive frames \(I_1\) and \(I_2\), we calculate the forward flow \(v_{12}\) and the backward flow \(v_{21}\) using the pre-trained flow network. Then, for each pixel p in image \(I_1\), we first map it to \(I_2\) using \(v_{12}\) and then map it back to \(I_1\) using \(v_{21}\). If it maps to a different point rather to p (up to an error threshold of two pixels), then this point is considered to be occluded.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xue, T., Chen, B., Wu, J. et al. Video Enhancement with Task-Oriented Flow. Int J Comput Vis 127, 1106–1125 (2019). https://doi.org/10.1007/s11263-018-01144-2

Download citation

Received: 23 May 2018
Accepted: 20 December 2018
Published: 12 February 2019
Issue Date: 01 August 2019
DOI: https://doi.org/10.1007/s11263-018-01144-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Video Enhancement with Task-Oriented Flow

Abstract

Access this article

Similar content being viewed by others

End-to-End Learning of Video Super-Resolution with Motion Compensation

Optical flow for video super-resolution: a survey

FLAVR: flow-free architecture for fast video frame interpolation

Notes

References

Acknowledgements