Skip to main content
Log in

Video Enhancement with Task-Oriented Flow

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Many video enhancement algorithms rely on optical flow to register frames in a video sequence. Precise flow estimation is however intractable; and optical flow itself is often a sub-optimal representation for particular video processing tasks. In this paper, we propose task-oriented flow (TOFlow), a motion representation learned in a self-supervised, task-specific manner. We design a neural network with a trainable motion estimation component and a video processing component, and train them jointly to learn the task-oriented flow. For evaluation, we build Vimeo-90K, a large-scale, high-quality video dataset for low-level video processing. TOFlow outperforms traditional optical flow on standard benchmarks as well as our Vimeo-90K dataset in three video processing tasks: frame interpolation, video denoising/deblocking, and video super-resolution.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. Note that Fixed Flow or TOFlow only uses 4-level structure of SpyNet for memory efficiency, while the original SpyNet network has 5 levels.

  2. We did not evaluate AdaConv on DVF dataset, as neither the implementation of AdaConv nor the DVF dataset is publicly available.

  3. The EPE of Fixed Flow on Sintel dataset is different from EPE of SpyNet (Ranjan and Black 2017) reported on Sintel website, as it is trained differently from SpyNet as we mentioned before.

References

  • Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., & Vijayanarasimhan, S. (2016). Youtube-8m: A large-scale video classification benchmark. arXiv:1609.08675.

  • Ahn, N., Kang, B., & Sohn, K. A. (2018). Fast, accurate, and, lightweight super-resolution with cascading residual network. In European conference on computer vision.

  • Aittala, M., & Durand, F. (2018). Burst image deblurring using permutation invariant convolutional neural networks. In European conference on computer vision (pp 731–747).

  • Baker, S., Scharstein, D., Lewis, J., Roth, S., Black, M. J., & Szeliski, R. (2011). A database and evaluation methodology for optical flow. International Journal of Computer Vision, 92(1), 1–31.

    Article  Google Scholar 

  • Brox, T., Bruhn, A., Papenberg, N., & Weickert, J. (2004). High accuracy optical flow estimation based on a theory for warping. In European conference on computer vision.

  • Brox, T., Bregler, C., & Malik, J. (2009). Large displacement optical flow. In IEEE conference on computer vision and pattern recognition.

  • Bulat, A., Yang, J., & Tzimiropoulos, G. (2018). To learn image super-resolution, use a gan to learn how to do image degradation first. In European conference on computer vision.

  • Butler, D. J., Wulff, J., Stanley, G. B., & Black, M. J. (2012). A naturalistic open source movie for optical flow evaluation. In European conference on computer vision (pp. 611–625).

  • Caballero, J., Ledig, C., Aitken, A., Acosta, A., Totz, J., Wang, Z., Shi, W. (2017). Real-time video super-resolution with spatio-temporal networks and motion compensation. In IEEE conference on computer vision and pattern recognition.

  • Fischer, P., Dosovitskiy, A., Ilg, E., Häusser, P., Hazırbaş, C., Golkov, V., van der Smagt, P., Cremers, D., & Brox, T. (2015). Flownet: Learning optical flow with convolutional networks. In IEEE international conference on computer vision.

  • Ganin, Y., Kononenko, D., Sungatullina, D., & Lempitsky, V. (2016). Deepwarp: Photorealistic image resynthesis for gaze manipulation. In European conference on computer vision.

  • Ghoniem, M., Chahir, Y., & Elmoataz, A. (2010). Nonlocal video denoising, simplification and inpainting using discrete regularization on graphs. Signal Process, 90(8), 2445–2455.

    Article  MATH  Google Scholar 

  • Godard, C., Matzen, K., & Uyttendaele, M. (2017). Deep burst denoising. In European conference on computer vision.

  • Horn, B. K., & Schunck, B. G. (1981). Determining optical flow. Artif Intell, 17(1–3), 185–203.

    Article  Google Scholar 

  • Huang, Y., Wang, W., & Wang, L. (2015). Bidirectional recurrent convolutional networks for multi-frame super-resolution. In Advances in neural information processing systems.

  • Jaderberg, M., Simonyan, K., & Zisserman, A., et al. (2015). Spatial transformer networks. In Advances in neural information processing systems.

  • Jiang, H., Sun, D., Jampani, V., Yang, M. H., Learned-Miller, E., Kautz, J. (2017). Super slomo: High quality estimation of multiple intermediate frames for video interpolation. In IEEE conference on computer vision and pattern recognition.

  • Jiang, X., Le Pendu, M., & Guillemot, C. (2018). Depth estimation with occlusion handling from a sparse set of light field views. In IEEE international conference on image processing.

  • Jo, Y., Oh, S. W., Kang, J., & Kim, S. J. (2018). Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation. In IEEE conference on computer vision and pattern recognition (pp. 3224–3232).

  • Kappeler, A., Yoo, S., Dai, Q., & Katsaggelos, A. K. (2016). Video super-resolution with convolutional neural networks. IEEE Transactions on Computational Imaging, 2(2), 109–122.

    Article  MathSciNet  Google Scholar 

  • Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In International conference on learning representations.

  • Li, M., Xie, Q., Zhao, Q., Wei, W., Gu, S., Tao, J., & Meng, D. (2018). Video rain streak removal by multiscale convolutional sparse coding. In IEEE conference on computer vision and pattern recognition (pp. 6644–6653).

  • Liao, R., Tao, X., Li, R., Ma, Z., & Jia, J. (2015). Video super-resolution via deep draft-ensemble learning. In IEEE conference on computer vision and pattern recognition.

  • Liu, C., & Freeman, W. (2010). A high-quality video denoising algorithm based on reliable motion estimation. In European conference on computer vision.

  • Liu, C., & Sun, D. (2011). A bayesian approach to adaptive video super resolution. In IEEE conference on computer vision and pattern recognition.

  • Liu, C., & Sun, D. (2014). On bayesian adaptive video super resolution. IEEE Transactions on Pattern Analysis and Machine intelligence, 36(2), 346–360.

    Article  Google Scholar 

  • Liu, Z., Yeh, R., Tang, X., Liu, Y., & Agarwala, A. (2017). Video frame synthesis using deep voxel flow. In IEEE international conference on computer vision.

  • Lu, G., Ouyang, W., Xu, D., Zhang, X., Gao, Z., & Sun, M. T. (2018). Deep Kalman filtering network for video compression artifact reduction. In European conference on computer vision (pp. 568–584).

  • Ma, Z., Liao, R., Tao, X., Xu, L., Jia, J., & Wu, E. (2015). Handling motion blur in multi-frame super-resolution. In IEEE conference on computer vision and pattern recognition.

  • Maggioni, M., Boracchi, G., Foi, A., & Egiazarian, K. (2012). Video denoising, deblocking, and enhancement through separable 4-d nonlocal spatiotemporal transforms. IEEE Transactions on Image Processing, 21(9), 3952–3966.

    Article  MathSciNet  MATH  Google Scholar 

  • Makansi, O., Ilg, E., & Brox, T. (2017). End-to-end learning of video super-resolution with motion compensation. In German conference on pattern recognition.

  • Mathieu, M., Couprie, C., & LeCun, Y. (2016). Deep multi-scale video prediction beyond mean square error. In International conference on learning representations.

  • Mémin, E., & Pérez, P. (1998). Dense estimation and object-based segmentation of the optical flow with robust techniques. IEEE Transactions on Image Processing, 7(5), 703–719.

    Article  Google Scholar 

  • Mildenhall, B., Barron, J. T., Chen, J., Sharlet, D., Ng, R., & Carroll, R. (2018). Burst denoising with kernel prediction networks. In IEEE conference on computer vision and pattern recognition.

  • Nasrollahi, K., & Moeslund, T. B. (2014). Super-resolution: A comprehensive survey. Machine Vision and Applications, 25(6), 1423–1468.

    Article  Google Scholar 

  • Niklaus, S., & Liu, F. (2018). Context-aware synthesis for video frame interpolation. In IEEE conference on computer vision and pattern recognition.

  • Niklaus, S., Mai, L., & Liu, F. (2017a). Video frame interpolation via adaptive convolution. In IEEE conference on computer vision and pattern recognition.

  • Niklaus, S., Mai, L., & Liu, F. (2017b). Video frame interpolation via adaptive separable convolution. In IEEE international conference on computer vision.

  • Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3), 145–175.

    Article  MATH  Google Scholar 

  • Ranjan, A., & Black, M. J. (2017). Optical flow estimation using a spatial pyramid network. In IEEE conference on computer vision and pattern recognition.

  • Revaud, J., Weinzaepfel, P., Harchaoui, Z., & Schmid, C. (2015). Epicflow: Edge-preserving interpolation of correspondences for optical flow. In IEEE conference on computer vision and pattern recognition.

  • Sajjadi, M. S., Vemulapalli, R., & Brown, M. (2018). Frame-recurrent video super-resolution. In IEEE conference on computer vision and pattern recognition (pp. 6626–6634).

  • Tao, X., Gao, H., Liao, R., Wang, J., & Jia, J. (2017). Detail-revealing deep video super-resolution. In IEEE international conference on computer vision.

  • Varghese, G., & Wang, Z. (2010). Video denoising based on a spatiotemporal gaussian scale mixture model. IEEE Transactions on Circuits and Systems for Video Technology, 20(7), 1032–1040.

    Article  Google Scholar 

  • Wang, T. C., Zhu, J. Y., Kalantari, N. K., Efros, A. A., & Ramamoorthi, R. (2017). Light field video capture using a learning-based hybrid imaging system. In SIGGRAPH.

  • Wedel, A., Cremers, D., Pock, T., & Bischof, H. (2009). Structure-and motion-adaptive regularization for high accuracy optic flow. In IEEE conference on computer vision and pattern recognition.

  • Wen, B., Li, Y., Pfister, L., & Bresler, Y. (2017). Joint adaptive sparsity and low-rankness on the fly: An online tensor reconstruction scheme for video denoising. In IEEE International Conference on Computer Vision (ICCV).

  • Werlberger, M., Pock, T., Unger, M., & Bischof, H. (2011). Optical flow guided tv-l1 video interpolation and restoration. In International conference on energy minimization methods in computer vision and pattern recognition.

  • Xu, J., Ranftl, R., & Koltun, V. (2017). Accurate optical flow via direct cost volume processing. In IEEE conference on computer vision and pattern recognition.

  • Xu, S., Zhang, F., He, X., Shen, X., & Zhang, X. (2015). Pm-pm: Patchmatch with potts model for object segmentation and stereo matching. IEEE Transactions on Image Processing, 24(7), 2182–2196.

    Article  MathSciNet  MATH  Google Scholar 

  • Yang, R., Xu, M., Wang, Z., & Li, T. (2018). Multi-frame quality enhancement for compressed video. In IEEE conference on computer vision and pattern recognition (pp 6664–6673).

  • Yu, J. J., Harley, A. W., & Derpanis, K. G. (2016). Back to basics: Unsupervised learning of optical flow via brightness constancy and motion smoothness. In European conference on computer vision workshops.

  • Yu, Z., Li, H., Wang, Z., Hu, Z., & Chen, C. W. (2013). Multi-level video frame interpolation: Exploiting the interaction among different levels. IEEE Transactions on Circuits and Systems for Video Technology, 23(7), 1235–1248.

    Article  Google Scholar 

  • Zhou, T., Tulsiani, S., Sun, W., Malik, J., & Efros, A. A. (2016). View synthesis by appearance flow. In European conference on computer vision.

  • Zhu, X., Wang, Y., Dai, J., Yuan, L., & Wei, Y. (2017). Flow-guided feature aggregation for video object detection. In IEEE International Conference on Computer Vision.

  • Zitnick, C. L., Kang, S. B., Uyttendaele, M., Winder, S., & Szeliski, R. (2004). High-quality video view interpolation using a layered representation. ACM Transactions on Graphics, 23(3), 600–608.

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported by NSF RI #1212849, NSF BIGDATA #1447476, Facebook, and Shell Research. This work was done when Tianfan Xue and Donglai Wei were graduate students at MIT CSAIL.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jiajun Wu.

Additional information

Communicated by Ming-Hsuan Yang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Additional Qualitative Results We show additional results on the following benchmarks: Vimeo interpolation benchmark (Fig. 14), Vimeo denoising benchmark (Fig. 15 for RGB videos, and Fig. 16 for grayscale videos), Vimeo deblocking benchmark (Fig. 17), and Vimeo super-resolution benchmark (Fig. 18). We randomly select testing images from test datasets. Differences between different algorithms are more clearer when zoomed in.

Flow Estimation Module We used SpyNet (Ranjan and Black 2017) as our flow estimation module. It consists of four sub-networks with the same network structure, but each sub-network has an independent set of parameters. Each sub-network consists of five sets of 7\(\times \)7 convolutional (with zero padding), batch normalization and ReLU layers. The number of channels after each convolutional layer is 32, 64, 32, 16, and 2. The input motion to the first network is a zero motion field.

Image Processing Module We use slight different structures in the image processing module for different tasks. For temporal frame interpolation both with and without masks, we build a residual network that consists of an averaging network and a residual network. The averaging network simply averages the two transformed frames (from frames 1 and 3). The residual network also takes the two transformed frames as input, but calculates the difference between the actual second frame and the average of two transformed frames through a convolutional network consists of three convolutional layers, each of which is followed by a ReLU layer. The kernel sizes of three layers are 9\(\times \)9, 1\(\times \)1, and 1\(\times \)1 (with zero padding), and the numbers of output channels are 64, 64, and 3. The final output is the summation of the output of the averaging network and the residual network.

Fig. 14
figure 14

Qualitative results on video interpolation. Samples are randomly selected from the Vimeo interpolation benchmark. The differences between different algorithms are clear only when zoomed in

Fig. 15
figure 15

Qualitative results on RGB video denoising. Samples are randomly selected from the Vimeo denoising benchmark. The differences between different algorithms are clear only when zoomed in

Fig. 16
figure 16

Qualitative results on grayscale video denoising. Samples are randomly selected from the Vimeo denoising benchmark. The differences between different algorithms are clear only when zoomed in

Fig. 17
figure 17

Qualitative results on video deblocking. Samples are randomly selected from the Vimeo deblocking benchmark. The differences between different algorithms are clear only when zoomed in

Fig. 18
figure 18

Qualitative results on video super-resolution. Samples are randomly selected from the Vimeo super-resolution benchmark. The differences between different algorithms are clear only when zoomed in. DeepSR was originally trained on 30–50 images, but evaluated on 7 frames in this experiment, so there are some artifacts

For video denoising/deblocking, the image processing module uses the same six-layer convolutional structure (three convolutional layers and three ReLU layers) as interpolation, but without the residual structure. We have also tried the residual structure for denoising/deblocking, but there is no significant improvement.

For video super-resolution, the image processing module consists of four pairs of convolutional layers and ReLU layers. The kernel sizes for these four layers are 9\(\times \)9, 9\(\times \)9, 1\(\times \)1, and 1\(\times \)1 (with zero padding), and the numbers of output channels are 64, 64, 64, and 3.

Mask Network Similar to our flow estimation module, our mask estimation network is also a four-level convolutional neural network pyramid as in Fig. 4. Each level consists of the same sub-network structure with five sets of 7\(\times \)7 convolutional (with zero padding), batch normalization and ReLU layers, but an independent set of parameters (output channels are 32, 64, 32, 16, and 2). For the first level, the input to the network is a concatenation of two estimated optical flow fields (four channels after concatenation), and the output is a concatenation of two estimated masks (one channel per mask). From the second level, the input to the network switch to a concatenation of, first, two estimated optical flow fields at that resolution, and second, bilinear-upsampled masks from the previous level (the resolution is twice of the previous level). In this way, the first level mask network estimates a rough mask, and the rest refines high frequency details of the mask.

We use cycle consistencies to obtain the ground truth occlusion mask for pre-training the mask network. For two consecutive frames \(I_1\) and \(I_2\), we calculate the forward flow \(v_{12}\) and the backward flow \(v_{21}\) using the pre-trained flow network. Then, for each pixel p in image \(I_1\), we first map it to \(I_2\) using \(v_{12}\) and then map it back to \(I_1\) using \(v_{21}\). If it maps to a different point rather to p (up to an error threshold of two pixels), then this point is considered to be occluded.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xue, T., Chen, B., Wu, J. et al. Video Enhancement with Task-Oriented Flow. Int J Comput Vis 127, 1106–1125 (2019). https://doi.org/10.1007/s11263-018-01144-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-018-01144-2

Keywords

Navigation