Abstract
Many video enhancement algorithms rely on optical flow to register frames in a video sequence. Precise flow estimation is however intractable; and optical flow itself is often a sub-optimal representation for particular video processing tasks. In this paper, we propose task-oriented flow (TOFlow), a motion representation learned in a self-supervised, task-specific manner. We design a neural network with a trainable motion estimation component and a video processing component, and train them jointly to learn the task-oriented flow. For evaluation, we build Vimeo-90K, a large-scale, high-quality video dataset for low-level video processing. TOFlow outperforms traditional optical flow on standard benchmarks as well as our Vimeo-90K dataset in three video processing tasks: frame interpolation, video denoising/deblocking, and video super-resolution.
Similar content being viewed by others
Notes
Note that Fixed Flow or TOFlow only uses 4-level structure of SpyNet for memory efficiency, while the original SpyNet network has 5 levels.
We did not evaluate AdaConv on DVF dataset, as neither the implementation of AdaConv nor the DVF dataset is publicly available.
The EPE of Fixed Flow on Sintel dataset is different from EPE of SpyNet (Ranjan and Black 2017) reported on Sintel website, as it is trained differently from SpyNet as we mentioned before.
References
Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., & Vijayanarasimhan, S. (2016). Youtube-8m: A large-scale video classification benchmark. arXiv:1609.08675.
Ahn, N., Kang, B., & Sohn, K. A. (2018). Fast, accurate, and, lightweight super-resolution with cascading residual network. In European conference on computer vision.
Aittala, M., & Durand, F. (2018). Burst image deblurring using permutation invariant convolutional neural networks. In European conference on computer vision (pp 731–747).
Baker, S., Scharstein, D., Lewis, J., Roth, S., Black, M. J., & Szeliski, R. (2011). A database and evaluation methodology for optical flow. International Journal of Computer Vision, 92(1), 1–31.
Brox, T., Bruhn, A., Papenberg, N., & Weickert, J. (2004). High accuracy optical flow estimation based on a theory for warping. In European conference on computer vision.
Brox, T., Bregler, C., & Malik, J. (2009). Large displacement optical flow. In IEEE conference on computer vision and pattern recognition.
Bulat, A., Yang, J., & Tzimiropoulos, G. (2018). To learn image super-resolution, use a gan to learn how to do image degradation first. In European conference on computer vision.
Butler, D. J., Wulff, J., Stanley, G. B., & Black, M. J. (2012). A naturalistic open source movie for optical flow evaluation. In European conference on computer vision (pp. 611–625).
Caballero, J., Ledig, C., Aitken, A., Acosta, A., Totz, J., Wang, Z., Shi, W. (2017). Real-time video super-resolution with spatio-temporal networks and motion compensation. In IEEE conference on computer vision and pattern recognition.
Fischer, P., Dosovitskiy, A., Ilg, E., Häusser, P., Hazırbaş, C., Golkov, V., van der Smagt, P., Cremers, D., & Brox, T. (2015). Flownet: Learning optical flow with convolutional networks. In IEEE international conference on computer vision.
Ganin, Y., Kononenko, D., Sungatullina, D., & Lempitsky, V. (2016). Deepwarp: Photorealistic image resynthesis for gaze manipulation. In European conference on computer vision.
Ghoniem, M., Chahir, Y., & Elmoataz, A. (2010). Nonlocal video denoising, simplification and inpainting using discrete regularization on graphs. Signal Process, 90(8), 2445–2455.
Godard, C., Matzen, K., & Uyttendaele, M. (2017). Deep burst denoising. In European conference on computer vision.
Horn, B. K., & Schunck, B. G. (1981). Determining optical flow. Artif Intell, 17(1–3), 185–203.
Huang, Y., Wang, W., & Wang, L. (2015). Bidirectional recurrent convolutional networks for multi-frame super-resolution. In Advances in neural information processing systems.
Jaderberg, M., Simonyan, K., & Zisserman, A., et al. (2015). Spatial transformer networks. In Advances in neural information processing systems.
Jiang, H., Sun, D., Jampani, V., Yang, M. H., Learned-Miller, E., Kautz, J. (2017). Super slomo: High quality estimation of multiple intermediate frames for video interpolation. In IEEE conference on computer vision and pattern recognition.
Jiang, X., Le Pendu, M., & Guillemot, C. (2018). Depth estimation with occlusion handling from a sparse set of light field views. In IEEE international conference on image processing.
Jo, Y., Oh, S. W., Kang, J., & Kim, S. J. (2018). Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation. In IEEE conference on computer vision and pattern recognition (pp. 3224–3232).
Kappeler, A., Yoo, S., Dai, Q., & Katsaggelos, A. K. (2016). Video super-resolution with convolutional neural networks. IEEE Transactions on Computational Imaging, 2(2), 109–122.
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In International conference on learning representations.
Li, M., Xie, Q., Zhao, Q., Wei, W., Gu, S., Tao, J., & Meng, D. (2018). Video rain streak removal by multiscale convolutional sparse coding. In IEEE conference on computer vision and pattern recognition (pp. 6644–6653).
Liao, R., Tao, X., Li, R., Ma, Z., & Jia, J. (2015). Video super-resolution via deep draft-ensemble learning. In IEEE conference on computer vision and pattern recognition.
Liu, C., & Freeman, W. (2010). A high-quality video denoising algorithm based on reliable motion estimation. In European conference on computer vision.
Liu, C., & Sun, D. (2011). A bayesian approach to adaptive video super resolution. In IEEE conference on computer vision and pattern recognition.
Liu, C., & Sun, D. (2014). On bayesian adaptive video super resolution. IEEE Transactions on Pattern Analysis and Machine intelligence, 36(2), 346–360.
Liu, Z., Yeh, R., Tang, X., Liu, Y., & Agarwala, A. (2017). Video frame synthesis using deep voxel flow. In IEEE international conference on computer vision.
Lu, G., Ouyang, W., Xu, D., Zhang, X., Gao, Z., & Sun, M. T. (2018). Deep Kalman filtering network for video compression artifact reduction. In European conference on computer vision (pp. 568–584).
Ma, Z., Liao, R., Tao, X., Xu, L., Jia, J., & Wu, E. (2015). Handling motion blur in multi-frame super-resolution. In IEEE conference on computer vision and pattern recognition.
Maggioni, M., Boracchi, G., Foi, A., & Egiazarian, K. (2012). Video denoising, deblocking, and enhancement through separable 4-d nonlocal spatiotemporal transforms. IEEE Transactions on Image Processing, 21(9), 3952–3966.
Makansi, O., Ilg, E., & Brox, T. (2017). End-to-end learning of video super-resolution with motion compensation. In German conference on pattern recognition.
Mathieu, M., Couprie, C., & LeCun, Y. (2016). Deep multi-scale video prediction beyond mean square error. In International conference on learning representations.
Mémin, E., & Pérez, P. (1998). Dense estimation and object-based segmentation of the optical flow with robust techniques. IEEE Transactions on Image Processing, 7(5), 703–719.
Mildenhall, B., Barron, J. T., Chen, J., Sharlet, D., Ng, R., & Carroll, R. (2018). Burst denoising with kernel prediction networks. In IEEE conference on computer vision and pattern recognition.
Nasrollahi, K., & Moeslund, T. B. (2014). Super-resolution: A comprehensive survey. Machine Vision and Applications, 25(6), 1423–1468.
Niklaus, S., & Liu, F. (2018). Context-aware synthesis for video frame interpolation. In IEEE conference on computer vision and pattern recognition.
Niklaus, S., Mai, L., & Liu, F. (2017a). Video frame interpolation via adaptive convolution. In IEEE conference on computer vision and pattern recognition.
Niklaus, S., Mai, L., & Liu, F. (2017b). Video frame interpolation via adaptive separable convolution. In IEEE international conference on computer vision.
Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3), 145–175.
Ranjan, A., & Black, M. J. (2017). Optical flow estimation using a spatial pyramid network. In IEEE conference on computer vision and pattern recognition.
Revaud, J., Weinzaepfel, P., Harchaoui, Z., & Schmid, C. (2015). Epicflow: Edge-preserving interpolation of correspondences for optical flow. In IEEE conference on computer vision and pattern recognition.
Sajjadi, M. S., Vemulapalli, R., & Brown, M. (2018). Frame-recurrent video super-resolution. In IEEE conference on computer vision and pattern recognition (pp. 6626–6634).
Tao, X., Gao, H., Liao, R., Wang, J., & Jia, J. (2017). Detail-revealing deep video super-resolution. In IEEE international conference on computer vision.
Varghese, G., & Wang, Z. (2010). Video denoising based on a spatiotemporal gaussian scale mixture model. IEEE Transactions on Circuits and Systems for Video Technology, 20(7), 1032–1040.
Wang, T. C., Zhu, J. Y., Kalantari, N. K., Efros, A. A., & Ramamoorthi, R. (2017). Light field video capture using a learning-based hybrid imaging system. In SIGGRAPH.
Wedel, A., Cremers, D., Pock, T., & Bischof, H. (2009). Structure-and motion-adaptive regularization for high accuracy optic flow. In IEEE conference on computer vision and pattern recognition.
Wen, B., Li, Y., Pfister, L., & Bresler, Y. (2017). Joint adaptive sparsity and low-rankness on the fly: An online tensor reconstruction scheme for video denoising. In IEEE International Conference on Computer Vision (ICCV).
Werlberger, M., Pock, T., Unger, M., & Bischof, H. (2011). Optical flow guided tv-l1 video interpolation and restoration. In International conference on energy minimization methods in computer vision and pattern recognition.
Xu, J., Ranftl, R., & Koltun, V. (2017). Accurate optical flow via direct cost volume processing. In IEEE conference on computer vision and pattern recognition.
Xu, S., Zhang, F., He, X., Shen, X., & Zhang, X. (2015). Pm-pm: Patchmatch with potts model for object segmentation and stereo matching. IEEE Transactions on Image Processing, 24(7), 2182–2196.
Yang, R., Xu, M., Wang, Z., & Li, T. (2018). Multi-frame quality enhancement for compressed video. In IEEE conference on computer vision and pattern recognition (pp 6664–6673).
Yu, J. J., Harley, A. W., & Derpanis, K. G. (2016). Back to basics: Unsupervised learning of optical flow via brightness constancy and motion smoothness. In European conference on computer vision workshops.
Yu, Z., Li, H., Wang, Z., Hu, Z., & Chen, C. W. (2013). Multi-level video frame interpolation: Exploiting the interaction among different levels. IEEE Transactions on Circuits and Systems for Video Technology, 23(7), 1235–1248.
Zhou, T., Tulsiani, S., Sun, W., Malik, J., & Efros, A. A. (2016). View synthesis by appearance flow. In European conference on computer vision.
Zhu, X., Wang, Y., Dai, J., Yuan, L., & Wei, Y. (2017). Flow-guided feature aggregation for video object detection. In IEEE International Conference on Computer Vision.
Zitnick, C. L., Kang, S. B., Uyttendaele, M., Winder, S., & Szeliski, R. (2004). High-quality video view interpolation using a layered representation. ACM Transactions on Graphics, 23(3), 600–608.
Acknowledgements
This work is supported by NSF RI #1212849, NSF BIGDATA #1447476, Facebook, and Shell Research. This work was done when Tianfan Xue and Donglai Wei were graduate students at MIT CSAIL.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Ming-Hsuan Yang.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Additional Qualitative Results We show additional results on the following benchmarks: Vimeo interpolation benchmark (Fig. 14), Vimeo denoising benchmark (Fig. 15 for RGB videos, and Fig. 16 for grayscale videos), Vimeo deblocking benchmark (Fig. 17), and Vimeo super-resolution benchmark (Fig. 18). We randomly select testing images from test datasets. Differences between different algorithms are more clearer when zoomed in.
Flow Estimation Module We used SpyNet (Ranjan and Black 2017) as our flow estimation module. It consists of four sub-networks with the same network structure, but each sub-network has an independent set of parameters. Each sub-network consists of five sets of 7\(\times \)7 convolutional (with zero padding), batch normalization and ReLU layers. The number of channels after each convolutional layer is 32, 64, 32, 16, and 2. The input motion to the first network is a zero motion field.
Image Processing Module We use slight different structures in the image processing module for different tasks. For temporal frame interpolation both with and without masks, we build a residual network that consists of an averaging network and a residual network. The averaging network simply averages the two transformed frames (from frames 1 and 3). The residual network also takes the two transformed frames as input, but calculates the difference between the actual second frame and the average of two transformed frames through a convolutional network consists of three convolutional layers, each of which is followed by a ReLU layer. The kernel sizes of three layers are 9\(\times \)9, 1\(\times \)1, and 1\(\times \)1 (with zero padding), and the numbers of output channels are 64, 64, and 3. The final output is the summation of the output of the averaging network and the residual network.
For video denoising/deblocking, the image processing module uses the same six-layer convolutional structure (three convolutional layers and three ReLU layers) as interpolation, but without the residual structure. We have also tried the residual structure for denoising/deblocking, but there is no significant improvement.
For video super-resolution, the image processing module consists of four pairs of convolutional layers and ReLU layers. The kernel sizes for these four layers are 9\(\times \)9, 9\(\times \)9, 1\(\times \)1, and 1\(\times \)1 (with zero padding), and the numbers of output channels are 64, 64, 64, and 3.
Mask Network Similar to our flow estimation module, our mask estimation network is also a four-level convolutional neural network pyramid as in Fig. 4. Each level consists of the same sub-network structure with five sets of 7\(\times \)7 convolutional (with zero padding), batch normalization and ReLU layers, but an independent set of parameters (output channels are 32, 64, 32, 16, and 2). For the first level, the input to the network is a concatenation of two estimated optical flow fields (four channels after concatenation), and the output is a concatenation of two estimated masks (one channel per mask). From the second level, the input to the network switch to a concatenation of, first, two estimated optical flow fields at that resolution, and second, bilinear-upsampled masks from the previous level (the resolution is twice of the previous level). In this way, the first level mask network estimates a rough mask, and the rest refines high frequency details of the mask.
We use cycle consistencies to obtain the ground truth occlusion mask for pre-training the mask network. For two consecutive frames \(I_1\) and \(I_2\), we calculate the forward flow \(v_{12}\) and the backward flow \(v_{21}\) using the pre-trained flow network. Then, for each pixel p in image \(I_1\), we first map it to \(I_2\) using \(v_{12}\) and then map it back to \(I_1\) using \(v_{21}\). If it maps to a different point rather to p (up to an error threshold of two pixels), then this point is considered to be occluded.
Rights and permissions
About this article
Cite this article
Xue, T., Chen, B., Wu, J. et al. Video Enhancement with Task-Oriented Flow. Int J Comput Vis 127, 1106–1125 (2019). https://doi.org/10.1007/s11263-018-01144-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-018-01144-2