Skip to main content
Log in

RED-Net: A Recurrent Encoder–Decoder Network for Video-Based Face Alignment

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

We propose a novel method for real-time face alignment in videos based on a recurrent encoder–decoder network model. Our proposed model predicts 2D facial point heat maps regularized by both detection and regression loss, while uniquely exploiting recurrent learning at both spatial and temporal dimensions. At the spatial level, we add a feedback loop connection between the combined output response map and the input, in order to enable iterative coarse-to-fine face alignment using a single network model, instead of relying on traditional cascaded model ensembles. At the temporal level, we first decouple the features in the bottleneck of the network into temporal-variant factors, such as pose and expression, and temporal-invariant factors, such as identity information. Temporal recurrent learning is then applied to the decoupled temporal-variant features. We show that such feature disentangling yields better generalization and significantly more accurate results at test time. We perform a comprehensive experimental analysis, showing the importance of each component of our proposed model, as well as superior results over the state of the art and several variations of our method in standard datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. https://sites.google.com/site/xipengcshomepage/eccv2016

References

  • Asthana, A., Zafeiriou, S., Cheng, S., & Pantic, M. (2003). Robust discriminative response map fitting with constrained local models. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3444–3451).

  • Asthana, A., Zafeiriou, S., Cheng, S., & Pantic, M. (2014). Incremental face alignment in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Badrinarayanan, V., Kendall, A., & Cipolla, R. (2015). Segnet: A deep convolutional encoder-decoder architecture for image segmentation. CoRR.

  • Belhumeur, P. N., Jacobs, D. W., Kriegman, D. J., & Kumar, N. (2011). Localizing parts of faces using a consensus of exemplars. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Black, M., & Yacoob, Y. (1995). Tracking and recognizing rigid and non-rigid facial motions using local parametric models of image motion. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Bulat, A., & Tzimiropoulos, G. (2016). Human pose estimation via convolutional part heatmap regression (pp. 717–732). Cham: Springer.

    Google Scholar 

  • Cao, X., Wei, Y., Wen, F., & Sun, J. (2014). Face alignment by explicit shape regression. International Journal of Computer Vision, 107(2), 177–190.

    Article  MathSciNet  Google Scholar 

  • Cho, K., van Merrienboer, B., Bahdanau, D., & Bengio, Y. (2014). On the properties of neural machine translation: Encoder-decoder approaches. CoRR arXiv:1409.1259.

  • Chrysos, G. G., Antonakos, E., Zafeiriou, S., & Snape, P. (2015). Offline deformable face tracking in arbitrary videos. In: Proceedings of the IEEE international conference on computer vision workshop (pp. 954–962).

  • Cootes, T. F., & Taylor, C. J. (1992). Active shape models-smart snakes. In BMVC.

  • Decarlo, D., & Metaxas, D. (2000). Optical flow constraints on deformable models with applications to face tracking. International Journal of Computer Vision, 38(2), 99–127.

    Article  MATH  Google Scholar 

  • FGNet: Talking face video. Technical report, Online (2004)

  • Gao, X., Su, Y., Li, X., & Tao, D. (2010). A review of active appearance models. IEEE Transactions on Systems, Man, and Cybernetics, 40(2), 145–158.

    Article  Google Scholar 

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In 2016 IEEE conference on computer vision and pattern recognition (CVPR).

  • Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computing, 9(8), 1735–1780.

    Article  Google Scholar 

  • Hong, S., Noh, H., & Han, B. (2015). Decoupled deep neural network for semi-supervised semantic segmentation. CoRR arXiv:1506.04924.

  • Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR arXiv:1502.03167.

  • Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., et al. (2014). Caffe: Convolutional architecture for fast feature embedding. In ACM multimedia conference (pp. 675–678).

  • Jourabloo, A., & Liu, X. (2016). Large-pose face alignment via CNN-based dense 3D model fitting. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Kendall, A., Badrinarayanan, V., & Cipolla, R. (2015). Bayesian segnet: Model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. CoRR arXiv:1511.02680.

  • Koestinger, M., Wohlhart, P., Roth, P. M., & Bischof, H. (2011) Annotated facial landmarks in the wild: A large-scale, real-world database for facial landmark localization. In Workshop on benchmarking facial image analysis technologies.

  • Lai, H., Xiao, S., Cui, Z., Pan, Y., Xu, C., & Yan, S. (2015). Deep cascaded regression for face alignment. arXiv:1510.09083v2.

  • Le, V., Brandt, J., Lin, Z., Bourdev, L., & Huang, T. S. (2012). Interactive facial feature localization. In European conference on computer vision (pp. 679–692).

  • Learned-Miller, G. B. H. E. (2014). Labeled faces in the wild: Updates and new reporting procedures. Technical Report UM-CS-2014-003, University of Massachusetts, Amherst.

  • Long, J., Shelhamer, E., & Darrell, T. (2014a). Fully convolutional networks for semantic segmentation. CoRR arXiv:1411.4038.

  • Long, J. L., Zhang, N., & Darrell, T. (2014b). Do convnets learn correspondence? In Advances in neural information processing systems (pp. 1601–1609).

  • Lu, L., Zhang, X., Cho, K., & Renals, S. (2015). A study of the recurrent neural network encoder-decoder for large vocabulary speech recognition. In INTERSPEECH.

  • Mikolov, T., Joulin, A., Chopra, S., Mathieu, M., & Ranzato, M. (2014). Learning longer memory in recurrent neural networks. CoRR arXiv:1412.7753.

  • Mikolov, T., Karafiát, M., Burget, L., Černocký, J., & Khudanpur, S. (2010). Recurrent neural network based language model. In INTERSPEECH.

  • Milborrow, S., & Nicolls, F. (2008). Locating facial features with an extended active shape model. In European conference on computer vision (pp. 504–513).

  • Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted Boltzmann machines. In The international conference on machine learning (pp. 807–814).

  • Oh, J., Guo, X., Lee, H., Lewis, R. L., & Singh, S. (2015). Action-conditional video prediction using deep networks in atari games. In Advances in neural information processing systems (pp. 2845–2853).

  • Oliver, N., Pentland, A., & Berard, F. (1997). Lafter: Lips and face real time tracker. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 123–129).

  • Patras, I., & Pantic, M. (2004). Particle filtering with factorized likelihoodsfor tracking facial features. In Automatic face and gesture recognition (pp. 97–102).

  • Peng, X., Feris, R. S., Wang, X., & Metaxas, D. N. (2016a). A recurrent encoder-decoder network for sequential face alignment. In European conference on computer vision (pp. 38–56). Springer.

  • Peng, X., Hu, Q., Huang, J., & Metaxas, D. N. (2016b). Track facial points in unconstrained videos. In Proceedings of the British machine vision conference (pp. 129.1–129.13).

  • Peng, X., Huang, J., Hu, Q., Zhang, S., Elgammal, A., & Metaxas, D. (2015a). From circle to 3-sphere: Head pose estimation by instance parameterization. Computer Vision and Image Understanding, 136, 92–102.

    Article  Google Scholar 

  • Peng, X., Yu, X., Sohn, K., Metaxas, D. N., & Chandraker, M. (2017a). Reconstruction-based disentanglement for pose-invariant face recognition. In Proceedings of the IEEE international conference on computer vision (pp. 1623–1632).

  • Peng, X., Zhang, S., Yang, Y., & Metaxas, D. N. (2015b). Piefa: Personalized incremental and ensemble face alignment. In: Proceedings of the IEEE international conference on computer vision.

  • Peng, X., Zhang, S., Yu, Y., & Metaxas, D. N. (2017b). Toward personalized modeling: Incremental and ensemble alignment for sequential faces in the wild. International Journal of Computer Vision, 126, 1–14.

    MathSciNet  Google Scholar 

  • Sagonas, C., Antonakos, E., Tzimiropoulos, G., Zafeiriou, S., & Pantic, M. (2016). 300 faces in-the-wild challenge: database and results. Image and Vision Computing, 47, 3–18.

    Article  Google Scholar 

  • Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., & Pantic, M. (2013). 300 faces in-the-wild challenge: The first facial landmark localization challenge. In Proceedings of the IEEE international conference on computer vision workshop.

  • Schroff, F., Kalenichenko, D., & Philbin, J. (2015). Facenet: A unified embedding for face recognition and clustering. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 815–823).

  • Shen, J., Zafeiriou, S., Chrysos, G., Kossaifi, J., Tzimiropoulos, G., & Pantic, M. (2015). The first facial landmark tracking in-the-wild challenge: Benchmark and results. In Proceedings of the IEEE international conference on computer vision workshop.

  • Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR arXiv:1409.1556.

  • Sun, Y., Wang, X., & Tang, X. (2013). Deep convolutional network cascade for facial point detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3476–3483).

  • Sun, Y., Wang, X., & Tang, X. (2015). Deeply learned face representations are sparse, selective, and robust. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2892–2900).

  • Taigman, Y., Yang, M., Ranzato, M., & Wolf, L. (2014). Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Tzimiropoulos, G. (2015). Project-out cascaded regression with an application to face alignment. In CVPR (pp. 3659–3667).

  • Veeriah, V., Zhuang, N., & Qi, G. J. (2015) Differential recurrent neural networks for action recognition. In: Proceedings of the IEEE international conference on computer vision.

  • Wang, J., Cheng, Y., & Feris, R. S. (2016). Walk and learn: Facial attribute representation learning from egocentric video and contextual data. In: Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Wang, X., Yang, M., Zhu, S., & Lin, Y. (2015). Regionlets for generic object detection. TPAMI, 37(10), 2071–2084.

    Article  Google Scholar 

  • Wu, Y., & Ji, Q. (2016). Constrained joint cascade regression framework for simultaneous facial action unit recognition and facial landmark detection. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Xiong, X., & De la Torre, F. (2013). Supervised descent method and its application to face alignment. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Yang, J., Reed, S., Yang, M. H., & Lee, H. (2015). Weakly-supervised disentangling with recurrent transformations for 3D view synthesis. In NIPS.

  • Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., & Courville, A. (2015) Describing videos by exploiting temporal structure. In Proceedings of the IEEE international conference on computer vision.

  • Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In European conference on computer vision (pp. 818–833).

  • Zhang, J., Shan, S., Kan, M., & Chen, X. (2014a). Coarse-to-fine auto-encoder networks (CFAN) for real-time face alignment. In European conference on computer vision (pp. 1–16).

  • Zhang, Z., Luo, P., Loy, C. C., & Tang, X. (2014b). Facial landmark detection by deep multi-task learning. In European conference on computer vision (pp. 94–108).

  • Zhu, S., Li, C., Loy, C. C., & Tang, X. (2015). Face alignment by coarse-to-fine shape searching. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4998–5006).

  • Zhu, X., Lei, Z., Liu, X., Shi, H., & Li, S. Z. (2016). Face alignment across large poses: A 3D solution. In Proceedings of the IEEE conference on computer vision and pattern recognition.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xi Peng.

Additional information

Communicated by Xiaoou Tang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Peng, X., Feris, R.S., Wang, X. et al. RED-Net: A Recurrent Encoder–Decoder Network for Video-Based Face Alignment. Int J Comput Vis 126, 1103–1119 (2018). https://doi.org/10.1007/s11263-018-1095-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-018-1095-1

Keywords

Navigation