RED-Net: A Recurrent Encoder–Decoder Network for Video-Based Face Alignment

Peng, Xi; Feris, Rogerio S.; Wang, Xiaoyu; Metaxas, Dimitris N.

doi:10.1007/s11263-018-1095-1

RED-Net: A Recurrent Encoder–Decoder Network for Video-Based Face Alignment

Published: 23 May 2018

Volume 126, pages 1103–1119, (2018)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Xi Peng ORCID: orcid.org/0000-0002-7772-001X¹,
Rogerio S. Feris²,
Xiaoyu Wang³ &
…
Dimitris N. Metaxas¹

1094 Accesses
14 Citations
3 Altmetric
Explore all metrics

Abstract

We propose a novel method for real-time face alignment in videos based on a recurrent encoder–decoder network model. Our proposed model predicts 2D facial point heat maps regularized by both detection and regression loss, while uniquely exploiting recurrent learning at both spatial and temporal dimensions. At the spatial level, we add a feedback loop connection between the combined output response map and the input, in order to enable iterative coarse-to-fine face alignment using a single network model, instead of relying on traditional cascaded model ensembles. At the temporal level, we first decouple the features in the bottleneck of the network into temporal-variant factors, such as pose and expression, and temporal-invariant factors, such as identity information. Temporal recurrent learning is then applied to the decoupled temporal-variant features. We show that such feature disentangling yields better generalization and significantly more accurate results at test time. We perform a comprehensive experimental analysis, showing the importance of each component of our proposed model, as well as superior results over the state of the art and several variations of our method in standard datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 4

Fig. 6

Facial emotion recognition using convolutional neural networks (FERC)

Article 18 February 2020

Ninad Mehendale

Image Matching from Handcrafted to Deep Features: A Survey

Article Open access 04 August 2020

Jiayi Ma, Xingyu Jiang, … Junchi Yan

Deepfake: An Overview

Notes

https://sites.google.com/site/xipengcshomepage/eccv2016

References

Asthana, A., Zafeiriou, S., Cheng, S., & Pantic, M. (2003). Robust discriminative response map fitting with constrained local models. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3444–3451).
Asthana, A., Zafeiriou, S., Cheng, S., & Pantic, M. (2014). Incremental face alignment in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Badrinarayanan, V., Kendall, A., & Cipolla, R. (2015). Segnet: A deep convolutional encoder-decoder architecture for image segmentation. CoRR.
Belhumeur, P. N., Jacobs, D. W., Kriegman, D. J., & Kumar, N. (2011). Localizing parts of faces using a consensus of exemplars. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Black, M., & Yacoob, Y. (1995). Tracking and recognizing rigid and non-rigid facial motions using local parametric models of image motion. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Bulat, A., & Tzimiropoulos, G. (2016). Human pose estimation via convolutional part heatmap regression (pp. 717–732). Cham: Springer.
Google Scholar
Cao, X., Wei, Y., Wen, F., & Sun, J. (2014). Face alignment by explicit shape regression. International Journal of Computer Vision, 107(2), 177–190.
Article MathSciNet Google Scholar
Cho, K., van Merrienboer, B., Bahdanau, D., & Bengio, Y. (2014). On the properties of neural machine translation: Encoder-decoder approaches. CoRR arXiv:1409.1259.
Chrysos, G. G., Antonakos, E., Zafeiriou, S., & Snape, P. (2015). Offline deformable face tracking in arbitrary videos. In: Proceedings of the IEEE international conference on computer vision workshop (pp. 954–962).
Cootes, T. F., & Taylor, C. J. (1992). Active shape models-smart snakes. In BMVC.
Decarlo, D., & Metaxas, D. (2000). Optical flow constraints on deformable models with applications to face tracking. International Journal of Computer Vision, 38(2), 99–127.
Article MATH Google Scholar
FGNet: Talking face video. Technical report, Online (2004)
Gao, X., Su, Y., Li, X., & Tao, D. (2010). A review of active appearance models. IEEE Transactions on Systems, Man, and Cybernetics, 40(2), 145–158.
Article Google Scholar
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In 2016 IEEE conference on computer vision and pattern recognition (CVPR).
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computing, 9(8), 1735–1780.
Article Google Scholar
Hong, S., Noh, H., & Han, B. (2015). Decoupled deep neural network for semi-supervised semantic segmentation. CoRR arXiv:1506.04924.
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR arXiv:1502.03167.
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., et al. (2014). Caffe: Convolutional architecture for fast feature embedding. In ACM multimedia conference (pp. 675–678).
Jourabloo, A., & Liu, X. (2016). Large-pose face alignment via CNN-based dense 3D model fitting. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Kendall, A., Badrinarayanan, V., & Cipolla, R. (2015). Bayesian segnet: Model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. CoRR arXiv:1511.02680.
Koestinger, M., Wohlhart, P., Roth, P. M., & Bischof, H. (2011) Annotated facial landmarks in the wild: A large-scale, real-world database for facial landmark localization. In Workshop on benchmarking facial image analysis technologies.
Lai, H., Xiao, S., Cui, Z., Pan, Y., Xu, C., & Yan, S. (2015). Deep cascaded regression for face alignment. arXiv:1510.09083v2.
Le, V., Brandt, J., Lin, Z., Bourdev, L., & Huang, T. S. (2012). Interactive facial feature localization. In European conference on computer vision (pp. 679–692).
Learned-Miller, G. B. H. E. (2014). Labeled faces in the wild: Updates and new reporting procedures. Technical Report UM-CS-2014-003, University of Massachusetts, Amherst.
Long, J., Shelhamer, E., & Darrell, T. (2014a). Fully convolutional networks for semantic segmentation. CoRR arXiv:1411.4038.
Long, J. L., Zhang, N., & Darrell, T. (2014b). Do convnets learn correspondence? In Advances in neural information processing systems (pp. 1601–1609).
Lu, L., Zhang, X., Cho, K., & Renals, S. (2015). A study of the recurrent neural network encoder-decoder for large vocabulary speech recognition. In INTERSPEECH.
Mikolov, T., Joulin, A., Chopra, S., Mathieu, M., & Ranzato, M. (2014). Learning longer memory in recurrent neural networks. CoRR arXiv:1412.7753.
Mikolov, T., Karafiát, M., Burget, L., Černocký, J., & Khudanpur, S. (2010). Recurrent neural network based language model. In INTERSPEECH.
Milborrow, S., & Nicolls, F. (2008). Locating facial features with an extended active shape model. In European conference on computer vision (pp. 504–513).
Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted Boltzmann machines. In The international conference on machine learning (pp. 807–814).
Oh, J., Guo, X., Lee, H., Lewis, R. L., & Singh, S. (2015). Action-conditional video prediction using deep networks in atari games. In Advances in neural information processing systems (pp. 2845–2853).
Oliver, N., Pentland, A., & Berard, F. (1997). Lafter: Lips and face real time tracker. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 123–129).
Patras, I., & Pantic, M. (2004). Particle filtering with factorized likelihoodsfor tracking facial features. In Automatic face and gesture recognition (pp. 97–102).
Peng, X., Feris, R. S., Wang, X., & Metaxas, D. N. (2016a). A recurrent encoder-decoder network for sequential face alignment. In European conference on computer vision (pp. 38–56). Springer.
Peng, X., Hu, Q., Huang, J., & Metaxas, D. N. (2016b). Track facial points in unconstrained videos. In Proceedings of the British machine vision conference (pp. 129.1–129.13).
Peng, X., Huang, J., Hu, Q., Zhang, S., Elgammal, A., & Metaxas, D. (2015a). From circle to 3-sphere: Head pose estimation by instance parameterization. Computer Vision and Image Understanding, 136, 92–102.
Article Google Scholar
Peng, X., Yu, X., Sohn, K., Metaxas, D. N., & Chandraker, M. (2017a). Reconstruction-based disentanglement for pose-invariant face recognition. In Proceedings of the IEEE international conference on computer vision (pp. 1623–1632).
Peng, X., Zhang, S., Yang, Y., & Metaxas, D. N. (2015b). Piefa: Personalized incremental and ensemble face alignment. In: Proceedings of the IEEE international conference on computer vision.
Peng, X., Zhang, S., Yu, Y., & Metaxas, D. N. (2017b). Toward personalized modeling: Incremental and ensemble alignment for sequential faces in the wild. International Journal of Computer Vision, 126, 1–14.
MathSciNet Google Scholar
Sagonas, C., Antonakos, E., Tzimiropoulos, G., Zafeiriou, S., & Pantic, M. (2016). 300 faces in-the-wild challenge: database and results. Image and Vision Computing, 47, 3–18.
Article Google Scholar
Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., & Pantic, M. (2013). 300 faces in-the-wild challenge: The first facial landmark localization challenge. In Proceedings of the IEEE international conference on computer vision workshop.
Schroff, F., Kalenichenko, D., & Philbin, J. (2015). Facenet: A unified embedding for face recognition and clustering. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 815–823).
Shen, J., Zafeiriou, S., Chrysos, G., Kossaifi, J., Tzimiropoulos, G., & Pantic, M. (2015). The first facial landmark tracking in-the-wild challenge: Benchmark and results. In Proceedings of the IEEE international conference on computer vision workshop.
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR arXiv:1409.1556.
Sun, Y., Wang, X., & Tang, X. (2013). Deep convolutional network cascade for facial point detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3476–3483).
Sun, Y., Wang, X., & Tang, X. (2015). Deeply learned face representations are sparse, selective, and robust. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2892–2900).
Taigman, Y., Yang, M., Ranzato, M., & Wolf, L. (2014). Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Tzimiropoulos, G. (2015). Project-out cascaded regression with an application to face alignment. In CVPR (pp. 3659–3667).
Veeriah, V., Zhuang, N., & Qi, G. J. (2015) Differential recurrent neural networks for action recognition. In: Proceedings of the IEEE international conference on computer vision.
Wang, J., Cheng, Y., & Feris, R. S. (2016). Walk and learn: Facial attribute representation learning from egocentric video and contextual data. In: Proceedings of the IEEE conference on computer vision and pattern recognition.
Wang, X., Yang, M., Zhu, S., & Lin, Y. (2015). Regionlets for generic object detection. TPAMI, 37(10), 2071–2084.
Article Google Scholar
Wu, Y., & Ji, Q. (2016). Constrained joint cascade regression framework for simultaneous facial action unit recognition and facial landmark detection. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Xiong, X., & De la Torre, F. (2013). Supervised descent method and its application to face alignment. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Yang, J., Reed, S., Yang, M. H., & Lee, H. (2015). Weakly-supervised disentangling with recurrent transformations for 3D view synthesis. In NIPS.
Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., & Courville, A. (2015) Describing videos by exploiting temporal structure. In Proceedings of the IEEE international conference on computer vision.
Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In European conference on computer vision (pp. 818–833).
Zhang, J., Shan, S., Kan, M., & Chen, X. (2014a). Coarse-to-fine auto-encoder networks (CFAN) for real-time face alignment. In European conference on computer vision (pp. 1–16).
Zhang, Z., Luo, P., Loy, C. C., & Tang, X. (2014b). Facial landmark detection by deep multi-task learning. In European conference on computer vision (pp. 94–108).
Zhu, S., Li, C., Loy, C. C., & Tang, X. (2015). Face alignment by coarse-to-fine shape searching. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4998–5006).
Zhu, X., Lei, Z., Liu, X., Shi, H., & Li, S. Z. (2016). Face alignment across large poses: A 3D solution. In Proceedings of the IEEE conference on computer vision and pattern recognition.

Download references

Author information

Authors and Affiliations

Rutgers University, Piscataway, NJ, 08854, USA
Xi Peng & Dimitris N. Metaxas
IBM T. J. Watson Research Center, Yorktown Heights, NY, 10598, USA
Rogerio S. Feris
Intellifusion, Redmond, WA, USA
Xiaoyu Wang

Authors

Xi Peng
View author publications
You can also search for this author in PubMed Google Scholar
Rogerio S. Feris
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Dimitris N. Metaxas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xi Peng.

Additional information

Communicated by Xiaoou Tang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Peng, X., Feris, R.S., Wang, X. et al. RED-Net: A Recurrent Encoder–Decoder Network for Video-Based Face Alignment. Int J Comput Vis 126, 1103–1119 (2018). https://doi.org/10.1007/s11263-018-1095-1

Download citation

Received: 19 April 2017
Accepted: 18 April 2018
Published: 23 May 2018
Issue Date: October 2018
DOI: https://doi.org/10.1007/s11263-018-1095-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

RED-Net: A Recurrent Encoder–Decoder Network for Video-Based Face Alignment

Abstract

Access this article

Similar content being viewed by others

Facial emotion recognition using convolutional neural networks (FERC)

Image Matching from Handcrafted to Deep Features: A Survey

Deepfake: An Overview

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

RED-Net: A Recurrent Encoder–Decoder Network for Video-Based Face Alignment

Abstract

Access this article

Similar content being viewed by others

Facial emotion recognition using convolutional neural networks (FERC)

Image Matching from Handcrafted to Deep Features: A Survey

Deepfake: An Overview

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation