skip to main content
10.1145/3394171.3413993acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Spatial-Temporal Knowledge Integration: Robust Self-Supervised Facial Landmark Tracking

Published:12 October 2020Publication History

ABSTRACT

Diversity of training data significantly affects tracking robustness of model under unconstrained environments. However, existing labeled datasets for facial landmark tracking tend to be large but not diverse, and manually annotating the massive clips of new diverse videos is extremely expensive. To address these problems, we propose a Spatial-Temporal Knowledge Integration (STKI) approach. Unlike most existing methods which rely heavily on labeled data, STKI exploits supervisions from unlabeled data. Specifically, STKI integrates spatial-temporal knowledge from massive unlabeled videos, which has several orders of magnitude more than existing labeled video data on the diversity, for robust tracking. Our framework includes a self-supervised tracker and an image-based detector for tracking initialization. To avoid the distortion of facial shape, the tracker leverages adversarial learning to introduce facial structure prior and temporal knowledge into cycle-consistency tracking. Meanwhile, we design a graph-based knowledge distillation method, which distills the knowledge from tracking and detection results, to improve the generalization of the detector. The fine-tuned detector can provide tracker on unconstrained videos with high-quality tracking initialization. Extensive experimental results show that the proposed method achieves state-of-the-art performance on comprehensive evaluation datasets.

Skip Supplemental Material Section

Supplemental Material

3394171.3413993.mp4

mp4

56.3 MB

References

  1. 2014. FGNET: Talking Face Video. (2014). http://www-prima.inrialpes.fr/FGnet/data/01-TalkingFace/talking_face.htmlGoogle ScholarGoogle Scholar
  2. Akshay Asthana, Stefanos Zafeiriou, Shiyang Cheng, and Maja Pantic. 2014. Incremental face alignment in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1859--1866.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Adrian Bulat and Georgios Tzimiropoulos. 2017. How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). In ICCV. 1021--1030.Google ScholarGoogle Scholar
  4. Xudong Cao, Yichen Wei, Fang Wen, and Jian Sun. 2014. Face alignment by explicit shape regression. IJCV, Vol. 107, 2 (2014), 177--190.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Che-Han Chang, Chun-Nan Chou, and Edward Y. Chang. 2017. CLKN: Cascaded Lucas-Kanade Networks for Image Alignment. In CVPR. 3777--3785.Google ScholarGoogle Scholar
  6. Timothy F. Cootes, Gareth J. Edwards, and Christopher J. Taylor. 2001. Active Appearance Models. TPAMI, Vol. 23, 6 (2001), 681--685.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Timothy F. Cootes, Mircea C. Ionita, Claudia Lindner, and Patrick Sauer. 2012. Robust and Accurate Shape Model Fitting Using Random Forest Regression Voting. In ECCV. 278--291.Google ScholarGoogle Scholar
  8. Timothy F. Cootes, Christopher J. Taylor, David H. Cooper, and Jim Graham. 1995. Active Shape Models-Their Training and Application. Journal of Computer Vision and Image Understanding, Vol. 61, 1 (1995), 38--59.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Xuanyi Dong, Shoou-I Yu, Xinshuo Weng, Shih-En Wei, Yi Yang, and Yaser Sheikh. 2018. Supervision-by-Registration: An unsupervised approach to improve the precision of facial landmark detectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 360--368.Google ScholarGoogle ScholarCross RefCross Ref
  10. Yao Feng, Fan Wu, Xiaohu Shao, Yanfeng Wang, and Xi Zhou. 2018. Joint 3D Face Reconstruction and Dense Alignment with Position Map Regression Network. arXiv:1803.07835 (2018).Google ScholarGoogle Scholar
  11. Basura Fernando, Hakan Bilen, Efstratios Gavves, and Stephen Gould. 2017. Self-supervised video representation learning with odd-one-out networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3636--3645.Google ScholarGoogle ScholarCross RefCross Ref
  12. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems. 2672--2680.Google ScholarGoogle Scholar
  13. Carl Martin Grewe and Stefan Zachow. 2016. Fully Automated and Highly Accurate Dense Correspondence for Facial Surfaces. In ECCVW. 552--568.Google ScholarGoogle Scholar
  14. Ralph Gross, Iain Matthews, Jeffrey Cohn, Takeo Kanade, and Simon Baker. 2010. Multi-pie. Image and Vision Computing, Vol. 28, 5 (2010), 807--813.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Minghao Guo, Jiwen Lu, and Jie Zhou. 2018. Dual-Agent Deep Reinforcement Learning for Deformable Face Tracking. In ECCV. 783--799.Google ScholarGoogle Scholar
  16. Shi HL et al. 2016. Face Alignment Across Large Poses: A 3D Solution. In CVPR. 146--155.Google ScholarGoogle Scholar
  17. Zhibin Hong, Xue Mei, Danil Prokhorov, and Dacheng Tao. 2013. Tracking via Robust Multi-task Multi-view Joint Sparse Representation. In ICCV. 649--656.Google ScholarGoogle Scholar
  18. Junlin Hu, Jiwen Lu, and Yap-Peng Tan. 2014. Discriminative Deep Metric Learning for Face Verification in the Wild. In CVPR. 1875--1882.Google ScholarGoogle Scholar
  19. Amin Jourabloo and Xiaoming Liu. 2016. Large-Pose Face Alignment via CNN-Based Dense 3D Model Fitting. In CVPR. 4188--4196.Google ScholarGoogle Scholar
  20. Minyoung Kim, Sanjiv Kumar, Vladimir Pavlovic, and Henry Rowley. 2008. Face Tracking and Recognition with Visual Constraints in Real-World Videos. Anchorage, AK. https://doi.org/10.1109/cvpr.2008.4587572Google ScholarGoogle Scholar
  21. Samuli Laine and Timo Aila. 2016. Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242 (2016).Google ScholarGoogle Scholar
  22. Hao Liu, Jiwen Lu, Jianjiang Feng, and Jie Zhou. 2018. Two-Stream Transformer Networks for Video-Based Face Alignment. TPAMI, Vol. 40, 11 (2018), 2546--2554.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Jiangjing Lv, Xiaohu Shao, Junliang Xing, Cheng Cheng, and Xi Zhou. 2017. A deep regression architecture with two-stage re-initialization for high performance facial landmark detection. In CVPR. 3691--3700.Google ScholarGoogle Scholar
  24. Daniel Merget, Matthias Rock, and Gerhard Rigoll. 2018. Robust facial landmark detection via a fully-convolutional local-global context network. In Proceedings of the IEEE conference on computer vision and pattern recognition. 781--790.Google ScholarGoogle ScholarCross RefCross Ref
  25. Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014).Google ScholarGoogle Scholar
  26. Ishan Misra, C Lawrence Zitnick, and Martial Hebert. 2016. Shuffle and learn: unsupervised learning using temporal order verification. In European Conference on Computer Vision. Springer, 527--544.Google ScholarGoogle ScholarCross RefCross Ref
  27. Hieu V Nguyen and Li Bai. 2010. Cosine similarity metric learning for face verification. In Asian conference on computer vision. Springer, 709--720.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Xi Peng, Rogé rio Schmidt Feris, Xiaoyu Wang, and Dimitris N. Metaxas. 2016. A Recurrent Encoder-Decoder Network for Sequential Face Alignment. In ECCV. 38--56.Google ScholarGoogle Scholar
  29. Xi Peng, Shaoting Zhang, Yu Yang, and Dimitris N Metaxas. 2015. Piefa: Personalized incremental and ensemble face alignment. In Proceedings of the IEEE international conference on computer vision. 3880--3888.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Ilija Radosavovic, Piotr Dollár, Ross Girshick, Georgia Gkioxari, and Kaiming He. 2018. Data distillation: Towards omni-supervised learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4119--4128.Google ScholarGoogle ScholarCross RefCross Ref
  31. Shaoqing Ren, Xudong Cao, Yichen Wei, and Jian Sun. 2014. Face alignment at 3000 fps via regressing local binary features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1685--1692.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. 2014. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550 (2014).Google ScholarGoogle Scholar
  33. Christos Sagonas, Yannis Panagakis, Stefanos Zafeiriou, and Maja Pantic. 2014. Raps: Robust and efficient automatic construction of person-specific deformable models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1789--1796.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Christos Sagonas, Georgios Tzimiropoulos, Stefanos Zafeiriou, and Maja Pantic. 2013. 300 faces in-the-wild challenge: The first facial landmark localization challenge. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 397--403.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Enrique Sá nchez-Lozano, Brais Mart'i nez, Georgios Tzimiropoulos, and Michel F. Valstar. 2016. Cascaded Continuous Regression for Real-Time Incremental Face Tracking. In ECCV. 645--661.Google ScholarGoogle Scholar
  36. Jason M Saragih, Simon Lucey, and Jeffrey F Cohn. 2011. Deformable model fitting by regularized landmark mean-shift. International journal of computer vision, Vol. 91, 2 (2011), 200--215.Google ScholarGoogle Scholar
  37. Jie Shen, Stefanos Zafeiriou, Grigoris G. Chrysos, Jean Kossaifi, Georgios Tzimiropoulos, and Maja Pantic. 2015. The First Facial Landmark Tracking in-the-Wild Challenge: Benchmark and Results. In ICCVW. 1003--1011.Google ScholarGoogle Scholar
  38. Karen Simonyan and Andrew Zisserman. 2014. Two-Stream Convolutional Networks for Action Recognition in Videos. In NIPS. 568--576.Google ScholarGoogle Scholar
  39. Yi Sun, Xiaogang Wang, and Xiaoou Tang. 2013. Deep convolutional network cascade for facial point detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3476--3483.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Ying Tai, Yicong Liang, Xiaoming Liu, Lei Duan, Jilin Li, Chengjie Wang, Feiyue Huang, and Yu Chen. 2019. Towards highly accurate and stable face alignment for high-resolution videos. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8893--8900.Google ScholarGoogle ScholarCross RefCross Ref
  41. George Trigeorgis, Patrick Snape, Mihalis A Nicolaou, Epameinondas Antonakos, and Stefanos Zafeiriou. 2016. Mnemonic descent method: A recurrent process applied for end-to-end face alignment. In CVPR. 4177--4187.Google ScholarGoogle Scholar
  42. Georgios Tzimiropoulos. 2015a. Project-Out Cascaded Regression with an application to face alignment. In CVPR .Google ScholarGoogle Scholar
  43. Georgios Tzimiropoulos. 2015b. Project-out cascaded regression with an application to face alignment. In CVPR. 3659--3667.Google ScholarGoogle Scholar
  44. Xiaolong Wang, Allan Jabri, and Alexei A Efros. 2019. Learning correspondence from the cycle-consistency of time. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2566--2576.Google ScholarGoogle ScholarCross RefCross Ref
  45. L Wolf, T Hassner, and I Maoz. 2011. Face recognition in unconstrained videos with matched background similarity. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 529--534.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Wayne Wu, Chen Qian, Shuo Yang, Quan Wang, Yici Cai, and Qiang Zhou. 2018. Look at boundary: A boundary-aware face alignment algorithm. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2129--2138.Google ScholarGoogle ScholarCross RefCross Ref
  47. Xuehan Xiong and Fernando De la Torre. 2013. Supervised descent method and its applications to face alignment. In CVPR. 532--539.Google ScholarGoogle Scholar
  48. Xuehan Xiong and Fernando De la Torre. 2013. Supervised Descent Method and Its Applications to Face Alignment. In CVPR. 532--539.Google ScholarGoogle Scholar
  49. Jie Zhang, Shiguang Shan, Meina Kan, and Xilin Chen. 2014. Coarse-to-fine auto-encoder networks (cfan) for real-time face alignment. In ECCV. 1--16.Google ScholarGoogle Scholar
  50. Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. 2016b. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, Vol. 23, 10 (2016), 1499--1503.Google ScholarGoogle ScholarCross RefCross Ref
  51. Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. 2016a. Learning Deep Representation for Face Alignment with Auxiliary Attributes. TPAMI, Vol. 38, 5 (2016), 918--930.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Congcong Zhu, Hao Liu, Zhenhua Yu, and Xuehong Sun. 2020. Towards Omni-Supervised Face Alignment for Large Scale Unlabeled Videos.. In AAAI. 13090--13097.Google ScholarGoogle Scholar
  53. Shizhan Zhu, Cheng Li, Chen Change Loy, and Xiaoou Tang. 2015a. Face alignment by coarse-to-fine shape searching. In CVPR. 4998--5006.Google ScholarGoogle Scholar
  54. Shizhan Zhu, Cheng Li, Chen Change Loy, and Xiaoou Tang. 2015b. Face alignment by coarse-to-fine shape searching. In CVPR. 4998--5006.Google ScholarGoogle Scholar

Index Terms

  1. Spatial-Temporal Knowledge Integration: Robust Self-Supervised Facial Landmark Tracking

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MM '20: Proceedings of the 28th ACM International Conference on Multimedia
      October 2020
      4889 pages
      ISBN:9781450379885
      DOI:10.1145/3394171

      Copyright © 2020 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 12 October 2020

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate995of4,171submissions,24%

      Upcoming Conference

      MM '24
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader