research-article

Spatial-Temporal Knowledge Integration: Robust Self-Supervised Facial Landmark Tracking

Authors:
Congcong Zhu

Shanghai University, shanghai, China

Shanghai University, shanghai, China
View Profile

,
Xiaoqiang Li

Shanghai University, shanghai, China

Shanghai University, shanghai, China
View Profile

,
Jide Li

Shanghai University, shanghai, China

Shanghai University, shanghai, China
View Profile

,
Guangtai Ding

Shanghai University, shanghai, China

Shanghai University, shanghai, China
View Profile

,
Weiqin Tong

Shanghai University, shanghai, China

Shanghai University, shanghai, China
View Profile

MM '20: Proceedings of the 28th ACM International Conference on MultimediaOctober 2020Pages 4135–4143https://doi.org/10.1145/3394171.3413993

Published:12 October 2020Publication History

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Pages 4135–4143

ABSTRACT

Diversity of training data significantly affects tracking robustness of model under unconstrained environments. However, existing labeled datasets for facial landmark tracking tend to be large but not diverse, and manually annotating the massive clips of new diverse videos is extremely expensive. To address these problems, we propose a Spatial-Temporal Knowledge Integration (STKI) approach. Unlike most existing methods which rely heavily on labeled data, STKI exploits supervisions from unlabeled data. Specifically, STKI integrates spatial-temporal knowledge from massive unlabeled videos, which has several orders of magnitude more than existing labeled video data on the diversity, for robust tracking. Our framework includes a self-supervised tracker and an image-based detector for tracking initialization. To avoid the distortion of facial shape, the tracker leverages adversarial learning to introduce facial structure prior and temporal knowledge into cycle-consistency tracking. Meanwhile, we design a graph-based knowledge distillation method, which distills the knowledge from tracking and detection results, to improve the generalization of the detector. The fine-tuned detector can provide tracker on unconstrained videos with high-quality tracking initialization. Extensive experimental results show that the proposed method achieves state-of-the-art performance on comprehensive evaluation datasets.

Supplemental Material

3394171.3413993.mp4

mp4

56.3 MB

Download

References

2014. FGNET: Talking Face Video. (2014). http://www-prima.inrialpes.fr/FGnet/data/01-TalkingFace/talking_face.htmlGoogle Scholar
Akshay Asthana, Stefanos Zafeiriou, Shiyang Cheng, and Maja Pantic. 2014. Incremental face alignment in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1859--1866.Google ScholarDigital Library
Adrian Bulat and Georgios Tzimiropoulos. 2017. How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). In ICCV. 1021--1030.Google Scholar
Xudong Cao, Yichen Wei, Fang Wen, and Jian Sun. 2014. Face alignment by explicit shape regression. IJCV, Vol. 107, 2 (2014), 177--190.Google ScholarDigital Library
Che-Han Chang, Chun-Nan Chou, and Edward Y. Chang. 2017. CLKN: Cascaded Lucas-Kanade Networks for Image Alignment. In CVPR. 3777--3785.Google Scholar
Timothy F. Cootes, Gareth J. Edwards, and Christopher J. Taylor. 2001. Active Appearance Models. TPAMI, Vol. 23, 6 (2001), 681--685.Google ScholarDigital Library
Timothy F. Cootes, Mircea C. Ionita, Claudia Lindner, and Patrick Sauer. 2012. Robust and Accurate Shape Model Fitting Using Random Forest Regression Voting. In ECCV. 278--291.Google Scholar
Timothy F. Cootes, Christopher J. Taylor, David H. Cooper, and Jim Graham. 1995. Active Shape Models-Their Training and Application. Journal of Computer Vision and Image Understanding, Vol. 61, 1 (1995), 38--59.Google ScholarDigital Library
Xuanyi Dong, Shoou-I Yu, Xinshuo Weng, Shih-En Wei, Yi Yang, and Yaser Sheikh. 2018. Supervision-by-Registration: An unsupervised approach to improve the precision of facial landmark detectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 360--368.Google ScholarCross Ref
Yao Feng, Fan Wu, Xiaohu Shao, Yanfeng Wang, and Xi Zhou. 2018. Joint 3D Face Reconstruction and Dense Alignment with Position Map Regression Network. arXiv:1803.07835 (2018).Google Scholar
Basura Fernando, Hakan Bilen, Efstratios Gavves, and Stephen Gould. 2017. Self-supervised video representation learning with odd-one-out networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3636--3645.Google ScholarCross Ref
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems. 2672--2680.Google Scholar
Carl Martin Grewe and Stefan Zachow. 2016. Fully Automated and Highly Accurate Dense Correspondence for Facial Surfaces. In ECCVW. 552--568.Google Scholar
Ralph Gross, Iain Matthews, Jeffrey Cohn, Takeo Kanade, and Simon Baker. 2010. Multi-pie. Image and Vision Computing, Vol. 28, 5 (2010), 807--813.Google ScholarDigital Library
Minghao Guo, Jiwen Lu, and Jie Zhou. 2018. Dual-Agent Deep Reinforcement Learning for Deformable Face Tracking. In ECCV. 783--799.Google Scholar
Shi HL et al. 2016. Face Alignment Across Large Poses: A 3D Solution. In CVPR. 146--155.Google Scholar
Zhibin Hong, Xue Mei, Danil Prokhorov, and Dacheng Tao. 2013. Tracking via Robust Multi-task Multi-view Joint Sparse Representation. In ICCV. 649--656.Google Scholar
Junlin Hu, Jiwen Lu, and Yap-Peng Tan. 2014. Discriminative Deep Metric Learning for Face Verification in the Wild. In CVPR. 1875--1882.Google Scholar
Amin Jourabloo and Xiaoming Liu. 2016. Large-Pose Face Alignment via CNN-Based Dense 3D Model Fitting. In CVPR. 4188--4196.Google Scholar
Minyoung Kim, Sanjiv Kumar, Vladimir Pavlovic, and Henry Rowley. 2008. Face Tracking and Recognition with Visual Constraints in Real-World Videos. Anchorage, AK. https://doi.org/10.1109/cvpr.2008.4587572Google Scholar
Samuli Laine and Timo Aila. 2016. Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242 (2016).Google Scholar
Hao Liu, Jiwen Lu, Jianjiang Feng, and Jie Zhou. 2018. Two-Stream Transformer Networks for Video-Based Face Alignment. TPAMI, Vol. 40, 11 (2018), 2546--2554.Google ScholarDigital Library
Jiangjing Lv, Xiaohu Shao, Junliang Xing, Cheng Cheng, and Xi Zhou. 2017. A deep regression architecture with two-stage re-initialization for high performance facial landmark detection. In CVPR. 3691--3700.Google Scholar
Daniel Merget, Matthias Rock, and Gerhard Rigoll. 2018. Robust facial landmark detection via a fully-convolutional local-global context network. In Proceedings of the IEEE conference on computer vision and pattern recognition. 781--790.Google ScholarCross Ref
Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014).Google Scholar
Ishan Misra, C Lawrence Zitnick, and Martial Hebert. 2016. Shuffle and learn: unsupervised learning using temporal order verification. In European Conference on Computer Vision. Springer, 527--544.Google ScholarCross Ref
Hieu V Nguyen and Li Bai. 2010. Cosine similarity metric learning for face verification. In Asian conference on computer vision. Springer, 709--720.Google ScholarDigital Library
Xi Peng, Rogé rio Schmidt Feris, Xiaoyu Wang, and Dimitris N. Metaxas. 2016. A Recurrent Encoder-Decoder Network for Sequential Face Alignment. In ECCV. 38--56.Google Scholar
Xi Peng, Shaoting Zhang, Yu Yang, and Dimitris N Metaxas. 2015. Piefa: Personalized incremental and ensemble face alignment. In Proceedings of the IEEE international conference on computer vision. 3880--3888.Google ScholarDigital Library
Ilija Radosavovic, Piotr Dollár, Ross Girshick, Georgia Gkioxari, and Kaiming He. 2018. Data distillation: Towards omni-supervised learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4119--4128.Google ScholarCross Ref
Shaoqing Ren, Xudong Cao, Yichen Wei, and Jian Sun. 2014. Face alignment at 3000 fps via regressing local binary features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1685--1692.Google ScholarDigital Library
Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. 2014. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550 (2014).Google Scholar
Christos Sagonas, Yannis Panagakis, Stefanos Zafeiriou, and Maja Pantic. 2014. Raps: Robust and efficient automatic construction of person-specific deformable models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1789--1796.Google ScholarDigital Library
Christos Sagonas, Georgios Tzimiropoulos, Stefanos Zafeiriou, and Maja Pantic. 2013. 300 faces in-the-wild challenge: The first facial landmark localization challenge. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 397--403.Google ScholarDigital Library
Enrique Sá nchez-Lozano, Brais Mart'i nez, Georgios Tzimiropoulos, and Michel F. Valstar. 2016. Cascaded Continuous Regression for Real-Time Incremental Face Tracking. In ECCV. 645--661.Google Scholar
Jason M Saragih, Simon Lucey, and Jeffrey F Cohn. 2011. Deformable model fitting by regularized landmark mean-shift. International journal of computer vision, Vol. 91, 2 (2011), 200--215.Google Scholar
Jie Shen, Stefanos Zafeiriou, Grigoris G. Chrysos, Jean Kossaifi, Georgios Tzimiropoulos, and Maja Pantic. 2015. The First Facial Landmark Tracking in-the-Wild Challenge: Benchmark and Results. In ICCVW. 1003--1011.Google Scholar
Karen Simonyan and Andrew Zisserman. 2014. Two-Stream Convolutional Networks for Action Recognition in Videos. In NIPS. 568--576.Google Scholar
Yi Sun, Xiaogang Wang, and Xiaoou Tang. 2013. Deep convolutional network cascade for facial point detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3476--3483.Google ScholarDigital Library
Ying Tai, Yicong Liang, Xiaoming Liu, Lei Duan, Jilin Li, Chengjie Wang, Feiyue Huang, and Yu Chen. 2019. Towards highly accurate and stable face alignment for high-resolution videos. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8893--8900.Google ScholarCross Ref
George Trigeorgis, Patrick Snape, Mihalis A Nicolaou, Epameinondas Antonakos, and Stefanos Zafeiriou. 2016. Mnemonic descent method: A recurrent process applied for end-to-end face alignment. In CVPR. 4177--4187.Google Scholar
Georgios Tzimiropoulos. 2015a. Project-Out Cascaded Regression with an application to face alignment. In CVPR .Google Scholar
Georgios Tzimiropoulos. 2015b. Project-out cascaded regression with an application to face alignment. In CVPR. 3659--3667.Google Scholar
Xiaolong Wang, Allan Jabri, and Alexei A Efros. 2019. Learning correspondence from the cycle-consistency of time. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2566--2576.Google ScholarCross Ref
L Wolf, T Hassner, and I Maoz. 2011. Face recognition in unconstrained videos with matched background similarity. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 529--534.Google ScholarDigital Library
Wayne Wu, Chen Qian, Shuo Yang, Quan Wang, Yici Cai, and Qiang Zhou. 2018. Look at boundary: A boundary-aware face alignment algorithm. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2129--2138.Google ScholarCross Ref
Xuehan Xiong and Fernando De la Torre. 2013. Supervised descent method and its applications to face alignment. In CVPR. 532--539.Google Scholar
Xuehan Xiong and Fernando De la Torre. 2013. Supervised Descent Method and Its Applications to Face Alignment. In CVPR. 532--539.Google Scholar
Jie Zhang, Shiguang Shan, Meina Kan, and Xilin Chen. 2014. Coarse-to-fine auto-encoder networks (cfan) for real-time face alignment. In ECCV. 1--16.Google Scholar
Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. 2016b. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, Vol. 23, 10 (2016), 1499--1503.Google ScholarCross Ref
Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. 2016a. Learning Deep Representation for Face Alignment with Auxiliary Attributes. TPAMI, Vol. 38, 5 (2016), 918--930.Google ScholarDigital Library
Congcong Zhu, Hao Liu, Zhenhua Yu, and Xuehong Sun. 2020. Towards Omni-Supervised Face Alignment for Large Scale Unlabeled Videos.. In AAAI. 13090--13097.Google Scholar
Shizhan Zhu, Cheng Li, Chen Change Loy, and Xiaoou Tang. 2015a. Face alignment by coarse-to-fine shape searching. In CVPR. 4998--5006.Google Scholar
Shizhan Zhu, Cheng Li, Chen Change Loy, and Xiaoou Tang. 2015b. Face alignment by coarse-to-fine shape searching. In CVPR. 4998--5006.Google Scholar

Index Terms

Spatial-Temporal Knowledge Integration: Robust Self-Supervised Facial Landmark Tracking
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Tracking

Recommendations

Multi-Sourced Knowledge Integration for Robust Self-Supervised Facial Landmark Tracking
Expensive annotation costs significantly hinder the development of facial landmark tracking owing to the frame-by-frame labeling of dense landmarks. The most promising approach to address this problem is to develop a self-supervised tracker for large-...
Read More
Boosting Facial Landmark Detection via Self-supervised and Semi-supervised Learning
SOICT '23: Proceedings of the 12th International Symposium on Information and Communication Technology

Keypoint detection is one of the main focused fields in computer vision with various applications. Traditional fully-supervised deep learning methods currently dominate the field with impressive accuracy, but typically require careful, expensive, and ...
Read More
Robust Visual Object Tracking with Top-down Reasoning
MM '17: Proceedings of the 25th ACM international conference on Multimedia

In generic visual tracking, traditional appearance based trackers suffer from distracting factors like bad lighting or major target deformation, etc., as well as insufficiency of training data. In this work, we propose to exploit the category-specific ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '20: Proceedings of the 28th ACM International Conference on Multimedia
October 2020
4889 pages
ISBN:9781450379885
DOI:10.1145/3394171
General Chairs:
Chang Wen Chen
Chinese University of Hong Kong, Shenzhen, China
,
Rita Cucchiara
UNIMORE, Italy
,
Xian-Sheng Hua
Alibaba Group, China
,
Program Chairs:
Guo-Jun Qi
Futurewei Technologies, USA
,
Elisa Ricci
UNITN & Fondazione Bruno Kessler, Italy
,
Zhengyou Zhang
Tencent, China
,
Roger Zimmermann
National University of Singapore, Singapore
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 October 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
computer vision
deep learning
image processing
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 278
  Total Downloads
- Downloads (Last 12 months)28
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Spatial-Temporal Knowledge Integration: Robust Self-Supervised Facial Landmark Tracking

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Multi-Sourced Knowledge Integration for Robust Self-Supervised Facial Landmark Tracking

Boosting Facial Landmark Detection via Self-supervised and Semi-supervised Learning

Robust Visual Object Tracking with Top-down Reasoning