ABSTRACT
Diversity of training data significantly affects tracking robustness of model under unconstrained environments. However, existing labeled datasets for facial landmark tracking tend to be large but not diverse, and manually annotating the massive clips of new diverse videos is extremely expensive. To address these problems, we propose a Spatial-Temporal Knowledge Integration (STKI) approach. Unlike most existing methods which rely heavily on labeled data, STKI exploits supervisions from unlabeled data. Specifically, STKI integrates spatial-temporal knowledge from massive unlabeled videos, which has several orders of magnitude more than existing labeled video data on the diversity, for robust tracking. Our framework includes a self-supervised tracker and an image-based detector for tracking initialization. To avoid the distortion of facial shape, the tracker leverages adversarial learning to introduce facial structure prior and temporal knowledge into cycle-consistency tracking. Meanwhile, we design a graph-based knowledge distillation method, which distills the knowledge from tracking and detection results, to improve the generalization of the detector. The fine-tuned detector can provide tracker on unconstrained videos with high-quality tracking initialization. Extensive experimental results show that the proposed method achieves state-of-the-art performance on comprehensive evaluation datasets.
Supplemental Material
- 2014. FGNET: Talking Face Video. (2014). http://www-prima.inrialpes.fr/FGnet/data/01-TalkingFace/talking_face.htmlGoogle Scholar
- Akshay Asthana, Stefanos Zafeiriou, Shiyang Cheng, and Maja Pantic. 2014. Incremental face alignment in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1859--1866.Google ScholarDigital Library
- Adrian Bulat and Georgios Tzimiropoulos. 2017. How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). In ICCV. 1021--1030.Google Scholar
- Xudong Cao, Yichen Wei, Fang Wen, and Jian Sun. 2014. Face alignment by explicit shape regression. IJCV, Vol. 107, 2 (2014), 177--190.Google ScholarDigital Library
- Che-Han Chang, Chun-Nan Chou, and Edward Y. Chang. 2017. CLKN: Cascaded Lucas-Kanade Networks for Image Alignment. In CVPR. 3777--3785.Google Scholar
- Timothy F. Cootes, Gareth J. Edwards, and Christopher J. Taylor. 2001. Active Appearance Models. TPAMI, Vol. 23, 6 (2001), 681--685.Google ScholarDigital Library
- Timothy F. Cootes, Mircea C. Ionita, Claudia Lindner, and Patrick Sauer. 2012. Robust and Accurate Shape Model Fitting Using Random Forest Regression Voting. In ECCV. 278--291.Google Scholar
- Timothy F. Cootes, Christopher J. Taylor, David H. Cooper, and Jim Graham. 1995. Active Shape Models-Their Training and Application. Journal of Computer Vision and Image Understanding, Vol. 61, 1 (1995), 38--59.Google ScholarDigital Library
- Xuanyi Dong, Shoou-I Yu, Xinshuo Weng, Shih-En Wei, Yi Yang, and Yaser Sheikh. 2018. Supervision-by-Registration: An unsupervised approach to improve the precision of facial landmark detectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 360--368.Google ScholarCross Ref
- Yao Feng, Fan Wu, Xiaohu Shao, Yanfeng Wang, and Xi Zhou. 2018. Joint 3D Face Reconstruction and Dense Alignment with Position Map Regression Network. arXiv:1803.07835 (2018).Google Scholar
- Basura Fernando, Hakan Bilen, Efstratios Gavves, and Stephen Gould. 2017. Self-supervised video representation learning with odd-one-out networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3636--3645.Google ScholarCross Ref
- Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems. 2672--2680.Google Scholar
- Carl Martin Grewe and Stefan Zachow. 2016. Fully Automated and Highly Accurate Dense Correspondence for Facial Surfaces. In ECCVW. 552--568.Google Scholar
- Ralph Gross, Iain Matthews, Jeffrey Cohn, Takeo Kanade, and Simon Baker. 2010. Multi-pie. Image and Vision Computing, Vol. 28, 5 (2010), 807--813.Google ScholarDigital Library
- Minghao Guo, Jiwen Lu, and Jie Zhou. 2018. Dual-Agent Deep Reinforcement Learning for Deformable Face Tracking. In ECCV. 783--799.Google Scholar
- Shi HL et al. 2016. Face Alignment Across Large Poses: A 3D Solution. In CVPR. 146--155.Google Scholar
- Zhibin Hong, Xue Mei, Danil Prokhorov, and Dacheng Tao. 2013. Tracking via Robust Multi-task Multi-view Joint Sparse Representation. In ICCV. 649--656.Google Scholar
- Junlin Hu, Jiwen Lu, and Yap-Peng Tan. 2014. Discriminative Deep Metric Learning for Face Verification in the Wild. In CVPR. 1875--1882.Google Scholar
- Amin Jourabloo and Xiaoming Liu. 2016. Large-Pose Face Alignment via CNN-Based Dense 3D Model Fitting. In CVPR. 4188--4196.Google Scholar
- Minyoung Kim, Sanjiv Kumar, Vladimir Pavlovic, and Henry Rowley. 2008. Face Tracking and Recognition with Visual Constraints in Real-World Videos. Anchorage, AK. https://doi.org/10.1109/cvpr.2008.4587572Google Scholar
- Samuli Laine and Timo Aila. 2016. Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242 (2016).Google Scholar
- Hao Liu, Jiwen Lu, Jianjiang Feng, and Jie Zhou. 2018. Two-Stream Transformer Networks for Video-Based Face Alignment. TPAMI, Vol. 40, 11 (2018), 2546--2554.Google ScholarDigital Library
- Jiangjing Lv, Xiaohu Shao, Junliang Xing, Cheng Cheng, and Xi Zhou. 2017. A deep regression architecture with two-stage re-initialization for high performance facial landmark detection. In CVPR. 3691--3700.Google Scholar
- Daniel Merget, Matthias Rock, and Gerhard Rigoll. 2018. Robust facial landmark detection via a fully-convolutional local-global context network. In Proceedings of the IEEE conference on computer vision and pattern recognition. 781--790.Google ScholarCross Ref
- Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014).Google Scholar
- Ishan Misra, C Lawrence Zitnick, and Martial Hebert. 2016. Shuffle and learn: unsupervised learning using temporal order verification. In European Conference on Computer Vision. Springer, 527--544.Google ScholarCross Ref
- Hieu V Nguyen and Li Bai. 2010. Cosine similarity metric learning for face verification. In Asian conference on computer vision. Springer, 709--720.Google ScholarDigital Library
- Xi Peng, Rogé rio Schmidt Feris, Xiaoyu Wang, and Dimitris N. Metaxas. 2016. A Recurrent Encoder-Decoder Network for Sequential Face Alignment. In ECCV. 38--56.Google Scholar
- Xi Peng, Shaoting Zhang, Yu Yang, and Dimitris N Metaxas. 2015. Piefa: Personalized incremental and ensemble face alignment. In Proceedings of the IEEE international conference on computer vision. 3880--3888.Google ScholarDigital Library
- Ilija Radosavovic, Piotr Dollár, Ross Girshick, Georgia Gkioxari, and Kaiming He. 2018. Data distillation: Towards omni-supervised learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4119--4128.Google ScholarCross Ref
- Shaoqing Ren, Xudong Cao, Yichen Wei, and Jian Sun. 2014. Face alignment at 3000 fps via regressing local binary features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1685--1692.Google ScholarDigital Library
- Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. 2014. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550 (2014).Google Scholar
- Christos Sagonas, Yannis Panagakis, Stefanos Zafeiriou, and Maja Pantic. 2014. Raps: Robust and efficient automatic construction of person-specific deformable models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1789--1796.Google ScholarDigital Library
- Christos Sagonas, Georgios Tzimiropoulos, Stefanos Zafeiriou, and Maja Pantic. 2013. 300 faces in-the-wild challenge: The first facial landmark localization challenge. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 397--403.Google ScholarDigital Library
- Enrique Sá nchez-Lozano, Brais Mart'i nez, Georgios Tzimiropoulos, and Michel F. Valstar. 2016. Cascaded Continuous Regression for Real-Time Incremental Face Tracking. In ECCV. 645--661.Google Scholar
- Jason M Saragih, Simon Lucey, and Jeffrey F Cohn. 2011. Deformable model fitting by regularized landmark mean-shift. International journal of computer vision, Vol. 91, 2 (2011), 200--215.Google Scholar
- Jie Shen, Stefanos Zafeiriou, Grigoris G. Chrysos, Jean Kossaifi, Georgios Tzimiropoulos, and Maja Pantic. 2015. The First Facial Landmark Tracking in-the-Wild Challenge: Benchmark and Results. In ICCVW. 1003--1011.Google Scholar
- Karen Simonyan and Andrew Zisserman. 2014. Two-Stream Convolutional Networks for Action Recognition in Videos. In NIPS. 568--576.Google Scholar
- Yi Sun, Xiaogang Wang, and Xiaoou Tang. 2013. Deep convolutional network cascade for facial point detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3476--3483.Google ScholarDigital Library
- Ying Tai, Yicong Liang, Xiaoming Liu, Lei Duan, Jilin Li, Chengjie Wang, Feiyue Huang, and Yu Chen. 2019. Towards highly accurate and stable face alignment for high-resolution videos. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8893--8900.Google ScholarCross Ref
- George Trigeorgis, Patrick Snape, Mihalis A Nicolaou, Epameinondas Antonakos, and Stefanos Zafeiriou. 2016. Mnemonic descent method: A recurrent process applied for end-to-end face alignment. In CVPR. 4177--4187.Google Scholar
- Georgios Tzimiropoulos. 2015a. Project-Out Cascaded Regression with an application to face alignment. In CVPR .Google Scholar
- Georgios Tzimiropoulos. 2015b. Project-out cascaded regression with an application to face alignment. In CVPR. 3659--3667.Google Scholar
- Xiaolong Wang, Allan Jabri, and Alexei A Efros. 2019. Learning correspondence from the cycle-consistency of time. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2566--2576.Google ScholarCross Ref
- L Wolf, T Hassner, and I Maoz. 2011. Face recognition in unconstrained videos with matched background similarity. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 529--534.Google ScholarDigital Library
- Wayne Wu, Chen Qian, Shuo Yang, Quan Wang, Yici Cai, and Qiang Zhou. 2018. Look at boundary: A boundary-aware face alignment algorithm. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2129--2138.Google ScholarCross Ref
- Xuehan Xiong and Fernando De la Torre. 2013. Supervised descent method and its applications to face alignment. In CVPR. 532--539.Google Scholar
- Xuehan Xiong and Fernando De la Torre. 2013. Supervised Descent Method and Its Applications to Face Alignment. In CVPR. 532--539.Google Scholar
- Jie Zhang, Shiguang Shan, Meina Kan, and Xilin Chen. 2014. Coarse-to-fine auto-encoder networks (cfan) for real-time face alignment. In ECCV. 1--16.Google Scholar
- Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. 2016b. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, Vol. 23, 10 (2016), 1499--1503.Google ScholarCross Ref
- Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. 2016a. Learning Deep Representation for Face Alignment with Auxiliary Attributes. TPAMI, Vol. 38, 5 (2016), 918--930.Google ScholarDigital Library
- Congcong Zhu, Hao Liu, Zhenhua Yu, and Xuehong Sun. 2020. Towards Omni-Supervised Face Alignment for Large Scale Unlabeled Videos.. In AAAI. 13090--13097.Google Scholar
- Shizhan Zhu, Cheng Li, Chen Change Loy, and Xiaoou Tang. 2015a. Face alignment by coarse-to-fine shape searching. In CVPR. 4998--5006.Google Scholar
- Shizhan Zhu, Cheng Li, Chen Change Loy, and Xiaoou Tang. 2015b. Face alignment by coarse-to-fine shape searching. In CVPR. 4998--5006.Google Scholar
Index Terms
- Spatial-Temporal Knowledge Integration: Robust Self-Supervised Facial Landmark Tracking
Recommendations
Multi-Sourced Knowledge Integration for Robust Self-Supervised Facial Landmark Tracking
Expensive annotation costs significantly hinder the development of facial landmark tracking owing to the frame-by-frame labeling of dense landmarks. The most promising approach to address this problem is to develop a self-supervised tracker for large-...
Boosting Facial Landmark Detection via Self-supervised and Semi-supervised Learning
SOICT '23: Proceedings of the 12th International Symposium on Information and Communication TechnologyKeypoint detection is one of the main focused fields in computer vision with various applications. Traditional fully-supervised deep learning methods currently dominate the field with impressive accuracy, but typically require careful, expensive, and ...
Robust Visual Object Tracking with Top-down Reasoning
MM '17: Proceedings of the 25th ACM international conference on MultimediaIn generic visual tracking, traditional appearance based trackers suffer from distracting factors like bad lighting or major target deformation, etc., as well as insufficiency of training data. In this work, we propose to exploit the category-specific ...
Comments