Facial landmark points detection using knowledge distillation-based neural networks
Introduction
Facial image alignment based on landmark points is a crucial step in many facial image analysis applications including face recognition (Lu and Tang, 2015, Soltanpour et al., 2017), face verification (Sun et al., 2014, Sun et al., 2013), face frontalization (Hassner et al., 2015), pose estimation (Vicente et al., 2015), and facial expression recognition (Sun et al., 2014, Zhao et al., 2003). The goal is to detect and localize the coordinates of predefined landmark points on human faces and use them for face alignment. In the past two decades, great progress has been made toward improving facial landmark detection algorithms’ accuracy. However, most of the previous research does not focus on designing and/or training lightweight networks that can run on mobile devices with limited computational power.
While facial landmark points detection is still considered a challenging task for faces with large pose variations and occlusion (Dong et al., 2018a, Wu et al., 2018), recent methods have designed heavy models with a large number of parameters, which making them unsuitable for real-time applications. Moreover, with the growth of Internet-of-Things (IoT), robotics, and mobile devices, it is vital to balance accuracy and model efficiency (i.e., computational time). Recently, deep learning-based methods have caught the attention of people in tackling this problem too. Among many lightweight neural network models, MobileNetV2 (Sandler et al., 2018) is proven to be a good trade-off between accuracy and speed. However, because of the small number of network parameters, the face alignment task’s accuracy using MobileNetV2 (Sandler et al., 2018) might not be enough, especially when applied to faces with extreme poses or occlusions.
Tan and Le (2019) have recently proposed EfficientNet, a family of eight different networks designed to put a trade-off between the accuracy and model size. The designer of EfficientNet found a strong connection between the accuracy of a network and its depth, width, and resolution. Consequently, the proposed EfficientNet family is designed to be efficient. In other words, EfficientNet family are designed to achieve good accuracy while they are relatively small means having a fewer number of network parameters and fast means having a smaller number of floating points operation (FLOPs).
Recently, knowledge distillation (KD) was utilized in image classification (Hinton et al., 2015, Romero et al., 2014), object detection (Li et al., 2017), and semantic segmentation (Xie et al., 2018). Initially, the idea was to train a lightweight network with acceptable accuracy by transferring features and knowledge generated by an ensemble network into the single smaller network (Buciluǎ et al., 2006). Later, Hinton et al. (2015) introduced the term knowledge distillation as a technique to create a small model, called Student network, learned to generate the results that are created by a more cumbersome model, called Teacher network.
Inspired by the concept of KD, we propose a novel loss function called KD-Loss to improve face alignment accuracy. Specifically, we propose a KD-based architecture using two different Teacher networks – EfficientNet-B3 (Tan and Le, 2019) – to guide the lightweight Student-Network, which is MobileNetV2 (Sandler et al., 2018), to better cope with the facial landmark detection task. Using the facial landmarks predicted by each of the Teacher networks, we introduce two ALoss functions. Being assisted by the ALoss, the KD-Loss considers the geometrical relation between the facial landmarks predicted by the Student and the two Teacher networks to improve the accuracy of the MobileNetV2 (Sandler et al., 2018). In other words, we proposed to use two independent sets of facial landmark points which are predicted by our Teacher network to guide the lightweight Student network towards better localization of the landmark points.
We train our method in two phases. In the first phase, we create Soft-landmarks inspired by ASM (Cootes et al., 1998). Soft-landmarks are more similar to the Mean-landmark compared to the Hard-landmarks, which are the original facial landmarks. Hence, as a rule of thumb, it is easier for a lightweight model to predict the distribution of these Soft-landmarks compared to the original ground truth. We use this attribute to create a Teacher-Student architecture to improve the accuracy of the Student network. More clearly, in the first phase, we train one Teacher network using the Hard-landmarks and call it Tough-Teacher, and another Teacher network using the Soft-landmarks as the ground truth landmark points and call it Tolerant-Teacher. Then, in the second phase, we use our proposed KD-Loss to transfuse the information gathered by both Teacher networks into the Student model during the training phase. Fig. 4 and Fig. 5 depict a general architecture of our proposed training architecture. We tested our proposed method on the challenging 300W (Sagonas et al., 2013), WFLW (Wu et al., 2018), and COFW (Burgos-Artizzu et al., 2013) datasets. The results of our experiments show that the accuracy of facial landmark points detection using MobileNetV2 trained using our KD-Loss approach is more accurate than the original MobileNetV2 (Sandler et al., 2018). The results are also comparable with state-of-the-art methods, while the network size is significantly smaller than most of the previously proposed networks.
The contributions of our approach are summarized as follows. First, to the best of our knowledge, this is the first time the concept of KD is applied to a coordinate-based regression facial landmark detection model. Second, we proposed two different Teacher networks for guiding the Student network toward the ground truth landmark point. Third, different from the popular loss functions, the magnitude of our proposed ALoss can be either a positive or a negative number. Finally, using ALoss, we propose KD-Loss, which uses the geometrical relation between the Student and Teacher networks to improve accuracy in facial landmark detection.
The remaining of this paper is organized as follows. Section 2 reviews the related work in facial landmark points detection. Section 3 explains the details of the proposed method and the training process. Then, the evaluation of the method as well as the experimental results are provided in Section 4. Finally, Section 5 concludes with discussion on the proposed method and future research directions.
Section snippets
Related work
The facial landmark detection task dates back to over twenty years ago, when classical methods (aka template-based methods) were introduced. Active Shape Model (ASM) (Cootes et al., 2000) and Active Appearance Model (AAM) (Cootes et al., 1998, Martins et al., 2013) are among the first methods for facial landmark detection. Based on these methods, Principal Components Analysis (PCA) is applied to simplify the problem and learn parametric features of faces to model facial landmarks variations.
Proposed model
In this section, we first explain the process of creating the Soft-landmarks inspired from ASM. The Soft-landmarks are utilized for training the Tolerant-Teacher network. Then, we illustrate our proposed Student–Teacher architecture. After that, we explain our proposed KD-Loss function using the proposed assistive loss functions.
Experimental results
In this section, we first explain the training phase and the datasets that we used in evaluating our proposed model. We then describe the test phase as well as the implementation details and the evaluation metrics. Lastly, we present the results of facial landmark points detection using our proposed KD-Loss method.
Conclusion and future work
This paper proposed a novel architecture inspired by the KD concept for the facial landmark detection task. Using the Active Shape Model (Cootes et al., 2000), we defined Mean-landmark, Soft-landmarks as well as Hard-landmarks terms, and, used them to train our proposed Tolerant as well as Tough Teacher networks, which are EfficientNet-B3 (Tan and Le, 2019). In addition, we used MobileNetV2 (Sandler et al., 2018) as our Student-Network. The main novelty and idea of our paper are to design
CRediT authorship contribution statement
Ali Pourramezan Fard: Generating the idea of the paper, Implementation of the framework, Testing and Validation of the model, Writing – original draft. Mohammad H. Mahoor: Supervision, Validation of the idea, Paper editing, Paper revision.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgment
We wish to show our appreciation to Dr. Julia Madsen for her great help on editing this paper.
References (70)
- et al.
Facial landmarks localization using cascaded neural networks
Comput. Vis. Image Underst.
(2021) - et al.
Generative face alignment through 2.5 d active appearance models
Comput. Vis. Image Underst.
(2013) - et al.
A survey of local feature methods for 3D face recognition
Pattern Recognit.
(2017) - et al.
Face alignment using a 3d deeply-initialized ensemble of regression trees
Comput. Vis. Image Underst.
(2019) - et al.
Fine-grained facial landmark detection exploiting intermediate feature representations
Comput. Vis. Image Underst.
(2020) - et al.
Exemplar-based cascaded stacked auto-encoder networks for robust face alignment
Comput. Vis. Image Underst.
(2018) - Asthana, A., Zafeiriou, S., Cheng, S., Pantic, M., 2013. Robust discriminative response map fitting with constrained...
- et al.
3D constrained local model for rigid and non-rigid facial tracking
- et al.
Localizing parts of faces using a consensus of exemplars
IEEE Trans. Pattern Anal. Mach. Intell.
(2013) - et al.
Model compression