Facial landmark points detection using knowledge distillation-based neural networks

https://doi.org/10.1016/j.cviu.2021.103316Get rights and content

Highlights

  • Applying knowledge distillation for facial landmark points detection.

  • Training lightweight convolutional neural networks for efficient face alignment.

  • Using two Teachers (a tough network and a tolerant network) for training a Student network for face alignment.

Abstract

Facial landmark detection is a vital step for numerous facial image analysis applications. Although some deep learning-based methods have achieved good performances in this task, they are often not suitable for running on mobile devices. Such methods rely on networks with many parameters, which makes the training and inference time-consuming. Training lightweight neural networks such as MobileNets are often challenging, and the models might have low accuracy. Inspired by knowledge distillation (KD), this paper presents a novel loss function to train a lightweight Student network (e.g., MobileNetV2) for facial landmark detection. We use two Teacher networks, a Tolerant-Teacher and a Tough-Teacher in conjunction with the Student network. The Tolerant-Teacher is trained using Soft-landmarks created by active shape models, while the Tough-Teacher is trained using the ground truth (aka Hard-landmarks) landmark points. To utilize the facial landmark points predicted by the Teacher networks, we define an Assistive Loss (ALoss) for each Teacher network. Moreover, we define a loss function called KD-Loss that utilizes the facial landmark points predicted by the two pre-trained Teacher networks (EfficientNet-b3) to guide the lightweight Student network towards predicting the Hard-landmarks. Our experimental results on three challenging facial datasets show that the proposed architecture will result in a better-trained Student network that can extract facial landmark points with high accuracy.

Introduction

Facial image alignment based on landmark points is a crucial step in many facial image analysis applications including face recognition (Lu and Tang, 2015, Soltanpour et al., 2017), face verification (Sun et al., 2014, Sun et al., 2013), face frontalization (Hassner et al., 2015), pose estimation (Vicente et al., 2015), and facial expression recognition (Sun et al., 2014, Zhao et al., 2003). The goal is to detect and localize the coordinates of predefined landmark points on human faces and use them for face alignment. In the past two decades, great progress has been made toward improving facial landmark detection algorithms’ accuracy. However, most of the previous research does not focus on designing and/or training lightweight networks that can run on mobile devices with limited computational power.

While facial landmark points detection is still considered a challenging task for faces with large pose variations and occlusion (Dong et al., 2018a, Wu et al., 2018), recent methods have designed heavy models with a large number of parameters, which making them unsuitable for real-time applications. Moreover, with the growth of Internet-of-Things (IoT), robotics, and mobile devices, it is vital to balance accuracy and model efficiency (i.e., computational time). Recently, deep learning-based methods have caught the attention of people in tackling this problem too. Among many lightweight neural network models, MobileNetV2 (Sandler et al., 2018) is proven to be a good trade-off between accuracy and speed. However, because of the small number of network parameters, the face alignment task’s accuracy using MobileNetV2 (Sandler et al., 2018) might not be enough, especially when applied to faces with extreme poses or occlusions.

Tan and Le (2019) have recently proposed EfficientNet, a family of eight different networks designed to put a trade-off between the accuracy and model size. The designer of EfficientNet found a strong connection between the accuracy of a network and its depth, width, and resolution. Consequently, the proposed EfficientNet family is designed to be efficient. In other words, EfficientNet family are designed to achieve good accuracy while they are relatively small means having a fewer number of network parameters and fast means having a smaller number of floating points operation (FLOPs).

Recently, knowledge distillation (KD) was utilized in image classification (Hinton et al., 2015, Romero et al., 2014), object detection (Li et al., 2017), and semantic segmentation (Xie et al., 2018). Initially, the idea was to train a lightweight network with acceptable accuracy by transferring features and knowledge generated by an ensemble network into the single smaller network (Buciluǎ et al., 2006). Later, Hinton et al. (2015) introduced the term knowledge distillation as a technique to create a small model, called Student network, learned to generate the results that are created by a more cumbersome model, called Teacher network.

Inspired by the concept of KD, we propose a novel loss function called KD-Loss to improve face alignment accuracy. Specifically, we propose a KD-based architecture using two different Teacher networks – EfficientNet-B3 (Tan and Le, 2019) – to guide the lightweight Student-Network, which is MobileNetV2 (Sandler et al., 2018), to better cope with the facial landmark detection task. Using the facial landmarks predicted by each of the Teacher networks, we introduce two ALoss functions. Being assisted by the ALoss, the KD-Loss considers the geometrical relation between the facial landmarks predicted by the Student and the two Teacher networks to improve the accuracy of the MobileNetV2 (Sandler et al., 2018). In other words, we proposed to use two independent sets of facial landmark points which are predicted by our Teacher network to guide the lightweight Student network towards better localization of the landmark points.

We train our method in two phases. In the first phase, we create Soft-landmarks inspired by ASM (Cootes et al., 1998). Soft-landmarks are more similar to the Mean-landmark compared to the Hard-landmarks, which are the original facial landmarks. Hence, as a rule of thumb, it is easier for a lightweight model to predict the distribution of these Soft-landmarks compared to the original ground truth. We use this attribute to create a Teacher-Student architecture to improve the accuracy of the Student network. More clearly, in the first phase, we train one Teacher network using the Hard-landmarks and call it Tough-Teacher, and another Teacher network using the Soft-landmarks as the ground truth landmark points and call it Tolerant-Teacher. Then, in the second phase, we use our proposed KD-Loss to transfuse the information gathered by both Teacher networks into the Student model during the training phase. Fig. 4 and Fig. 5 depict a general architecture of our proposed training architecture. We tested our proposed method on the challenging 300W (Sagonas et al., 2013), WFLW (Wu et al., 2018), and COFW (Burgos-Artizzu et al., 2013) datasets. The results of our experiments show that the accuracy of facial landmark points detection using MobileNetV2 trained using our KD-Loss approach is more accurate than the original MobileNetV2 (Sandler et al., 2018). The results are also comparable with state-of-the-art methods, while the network size is significantly smaller than most of the previously proposed networks.

The contributions of our approach are summarized as follows. First, to the best of our knowledge, this is the first time the concept of KD is applied to a coordinate-based regression facial landmark detection model. Second, we proposed two different Teacher networks for guiding the Student network toward the ground truth landmark point. Third, different from the popular loss functions, the magnitude of our proposed ALoss can be either a positive or a negative number. Finally, using ALoss, we propose KD-Loss, which uses the geometrical relation between the Student and Teacher networks to improve accuracy in facial landmark detection.

The remaining of this paper is organized as follows. Section 2 reviews the related work in facial landmark points detection. Section 3 explains the details of the proposed method and the training process. Then, the evaluation of the method as well as the experimental results are provided in Section 4. Finally, Section 5 concludes with discussion on the proposed method and future research directions.

Section snippets

Related work

The facial landmark detection task dates back to over twenty years ago, when classical methods (aka template-based methods) were introduced. Active Shape Model (ASM) (Cootes et al., 2000) and Active Appearance Model (AAM) (Cootes et al., 1998, Martins et al., 2013) are among the first methods for facial landmark detection. Based on these methods, Principal Components Analysis (PCA) is applied to simplify the problem and learn parametric features of faces to model facial landmarks variations.

Proposed model

In this section, we first explain the process of creating the Soft-landmarks inspired from ASM. The Soft-landmarks are utilized for training the Tolerant-Teacher network. Then, we illustrate our proposed Student–Teacher architecture. After that, we explain our proposed KD-Loss function using the proposed assistive loss functions.

Experimental results

In this section, we first explain the training phase and the datasets that we used in evaluating our proposed model. We then describe the test phase as well as the implementation details and the evaluation metrics. Lastly, we present the results of facial landmark points detection using our proposed KD-Loss method.

Conclusion and future work

This paper proposed a novel architecture inspired by the KD concept for the facial landmark detection task. Using the Active Shape Model (Cootes et al., 2000), we defined Mean-landmark, Soft-landmarks as well as Hard-landmarks terms, and, used them to train our proposed Tolerant as well as Tough Teacher networks, which are EfficientNet-B3 (Tan and Le, 2019). In addition, we used MobileNetV2 (Sandler et al., 2018) as our Student-Network. The main novelty and idea of our paper are to design

CRediT authorship contribution statement

Ali Pourramezan Fard: Generating the idea of the paper, Implementation of the framework, Testing and Validation of the model, Writing – original draft. Mohammad H. Mahoor: Supervision, Validation of the idea, Paper editing, Paper revision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

We wish to show our appreciation to Dr. Julia Madsen for her great help on editing this paper.

References (70)

  • Burgos-Artizzu, X.P., Perona, P., Dollár, P., 2013. Robust face landmark estimation under occlusion. In: Proceedings of...
  • CootesT. et al.

    An introduction to active shape models

    Image Process. Anal.

    (2000)
  • CootesT.F. et al.

    Active appearance models

  • CristinacceD. et al.

    Feature detection and tracking with constrained local models.

  • DengJ. et al.

    Joint multi-view face alignment in the wild

    IEEE Trans. Image Process.

    (2019)
  • Dong, X., Yan, Y., Ouyang, W., Yang, Y., 2018. Style aggregated network for facial landmark detection. In: Proceedings...
  • Dong, X., Yu, S.I., Weng, X., Wei, S.E., Yang, Y., Sheikh, Y., 2018. Supervision-by-registration: An unsupervised...
  • Fard, A.P., Abdollahi, H., Mahoor, M., ASMNet: A lightweight deep neural network for face alignment and pose...
  • Feng, Z.H., Kittler, J., Awais, M., Huber, P., Wu, X.J., 2018. Wing loss for robust facial landmark localisation with...
  • FengZ.-H. et al.

    Rectified wing loss for efficient and robust facial landmark localisation with convolutional neural networks

    Int. J. Comput. Vis.

    (2020)
  • Feng, Z.H., Kittler, J., Christmas, W., Huber, P., Wu, X.J., 2017. Dynamic attention-controlled cascaded shape...
  • GuoX. et al.

    PFLD: A practical facial landmark detector

    (2019)
  • Hassner, T., Harel, S., Paz, E., Enbar, R., 2015. Effective face frontalization in unconstrained images. In:...
  • HintonG. et al.

    Distilling the knowledge in a neural network

    (2015)
  • Honari, S., Yosinski, J., Vincent, P., Pal, C., 2016. Recombinator networks: Learning coarse-to-fine feature...
  • Iranmanesh, S.M., Dabouei, A., Soleymani, S., Kazemi, H., Nasrabadi, N., 2020. Robust facial landmark detection via...
  • Kazemi, V., Sullivan, J., 2014. One millisecond face alignment with an ensemble of regression trees. In: Proceedings of...
  • KingmaD.P. et al.

    Adam: A method for stochastic optimization

    (2014)
  • Kowalski, M., Naruniec, J., Trzcinski, T., 2017. Deep alignment network: A convolutional neural network for robust face...
  • Kumar, A., Chellappa, R., 2018. Disentangling 3d pose in a dendritic cnn for unconstrained 2d face alignment. In:...
  • LeV. et al.

    Interactive facial feature localization

  • Li, Q., Jin, S., Yan, J., 2017. Mimicking very efficient network for object detection. In: Proceedings of the IEEE...
  • Lu, C., Tang, X., 2015. Surpassing human-level face verification performance on LFW with GaussianFace. In: Twenty-Ninth...
  • Lv, J., Shao, X., Xing, J., Cheng, C., Zhou, X., 2017. A deep regression architecture with two-stage re-initialization...
  • Miao, X., Zhen, X., Liu, X., Deng, C., Athitsos, V., Huang, H., 2018. Direct shape regression networks for end-to-end...
  • Cited by (0)

    View full text