3D Large-Pose Face Alignment Method Based on the Truncated Alexnet Cascade Network

Zhang, Qian; Zheng, Hao; Yan, Tao; Li, Jiehui

doi:https://doi.org/10.1155/2020/6675014

Advances in Condensed Matter Physics

On this page

Abstract Introduction Discussion and Results Conclusion Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Special Issue

Light Field Control and All-Optical Information Processing from Metasurfaces and Metamaterials

View this Special Issue

Research Article | Open Access

Volume 2020 | Article ID 6675014 | https://doi.org/10.1155/2020/6675014

3D Large-Pose Face Alignment Method Based on the Truncated Alexnet Cascade Network

Qian Zhang,¹Hao Zheng,¹Tao Yan,²and Jiehui Li¹

Academic Editor: Junmin Liu

Received28 Oct 2020

Accepted22 Nov 2020

Published07 Dec 2020

Abstract

Aiming at the low accuracy of large-pose face alignment, a cascade network based on truncated Alexnet is designed and implemented in the paper. The parallel convolution pooling layers are added for concatenating parallel results in the original deep convolution neural network, which improves the accuracy of the output. Sending the intermediate parameter which is the result of each iteration into CNN and iterating repeatedly to optimize the pose parameter in order to get more accurate results of face alignment. To verify the effectiveness of this method, this paper tests on the AFLW and AFLW2000-3D datasets. Experiments on datasets show that the normalized average error of this method is 5.00% and 5.27%. Compared with 3DDFA, which is a current popular algorithm, the accuracy is improved by 0.60% and 0.15%, respectively.

1. Introduction

As an important research topic in the field of artificial intelligence and face recognition, face alignment has been widely concerned by academia and industry. The core is to use computing equipment to extract the semantics of pixels in face images, which has a great theoretical research significance and practical application value. In recent years, the success of application by using deep learning has greatly improved the accuracy of face alignment. However, there are still many challenges and bottlenecks in the recognition problem under the unrestricted conditions in the real scene, among which the pose change as a factor that cannot be ignored greatly affects the accuracy of face alignment.

At present, the mainstream face alignment methods can be divided into two categories: 2D face alignment and 3D face alignment. As the widely used 2D face alignment method, Zhang et al. [1] proposed face marker detection based on deep multitask learning in 2014, and Lee et al. [2] improved it by using the Gaussian-guided regression network in 2019. Then, pearl to the fine shape retrieval method was proposed by Zhu et al. [3]. In 2015, they have laid the foundation for face alignment of small and medium attitude where the yaw angle is less than 45° and all the landmarks are visible. The steps of 2D face alignment can be roughly divided into face preprocessing, shape initialization, shape prediction, and output.

Compared with the traditional 2D face alignment, 3D face alignment mainly uses a subspace to model 3D face and realizes fitting by minimizing the difference between image and model appearance, which makes the model performance more robust and accurate in unconstrained scenes. Of course, there are several inherent defects in the 3D face alignment method. The alignment results are similar with the average model. They are lack of personalized features. In order to solve the problem, Yin et al. [4] proposed a 3D deformation model for face recognition. However, each image takes one minute, which takes too much time. Liu and Jourabloo [5] fitted the 3D deformation model to 2D image, with the aid of the sparse 3D point distribution model; the model parameters and projection matrix are estimated by cascade linear or nonlinear regression variables, which realize alignment of human faces in any posture. However, the effect of recovering face detail features is still not good. Then, Liu and Jourabloo [6] used 3D face modeling to improve the result of locating landmarks in large-pose face. But the accuracy of alignment results is still limited by linear parameterized 3D models. Large-pose alignment methods still need to be improved. Zhu et al. [7] improved the face alignment performance across large poses and addressed all the three challenges that traditional models need visible landmark points which are not applicable to the side; large poses will cause significant changes in face from front to side and to locate invisible landmarks in large poses. The first one has been properly solved by the 3D dense face model [8], whereas the others still depend on the model accuracy but only the method. Therefore, we need the model which will be more accurate and reliable. As the solution, we propose a cascaded convolutional neutral network- (CNN-) based regression method. CNN has been proved of excellent capability to extract useful information from images with large variations in object detection and image classification. And on this basis, we designed a new cascade network structure based on truncated Alexnet to improve the accuracy.

2. The Training of the Model

2.1. Feature Selection

Good features can make training efficient and improve the accuracy of the model. In order to get better features, we designed a new cascade network structure based on truncated Alexnet.

2.1.1. Alexnet

Alexnet deepens the network structure based on Lenet [9]. The structure of Lenet is shown in Figure 1.

The structure of Alexnet is shown in Figure 2. The network contains five convolution layers and three fully connected layers. Compared with Lenet, Alexnet has a deeper network structure and uses several parallel convolution layers and pooling layers to extract image features. It also uses dropout and data enhancement data augmentation to suppress over fitting.

2.1.2. Cascade Network Structure Based on Truncated Alexnet

Based on the structure of Alexnet, this paper constructs a new kind of truncated Alexnet. The structure is shown in Figure 3. An additional parallel convolution pooling layer is added to the original structure to form a truncated Alexnet cascade network. The input image is stacked with the iterated PNCC as input and then convoluted into the network in parallel. The parallel results are stacked together to form a full connection layer.

2.1.3. Network Structure

The purpose of 3D face alignment is to estimate the target from a single face image. Different from the existing network, based on the cascaded network structure of 3ddfa, we add a parallel pooling layer and a concatenate step before the full connection layer. In general, at iteration k (k = 0, 1, …, K), given an initial parameter p^k, we construct a specially designed feature PNCC with p^k and train a convolutional neutral network Net^k to predict the parameter update △p^k:

Afterwards, a better medium parameter ^k becomes the input of the next network has the same structure as . The input is the 100 × 100 × 3 color image stacked by PNCC. The network contains eight convolution layers, seven pooling layers, and two fully connected layers. The first two convolution layers share weights to extract low-level features. The last three convolution layers do not share weights to extract location sensitive features, which is further regressed to a 256-dimensional feature vector. The output is a 234-dimensional parameter update including 6-dimensional pose parameters (f, pitch, yaw, roll, t_2dx, and t_2dy), 199-dimensional shape parameters α_id, and 29-dimensional expression parameters α_exp.

2.1.4. PNCC

The special structure of the cascaded CNN has three requirements of its input feature. First, the feedback property requires that the input feature should depend on the CNN output to enable the cascade manner. Second, the convergence property requires that the input feature should reflect the fitting accuracy to make the cascade converge after some iterations. Finally, the convolvable property requires that the convolution on the input feature should make sense. Based on the three properties, we design our features as follows: first, the 3D mean face is normalized to 0-1 in x, y, and z axis as given in the following equation. The unique 3D coordinate of each vertex is called its normalized coordinate code (NCC).where the is the mean shape of 3DMM in equation 4. Since NCC has three channels as RGB, we also show the mean face with NCC as its texture. Second, with a model parameter p, we adopt the Z-buffer to render the projected 3D face colored by NCC as in the following equation:where renders an image from the 3D mesh colored by t, and V_3d(p) is the current 3D face. Afterwards, PNCC is stacked with the input image and transferred to CNN. Projected normalized coordinate code (PNCC) is shown in Figure 4.

2.2. 3DMM

Blanz and Basso [10] proposed the 3D morphable model (3DMM) which describes the 3D face space with PCA, and it is widely used in face alignment field [11–13]. 3DMM is shown in the following equation:where S is a 3D face, is the mean shape, A_id is the principle axes trained on the 3D face scans with neutral expression and α_id is the shape parameter, and A_exp is the principle axes trained on the offsets between expression scans and neutral scans and α_exp is the expression parameter. In this work, the A_id and A_exp come from the Basel Face Model (BFM) and Face-Warehouse [14], respectively. The 3D face is then projected onto the image plane with weak perspective projection.where is the model construction and projection function, leading to the 2D positions of model vertexes, f is the scale factor, Pr is the orthographic projection matrix , R is the rotation matrix constructed from rotation angles pitch, yaw, and roll, and t_2d is the translation vector. The collection of all the model parameters is P = [f, pitch, yaw, roll, t_2d, α_id, α_exp]^T.

2.3. Loss Function

In this paper, the loss function is shown in the following equation:where is used to measure the error between the predicted value of the model for the i^th sample and the real label y_i. As mentioned above, it is necessary to minimize this value as much as possible to improve the fitness between the model and the training set. The fitness is not the final evaluation index, but the test error. Therefore, the regularization function of parameter ω is introduced to constrain the model, in order to avoid over fitting. It is shown in the following equation:

The initial learning rate was 10⁻⁴, and the batch size was 8. After 15 complete cycle iterations, the learning rate was reduced to 10⁻⁵. Then, after 15 iterations, the learning rate was reduced to 10⁻⁶. Totally, 40 iterations were carried out for the whole training.

3. Discussion and Results

3.1. Evaluation Index

In this paper, normalized mean error (NME) [15] is applied to measure the accuracy of face alignment rather than the Euclidian distance; the reason is that the Euclidian distance of the contour surface with small eye distance is not accurate. NME is shown in the following equation:where x denotes the ground truth landmarks for a given face, y is the corresponding prediction, and d is the square root of the ground truth bounding box, computed as .

3.2. Experimental Analysis

The input is single picture, and the output results are face detection image, PNCC, and pose estimation results. The results construct on 2.30GHZ CPU and GTX1060. Table 1 shows the most popular image datasets and their main features.

In order to verify the effect of the face alignment method in large poses in this paper, experimental results are based on Annotated Facial Landmarks in the Wild (AFLW). AFLW face database is a dataset composed of face pictures in various natural situations, and the landmarks are accurately marked. The database is suitable for face recognition, face detection, face alignment, and other research. Table 2 and Figure 5 show the comparison of mainstream algorithms. Among them, ESR [16] (explicit shape regression), SDM [17] (supervised descent method), LBF [18] (local binary features), CFSS [3] (coat to fine shape searching), RCPR [19] (robust cascaded pose regression), RMFA [20] (restrictive mean field approximation), and 3DDFA [21] are popular methods based on cascade regression.

By comparing the experimental results in Table 2 and Figures 5 and 6, it shows the accuracy of the results. Compared with the 3DDFA algorithm as the main reference object, the NME of AFLW2000 and AFLW2000-3D is reduced to 5.00% and 5.27%, respectively, which is better than several popular faces alignment algorithm which shows the effectiveness and accuracy of this method. The output results are shown from Figures 7–9. Among them, Figures 7(a), 8(a), and 9(a) are the results of landmark labeling. Figures 7(b), 8(b), and 9(b) are PNCC. The cubes in Figures 7(c), 8(c), and 9(c) are the pose estimation of the current face. It shows that the algorithm in this paper has good alignment result in each pose.

(a)

(b)

(c)

(a)

(b)

(c)

(a)

(b)

(c)

4. Conclusion

In this paper, a method of face alignment using cascade unified network structure is proposed for large-pose face alignment. By using the deep convolution neural network to iterate repeatedly and using the iterative results to return the face feature points, the face alignment in large-pose environment is realized, and the result is improved by using normalized mean error function to evaluate alignment accuracy. The experimental results show that this method has obvious advantages over the existing face alignment methods in accuracy. However, it still needs to be improved in the efficiency of the algorithm. At the same time, it is difficult to achieve accurate face alignment in the presence of external occlusion. These problems need to be further studied and discussed, which will be the focus of subsequent research work.

Data Availability

No data were used to support this study.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by General Project of Shanghai Normal University.

References

Z. Zhang, P. Luo, C. C. Loy, and X. Tang, “Facial landmark detection by deep multi-task learning,” in Proceedings of the Computer Vision–ECCV, pp. 94–108, Springer, Zurich, Switzerland, September 2014.
View at: Google Scholar
Y. Lee, T. Kim, T. Jeon, H. Bae, and S. Lee, “Facial landmark detection using Gaussian guided regression network,” in Proceedings of the 2019 34th International Technical Conference on Circuits/Systems, Computers and Communications (ITC-CSCC), pp. 1–4, JeJu, South Korea, December 2019.
View at: Google Scholar
S. Zhu, Li Cheng, C. C. Loy, and X. Tang, “Face alignment by coarse-to-fine shape searching,” in Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4998–5006, Boston, MA, USA, June 2015.
View at: Google Scholar
Y. Yin, W. Wan, C. Yang, and S. Miao, “Specific material properties for voxels in FEM-based 3D model deformation,” in Proceedings of the 2014 International Conference on Audio, Language and Image Processing, pp. 792–796, Shanghai, China, January 2014.
View at: Google Scholar
A. Jourabloo and X. Liu, “Pose-invariant 3D face alignment,” in Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), pp. 3694–3702, Santiago, CL, USA, December 2015.
View at: Google Scholar
A. Jourabloo and X. Liu, “Large-pose face alignment via CNN-based dense 3D model fitting,” in Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4188–4196, Las Vegas, NV, USA, June 2016.
View at: Google Scholar
X. Zhu, Z. Lei, X. Liu, H. Shi, and S. Z. Li, “Face alignment across large poses: a 3D solution,” in Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 146–155, Las Vegas, NV, USA, November 2016.
View at: Google Scholar
Y. Guo, j. zhang, J. Cai, B. Jiang, and J. Zheng, “CNN-based real-time dense face reconstruction with inverse-rendered photo-realistic face images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 6, pp. 1294–1307, 2019.
View at: Publisher Site | Google Scholar
G. Wang and J. Gong, “Facial expression recognition based on improved LeNet-5 CNN,” in Proceedings of the 2019 Chinese Control and Decision Conference (CCDC), pp. 5655–5660, Nanchang, China, November 2019.
View at: Google Scholar
T. V. Basso and V. Blanz, “Regularized 3D morphable models,” in Proceedings of the First IEEE International Workshop on Higher-Level Knowledge in 3D Modeling and Motion Analysis 2003, pp. 3–10, Nice, France, October 2003.
View at: Google Scholar
L. Tran, F. Liu, and X. Liu, “Towards high-fidelity nonlinear 3D face morphable model,” in Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1126–1135, Long Beach, CA, USA, April 2019.
View at: Google Scholar
Y. Zhao, F. Shi, M. Zhao, C. Jia, and S. Chen, “Face alignment based on 3D morphable model,” in Proceedings of the 2018 IEEE 4th International Conference on Computer and Communications (ICCC), pp. 1488–1492, Chengdu, China, October 2018.
View at: Google Scholar
S. Ploumpis, H. Wang, N. Pears, W. A. P. Smith, and S. Zafeiriou, “Combining 3D morphable models: a large scale face-and-head model,” in Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10926–10935, Long Beach, CA, USA, March 2019.
View at: Google Scholar
C. Cao, Y. Weng, S. Zhou, Y. Tong, and K. Zhou, “FaceWarehouse: a 3D facial expression database for visual computing,” IEEE Transactions on Visualization and Computer Graphics, vol. 20, no. 3, pp. 413–425, 2014.
View at: Publisher Site | Google Scholar
F. Liu, D. Zeng, J. Li, and Q.-J. Zhao, “On 3D face reconstruction via cascaded regression in shape space,” Frontiers of Information Technology & Electronic Engineering, vol. 18, no. 12, pp. 1978–1990, 2017.
View at: Publisher Site | Google Scholar
X. Cao, Y. Wei, F. Wen, and J. Sun, “Face alignment by explicit shape regression,” in Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2887–2894, Providence, RI, USA, June 2012.
View at: Google Scholar
R. Ranjan, S. Sankaranarayanan, A. Bansal et al., “Deep learning for understanding faces: machines may be just as good, or better, than humans,” IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 66–83, 2018.
View at: Publisher Site | Google Scholar
S. Ren, X. Cao, Y. Wei, and J. Sun, “Face alignment at 3000 FPS via regressing local binary features,” in Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1685–1692, Columbus, OH, USA, October 2014.
View at: Google Scholar
X. P. Burgos-Artizzu, P. Perona, and P. Dollár, “Robust face landmark estimation under occlusion,” in Proceedings of the 2013 IEEE International Conference on Computer Vision, pp. 1513–1520, Sydney, NSW, Australia, September 2013.
View at: Google Scholar
F. X. Chen, F. Liu, and Q. J. Zhao, “Robust multi-view face alignment based on cascaded 2D/3D face shape regression,” in Proceedings of the Chinese Conference on Biometric Recognition, pp. 40–49, Springer-Verlag, Zhuzhou, China, October 2016.
View at: Google Scholar
X. Zhu, X. Liu, Z. Lei, and S. Z. Li, “Face alignment in full pose range: a 3D total solution,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 1, pp. 78–92, 2019.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2020 Qian Zhang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

376

Downloads

1099

Citations