Abstract

Aiming at the low accuracy of large-pose face alignment, a cascade network based on truncated Alexnet is designed and implemented in the paper. The parallel convolution pooling layers are added for concatenating parallel results in the original deep convolution neural network, which improves the accuracy of the output. Sending the intermediate parameter which is the result of each iteration into CNN and iterating repeatedly to optimize the pose parameter in order to get more accurate results of face alignment. To verify the effectiveness of this method, this paper tests on the AFLW and AFLW2000-3D datasets. Experiments on datasets show that the normalized average error of this method is 5.00% and 5.27%. Compared with 3DDFA, which is a current popular algorithm, the accuracy is improved by 0.60% and 0.15%, respectively.

1. Introduction

As an important research topic in the field of artificial intelligence and face recognition, face alignment has been widely concerned by academia and industry. The core is to use computing equipment to extract the semantics of pixels in face images, which has a great theoretical research significance and practical application value. In recent years, the success of application by using deep learning has greatly improved the accuracy of face alignment. However, there are still many challenges and bottlenecks in the recognition problem under the unrestricted conditions in the real scene, among which the pose change as a factor that cannot be ignored greatly affects the accuracy of face alignment.

At present, the mainstream face alignment methods can be divided into two categories: 2D face alignment and 3D face alignment. As the widely used 2D face alignment method, Zhang et al. [1] proposed face marker detection based on deep multitask learning in 2014, and Lee et al. [2] improved it by using the Gaussian-guided regression network in 2019. Then, pearl to the fine shape retrieval method was proposed by Zhu et al. [3]. In 2015, they have laid the foundation for face alignment of small and medium attitude where the yaw angle is less than 45° and all the landmarks are visible. The steps of 2D face alignment can be roughly divided into face preprocessing, shape initialization, shape prediction, and output.

Compared with the traditional 2D face alignment, 3D face alignment mainly uses a subspace to model 3D face and realizes fitting by minimizing the difference between image and model appearance, which makes the model performance more robust and accurate in unconstrained scenes. Of course, there are several inherent defects in the 3D face alignment method. The alignment results are similar with the average model. They are lack of personalized features. In order to solve the problem, Yin et al. [4] proposed a 3D deformation model for face recognition. However, each image takes one minute, which takes too much time. Liu and Jourabloo [5] fitted the 3D deformation model to 2D image, with the aid of the sparse 3D point distribution model; the model parameters and projection matrix are estimated by cascade linear or nonlinear regression variables, which realize alignment of human faces in any posture. However, the effect of recovering face detail features is still not good. Then, Liu and Jourabloo [6] used 3D face modeling to improve the result of locating landmarks in large-pose face. But the accuracy of alignment results is still limited by linear parameterized 3D models. Large-pose alignment methods still need to be improved. Zhu et al. [7] improved the face alignment performance across large poses and addressed all the three challenges that traditional models need visible landmark points which are not applicable to the side; large poses will cause significant changes in face from front to side and to locate invisible landmarks in large poses. The first one has been properly solved by the 3D dense face model [8], whereas the others still depend on the model accuracy but only the method. Therefore, we need the model which will be more accurate and reliable. As the solution, we propose a cascaded convolutional neutral network- (CNN-) based regression method. CNN has been proved of excellent capability to extract useful information from images with large variations in object detection and image classification. And on this basis, we designed a new cascade network structure based on truncated Alexnet to improve the accuracy.

2. The Training of the Model

2.1. Feature Selection

Good features can make training efficient and improve the accuracy of the model. In order to get better features, we designed a new cascade network structure based on truncated Alexnet.

2.1.1. Alexnet

Alexnet deepens the network structure based on Lenet [9]. The structure of Lenet is shown in Figure 1.

The structure of Alexnet is shown in Figure 2. The network contains five convolution layers and three fully connected layers. Compared with Lenet, Alexnet has a deeper network structure and uses several parallel convolution layers and pooling layers to extract image features. It also uses dropout and data enhancement data augmentation to suppress over fitting.

2.1.2. Cascade Network Structure Based on Truncated Alexnet

Based on the structure of Alexnet, this paper constructs a new kind of truncated Alexnet. The structure is shown in Figure 3. An additional parallel convolution pooling layer is added to the original structure to form a truncated Alexnet cascade network. The input image is stacked with the iterated PNCC as input and then convoluted into the network in parallel. The parallel results are stacked together to form a full connection layer.

2.1.3. Network Structure

The purpose of 3D face alignment is to estimate the target from a single face image. Different from the existing network, based on the cascaded network structure of 3ddfa, we add a parallel pooling layer and a concatenate step before the full connection layer. In general, at iteration k (k = 0, 1, …, K), given an initial parameter pk, we construct a specially designed feature PNCC with pk and train a convolutional neutral network Netk to predict the parameter update △pk:

Afterwards, a better medium parameter k becomes the input of the next network has the same structure as . The input is the 100 × 100 × 3 color image stacked by PNCC. The network contains eight convolution layers, seven pooling layers, and two fully connected layers. The first two convolution layers share weights to extract low-level features. The last three convolution layers do not share weights to extract location sensitive features, which is further regressed to a 256-dimensional feature vector. The output is a 234-dimensional parameter update including 6-dimensional pose parameters (f, pitch, yaw, roll, t2dx, and t2dy), 199-dimensional shape parameters αid, and 29-dimensional expression parameters αexp.

2.1.4. PNCC

The special structure of the cascaded CNN has three requirements of its input feature. First, the feedback property requires that the input feature should depend on the CNN output to enable the cascade manner. Second, the convergence property requires that the input feature should reflect the fitting accuracy to make the cascade converge after some iterations. Finally, the convolvable property requires that the convolution on the input feature should make sense. Based on the three properties, we design our features as follows: first, the 3D mean face is normalized to 0-1 in x, y, and z axis as given in the following equation. The unique 3D coordinate of each vertex is called its normalized coordinate code (NCC).where the is the mean shape of 3DMM in equation 4. Since NCC has three channels as RGB, we also show the mean face with NCC as its texture. Second, with a model parameter p, we adopt the Z-buffer to render the projected 3D face colored by NCC as in the following equation:where renders an image from the 3D mesh colored by t, and V3d(p) is the current 3D face. Afterwards, PNCC is stacked with the input image and transferred to CNN. Projected normalized coordinate code (PNCC) is shown in Figure 4.

2.2. 3DMM

Blanz and Basso [10] proposed the 3D morphable model (3DMM) which describes the 3D face space with PCA, and it is widely used in face alignment field [1113]. 3DMM is shown in the following equation:where S is a 3D face, is the mean shape, Aid is the principle axes trained on the 3D face scans with neutral expression and αid is the shape parameter, and Aexp is the principle axes trained on the offsets between expression scans and neutral scans and αexp is the expression parameter. In this work, the Aid and Aexp come from the Basel Face Model (BFM) and Face-Warehouse [14], respectively. The 3D face is then projected onto the image plane with weak perspective projection.where is the model construction and projection function, leading to the 2D positions of model vertexes, f is the scale factor, Pr is the orthographic projection matrix , R is the rotation matrix constructed from rotation angles pitch, yaw, and roll, and t2d is the translation vector. The collection of all the model parameters is P = [f, pitch, yaw, roll, t2d, αid, αexp]T.

2.3. Loss Function

In this paper, the loss function is shown in the following equation:where is used to measure the error between the predicted value of the model for the ith sample and the real label yi. As mentioned above, it is necessary to minimize this value as much as possible to improve the fitness between the model and the training set. The fitness is not the final evaluation index, but the test error. Therefore, the regularization function of parameter ω is introduced to constrain the model, in order to avoid over fitting. It is shown in the following equation:

The initial learning rate was 10−4, and the batch size was 8. After 15 complete cycle iterations, the learning rate was reduced to 10−5. Then, after 15 iterations, the learning rate was reduced to 10−6. Totally, 40 iterations were carried out for the whole training.

3. Discussion and Results

3.1. Evaluation Index

In this paper, normalized mean error (NME) [15] is applied to measure the accuracy of face alignment rather than the Euclidian distance; the reason is that the Euclidian distance of the contour surface with small eye distance is not accurate. NME is shown in the following equation:where x denotes the ground truth landmarks for a given face, y is the corresponding prediction, and d is the square root of the ground truth bounding box, computed as .

3.2. Experimental Analysis

The input is single picture, and the output results are face detection image, PNCC, and pose estimation results. The results construct on 2.30GHZ CPU and GTX1060. Table 1 shows the most popular image datasets and their main features.

In order to verify the effect of the face alignment method in large poses in this paper, experimental results are based on Annotated Facial Landmarks in the Wild (AFLW). AFLW face database is a dataset composed of face pictures in various natural situations, and the landmarks are accurately marked. The database is suitable for face recognition, face detection, face alignment, and other research. Table 2 and Figure 5 show the comparison of mainstream algorithms. Among them, ESR [16] (explicit shape regression), SDM [17] (supervised descent method), LBF [18] (local binary features), CFSS [3] (coat to fine shape searching), RCPR [19] (robust cascaded pose regression), RMFA [20] (restrictive mean field approximation), and 3DDFA [21] are popular methods based on cascade regression.

By comparing the experimental results in Table 2 and Figures 5 and 6, it shows the accuracy of the results. Compared with the 3DDFA algorithm as the main reference object, the NME of AFLW2000 and AFLW2000-3D is reduced to 5.00% and 5.27%, respectively, which is better than several popular faces alignment algorithm which shows the effectiveness and accuracy of this method. The output results are shown from Figures 79. Among them, Figures 7(a), 8(a), and 9(a) are the results of landmark labeling. Figures 7(b), 8(b), and 9(b) are PNCC. The cubes in Figures 7(c), 8(c), and 9(c) are the pose estimation of the current face. It shows that the algorithm in this paper has good alignment result in each pose.

4. Conclusion

In this paper, a method of face alignment using cascade unified network structure is proposed for large-pose face alignment. By using the deep convolution neural network to iterate repeatedly and using the iterative results to return the face feature points, the face alignment in large-pose environment is realized, and the result is improved by using normalized mean error function to evaluate alignment accuracy. The experimental results show that this method has obvious advantages over the existing face alignment methods in accuracy. However, it still needs to be improved in the efficiency of the algorithm. At the same time, it is difficult to achieve accurate face alignment in the presence of external occlusion. These problems need to be further studied and discussed, which will be the focus of subsequent research work.

Data Availability

No data were used to support this study.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by General Project of Shanghai Normal University.