Semi-supervised facial landmark annotation☆
Highlights
► A semi-supervised least-squares-based congealing algorithm is proposed. ► A statistical shape model learned online is integrated into congealing. ► We develop a system to automatically estimate the landmark set for an image ensemble.
Introduction
Image annotation for training data is an essential step in many learning-based computer vision tasks. In general there are at least two types of prior knowledge represented by image annotation. One is semantic knowledge, such as a person’s ID for face recognition, or an object’s name for content-based image retrieval. The other is geometric/landmark knowledge. In learning-based object detection [1], [2], for example, the position and size of the object (face/pedestrian/car) needs to be annotated for all training images. For supervised face alignment [3], [4], each training image must be annotated with a set of landmarks, which describes the 2D location of the key facial features.
This paper focuses on geometric/landmark knowledge annotation, which is typically carried out manually. Practical applications, such as object detection, often require thousands of annotated images to achieve sufficient generalization capability. Hence, manual annotation becomes labor-intensive and time-consuming for these applications. Furthermore, image annotation is also an error-prone process due to annotator error, imperfect description of the objectives, and inconsistencies among different annotators.
To alleviate these problems, this paper presents an approach to automatically provide landmark annotation for a large set of images in a semi-supervised fashion. That is, using manually annotated landmark locations for a small number of images, our approach can automatically estimate the landmark locations for the entire set of images (see Fig. 1). In one example, we will demonstrate that 15 manually annotated images may be used to automatically annotate a complete set of 1176 images with the help of a face detector. The core of our algorithm, named Semi-supervised Least-Squares Congealing (SLSC), is the minimization of an objective function defined as the summation of the pairwise L2 distances between warped images. Two types of distances are used: the distance between the annotated and unannotated images, and the distance between the unannotated images. The objective function is iteratively minimized via the well-known and efficient inverse warping technique [5]. During the optimization process, we also constrain the estimated landmark locations by utilizing shape statistics that are learned in an online manner, which is shown to result in better convergence of landmark position estimates and hence improved robustness of the annotation.
Several prior work on joint alignment for an image ensemble [6], [7], [8] estimates global affine parameters for each image. However, most real-world objects exhibit non-rigid deformation that is not well-modeled by the affine transformation. Estimating more realistic deformations using a large set of landmarks is an important step towards accurately characterizing the shape variation within an object class. Motivated by this, we propose a hierarchical patch-based approach together with a greedy patch selection algorithm to estimating landmark positions. Starting from the whole face region, we iteratively select the patch with the greatest potential to minimize the objective function and conduct congealing for this patch simultaneously with its neighboring patch. These operations are consecutively applied to patches with gradually reduced size. In this strategy, the landmark annotation from the larger patch can be propagated to smaller patches, which enhances the robustness of the annotation. Furthermore, congealing on small patches allows the locations of landmarks to be ultimately determined by local appearance information, which improves the precision of annotation. In addition, a joint congealing on two neighboring patches is proposed to enforce the geometrical consistency between them. Our applications on facial images show that even when manually annotating only a few images of the ensemble, the landmarks of the remaining images can be estimated accurately. An overview of the system is illustrated in Fig. 2.
Our proposed automatic image annotation framework has three primary contributions:
- (1)
A core algorithm is proposed for semi-supervised least-squares-based congealing of an image ensemble. We describe its efficient implementation using the inverse warping technique [5] and provide computational analysis.
- (2)
A statistical shape model learned online is integrated into the congealing process to reduce outliers of landmark estimation among the ensemble.
- (3)
A coarse-to-fine patch-based scheme together with a greedy patch selection strategy is proposed to improve the accuracy of landmark estimation. Furthermore, geometrical constraints are employed in the cost function to enforce the geometrical consistency between two neighboring patches, and thus improve the reliability of landmark annotation.
- (4)
An end-to-end system is developed for automatic estimation of a set of landmarks in an ensemble of facial images with very few manually annotated images. Extensive experiments that qualitatively and quantitatively evaluate the performance and capabilities of the system and comparisons with the state-of-the-art techniques [9], [8], [10] have been conducted and are reported here.
The rest of the paper is organized as follows: After a brief description of related work in Section 2, this paper presents the semi-supervised least-squares-based congealing (SLSC) algorithm in Section 3. We then describe the shape constrained semi-supervised least-squares-based congealing (SSLSC) in Section 4, and the greedy patch selection scheme (i.e., patch-based SSLSC) in Section 5. Section 6 describes our extensive experimental results. The paper concludes in Section 7.
Section snippets
Prior work
In some notable and early work on unsupervised joint alignment, Learned-Miller [6], [7] denotes the process as “congealing”. The underlying idea is to minimize an entropy-based cost function by estimating the warping parameters of an ensemble. More recently, Cox et al. [8] propose a least-squares-based congealing (LSC) algorithm, which uses L2 constraints to estimate the warping parameter of each image. An inverse compositional parameter updating strategy further improves the congealing
Semi-supervised least-squares congealing
In this section we will describe the objective function, detailed derivation, and computational analysis of our core algorithm, semi-supervised least-squares congealing (SLSC).
Similar to conventional congealing algorithms, SLSC takes an ensemble of images as input, among which we assume that there are K unannotated images I = {Ii}i∈[1, K], each of which is associated with an m-dimensional warping parameter vector pi = [pi1, pi2, … , pim]T·Ii(·) denotes a 1D vector containing the image intensity values
Shape-constrained SLSC
In this section, we will introduce a shape-constrained SLSC, which improves the robustness of the congealing process by reducing outliers. Given the warping parameters for all images , and their corresponding landmark locations , where s is a concatenated vector of 2D landmarks s = [x1, y1, x2, y2, … , xV, yV]T for V landmarks, there are two ways of mapping between each pair of the warping parameter vector and its corresponding landmark
Iterative patch-based landmark annotation
Having dealt with the outliers during congealing process, we should improve the accuracy of the landmark annotation, which is crucial for practical applications. Since the shape deformation of a real-world object is often non-rigid due to inter-subject variations, object motions, and camera views, estimating the global and rigid transformation of the object is not sufficient to characterize the object. However, joint estimating the non-rigid transformation is difficult to solve since the
Experiments
In order to demonstrate the effectiveness of the proposed algorithm, we have performed extensive validation studies for the application of annotating facial landmarks. It is desirable to find results from a previous state-of-art approach to compare with ours. A fair comparison with supervised methods such as AAM [9] would be to train AAM using the same set of annotated images provided to our approach. In contrast, we aim to automatically annotate a set of specific landmarks around the facial
Conclusions
Shape deformation of images of a real-world object is often non-rigid due to inter-instance variability, object motion, and changing camera view point. Automatically estimating non-rigid deformations for an object class is a critical step in characterizing the object and learning statistical shape models. Our proposed approach facilitates such a task by automatically producing annotated data sets with only a small number of manually annotated examples. Extensive experiments demonstrate that our
Acknowledgments
This Project was supported by awards #2007-DE-BX-K191 and #2007-MU-CX-K001 awarded by the National Institute of Justice, Office of Justice Programs, US Department of Justice. The opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the Department of Justice.
References (36)
- et al.
Active shape models—their training and application
Comput. Vision Image Understand.
(1995) - et al.
Generic vs. person specific active appearance models
J. Image Vision Comput.
(2005) - et al.
Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories
Comput. Vision Image Understand.
(2007) - et al.
Robust real-time face detection
Int. J. Comput. Vision
(2004) - N. Dalal, W. Triggs, Histograms of oriented gradients for human detection, in: Proc. IEEE Conf. on Computer Vision and...
- et al.
Active appearance models
IEEE Trans. Pattern Anal. Mach. Intell.
(2001) - et al.
Lucas–Kanade 20 years on: a unifying framework
Int. J. Comput. Vision
(2004) - E. Learned-Miller, N. Matsakis, P. Viola, Learning from one example through shared densities on transforms, in: Proc....
Data driven image models through continuous joint alignment
IEEE Trans. Pattern Anal. Mach. Intell.
(2006)- M. Cox, S. Sridharan, S. Lucey, J. Cohn, Least squares congealing for unsupervised alignment of images, in: Proc. IEEE...
Active appearance models revisited
Int. J. Comput. Vision
Morphable surface models
Int. J. Comput. Vision
Cited by (14)
Facial feature point detection: A comprehensive survey
2018, NeurocomputingCitation Excerpt :A discriminative face alignment evaluation metric is designed by virtue of cascaded AdaBoost framework [145] and Real AdaBoost [56]. Tong et al. [134,135] proposed a semi-supervised facial landmark localization approach which utilizes a small number of manually labeled images. To obtain a reasonable shape, an on-line learned PDM shape model is imposed as a constraint.
300 Faces In-The-Wild Challenge: database and results
2016, Image and Vision ComputingCitation Excerpt :In [35], the training procedure requires as input the orientation of each face depicted in the training images. Secondly, and most importantly, existing methods, such as [36] and [37], have only been applied on images that are captured under controlled conditions. The aforementioned issues, make the existing methods incapable for the task of semi-automatic annotation of large databases with in-the-wild images (most of the images are downloaded from the web with simple search queries), which is a much more challenging task.
Exploiting Self-Supervised and Semi-Supervised Learning for Facial Landmark Tracking with Unlabeled Data
2020, MM 2020 - Proceedings of the 28th ACM International Conference on MultimediaDense facial landmark localization: Database and annotation tool
2019, Proceedings - 2019 34rd Youth Academic Annual Conference of Chinese Association of Automation, YAC 2019Facial Landmark Detection: A Literature Survey
2019, International Journal of Computer VisionDriver Facial Landmark Detection in Real Driving Situations
2018, IEEE Transactions on Circuits and Systems for Video Technology
- ☆
This paper has been recommended for acceptance by K.W. Bowyer.