Semi-supervised facial landmark annotation

https://doi.org/10.1016/j.cviu.2012.03.008Get rights and content

Abstract

Landmark annotation for training images is essential for many learning tasks in computer vision, such as object detection, tracking, and alignment. Image annotation is typically conducted manually, which is both labor-intensive and error-prone. To improve this process, this paper proposes a new approach to estimating the locations of a set of landmarks for a large image ensemble using manually annotated landmarks for only a small number of images in the ensemble. Our approach, named semi-supervised least-squares congealing, aims to minimize an objective function defined on both annotated and unannotated images. A shape model is learned online to constrain the landmark configuration. We employ an iterative coarse-to-fine patch-based scheme together with a greedy patch selection strategy for landmark location estimation. Extensive experiments on facial images show that our approach can reliably and accurately annotate landmarks for a large image ensemble starting with a small number of manually annotated images, under several challenging scenarios.

Highlights

► A semi-supervised least-squares-based congealing algorithm is proposed. ► A statistical shape model learned online is integrated into congealing. ► We develop a system to automatically estimate the landmark set for an image ensemble.

Introduction

Image annotation for training data is an essential step in many learning-based computer vision tasks. In general there are at least two types of prior knowledge represented by image annotation. One is semantic knowledge, such as a person’s ID for face recognition, or an object’s name for content-based image retrieval. The other is geometric/landmark knowledge. In learning-based object detection [1], [2], for example, the position and size of the object (face/pedestrian/car) needs to be annotated for all training images. For supervised face alignment [3], [4], each training image must be annotated with a set of landmarks, which describes the 2D location of the key facial features.

This paper focuses on geometric/landmark knowledge annotation, which is typically carried out manually. Practical applications, such as object detection, often require thousands of annotated images to achieve sufficient generalization capability. Hence, manual annotation becomes labor-intensive and time-consuming for these applications. Furthermore, image annotation is also an error-prone process due to annotator error, imperfect description of the objectives, and inconsistencies among different annotators.

To alleviate these problems, this paper presents an approach to automatically provide landmark annotation for a large set of images in a semi-supervised fashion. That is, using manually annotated landmark locations for a small number of images, our approach can automatically estimate the landmark locations for the entire set of images (see Fig. 1). In one example, we will demonstrate that 15 manually annotated images may be used to automatically annotate a complete set of 1176 images with the help of a face detector. The core of our algorithm, named Semi-supervised Least-Squares Congealing (SLSC), is the minimization of an objective function defined as the summation of the pairwise L2 distances between warped images. Two types of distances are used: the distance between the annotated and unannotated images, and the distance between the unannotated images. The objective function is iteratively minimized via the well-known and efficient inverse warping technique [5]. During the optimization process, we also constrain the estimated landmark locations by utilizing shape statistics that are learned in an online manner, which is shown to result in better convergence of landmark position estimates and hence improved robustness of the annotation.

Several prior work on joint alignment for an image ensemble [6], [7], [8] estimates global affine parameters for each image. However, most real-world objects exhibit non-rigid deformation that is not well-modeled by the affine transformation. Estimating more realistic deformations using a large set of landmarks is an important step towards accurately characterizing the shape variation within an object class. Motivated by this, we propose a hierarchical patch-based approach together with a greedy patch selection algorithm to estimating landmark positions. Starting from the whole face region, we iteratively select the patch with the greatest potential to minimize the objective function and conduct congealing for this patch simultaneously with its neighboring patch. These operations are consecutively applied to patches with gradually reduced size. In this strategy, the landmark annotation from the larger patch can be propagated to smaller patches, which enhances the robustness of the annotation. Furthermore, congealing on small patches allows the locations of landmarks to be ultimately determined by local appearance information, which improves the precision of annotation. In addition, a joint congealing on two neighboring patches is proposed to enforce the geometrical consistency between them. Our applications on facial images show that even when manually annotating only a few images of the ensemble, the landmarks of the remaining images can be estimated accurately. An overview of the system is illustrated in Fig. 2.

Our proposed automatic image annotation framework has three primary contributions:

  • (1)

    A core algorithm is proposed for semi-supervised least-squares-based congealing of an image ensemble. We describe its efficient implementation using the inverse warping technique [5] and provide computational analysis.

  • (2)

    A statistical shape model learned online is integrated into the congealing process to reduce outliers of landmark estimation among the ensemble.

  • (3)

    A coarse-to-fine patch-based scheme together with a greedy patch selection strategy is proposed to improve the accuracy of landmark estimation. Furthermore, geometrical constraints are employed in the cost function to enforce the geometrical consistency between two neighboring patches, and thus improve the reliability of landmark annotation.

  • (4)

    An end-to-end system is developed for automatic estimation of a set of landmarks in an ensemble of facial images with very few manually annotated images. Extensive experiments that qualitatively and quantitatively evaluate the performance and capabilities of the system and comparisons with the state-of-the-art techniques [9], [8], [10] have been conducted and are reported here.

The rest of the paper is organized as follows: After a brief description of related work in Section 2, this paper presents the semi-supervised least-squares-based congealing (SLSC) algorithm in Section 3. We then describe the shape constrained semi-supervised least-squares-based congealing (SSLSC) in Section 4, and the greedy patch selection scheme (i.e., patch-based SSLSC) in Section 5. Section 6 describes our extensive experimental results. The paper concludes in Section 7.

Section snippets

Prior work

In some notable and early work on unsupervised joint alignment, Learned-Miller [6], [7] denotes the process as “congealing”. The underlying idea is to minimize an entropy-based cost function by estimating the warping parameters of an ensemble. More recently, Cox et al. [8] propose a least-squares-based congealing (LSC) algorithm, which uses L2 constraints to estimate the warping parameter of each image. An inverse compositional parameter updating strategy further improves the congealing

Semi-supervised least-squares congealing

In this section we will describe the objective function, detailed derivation, and computational analysis of our core algorithm, semi-supervised least-squares congealing (SLSC).

Similar to conventional congealing algorithms, SLSC takes an ensemble of images as input, among which we assume that there are K unannotated images I = {Ii}i∈[1, K], each of which is associated with an m-dimensional warping parameter vector pi = [pi1, pi2,  , pim]T·Ii(·) denotes a 1D vector containing the image intensity values

Shape-constrained SLSC

In this section, we will introduce a shape-constrained SLSC, which improves the robustness of the congealing process by reducing outliers. Given the warping parameters for all images {P,P}=[p1,,pK,p˜1,,p˜K], and their corresponding landmark locations {S,S}=[s1,,sK,s˜1,,s˜K], where s is a concatenated vector of 2D landmarks s = [x1, y1, x2, y2,  , xV, yV]T for V landmarks, there are two ways of mapping between each pair of the warping parameter vector pk{P,P} and its corresponding landmark

Iterative patch-based landmark annotation

Having dealt with the outliers during congealing process, we should improve the accuracy of the landmark annotation, which is crucial for practical applications. Since the shape deformation of a real-world object is often non-rigid due to inter-subject variations, object motions, and camera views, estimating the global and rigid transformation of the object is not sufficient to characterize the object. However, joint estimating the non-rigid transformation is difficult to solve since the

Experiments

In order to demonstrate the effectiveness of the proposed algorithm, we have performed extensive validation studies for the application of annotating facial landmarks. It is desirable to find results from a previous state-of-art approach to compare with ours. A fair comparison with supervised methods such as AAM [9] would be to train AAM using the same set of annotated images provided to our approach. In contrast, we aim to automatically annotate a set of specific landmarks around the facial

Conclusions

Shape deformation of images of a real-world object is often non-rigid due to inter-instance variability, object motion, and changing camera view point. Automatically estimating non-rigid deformations for an object class is a critical step in characterizing the object and learning statistical shape models. Our proposed approach facilitates such a task by automatically producing annotated data sets with only a small number of manually annotated examples. Extensive experiments demonstrate that our

Acknowledgments

This Project was supported by awards #2007-DE-BX-K191 and #2007-MU-CX-K001 awarded by the National Institute of Justice, Office of Justice Programs, US Department of Justice. The opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the Department of Justice.

References (36)

  • I. Matthews et al.

    Active appearance models revisited

    Int. J. Comput. Vision

    (2004)
  • Y. Tong, X. Liu, F.W. Wheeler, P. Tu, Automatic facial landmark labeling with minimal supervision, in: Proc. IEEE Conf....
  • M. Cox, S. Sridharan, S. Lucey, J. Cohn, Least-squares congealing for large numbers of images, in: Proc. of the Intl....
  • A. Vedaldi, G. Guidi, S. Soatto, Joint data alignment up to lossy transformations, in: Proc. IEEE Conf. on Computer...
  • M. Storer, M. Urschler, H. Bischof, Intensity-based congealing for unsupervised joint image alignment, in: Proc. of the...
  • C.R. Shelton

    Morphable surface models

    Int. J. Comput. Vision

    (2000)
  • S. Balci, P. Golland, M. Shenton, W. Wells, Free-form b-spline deformation model for groupwise registration, in: Proc....
  • T. Vetter, M.J. Jones, T. Poggio, A bootstrapping algorithm for learning linear models of object classes, in: Proc....
  • Cited by (14)

    • Facial feature point detection: A comprehensive survey

      2018, Neurocomputing
      Citation Excerpt :

      A discriminative face alignment evaluation metric is designed by virtue of cascaded AdaBoost framework [145] and Real AdaBoost [56]. Tong et al. [134,135] proposed a semi-supervised facial landmark localization approach which utilizes a small number of manually labeled images. To obtain a reasonable shape, an on-line learned PDM shape model is imposed as a constraint.

    • 300 Faces In-The-Wild Challenge: database and results

      2016, Image and Vision Computing
      Citation Excerpt :

      In [35], the training procedure requires as input the orientation of each face depicted in the training images. Secondly, and most importantly, existing methods, such as [36] and [37], have only been applied on images that are captured under controlled conditions. The aforementioned issues, make the existing methods incapable for the task of semi-automatic annotation of large databases with in-the-wild images (most of the images are downloaded from the web with simple search queries), which is a much more challenging task.

    • Exploiting Self-Supervised and Semi-Supervised Learning for Facial Landmark Tracking with Unlabeled Data

      2020, MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia
    • Dense facial landmark localization: Database and annotation tool

      2019, Proceedings - 2019 34rd Youth Academic Annual Conference of Chinese Association of Automation, YAC 2019
    • Facial Landmark Detection: A Literature Survey

      2019, International Journal of Computer Vision
    • Driver Facial Landmark Detection in Real Driving Situations

      2018, IEEE Transactions on Circuits and Systems for Video Technology
    View all citing articles on Scopus

    This paper has been recommended for acceptance by K.W. Bowyer.

    View full text