doi:10.1016/j.imavis.2005.08.001
Copyright © 2005 Elsevier B.V. All rights reserved.
Active appearance models with occlusion
aThe Robotics Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA
Received 3 December 2004;
accepted 23 August 2005.
Available online 9 November 2005.
References and further reading may be available for this article. To view references and further reading you must
purchase this article.
Abstract
Active Appearance Models (AAMs) are generative parametric models that have been successfully used in the past to track faces in video. A variety of video applications are possible, including dynamic head pose and gaze estimation for real-time user interfaces, lip-reading, and expression recognition. To construct an AAM, a number of training images of faces with a mesh of canonical feature points (usually hand-marked) are needed. All feature points have to be visible in all training images. However, in many scenarios parts of the face may be occluded. Perhaps the most common cause of occlusion is 3D pose variation, which can cause self-occlusion of the face. Furthermore, tracking using standard AAM fitting algorithms often fails in the presence of even small occlusions. In this paper we propose algorithms to construct AAMs from occluded training images and to track faces efficiently in videos containing occlusion. We evaluate our algorithms both quantitatively and qualitatively and show successful real-time face tracking on a number of image sequences containing varying degrees and types of occlusions.
Keywords: Model-based face analysis; Robust model fitting; Fitting with occlusion
Fig. 1. Artificially occluded data. (a) Original images with all mesh vertices s visible. (b) Images with 10% of the face—region occluded. (c) Images with 50% of the face—region occluded. Only non-occluded vertices are used in the AAM construction.
Fig. 2. Base mesh distance. The graph shows average pixel distances between base meshes s0 computed from unoccluded and occluded training data. While the pixel distance increases for higher levels of occlusion it stays below 0.5 pixels even for the maximal occlusion of 50%.
Fig. 3. Mean shape s0 and shape variations s1–s3 overlaid on the base mesh. (a) Shape images computed from unoccluded data. (b) Shape images computed from data with 50% occlusion. The resulting shape modes are very similar.
Fig. 4. Comparison of the AAM model components shape variation and appearance variation computed from unoccluded and occluded data. (a) Shape energy overlap SE. (b) Appearance energy overlap AE. For both components a high degree of similarity is evident. At around 50% occlusion, however, the performance drops off rapidly.
Fig. 5. Mean appearance A0 and appearance variations A1–A3. (a) Appearance images computed from unoccluded data. (b) Appearance images computed from data with 50% occlusion.
Fig. 6. Training images with and without occlusion. We show 6 of the 120 hand-marked images used in the training of the AAM for the tracking task of Fig. 7.
Fig. 7. Example frames of a test sequence showing accurate tracking with an AAM constructed with occlusion. See the accompanying movie fit.mpg for the full sequence of 457 frames.
Fig. 8. The project-out inverse compositional algorithm [15].
Fig. 9. The efficient robust normalization inverse compositional image alignment algorithm.
Fig. 10. The robust simultaneous algorithm. Because the steepest descent images depend on the appearance parameters, Steps (I3) and (I4) must be performed in every iteration.
Fig. 11. Average frequency of convergence for the project-out (PO) and normalization (N) algorithms for 10 appearance images Ai. The two algorithms perform identically, showing empirically that they are equivalent. For more results see [2].
Fig. 12. Average frequency of convergence for the robust fitting algorithms for different levels of occlusion. The robust project-out (RPO) and the robust normalization (RN) algorithm again perform identically. The efficient robust normalization algorithm (ERN) only performs slightly worse than the non-efficient variants. The robust project-out algorithm with Hager-Belhumeur approximation (RPO-HB) performs far worse then any of the other algorithms, especially for higher levels of occlusion. Across all four conditions the robust simultaneous algorithm (RSIC) performs best.
Fig. 13. Comparison of using the (non-robust) project-out (top row) [15] and the efficient robust normalization algorithm (bottom row) on an image sequence with occlusion by a black box. The project-out algorithm fails to track once the face is covered by the box (top center and top right) and is unable to recover (see box.mpg). The efficient robust normalization algorithm accurately tracks the face (bottom row).
Fig. 14. Comparison of using the (non-robust) project-out (top row) [15] and the efficient robust normalization algorithm (bottom row) on an image sequence with occlusion by a hand. The chin is covered by the hand while the face rotates. The project-out algorithm fails to track once the face starts to rotate (top center) and again is unable to recover (see hand.mpg). The efficient robust normalization algorithm accurately tracks the face throughout the sequence (bottom row).
Fig. 15. Comparison of using the (non-robust) project-out (top row) and the efficient robust normalization algorithm (bottom row) on an image sequence with self-occlusion. The face rotates from frontal to full left profile and back to frontal again. The project-out algorithm fails to track once the face nears the profile location (top center). Again, the efficient robust normalization algorithm accurately tracks the face throughout the entire sequence (see rotate.mpg).
Fig. 16. Overview of the algorithms discussed in this paper. The numbers in parenthesis refer to the sections in which the respective algorithm is described. The project-out algorithm was introduced in [15]. The Hager-Belhumeur approximation to the robust project-out algorithm was proposed in [13]. All other algorithms were introduced in [2].
Table 1.
Fitting speed comparison on a 3GHz Pentium 4 in milliseconds

We measure the average fitting speed per frame of the project-out (PO), robust normalization (RN) and efficient robust normalization (ERN) algorithms over an image sequence of 457 frames. These results are for an AAM with 11 shape parameters, 20 appearance parameters, and 9981 color pixels.