Introduction

One of the main aims of social signal processing is to address the social ignorance of today’s computers (Vinciarelli et al. 2009). To address this social ignorance, it is important that computers are capable of interpreting social signals. These social signals may encompass verbal communication, prompting the development of speech recognition systems, but they typically also entail behavioral cues. Ekman and Friesen (1969) distinguish five main behavioral cues, viz. (1) affective/attitudinal/cognitive states, (2) emblems, (3) manipulators, (4) illustrators, and (5) regulators. Although this taxonomy is useful to describe communicative intentions, it is not well suited as a taxonomy that describes the technologies required to allow computers to interpret these behavioral cues. Behavioral cues are primarily contained in facial expressions (cues 1, 2, 3, 4, and 5), gestures (cues 2, 3, and 4), body pose (cues 1, 2, 4, and 5), and interactions (cues 4 and 5). The interpretation of these features requires different technologies: facial expression analysis, gesture recognition, pose detection, and gaze detection, respectively. An overview on which computer technologies are required to detect which behavioral and social cues is presented by Vinciarelli et al. (2009).

In this study, we focus on one of the most important technologies required to interpret behavioral cues, viz. facial expression analysis. In particular, we develop a system that automatically annotates faces depicted in videos according to the facial action coding system [FACS; Ekman and Friesen (1978)]. FACS allows for the systematic description of facial expressions by categorizing facial expressions by describing them in terms of 46 action units (AUs) that correspond to the facial muscles. Action units can be given intensity scores: the most simple score is present or not present. Two alternative intensity scores are (1) neutral, onset, apex, and offset or (2) trace, slight, pronounced, extreme, and maximum. In our experiments, we use the simple present/non-present scoring; however, win provided with appropriate data, our system can be used with other scorings as well.

Action units can be used to recognize facial expressions and/or behavioral cues (Giudice and Colle 2007; Lucey et al. 2010). For instance, anger typically involves action unit 23 or 24, disgust involves action unit 9 or 10, and happiness involves action unit 12 (see Table 2 for an overview of action units). An action unit recognition system like one we develop in this paper can thus be used as a basis for recognizing higher-level facial expressions and/or behavioral cues.

Our system for action unit recognition combines computer vision techniques that extract informative features from the depicted faces with a machine learning model for the classification of time series (as facial expressions in a video change over time). The computer vision techniques extract features from the face images that are indicative of the presence of action units in the face using a deformable template model, called the active appearance model [AAM; Cootes et al. (1998)]. The machine learning model, known as linear-chain conditional random field [CRF; Lafferty et al. (2001)], recognizes action units in each image in the sequence of face images. The key property of the linear-chain CRF is that it not only employs the face features measured in the current image to recognize action units, but that it also employs knowledge on the likelihood of an action unit changing from one intensity score to another (for instance, it can make use of the fact that an action unit does not typically change from neutral into offset). We investigate the performance of our action unit recognition system on a large data set of videos in which subjects are recorded while making posed and natural facial expressions. The videos in the data set were annotated by human FACS-certified annotators, facilitating the training and evaluation of our system.

The main aim of the paper is to illustrate what today’s most sophisticated computer vision and machine learning techniques are capable of and to provide some ideas on how these techniques may be used to facilitate social psychology research. We do not explicitly compare the performance of our system with that of other systems presented in the literature; we merely use our system to illustrate the potential of computer vision and machine learning to facial expression analysis. The results of our experiments reveal that it is likely that our system would pass certification tests for human FACS labelers.

The outline of the remainder of this paper is as follows. In "Related work", we give an overview of related work on automatic action unit recognition. "Active appearance models" provides an overview of the construction, and fitting of the active appearance model, we use as a basis for our system. In "Feature extraction", we introduce three types of features that are extracted from the face images and that are indicative of action unit presence in the depicted face. Subsequently, "Conditional random fields" describes the conditional random field model we use to assign FACS labels to each frame in a face image sequence (based on the extracted features). In "Experiments", we describe the setup and results of experiments in which we evaluate our approach on a large data set of facial expression movies that were annotated by human FACS annotators. The potential impact and applications of our system to social psychology are discussed in "Discussion". "Concluding remarks" presents our conclusions, as well as directions for future work.

Related work

In the computer vision field, there is a large body of work on face analysis. Traditionally, much of this work has focused on face detection [i.e., determining where in an image a face is located; Viola and Jones (2001)], face recognition [i.e., determining who is depicted in a face image; Phillips et al. (2005)], and emotion recognition [i.e., recognizing the six basic emotions anger, fear, disgust, joy, sadness, and surprise; Fasel and Luettin (2003)] Although the automatic recognition of action units to face images has not nearly received as much attention, there are still quite a few studies that investigate automatic action unit recognition; see Table 1. Similar to the system we present in this paper, most systems presented in earlier work consist of two main components: (1) a component that extracts features from the face images that are indicative of the presence of action units and (2) a component that learns to recognize action units based on these input features, i.e., a classifier. An overview of how other systems for action unit recognition implement these two components is given in Table 1.

Table 1 Overview of the two main components of systems for action unit recognition
Table 2 Overview of the action units considered in our study

From the overview presented in the table, we observe that (1) features obtained from Gabor filters and (2) the locations of facial feature points are the most popular features. Gabor filters are local, high-frequency, oriented filters that resemble the filters implemented in the primate primal visual cortex V1 (Daugman 1985; Jones and Palmer 1987). Because a Gabor filter basis is highly overcomplete, a subset of filters has to be selected to obtain a feature representation with a manageable dimensionality. This subset of Gabor filters is typically selected by means of a technique called boosting (Freund and Schapire 1995). In our system, we do not use Gabor features (or boosting), but like many other studies listed in Table 1, we opt to use the location of tracked facial feature points as features for the action unit recognition. Such facial feature points contain information on the location of important parts of the face (eye corners, nostrils, mouth corners, etc.); the pixel values around the feature points contain information on the appearance of these face parts. Similar to Lucey et al. (2007), we use active appearance models to track facial feature points, but we extend their approach by extracting more sophisticated appearance features around the facial feature points identified by the tracker.

As for the classifiers that are used to perform predictions based on the extracted features, we observe that support vector machines (SVMs) are the most popular classifiers. Like the perceptron, SVMs separate the two classes (i.e., action unit present or not present) by a (hyper)plane, but they are less prone to overfitting than the perceptron (Vapnik 1995). Using the so-called “kernel trick” (Shawe-Taylor and Christianini 2004), SVMs can also be used to learn non-linear classifiers. A major shortcoming of standard SVMs, and of many of the other classifiers used in the previous work, is that they fail to incorporate the temporal structure of facial expressions. For instance, if we observe the intensity onset for a particular action unit in the current frame, we can be fairly confident that in a few frames this action unit reaches the state apex, even if the visual evidence for this state is limited (e.g., because part of the face is occluded). Previously proposed approaches fail to incorporate such knowledge. In our system, we do incorporate temporal information using linear-chain conditional random fields Footnote 1 (see "Conditional random fields")

Active appearance models

Active appearance models simultaneously describe the shape and texture variation of faces (Cootes et al. 1998; Matthews and Baker 2004). Herein, shape refers to the relative positions of feature points (such as eye corners, mouth corners, nose tip, etc.) in the face, whereas texture refers to the shape-normalized visual appearance of the face (for instance, eye color, skin color, malls, etc.). Active appearance models thus consist of two submodels: (1) a shape model that models the location of facial feature points and (2) a texture model that models the shape-normalized facial texture. We discuss the two models separately below in "Shape model" and "Texture model". "Combining the models" describes how the shape and the texture models are combined to construct the active appearance model. In "Fitting", we discuss how the active appearance model is fitted to a new face image.

Shape model

To train the shape model of an active appearance model, a data set of face images is required in which facial feature points—for instance, mouth corners, eye corners, and nose tip—are manually annotated. The feature points are required to be relatively dense, in such a way that a triangulation constructed on the feature points approximately captures the geometry of the face, i.e., in such a way that the imaginary triangles between the feature points correspond to roughly planar surfaces of the face. Three examples of annotated faces are shown in Fig. 1. The manual annotation of a collection of face images is time-consuming, but it only needs to be done once for a fixed collection of faces. If later on, we encounter a new face image, we can automatically determine the facial feature point locations by fitting the active appearance model on the new face image using the procedure described in "Fitting".

Fig. 1
figure 1

Three examples of faces with manually annotated facial feature points shown as red crosses

To model the variation in facial feature point locations (due to differences in the shape of faces), we perform Principal Components Analysis (PCA) on normalized Footnote 2 facial feature point coordinates. PCA learns a model of the data that identifies (1) which facial feature points have the largest location variation and (2) how the variations in the locations of the facial feature points are correlated. In particular, PCA learns: (1) a base shape \(\varvec{\nu}\) that is formed by the mean of the normalized feature point coordinates averaged over the entire data set and (2) a linear basis S that contains the directions in which the facial feature points vary most. Together, the base shape \(\varvec{\nu}\) and the linear basis S allow us to model each plausible facial feature point configuration well (in the squared error sense) using a small number of shape parameters p. Given the vector of shape parameters p, the facial feature point configuration can be computed as \({\bf p}^{\rm T}{\bf S} + \varvec{\nu}\). The shape parameters p thus form a compact representation for the deviation of the face shape from the base shape.

An example of a shape model with four shape components is shown in Fig. 2. In the figure, red crosses indicate the location of the facial feature points in the base shape \(\varvec{\nu}\), and blue arrows indicate the direction and magnitude of the variation in locations of each of the feature points (longer arrows represent a larger variance). For instance, the first component of the shape model represents the location of the mouth and jaw, the second component describes a forward rotation of the face, and the third describes small out-of-plane rotations, etc. (We note here that not all shape components are necessarily easily interpretable.)

Fig. 2
figure 2

A shape model with four components. The red wire frame indicates the base shape. The blue arrows indicate the movement directions of the feature points (each component corresponds to one column of S)

Texture model

To model the facial texture (i.e., the shape-normalized appearance of faces), we use the feature point annotations to construct a data set of face images in which all feature points have exactly the same location. This is achieved by warpingFootnote 3 each face image onto the base shape \(\varvec{\nu}\) using the feature point annotations as control points. We can use the resulting shape-normalized face images to construct a texture model that describes features such as eye color, skin color, lip color, malls, etc.

Like the shape model, the texture model is also constructed using PCA. To construct the texture model, PCA is applied on the shape-normalized images. In other words, a low-dimensional texture representation is constructed in such a way, that as much of the pixel variance as possible is preserved. The texture model contains (1) a mean texture image \(\varvec{\mu}\) that is computed by averaging all shape-normalized face images and (2) a linear basis A that captures the main deviations from the mean texture image. The texture model allows us to model each plausible facial texture with low error (in the squared error sense) using only a small number of texture parameters \(\varvec{\lambda}\). Given a texture parameter vector \(\varvec{\lambda}\), a facial texture image can be constructed by evaluating \(\varvec{\lambda}^{\rm T}{\bf A} + \varvec{\mu}\).

An example of a texture model with four components is depicted in Fig. 3. In the figure, bright areas in the images correspond to pixels in the (shape-normalized) facial texture data set with high variance. For instance, the second component models the closing of the eyes (blinking), whereas the fourth component models the opening of the mouth. The third and fourth components are also used to model the presence of glasses. (The first texture component models variations in the overall brightness of the face images.)

Fig. 3
figure 3

A texture model with four components. The leftmost image represents the mean texture. The other images indicate deviations from the mean texture (each component corresponds to a single column of A)

Combining the models

To model the appearance of a face, the active appearance model combines the shape and texture models. The combination is performed by warping the texture image generated by the texture model onto the face shape generated by the shape model. Given the shape and texture parameters, the corresponding face is thus generated using a three-stage process. First, the shape model is used to generate a face shape, i.e., to lay out the facial feature points. Second, the texture model is used to generate a facial texture image. Recall that this texture image is defined in the coordinate frame of the base shape \(\varvec{\nu}\). Third, the texture image is warped onto the face shape using the constructed feature points as control points to construct the final face image. This process is illustrated in Fig. 4.

Fig. 4
figure 4

Generating a face from an active appearance model. The face shape is constructed by adding a linear combination of the shape components S to the base shape \(\varvec{\nu}\). The facial texture is constructed by adding a linear combination of the texture components A to the mean texture \(\varvec{\mu}\). The final face image is formed by warping the resulting facial texture onto the face shape

Fitting

When presented with a new face image, fitting aims to find a configuration of the shape parameters p and the texture parameters \(\varvec{\lambda}\) that minimizes the squared error between the face image and the face generated by the active appearance model. In the literature, several fitting algorithms have been proposed, e.g., by Matthews and Baker (2004); Gross et al. (2005); Papandreou and Maragos (2008). In our study, we use a fitting algorithm based on the project-out inverse compositional algorithm (Matthews and Baker 2004). This fitting algorithm performs the squared error minimization with respect to the shape parameters first; the shape parameters are set in such a way that the squared error between the shape-transformed mean texture and the observed face is minimized. Given the shape parameters, the corresponding texture parameters can be computed by solving a linear least-squares problem. The mathematical details of the project-out inverse compositional algorithm fall outside the scope of this paper but are described in detail by Matthews and Baker (2004). Our fitting procedure is initialized using a standard face detector (Viola and Jones 2001).

Feature extraction

Active appearance models identify facial feature points, and they provide a low-dimensional approximation of the facial texture, but they do not produce features that are indicative of the presence of action units in the face, i.e., they do not provide direct information about characteristics of the face that are of relevance to its facial expression. We investigate three types of features, all of which use the feature points identified by the active appearance models. The three features are discussed separately in the next three subsections.

Normalized shape variations

Changes in the location of facial feature points identified by the active appearance model are indicative of the presence of certain action units. For instance, large variations in the locations of feature points around the mouth may indicate the presence of action unit 27 (mouth stretch). Hence, we can use the differences between the locations of the feature points in the current frame and the location of the feature point in the first frame of each movie as features for action unit recognition. These differences are computed using a two-stage process. First, we normalize all frames in a movie with respect to the base shape to remove rigid transformations such as translations, rotations, and rescalings. Second, we subtract the resulting normalized shape coordinates of the first frameFootnote 4 from the resulting normalized shape coordinates of the other frames to measure the changes in feature point locations. This process produces features that we refer to as normalized shape variation (NSV) features.

Shape-normalized texture variations

Normalized shape variation features do not possess information on the facial texture, such as the presence of wrinkles in the face. As a result, it may be hard to predict the presence of, e.g., action unit 24 (lip pressor), based on normalized shape variations. By contrast, the lip pressor is clearly visible in the texture of the face (as it makes most of the lip texture disappear). Using the feature point locations identified by the active appearance model allows us to extract shape-normalized texture features that capture such texture information. The features are extracted by warping the face image onto the base shape \(\varvec{\nu}, \) using the feature point locations as control points. This leads to texture images in which all feature points are in exactly the same location. Three examples of such shape-normalized textures are shown in Fig. 5. Due to the shape normalization, the differences between the texture images provide insight into the presence of wrinkles and other textural features (Ashraf et al. 2007). Similar to the normalized shape variations, we compute shape-normalized texture variation (SNTV) features by subtracting the shape-normalized textures from the the shape-normalized texture in the first frame.

Fig. 5
figure 5

Three examples of shape-normalized facial textures. Note how the facial feature points (eye corners, mouth corners, nose tip, etc.) are in exactly the same location in the images

Scale-invariant feature transform

Pixel-based image representations such as shape-normalized texture variations are well known to have limitations for recognition tasks, because they are highly variable under, among others, changes in lighting (as illustrated by the first texture component in Fig. 3). To address this problem, image representations based on image gradients are often more successful (Lowe 2004; Ke and Sukthankar 2004; Dalal and Triggs 2005; Bay et al. 2008). A second problem of shape-normalized appearance features is that they may contain a lot of variables (i.e., regions in the face image) that are hardly indicative of the presence of action units, because their texture does not vary much under different expressions. Hence, local gradient-based image features may be more appropriate for action unit detection.

Scale-invariant feature transform (SIFT) features are local gradient-based image features introduced by Lowe (2004) that address both of these problems. SIFT features have been successfully used in a wide range of computer vision tasks [e.g., Sivic and Zisserman (2003); Brown and Lowe (2003); Quattoni et al. (2010)]. They construct a histogram of the magnitude and orientation of the image gradient in a small image patch around a facial feature point. The histogram consists of 16 orientation subhistograms, each of which has 8 bins, leading to a 128-dimensional feature (per feature point). The construction of the SIFT feature consists of three main steps: (1) the gradient magnitude and orientation at each pixel in the image patch are computed, (2) the gradient magnitudes are weighted using a Gaussian window that is centered onto the image patch, and (3) the weighted gradient magnitudes are accumulated into orientation histograms measured over subregions of size 4 × 4 pixels.

We compute SIFT features around all facial feature points around the eyebrows, eyes, nose, and mouth and concatenate the resulting feature vectors to construct a facial texture representation that contains information on the texture around these facial feature points.

Conditional random fields

Linear-chain conditional random fields are discriminative probabilistic models that are used for labeling sequential data (LeCun et al. 1998; Lafferty et al. 2001). Conditional random fields may be best understood by starting from the framework of linear logistic regression. A linear logistic regressor is a generalized linear model (GLM) for multinomial regression. It models the probability of a variable y having one of K states given the data x using a logistic (or soft-max) function

$$ p(y=i|{\bf x})=\frac{\exp({\bf w}_i^{\rm T}{\bf x})}{\sum_{k=1}^K \exp({\bf w}_k^{\rm T}{\bf x})}. $$
(1)

In the context of this paper, the event y may correspond to a certain action unit having one of K = 2 states: present or non-present. When more detailed intensity scores are available for the action units, the number of possible states increases. The data x corresponds to the features extracted from a face image. The regression weights w i are learned based on N labeled training data points \(\{(y_1, {\bf x}_1), (y_2,{\bf x}_2), \ldots (y_N,{\bf x}_N)\}\), i.e., based on pairs of images and their action unit labels. The learning is performed using a technique called maximum conditional likelihood, which maximizes the function \(\mathcal{L}=\sum_{n=1}^N \log p(y_n | {\bf x}_n)\). The function \(\mathcal{L}\) has a single global maximum, which makes learning the regression weights relatively straightforward.

In our setting, we do not recognize action units in a collection of independent images, but in a collections of consecutive frames of a movie. Consecutive frames in a movie have strong dependencies, as the differences between two consecutive frames are typically small. Moreover, we often have prior knowledge on how action units are likely to behave; for instance, we could exploit our knowledge that is unlikely that an apex is followed by an onset (if we were recognizing these intensity scores). Hence, recognizing action units in each frame independently using a linear logistic regressor, without taking into account the temporal structure of facial expressions, would be very naive. It is exactly this naivety that conditional random fields aim to resolve.

Conditional random fields incorporate a temporal model between the label y t at time step t and the label y t+1 at time step t + 1. In particular, they learn a set of transition log probabilities \(\{{\bf v}_{1}, {\bf v}_{2}, \ldots, {\bf v}_{K}\}\) that measure how likely it is that – in the next time step – one moves from one state to another. The log probability of moving from state y t to state y t+1 is given by v y_t, y_t+1. The conditional random field thus has two sets of parameters: transition log probabilities v and regression weights w. Together, these parameters determine the probability of a label sequence \(y_1, y_2, \ldots, y_T\) given a data sequence \({\bf x}_1, {\bf x}_2, \ldots, {\bf x}_T\). Like in logistic regression, the training is performed by maximizing the conditional log likelihood, which is now given by \(\mathcal{L} = \sum_{n=1}^N \log p(y_{n1}, y_{n2}, \ldots, y_{nT} | {\bf x}_{n1}, {\bf x}_{n2}, \dots, {\bf x}_{nT})\) with respect to the transition log probabilities and the regression weights. In other words, conditional random fields aim to maximize the likelihood of a label sequence. The function \(\mathcal{L}\) still has a single global maximum, which makes training relatively straightforward. In our experiments, we used a stochastic gradient descent algorithm (Robbins and Monro 1951; Bottou 2004) to learn the parameters of the conditional random fields.

The prediction of frame labels on an unseen test sequence using conditional random fields is straightforward; it amounts to evaluating the posterior distribution over the label sequence \(p(y_{1}, y_{2}, \dots, y_{T} | {\bf x}_{1}, {\bf x}_{2}, \ldots, {\bf x}_{T})\) (which is exactly the distribution that is modeled by the conditional random field). Evaluating this distribution can be performed efficiently (Viterbi 1967), and it gives a value for each frame that indicates the probability that an action unit is present in that particular frame. Subsequently, we can apply a threshold on these probabilities to construct the final annotation; for instance, we can chooseFootnote 5 to label an action unit as present in a frame if its probability of being present is larger than 0.5.

Experiments

This section describes our experiments with the action unit recognition system described above. The data set we used as the basis for our experiments is described in "Data set". We discuss the setup of the experiments in "Experimental setup" and the results of our experiments in "Results".

Data set

In our automatic action unit recognition experiments, we performed experiments on version 2 of the Cohn--Kanade data setFootnote 6 [also referred to as the CK+ data set; Lucey et al. (2010)]. The data set contains 593 short movies of 123 subjects producing posed expressions. Together, the movies contain 10,734 frames (i.e., images); the average length of a movie is 18.1 frames. All movies were annotated for the presence and non-presence of action units by two human FACS labelers. In our experiments, we only considered action units that are present relatively often. The action units we focused on are listed in Table 2.

The face images range in size between 640 × 490 and 720 × 480 pixels; the size of the face area in the images ranges between 250 × 250 and 300 × 300 pixels. A small number of movies is in full color, but the majority of the movies are in grayscale. For convenience, we converted all movies to grayscale in our experiments.

Experimental setup

For the shape model of the active appearance model, we determine the number of shape components by preserving 90% of the variance in the facial feature point locations. The number of texture components in the texture model is also determined by preserving 90% of the variance in the shape-normalized appearance (i.e., texture) of the faces. The conditional random fields are trained on the features that are extracted from the faces.

The evaluation of the performance of our system is performed using an approach called leave-one-subject-out cross-validation. This means that we perform a separate experiment for each of the 123 subjects: We leave out all movies containing that subject from the data and train the conditional random fields on the remaining data. Subsequently, we evaluate the performance of the conditional random fields on the movies that contain the held-out subject. This allows us to investigate how well our system works on new, unseen human subjects. The performance of the conditional random fields is averaged over all 123 runs.

We measure the performance of our automatic FACS annotation system by measuring the area under the receiver operating characteristic (ROC) curve. The ROC curve plots the rate of false positives against the true positive rate for various thresholds. The area under the ROC curve (AUC) summarizes the quality of the annotator in a single value: It measures the probability that the classifier assigns a higher score to a randomly selected positive example than to a randomly selected negative example (Bradley 1997). For a completely random annotator, this probability (i.e., the AUC) is 0.5, whereas the AUC is 1 for a perfect annotator. To give an indication of the uncertainty in the AUC values, we also present an upper bound on this uncertainty proposed by Cortes and Mohri (2005). In particular, the uncertainty measure is given by \(u = \frac{\sqrt{AUC (1-AUC)}}{\max(N_p, N_n)}\), where N p represents the number of positive examples and N n represents the number of negative examples.

Results

In Table 3, we present the results of training and testing conditional random fields on the three types of features presented in "Feature extraction". We also present results obtained using a feature representation that was obtained by performing PCA on the combination of three features. The results are average AUC values obtained using leave-one-subject-out cross-validation and upper bounds on the AUC uncertainties.

Table 3 Averaged areas under the curve (AUCs) obtained by training conditional random fields on the feature sets

From the results presented in the table, we observe that the performance of our system is strong; the best feature set has an average AUC of 0.901, with AUCs for individual action units ranging between 0.867 (on action unit 4; brow lowerer) and 0.946 (on action unit 27; mouth stretch). In order to give a reference frame for the quality of these results, a true positive rate of 0.70 is sufficient to pass FACS certification tests (Ekman and Rosenberg 2005). It thus seems likely that our system would pass such tests.

From the results, we also observe that appearance-based features typically outperform feature point locations; the normalized shape features only outperform appearance-based features on action units that lead to large feature point variations, such as the inner and outer brow raisers. Furthermore, we observe our combined features perform disappointingly; presumably, a better approach to combine the features is to train classifiers on each of the features and to linearly combine the predictions of these classifiers (Bell and Koren 2007).

Discussion

Although the results presented in "Experiments" are promising, some important issues remain. An important issue of our action unit recognition system and of most similar systems is that their performance is typically not very robust under out-of-plane rotations or partial occlusions of the face. We did not evaluate the performance of our system in such situations, because up to the best of our knowledge, there are no publicly available databases that contain FACS-coded videos with out-of-plane rotations and/or occlusions Footnote 7. As a result of the lack of such data, today’s systems are still trailing behind human observers; in particular, because the human visual system is remarkably robust to variations such as rotations and occlusions. Nonetheless, it is likely that our system would pass FACS certification tests (these tests require a true positive rate of at least 70%).

Another issue of the action unit recognition system we described (and of other recently developed systems) is that it heavily relies on the availability of facial expression videos that are labeled by human FACS annotators. In practice, the availability of such FACS-labeled data is often limited because of the high costs that are associated to manual FACS labeling. To address this issue, it may be helpful to employ approaches for semi-supervised learning (i.e., using unlabeled data to improve the action unit detectors) and/or active learning (i.e., learning which instances should be manually labeled).

The potential of systems such as the one presented in this paper extends far beyond automatically recognizing action units in data gathered by, e.g., social psychologists. In particular, automatic action unit recognition may provide a good basis for the recognition of higher-level cognitive states like interest and puzzlement (Cunningham et al. 2004) or (dis)agreement (Bousmalis et al. 2009) and for the recognition of psychological problems such as suicidal depressions (Ekman and Rosenberg 2005), pain (Williams 2003), or schizophrenia (Wang et al. 2008). Other potential applications of our system include understanding social behaviors such as accord and rapport (Ambady and Rosenthal 1992; Cunningham et al. 2004), identifying social signals such as status or trustworthiness (Ambady and Rosenthal 1992; Ekman and Friesen 1969; Ekman et al. 2002), predicting the success of marriage counseling (Gottman et al. 2001), and identifying personality traits such as extraversion and temperament (Ekman and Rosenberg 2005). An extensive overview of applications of automatic facial expression measurement is given by Bartlett and Whitehill (2010). Applications of our system may also exploit the generative capabilities of the active appearance model. For instance, the model may be used to investigate the effect of small changes in facial appearance on human perception (Boker et al. 2007) or for experiments with expression cloning (Theobald et al. 2007).

Concluding remarks

We developed a system for automatic action unit recognition. Our system uses conditional random fields to predict action unit states from features extracted using active appearance models. The performance of our system is promising, and it can be used in real time.

In future work, we aim to extend the conditional random fields to exploit correlations between action units (Sutton et al. 2007; Tong et al. 2010): For instance, if we detect the presence of action unit AU12, the probability that AU13 or AU14 is also present increases; our models should exploit this information. In addition, we intend to employ semi-supervised and active learning to obtain good performance at low labeling costs.

We also intend to use our system to detect basic emotions as well as higher-level social signals by learning mappings from action unit labels to these emotions/signals. In particular, we intend to use our system for the recognition of agreement/disagreement (Bousmalis et al. 2009; Poggi et al. 2010). We note that in a system that recognizes agreement/disagreement, more features that only AU presence should be taken into account; in particular, successfully recognizing agreement/disagreement requires the detection of nods and shakes (Kapoor and Picard 2001; Tan and Rong 2003; Kang et al. 2006).