1 Introduction

Facial expressions convey important information about the emotional and mental states of a person. Facial expression understanding has wide applications and has been most widely studied in the form of emotion recognition. The classic problem is to recognise six basic emotions [39] (Fig. 1, top) based on facial expressions [21]. This problem is now considered solved for posed, exaggerated expressions in lab conditions (Fig. 1, top); but is still an open question for realistic subtly exhibited expressions where, for example, even a slight tightening of the lips can be a sign that someone is angry. Subtle expressions (Fig. 1, Row 2&3) are important for mental activity analysis and deception detection [55].

Fig. 1.
figure 1

Conventional emotion recognition (Row 1) vs our LSEMSW (Row 2 and 3). Our dataset contains mental states that are richer – including a variety of cognitive states, conveyed by subtle expressions, and exhibited in the wild.

In this paper we go significantly beyond existing work on emotion recognition in two ways, to address: Recognition of subtly expressed rather than exaggerated emotions; and recognition of a wider range of mental states beyond the basic six emotions, including for the first time cognitive states. To address these goals we work from two directions: providing (1) improved deep learning algorithms, and (2) a big dataset covering subtle emotions and cognitive states. Specifically, first, we introduce a new deep learning approach to subtle expression recognition, based on a novel multi-task learning architecture to exploit a side task: landmark detection. Second, in order to benchmark the new proposed task, and to train our deep learning model, we also contribute a new large scale dataset: Large-scale Subtle Emotions and Mental States in the Wild (LSEMSW). The expressions of this database are much more realistically subtle compared to existing posed and exaggerated benchmarks (Fig. 1). The dataset is also much richer than most existing benchmarks – containing 13 emotions and cognitive states (Fig. 1 Row 2 & 3) defined by 2 psychologists.

Deep MTL Algorithms. We build on deep CNNs, which have achieved great success in many vision tasks. As per classic studies that use landmark locations [54] and distances [50] for basic emotion recognition; we observe that when emotional/mental state is conveyed by subtle expressions, a salient cue is often slight movements of facial landmarks (e.g., eye widening). To provide this prior knowledge as an inductive bias in our deep network design, we aim to detect landmarks and mental state simultaneously via multi-task learning (MTL).

Classic MTL methods focused on improving performance by cross-task knowledge transfer through shared task representations in linear models [2] or kernel-based nonlinear models [10]. More recently, interest in MTL for neural networks has been grown in popularity (again [5]), in order to combine the power of deep networks and knowledge sharing [24, 27, 61]. The conventional deep MTL pre-defines the first few CNN layers as shared by multiple tasks, and then forks into different layers for different tasks with different losses. However, this approach is heuristic and, without theoretical guidance about how to choose sharing structure, it is often left to ‘grad student descent’. This is particularly tricky with increasingly deep CNN architectures. For example ResNet [17] has 156 layers, leading to 156 possible architectures assuming exactly one fork, or \({(B_T)}^{156}\) architectures (where T is the number of tasks and \(B_{\cdot }\) is the Bell number, i.e., the number of partitions of a set) more generally. To address this, we develop a new tensor trace norm approach that automatically determines how much every layer should be shared, and without assuming a single fork point.

Furthermore we address the issue that MTL typically requires all tasks to be annotated on a single dataset to be effective. If the tasks are associated with different datasets, MTL can still be applied but it is ineffective due to the negative effect of cross-dataset distribution shift outweighing the benefit of MTL-based knowledge sharing. By integrating a distribution alignment strategy [13], we can use disjoint training sets (tasks defined on different datasets), thus making MTL much more flexible and widely applicable. In the context of emotion recognition, this allows us to leverage existing datasets to provide auxiliary tasks such as facial landmark localisation in 300-W dataset [45].

Subtle Expression Database. Most existing expression databases [8, 11, 20, 37] only contain images with strong expression of exaggerated emotions, and subtle expression analysis is rarely investigated. To address this gap, we contribute LSEMSW, the first big database for subtle expression analysis. LSEMSW only contains images with realistically subtle expressions. In addition, the existing databases have some limitations: they either contain only emotions without other mental states [8, 11], are noisy due to automated annotation [11], or are too small for deep learning [8]. Our LSEMSW  contains other (non-emotional) cognitive mental states, compared to existing datasets focusing on six basic emotions [8]. LSEMSW contains 176K images, making it multiple orders of magnitude larger than some alternatives (E.g.,  1500 images in AFEW [8]), and all images are manually labelled rather than automatically annotated by algorithms [11]. Finally, we contrast micro-expression recognition, which is to recognise an emotion that a person is trying to conceal [25]. This is related in addressing subtle cues, but different in that it is typically performed on video rather than images.

Contributions. In summary, our contributions are: (i) Unlike standard heuristically designed deep MTL, we propose an end-to-end soft sharing strategy that flexibly learns where, what and how much to share by optimising the trace norm regularised parameters. We further embed a distribution alignment method in order to maintain good performance when the per-task training sets are disjoint. (ii) We contribute our LSEMSW dataset consisting of 176K images manually annotated with 13 emotions and cognitive states. This is the first database for subtle expression analysis, the first database for recognising cognitive states from facial expressions, and it is big enough for deep CNN training. We will release this database to advance mental state recognition in the deep learning era. In addition, the source code and trained models will be made publicly available. (iii) We show that LSEMSW  can benefit Traditional (non-subtle) expression recognition (TNER), by using transfer learning to achieve very competitive TNER performance on Oulu-Casia NIR&Vis  [62] and CK+  [29] databases.

2 Methodology

2.1 Preliminaries

Matrix-Based Multi-task Learning. Matrix-based MTL is usually built on linear models, i.e., each task is parameterised by a D-dimensional weight vector \(\mathbf w \), and the model prediction is \(\hat{y} = \mathbf x ^T \mathbf w \), where \(\mathbf x \) is a D-dimensional feature vector representing an instance. The objective function for matrix-based MTL can be written as \(\sum _{i=1}^{T}\sum _{j=1}^{N^{(i)}} \ell ({y}^{(i)}_j, \mathbf x ^{(i)}_j \cdot \mathbf w ^{(i)}) + \lambda \varOmega (W)\). Here \(\ell ({y}, \hat{{y}})\) is a loss function of the true label \(\mathbf y \) and predicted label \(\hat{\mathbf{y }}\). T is the number of tasks, and for the i-th task there are \(N^{(i)}\) training instances. Assuming the dimensionality of every task’s feature is the same, the models \(\mathbf w ^{(i)}\) are of the same size. Then the collection of \(\mathbf w ^{(i)}\)s forms a \(D\times T\) matrix W where the i-th column is a linear model for the i-th task. A regulariser \(\varOmega (W)\) is exploited to encourage W to be a low-rank matrix. Some choices include the \(\ell _{2,1}\) norm [2], and trace norm [19].

Tensor-Based Multi-task Learning. In standard MTL, each task is indexed by a single factor. But in some real-world problems, tasks are indexed by multiple factors. The collection of linear models for all tasks is then a 3-way tensor \(\mathcal {W}\) of size \(D\times T_1 \times T_2\), where \(T_1\) and \(T_2\) are two task indices. In this case, tensor norm regularisers \(\varOmega (\mathcal {W})\) have been used [51]. For example, sum of the trace norms on all matriciations [44] and scaled latent trace norm [56]. However, such prior tensor norm-based regularisers have been limited to shallow models. We develop methods to allow application of tensor norms end-to-end in deep networks.

Deep Multi-task Learning. With the success of deep learning, many studies have investigated deep MTL [28, 36, 41, 58, 61]. E.g., using a CNN to find facial landmarks as well as recognise facial attributes [41, 61]. The standard approach [28, 41, 61] is to share the bottom layers of a deep network and use ask-specific parameters for the top layers. We call this type of ‘predefined’ sharing strategy ‘hard’ sharing. This ‘hard’ sharing based architecture can be traced back to 2000 s [3]. However, it is impossible to try every hard sharing possibility in modern CNN architectures with many layers. Limited very recent work on automating deep MTL [36, 58] suffers from the need to specify discrete ranks at every layer. This introduces an additional sharing-strength hyper-parameter per-layer, and crucially prevents knowledge sharing when working with only two tasks, as it increases rather than decreases the number of parameters. Our approach learns soft sharing at all layers controlled by a single sharing strength hyper-parameter.

2.2 Trace Norm-Based Knowledge Sharing for Deep MTL

In this work, we focus on deep MTL, in particular, CNN-based MTL. One CNN contains multiple convolution layers, each consisting of a number of convolutional kernels. A convolutional layer is parameterised by a 4-way tensor of size \(H\times W\times C \times M\) where HWCM are the height, width, number of channels, number of filters respectively. Since convolutional layers are structured as tensors, we use tensor-based theory, in particular, tensor trace norm, to achieve knowledge sharing. Unlike ‘hard sharing’ strategy, we propose a flexible ‘soft’ parameter sharing strategy that automatically learns where, what and how much to share by optimising the tensor trace norm regularised parameters.

Knowledge Sharing. To learn a parameter sharing strategy, we propose the following framework: For T tasks, each is modelled by a neural network of the same architecture. The T networks are stacked horizontally in a layer-wise fashion, i.e. we assume the architectures of different tasks’ networks are the same, so that we can collect the parameters in the same level (layer) then stack them to form a one-order higher tensor, e.g., for convolution layer, \(4D\rightarrow 5D\). This process is repeated for every layer. With this stacking of parameters into higher order tensors, we can apply a tensor trace norm regulariser to each in order to achieve knowledge sharing. A schematic example with 2-task learning is illustrated in Fig. 2. Learning the CNN with tensor trace norm regularisation means that the ranks of these tensors are minimised where possible, and thus knowledge is shared where possible. Since trace norm is performed on the stacked parameters of all the layers, we can control the parameter sharing for all layers with a single hyperparameter of regularisation strength.

Tensor Norms. Since tensor trace norm is the core of our approach, we review this topic. Matrix trace norm is the sum of a matrix’s singular values \(||X||_* = \sum _{i=1} \sigma _i\). It is the tightest convex relation of matrix rank [42]. Thus when directly restricting the rank of a matrix is challenging, trace norm serves as a good proxy. The trace norm of a tensor can be formulated as the sum of trace norms of matrices. However unlike for matrices, the trace norm of a tensor is not unique because tensors can be factored in many ways e.g., Tucker [53] and Tensor-Train [38] decompositions. We propose three tensor trace norms here, corresponding to three variants of the proposed method.

For an N-way tensor \(\mathcal {W}\) of size \(D_1\times D_2 \times \dots \times D_N\). We define

$$\begin{aligned} \mathbf{Last \,Axis\, Flattening \,(LAF) } \quad ||\mathcal {W}||_* = \gamma ||\mathcal {W}_{(N)}||_* \end{aligned}$$
(1)

where \(\mathcal {W}_{(i)} := {\text {reshape}}({\text {permute}}(\mathcal {W}, [i, 1, \dots , i-1, i+1 \dots , N]), [D_i, \prod _{j\lnot i} D_j])\) is the mode-i tensor flattening. This is the simplest definition. Given that in our framework, the last axis of tensor indexes the tasks, i.e., \(D_N = T\), it is the most straightforward way to adapt matrix-based MTL – i.e. by reshaping the \(D_1\times D_2 \times \dots \times T\) tensor to \(D_1 D_2 \dots \times T\) matrix.

To advance, we define two kinds of tensor trace norm that are closely connected with Tucker-rank (obtained by Tucker decomposition) and TT-rank (obtained by Tensor Train decomposition).

$$\begin{aligned} \mathbf{Tucker } \,\,\, ||\mathcal {W}||_*= & {} \sum _{i=1}^{N} \gamma _i ||\mathcal {W}_{(i)}||_* \end{aligned}$$
(2)
$$\begin{aligned} \mathbf{TT } \,\,\, ||\mathcal {W}||_*= & {} \sum _{i=1}^{N-1} \gamma _i ||\mathcal {W}_{[i]}||_* \end{aligned}$$
(3)

Here \(\mathcal {W}_{[i]}\) is yet another way to unfold the tensor, which is obtained by \(\mathcal {W}_{[i]} = {\text {reshape}}(\mathcal {W}, [D_1 D_2 \dots D_i, D_{i+1} D_{i+2} \dots D_N])\). Note that unlike LAF, Tucker and TT also encourage within-task parameter sharing, e.g., sharing across filters in a neural network context.

Optimisation. For the regularisers defined in Eqs. (1)–(3), we see that tensor trace norm is formulated as the sum of matrix trace norms. Gradient-based methods are not commonly used to optimise matrix trace norm. However in order to apply trace norm-based regularisation end-to-end in CNNs, we wish to optimise trace norm and standard CNN losses using a single gradient-based optimiser such as Tensorflow [1]. Thus we derive a (sub-)gradient descent method for trace norm minimisation.

We start from an equivalent definition of trace norm instead of the sum of singular values, \(\Vert W \Vert _* = {\text {Trace}}((W^T W)^{\frac{1}{2}}) = {\text {Trace}}((W W^T)^{\frac{1}{2}})\) where \((\cdot )^{\frac{1}{2}}\) is the matrix square root. Given the property of the differential of the trace function, \(\partial {\text {Trace}}(f(A)) = f'(A^T):\partial A\), where the colon  :  denotes the double-dot (a.k.a. Frobenius) product, i.e., \(A:B = {\text {Trace}}(AB^T)\). In this case, \(A=W^T W\), \(f(\cdot ) = (\cdot )^{\frac{1}{2}}\) thus \(f'(\cdot ) = \frac{1}{2}(\cdot )^{-\frac{1}{2}}\), so we have,

$$\begin{aligned} \partial {\text {Trace}}((W^TW)^{\frac{1}{2}})&=\frac{1}{2}(W^TW)^{-\frac{1}{2}}:\partial (W^TW) =W(W^TW)^{-\frac{1}{2}}: \partial W \end{aligned}$$

Therefore we have \(\frac{\partial \Vert W\Vert _*}{\partial W} = W(W^TW)^{-\frac{1}{2}}\). In the case that \(W^TW\) is not invertible, we can derive that \(\frac{\partial \Vert W\Vert _*}{\partial W}= (W W^T)^{-\frac{1}{2}} W\) similarly. To avoid the check on whether \(W^TW\) is invertible, and more importantly, to avoid the explicit computation of the matrix square root, which is usually not numerically safe, we use the following procedure.

First, we assume W is an \(N\times P\) matrix (\(N>P\)) and let the (full) SVD of W be \(W=U\varSigma V^T\). \(\varSigma \) is an \(N\times P\) matrix in the form of \(\varSigma = [\varSigma _*;\mathbf {0}_{(N-P)\times P}]\). Then we have

$$\begin{aligned} W(W^T W)^{-\frac{1}{2}}&= U\varSigma V^T (V \varSigma _*^2 V^T)^{-\frac{1}{2}} = U\varSigma V^T V \varSigma _*^{-1} V^T\\&= U\varSigma \varSigma _*^{-1} V^T = U [I_P;\mathbf {0}_{(N-P)\times P}] V^T \end{aligned}$$

This indicates that we only need to compute the truncated SVD, i.e., \(W = U_* \varSigma _* V_*^T\), and \(W(W^T W)^{-\frac{1}{2}} = U_* V_*^T\). For the case when \(N<P\), we have the same result as,

$$\begin{aligned} (W W^T)^{-\frac{1}{2}} W&=(U \varSigma _*^2 U^T)^{-\frac{1}{2}} U\varSigma V^T = U \varSigma _*^{-1} U^T U\varSigma V^T \\ {}&= U \varSigma _*^{-1} \varSigma V^T = U [I_N,\mathbf {0}_{(P-N)\times N}] V^T \end{aligned}$$

Now we have an agreed formulation: \(\frac{\partial \Vert W\Vert _*}{\partial W} = U_* V_*^T\) that we can use for gradient descent. Though exact SVD is expensive, we find that a fast randomized SVD [16] works well in practice.

2.3 Adversarial Domain Alignment (ADA)

In our application, the main task’s dataset (LSEMSW) is disjoint to the auxiliary task’s (300-W) [45]. This leads to the distribution shift problem across the two tasks, reducing the performance of MTL. Inspired by [14] and [13], we propose to confuse the dataset identity for dealing with this problem.

We use ADA to solve this problem: One classifier aims to distinguish which distribution (dataset) the features of each task are from. If features are distinguishable the domain shift is clearly greater than if they are indistinguishable. ADA trains them to be indistinguishable. We assume \(T\ge 2\) tasks (indexed by t), each with its own dataset \(\{X_t, y_t\}\). Task t is modelled by a CNN parametrised by \(\varTheta _t=\{\theta _t^{(1)}, \theta _t^{(2)}, \dots , \theta _t^{(L)}\}\) where L is the number of layers, and we split \(\varTheta _t\) into two sets at the l-th layer. Conventionally we choose \(l=L-1\), i.e., the penultimate layer, so we have \(\varTheta _t = \varTheta _t^* \cup \{\theta _t^{(L)}\}\) where \(\varTheta _t^* = \{\theta _t^{(1)}, \theta _t^{(2)}, \dots , \theta _t^{(L-1)}\}\). We then build a multi-class classification problem that uses a neural network parametrised by \(\varPhi \) to predict the database identity from \(f_{\varTheta _t^*}(X_t)\), the penultimate layer representation. Letting Z be the stacked features for all tasks i.e., \(Z = [f_{\varTheta _t^*}(X_1) \dots f_{\varTheta _t^*}(X_T)]\), we optimise

$$\begin{aligned} \underset{\varTheta _1^* \dots \varTheta _T^*}{{\text {max}}} \underset{\varPhi }{{\text {min}}} \ell (g_\varPhi {([f_{\varTheta _1^*}(X_1) \dots f_{\varTheta _T^*}(X_T)])},y) \end{aligned}$$
(4)

where y is one-hot label to indicate which distribution the feature is from; \(g_\varPhi \) is a classifier, e.g. softmax; \(\ell \) is a cross entropy loss.

For our application, we have 2 tasks in total, so it is reduced to a binary classification problem. For the task identity prediction neural network, we use a 2-hidden-layer MLP (multilayer perceptron) with 512 (input feature) -128 (hidden layer)-64 (hidden layer) -2 (classifier) architecture.

2.4 CNN Architecture for Deep MTL

In this study, we implement our deep MTL based on the well known Residual Network (ResNet) architecture [17]. We use the compact 34-layer ResNet with 33 convolutional layers and 1 fully connected layer detailed in [17]. We perform trace norm on the weights of all the 33 shareable convolutional layers of the stacked networks. In addition, the original 34-layer ResNet has a \(7\times 7\) global average pooling before loss layer, adapting to the \(224\times 224\) input. To adapt to our \(96\times 96\) input, we use \(3\times 3\) average pooling instead. The Adversarial Domain Alignment is performed on the activations (feature map) of this average pooling. The classification loss for mental state recognition is softmax cross-entropy loss, while the loss for landmark detection is \(l_1\) regression loss. The architecture is shown in Fig. 2.

Fig. 2.
figure 2

Our deep MTL framework. For simplicity, layers such as pooling, relu, etc. are not visualised. ‘activation’\(\in R^{512}\) denotes the feature map after global average pooling.

3 Large-Scale Subtle Emotions and Mental States in the Wild (LSEMSW) Database

Motivation. Subtle expressions occur when a person’s emotional response to surroundings is of low intensity. People usually exhibit subtle expressions when they start to feel an emotion. Subtle expression recognition has many applications such as mental activity analysis and deception detection [55]. However, subtle expressions are rarely investigated and the existing facial expression analysis techniques mainly focus on strong or exaggerated expressions. To advance research on subtle expression analysis, we collect the new LSEMSW database.

Collection and Annotation. LSEMSW was collected from more than 200 movies and TV serials such as Big Bang Theory, Harry Potter, Game of Thrones, etc. For each video/clip, we selected the first frame of every 5 ones. Then face detection was performed on the selected frames using MTCNN [60]. The images that contain faces were manually annotated via Amazon Mechanical Turk over nine months. To achieve accurate annotation, we provided detailed instructions to annotators and used Amazon MT Master service to select well-performing reliable annotators based on their historical performance. Each image is assigned to 3 workers for annotation. During the annotation, the subtitle (if available) on the frame is shown to help the workers make their decision. An annotation is accepted only if more than two workers agree on the annotation. The images with strong expressions were manually filtered. More details of our database are shown in Tables 1 and 2 and in the supplementary material.

Table 1. Attribute distribution.
Table 2. Expression distribution.

Comparison with Existing Databases. We compare LSEMSW with existing well known expression/emotion databases in Table 3. We can see that our LSEMSW is the only one with subtle expressions rather than strong expressions. Although this research focuses on subtle expression recognition, the knowledge learned from LSEMSW  can be transferred to standard strong expression recognition, as verified in Sect. 4.2. In terms of size, LSEMSW is smaller than EmotioNet [11] and AffectNet [37]. However while EmotioNet [11] contains 1 million images, only 50K are manually annotated and the labels of the remaining images are noisily predicted by algorithm [4]. Therefore, our database is the second largest with manual expression annotations. It is the only database with cognitive state annotations.

Table 3. Comparison of manually annotated facial expression databases.

4 Experiments

4.1 Databases and Settings

Expressions. We explore two types of expression recognition: (1) subtle expression recognition and (2) traditional (non-subtle) expression recognition (TNER). For (1), our LSEMSW database is used for evaluation. Specifically, the database is divided to training, validation and test sets according to the ratio: 80%, 10% and 10%. The rank 1 recognition rate on test set is reported. For (2), we explore transferring representation learned from LSEMSW  to TNER. Specifically, we train TNER networks by finetuning from the subtle expression network trained with LSEMSW. We use two well known TNER databases, Oulu-Casia NIR&Vis (OCNV) facial expression database [62] and Extended Cohn-Kanade database (CK+) [29], for this evaluation. OCNV contains 480 sequences taken under 3 lightings: dark, strong and weak. Following [9], we use VIS videos with strong lighting (80 identities and 6 expressions). Each image sequence varies from neutral to peak formation of one particular expression. The last three frames (strongest expression) are used. 10-fold cross validation is conducted as [9]. On the other hand, CK+ includes 593 video sequences of 123 subjects. Subjects displayed 7 basic (non subtle) expressions in different sequences. We use only the last (strongest) frame of the sequence. Following  [23], 5-fold cross validation is conducted. During training, data augmentation (flip, crop, rotation), which we find is very important, is performed. We finetune the LSEMSW-pretrained network using the augmented training images of OCNV and CK+ and evaluate the performance on the testing images of these 2 databases. Evaluations are reported on task (1) except where explicitly specified.

Facial Landmarks. We use 68-point annotations  [46] for landmark detection. Our training set consists of the training images of 300 Faces In-the-Wild Challenge (300-W)  [46] and Menpo benchmark [59]. Face detection using MTCNN [60] is performed on original training images. The detected bounding boxes are extended with a scale ratio of 0.2, aiming to cover the whole face area. Due to the limited training images, data argumentation is important. The detected faces are flipped, rotated (−30\(^{\circ }\), 30\(^{\circ }\)), and disturbed via translation and scaling (0.8, 1.2). During training, the landmark coordinates are normalised to (0, 1). Following [18, 52], the test set contains 3 parts: common subset (554 images), challenging subset (135 images) and full set (689 images). Where not explicitly specified, we reports the results on the full set. Following [18, 52], we use the normalised mean error (the distance between estimated landmarks and the ground truths, normalised by the inter-pupil distance) to evaluate the result.

Implementation Details. Our end-to-end deep MTL framework is implemented in TensorFlow [1]. The training images for mental state recognition are aligned and cropped to \(96\times 96\). Similarly, the images for landmark detection are resized to \(96\times 96\) and landmark coordinates are whitened following [43]. The landmark detection data is augmented by horizontal flip, rotation, scale, shift, and adding Gaussian noise following  [12]. Only horizontal flip is used for emotion recognition. The learning rates for both networks are set 0.01, and batch sizes are both 256.

Fig. 3.
figure 3

Trace Norm and ADA Analysis: Trace Norm changes during (a) and after (b) network optimisation. The feature distributions with and without ADA (c).

4.2 Results

Analysis of Discovered Sharing. To analyse the learned sharing strategy, we define the trace norm decrease rate (TNDR) as \(\frac{\text {Norm of Optimised Para}}{\text {Norm of Initialised Para}}\). The smaller TNDR is, the more knowledge one convolutional layer shares. We take ResNet+LAF as an example to investigate the properties of trace norm optimisation. Figure 3a shows the TNDR decreases against network optimisation epochs. We choose the 1st, 10th, 20th, 30th layers LAF trace norms for analysis. Clearly, the 1st layer LAF decreases more dramatically than others, implying more knowledge is shared in the 1st layer. This is consistent with the common intuition that the lower layers capture more broadly useful low-level features. Figure 3b shows the TNDR of all layers after learning. We observe that: (i) Overall TNDR is smaller (information sharing is greater) at earlier layers as expected. However this trend is continuous rather than discontinuous, supporting the value of continuously varying soft sharing rather than a discrete all-or-nothing fork. Surprisingly (ii) within each residual block TNDR decreases (sharing is less) at higher layers. By learning the parameter sharing, our method has discovered a surprising strategy – related to the ResNet block architecture – that a human engineer is unlikely to have tried.

Comparison with Other Deep MTL Methods. Traditional Deep MTL methods use hand designed ‘hard’ parameter sharing which manually defines which layers are shared and which not. To contrast the manual approach, we compare 4 predefined architectures: 34-layer ResNets with the first {6, 14, 26, 32} convolutional layers shared and the rest task-specific. These increments are chosen to correspond to 4 residual units/blocks in  [17]. From Table 4, we see that our automatic soft sharing (without ADA) works much better than ‘hard’ sharing in both tasks. Among ‘hard’ methods, ResNet (6) with the first 6 layers shared is best. The fact that such fairly limited sharing works best implies the cross-dataset domain-shift between the two tasks is strong, further motivating our solution for domain invariant feature learning. We also implement the recent deep MTL method ‘cross stitch MTL’ [36] using the same ResNet. From Table 4, we can see that our MTL strategy outperforms ‘cross stitch MTL’. This is because our trace-norm based strategy provides more fine-grained control of information sharing compared to discrete rank setting.

Table 4. Accuracy (%) of Mental State Recognition on LSEMSW using 34-layer ResNet. RN(#) indicates the number of shared layers in standard MTL baseline.

Trace Norm Comparison. A key contribution of this work is multi-task parameter sharing through trace norm. Here we compare the three trace norms (LAF, Tucker, TT) introduced in Sect. 2.2 without ADA. The baseline single task method is 34-layer ResNet without any parameter sharing. From the results in Table 4, we can see our MTL methods (LAF, Tucker, TT) perform significantly better than single-task learning. Specifically, for mental state recognition, LAF, Tucker and TT achieve recognition accuracy around 33.4%, compared with 28.39% of single task learning. For landmark detection, LAF, Tucker and TT reduce the mean error rates vs single task by around 7%. The three trace norms achieve very similar performance. This means that our strategy is not very sensitive to the type of norm/factorisation. The similar performance of TT and Tucker to LAF also mean that there is not much gain from compressing across filters rather than tasks – suggesting that ResNet is not overly ‘wide’ for mental state recognition. Thus we choose the simplest LAF for subsequent comparisons.

Adversarial Domain Alignment. We proposed ADA to reduce the domain shift across training sets from the tasks. As shown in Table 4, our method ResNet+LAF+ADA achieved 36.72% mental state recognition accuracy and 4.64% mean error rate of landmark detection, compared with ResNet+LAF (33.43%, 4.67%), showing the effectiveness of ADA. To further investigate the effect of ADA, we visualise the data distributions using t-SNE [33] technique. From Fig. 3c, we compare the feature distributions of two test sets (mental state and landmark) with (ResNet+LAF+ADA) and without (ResNet+LAF) using ADA. Clearly, ADA can effectively solve the domain shift problem.

Subtle Expression Recognition (SoA). Finally, we compare to prior state-of-the-art (SoA) methods. The historical lack of big training data, means that most prior approaches to expression/emotion recognition use handcrafted features such as LPQ [6], LBP [47], EOH [35]. A very recent study  [40] empirically showed deep learning methods (AlexNet, VGGNet, ResNet) to be effective. Therefore, we compare the proposed method with all these networks. As subtle expression recognition is very challenging, handcrafted features (LPQ, LBP and EOH) do not achieve promising performance. From Table 5, we see that EOH [35] is the best handcrafted feature because EOH captures both spatial and texture information while LBP and LPQ only capture texture information. Nevertheless deep learning methods work better than handcrafted features because the deep features are trained end-to-end to capture subtle facial variations. Our proposed ResNet+LAF+ADA approach performs best overall. The superiority of ResNet+LAF+ADA against ResNet shows the effectiveness of our MTL strategy (LAF) and domain alignment strategy (ADA).

Table 5. Comparison of SoA methods on LSEMSW
Table 6. Error Rate (%) of Landmark detection on 300-W database.

Landmark Detection (SoA). Facial landmark detection is primarily performed to provide an auxiliary task for our main subtle expression recognition task. We nevertheless also evaluate landmark detection here. Some qualitative results are shown in Fig. 4. The images illustrate strong variation of expression, illumination, occlusion, and poses. We can see that our method (ResNet+LAF+ADA) is very robust to these variations. Some failure cases are also shown in Fig. 4. These are mainly caused by the combination of different strong variations, e.g. expression+pose (row 2, col 5 & 6) and expression+pose+illumination (row 2, col 7). We also perform quantitative comparison to SoA methods in Table 6. From the results we can see that our method (RN+LAF+ADA) achieves very promising landmark detection performance. Specifically, we achieve the 2nd best performance on common subset and full set, and best on challenging subset, showing the robustness of our method on various challenging scenarios such as strong pose, illumination and expression variations – as illustrated in Fig. 4. The promising performance results from (1) the strong nonlinear modelling (regression) capacity of ResNet and (2) the effectiveness of LAF and ADA. Both (1) and (2) are also supported by Table 4. We also compare the different loss functions used by landmark detection. From Table 6, we can see that \(l_1\) loss achieves better performance than \(l_2\) loss.

Fig. 4.
figure 4

Samples of Landmark Detection: Faces with expressions (row 1, col 1–2), illuminations (row 1, col 3–4), occlusions (row 1, col 5–6), poses (row 2, col 1–3) and failed cases (row 2, col 4–6)

Traditional (non-subtle) Expression Recognition (TNER). It is interesting to investigate transferring knowledge learned from LSEMSW  to TNER. We finetune the LSEMSW-pretrained network using the augmented training images of Oulu-Casia NIR&Vis (OCNV) [62] and CK+ [29] facial expression databases and also 300-W for multi-task learning. From the results in Table 7, we can draw the following conclusions: (i) Finetuning from LSEMSW works significantly better than training from scratch: 87.1% vs 76.0% on OCNV and 96.4% vs 86.3%, thus confirming its benefit as a source of data for representation learning, even if the final goal is TNER. (ii) Our MTL based on LAF and ADA is also beneficial for this TNER task (RN+LAF+ADA (FT) vs RN (FT) scores 87.1% vs 82.9% on OCNV and 96.4% vs 93.2% on CK+), as well as subtle expression recognition. (iii) In terms of comparison to prior state of the art, we achieve very competitive TNER performance via our soft MTL method and fine-tuning from LSEMSW (although it exclusively contains subtle expressions). Our RN+LAF+ADA (FT) achieves state of the art performance on CK+ and second best on OCNV.

Table 7. Comparison against state of the art on traditional non-subtle expression recognition. (FT) indicates fine-tuning from LSEMSW. (S) means training from scratch.

5 Conclusion

In summary we have contributed a large new database to advance subtle expression recognition in the deep learning era. A trace norm based MTL learning method is proposed and ADA is used for domain alignment. Extensive experiments have verified the effectiveness of the propose methods.