Abstract
In this paper, we present the method for our submission to the emotion recognition in the wild challenge (EmotiW). The challenge is to automatically classify the emotions acted by human subjects in video clips under real-world environment. In our method, each video clip can be represented by three types of image set models (i.e. linear subspace, covariance matrix, and Gaussian distribution) respectively, which can all be viewed as points residing on some Riemannian manifolds. Then different Riemannian kernels are employed on these set models correspondingly for similarity/distance measurement. For classification, three types of classifiers, i.e. kernel SVM, logistic regression, and partial least squares, are investigated for comparisons. Finally, an optimal fusion of classifiers learned from different kernels and different modalities (video and audio) is conducted at the decision level for further boosting the performance. We perform extensive evaluations on the EmotiW 2014 challenge data (including validation set and blind test set), and evaluate the effects of different components in our pipeline. It is observed that our method has achieved the best performance reported so far. To further evaluate the generalization ability, we also perform experiments on the EmotiW 2013 data and two well-known lab-controlled databases: CK+ and MMI. The results show that the proposed framework significantly outperforms the state-of-the-art methods.
Similar content being viewed by others
References
Arandjelovic O, Shakhnarovich G, Fisher J, Cipolla R, Darrell T (2005) Face recognition with image sets using manifold density divergence. IEEE Comput Vis Pattern Recognit 1:581–588
Arsigny V, Fillard P, Pennec X, Ayache N (2007) Geometric means in a novel vector space structure on symmetric positive-definite matrices. SIAM J Matrix Anal Appl 29(1):328–347
Chang C-C, Lin C-J (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):27
Chen J, Chen Z, Chi Z, Fu H (2014) Emotion recognition in the wild with feature fusion and multiple kernel learning. ACM Int Conf Multimodal Interact 1:508–513
Chew SW, Lucey S, Lucey P, Sridharan S, Conn JF (2012) Improved facial expression recognition via uni-hyperplane classification. IEEE Comput Vis Pattern Recognit 1:2554–2561
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. IEEE Comput Vis Pattern Recognit 1:886–893
Deng J, DongW, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. IEEE Comput Vis Pattern Recognit 1:248–255
Dhall A, Asthana A, Goecke R, Gedeon T (2011) Emotion recognition using phog and lpq features. IEEE Autom Face Gesture Recognit 1:878–883
Dhall A, Goecke R, Joshi J, Sikka K, Gedeon T (2014) Emotion recognition in the wild challenge 2014: baseline, data and protocol. ACM Int Conf Multimodal Interact 1:461–466
Dhall A, Goecke R, Joshi J, Wagner M, Gedeon T (2013) Emotion recognition in the wild challenge 2013. ACM Int Conf Multimodal Interact 1:509–516
Dhall A, Goecke R, Lucey S, Gedeon T (2012) Collecting large, richly annotated facial-expression databases from movies. IEEE MultiM 19(3):34–41
Eyben F, Wöllmer M, Schuller B (2010) Opensmile: the munich versatile and fast open-source audio feature extractor. ACM Int Conf Multimed 1:1459–1462
Fan R-E, Chang K-W, Hsieh C-J, Wang X-R, Lin C-J (2008) Liblinear: a library for large linear classification. J Mach Learn Res 9:1871–1874
Girshick R, Donahue J, Darrell T, Malik J (2013) Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv preprint; arXiv:1311.2524
Grosicki M (2014) Neural networks for emotion recognition in the wild. ACM Int Conf Multimodal Interact 1:467–472
Guo Y, Zhao G, Pietikäinen M (2012) Dynamic facial expression recognition using longitudinal facial expression atlases. Eur Conf Comput Vis 1:631–644
Hamm J, Lee DD (2008) Grassmann discriminant analysis: a unifying view on subspace-based learning. Int Conf Mach Learn 1:376–383
Harandi MT, Sanderson C, Shirazi S, Lovell BC (2011) Graph embedding discriminant analysis on grassmannian manifolds for improved image set matching. IEEE Comput Vis Pattern Recognit 1:2705–2712
Hotelling H (1936) Relations between two sets of variates. Biometrika 321–377
Huang X, He Q, Hong X, Zhao G, Pietikainen M (2014) Improved spatiotemporal local monogenic binary pattern for emotion recognition in the wild. ACM Int Conf Multimodal Interact 1:514–520
Hubel DH, Wiesel TN (1968) Receptive fields and functional architecture of monkey striate cortex. J Physiol 195(1):215–243
Jia Y (2013) Caffe: an open source convolutional architecture for fast feature embedding. http://caffe.berkeleyvision.org
Kaya H, Salah AA (2014) Combining modality-specific extreme learning machines for emotion recognition in the wild. ACM Int Conf Multimodal Interact 1:487–493
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 1:1097–1105
Lan Z-Z, Jiang L, Yu S-I, Rawat S, Cai Y, Gao C, Xu S, Shen H, Li X, Wang Y, et al (2013) Cmu-informedia at trecvid 2013 multimedia event detection. TRECVID 2013 Workshop
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Li P, Wang Q, Zhang L (2013) A novel earth mover’s distance methodology for image matching with gaussian mixture models. IEEE Int Conf Comput Vis 1:1689–1696
Liu M, Li S, Shan S, Chen X (2013) Au-aware deep networks for facial expression recognition. IEEE Autom Face Gesture Recognit 1:1–6
Liu M, Li S, Shan S, Wang R, Chen X (2014) Deeply learning deformable facial action parts model for dynamic expression analysis. Asian Conf Comput Vis 1:143–157
Liu M, Shan S, Wang R, Chen X (2014) Learning expressionlets on spatio-temporal manifold for dynamic facial expression recognition. IEEE Comput Vis Pattern Recognit 1:1749–1756
Liu M, Wang R, Huang Z, Shan S, Chen X (2013) Partial least squares regression on grassmannian manifold for emotion recognition. ACM Int Conf Multimodal Interact 1:525–530
Liu M, Wang R, Li S, Shan S, Huang Z, Chen X (2014) Combining multiple kernel methods on riemannian manifold for emotion recognition in the wild. ACM Int Conf Multimodal Interact 1:494–501
Lovrić M, Min-Oo M, Ruh EA (2000) Multivariate normal distributions parametrized as a riemannian symmetric space. J Multivar Anal 74(1):36–48
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vision 60(2):91–110
Lucey P, Cohn JF, Kanade T, Saragih J, Ambadar Z, Matthews I (2010) The extended cohn-kanade dataset (ck+): a complete dataset for action unit and emotion-specified expression. IEEE Comput Vis Pattern Recognit Workshops 1:94–101
Meudt S, Schwenker F (2014) Enhanced autocorrelation in real world emotion recognition. ACM Int Conf Multimodal Interact 1:502–507
Pennec X, Fillard P, Ayache N (2006) A riemannian framework for tensor computing. Int J Comput Vision 66(1):41–66
Ringeval F, Amiriparian S, Eyben F, Scherer K, Schuller B (2014) Emotion recognition in the wild: incorporating voice and lip activity in multimodal decision-level fusion. ACM Int Conf Multimodal Interact 1:473–480
Rosipal R, Krämer N (2006) Overview and recent advances in partial least squares. Subspace Latent Struct Featur Select 34–51
Shakhnarovich G, Fisher JW, Darrell T (2002) Face recognition from long-term observations. Eur Conf Comput Vis 1:851–865
Shan C, Gong S, McOwan PW (2009) Facial expression recognition based on local binary patterns: a comprehensive study. Image Vision Comput 27(6):803–816
Sharma A, Jacobs DW (2011) Bypassing synthesis: pls for face recognition with pose, low-resolution and sketch. IEEE Comput Vis Pattern Recognit 1:593–600
Sikka K, Dykstra K, Sathyanarayana S, Littlewort G, Bartlett M (2013) Multiple kernel learning for emotion recognition in the wild. ACM Int Conf Multimodal Interact 1:517–524
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. arXiv preprint; arXiv:1406.2199
Sun B, Li L, Zuo T, Chen Y, Zhou G, Wu X (2014) Combining multimodal features with hierarchical classifier fusion for emotion recognition in the wild. ACM Int Conf Multimodal Interact 1:481–486
Valstar M, Pantic M (2010) Induced disgust, happiness and surprise: an addition to the mmi facial expression database. Int Conf Lang Resour Eval 1:65–70
Valstar M, Schuller B, Smith K, Eyben F, Jiang B, Bilakhia S, Schnieder S, Cowie R, Pantic M (2013) Avec 2013: the continuous audio/visual emotion and depression recognition challenge. ACM Int Workshop Audio/Vis Emot Chall 1:3–10
Valstar MF, Jiang B, Mehu M, Pantic M, Scherer K (2011) The first facial expression recognition and analysis challenge. IEEE Autom Face Gesture Recognit 1:921–926
Vemulapalli R, Pillai JK, Chellappa R (2013) Kernel learning for extrinsic classification of manifold features. IEEE Comput Vis Pattern Recognit 1:1782–1789
Wang R, Guo H, Davis LS, Dai Q (2012) Covariance discriminative learning: a natural and efficient approach to image set classification. IEEE Comput Vis Pattern Recognit 1:2496–2503
Wang Z, Wang S, Ji Q (2013) Capturing complex spatio-temporal relations among facial muscles for facial expression recognition. IEEE Comput Vis Pattern Recognit 1:3422–3429
Wold H (1985) Partial least squares. In: Kotz S, Johnson NL (eds) Encyclopedia of statistical sciences, vol 6. Wiley, New York, pp 581–591
Yamaguchi O, Fukui K, Maeda K-I (1998) Face recognition using temporal image sequence. IEEE Autom Face Gesture Recognit 1:318–323
Yang P, Liu Q, Metaxas DN (2007) Boosting coded dynamic features for facial action units and facial expression recognition. IEEE Comput Vis Pattern Recognit 1:1–6
Zeng Z, Pantic M, Roisman GI, Huang TS (2009) A survey of affect recognition methods: audio, visual, and spontaneous expressions. Pattern Anal Mach Intell IEEE Trans 31(1):39–58
Zhang X, Zhang L, Wang X-J, Shum H-Y (2012) Finding celebrities in billions of web images. Multimed IEEE Trans 14(4):995–1007
Zhao G, Pietikainen M (2007) Dynamic texture recognition using local binary patterns with an application to facial expressions. Pattern Anal Mach Intell IEEE Trans 29(6):915–928
Zhong L, Liu Q, Yang P, Liu B, Huang J, Metaxas DN (2012) Learning active facial patches for expression analysis. IEEE Comput Vis Pattern Recognit 1:2562–2569
Acknowledgments
This work is partially supported by 973 Program under contract No. 2015CB351802, Natural Science Foundation of China under contracts Nos. 61390511, 61222211, 61379083, and Youth Innovation Promotion Association CAS No. 2015085.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Liu, M., Wang, R., Li, S. et al. Video modeling and learning on Riemannian manifold for emotion recognition in the wild. J Multimodal User Interfaces 10, 113–124 (2016). https://doi.org/10.1007/s12193-015-0204-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12193-015-0204-5