Skip to main content
Log in

Video modeling and learning on Riemannian manifold for emotion recognition in the wild

  • Original Paper
  • Published:
Journal on Multimodal User Interfaces Aims and scope Submit manuscript

Abstract

In this paper, we present the method for our submission to the emotion recognition in the wild challenge (EmotiW). The challenge is to automatically classify the emotions acted by human subjects in video clips under real-world environment. In our method, each video clip can be represented by three types of image set models (i.e. linear subspace, covariance matrix, and Gaussian distribution) respectively, which can all be viewed as points residing on some Riemannian manifolds. Then different Riemannian kernels are employed on these set models correspondingly for similarity/distance measurement. For classification, three types of classifiers, i.e. kernel SVM, logistic regression, and partial least squares, are investigated for comparisons. Finally, an optimal fusion of classifiers learned from different kernels and different modalities (video and audio) is conducted at the decision level for further boosting the performance. We perform extensive evaluations on the EmotiW 2014 challenge data (including validation set and blind test set), and evaluate the effects of different components in our pipeline. It is observed that our method has achieved the best performance reported so far. To further evaluate the generalization ability, we also perform experiments on the EmotiW 2013 data and two well-known lab-controlled databases: CK+ and MMI. The results show that the proposed framework significantly outperforms the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Arandjelovic O, Shakhnarovich G, Fisher J, Cipolla R, Darrell T (2005) Face recognition with image sets using manifold density divergence. IEEE Comput Vis Pattern Recognit 1:581–588

    Google Scholar 

  2. Arsigny V, Fillard P, Pennec X, Ayache N (2007) Geometric means in a novel vector space structure on symmetric positive-definite matrices. SIAM J Matrix Anal Appl 29(1):328–347

    Article  MathSciNet  MATH  Google Scholar 

  3. Chang C-C, Lin C-J (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):27

    Google Scholar 

  4. Chen J, Chen Z, Chi Z, Fu H (2014) Emotion recognition in the wild with feature fusion and multiple kernel learning. ACM Int Conf Multimodal Interact 1:508–513

    Google Scholar 

  5. Chew SW, Lucey S, Lucey P, Sridharan S, Conn JF (2012) Improved facial expression recognition via uni-hyperplane classification. IEEE Comput Vis Pattern Recognit 1:2554–2561

    Google Scholar 

  6. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. IEEE Comput Vis Pattern Recognit 1:886–893

    Google Scholar 

  7. Deng J, DongW, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. IEEE Comput Vis Pattern Recognit 1:248–255

  8. Dhall A, Asthana A, Goecke R, Gedeon T (2011) Emotion recognition using phog and lpq features. IEEE Autom Face Gesture Recognit 1:878–883

    Google Scholar 

  9. Dhall A, Goecke R, Joshi J, Sikka K, Gedeon T (2014) Emotion recognition in the wild challenge 2014: baseline, data and protocol. ACM Int Conf Multimodal Interact 1:461–466

    Google Scholar 

  10. Dhall A, Goecke R, Joshi J, Wagner M, Gedeon T (2013) Emotion recognition in the wild challenge 2013. ACM Int Conf Multimodal Interact 1:509–516

    Google Scholar 

  11. Dhall A, Goecke R, Lucey S, Gedeon T (2012) Collecting large, richly annotated facial-expression databases from movies. IEEE MultiM 19(3):34–41

    Article  Google Scholar 

  12. Eyben F, Wöllmer M, Schuller B (2010) Opensmile: the munich versatile and fast open-source audio feature extractor. ACM Int Conf Multimed 1:1459–1462

    Google Scholar 

  13. Fan R-E, Chang K-W, Hsieh C-J, Wang X-R, Lin C-J (2008) Liblinear: a library for large linear classification. J Mach Learn Res 9:1871–1874

    MATH  Google Scholar 

  14. Girshick R, Donahue J, Darrell T, Malik J (2013) Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv preprint; arXiv:1311.2524

  15. Grosicki M (2014) Neural networks for emotion recognition in the wild. ACM Int Conf Multimodal Interact 1:467–472

    Google Scholar 

  16. Guo Y, Zhao G, Pietikäinen M (2012) Dynamic facial expression recognition using longitudinal facial expression atlases. Eur Conf Comput Vis 1:631–644

    Google Scholar 

  17. Hamm J, Lee DD (2008) Grassmann discriminant analysis: a unifying view on subspace-based learning. Int Conf Mach Learn 1:376–383

    Google Scholar 

  18. Harandi MT, Sanderson C, Shirazi S, Lovell BC (2011) Graph embedding discriminant analysis on grassmannian manifolds for improved image set matching. IEEE Comput Vis Pattern Recognit 1:2705–2712

    Google Scholar 

  19. Hotelling H (1936) Relations between two sets of variates. Biometrika 321–377

  20. Huang X, He Q, Hong X, Zhao G, Pietikainen M (2014) Improved spatiotemporal local monogenic binary pattern for emotion recognition in the wild. ACM Int Conf Multimodal Interact 1:514–520

    Google Scholar 

  21. Hubel DH, Wiesel TN (1968) Receptive fields and functional architecture of monkey striate cortex. J Physiol 195(1):215–243

    Article  Google Scholar 

  22. Jia Y (2013) Caffe: an open source convolutional architecture for fast feature embedding. http://caffe.berkeleyvision.org

  23. Kaya H, Salah AA (2014) Combining modality-specific extreme learning machines for emotion recognition in the wild. ACM Int Conf Multimodal Interact 1:487–493

    Google Scholar 

  24. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 1:1097–1105

    Google Scholar 

  25. Lan Z-Z, Jiang L, Yu S-I, Rawat S, Cai Y, Gao C, Xu S, Shen H, Li X, Wang Y, et al (2013) Cmu-informedia at trecvid 2013 multimedia event detection. TRECVID 2013 Workshop

  26. LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324

    Article  Google Scholar 

  27. Li P, Wang Q, Zhang L (2013) A novel earth mover’s distance methodology for image matching with gaussian mixture models. IEEE Int Conf Comput Vis 1:1689–1696

    Google Scholar 

  28. Liu M, Li S, Shan S, Chen X (2013) Au-aware deep networks for facial expression recognition. IEEE Autom Face Gesture Recognit 1:1–6

    Google Scholar 

  29. Liu M, Li S, Shan S, Wang R, Chen X (2014) Deeply learning deformable facial action parts model for dynamic expression analysis. Asian Conf Comput Vis 1:143–157

    Google Scholar 

  30. Liu M, Shan S, Wang R, Chen X (2014) Learning expressionlets on spatio-temporal manifold for dynamic facial expression recognition. IEEE Comput Vis Pattern Recognit 1:1749–1756

    Google Scholar 

  31. Liu M, Wang R, Huang Z, Shan S, Chen X (2013) Partial least squares regression on grassmannian manifold for emotion recognition. ACM Int Conf Multimodal Interact 1:525–530

    Google Scholar 

  32. Liu M, Wang R, Li S, Shan S, Huang Z, Chen X (2014) Combining multiple kernel methods on riemannian manifold for emotion recognition in the wild. ACM Int Conf Multimodal Interact 1:494–501

    Google Scholar 

  33. Lovrić M, Min-Oo M, Ruh EA (2000) Multivariate normal distributions parametrized as a riemannian symmetric space. J Multivar Anal 74(1):36–48

    Article  MathSciNet  MATH  Google Scholar 

  34. Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vision 60(2):91–110

    Article  Google Scholar 

  35. Lucey P, Cohn JF, Kanade T, Saragih J, Ambadar Z, Matthews I (2010) The extended cohn-kanade dataset (ck+): a complete dataset for action unit and emotion-specified expression. IEEE Comput Vis Pattern Recognit Workshops 1:94–101

    Google Scholar 

  36. Meudt S, Schwenker F (2014) Enhanced autocorrelation in real world emotion recognition. ACM Int Conf Multimodal Interact 1:502–507

    Google Scholar 

  37. Pennec X, Fillard P, Ayache N (2006) A riemannian framework for tensor computing. Int J Comput Vision 66(1):41–66

    Article  MathSciNet  MATH  Google Scholar 

  38. Ringeval F, Amiriparian S, Eyben F, Scherer K, Schuller B (2014) Emotion recognition in the wild: incorporating voice and lip activity in multimodal decision-level fusion. ACM Int Conf Multimodal Interact 1:473–480

    Google Scholar 

  39. Rosipal R, Krämer N (2006) Overview and recent advances in partial least squares. Subspace Latent Struct Featur Select 34–51

  40. Shakhnarovich G, Fisher JW, Darrell T (2002) Face recognition from long-term observations. Eur Conf Comput Vis 1:851–865

    MATH  Google Scholar 

  41. Shan C, Gong S, McOwan PW (2009) Facial expression recognition based on local binary patterns: a comprehensive study. Image Vision Comput 27(6):803–816

    Article  Google Scholar 

  42. Sharma A, Jacobs DW (2011) Bypassing synthesis: pls for face recognition with pose, low-resolution and sketch. IEEE Comput Vis Pattern Recognit 1:593–600

    Google Scholar 

  43. Sikka K, Dykstra K, Sathyanarayana S, Littlewort G, Bartlett M (2013) Multiple kernel learning for emotion recognition in the wild. ACM Int Conf Multimodal Interact 1:517–524

    Google Scholar 

  44. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. arXiv preprint; arXiv:1406.2199

  45. Sun B, Li L, Zuo T, Chen Y, Zhou G, Wu X (2014) Combining multimodal features with hierarchical classifier fusion for emotion recognition in the wild. ACM Int Conf Multimodal Interact 1:481–486

    Google Scholar 

  46. Valstar M, Pantic M (2010) Induced disgust, happiness and surprise: an addition to the mmi facial expression database. Int Conf Lang Resour Eval 1:65–70

    Google Scholar 

  47. Valstar M, Schuller B, Smith K, Eyben F, Jiang B, Bilakhia S, Schnieder S, Cowie R, Pantic M (2013) Avec 2013: the continuous audio/visual emotion and depression recognition challenge. ACM Int Workshop Audio/Vis Emot Chall 1:3–10

    Article  Google Scholar 

  48. Valstar MF, Jiang B, Mehu M, Pantic M, Scherer K (2011) The first facial expression recognition and analysis challenge. IEEE Autom Face Gesture Recognit 1:921–926

  49. Vemulapalli R, Pillai JK, Chellappa R (2013) Kernel learning for extrinsic classification of manifold features. IEEE Comput Vis Pattern Recognit 1:1782–1789

    Google Scholar 

  50. Wang R, Guo H, Davis LS, Dai Q (2012) Covariance discriminative learning: a natural and efficient approach to image set classification. IEEE Comput Vis Pattern Recognit 1:2496–2503

    Google Scholar 

  51. Wang Z, Wang S, Ji Q (2013) Capturing complex spatio-temporal relations among facial muscles for facial expression recognition. IEEE Comput Vis Pattern Recognit 1:3422–3429

    Google Scholar 

  52. Wold H (1985) Partial least squares. In: Kotz S, Johnson NL (eds) Encyclopedia of statistical sciences, vol 6. Wiley, New York, pp 581–591

    Google Scholar 

  53. Yamaguchi O, Fukui K, Maeda K-I (1998) Face recognition using temporal image sequence. IEEE Autom Face Gesture Recognit 1:318–323

    Article  Google Scholar 

  54. Yang P, Liu Q, Metaxas DN (2007) Boosting coded dynamic features for facial action units and facial expression recognition. IEEE Comput Vis Pattern Recognit 1:1–6

    Google Scholar 

  55. Zeng Z, Pantic M, Roisman GI, Huang TS (2009) A survey of affect recognition methods: audio, visual, and spontaneous expressions. Pattern Anal Mach Intell IEEE Trans 31(1):39–58

    Article  Google Scholar 

  56. Zhang X, Zhang L, Wang X-J, Shum H-Y (2012) Finding celebrities in billions of web images. Multimed IEEE Trans 14(4):995–1007

    Article  Google Scholar 

  57. Zhao G, Pietikainen M (2007) Dynamic texture recognition using local binary patterns with an application to facial expressions. Pattern Anal Mach Intell IEEE Trans 29(6):915–928

    Article  Google Scholar 

  58. Zhong L, Liu Q, Yang P, Liu B, Huang J, Metaxas DN (2012) Learning active facial patches for expression analysis. IEEE Comput Vis Pattern Recognit 1:2562–2569

    Google Scholar 

Download references

Acknowledgments

This work is partially supported by 973 Program under contract No. 2015CB351802, Natural Science Foundation of China under contracts Nos. 61390511, 61222211, 61379083, and Youth Innovation Promotion Association CAS No. 2015085.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shiguang Shan.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, M., Wang, R., Li, S. et al. Video modeling and learning on Riemannian manifold for emotion recognition in the wild. J Multimodal User Interfaces 10, 113–124 (2016). https://doi.org/10.1007/s12193-015-0204-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12193-015-0204-5

Keywords

Navigation