ABSTRACT
Recently, facial expression recognition (FER) in the wild has gained a lot of researchers' attention because it is a valuable topic to enable the FER techniques to move from the laboratory to the real applications. In this paper, we focus on this challenging but interesting topic and make contributions from three aspects. First, we present a new large-scale 'in-the-wild' dynamic facial expression database, DFEW (Dynamic Facial Expression in the Wild), consisting of over 16,000 video clips from thousands of movies. These video clips contain various challenging interferences in practical scenarios such as extreme illumination, occlusions, and capricious pose changes. Second, we propose a novel method called Expression-Clustered Spatiotemporal Feature Learning (EC-STFL) framework to deal with dynamic FER in the wild. Third, we conduct extensive benchmark experiments on DFEW using a lot of spatiotemporal deep feature learning methods as well as our proposed EC-STFL. Experimental results show that DFEW is a well-designed and challenging database, and the proposed EC-STFL can promisingly improve the performance of existing spatiotemporal deep neural networks in coping with the problem of dynamic FER in the wild. Our DFEW database is publicly available and can be freely downloaded from https://dfew-dataset.github.io/.
Supplemental Material
- C Fabian Benitezquiroz, Ramprakash Srinivasan, and Aleix M Martinez. 2016. EmotioNet: An Accurate, Real-Time Algorithm for the Automatic Annotation of a Million Facial Expressions in the Wild. (2016), 5562--5570.Google Scholar
- Joao Carreira, Eric Noland, Chloe Hillier, and Andrew Zisserman. 2019. A short note on the kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987 (2019).Google Scholar
- Charles Darwin and Phillip Prodger. 1998. The expression of the emotions in man and animals. Oxford University Press, USA.Google Scholar
- Abhinav Dhall. 2019. EmotiW 2019: Automatic Emotion, Engagement and Cohesion Prediction Tasks. In 2019 International Conference on Multimodal Interaction. 546--550.Google ScholarDigital Library
- Abhinav Dhall, Roland Goecke, Simon Lucey, and Tom Gedeon. 2012. Collecting large, richly annotated facial-expression databases from movies. IEEE multimedia 3 (2012), 34--41.Google Scholar
- Paul Ekman and Wallace V Friesen. 1971. Constants across cultures in the face and emotion. Journal of personality and social psychology, Vol. 17, 2 (1971), 124.Google ScholarCross Ref
- Yingruo Fan, Jacqueline CK Lam, and Victor OK Li. 2018. Video-based emotion recognition using deeply-supervised neural networks. In Proceedings of the 20th ACM International Conference on Multimodal Interaction. 584--588.Google ScholarDigital Library
- Yingruo Fan, Victor Li, and Jacqueline CK Lam. 2020. Facial Expression Recognition with Deeply-Supervised Attention Network. IEEE Transactions on Affective Computing (2020).Google Scholar
- Yin Fan, Xiangju Lu, Dian Li, and Yuanliu Liu. 2016. Video-based emotion recognition using CNN-RNN and C3D hybrid networks. In Proceedings of the 18th ACM International Conference on Multimodal Interaction. 445--450.Google ScholarDigital Library
- Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin, Vol. 76, 5 (1971), 378.Google Scholar
- Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. 1999. Learning to forget: Continual prediction with LSTM. (1999).Google Scholar
- Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2017. Learning spatio-temporal features with 3D residual networks for action recognition. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 3154--3160.Google ScholarCross Ref
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarCross Ref
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.Google Scholar
- Ping Hu, Dongqi Cai, Shandong Wang, Anbang Yao, and Yurong Chen. 2017. Learning supervised scoring ensemble for emotion recognition in the wild. In Proceedings of the 19th ACM international conference on multimodal interaction. 553--560.Google ScholarDigital Library
- Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 1725--1732.Google ScholarDigital Library
- Dimitrios Kollias, Panagiotis Tzirakis, Mihalis A Nicolaou, Athanasios Papaioannou, Guoying Zhao, Björn Schuller, Irene Kotsia, and Stefanos Zafeiriou. 2019. Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architectures, and beyond. International Journal of Computer Vision, Vol. 127, 6--7 (2019), 907--929.Google ScholarDigital Library
- Jean Kossaifi, Georgios Tzimiropoulos, Sinisa Todorovic, and Maja Pantic. 2017. AFEW-VA database for valence and arousal estimation in-the-wild. Image and Vision Computing, Vol. 65 (2017), 23--36.Google ScholarCross Ref
- Jiyoung Lee, Seungryong Kim, Sunok Kim, Jungin Park, and Kwanghoon Sohn. 2019. Context-Aware Emotion Recognition Networks. (2019).Google Scholar
- Shan Li and Weihong Deng. 2018. Reliable crowdsourcing and deep locality-preserving learning for unconstrained facial expression recognition. IEEE Transactions on Image Processing, Vol. 28, 1 (2018), 356--370.Google ScholarDigital Library
- Shan Li, Weihong Deng, and JunPing Du. 2017. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2852--2861.Google ScholarCross Ref
- Sunan Li, Wenming Zheng, Yuan Zong, Cheng Lu, Chuangao Tang, Xingxun Jiang, Jiateng Liu, and Wanchuang Xia. 2019. Bi-modality Fusion for Emotion Recognition in the Wild. In 2019 International Conference on Multimodal Interaction. 589--594.Google Scholar
- Chuanhe Liu, Tianhao Tang, Kui Lv, and Minghao Wang. 2018. Multi-feature based emotion recognition for video clips. In Proceedings of the 20th ACM International Conference on Multimodal Interaction. 630--634.Google ScholarDigital Library
- Mengyi Liu, Shiguang Shan, Ruiping Wang, and Xilin Chen. 2014. Learning expressionlets on spatio-temporal manifold for dynamic facial expression recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1749--1756.Google ScholarDigital Library
- Xin Liu, Meina Kan, Wanglong Wu, Shiguang Shan, and Xilin Chen. 2016. VIPLFaceNet: An Open Source Deep Face Recognition SDK. Frontiers of Computer Science (FCS) (2016).Google Scholar
- Cheng Lu, Wenming Zheng, Chaolong Li, Chuangao Tang, Suyuan Liu, Simeng Yan, and Yuan Zong. 2018. Multiple spatio-temporal feature learning for video-based emotion recognition in the wild. In Proceedings of the 20th ACM International Conference on Multimodal Interaction. 646--652.Google ScholarDigital Library
- Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research, Vol. 9, Nov (2008), 2579--2605.Google Scholar
- Ali Mollahosseini, Behzad Hasani, and Mohammad H Mahoor. 2019. AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild. IEEE Transactions on Affective Computing, Vol. 10, 1 (2019), 18--31.Google ScholarDigital Library
- Mathew Monfort, Alex Andonian, Bolei Zhou, Kandan Ramakrishnan, Sarah Adel Bargal, Tom Yan, Lisa Brown, Quanfu Fan, Dan Gutfreund, Carl Vondrick, et almbox. 2019. Moments in time dataset: one million videos for event understanding. IEEE transactions on pattern analysis and machine intelligence, Vol. 42, 2 (2019), 502--508.Google Scholar
- Bowen Pan, Shangfei Wang, and Bin Xia. 2019. Occluded Facial Expression Recognition Enhanced through Privileged Information. In Proceedings of the 27th ACM International Conference on Multimedia. 566--573.Google ScholarDigital Library
- Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. (2017).Google Scholar
- Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In proceedings of the IEEE International Conference on Computer Vision. 5533--5541.Google ScholarCross Ref
- M. Inc. Face++ research. [n.d.]. toolkit. www.faceplusplus.com.Google Scholar
- Bjorn Schuller, Bogdan Vlasenko, Florian Eyben, Martin Wöllmer, Andre Stuhlsatz, Andreas Wendemuth, and Gerhard Rigoll. 2010. Cross-corpus acoustic emotion recognition: Variances and strategies. IEEE Transactions on Affective Computing, Vol. 1, 2 (2010), 119--131.Google ScholarDigital Library
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
- Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).Google Scholar
- Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489--4497.Google ScholarDigital Library
- Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6450--6459.Google ScholarCross Ref
- Laurens Van Der Maaten. 2014. Accelerating t-SNE using tree-based algorithms. The Journal of Machine Learning Research, Vol. 15, 1 (2014), 3221--3245.Google ScholarDigital Library
- Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. 2016. A discriminative feature learning approach for deep face recognition. In European conference on computer vision. Springer, 499--515.Google ScholarCross Ref
- Guoying Zhao and Matti Pietikainen. 2007. Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE transactions on pattern analysis and machine intelligence, Vol. 29, 6 (2007), 915--928.Google Scholar
- Wenming Zheng, Hao Tang, Zhouchen Lin, and Thomas S Huang. 2010. Emotion recognition from arbitrary view facial images. (2010), 490--503.Google Scholar
- Wenming Zheng, Xiaoyan Zhou, Cairong Zou, and Li Zhao. 2006. Facial expression recognition using kernel canonical correlation analysis (KCCA). IEEE Transactions on Neural Networks , Vol. 17, 1 (2006), 233--238.Google ScholarDigital Library
- Ziheng Zhou, Xiaopeng Hong, Guoying Zhao, and Matti Pietikäinen. 2013. A compact representation of visual speech data using latent variables. IEEE transactions on pattern analysis and machine intelligence, Vol. 36, 1 (2013), 1--1.Google Scholar
- Ziheng Zhou, Guoying Zhao, and Matti Pietikäinen. 2011. Towards a practical lipreading system. In CVPR 2011. IEEE, 137--144.Google ScholarDigital Library
Index Terms
- DFEW: A Large-Scale Database for Recognizing Dynamic Facial Expressions in the Wild
Recommendations
Former-DFER: Dynamic Facial Expression Recognition Transformer
MM '21: Proceedings of the 29th ACM International Conference on MultimediaThis paper proposes a dynamic facial expression recognition transformer (Former-DFER) for the in-the-wild scenario. Specifically, the proposed Former-DFER mainly consists of a convolutional spatial transformer (CS-Former) and a temporal transformer (T-...
DPCNet: Dual Path Multi-Excitation Collaborative Network for Facial Expression Representation Learning in Videos
MM '22: Proceedings of the 30th ACM International Conference on MultimediaCurrent works of facial expression learning in video consume significant computational resources to learn spatial channel feature representations and temporal relationships. To mitigate this issue, we propose a Dual Path multi-excitation Collaborative ...
MAFW: A Large-scale, Multi-modal, Compound Affective Database for Dynamic Facial Expression Recognition in the Wild
MM '22: Proceedings of the 30th ACM International Conference on MultimediaDynamic facial expression recognition (FER) databases provide important data support for affective computing and applications. However, most FER databases are annotated with several basic mutually exclusive emotional categories and contain only one ...
Comments