Published April 29, 2024 | Version v2
Journal article Open

Cross-Task Multi-Branch Vision Transformer for Facial Expression and Mask Wearing Classification

  • 1. Carnegie Mellon University
  • 2. AMA University
  • 3. University of Washington
  • 4. Microsoft
  • 5. Northern Arizona University

Description

With wearing masks becoming a new cultural norm, facial expression recognition (FER) while taking masks into account has become a significant challenge. In this paper, we propose a unified multi-branch vision transformer for facial expression recognition and mask wearing classification tasks. Our approach extracts shared features for both tasks using a dual-branch architecture that obtains multi-scale feature representations. Furthermore, we propose a cross-task fusion phase that processes tokens for each task with separate branches, while exchanging information using a cross attention module. Our proposed framework reduces the overall complexity compared with using separate networks for both tasks by the simple yet effective cross-task fusion phase. Extensive experiments demonstrate that our proposed model performs better than or on par with different state-of-the-art methods on both facial expression recognition and facial mask wearing classification task.

Files

v1n1a07.pdf

Files (819.6 kB)

Name Size Download all
md5:a00dc5889f601a0bebaf1ed1eb208fa1
819.6 kB Preview Download

Additional details

References

  • [1] Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020).
  • [2] Deng, Jia, et al. "Imagenet: A large-scale hierarchical image database." 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009.
  • [3] Sun, Chen, et al. "Revisiting unreasonable effectiveness of data in deep learning era." Proceedings of the IEEE international conference on computer vision. 2017.
  • [4] Li, Panfeng, Youzuo Lin, and Emily Schultz-Fellenz. "Contextual hourglass network for semantic segmentation of high resolution aerial imagery." arXiv preprint arXiv:1810.12813 (2018).
  • [5] Chen, Chun-Fu Richard, Quanfu Fan, and Rameswar Panda. "Crossvit: Cross-attention multi-scale vision transformer for image classification." Proceedings of the IEEE/CVF international conference on computer vision. 2021.
  • [6] Turan, Cigdem, and Kin-Man Lam. "Region-based feature fusion for facial-expression recognition." 2014 IEEE International Conference on Image Processing (ICIP). IEEE, 2014.
  • [7] Zhu, Ziwei, and Wenjing Zhou. "Taming heavy-tailed features by shrinkage." International Conference on Artificial Intelligence and Statistics. PMLR, 2021.
  • [8] Farzaneh, Amir Hossein, and Xiaojun Qi. "Facial expression recognition in the wild via deep attentive center loss." Proceedings of the IEEE/CVF winter conference on applications of computer vision. 2021.
  • [9] Wang, Kai, et al. "Region attention networks for pose and occlusion robust facial expression recognition." IEEE Transactions on Image Processing 29 (2020): 4057-4069.
  • [10] Shi, Ge, Jason Smucny, and Ian Davidson. "Deep learning for prognosis using task-fmri: A novel architecture and training scheme." Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2022.
  • [11] Ma, Fuyan, Bin Sun, and Shutao Li. "Robust facial expression recognition with convolutional visual transformers." arXiv preprint arXiv:2103.16854 2.6 (2021): 7.
  • [12] Ding, Wenhao, et al. "Vehicle pose and shape estimation through multiple monocular vision." 2018 IEEE International Conference on Robotics and Biomimetics (ROBIO). IEEE, 2018.
  • [13] Osherov, Elad, and Michael Lindenbaum. "Increasing cnn robustness to occlusions by reducing filter support." Proceedings of the IEEE International Conference on Computer Vision. 2017.
  • [14] Weng, Yijie, Jianhao, Wu. "Fortifying the global data fortress: a multidimensional examination of cyber security indexes and data protection measures across 193 nations". International Journal of Frontiers in Engineering Technology 6. 2(2024).
  • [15] Jagadeeswari, C., and M. Uday Theja. "Performance evaluation of intelligent face mask detection system with various deep learning classifiers." International Journal of Advanced Science and Technology 29.11s (2020): 3074-3082.
  • [16] Yao, Jiawei, et al. "Ndc-scene: Boost monocular 3d semantic scene completion in normalized device coordinates space." 2023 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE Computer Society, 2023.
  • [17] Yao, Jiawei, et al. "Building lane-level maps from aerial images." ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024.
  • [18] Ge, Shiming, et al. "Detecting masked faces in the wild with lle-cnns." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
  • [19] Tabassum, Tarafder Elmi, et al. "Integrating GRU with a Kalman Filter to Enhance Visual Inertial Odometry Performance in Complex Environments." Aerospace 10.11 (2023): 923.
  • [20] Read, Andrew J., et al. "Prediction of Gastrointestinal Tract Cancers Using Longitudinal Electronic Health Record Data." Cancers 15.5 (2023): 1399.
  • [21] Zhao, Peng, et al. "HTN planning with uncontrollable durations for emergency decision-making." Journal of Intelligent & Fuzzy Systems 33.1 (2017): 255-267.
  • [22] Wang, Wenhai, et al. "Pyramid vision transformer: A versatile backbone for dense prediction without convolutions." Proceedings of the IEEE/CVF international conference on computer vision. 2021.
  • [23] Li, Keqin, et al. "The application of Augmented Reality (AR) in Remote Work and Education." arXiv preprint arXiv:2404.10579 (2024).
  • [24] Goodfellow, Ian J., et al. "Challenges in representation learning: A report on three machine learning contests." Neural Information Processing: 20th International Conference, ICONIP 2013, Daegu, Korea, November 3-7, 2013. Proceedings, Part III 20. Springer berlin heidelberg, 2013.
  • [25] Ru, Jingyu, et al. "A Bounded Near-Bottom Cruise Trajectory Planning Algorithm for Underwater Vehicles." Journal of Marine Science and Engineering 11.1 (2022): 7.
  • [26] Liu, Tianrui, Qi, Cai, Changxin, Xu, Bo, Hong, Jize, Xiong, Yuxin, Qiao, Tsungwei, Yang. "Image Captioning in News Report Scenario". Academic Journal of Science and Technology 10. 1(2024): 284–289.
  • [27] Zhao, Peng, Chao Qi, and Dian Liu. "Resource-constrained Hierarchical Task Network planning under uncontrollable durations for emergency decision-making." Journal of Intelligent & Fuzzy Systems 33.6 (2017): 3819-3834.
  • [28] Zhao, Peng, Chao Qi, and Dian Liu. "Resource-constrained Hierarchical Task Network planning under uncontrollable durations for emergency decision-making." Journal of Intelligent & Fuzzy Systems 33.6 (2017): 3819-3834.
  • [29] Qi, Chao, et al. "Hierarchical task network planning with resources and temporal constraints." Knowledge-Based Systems 133 (2017): 17-32.
  • [30] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).
  • [31] Liu, Dian, et al. "Hierarchical task network-based emergency task planning with incomplete information, concurrency and uncertain duration." Knowledge-Based Systems 112 (2016): 67-79.
  • [32] Li, Panfeng, Mohamed Abouelenien, and Rada Mihalcea. "Deception Detection from Linguistic and Physiological Data Streams Using Bimodal Convolutional Neural Networks." arXiv preprint arXiv:2311.10944 (2023).
  • [33] Atulya Shree, Kai Jia, Zhiyao Xiong, Siu Fai Chow, Raymond Phan, Panfeng Li, & Domenico Curro. (2022). Image analysis.
  • [34] Jin Wang, JinFei Wang, Shuying Dai, Jiqiang Yu, Keqin Li. "Research on emotionally intelligent dialogue generation based on automatic dialogue system." arXiv preprint arXiv:2402.11447 (2024).
  • [35] Levi, Gil, and Tal Hassner. "Emotion recognition in the wild via convolutional neural networks and mapped binary patterns." Proceedings of the 2015 ACM on international conference on multimodal interaction. 2015.
  • [36] Xin, Yi, et al. "MmAP: Multi-modal Alignment Prompt for Cross-domain Multi-task Learning." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 38. No. 14. 2024.
  • [37] Xin, Yi, et al. "Parameter-Efficient Fine-Tuning for Pre-Trained Vision Models: A Survey." arXiv preprint arXiv:2402.02242 (2024).
  • [38] Wang, Jun, et al. "Facex-zoo: A pytorch toolbox for face recognition." Proceedings of the 29th ACM international conference on multimedia. 2021.
  • [39] Liu, Hao, et al. "Deep Reinforcement Learning for Mobile Robot Path Planning." arXiv preprint arXiv:2404.06974 (2024).
  • [40] Wang, Xiaosong, et al. "Advanced Network Intrusion Detection with TabTransformer." Journal of Theory and Practice of Engineering Science 4.03 (2024): 191-198.
  • [41] Liu, Tianrui, et al. "News recommendation with attention mechanism." arXiv preprint arXiv:2402.07422 (2024).
  • [42] Li, Shan, Weihong Deng, and JunPing Du. "Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
  • [43] Yuan, Li, et al. "Tokens-to-token vit: Training vision transformers from scratch on imagenet." Proceedings of the IEEE/CVF international conference on computer vision. 2021.
  • [44] Wang, Hong-Wei, et al. "Review on hierarchical task network planning under uncertainty." Acta Autom. Sin 42 (2016): 655-667.
  • [45] He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
  • [46] Lyons, Michael, et al. "Coding facial expressions with gabor wavelets." Proceedings Third IEEE international conference on automatic face and gesture recognition. IEEE, 1998.
  • [47] Liu, Tianrui, Changxin, Xu, Yuxin, Qiao, Chufeng, Jiang, Jiqiang, Yu. "Particle Filter SLAM for Vehicle Localization". Journal of Industrial Engineering and Applied Science 2. 1(2024): 27–31.
  • [48] Castellano, Giovanna, Berardina De Carolis, and Nicola Macchiarulo. "Automatic emotion recognition from facial expressions when wearing a mask." Proceedings of the 14th Biannual Conference of the Italian SIGCHI Chapter. 2021.
  • [49] Su, Jing, et al. "Large Language Models for Forecasting and Anomaly Detection: A Systematic Literature Review." arXiv preprint arXiv:2402.10350 (2024).
  • [50] Loey, Mohamed, et al. "Fighting against COVID-19: A novel deep learning model based on YOLO-v2 with ResNet-50 for medical face mask detection." Sustainable cities and society 65 (2021): 102600.
  • [51] Liu, Tianrui, Qi, Cai, Changxin, Xu, Bo, Hong, Fanghao, Ni, Yuxin, Qiao, Tsungwei, Yang. "Rumor Detection with A Novel Graph Neural Network Approach". Academic Journal of Science and Technology 10. 1(2024): 305–310.
  • [52] Lucey, Patrick, et al. "The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression." 2010 ieee computer society conference on computer vision and pattern recognition-workshops. IEEE, 2010.