skip to main content
10.1145/3536220.3558806acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article

Automatic facial expressions, gaze direction and head movements generation of a virtual agent

Published:07 November 2022Publication History

ABSTRACT

In this article, we present two models to jointly and automatically generate the head, facial and gaze movements of a virtual agent from acoustic speech features. Two architectures are explored: a Generative Adversarial Network and an Adversarial Encoder-Decoder. Head movements and gaze orientation are generated as 3D coordinates, while facial expressions are generated using action units based on the facial action coding system. A large corpus of almost 4 hours of videos, involving 89 different speakers is used to train our models. We extract the speech and visual features automatically from these videos using existing tools. The evaluation of these models is conducted objectively with measures such as density evaluation and a visualisation from PCA reduction, as well as subjectively through a users perceptive study. Our proposed methodology shows that on 15 seconds sequences, encoder-decoder architecture drastically improves the perception of generated behaviours in two criteria: the coordination with speech and the naturalness. Our code can be found in : https://github.com/aldelb/non-verbal-behaviours-generation.

References

  1. Tadas Baltrušaitis, Peter Robinson, and Louis-Philippe Morency. 2016. Openface: an open source facial behavior analysis toolkit. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 1–10.Google ScholarGoogle ScholarCross RefCross Ref
  2. Carlos Busso, Zhigang Deng, Michael Grimm, Ulrich Neumann, and Shrikanth Narayanan. 2007. Rigid head motion in expressive speech animation: Analysis and synthesis. IEEE transactions on audio, speech, and language processing 15, 3(2007), 1075–1086.Google ScholarGoogle Scholar
  3. Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7291–7299.Google ScholarGoogle ScholarCross RefCross Ref
  4. Justine Cassell. 2000. Embodied conversational interface agents. Commun. ACM 43, 4 (2000), 70–78.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Justine Cassell, Catherine Pelachaud, Norman Badler, Mark Steedman, Brett Achorn, Tripp Becket, Brett Douville, Scott Prevost, and Matthew Stone. 1994. Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents. In Proceedings of the 21st annual conference on Computer graphics and interactive techniques. 413–420.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Chung-Cheng Chiu and Stacy Marsella. 2014. Gesture generation with low-dimensional embeddings. In Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems. 781–788.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Paul Ekman. 2002. Facial action coding system (FACS). A Human Face, Salt Lake City(2002).Google ScholarGoogle Scholar
  8. Paul Ekman and Wallace V Friesen. 1978. Facial action coding system. Environmental Psychology & Nonverbal Behavior (1978).Google ScholarGoogle Scholar
  9. Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia. 1459–1462.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Alexandre Garcia, Slim Essid, Florence d’Alché Buc, and Chloé Clavel. 2019. A multimodal movie review corpus for fine-grained opinion mining. arXiv preprint arXiv:1902.10102(2019).Google ScholarGoogle Scholar
  11. Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, and Jitendra Malik. 2019. Learning individual styles of conversational gesture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3497–3506.Google ScholarGoogle ScholarCross RefCross Ref
  12. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. Advances in neural information processing systems 27 (2014).Google ScholarGoogle Scholar
  13. Ikhsanul Habibie, Weipeng Xu, Dushyant Mehta, Lingjie Liu, Hans-Peter Seidel, Gerard Pons-Moll, Mohamed Elgharib, and Christian Theobalt. 2021. Learning speech-driven 3d conversational gestures from video. In Proceedings of the 21st ACM International Conference on Intelligent Virtual Agents. 101–108.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Dai Hasegawa, Naoshi Kaneko, Shinichi Shirakawa, Hiroshi Sakuta, and Kazuhiko Sumi. 2018. Evaluation of speech-to-gesture generation using bi-directional LSTM network. In Proceedings of the 18th International Conference on Intelligent Virtual Agents. 79–86.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Patrik Jonell, Taras Kucherenko, Gustav Eje Henter, and Jonas Beskow. 2020. Let’s Face It: Probabilistic Multi-modal Interlocutor-aware Generation of Facial Gestures in Dyadic Settings. In Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents. 1–8.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. 2017. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics (TOG) 36, 4 (2017), 1–12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Adam Kendon. 2004. Gesture: Visible action as utterance. Cambridge University Press.Google ScholarGoogle ScholarCross RefCross Ref
  18. Taras Kucherenko, Dai Hasegawa, Naoshi Kaneko, Gustav Eje Henter, and Hedvig Kjellström. 2021. Moving fast and slow: Analysis of representations and post-processing in speech-driven automatic gesture generation. International Journal of Human–Computer Interaction 37, 14(2021), 1300–1316.Google ScholarGoogle Scholar
  19. Sergey Levine, Christian Theobalt, and Vladlen Koltun. 2009. Real-time prosody-driven synthesis of body language. In ACM SIGGRAPH Asia 2009 papers. 1–10.Google ScholarGoogle Scholar
  20. Margot Lhommet, Yuyu Xu, and Stacy Marsella. 2015. Cerebella: automatic generation of nonverbal behavior for virtual humans. In Twenty-Ninth AAAI Conference on Artificial Intelligence.Google ScholarGoogle ScholarCross RefCross Ref
  21. Jing Li, Di Kang, Wenjie Pei, Xuefei Zhe, Ying Zhang, Zhenyu He, and Linchao Bao. 2021. Audio2Gestures: Generating Diverse Gestures from Speech Audio with Conditional Variational Autoencoders. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11293–11302.Google ScholarGoogle ScholarCross RefCross Ref
  22. Soroosh Mariooryad and Carlos Busso. 2012. Generating human-like behaviors using joint, speech-driven models for conversational agents. IEEE Transactions on Audio, Speech, and Language Processing 20, 8(2012), 2329–2340.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Stacy Marsella, Yuyu Xu, Margaux Lhommet, Andrew Feng, Stefan Scherer, and Ari Shapiro. 2013. Virtual character performance from speech. In Proceedings of the 12th ACM SIGGRAPH/Eurographics Symposium on Computer Animation. 25–35.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. David McNeill. 2000. Language and gesture. Vol. 2. Cambridge University Press Cambridge.Google ScholarGoogle ScholarCross RefCross Ref
  25. Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. 2016. Unrolled generative adversarial networks. arXiv preprint arXiv:1611.02163(2016).Google ScholarGoogle Scholar
  26. Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784(2014).Google ScholarGoogle Scholar
  27. Masahiro Mori, Karl F MacDorman, and Norri Kageki. 2012. The uncanny valley [from the field]. IEEE Robotics & automation magazine 19, 2 (2012), 98–100.Google ScholarGoogle ScholarCross RefCross Ref
  28. Kevin G Munhall, Jeffery A Jones, Daniel E Callan, Takaaki Kuratate, and Eric Vatikiotis-Bateson. 2004. Visual prosody and speech intelligibility: Head movement improves auditory speech perception. Psychological science 15, 2 (2004), 133–137.Google ScholarGoogle ScholarCross RefCross Ref
  29. Nora A Murphy and Judith A Hall. 2021. Capturing Behavior in Small Doses: A Review of Comparative Research in Evaluating Thin Slices for Behavioral Measurement. Frontiers in psychology 12 (2021), 667326.Google ScholarGoogle Scholar
  30. George Papamakarios, Eric T Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. 2021. Normalizing Flows for Probabilistic Modeling and Inference.J. Mach. Learn. Res. 22, 57 (2021), 1–64.Google ScholarGoogle Scholar
  31. Sunghyun Park, Han Suk Shim, Moitreya Chatterjee, Kenji Sagae, and Louis-Philippe Morency. 2014. Computational analysis of persuasiveness in social multimedia: A novel dataset and multimodal prediction approach. In Proceedings of the 16th International Conference on Multimodal Interaction. 50–57.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Catherine Pelachaud. 2015. Greta: an interactive expressive embodied conversational agent. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems. 5–5.Google ScholarGoogle Scholar
  33. Hai Xuan Pham, Yuting Wang, and Vladimir Pavlovic. 2018. End-to-end learning for 3d facial animation from speech. In Proceedings of the 20th ACM International Conference on Multimodal Interaction. 361–365.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention. Springer, 234–241.Google ScholarGoogle ScholarCross RefCross Ref
  35. Najmeh Sadoughi and Carlos Busso. 2018. Novel realizations of speech-driven head movements with generative adversarial networks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6169–6173.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Najmeh Sadoughi, Yang Liu, and Carlos Busso. 2015. MSP-AVATAR corpus: Motion capture recordings to study the role of discourse functions in the design of intelligent virtual agents. In 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), Vol. 7. IEEE, 1–6.Google ScholarGoogle ScholarCross RefCross Ref
  37. Kenta Takeuchi, Dai Hasegawa, Shinichi Shirakawa, Naoshi Kaneko, Hiroshi Sakuta, and Kazuhiko Sumi. 2017. Speech-to-gesture generation: A challenge in deep learning approach with bi-directional LSTM. In Proceedings of the 5th International Conference on Human Agent Interaction. 365–369.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Angela Tinwell, Mark Grimshaw, Debbie Abdel Nabi, and Andrew Williams. 2011. Facial expression of emotion and perception of the Uncanny Valley in virtual characters. Computers in Human Behavior 27, 2 (2011), 741–749.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. David Traum, William Swartout, Peter Khooshabeh, Stefan Kopp, Stefan Scherer, and Anton Leuski. 2016. Intelligent Virtual Agents: 16th International Conference, IVA 2016, Los Angeles, CA, USA, September 20–23, 2016, Proceedings. Vol. 10011. Springer.Google ScholarGoogle ScholarCross RefCross Ref
  40. Michel François Valstar and Maja Pantic. 2006. Biologically vs. logic inspired encoding of facial actions and emotions in video. In 2006 IEEE International Conference on Multimedia and Expo. IEEE, 325–328.Google ScholarGoogle ScholarCross RefCross Ref
  41. Pieter Wolfert, Jeffrey M Girard, Taras Kucherenko, and Tony Belpaeme. 2021. To rate or not to rate: Investigating evaluation methods for generated co-speech gestures. In Proceedings of the 2021 International Conference on Multimodal Interaction. 494–502.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Bowen Wu, Chaoran Liu, Carlos Toshinori Ishi, and Hiroshi Ishiguro. 2021. Modeling the conditional distribution of co-speech upper body gesture jointly using conditional-GAN and unrolled-GAN. Electronics 10, 3 (2021), 228.Google ScholarGoogle ScholarCross RefCross Ref
  43. Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016. Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv preprint arXiv:1606.06259(2016).Google ScholarGoogle Scholar

Index Terms

  1. Automatic facial expressions, gaze direction and head movements generation of a virtual agent

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ICMI '22 Companion: Companion Publication of the 2022 International Conference on Multimodal Interaction
        November 2022
        225 pages
        ISBN:9781450393898
        DOI:10.1145/3536220

        Copyright © 2022 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 7 November 2022

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed limited

        Acceptance Rates

        Overall Acceptance Rate453of1,080submissions,42%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format