ABSTRACT
In this article, we present two models to jointly and automatically generate the head, facial and gaze movements of a virtual agent from acoustic speech features. Two architectures are explored: a Generative Adversarial Network and an Adversarial Encoder-Decoder. Head movements and gaze orientation are generated as 3D coordinates, while facial expressions are generated using action units based on the facial action coding system. A large corpus of almost 4 hours of videos, involving 89 different speakers is used to train our models. We extract the speech and visual features automatically from these videos using existing tools. The evaluation of these models is conducted objectively with measures such as density evaluation and a visualisation from PCA reduction, as well as subjectively through a users perceptive study. Our proposed methodology shows that on 15 seconds sequences, encoder-decoder architecture drastically improves the perception of generated behaviours in two criteria: the coordination with speech and the naturalness. Our code can be found in : https://github.com/aldelb/non-verbal-behaviours-generation.
- Tadas Baltrušaitis, Peter Robinson, and Louis-Philippe Morency. 2016. Openface: an open source facial behavior analysis toolkit. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 1–10.Google ScholarCross Ref
- Carlos Busso, Zhigang Deng, Michael Grimm, Ulrich Neumann, and Shrikanth Narayanan. 2007. Rigid head motion in expressive speech animation: Analysis and synthesis. IEEE transactions on audio, speech, and language processing 15, 3(2007), 1075–1086.Google Scholar
- Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7291–7299.Google ScholarCross Ref
- Justine Cassell. 2000. Embodied conversational interface agents. Commun. ACM 43, 4 (2000), 70–78.Google ScholarDigital Library
- Justine Cassell, Catherine Pelachaud, Norman Badler, Mark Steedman, Brett Achorn, Tripp Becket, Brett Douville, Scott Prevost, and Matthew Stone. 1994. Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents. In Proceedings of the 21st annual conference on Computer graphics and interactive techniques. 413–420.Google ScholarDigital Library
- Chung-Cheng Chiu and Stacy Marsella. 2014. Gesture generation with low-dimensional embeddings. In Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems. 781–788.Google ScholarDigital Library
- Paul Ekman. 2002. Facial action coding system (FACS). A Human Face, Salt Lake City(2002).Google Scholar
- Paul Ekman and Wallace V Friesen. 1978. Facial action coding system. Environmental Psychology & Nonverbal Behavior (1978).Google Scholar
- Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia. 1459–1462.Google ScholarDigital Library
- Alexandre Garcia, Slim Essid, Florence d’Alché Buc, and Chloé Clavel. 2019. A multimodal movie review corpus for fine-grained opinion mining. arXiv preprint arXiv:1902.10102(2019).Google Scholar
- Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, and Jitendra Malik. 2019. Learning individual styles of conversational gesture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3497–3506.Google ScholarCross Ref
- Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. Advances in neural information processing systems 27 (2014).Google Scholar
- Ikhsanul Habibie, Weipeng Xu, Dushyant Mehta, Lingjie Liu, Hans-Peter Seidel, Gerard Pons-Moll, Mohamed Elgharib, and Christian Theobalt. 2021. Learning speech-driven 3d conversational gestures from video. In Proceedings of the 21st ACM International Conference on Intelligent Virtual Agents. 101–108.Google ScholarDigital Library
- Dai Hasegawa, Naoshi Kaneko, Shinichi Shirakawa, Hiroshi Sakuta, and Kazuhiko Sumi. 2018. Evaluation of speech-to-gesture generation using bi-directional LSTM network. In Proceedings of the 18th International Conference on Intelligent Virtual Agents. 79–86.Google ScholarDigital Library
- Patrik Jonell, Taras Kucherenko, Gustav Eje Henter, and Jonas Beskow. 2020. Let’s Face It: Probabilistic Multi-modal Interlocutor-aware Generation of Facial Gestures in Dyadic Settings. In Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents. 1–8.Google ScholarDigital Library
- Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. 2017. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics (TOG) 36, 4 (2017), 1–12.Google ScholarDigital Library
- Adam Kendon. 2004. Gesture: Visible action as utterance. Cambridge University Press.Google ScholarCross Ref
- Taras Kucherenko, Dai Hasegawa, Naoshi Kaneko, Gustav Eje Henter, and Hedvig Kjellström. 2021. Moving fast and slow: Analysis of representations and post-processing in speech-driven automatic gesture generation. International Journal of Human–Computer Interaction 37, 14(2021), 1300–1316.Google Scholar
- Sergey Levine, Christian Theobalt, and Vladlen Koltun. 2009. Real-time prosody-driven synthesis of body language. In ACM SIGGRAPH Asia 2009 papers. 1–10.Google Scholar
- Margot Lhommet, Yuyu Xu, and Stacy Marsella. 2015. Cerebella: automatic generation of nonverbal behavior for virtual humans. In Twenty-Ninth AAAI Conference on Artificial Intelligence.Google ScholarCross Ref
- Jing Li, Di Kang, Wenjie Pei, Xuefei Zhe, Ying Zhang, Zhenyu He, and Linchao Bao. 2021. Audio2Gestures: Generating Diverse Gestures from Speech Audio with Conditional Variational Autoencoders. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11293–11302.Google ScholarCross Ref
- Soroosh Mariooryad and Carlos Busso. 2012. Generating human-like behaviors using joint, speech-driven models for conversational agents. IEEE Transactions on Audio, Speech, and Language Processing 20, 8(2012), 2329–2340.Google ScholarDigital Library
- Stacy Marsella, Yuyu Xu, Margaux Lhommet, Andrew Feng, Stefan Scherer, and Ari Shapiro. 2013. Virtual character performance from speech. In Proceedings of the 12th ACM SIGGRAPH/Eurographics Symposium on Computer Animation. 25–35.Google ScholarDigital Library
- David McNeill. 2000. Language and gesture. Vol. 2. Cambridge University Press Cambridge.Google ScholarCross Ref
- Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. 2016. Unrolled generative adversarial networks. arXiv preprint arXiv:1611.02163(2016).Google Scholar
- Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784(2014).Google Scholar
- Masahiro Mori, Karl F MacDorman, and Norri Kageki. 2012. The uncanny valley [from the field]. IEEE Robotics & automation magazine 19, 2 (2012), 98–100.Google ScholarCross Ref
- Kevin G Munhall, Jeffery A Jones, Daniel E Callan, Takaaki Kuratate, and Eric Vatikiotis-Bateson. 2004. Visual prosody and speech intelligibility: Head movement improves auditory speech perception. Psychological science 15, 2 (2004), 133–137.Google ScholarCross Ref
- Nora A Murphy and Judith A Hall. 2021. Capturing Behavior in Small Doses: A Review of Comparative Research in Evaluating Thin Slices for Behavioral Measurement. Frontiers in psychology 12 (2021), 667326.Google Scholar
- George Papamakarios, Eric T Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. 2021. Normalizing Flows for Probabilistic Modeling and Inference.J. Mach. Learn. Res. 22, 57 (2021), 1–64.Google Scholar
- Sunghyun Park, Han Suk Shim, Moitreya Chatterjee, Kenji Sagae, and Louis-Philippe Morency. 2014. Computational analysis of persuasiveness in social multimedia: A novel dataset and multimodal prediction approach. In Proceedings of the 16th International Conference on Multimodal Interaction. 50–57.Google ScholarDigital Library
- Catherine Pelachaud. 2015. Greta: an interactive expressive embodied conversational agent. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems. 5–5.Google Scholar
- Hai Xuan Pham, Yuting Wang, and Vladimir Pavlovic. 2018. End-to-end learning for 3d facial animation from speech. In Proceedings of the 20th ACM International Conference on Multimodal Interaction. 361–365.Google ScholarDigital Library
- Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention. Springer, 234–241.Google ScholarCross Ref
- Najmeh Sadoughi and Carlos Busso. 2018. Novel realizations of speech-driven head movements with generative adversarial networks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6169–6173.Google ScholarDigital Library
- Najmeh Sadoughi, Yang Liu, and Carlos Busso. 2015. MSP-AVATAR corpus: Motion capture recordings to study the role of discourse functions in the design of intelligent virtual agents. In 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), Vol. 7. IEEE, 1–6.Google ScholarCross Ref
- Kenta Takeuchi, Dai Hasegawa, Shinichi Shirakawa, Naoshi Kaneko, Hiroshi Sakuta, and Kazuhiko Sumi. 2017. Speech-to-gesture generation: A challenge in deep learning approach with bi-directional LSTM. In Proceedings of the 5th International Conference on Human Agent Interaction. 365–369.Google ScholarDigital Library
- Angela Tinwell, Mark Grimshaw, Debbie Abdel Nabi, and Andrew Williams. 2011. Facial expression of emotion and perception of the Uncanny Valley in virtual characters. Computers in Human Behavior 27, 2 (2011), 741–749.Google ScholarDigital Library
- David Traum, William Swartout, Peter Khooshabeh, Stefan Kopp, Stefan Scherer, and Anton Leuski. 2016. Intelligent Virtual Agents: 16th International Conference, IVA 2016, Los Angeles, CA, USA, September 20–23, 2016, Proceedings. Vol. 10011. Springer.Google ScholarCross Ref
- Michel François Valstar and Maja Pantic. 2006. Biologically vs. logic inspired encoding of facial actions and emotions in video. In 2006 IEEE International Conference on Multimedia and Expo. IEEE, 325–328.Google ScholarCross Ref
- Pieter Wolfert, Jeffrey M Girard, Taras Kucherenko, and Tony Belpaeme. 2021. To rate or not to rate: Investigating evaluation methods for generated co-speech gestures. In Proceedings of the 2021 International Conference on Multimodal Interaction. 494–502.Google ScholarDigital Library
- Bowen Wu, Chaoran Liu, Carlos Toshinori Ishi, and Hiroshi Ishiguro. 2021. Modeling the conditional distribution of co-speech upper body gesture jointly using conditional-GAN and unrolled-GAN. Electronics 10, 3 (2021), 228.Google ScholarCross Ref
- Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016. Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv preprint arXiv:1606.06259(2016).Google Scholar
Index Terms
- Automatic facial expressions, gaze direction and head movements generation of a virtual agent
Recommendations
Recognizing gaze aversion gestures in embodied conversational discourse
ICMI '06: Proceedings of the 8th international conference on Multimodal interfacesEye gaze offers several key cues regarding conversational discourse during face-to-face interaction between people. While a large body of research results exist to document the use of gaze in human-to-human interaction, and in animating realistic ...
Continuous Emotion Recognition in Videos by Fusing Facial Expression, Head Pose and Eye Gaze
ICMI '19: 2019 International Conference on Multimodal InteractionContinuous emotion recognition is of great significance in affective computing and human-computer interaction. Most of existing methods for video based continuous emotion recognition utilize facial expression. However, besides facial expression, other ...
Eye gaze tracking with free head movements using a single camera
SoICT '10: Proceedings of the 1st Symposium on Information and Communication TechnologyThe problem of eye gaze tracking has been researched and developed for a long time. The most difficult problem in the non-intrusive system of eye gaze tracking is the problem of head movements. Some of existing methods have to use two cameras and an ...
Comments