Abstract
Existing methods for 3D tracking from monocular RGB videos predominantly consider articulated and rigid objects (e.g., two hands or humans interacting with rigid environments). Modelling dense non-rigid object deformations in this setting (e.g. when hands are interacting with a face), remained largely unaddressed so far, although such effects can improve the realism of the downstream applications such as AR/VR, 3D virtual avatar communications, and character animations. This is due to the severe ill-posedness of the monocular view setting and the associated challenges (e.g., in acquiring a dataset for training and evaluation or obtaining the reasonable non-uniform stiffness of the deformable object). While it is possible to naïvely track multiple non-rigid objects independently using 3D templates or parametric 3D models, such an approach would suffer from multiple artefacts in the resulting 3D estimates such as depth ambiguity, unnatural intra-object collisions and missing or implausible deformations.
Hence, this paper introduces the first method that addresses the fundamental challenges depicted above and that allows tracking human hands interacting with human faces in 3D from single monocular RGB videos. We model hands as articulated objects inducing non-rigid face deformations during an active interaction. Our method relies on a new hand-face motion and interaction capture dataset with realistic face deformations acquired with a markerless multi-view camera system. As a pivotal step in its creation, we process the reconstructed raw 3D shapes with position-based dynamics and an approach for non-uniform stiffness estimation of the head tissues, which results in plausible annotations of the surface deformations, hand-face contact regions and head-hand positions. At the core of our neural approach are a variational auto-encoder supplying the hand-face depth prior and modules that guide the 3D tracking by estimating the contacts and the deformations. Our final 3D hand and face reconstructions are realistic and more plausible compared to several baselines applicable in our setting, both quantitatively and qualitatively. https://vcai.mpi-inf.mpg.de/projects/Decaf
Supplemental Material
- Jascha Achenbach, Robert Brylka, Thomas Gietzen, Katja zum Hebel, Elmar Schömer, Ralf Schulze, Mario Botsch, and Ulrich Schwanecke. 2018. A multilinear model for bidirectional craniofacial reconstruction. In Proceedings of the Eurographics Workshop on Visual Computing for Biology and Medicine. 67--76.Google ScholarDigital Library
- Abien Fred Agarap. 2018. Deep learning using rectified linear units (relu). arXiv preprint arXiv:1803.08375 (2018).Google Scholar
- Aljaz Bozic, Pablo Palafox, Michael Zollöfer, Angela Dai, Justus Thies, and Matthias Nießner. 2020. Neural Non-Rigid Tracking. (2020).Google Scholar
- Adrian Bulat and Georgios Tzimiropoulos. 2017. How far are we from solving the 2D & 3D Face Alignment problem? (and a dataset of 230,000 3D facial landmarks). In International Conference on Computer Vision (ICCV).Google ScholarCross Ref
- Zhe Cao, Ilija Radosavovic, Angjoo Kanazawa, and Jitendra Malik. 2021. Reconstructing hand-object interactions in the wild. In International Conference on Computer Vision (ICCV).Google ScholarCross Ref
- Rishabh Dabral, Soshi Shimada, Arjun Jain, Christian Theobalt, and Vladislav Golyanik. 2021. Gravity-Aware Monocular 3D Human-Object Reconstruction. In International Conference on Computer Vision (ICCV).Google ScholarCross Ref
- Radek Danecek, Michael J. Black, and Timo Bolkart. 2022. EMOCA: Emotion Driven Monocular Face Capture and Animation. In Conference on Computer Vision and Pattern Recognition (CVPR). 20311--20322.Google Scholar
- Yao Feng, Vasileios Choutas, Timo Bolkart, Dimitrios Tzionas, and Michael Black. 2021a. Collaborative Regression of Expressive Bodies using Moderation. In International Conference on 3D Vision (3DV). 792--804. Google ScholarCross Ref
- Yao Feng, Haiwen Feng, Michael J Black, and Timo Bolkart. 2021b. Learning an animatable detailed 3d face model from in-the-wild images. ACM Transactions on Graphics (ToG) 40, 4 (2021), 1--13.Google ScholarDigital Library
- Mihai Fieraru, Mihai Zanfir, Elisabeta Oneata, Alin-Ionut Popa, Vlad Olaru, and Cristian Sminchisescu. 2020. Three-dimensional reconstruction of human interactions. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
- Mihai Fieraru, Mihai Zanfir, Elisabeta Oneata, Alin-Ionut Popa, Vlad Olaru, and Cristian Sminchisescu. 2021. Learning complex 3D human self-contact. In Proceedings of the AAAI Conference on Artificial Intelligence.Google ScholarCross Ref
- David Fuentes-Jimenez, Daniel Pizarro, David Casillas-Perez, Toby Collins, and Adrien Bartoli. 2021. Texture-Generic Deep Shape-From-Template. IEEE Access 9 (2021), 75211--75230.Google ScholarCross Ref
- Pablo Garrido, Levi Valgaerts, Chenglei Wu, and Christian Theobalt. 2013. Reconstructing detailed dynamic face geometry from monocular video. ACM Trans. Graph. 32, 6 (2013), 158--1.Google ScholarDigital Library
- Pablo Garrido, Michael Zollhöfer, Chenglei Wu, Derek Bradley, Patrick Pérez, Thabo Beeler, and Christian Theobalt. 2016. Corrective 3D reconstruction of lips from monocular video. ACM Trans. Graph. 35, 6 (2016), 219--1.Google Scholar
- Erik Gärtner, Mykhaylo Andriluka, Erwin Coumans, and Cristian Sminchisescu. 2022a. Differentiable dynamics for articulated 3d human motion reconstruction. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
- Erik Gärtner, Mykhaylo Andriluka, Hongyi Xu, and Cristian Sminchisescu. 2022b. Trajectory optimization for physics-based reconstruction of 3d human pose from monocular video. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
- Vladislav Golyanik, Soshi Shimada, Kiran Varanasi, and Didier Stricker. 2018. Hdm-net: Monocular non-rigid 3d reconstruction with learned deformation model. In Virtual Reality and Augmented Reality: 15th EuroVR International Conference, EuroVR 2018, London, UK, October 22--23, 2018, Proceedings 15. Springer, 51--72.Google ScholarCross Ref
- Patrick Grady, Chengcheng Tang, Christopher D. Twigg, Minh Vo, Samarth Brahmbhatt, and Charles C. Kemp. 2021. ContactOpt: Optimizing Contact to Improve Grasps. In Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
- Kaiwen Guo, Feng Xu, Tao Yu, Xiaoyang Liu, Qionghai Dai, and Yebin Liu. 2017. Realtime geometry, albedo, and motion reconstruction using a single rgb-d camera. ACM Transactions on Graphics (ToG) 36, 4 (2017), 1.Google ScholarDigital Library
- Marc Habermann, Weipeng Xu, Helge Rhodin, Michael Zollhoefer, Gerard Pons-Moll, and Christian Theobalt. 2018. NRST: Non-rigid Surface Tracking from Monocular Video. In German Conference on Pattern Recognition (GCPR).Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
- Haoyu Hu, Xinyu Yi, Hao Zhang, Jun-Hai Yong, and Feng Xu. 2022. Physical Interaction: Reconstructing Hand-object Interactions with Physics. In SIGGRAPH Asia 2022 Conference Papers.Google Scholar
- Buzhen Huang, Liang Pan, Yuan Yang, Jingyi Ju, and Yangang Wang. 2022. Neural MoCon: Neural Motion Control for Physically Plausible Human Motion Capture. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
- Alexandru Eugen Ichim, Sofien Bouaziz, and Mark Pauly. 2015. Dynamic 3D avatar creation from hand-held video input. ACM Transactions on Graphics (ToG) 34, 4 (2015), 1--14.Google ScholarDigital Library
- Matthias Innmann, Michael Zollhöfer, Matthias Nießner, Christian Theobalt, and Marc Stamminger. 2016. Volumedeform: Real-time volumetric non-rigid reconstruction. In International Conference on Computer Vision (ICCV).Google ScholarCross Ref
- Navami Kairanda, Edgar Tretschk, Mohamed Elgharib, Christian Theobalt, and Vladislav Golyanik. 2022. φ-SfT: Shape-from-Template with a Physics-based Deformation Model. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
- Diederik P Kingma and Max Welling. 2014. Auto-encoding variational bayes. In International Conference on Learning Representations (ICLR).Google Scholar
- Yen Lee Angela Kwok, Jan Gralton, and Mary-Louise McLaws. 2015. Face touching: a frequent habit that has implications for hand hygiene. American journal of infection control 43, 2 (2015), 112--114.Google Scholar
- Alexandros Lattas, Stylianos Moschoglou, Baris Gecer, Stylianos Ploumpis, Vasileios Triantafyllou, Abhijeet Ghosh, and Stefanos Zafeiriou. 2020. AvatarMe: Realistically Renderable 3D Facial Reconstruction" in-the-wild". In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 760--769.Google ScholarCross Ref
- Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. 2017. Learning a model of facial shape and expression from 4D scans. ACM Transactions on Graphics (TOG) 36, 6 (2017), 194:1--194:17.Google ScholarDigital Library
- Zhi Li, Soshi Shimada, Bernt Schiele, Christian Theobalt, and Vladislav Golyanik. 2022. MoCapDeform: Monocular 3D Human Motion Capture in Deformable Scenes. In International Conference on 3D Vision (3DV).Google ScholarCross Ref
- Wenbin Lin, Chengwei Zheng, Jun-Hai Yong, and Feng Xu. 2022. Occlusionfusion: Occlusion-aware motion estimation for real-time dynamic 3d reconstruction. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
- Shaowei Liu, Hanwen Jiang, Jiarui Xu, Sifei Liu, and Xiaolong Wang. 2021. Semi-supervised 3d hand-object poses estimation with interactions in time. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
- Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Yong, Juhyun Lee, et al. 2019. Mediapipe: A framework for perceiving and processing reality. In Workshop on Computer Vision for AR/VR at Computer Vision and Pattern Recognition (CVPRW).Google Scholar
- Zhengyi Luo, Ryo Hachiuma, Ye Yuan, and Kris Kitani. 2021. Dynamics-regulated kinematic policy for egocentric pose estimation. Advances in Neural Information Processing Systems (NeurIPS) (2021).Google Scholar
- Zhengyi Luo, Shun Iwase, Ye Yuan, and Kris Kitani. 2022. Embodied Scene-aware Human Pose Estimation. Advances in Neural Information Processing Systems (NeurIPS) (2022).Google Scholar
- Andrew L Maas, Awni Y Hannun, Andrew Y Ng, et al. 2013. Rectifier nonlinearities improve neural network acoustic models. In International Conference on Machine Learning (ICML).Google Scholar
- Franziska Mueller, Micah Davis, Florian Bernard, Oleksandr Sotnychenko, Mickeal Verschoor, Miguel A Otaduy, Dan Casas, and Christian Theobalt. 2019. Real-time pose and shape reconstruction of two interacting hands with a single depth camera. ACM Transactions on Graphics (ToG) 38, 4 (2019).Google ScholarDigital Library
- Lea Müller, Ahmed A. A. Osman, Siyu Tang, Chun-Hao P. Huang, and Michael J. Black. 2021. On Self-Contact and Human Pose. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
- Matthias Müller, Bruno Heidelberger, Marcus Hennix, and John Ratcliff. 2007. Position Based Dynamics. J. Vis. Comun. Image Represent. 18, 2 (apr 2007), 109--118.Google ScholarDigital Library
- Dat Tien Ngo, Sanghyuk Park, Anne Jorstad, Alberto Crivellaro, Chang D. Yoo, and Pascal Fua. 2015. Dense Image Registration and Deformable Surface Reconstruction in Presence of Occlusions and Minimal Texture. In International Conference on Computer Vision (ICCV).Google Scholar
- Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems (NeurIPS).Google Scholar
- Antoine Petit, Stéphane Cotin, Vincenzo Lippiello, and Bruno Siciliano. 2018. Capturing deformations of interacting non-rigid objects using rgb-d data. In International Conference on Intelligent Robots and Systems (IROS).Google ScholarDigital Library
- Pexels. 2023. Pexels. https://www.pexels.com/. Accessed: 2023-10-11.Google Scholar
- Davis Rempe, Leonidas J Guibas, Aaron Hertzmann, Bryan Russell, Ruben Villegas, and Jimei Yang. 2020. Contact and Human Dynamics from Monocular Video. In European Conference on Computer Vision (ECCV).Google ScholarDigital Library
- Javier Romero, Dimitrios Tzionas, and Michael J. Black. 2017. Embodied Hands: Modeling and Capturing Hands and Bodies Together. ACM Transactions on Graphics (TOG) 36, 6 (Nov. 2017).Google ScholarDigital Library
- Shunsuke Saito, Tianye Li, and Hao Li. 2016. Real-time facial segmentation and performance capture from rgb input. In Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part VIII 14. Springer, 244--261.Google Scholar
- Mathieu Salzmann, Julien Pilet, Slobodan Ilic, and Pascal Fua. 2007. Surface Deformation Models for Nonrigid 3D Shape Recovery. Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 29, 8 (2007), 1481--1487.Google ScholarDigital Library
- Soumyadip Sengupta, Vivek Jayaram, Brian Curless, Steve Seitz, and Ira Kemelmacher-Shlizerman. 2020. Background Matting: The World is Your Green Screen. In Computer Vision and Pattern Regognition (CVPR).Google Scholar
- Soshi Shimada, Vladislav Golyanik, Zhi Li, Patrick Pérez, Weipeng Xu, and Christian Theobalt. 2022. HULC: 3D HUman Motion Capture with Pose Manifold SampLing and Dense Contact Guidance. In European Conference on Computer Vision (ECCV).Google ScholarDigital Library
- Soshi Shimada, Vladislav Golyanik, Christian Theobalt, and Didier Stricker. 2019. Ismogan: Adversarial learning for monocular non-rigid 3d reconstruction. In Computer Vision and Pattern Recognition Workshops (CVPRW).Google Scholar
- Soshi Shimada, Vladislav Golyanik, Weipeng Xu, Patrick Pérez, and Christian Theobalt. 2021. Neural Monocular 3D Human Motion Capture with Physical Awareness. ACM Transactions on Graphics (TOG) 40, 4, Article 83 (aug 2021).Google ScholarDigital Library
- Soshi Shimada, Vladislav Golyanik, Weipeng Xu, and Christian Theobalt. 2020. PhysCap: Physically Plausible Monocular 3D Motion Capture in Real Time. ACM Transactions on Graphics 39, 6, Article 235 (dec 2020).Google ScholarDigital Library
- Miroslava Slavcheva, Maximilian Baust, Daniel Cremers, and Slobodan Ilic. 2017. Killing-fusion: Non-rigid 3d reconstruction without correspondences. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
- Kihyuk Sohn, Honglak Lee, and Xinchen Yan. 2015. Learning structured output representation using deep conditional generative models. Advances in neural information processing systems (NeurIPS) (2015).Google Scholar
- Bugra Tekin, Federica Bogo, and Marc Pollefeys. 2019. H+ o: Unified egocentric recognition of 3d hand-object poses and interactions. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
- Ayush Tewari, Michael Zollöfer, Hyeongwoo Kim, Pablo Garrido, Florian Bernard, Patrick Perez, and Theobalt Christian. 2017. MoFA: Model-based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction. In The IEEE International Conference on Computer Vision (ICCV).Google Scholar
- Justus Thies, Michael Zollhofer, Marc Stamminger, Christian Theobalt, and Matthias Nießner. 2016. Face2face: Real-time face capture and reenactment of rgb videos. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2387--2395.Google ScholarDigital Library
- Edith Tretschk, Navami Kairanda, Mallikarjun B R, Rishabh Dabral, Adam Kortylewski, Bernhard Egger, Marc Habermann, Pascal Fua, Christian Theobalt, and Vladislav Golyanik. 2023. State of the Art in Dense Monocular Non-Rigid 3D Reconstruction. Computer Graphics Forum (EG STAR 2023) (2023).Google ScholarCross Ref
- Aggeliki Tsoli and Antonis A Argyros. 2018. Joint 3D tracking of a deformable object in interaction with a hand. In European Conference on Computer Vision (ECCV).Google ScholarDigital Library
- Jiayi Wang, Diogo Luvizon, Franziska Mueller, Florian Bernard, Adam Kortylewski, Dan Casas, and Christian Theobalt. 2022. HandFlow: Quantifying View-Dependent 3D Ambiguity in Two-Hand Reconstruction with Normalizing Flow. Vision, Modeling, and Visualization (2022).Google Scholar
- Chenglei Wu, Derek Bradley, Markus Gross, and Thabo Beeler. 2016. An anatomically-constrained local deformation model for monocular face capture. ACM transactions on graphics (TOG) 35, 4 (2016), 1--12.Google Scholar
- Kevin Xie, Tingwu Wang, Umar Iqbal, Yunrong Guo, Sanja Fidler, and Florian Shkurti. 2021. Physics-based human motion estimation and synthesis from videos. In International Conference on Computer Vision (ICCV).Google ScholarCross Ref
- Xinyu Yi, Yuxiao Zhou, Marc Habermann, Soshi Shimada, Vladislav Golyanik, Christian Theobalt, and Feng Xu. 2022. Physical Inertial Poser (PIP): Physics-aware Realtime Human Motion Tracking from Sparse Inertial Sensors. In Computer Vision and Pattern Recognition (CVPR).Google Scholar
- Alex Yu. 2023. Triangle mesh to signed-distance function (SDF). https://github.com/sxyu/sdf.Google Scholar
- Rui Yu, Chris Russell, Neill DF Campbell, and Lourdes Agapito. 2015. Direct, dense, and deformable: Template-based non-rigid 3d reconstruction from rgb video. In International Conference on Computer Vision (ICCV).Google ScholarDigital Library
- Ye Yuan, Shih-En Wei, Tomas Simon, Kris Kitani, and Jason Saragih. 2021. Simpoe: Simulated character control for 3d human pose estimation. In Computer vision and pattern recognition (CVPR).Google Scholar
- Baowen Zhang, Yangang Wang, Xiaoming Deng, Yinda Zhang, Ping Tan, Cuixia Ma, and Hongan Wang. 2021a. Interacting two-hand 3d pose and shape reconstruction from single color image. In International Conference on Computer Vision (ICCV).Google ScholarCross Ref
- Hao Zhang, Zi-Hao Bo, Jun-Hai Yong, and Feng Xu. 2019. InteractionFusion: real-time reconstruction of hand poses and deformable objects in hand-object interactions. ACM Transactions on Graphics (TOG) 38, 4 (2019).Google ScholarDigital Library
- Hao Zhang, Yuxiao Zhou, Yifei Tian, Jun-Hai Yong, and Feng Xu. 2021b. Single depth view based real-time reconstruction of hand-object interactions. ACM Transactions on Graphics (TOG) 40, 3 (2021).Google Scholar
- Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. 2019. On the continuity of rotation representations in neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5745--5753.Google ScholarCross Ref
- Michael Zollhöfer, Matthias Nießner, Shahram Izadi, Christoph Rehmann, Christopher Zach, Matthew Fisher, Chenglei Wu, Andrew Fitzgibbon, Charles Loop, Christian Theobalt, et al. 2014. Real-time non-rigid reconstruction using an RGB-D camera. ACM Transactions on Graphics (ToG) 33, 4 (2014).Google ScholarDigital Library
Index Terms
- Decaf: Monocular Deformation Capture for Face and Hand Interactions
Recommendations
Semi-dense Visual Odometry for a Monocular Camera
ICCV '13: Proceedings of the 2013 IEEE International Conference on Computer VisionWe propose a fundamentally novel approach to real-time visual odometry for a monocular camera. It allows to benefit from the simplicity and accuracy of dense tracking - which does not depend on visual features - while running in real-time on a CPU. The ...
Markerless Shape and Motion Capture From Multiview Video Sequences
We propose a new markerless shape and motion capture approach from multiview video sequences. The shape recovery method consists of two steps: separating and merging. In the separating step, the depth map represented with a point cloud for each view is ...
Continuous capture of skin deformation
We describe a method for the acquisition of deformable human geometry from silhouettes. Our technique uses a commercial tracking system to determine the motion of the skeleton, then estimates geometry for each bone using constraints provided by the ...
Comments