ABSTRACT
Reconstructing 3D human shape and pose from a monocular image is challenging despite the promising results achieved by most recent learning based methods. The commonly occurred misalignment comes from the facts that the mapping from image to model space is highly non-linear and the rotation-based pose representation of the body model is prone to result in drift of joint positions. In this work, we present the Decompose-and-aggregate Network (DaNet) to address these issues. DaNet includes three new designs, namely UVI guided learning, decomposition for fine-grained perception, and aggregation for robust prediction. First, we adopt the UVI maps, which densely build a bridge between 2D pixels and 3D vertexes, as an intermediate representation to facilitate the learning of image-to-model mapping. Second, we decompose the prediction task into one global stream and multiple local streams so that the network not only provides global perception for the camera and shape prediction, but also has detailed perception for part pose prediction. Lastly, we aggregate the message from local streams to enhance the robustness of part pose prediction, where a position-aided rotation feature refinement strategy is proposed to exploit the spatial relationship between body parts. Such a refinement strategy is more efficient since the correlations between position features are stronger than that in the original rotation feature space. The effectiveness of our method is validated on the Human3.6M and UP-3D datasets. Experimental results show that the proposed method significantly improves the reconstruction performance in comparison with previous state-of-the-art methods. Our code is publicly available at https://github.com/HongwenZhang/DaNet-3DHumanReconstrution .
- Ijaz Akhter and Michael J Black. 2015. Pose-conditioned joint angle limits for 3D human pose reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 1446--1455.Google ScholarCross Ref
- Riza Alp Güler, Natalia Neverova, and Iasonas Kokkinos. 2018. Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7297--7306.Google ScholarCross Ref
- Riza Alp Guler, George Trigeorgis, Epameinondas Antonakos, Patrick Snape, Stefanos Zafeiriou, and Iasonas Kokkinos. 2017. Densereg: Fully convolutional dense shape regression in-the-wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 6799--6808.Google ScholarCross Ref
- Dragomir Anguelov, Praveen Srinivasan, Daphne Koller, Sebastian Thrun, Jim Rodgers, and James Davis. 2005. SCAPE: shape completion and animation of people. In ACM Transactions on Graphics , Vol. 24. ACM, 408--416.Google ScholarDigital Library
- Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J Black. 2016. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In European Conference on Computer Vision. Springer, 561--578.Google ScholarCross Ref
- Ching-Hang Chen and Deva Ramanan. 2017. 3d human pose estimation= 2d pose estimationGoogle Scholar
- matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7035--7043.Google Scholar
- Xianjie Chen and Alan L Yuille. 2014. Articulated pose estimation by a graphical model with image dependent pairwise relations. In Advances in Neural Information Processing Systems. 1736--1744.Google Scholar
- Xiao Chu, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. 2016a. Structured feature learning for pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4715--4723.Google ScholarCross Ref
- Xiao Chu, Wanli Ouyang, Xiaogang Wang, et almbox. 2016b. Crf-cnn: Modeling structured information in human pose estimation. In Advances in Neural Information Processing Systems. 316--324.Google Scholar
- Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. 2017. Rmpe: Regional multi-person pose estimation. In Proceedings of the IEEE International Conference on Computer Vision. 2334--2343.Google ScholarCross Ref
- Peng Guan, Alexander Weiss, Alexandru O Balan, and Michael J Black. 2009. Estimating human shape and pose from a single image. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 1381--1388.Google Scholar
- Riza Alp Guler and Iasonas Kokkinos. 2019. HoloPose: Holistic 3D Human Reconstruction In-The-Wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10884--10894.Google ScholarCross Ref
- Albert Haque, Boya Peng, Zelun Luo, Alexandre Alahi, Serena Yeung, and Li Fei-Fei. 2016. Towards viewpoint invariant 3d human pose estimation. In European Conference on Computer Vision. Springer, 160--177.Google ScholarCross Ref
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarCross Ref
- Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. 2014. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence , Vol. 36, 7 (2014), 1325--1339.Google Scholar
- Aaron S Jackson, Chris Manafas, and Georgios Tzimiropoulos. 2018. 3d human body reconstruction from a single image via volumetric regression. In Proceedings of the European Conference on Computer Vision .Google Scholar
- Max Jaderberg, Karen Simonyan, Andrew Zisserman, et almbox. 2015. Spatial transformer networks. In Advances in Neural Information Processing Systems. 2017--2025.Google ScholarDigital Library
- Hanbyul Joo, Tomas Simon, and Yaser Sheikh. 2018. Total capture: A 3d deformation model for tracking faces, hands, and bodies. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 8320--8329.Google ScholarCross Ref
- Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik. 2018. End-to-end recovery of human shape and pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7122--7131.Google ScholarCross Ref
- Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. 2018. Neural 3d mesh renderer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3907--3916.Google ScholarCross Ref
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. International Conference on Learning Representations (2014).Google Scholar
- Nikos Kolotouros, Georgios Pavlakos, and Kostas Daniilidis. 2019. Convolutional Mesh Regression for Single-Image Human Shape Reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 4501--4510.Google ScholarCross Ref
- Christoph Lassner, Javier Romero, Martin Kiefel, Federica Bogo, Michael J Black, and Peter V Gehler. 2017. Unite the people: Closing the loop between 3d and 2d human representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 6050--6059.Google ScholarCross Ref
- Kyoungoh Lee, Inwoong Lee, and Sanghoon Lee. 2018. Propagating lstm: 3d pose estimation based on joint interdependency. In Proceedings of the European Conference on Computer Vision. 119--135.Google ScholarCross Ref
- Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European Conference on Computer Vision. Springer, 740--755.Google ScholarCross Ref
- Matthew Loper, Naureen Mahmood, and Michael J Black. 2014. MoSh: Motion and shape capture from sparse markers. ACM Transactions on Graphics , Vol. 33, 6 (2014), 220.Google ScholarDigital Library
- Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. 2015. SMPL: A skinned multi-person linear model. ACM Transactions on Graphics , Vol. 34, 6 (2015), 248.Google ScholarDigital Library
- Matthew M Loper and Michael J Black. 2014. OpenDR: An approximate differentiable renderer. In European Conference on Computer Vision. Springer, 154--169.Google ScholarCross Ref
- Chenxu Luo, Xiao Chu, and Alan L. Yuille. 2018. OriNet: A Fully Convolutional Network for 3D Human Pose Estimation. In British Machine Vision Conference 2018. 92.Google Scholar
- Julieta Martinez, Rayat Hossain, Javier Romero, and James J Little. 2017. A simple yet effective baseline for 3d human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision. 2640--2649.Google ScholarCross Ref
- Francesc Moreno-Noguer. 2017. 3d human pose estimation from a single image via distance matrix regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 2823--2832.Google ScholarCross Ref
- Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision. Springer, 483--499.Google ScholarCross Ref
- Bruce Xiaohan Nie, Ping Wei, and Song-Chun Zhu. 2017. Monocular 3D human pose estimation by predicting depth on joints. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 3467--3475.Google ScholarCross Ref
- Mohamed Omran, Christoph Lassner, Gerard Pons-Moll, Peter Gehler, and Bernt Schiele. 2018. Neural body fitting: Unifying deep learning and model based human pose and shape estimation. In International Conference on 3D Vision. IEEE, 484--494.Google ScholarCross Ref
- Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and Michael J Black. 2019. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 10975--10985.Google ScholarCross Ref
- Georgios Pavlakos, Xiaowei Zhou, Konstantinos G Derpanis, and Kostas Daniilidis. 2017. Coarse-to-fine volumetric prediction for single-image 3D human pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 7025--7034.Google ScholarCross Ref
- Georgios Pavlakos, Luyang Zhu, Xiaowei Zhou, and Kostas Daniilidis. 2018. Learning to estimate 3D human pose and shape from a single color image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 459--468.Google ScholarCross Ref
- Leonid Pishchulin, Mykhaylo Andriluka, Peter Gehler, and Bernt Schiele. 2013. Poselet conditioned pictorial structures. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 588--595.Google ScholarDigital Library
- Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. 2012. Reconstructing 3d human pose from 2d image landmarks. In European Conference on Computer Vision. Springer, 573--586.Google ScholarDigital Library
- Leonid Sigal, Alexandru Balan, and Michael J Black. 2008. Combined discriminative and generative articulated pose and non-rigid shape estimation. In Advances in Neural Information Processing Systems. 1337--1344.Google Scholar
- Ke Sun, Cuiling Lan, Junliang Xing, Wenjun Zeng, Dong Liu, and Jingdong Wang. 2017a. Human pose estimation using global and local normalization. In Proceedings of the IEEE International Conference on Computer Vision. 5599--5607.Google ScholarCross Ref
- Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. 2019. Deep High-Resolution Representation Learning for Human Pose Estimation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019).Google ScholarCross Ref
- Xiao Sun, Jiaxiang Shang, Shuang Liang, and Yichen Wei. 2017b. Compositional human pose regression. In Proceedings of the IEEE International Conference on Computer Vision. 2602--2611.Google ScholarCross Ref
- Xiao Sun, Bin Xiao, Fangyin Wei, Shuang Liang, and Yichen Wei. 2018. Integral human pose regression. In Proceedings of the European Conference on Computer Vision. 529--545.Google ScholarCross Ref
- Yaniv Taigman, Ming Yang, Marc'Aurelio Ranzato, and Lior Wolf. 2014. Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on computer vision and pattern recognition . 1701--1708.Google ScholarDigital Library
- Vince Tan, Ignas Budvytis, and Roberto Cipolla. 2017. Indirect deep structured learning for 3D human body shape and pose prediction. In British Machine Vision Conference .Google ScholarCross Ref
- Bugra Tekin, Pablo Márquez-Neila, Mathieu Salzmann, and Pascal Fua. 2017. Learning to fuse 2d and 3d image cues for monocular body pose estimation. In Proceedings of the IEEE International Conference on Computer Vision. 3941--3950.Google ScholarCross Ref
- Denis Tome, Chris Russell, and Lourdes Agapito. 2017. Lifting from the deep: Convolutional 3d pose estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 2500--2509.Google ScholarCross Ref
- Jonathan J Tompson, Arjun Jain, Yann LeCun, and Christoph Bregler. 2014. Joint training of a convolutional network and a graphical model for human pose estimation. In Advances in Neural Information Processing Systems. 1799--1807.Google Scholar
- Hsiao-Yu Tung, Hsiao-Wei Tung, Ersin Yumer, and Katerina Fragkiadaki. 2017. Self-supervised learning of motion capture. In Advances in Neural Information Processing Systems. 5236--5246.Google Scholar
- Gul Varol, Duygu Ceylan, Bryan Russell, Jimei Yang, Ersin Yumer, Ivan Laptev, and Cordelia Schmid. 2018. BodyNet: Volumetric inference of 3D human body shapes. In Proceedings of the European Conference on Computer Vision. 20--36.Google ScholarCross Ref
- Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. 2016. Convolutional pose machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4724--4732.Google ScholarCross Ref
- Wanglong Wu, Meina Kan, Xin Liu, Yi Yang, Shiguang Shan, and Xilin Chen. 2017. Recursive spatial transformer (rest) for alignment-free face recognition. In Proceedings of the IEEE International Conference on Computer Vision . 3772--3780.Google ScholarCross Ref
- Donglai Xiang, Hanbyul Joo, and Yaser Sheikh. 2019. Monocular total capture: Posing face, body, and hands in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 10965--10974.Google ScholarCross Ref
- Wei Yang, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. 2016. End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3073--3082.Google ScholarCross Ref
- Wei Yang, Wanli Ouyang, Xiaolong Wang, Jimmy Ren, Hongsheng Li, and Xiaogang Wang. 2018. 3d human pose estimation in the wild by adversarial learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5255--5264.Google ScholarCross Ref
- Yi Yang and Deva Ramanan. 2011. Articulated pose estimation with flexible mixtures-of-parts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1385--1392.Google ScholarDigital Library
- Pengfei Yao, Zheng Fang, Fan Wu, Yao Feng, and Jiwei Li. 2019. Densebody: Directly regressing dense 3d human pose and shape from a single color image. arXiv preprint arXiv:1903.10153 (2019).Google Scholar
- Xingyi Zhou, Qixing Huang, Xiao Sun, Xiangyang Xue, and Yichen Wei. 2017. Towards 3d human pose estimation in the wild: a weakly-supervised approach. In Proceedings of the IEEE International Conference on Computer Vision . 398--407.Google ScholarCross Ref
- Xingyi Zhou, Xiao Sun, Wei Zhang, Shuang Liang, and Yichen Wei. 2016. Deep kinematic pose regression. In European Conference on Computer Vision. Springer, 186--201.Google ScholarCross Ref
Index Terms
- DaNet: Decompose-and-aggregate Network for 3D Human Shape and Pose Estimation
Recommendations
Estimating 3D body mesh without SMPL annotations via alternating successive convex approximation
AbstractThis paper focused on extracting effective human shape and pose information from solely joint annotations. With few training datasets and without using SMPL annotations from MoSh, we proposed a method based on alternating successive ...
Graphical abstractDisplay Omitted
Highlights- The proposed method trained with a minimum dataset (without moshed data) can provide accurate estimations for 3D body mesh.
DANet: Dual-Branch Activation Network for Small Object Instance Segmentation of Ship Images
In maritime scenes, instance segmentation of small object ships is of vital importance. Small ship objects in images have the characteristics of smaller size, lower image cover rate and fewer appearance features. However, existing instance segmentation ...
Rotation and translation invariants of Gaussian-Hermite moments
Geometric moment invariants are widely used in many fields of image analysis and pattern recognition since their first introduction by Hu in 1962. A few years ago, Flusser has proved how to find the independent and complete set of geometric moment ...
Comments