research-article

DaNet: Decompose-and-aggregate Network for 3D Human Shape and Pose Estimation

Authors:
Hongwen Zhang

Institute of Automation, Chinese Academy of Sciences & University of Chinese Academy of Sciences, Beijing, China

Institute of Automation, Chinese Academy of Sciences & University of Chinese Academy of Sciences, Beijing, China
View Profile

,
Jie Cao

Institute of Automation, Chinese Academy of Sciences & University of Chinese Academy of Sciences, Beijing, China

Institute of Automation, Chinese Academy of Sciences & University of Chinese Academy of Sciences, Beijing, China
View Profile

,
Guo Lu

Shanghai Jiao Tong University, Shanghai, China

Shanghai Jiao Tong University, Shanghai, China
View Profile

,
Wanli Ouyang

The University of Sydney, Sydney, Australia

The University of Sydney, Sydney, Australia
View Profile

,
Zhenan Sun

Institute of Automation, Chinese Academy of Sciences & University of Chinese Academy of Sciences, Beijing, China

Institute of Automation, Chinese Academy of Sciences & University of Chinese Academy of Sciences, Beijing, China
View Profile

MM '19: Proceedings of the 27th ACM International Conference on MultimediaOctober 2019Pages 935–944https://doi.org/10.1145/3343031.3351057

Published:15 October 2019Publication History

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Pages 935–944

ABSTRACT

Reconstructing 3D human shape and pose from a monocular image is challenging despite the promising results achieved by most recent learning based methods. The commonly occurred misalignment comes from the facts that the mapping from image to model space is highly non-linear and the rotation-based pose representation of the body model is prone to result in drift of joint positions. In this work, we present the Decompose-and-aggregate Network (DaNet) to address these issues. DaNet includes three new designs, namely UVI guided learning, decomposition for fine-grained perception, and aggregation for robust prediction. First, we adopt the UVI maps, which densely build a bridge between 2D pixels and 3D vertexes, as an intermediate representation to facilitate the learning of image-to-model mapping. Second, we decompose the prediction task into one global stream and multiple local streams so that the network not only provides global perception for the camera and shape prediction, but also has detailed perception for part pose prediction. Lastly, we aggregate the message from local streams to enhance the robustness of part pose prediction, where a position-aided rotation feature refinement strategy is proposed to exploit the spatial relationship between body parts. Such a refinement strategy is more efficient since the correlations between position features are stronger than that in the original rotation feature space. The effectiveness of our method is validated on the Human3.6M and UP-3D datasets. Experimental results show that the proposed method significantly improves the reconstruction performance in comparison with previous state-of-the-art methods. Our code is publicly available at https://github.com/HongwenZhang/DaNet-3DHumanReconstrution .

References

Ijaz Akhter and Michael J Black. 2015. Pose-conditioned joint angle limits for 3D human pose reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 1446--1455.Google ScholarCross Ref
Riza Alp Güler, Natalia Neverova, and Iasonas Kokkinos. 2018. Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7297--7306.Google ScholarCross Ref
Riza Alp Guler, George Trigeorgis, Epameinondas Antonakos, Patrick Snape, Stefanos Zafeiriou, and Iasonas Kokkinos. 2017. Densereg: Fully convolutional dense shape regression in-the-wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 6799--6808.Google ScholarCross Ref
Dragomir Anguelov, Praveen Srinivasan, Daphne Koller, Sebastian Thrun, Jim Rodgers, and James Davis. 2005. SCAPE: shape completion and animation of people. In ACM Transactions on Graphics , Vol. 24. ACM, 408--416.Google ScholarDigital Library
Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J Black. 2016. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In European Conference on Computer Vision. Springer, 561--578.Google ScholarCross Ref
Ching-Hang Chen and Deva Ramanan. 2017. 3d human pose estimation= 2d pose estimationGoogle Scholar
matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7035--7043.Google Scholar
Xianjie Chen and Alan L Yuille. 2014. Articulated pose estimation by a graphical model with image dependent pairwise relations. In Advances in Neural Information Processing Systems. 1736--1744.Google Scholar
Xiao Chu, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. 2016a. Structured feature learning for pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4715--4723.Google ScholarCross Ref
Xiao Chu, Wanli Ouyang, Xiaogang Wang, et almbox. 2016b. Crf-cnn: Modeling structured information in human pose estimation. In Advances in Neural Information Processing Systems. 316--324.Google Scholar
Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. 2017. Rmpe: Regional multi-person pose estimation. In Proceedings of the IEEE International Conference on Computer Vision. 2334--2343.Google ScholarCross Ref
Peng Guan, Alexander Weiss, Alexandru O Balan, and Michael J Black. 2009. Estimating human shape and pose from a single image. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 1381--1388.Google Scholar
Riza Alp Guler and Iasonas Kokkinos. 2019. HoloPose: Holistic 3D Human Reconstruction In-The-Wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10884--10894.Google ScholarCross Ref
Albert Haque, Boya Peng, Zelun Luo, Alexandre Alahi, Serena Yeung, and Li Fei-Fei. 2016. Towards viewpoint invariant 3d human pose estimation. In European Conference on Computer Vision. Springer, 160--177.Google ScholarCross Ref
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarCross Ref
Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. 2014. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence , Vol. 36, 7 (2014), 1325--1339.Google Scholar
Aaron S Jackson, Chris Manafas, and Georgios Tzimiropoulos. 2018. 3d human body reconstruction from a single image via volumetric regression. In Proceedings of the European Conference on Computer Vision .Google Scholar
Max Jaderberg, Karen Simonyan, Andrew Zisserman, et almbox. 2015. Spatial transformer networks. In Advances in Neural Information Processing Systems. 2017--2025.Google ScholarDigital Library
Hanbyul Joo, Tomas Simon, and Yaser Sheikh. 2018. Total capture: A 3d deformation model for tracking faces, hands, and bodies. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 8320--8329.Google ScholarCross Ref
Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik. 2018. End-to-end recovery of human shape and pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7122--7131.Google ScholarCross Ref
Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. 2018. Neural 3d mesh renderer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3907--3916.Google ScholarCross Ref
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. International Conference on Learning Representations (2014).Google Scholar
Nikos Kolotouros, Georgios Pavlakos, and Kostas Daniilidis. 2019. Convolutional Mesh Regression for Single-Image Human Shape Reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 4501--4510.Google ScholarCross Ref
Christoph Lassner, Javier Romero, Martin Kiefel, Federica Bogo, Michael J Black, and Peter V Gehler. 2017. Unite the people: Closing the loop between 3d and 2d human representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 6050--6059.Google ScholarCross Ref
Kyoungoh Lee, Inwoong Lee, and Sanghoon Lee. 2018. Propagating lstm: 3d pose estimation based on joint interdependency. In Proceedings of the European Conference on Computer Vision. 119--135.Google ScholarCross Ref
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European Conference on Computer Vision. Springer, 740--755.Google ScholarCross Ref
Matthew Loper, Naureen Mahmood, and Michael J Black. 2014. MoSh: Motion and shape capture from sparse markers. ACM Transactions on Graphics , Vol. 33, 6 (2014), 220.Google ScholarDigital Library
Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. 2015. SMPL: A skinned multi-person linear model. ACM Transactions on Graphics , Vol. 34, 6 (2015), 248.Google ScholarDigital Library
Matthew M Loper and Michael J Black. 2014. OpenDR: An approximate differentiable renderer. In European Conference on Computer Vision. Springer, 154--169.Google ScholarCross Ref
Chenxu Luo, Xiao Chu, and Alan L. Yuille. 2018. OriNet: A Fully Convolutional Network for 3D Human Pose Estimation. In British Machine Vision Conference 2018. 92.Google Scholar
Julieta Martinez, Rayat Hossain, Javier Romero, and James J Little. 2017. A simple yet effective baseline for 3d human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision. 2640--2649.Google ScholarCross Ref
Francesc Moreno-Noguer. 2017. 3d human pose estimation from a single image via distance matrix regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 2823--2832.Google ScholarCross Ref
Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision. Springer, 483--499.Google ScholarCross Ref
Bruce Xiaohan Nie, Ping Wei, and Song-Chun Zhu. 2017. Monocular 3D human pose estimation by predicting depth on joints. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 3467--3475.Google ScholarCross Ref
Mohamed Omran, Christoph Lassner, Gerard Pons-Moll, Peter Gehler, and Bernt Schiele. 2018. Neural body fitting: Unifying deep learning and model based human pose and shape estimation. In International Conference on 3D Vision. IEEE, 484--494.Google ScholarCross Ref
Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and Michael J Black. 2019. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 10975--10985.Google ScholarCross Ref
Georgios Pavlakos, Xiaowei Zhou, Konstantinos G Derpanis, and Kostas Daniilidis. 2017. Coarse-to-fine volumetric prediction for single-image 3D human pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 7025--7034.Google ScholarCross Ref
Georgios Pavlakos, Luyang Zhu, Xiaowei Zhou, and Kostas Daniilidis. 2018. Learning to estimate 3D human pose and shape from a single color image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 459--468.Google ScholarCross Ref
Leonid Pishchulin, Mykhaylo Andriluka, Peter Gehler, and Bernt Schiele. 2013. Poselet conditioned pictorial structures. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 588--595.Google ScholarDigital Library
Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. 2012. Reconstructing 3d human pose from 2d image landmarks. In European Conference on Computer Vision. Springer, 573--586.Google ScholarDigital Library
Leonid Sigal, Alexandru Balan, and Michael J Black. 2008. Combined discriminative and generative articulated pose and non-rigid shape estimation. In Advances in Neural Information Processing Systems. 1337--1344.Google Scholar
Ke Sun, Cuiling Lan, Junliang Xing, Wenjun Zeng, Dong Liu, and Jingdong Wang. 2017a. Human pose estimation using global and local normalization. In Proceedings of the IEEE International Conference on Computer Vision. 5599--5607.Google ScholarCross Ref
Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. 2019. Deep High-Resolution Representation Learning for Human Pose Estimation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019).Google ScholarCross Ref
Xiao Sun, Jiaxiang Shang, Shuang Liang, and Yichen Wei. 2017b. Compositional human pose regression. In Proceedings of the IEEE International Conference on Computer Vision. 2602--2611.Google ScholarCross Ref
Xiao Sun, Bin Xiao, Fangyin Wei, Shuang Liang, and Yichen Wei. 2018. Integral human pose regression. In Proceedings of the European Conference on Computer Vision. 529--545.Google ScholarCross Ref
Yaniv Taigman, Ming Yang, Marc'Aurelio Ranzato, and Lior Wolf. 2014. Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on computer vision and pattern recognition . 1701--1708.Google ScholarDigital Library
Vince Tan, Ignas Budvytis, and Roberto Cipolla. 2017. Indirect deep structured learning for 3D human body shape and pose prediction. In British Machine Vision Conference .Google ScholarCross Ref
Bugra Tekin, Pablo Márquez-Neila, Mathieu Salzmann, and Pascal Fua. 2017. Learning to fuse 2d and 3d image cues for monocular body pose estimation. In Proceedings of the IEEE International Conference on Computer Vision. 3941--3950.Google ScholarCross Ref
Denis Tome, Chris Russell, and Lourdes Agapito. 2017. Lifting from the deep: Convolutional 3d pose estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 2500--2509.Google ScholarCross Ref
Jonathan J Tompson, Arjun Jain, Yann LeCun, and Christoph Bregler. 2014. Joint training of a convolutional network and a graphical model for human pose estimation. In Advances in Neural Information Processing Systems. 1799--1807.Google Scholar
Hsiao-Yu Tung, Hsiao-Wei Tung, Ersin Yumer, and Katerina Fragkiadaki. 2017. Self-supervised learning of motion capture. In Advances in Neural Information Processing Systems. 5236--5246.Google Scholar
Gul Varol, Duygu Ceylan, Bryan Russell, Jimei Yang, Ersin Yumer, Ivan Laptev, and Cordelia Schmid. 2018. BodyNet: Volumetric inference of 3D human body shapes. In Proceedings of the European Conference on Computer Vision. 20--36.Google ScholarCross Ref
Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. 2016. Convolutional pose machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4724--4732.Google ScholarCross Ref
Wanglong Wu, Meina Kan, Xin Liu, Yi Yang, Shiguang Shan, and Xilin Chen. 2017. Recursive spatial transformer (rest) for alignment-free face recognition. In Proceedings of the IEEE International Conference on Computer Vision . 3772--3780.Google ScholarCross Ref
Donglai Xiang, Hanbyul Joo, and Yaser Sheikh. 2019. Monocular total capture: Posing face, body, and hands in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 10965--10974.Google ScholarCross Ref
Wei Yang, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. 2016. End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3073--3082.Google ScholarCross Ref
Wei Yang, Wanli Ouyang, Xiaolong Wang, Jimmy Ren, Hongsheng Li, and Xiaogang Wang. 2018. 3d human pose estimation in the wild by adversarial learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5255--5264.Google ScholarCross Ref
Yi Yang and Deva Ramanan. 2011. Articulated pose estimation with flexible mixtures-of-parts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1385--1392.Google ScholarDigital Library
Pengfei Yao, Zheng Fang, Fan Wu, Yao Feng, and Jiwei Li. 2019. Densebody: Directly regressing dense 3d human pose and shape from a single color image. arXiv preprint arXiv:1903.10153 (2019).Google Scholar
Xingyi Zhou, Qixing Huang, Xiao Sun, Xiangyang Xue, and Yichen Wei. 2017. Towards 3d human pose estimation in the wild: a weakly-supervised approach. In Proceedings of the IEEE International Conference on Computer Vision . 398--407.Google ScholarCross Ref
Xingyi Zhou, Xiao Sun, Wei Zhang, Shuang Liang, and Yichen Wei. 2016. Deep kinematic pose regression. In European Conference on Computer Vision. Springer, 186--201.Google ScholarCross Ref

Index Terms

DaNet: Decompose-and-aggregate Network for 3D Human Shape and Pose Estimation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Reconstruction
        Shape inference
2. Human-centered computing

Recommendations

Estimating 3D body mesh without SMPL annotations via alternating successive convex approximation
Abstract
This paper focused on extracting effective human shape and pose information from solely joint annotations. With few training datasets and without using SMPL annotations from MoSh, we proposed a method based on alternating successive ...
Graphical abstract

Display Omitted
Highlights
- The proposed method trained with a minimum dataset (without moshed data) can provide accurate estimations for 3D body mesh.
Read More
DANet: Dual-Branch Activation Network for Small Object Instance Segmentation of Ship Images
In maritime scenes, instance segmentation of small object ships is of vital importance. Small ship objects in images have the characteristics of smaller size, lower image cover rate and fewer appearance features. However, existing instance segmentation ...
Read More
Rotation and translation invariants of Gaussian-Hermite moments

Geometric moment invariants are widely used in many fields of image analysis and pattern recognition since their first introduction by Hu in 1962. A few years ago, Flusser has proved how to find the independent and complete set of geometric moment ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '19: Proceedings of the 27th ACM International Conference on Multimedia
October 2019
2794 pages
ISBN:9781450368896
DOI:10.1145/3343031
General Chairs:
Laurent Amsaleg
CNRS-IRISA, France
,
Benoit Huet
EURECOM, France
,
Martha Larson
Radboud University and TU Delft (Netherlands)
,
Program Chairs:
Guillaume Gravier
CNRS-IRISA, France
,
Hayley Hung
Delft University of Technology Netherlands
,
Chong-Wah Ngo
City University of Hong Kong Hong Kong
,
Wei Tsang Ooi
National University of Singapore Singapore
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 October 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
3d human shape and pose estimation
decompose-and-aggregate network
position-aided rotation feature refinement
Qualifiers
- research-article
Conference

Acceptance Rates
MM '19 Paper Acceptance Rate252of936submissions,27%Overall Acceptance Rate995of4,171submissions,24%
More
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 24
  Total Citations
  View Citations
- 424
  Total Downloads
- Downloads (Last 12 months)26
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

DaNet: Decompose-and-aggregate Network for 3D Human Shape and Pose Estimation

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Estimating 3D body mesh without SMPL annotations via alternating successive convex approximation

DANet: Dual-Branch Activation Network for Small Object Instance Segmentation of Ship Images

Rotation and translation invariants of Gaussian-Hermite moments

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

DaNet: Decompose-and-aggregate Network for 3D Human Shape and Pose Estimation

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Estimating 3D body mesh without SMPL annotations via alternating successive convex approximation

DANet: Dual-Branch Activation Network for Small Object Instance Segmentation of Ship Images

Rotation and translation invariants of Gaussian-Hermite moments

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media