skip to main content
10.1145/3343031.3351057acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

DaNet: Decompose-and-aggregate Network for 3D Human Shape and Pose Estimation

Authors Info & Claims
Published:15 October 2019Publication History

ABSTRACT

Reconstructing 3D human shape and pose from a monocular image is challenging despite the promising results achieved by most recent learning based methods. The commonly occurred misalignment comes from the facts that the mapping from image to model space is highly non-linear and the rotation-based pose representation of the body model is prone to result in drift of joint positions. In this work, we present the Decompose-and-aggregate Network (DaNet) to address these issues. DaNet includes three new designs, namely UVI guided learning, decomposition for fine-grained perception, and aggregation for robust prediction. First, we adopt the UVI maps, which densely build a bridge between 2D pixels and 3D vertexes, as an intermediate representation to facilitate the learning of image-to-model mapping. Second, we decompose the prediction task into one global stream and multiple local streams so that the network not only provides global perception for the camera and shape prediction, but also has detailed perception for part pose prediction. Lastly, we aggregate the message from local streams to enhance the robustness of part pose prediction, where a position-aided rotation feature refinement strategy is proposed to exploit the spatial relationship between body parts. Such a refinement strategy is more efficient since the correlations between position features are stronger than that in the original rotation feature space. The effectiveness of our method is validated on the Human3.6M and UP-3D datasets. Experimental results show that the proposed method significantly improves the reconstruction performance in comparison with previous state-of-the-art methods. Our code is publicly available at https://github.com/HongwenZhang/DaNet-3DHumanReconstrution .

References

  1. Ijaz Akhter and Michael J Black. 2015. Pose-conditioned joint angle limits for 3D human pose reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 1446--1455.Google ScholarGoogle ScholarCross RefCross Ref
  2. Riza Alp Güler, Natalia Neverova, and Iasonas Kokkinos. 2018. Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7297--7306.Google ScholarGoogle ScholarCross RefCross Ref
  3. Riza Alp Guler, George Trigeorgis, Epameinondas Antonakos, Patrick Snape, Stefanos Zafeiriou, and Iasonas Kokkinos. 2017. Densereg: Fully convolutional dense shape regression in-the-wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 6799--6808.Google ScholarGoogle ScholarCross RefCross Ref
  4. Dragomir Anguelov, Praveen Srinivasan, Daphne Koller, Sebastian Thrun, Jim Rodgers, and James Davis. 2005. SCAPE: shape completion and animation of people. In ACM Transactions on Graphics , Vol. 24. ACM, 408--416.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J Black. 2016. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In European Conference on Computer Vision. Springer, 561--578.Google ScholarGoogle ScholarCross RefCross Ref
  6. Ching-Hang Chen and Deva Ramanan. 2017. 3d human pose estimation= 2d pose estimationGoogle ScholarGoogle Scholar
  7. matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7035--7043.Google ScholarGoogle Scholar
  8. Xianjie Chen and Alan L Yuille. 2014. Articulated pose estimation by a graphical model with image dependent pairwise relations. In Advances in Neural Information Processing Systems. 1736--1744.Google ScholarGoogle Scholar
  9. Xiao Chu, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. 2016a. Structured feature learning for pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4715--4723.Google ScholarGoogle ScholarCross RefCross Ref
  10. Xiao Chu, Wanli Ouyang, Xiaogang Wang, et almbox. 2016b. Crf-cnn: Modeling structured information in human pose estimation. In Advances in Neural Information Processing Systems. 316--324.Google ScholarGoogle Scholar
  11. Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. 2017. Rmpe: Regional multi-person pose estimation. In Proceedings of the IEEE International Conference on Computer Vision. 2334--2343.Google ScholarGoogle ScholarCross RefCross Ref
  12. Peng Guan, Alexander Weiss, Alexandru O Balan, and Michael J Black. 2009. Estimating human shape and pose from a single image. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 1381--1388.Google ScholarGoogle Scholar
  13. Riza Alp Guler and Iasonas Kokkinos. 2019. HoloPose: Holistic 3D Human Reconstruction In-The-Wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10884--10894.Google ScholarGoogle ScholarCross RefCross Ref
  14. Albert Haque, Boya Peng, Zelun Luo, Alexandre Alahi, Serena Yeung, and Li Fei-Fei. 2016. Towards viewpoint invariant 3d human pose estimation. In European Conference on Computer Vision. Springer, 160--177.Google ScholarGoogle ScholarCross RefCross Ref
  15. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  16. Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. 2014. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence , Vol. 36, 7 (2014), 1325--1339.Google ScholarGoogle Scholar
  17. Aaron S Jackson, Chris Manafas, and Georgios Tzimiropoulos. 2018. 3d human body reconstruction from a single image via volumetric regression. In Proceedings of the European Conference on Computer Vision .Google ScholarGoogle Scholar
  18. Max Jaderberg, Karen Simonyan, Andrew Zisserman, et almbox. 2015. Spatial transformer networks. In Advances in Neural Information Processing Systems. 2017--2025.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Hanbyul Joo, Tomas Simon, and Yaser Sheikh. 2018. Total capture: A 3d deformation model for tracking faces, hands, and bodies. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 8320--8329.Google ScholarGoogle ScholarCross RefCross Ref
  20. Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik. 2018. End-to-end recovery of human shape and pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7122--7131.Google ScholarGoogle ScholarCross RefCross Ref
  21. Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. 2018. Neural 3d mesh renderer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3907--3916.Google ScholarGoogle ScholarCross RefCross Ref
  22. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. International Conference on Learning Representations (2014).Google ScholarGoogle Scholar
  23. Nikos Kolotouros, Georgios Pavlakos, and Kostas Daniilidis. 2019. Convolutional Mesh Regression for Single-Image Human Shape Reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 4501--4510.Google ScholarGoogle ScholarCross RefCross Ref
  24. Christoph Lassner, Javier Romero, Martin Kiefel, Federica Bogo, Michael J Black, and Peter V Gehler. 2017. Unite the people: Closing the loop between 3d and 2d human representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 6050--6059.Google ScholarGoogle ScholarCross RefCross Ref
  25. Kyoungoh Lee, Inwoong Lee, and Sanghoon Lee. 2018. Propagating lstm: 3d pose estimation based on joint interdependency. In Proceedings of the European Conference on Computer Vision. 119--135.Google ScholarGoogle ScholarCross RefCross Ref
  26. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European Conference on Computer Vision. Springer, 740--755.Google ScholarGoogle ScholarCross RefCross Ref
  27. Matthew Loper, Naureen Mahmood, and Michael J Black. 2014. MoSh: Motion and shape capture from sparse markers. ACM Transactions on Graphics , Vol. 33, 6 (2014), 220.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. 2015. SMPL: A skinned multi-person linear model. ACM Transactions on Graphics , Vol. 34, 6 (2015), 248.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Matthew M Loper and Michael J Black. 2014. OpenDR: An approximate differentiable renderer. In European Conference on Computer Vision. Springer, 154--169.Google ScholarGoogle ScholarCross RefCross Ref
  30. Chenxu Luo, Xiao Chu, and Alan L. Yuille. 2018. OriNet: A Fully Convolutional Network for 3D Human Pose Estimation. In British Machine Vision Conference 2018. 92.Google ScholarGoogle Scholar
  31. Julieta Martinez, Rayat Hossain, Javier Romero, and James J Little. 2017. A simple yet effective baseline for 3d human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision. 2640--2649.Google ScholarGoogle ScholarCross RefCross Ref
  32. Francesc Moreno-Noguer. 2017. 3d human pose estimation from a single image via distance matrix regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 2823--2832.Google ScholarGoogle ScholarCross RefCross Ref
  33. Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision. Springer, 483--499.Google ScholarGoogle ScholarCross RefCross Ref
  34. Bruce Xiaohan Nie, Ping Wei, and Song-Chun Zhu. 2017. Monocular 3D human pose estimation by predicting depth on joints. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 3467--3475.Google ScholarGoogle ScholarCross RefCross Ref
  35. Mohamed Omran, Christoph Lassner, Gerard Pons-Moll, Peter Gehler, and Bernt Schiele. 2018. Neural body fitting: Unifying deep learning and model based human pose and shape estimation. In International Conference on 3D Vision. IEEE, 484--494.Google ScholarGoogle ScholarCross RefCross Ref
  36. Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and Michael J Black. 2019. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 10975--10985.Google ScholarGoogle ScholarCross RefCross Ref
  37. Georgios Pavlakos, Xiaowei Zhou, Konstantinos G Derpanis, and Kostas Daniilidis. 2017. Coarse-to-fine volumetric prediction for single-image 3D human pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 7025--7034.Google ScholarGoogle ScholarCross RefCross Ref
  38. Georgios Pavlakos, Luyang Zhu, Xiaowei Zhou, and Kostas Daniilidis. 2018. Learning to estimate 3D human pose and shape from a single color image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 459--468.Google ScholarGoogle ScholarCross RefCross Ref
  39. Leonid Pishchulin, Mykhaylo Andriluka, Peter Gehler, and Bernt Schiele. 2013. Poselet conditioned pictorial structures. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 588--595.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. 2012. Reconstructing 3d human pose from 2d image landmarks. In European Conference on Computer Vision. Springer, 573--586.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Leonid Sigal, Alexandru Balan, and Michael J Black. 2008. Combined discriminative and generative articulated pose and non-rigid shape estimation. In Advances in Neural Information Processing Systems. 1337--1344.Google ScholarGoogle Scholar
  42. Ke Sun, Cuiling Lan, Junliang Xing, Wenjun Zeng, Dong Liu, and Jingdong Wang. 2017a. Human pose estimation using global and local normalization. In Proceedings of the IEEE International Conference on Computer Vision. 5599--5607.Google ScholarGoogle ScholarCross RefCross Ref
  43. Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. 2019. Deep High-Resolution Representation Learning for Human Pose Estimation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019).Google ScholarGoogle ScholarCross RefCross Ref
  44. Xiao Sun, Jiaxiang Shang, Shuang Liang, and Yichen Wei. 2017b. Compositional human pose regression. In Proceedings of the IEEE International Conference on Computer Vision. 2602--2611.Google ScholarGoogle ScholarCross RefCross Ref
  45. Xiao Sun, Bin Xiao, Fangyin Wei, Shuang Liang, and Yichen Wei. 2018. Integral human pose regression. In Proceedings of the European Conference on Computer Vision. 529--545.Google ScholarGoogle ScholarCross RefCross Ref
  46. Yaniv Taigman, Ming Yang, Marc'Aurelio Ranzato, and Lior Wolf. 2014. Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on computer vision and pattern recognition . 1701--1708.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Vince Tan, Ignas Budvytis, and Roberto Cipolla. 2017. Indirect deep structured learning for 3D human body shape and pose prediction. In British Machine Vision Conference .Google ScholarGoogle ScholarCross RefCross Ref
  48. Bugra Tekin, Pablo Márquez-Neila, Mathieu Salzmann, and Pascal Fua. 2017. Learning to fuse 2d and 3d image cues for monocular body pose estimation. In Proceedings of the IEEE International Conference on Computer Vision. 3941--3950.Google ScholarGoogle ScholarCross RefCross Ref
  49. Denis Tome, Chris Russell, and Lourdes Agapito. 2017. Lifting from the deep: Convolutional 3d pose estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 2500--2509.Google ScholarGoogle ScholarCross RefCross Ref
  50. Jonathan J Tompson, Arjun Jain, Yann LeCun, and Christoph Bregler. 2014. Joint training of a convolutional network and a graphical model for human pose estimation. In Advances in Neural Information Processing Systems. 1799--1807.Google ScholarGoogle Scholar
  51. Hsiao-Yu Tung, Hsiao-Wei Tung, Ersin Yumer, and Katerina Fragkiadaki. 2017. Self-supervised learning of motion capture. In Advances in Neural Information Processing Systems. 5236--5246.Google ScholarGoogle Scholar
  52. Gul Varol, Duygu Ceylan, Bryan Russell, Jimei Yang, Ersin Yumer, Ivan Laptev, and Cordelia Schmid. 2018. BodyNet: Volumetric inference of 3D human body shapes. In Proceedings of the European Conference on Computer Vision. 20--36.Google ScholarGoogle ScholarCross RefCross Ref
  53. Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. 2016. Convolutional pose machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4724--4732.Google ScholarGoogle ScholarCross RefCross Ref
  54. Wanglong Wu, Meina Kan, Xin Liu, Yi Yang, Shiguang Shan, and Xilin Chen. 2017. Recursive spatial transformer (rest) for alignment-free face recognition. In Proceedings of the IEEE International Conference on Computer Vision . 3772--3780.Google ScholarGoogle ScholarCross RefCross Ref
  55. Donglai Xiang, Hanbyul Joo, and Yaser Sheikh. 2019. Monocular total capture: Posing face, body, and hands in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 10965--10974.Google ScholarGoogle ScholarCross RefCross Ref
  56. Wei Yang, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. 2016. End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3073--3082.Google ScholarGoogle ScholarCross RefCross Ref
  57. Wei Yang, Wanli Ouyang, Xiaolong Wang, Jimmy Ren, Hongsheng Li, and Xiaogang Wang. 2018. 3d human pose estimation in the wild by adversarial learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5255--5264.Google ScholarGoogle ScholarCross RefCross Ref
  58. Yi Yang and Deva Ramanan. 2011. Articulated pose estimation with flexible mixtures-of-parts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1385--1392.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Pengfei Yao, Zheng Fang, Fan Wu, Yao Feng, and Jiwei Li. 2019. Densebody: Directly regressing dense 3d human pose and shape from a single color image. arXiv preprint arXiv:1903.10153 (2019).Google ScholarGoogle Scholar
  60. Xingyi Zhou, Qixing Huang, Xiao Sun, Xiangyang Xue, and Yichen Wei. 2017. Towards 3d human pose estimation in the wild: a weakly-supervised approach. In Proceedings of the IEEE International Conference on Computer Vision . 398--407.Google ScholarGoogle ScholarCross RefCross Ref
  61. Xingyi Zhou, Xiao Sun, Wei Zhang, Shuang Liang, and Yichen Wei. 2016. Deep kinematic pose regression. In European Conference on Computer Vision. Springer, 186--201.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. DaNet: Decompose-and-aggregate Network for 3D Human Shape and Pose Estimation

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          MM '19: Proceedings of the 27th ACM International Conference on Multimedia
          October 2019
          2794 pages
          ISBN:9781450368896
          DOI:10.1145/3343031

          Copyright © 2019 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 15 October 2019

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          MM '19 Paper Acceptance Rate252of936submissions,27%Overall Acceptance Rate995of4,171submissions,24%

          Upcoming Conference

          MM '24
          MM '24: The 32nd ACM International Conference on Multimedia
          October 28 - November 1, 2024
          Melbourne , VIC , Australia

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader