Skip to main content

Advertisement

Log in

Abstract

This paper addresses the problem of 3D human pose estimation in the wild. A significant challenge is the lack of training data, i.e., 2D images of humans annotated with 3D poses. Such data is necessary to train state-of-the-art CNN architectures. Here, we propose a solution to generate a large set of photorealistic synthetic images of humans with 3D pose annotations. We introduce an image-based synthesis engine that artificially augments a dataset of real images with 2D human pose annotations using 3D motion capture data. Given a candidate 3D pose, our algorithm selects for each joint an image whose 2D pose locally matches the projected 3D pose. The selected images are then combined to generate a new synthetic image by stitching local image patches in a kinematically constrained manner. The resulting images are used to train an end-to-end CNN for full-body 3D pose estimation. We cluster the training data into a large number of pose classes and tackle pose estimation as a K-way classification problem. Such an approach is viable only with large training sets such as ours. Our method outperforms most of the published works in terms of 3D pose estimation in controlled environments (Human3.6M) and shows promising results for real-world images (LSP). This demonstrates that CNNs trained on artificial images generalize well to real images. Compared to data generated from more classical rendering engines, our synthetic images do not require any domain adaptation or fine-tuning stage.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. http://mocap.cs.cmu.edu.

  2. ground truth classes being obtained by assigning the ground truth 2D pose to the closest cluster.

References

  • Agarwal, A., & Triggs, B. (2006). Recovering 3D human pose from monocular images. PAMI, 28(1), 44–58.

    Article  Google Scholar 

  • Akhter, I., & Black, M. (2015). Pose-conditioned joint angle limits for 3D human pose reconstruction. In CVPR

  • Andriluka, M., Pishchulin, L., Gehler, P., & Schiele, B. (2014). 2D human pose estimation: New benchmark and state-of- the-art analysis. In CVPR

  • Bissacco, A., Yang, M.-H., & Soatto, S. (2006). Detecting humans via their pose. In NIPS

  • Bo, L., & Sminchisescu, C. (2010). Twin Gaussian processes for structured prediction. IJCV, 87(1–2), 28–52.

    Article  Google Scholar 

  • Bourdev, L., & Malik, J. (2009). Poselets: Body part detectors trained using 3D human pose annotations. In ICCV

  • Chen, C.-H. & Ramanan, D. (2017). 3D human pose estimation = 2D pose estimation + matching. In CVPR

  • Chen, W., Wang, H., Li, Y., Su, H., Wang, Z., Tu, C., Lischinski, D., Cohen-Or, D., & Chen, B. (2016). Synthesizing training images for boosting human 3D pose estimation. In 3DV

  • Chen, X., & Yuille, A.L. (2014). Articulated pose estimation by a graphical model with image dependent pairwise relations. In NIPS

  • de Souza, C. R., Gaidon, A., Cabon, Y., & Lopez, A.M. (2017). Procedural generation of videos to train deep action recognition networks. In CVPR

  • Dosovitskiy, A., Fischer, P., Ilg, E., Häusser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D., & Brox, T. (2015). Flownet: Learning optical flow with convolutional networks. In ICCV

  • Du, Y., Wong, Y., Liu, Y., Han, F., Gui, Y., Wang, Z., Kankanhalli, M., & Geng, W. (2016). Marker-less 3D human motion capture with monocular image sequence and height-maps. In ECCV

  • Elhayek, A., Aguiar, E., Jain, A., Tompson, J., Pishchulin, L., Andriluka, M., Bregler, C., Schiele, B., & Theobalt, C. (2015). Efficient convnet-based marker-less motion capture in general scenes with a low number of cameras. In CVPR

  • Enzweiler, M., & Gavrila, D.M. (2008). A mixed generative-discriminative framework for pedestrian classification. In CVPR

  • Fan, X., Zheng, K., Zhou, Y., & Wang, S. (2014). Pose locality constrained representation for 3D human pose reconstruction. In ECCV

  • Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.C., & Bengio, Y. (2014). Generative adversarial nets. In NIPS

  • Hattori, H., Boddeti, V.N., Kitani, K.M., & Kanade, T. (2015). Learning scene-specific pedestrian detectors without real data. In CVPR

  • Hornung, A., Dekkers, E., & Kobbelt, L. (2007). Character animation from 2D pictures and 3D motion data. ACM Transactons On Graphics, 26(1), 1.

    Article  Google Scholar 

  • Huang, S., & Ramanan, D. (2017). Expecting the unexpected: Training detectors for unusual pedestrians with adversarial imposters. In CVPR.

  • Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2014). Human(3).6M: Large scale datasets and predictive methods for 3D human sensing in natural environments. PAMI, 36(7), 1325–1339.

    Article  Google Scholar 

  • Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2016). Reading text in the wild with convolutional neural networks. IJCV, 116(1), 1–20.

    Article  MathSciNet  Google Scholar 

  • Jaderberg, M., Simonyan, K., Zisserman, A., & Kavukcuoglu, K. (2015). Spatial transformer networks. In NIPS

  • Johnson, S., & Everingham, M. (2010). Clustered pose and nonlinear appearance models for human pose estimation. In BMVC

  • Johnson, S., & Everingham, M. (2011). Learning effective human pose estimation from inaccurate annotation. In CVPR

  • Joo, H., Liu, H., Tan, L., Gui, L., Nabbe, B., Matthews, I., Kanade, T., Nobuhara, S., & Sheikh, Y. (2015). Panoptic studio: A massively multiview system for social motion capture. In ICCV

  • Kostrikov, I., & Gall, J. (2014). Depth sweep regression forests for estimating 3D human pose from images. In BMVC

  • Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS

  • Li, S., Zhang, W., & Chan, A.B. (2015). Maximum-margin structured learning with deep networks for 3D human pose estimation. In ICCV

  • Li, S., Zhang, W., & Chan, A.B. (2016). Maximum-margin structured learning with deep networks for 3D human pose estimation. In IJCV

  • Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., & Black, M. J. (2015). SMPL: A skinned multi-person linear model. ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia), 34(6), 248:1–248:16.

    Google Scholar 

  • Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., Xu, W., & Theobalt, C. (2017). Monocular 3D human pose estimation in the wild using improved CNN supervision. In 3D Vision (3DV)

  • Moreno-Noguer, F. (2017). 3D human pose estimation from a single image via distance matrix regression. In CVPR

  • Mori, G., & Malik, J. (2006). Recovering 3D human body configurations using shape contexts. PAMI, 28(7), 1052–1062.

    Article  Google Scholar 

  • Okada, R., & Soatto, S. (2008). Relevant feature selection for human pose estimation and localization in cluttered images. In ECCV

  • Park, D., & Ramanan, D. (2015). Articulated pose estimation with tiny synthetic videos. In CVPR ChaLearn Looking at People Workshop

  • Pavlakos, G., Zhou, X., Derpanis, K.G., & Daniilidis, K. (2017). Coarse-to-fine volumetric prediction for single-image 3D human pose. In CVPR

  • Peng, X., Sun, B., Ali, K., & Saenko, K. (2015). Learning deep object detectors from 3D models. In ICCV

  • Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, P.V., & Schiele, B. (2016). DeepCut: Joint subset partition and labeling for multi person pose estimation. CVPR

  • Pishchulin, L., Jain, A., Andriluka, M., T. Thormählen, & Schiele, B. (2012). Articulated people detection and pose estimation: Reshaping the future. In CVPR

  • Ramakrishna, V., Kanade, T., & Sheikh, Y. (2012). Reconstructing 3D human pose from 2D image landmarks. In ECCV

  • Rogez, G., Rihan, J., Orrite, C., & Torr, P. (2012). Fast human pose detection using randomized hierarchical cascades of rejectors. IJCV, 99(1), 25–52.

    Article  MathSciNet  Google Scholar 

  • Rogez, G., & Schmid, C. (2016). MoCap-guided data augmentation for 3D pose estimation in the wild. In NIPS

  • Rogez, G., Supancic, J., & Ramanan, D. (2015). First-person pose recognition using egocentric workspaces. In CVPR

  • Rogez, G., Weinzaepfel, P., & Schmid, C. (2017). LCR-Net: Localization-Classification-Regression for human pose. In CVPR

  • Romero, J., Kjellstrom, H., & Kragic, D. (2010). Hands in action: Real-time 3D reconstruction of hands in interaction with objects. In ICRA

  • Sanzari, M., Ntouskos, V., & Pirri, F. (2016). Bayesian image based 3D pose estimation. In ECCV

  • Shakhnarovich, G., Viola, P.A., & Darrell, T. (2003). Fast pose estimation with parameter-sensitive hashing. In ICCV

  • Shotton, J., Fitzgibbon, A.W., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., & Blake, A. (2011). Real-time human pose recognition in parts from single depth images. In CVPR

  • Sigal, L., Balan, A. O., & Black, M. J. (2010). Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. IJCV, 87(1–2), 4–27.

    Article  Google Scholar 

  • Sigal, L., & Black, M.J. (2006). Predicting 3D people from 2D pictures. In AMDO

  • Simo-Serra, E., Quattoni, A., Torras, C., & Moreno-Noguer, F. (2013). A joint model for 2D and 3D pose estimation from a single image. In CVPR

  • Simo-Serra, E., Ramisa, A., G. Alenyà, Torras, C., & Moreno-Noguer, F. (2012). Single image 3D human pose estimation from noisy observations. In CVPR

  • Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556

  • Su, H., Ruizhongtai, C., Qi, Y.Li, & Guibas, L.J. (2015). Render for CNN: Viewpoint estimation in images using CNNs trained with rendered 3D model views. In ICCV

  • Taylor, J. C. (2000). Reconstruction of articulated objects from point correspondences in a single uncalibrated image. In CVPR

  • Tekin, B., Katircioglu, I., Salzmann, M., Lepetit, V., & Fua, P. (2016). Structured prediction of 3D human pose with deep neural networks. In BMVC

  • Tekin, B., Rozantsev, A., Lepetit, V., & Fua, P. (2016). Direct prediction of 3D body poses from motion compensated sequences. In CVPR

  • Tome, D., Russell, C., & Agapito, L. (2017). Lifting from the deep: Convolutional 3D pose estimation from a single image. In CVPR

  • Tompson, J.J., Jain, A., LeCun, Y., & Bregler, C. (2014). Joint training of a convolutional network and a graphical model for human pose estimation. In NIPS

  • Toshev, A., & Szegedy C. (2014) DeepPose: Human pose estimation via deep neural networks. In CVPR

  • Varol, G., Romero, J., Martin, X., Mahmood, N., Black, M.J., Laptev, I., & Schmid, C. (2017). Learning from synthetic humans. In CVPR

  • Wang, C., Wang, Y., Lin, Z., Yuille, A. L., & Gao, W. (2014). Robust estimation of 3D human poses from a single image. In CVPR

  • Wei, S.-E., Ramakrishna, V., Kanade, T., & Sheikh, Y. (2016) Convolutional pose machines. In CVPR

  • Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., & Xiao, J (2015) 3D shapenets: A deep representation for volumetric shapes. In CVPR

  • Xu, J., Ramos, S., Vázquez, D., & López, A. M. (2014). Domain adaptation of deformable part-based models. PAMI, 36(12), 2367–2380.

    Article  Google Scholar 

  • Yang, W., Ouyang, W., Li, H., & Wang, X. (2016) End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation. In CVPR

  • Yasin, H., Iqbal, U., Krüger, B., Weber, A., & Gall, J. (2016) A dual-source approach for 3D pose estimation from a single image. In CVPR

  • Zhou, F., & De la Torre, F (2014) Spatio-temporal matching for human detection in video. In ECCV

  • Zhou, X., Huang, Q., Sun, X., Xue, X., & Wei, Y. (2017) Towards 3D human pose estimation in the wild: A weakly-supervised approach. In ICCV

  • Zhou, X., Sun, X., Zhang, W., Liang, S., & Wei, Y (2016) Deep kinematic pose regression. In ECCV Workshop on Geometry Meets Deep Learning

  • Zhou, X., Zhu, M., Leonardos, S., Derpanis, K., & Daniilidis, K. (2016) Sparseness meets deepness: 3D human pose estimation from monocular video. In CVPR

  • Zuffi, S., & Black, M.J. (2015) The stitched puppet: A graphical model of 3D human shape and pose. In CVPR

Download references

Acknowledgements

This work was supported by the European Commission under FP7 Marie Curie IOF Grant (PIOF-GA-2012-328288) and partially supported by the ERC advanced Grant ALLEGRO and an Amazon Academic Research Award (AARA). We acknowledge the support of NVIDIA with the donation of the GPUs used for this research. We thank Dr. Philippe Weinzaepfel for his help. We also thank the anonymous reviewers for their comments and suggestions that helped improve the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Grégory Rogez.

Additional information

Communicated by Adrien Gaidon, Florent Perronnin and Antonio Lopez.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rogez, G., Schmid, C. Image-Based Synthesis for Deep 3D Human Pose Estimation. Int J Comput Vis 126, 993–1008 (2018). https://doi.org/10.1007/s11263-018-1071-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-018-1071-9

Keywords

Navigation