Skip to main content

Advertisement

Log in

Pictorial Human Spaces: A Computational Study on the Human Perception of 3D Articulated Poses

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Human motion analysis in images and video, with its deeply inter-related 2D and 3D inference components, is a central computer vision problem. Yet, there are no studies that reveal how humans perceive other people in images and how accurate they are. In this paper we aim to unveil some of the processing—as well as the levels of accuracy—involved in the 3D perception of people from images by assessing the human performance. Moreover, we reveal the quantitative and qualitative differences between human and computer performance when presented with the same visual stimuli and show that metrics incorporating human perception can produce more meaningful results when integrated into automatic pose prediction algorithms. Our contributions are: (1) the construction of an experimental apparatus that relates perception and measurement, in particular the visual and kinematic performance with respect to 3D ground truth when the human subject is presented an image of a person in a given pose; (2) the creation of a dataset containing images, articulated 2D and 3D pose ground truth, as well as synchronized eye movement recordings of human subjects, shown a variety of human body configurations, both easy and difficult, as well as their ‘re-enacted’ 3D poses; (3) quantitative analysis revealing the human performance in 3D pose re-enactment tasks, the degree of stability in the visual fixation patterns of human subjects, and the way it correlates with different poses; (4) extensive analysis on the differences between human re-enactments and poses produced by an automatic system when presented with the same visual stimuli; (5) an approach to learning perceptual metrics that, when integrated into visual sensing systems, produces more stable and meaningful results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26

Similar content being viewed by others

Notes

  1. No connection with pictorial structures (Sapp et al. 2010; Fischler and Elschlager 1973)—2D tree-structured models for object detection.

  2. While asking people to re-enact a certain interaction between a person and the surrounding environment can provide further insight on how people learn and reproduce certain activities, or manipulate objects, this is beyond the scope of the current study.

References

  • Agarwal, A., & Triggs, B. (2006). Recovering 3D human pose from monocular images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28, 44–58.

    Article  Google Scholar 

  • Akhter, I., & Black, M. J. (2015). Pose-conditioned joint angle limits for 3d human pose reconstruction. In IEEE international conference on computer vision and pattern recognition.

  • Andriluka, M., Roth, S., & Schiele, B. (2010). Monocular 3D pose estimation and tracking by detection. In IEEE international conference on computer vision and pattern recognition.

  • Bar-hillel, A., Hertz, T., Shental, N., & Weinshall, D. (2003). Learning distance functions using equivalence relations. In International conference on machine learning.

  • Bo, L., & Sminchisescu, C. (2009). Structured output-associative regression. In IEEE international conference on computer vision and pattern recognition.

  • Bo, L., & Sminchisescu, C. (2010). Twin gaussian processes for structured prediction. International Journal of Computer Vision, 87, 28–52.

    Article  Google Scholar 

  • Bourdev, L., Maji, S., Brox, T., & Malik, J. (2010). Detecting people using mutually consistent poselet activations. In: European conference on computer vision. http://www.eecs.berkeley.edu/~lbourdev/poselets.

  • Chen, C., Zhuang, Y., Xiao, J., & Liang, Z. (2009). Perceptual 3D pose distance estimation by boosting relational geometric features. Computer Animation and Virtual Worlds, 20, 267–277.

    Article  Google Scholar 

  • Chen, X., & Yuille, A. (2014). Articulated pose estimation by a graphical model with image dependent pairwise relations. In: Advances in neural information processing systems (NIPS).

  • Cortes, C., Mohri, M., & Weston, J. (2005). A general regression technique for learning transductions. In International conference on machine learning (pp. 153–160).

  • Deutscher, J., Blake, A., & Reid, I. (2000). Articulated body motion capture by annealed particle filtering. In IEEE international conference on computer vision and pattern recognition.

  • Dickinson, S., & Metaxas, D. (1994). Integrating qualitative and quantitative shape recovery. In International journal of computer vision.

  • Ehinger, K. A., Hidalgo-Sotelo, B., Torralba, A., & Oliva, A. (2009). Modelling search for people in 900 scenes: A combined source model of eye guidance. Visual Cognition, 17, 945–978.

    Article  Google Scholar 

  • Fan, X., Zheng, K., Lin, Y., & Wang, S. (2015). Combining local appearance and holistic view: Dual-source deep neural networks for human pose estimation. In IEEE conference on computer vision and pattern recognition (CVPR 2015), Boston, MA, June 7–12.

  • Ferrari, V., Marin, M., & Zisserman, A. (2009). Pose search: retrieving people using their pose. In IEEE international conference on computer vision and pattern recognition.

  • Fischler, M. A., & Elschlager, R. A. (1973). The representation and matching of pictorial structures. IEEE Transactions on Computers, 22(1), 67–92. doi:10.1109/T-C.1973.223602.

    Article  Google Scholar 

  • Gall, J., Rosenhahn, B., Brox, T., & Seidel, H. (2010). Optimization and filtering for human motion capture: A multi-layer framework. International Journal of Computer Vision, 87, 75–92.

    Article  Google Scholar 

  • Harada, T., Taoka, S., Mori, T., & Sato, T. (2004). Quantitative evaluation method for pose and motion similarity based on human perception. International journal of humanoid robotics.

  • Huang, C. H., Boyer, E., & Ilic, S. (2013). Robust human body shape and pose tracking. In 3DV—International Conference on 3D Vision—2013 (pp. 287–294). Seattle, United States. doi:10.1109/3DV.2013.45, https://hal.inria.fr/hal-00922934, best paper runner up award.

  • Ionescu, C., Carreira, J., & Sminchisescu, C. (2014a). Iterated second-order label sensitive pooling for 3D human pose estimation. In IEEE conference on computer vision and pattern recognition.

  • Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2014b). Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence.

  • Jain, A., Tompson, J., Andriluka, M., Taylor, G. W., & Bregler, C. (2014). Learning human pose estimation features with convolutional networks.

  • Johannson, G. (1973). Visual perception of biological motion and a model for its analysis. In Perception and psychophysics.

  • Kanaujia, A., Sminchisescu, C., & Metaxas, D. (2007). Semi-supervised hierarchical models for 3d human pose reconstruction. In IEEE international conference on computer vision and pattern recognition.

  • Koenderink, J. (1998). Pictorial relief. Royal Society of London A: Mathematical, Physical and Engineering Sciences, 356, 1071–1086.

    Article  MathSciNet  MATH  Google Scholar 

  • Lee, H. J., & Chen, Z. (1985). Determination of 3D human body postures from a single view. Computer Vision, Graphics and Image Processing, 30, 148–168.

    Article  MathSciNet  Google Scholar 

  • Li, F., Lebanon, G., & Sminchisescu, C. (2012). Chebyshev approximations to the hyistogram \(\chi ^2\) kernel. In IEEE international conference on computer vision and pattern recognition.

  • Li, S., & Chan, A. B. (2014). 3d human pose estimation from monocular images with deep convolutional neural network. In Computer Vision—ACCV 2014–12th Asian Conference on Computer Vision, Singapore, Singapore, November 1–5, Revised Selected Papers, Part II.

  • López-Méndez, A., Gall, J., Casas, J., & van Gool, L. (2012). Metric learning from poses for temporal clustering of human motion. In R. Bowden, J. Collomosse, K. Mikolajczyk (Eds)., British machine vision conference (BMVC) (pp. 49.1–49.12). BMVA Press.

  • Marinoiu, E., Papava, D., & Sminchisescu, C. (2013). Pictorial human spaces. How well do humans perceive a 3D articulated pose? In International conference on computer vision.

  • Mathe, S., & Sminchisescu, C. (2013). Action from still image dataset and inverse optimal control to learn task specific visual scanpaths. In Advances in neural information processing systems.

  • Müller, M., Rder, T., & Clausen, M. (2005). Efficient content-based retrieval of motion capture data. ACM Transaction Graphics, 24, 677–685.

    Article  Google Scholar 

  • Pons-Moll, G., Fleet, D. J., & Rosenhahn, B. (2014). Posebits for monocular human pose estimation. In IEEE international conference on computer vision and pattern recognition.

  • Rahimi, A., & Recht, B. (2007). Random features for large-scale kernel machines. In J. C. Platt, D. Koller, Y. Singer, S. T. Roweis, J. C. Platt, D. Koller, Y. Singer, S. T. Roweis (Eds.), Advances in neural information processing systems, MIT Press. http://dblp.uni-trier.de/rec/bibtex/conf/nips/RahimiR07.

  • Rehg, J., Morris, D. D., & Kanade, T. (2003). Ambiguities in visual tracking of articulated objects using two- and three-dimensional models. International Journal of Robotics Research, 22(6), 393–418.

    Article  Google Scholar 

  • Sapp, B., Toshev, A., & Taskar, B. (2010). Cascaded models for articulated pose estimation. In European conference on computer vision.

  • Sekunova, A., Black, M., Parkinson, L., & Barton, J. (2013). Viewpoint and pose in body-form adaptation. Perception, 42(2), 176–186.

    Article  Google Scholar 

  • Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., & Blake, A. (2011). Real-time human pose recognition in parts from single depth images. In IEEE international conference on computer vision and pattern recognition.

  • Sidenbladh, H., Black, M., & Fleet, D. (2000). Stochastic tracking of 3D human figures using 2D image motion. In European conference on computer vision.

  • Sigal, L., & Black, M. (2006). Predicting 3d people from 2d pictures. In AMDO.

  • Sigal, L., Balan, A., & Black, M. J. (2007). Combined discriminative and generative articulated pose and non-rigid shape estimation. In Advances in neural information processing systems.

  • Sigal, L., Fleet, D. J., Troje, N. F., & Livne, M. (2010a). Human attributes from 3D pose tracking. In European conference on computer vision.

  • Sigal, L., Memisevic, R., & Fleet, D. (2010b). Shared kernel information embedding for discriminative inference. In IEEE international conference on computer vision and pattern recognition.

  • Sminchisescu, C., & Jepson, A. (2004). Variational mixture smoothing for non-linear dynamical systems. In IEEE international conference on computer vision and pattern recognition (Vol 2), Washington, D.C.

  • Sminchisescu, C., & Triggs, B. (2003). Kinematic jump processes for monocular 3D human tracking. In IEEE international conference on computer vision and pattern recognition.

  • Sminchisescu, C., & Triggs, B. (2005). Mapping minima and transitions in visual models. International Journal of Computer Vision, 61(1), 227.

  • Sminchisescu, C., Kanaujia, A., & Metaxas, D. (2006). Learning joint top-down and bottom-up processes for 3D visual inference. In IEEE international conference on computer vision and pattern recognition.

  • Sun, M., Kohli, P., & Shotton, J. (2012). Conditional regression forests for human pose estimation. In IEEE international conference on computer vision and pattern recognition.

  • Tang, J. K. T., Leung, H., Komura, T., & Shum, H. P. H. (2008). Emulating human perception of motion similarity. Computer Animation and Virtual Worlds, 19(3–4), 211–221. doi:10.1002/cav.v19:3/4.

    Article  Google Scholar 

  • Tompson, J. J., Jain, A., LeCun, Y., & Bregler, C. (2014). Joint training of a convolutional network and a graphical model for human pose estimation. In Advances in neural information processing systems 27: Annual conference on neural information processing systems 2014 (pp. 8–13). Montreal.

  • Toshev, A., & Szegedy, C. (2014). Deeppose: Human pose estimation via deep neural networks. In 2014 IEEE conference on computer vision and pattern recognition (CVPR 2014), Columbus, OH, USA, June 23–28.

  • Urtasun, R., Fleet, D., Hertzmann, A., & Fua, P. (2005). Priors for people tracking in small training sets. In IEEE international conference on computer vision.

  • Wolpert, D. M., Diedrichsen, J., & Flanagan, J. R. (2011). Principles of sensorimotor learning. Nature Reviews Neuroscience, 12(12), 739–751.

    Google Scholar 

  • Yang, Y., & Ramanan, D. (2011). Articulated pose estimation using flexible mixture of parts. In IEEE international conference on computer vision and pattern recognition.

Download references

Acknowledgments

This work was supported in part by CNCS-UEFISCDI under PCE-2011-3-0438, and JRP-RO-FR-2014-16.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cristian Sminchisescu.

Additional information

Communicated by Deva Ramanan.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Marinoiu, E., Papava, D. & Sminchisescu, C. Pictorial Human Spaces: A Computational Study on the Human Perception of 3D Articulated Poses. Int J Comput Vis 119, 194–215 (2016). https://doi.org/10.1007/s11263-016-0888-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-016-0888-3

Keywords

Navigation