Abstract
While research on articulated human motion and pose estimation has progressed rapidly in the last few years, there has been no systematic quantitative evaluation of competing methods to establish the current state of the art. We present data obtained using a hardware system that is able to capture synchronized video and ground-truth 3D motion. The resulting HumanEva datasets contain multiple subjects performing a set of predefined actions with a number of repetitions. On the order of 40,000 frames of synchronized motion capture and multi-view video (resulting in over one quarter million image frames in total) were collected at 60 Hz with an additional 37,000 time instants of pure motion capture data. A standard set of error measures is defined for evaluating both 2D and 3D pose estimation and tracking algorithms. We also describe a baseline algorithm for 3D articulated tracking that uses a relatively standard Bayesian framework with optimization in the form of Sequential Importance Resampling and Annealed Particle Filtering. In the context of this baseline algorithm we explore a variety of likelihood functions, prior models of human motion and the effects of algorithm parameters. Our experiments suggest that image observation models and motion priors play important roles in performance, and that in a multi-view laboratory environment, where initialization is available, Bayesian filtering tends to perform well. The datasets and the software are made available to the research community. This infrastructure will support the development of new articulated motion and pose estimation algorithms, will provide a baseline for the evaluation and comparison of new methods, and will help establish the current state of the art in human pose estimation and tracking.
Similar content being viewed by others
References
Agarwal, A., & Triggs, B. (2004a). Learning to track 3D human motion from silhouettes. In International conference on machine learning (ICML) (pp. 9–16).
Agarwal, A., & Triggs, B. (2004b). 3D human pose from silhouettes by relevance vector regression. In IEEE conference on computer vision and pattern recognition (CVPR), vol. 2 (pp. 882–888).
Arulampalam, S., Maskell, S., Gordon, N., & Clapp, T. (2002). A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Transactions on Signal Processing, 50(2), 174–188.
Baker, S., Scharstien, D., Lewis, J. P., Roth, S., Black, M. J., & Szeliski, R. (2007). A database and evaluation methodology for optical flow. In IEEE international conference on computer vision (ICCV) (pp. 1–8).
Balan, A., Sigal, L., Black, M. J., Davis, J., & Haussecker, H. (2007). Detailed human shape and pose from images. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1–8).
Balan, A., & Black, M. J. (2006). An adaptive appearance model approach for model-based articulated object tracking. In IEEE conference on computer vision and pattern recognition (CVPR), vol. 1 (pp. 758–765).
Balan, A., Sigal, L., & Black, M. (2005). A quantitative evaluation of video-based 3D person tracking. In IEEE workshop on visual surveillance and performance evaluation of tracking and surveillance (VS-PETS) (pp. 349–356).
Bissacco, A., Yang, M.-H., & Soatto, S. (2007). Fast human pose estimation using appearance and motion via multi-dimensional boosting, regression. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1–8).
Bo, L., Sminchisescu, C., Kanaujia, A., & Metaxas, D. (2008). Fast algorithms for large scale conditional 3D prediction. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1–8).
Bouguet, J.-Y. Camera calibration toolbox for Matlab. http://www.vision.caltech.edu/bouguetj/calib_doc/, accessed on 7/24/2009.
Bregler, C., & Malik, J. (1998). Tracking people with twists and exponential maps. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 8–15).
Brubaker, M., Fleet, D. J., & Hertzmann, A. (2007). Physics-based person tracking using simplified lower-body dynamics. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1–8).
Camomilla, V., Cereatti, A., Vannozzi, G., & Cappozzo, A. (2006). An optimized protocol for hip joint centre determination using the functional method. Journal of Biomechanics, 39(6), 1096–1106.
CMU Motion Capture Database, http://mocap.cs.cmu.edu/, accessed on 7/24/2009.
Corazza, S., Mündermann, L., & Andriacchi, T. (2007). A framework for the functional identification of joint centers using markerless motion capture, validation for the hip joint. Journal of Biomechanics, 40(15), 3510–3515.
Deutscher, J., & Reid, I. (2005). Articulated body motion capture by stochastic search. International Journal of Computer Vision, 61(2), 185–205.
Doucet, A., Godsil, S. J., & Andrieu, C. (2000). On sequential Monte Carlo sampling methods for Bayesian filtering. Statistics and Computing, 10(3), 197–208.
Dimitrijevic, M., Lepetit, V., & Fua, P. (2006). Human body pose detection using bayesian spatio-temporal, templates. Computer Vision and Image Understanding, 104(2), 127–139.
Fathi, A., & Mori, G. (2007). Human pose estimation using motion, exemplars. In IEEE international conference on computer vision (ICCV) (pp. 1–8).
Felzenszwalb, P., & Huttenlocher, D. (2005). Pictorial structures for object recognition. International Journal of Computer Vision, 61(1), 55–79.
Gall, J., Rosenhahn, B., Brox, T., Kersting, U., & Seidel, H.-P. (2006). Learning for multi-view 3D tracking in the context of particle filters. In LNCS : Vol. 4292. International symposium on visual computing (ISVC) (pp. 59–69). Berlin: Springer.
Gavrila, D. (1999). The visual analysis of human movement: a survey. Computer Vision and Image Understanding, 73(1), 82–98.
Gavrila, D., & Davis, L. (1996). 3-D model-based tracking of humans in action: a multi-view approach. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 73–80).
Grauman, K., Shakhnarovich, G., & Darrell, T. (2003). Inferring 3D structure with a statistical image-based shape model. In IEEE international conference on computer vision (ICCV) (pp. 641–648).
Gross, R., & Shi, J. (2001). The CMU motion of body (MoBo) database. Technical Report CMU-RI-TR-01-18. Robotics Institute, Carnegie Mellon University.
Hogg, D. C. (1983). Model-based vision: a program to see a walking person. Image and Vision Computing, 1, 5–20.
Hough, P. V. C. (1962). Method and means for recognizing complex patterns. U.S. Patent 3,069,654.
Hua, G., Yang, M.-H., & Wu, Y. (2005). Learning to estimate human pose with data driven belief propagation. In IEEE conference on computer vision and pattern recognition (CVPR), vol. 2 (pp. 747–754).
Isard, M., & Blake, A. (1998). Condensation–conditional density propagation for visual tracking. International Journal of Computer Vision, 29(1), 5–28.
Jepson, A., Fleet, D., & El-Maraghi, T. (2003). Robust online appearance models for visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(10), 1296–1311.
Ju, S., Black, M., & Yacoob, Y. (1996). Cardboard people: a parametrized model of articulated motion. In International conference on automatic face and gesture recognition (pp. 38–44).
Kakadiaris, I. A., & Metaxas, D. (1996). Model-based estimation of 3D human motion with occlusion based on active multi-viewpoint selection. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 81–87).
Knossow, D., Ronfard, R., & Horaud, R. (2008). Human motion tracking with a kinematic parameterization of extremal contours. International Journal of Computer Vision, 79(3), 247–269.
Lan, X., & Huttenlocher, D. (2005). Beyond trees: common factor models for 2D human pose recovery. In IEEE international conference on computer vision (ICCV), vol. 1 (pp. 470–477).
Lan, X., & Huttenlocher, D. (2004). A unified spatio-temporal articulated model for tracking. In IEEE conference on computer vision and pattern recognition (CVPR), vol. 1 (pp. 722–729).
Lee, C.-S., & Elgammal, A. (2007). Modeling view and posture manifold for tracking. In IEEE international conference on computer vision (ICCV) (pp. 1–8).
Lee, M., & Nevatia, R. (2006). Human pose tracking using multi-level structured models. In European conference on computer vision (ECCV), vol. 3 (pp. 368–381).
Li, R., Tian, T.-P., & Sclaroff, S. (2007). Simultaneous learning of nonlinear manifold and dynamical models for high-dimensional time series. In IEEE international conference on computer vision (ICCV) (pp. 1–8).
Li, R., Yang, M.-H., Sclaroff, S., & Tian, T.-P. (2006). Monocular tracking of 3D human motion with a coordinated mixture of factor analyzers. In European conference on computer vision (ECCV).
Lu, Z., Perpinan, M. C., & Sminchisescu, C. (2007). People tracking with the laplacian eigenmaps latent variable model. In Advances in neural information processing systems (NIPS), vol. 2 (pp. 137–150).
MacCormick, J., & Isard, M. (2000). Partitioned sampling, articulated objects, and interface-quality hand tracking. In European conference on computer vision (ECCV), vol. 2 (pp. 3–19).
Moeslund, T., & Granum, E. (2001). A survey of computer vision-based human motion capture. Computer Vision and Image Understanding, 18, 231–268.
Mori, G. (2005). Guiding model search using segmentation. In IEEE international conference on computer vision (ICCV) (pp. 1417–1423).
Mori, G., Ren, X., Efros, A., & Malik, J. (2004). Recovering human body configurations: combining segmentation and recognition. In IEEE conference on computer vision and pattern recognition (CVPR), vol. 2 (pp. 326–333).
Muendermann, L., Corazza, S., & Andriacchi, T. (2007). Accurately measuring human movement using articulated ICP with soft-joint constraints and a repository of articulated models. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1–8).
Navaratnam, R., Fitzgibbon, A., & Cipolla, R. (2007). The joint manifold model for semi-supervised multi-valued regression. In IEEE international conference on computer vision (ICCV) (pp. 1–8).
Ning, H., Xu, W., Gong, Y., & Huang, T. (2008). Discriminative learning of visual words for 3D human pose estimation. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1–8).
Ormoneit, D., Sidenbladh, H., Black, M. J., & Hastie, T. (2001). Learning and tracking cyclic human motion. In Advances in neural information processing systems (NIPS), vol. 13 (pp. 894–900).
Ormoneit, D., Sidenbladh, H., Black, M. J., & Hastie, T. (2000). Stochastic modeling and tracking of human motion, Learning 2000, Snowbird, UT.
O’Rourke, J., & Badler, N. I. (1980). Model-based image analysis of human motion using constraint propagation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2(6), 522–192.
Pavolvic, V., Rehg, J., Cham, T.-J., & Murphy, K. (1999). A dynamic Bayesian network approach to figure tracking using learned dynamic models. In IEEE international conference on computer vision (ICCV) (pp. 94–101).
Phillips, P. J., Blackburn, D., Bone, M., Grother, P., Micheals, R., & Tabassi, E. (2002). Face recognition vendor test. http://www.frvt.org/.
Phillips, P. J., Moon, H., Rizvi, S. A., & Rauss, P. J. (2000). The FERET evaluation methodology for face-recognition algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(10), 1090–1104.
Poon, E., & Fleet, D. (2002). Hybrid Monte Carlo filtering: edge-based people tracking. It IEEE workshop on motion and video computing (pp. 151–158).
Ramanan, D., Forsyth, D., & Zisserman, A. (2005). Strike a pose: tracking people by finding stylized poses (CVPR). In IEEE conference on computer vision and pattern recognition (CVPR), vol. 1 (pp. 271–278).
Ramanan, D., & Forsyth, D. (2003). Finding and tracking people from the bottom up. In IEEE conference on computer vision and pattern recognition (CVPR), vol. 2 (pp. 467–474).
Ren, X., Berg, A., & Malik, J. (2005). Recovering human body configurations using pairwise constraints between parts. In IEEE international conference on computer vision (ICCV), vol. 1 (pp. 824–831).
Roberts, T., McKenna, S., & Ricketts, I. (2004). Human pose estimation using learnt probabilistic region similarities and partial configurations. In European conference on computer vision (ICCV), vol. 4 (pp. 291–303).
Rogez, G., Rihan, J., Ramalingam, S., Oritte, C., & Torr, P. H. S. (2008). Randomized trees for human pose estimation. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1–8).
Ronfard, R., Schmid, C., & Triggs, B. (2002). Larning to parse pictures of people. In European conference on computer vision (ECCV), vol. 4 (pp. 700–714).
Rosales, R., & Sclaroff, S. (2000). Inferring body pose without tracking body parts. In IEEE conference on computer vision and pattern recognition (CVPR), vol. 2 (pp. 721–727).
Rosenhahn, B., Brox, T., Kersting, U., Smith, D., Gurney, J., & Klette, R. (2006). A system for marker-less human motion estimation. Kuenstliche Intelligenz, 1, 45–51.
Roth, S., Sigal, L., & Black, M. J. (2004). Gibbs likelihoods for Bayesian tracking. In IEEE conference on computer vision and pattern recognition (CVPR), vol. 1 (pp. 886–893).
Sarkar, S., Phillips, P. J., Liu, Z., Robledo, I., Grother, P., & Bowyer, K. W. (2005). The human ID gait challenge problem: data sets, performance, and analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(2), 162–177.
Scharstein, D., & Szeliski, R. (2002). A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision, 47(1/2/3), 7–42.
Shakhnarovich, G., Viola, P., & Darrell, T. (2003). Fast pose estimation with parameter-sensitive hashing. In IEEE international conference on computer vision (ICCV), vol. 2 (pp. 750–759).
Sidenbladh, H., & Black, M. J. (2003). Learning the statistics of people in images and video. International Journal of Computer Vision, 54(1–3), 183–209.
Sidenbladh, H., Black, M. J., & Sigal, L. (2002). Implicit probabilistic models of human motion for synthesis and tracking. In European conference on computer vision (ECCV), vol. 1 (pp. 784–800).
Sidenbladh, H., De la Torre, F., & Black, M. J. (2000). A framework for modeling the appearance of 3D articulated figures. In International conference on automatic face and gesture recognition (FG) (pp. 368–375).
Sidenbladh, H., Black, M., & Fleet, D. (2000). Stochastic tracking of 3D human figures using 2D image motion. In European conference on computer vision (ECCV), vol. 2 (pp. 702–718).
Sigal, L., Bhatia, S., Roth, S., Black, M., & Isard, M. (2004). Tracking loose-limbed people. In IEEE conference on computer vision and pattern recognition (CVPR), vol. 1 (pp. 421–428).
Sigal, L., & Black, M. (2006). Measure locally, reason globally: occlusion-sensitive articulated pose estimation. In IEEE conference on computer vision and pattern recognition (CVPR), vol. 2 (pp. 2041–2048).
Sminchisescu, C., Kanaujia, A., Li, Z., & Metaxas, D. (2005). Discriminative density propagation for 3D human motion estimation. in IEEE conference on computer vision and pattern recognition (CVPR), vol. 1 (pp. 390–397).
Sminchisescu, C., & Jepson, A. (2004). Generative modeling for continuous non-linearly embedded visual inference. In International conference on machine learning (ICML) (pp. 759–766).
Sminchisescu, C., & Triggs, B. (2003a). Kinematic jump processes for monocular 3D human tracking. In IEEE conference on computer vision and pattern recognition (CVPR), vol. 1 (pp. 69–76).
Sminchisescu, C., & Triggs, B. (2003b). Estimating articulated human motion with covariance scaled sampling. International Journal of Robotics Research, 22(6), 371–391.
Sminchisescu, C., & Telea, A. (2002). Human pose estimation from silhouettes a consistent approach using distance level sets. In International conference on computer graphics, visualization and computer vision (WSCG).
Sminchisescu, C. (2002). Consistency and coupling in human model likelihoods. In International conference on automatic face and gesture recognition (FG) (pp. 27–32).
Srinivasan, P., & Shi, J. (2007). Bottom-up recognition and parsing of the human body. In IEEE computer society conference on computer vision and pattern recognition (CVPR) (pp. 1–8).
Taylor, C. J. (2000). Reconstruction of articulated objects from point correspondences in a single image. Computer Vision and Image Understanding, 80(3), 349–363.
Urtasun, R., & Darrell, T. (2008). Local probabilistic regression for activity-independent human pose inference. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1–8).
Urtasun, R., Fleet, D. J., & Fua, P. (2006). 3D people tracking with gaussian process dynamical models. In IEEE conference on computer vision and pattern recognition (CVPR), vol. 1 (pp. 238–245).
Urtasun, R., Fleet, D. J., Hertzmann, A., & Fua, P. (2005). Priors for people tracking from small training sets. In IEEE international conference on computer vision (ICCV), vol. 1 (pp. 403–410).
Vlasic, D., Baran, I., Matusik, W., & Popović, J. (2008). Articulated mesh animation from multi-view silhouettes. ACM Transactions on Graphics, 27(3), 1–9.
Vondrak, M., Sigal, L., & Jenkins, O. C. (2008). Physical simulation for probabilistic motion tracking. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1–8).
Wang, P., & Rehg, J. M. (2006). A modular approach to the analysis and evaluation of particle filters for figure tracking. In IEEE conference on computer vision and pattern recognition (CVPR), vol. 1 (pp. 790–797).
Wachter, S., & Nagel, H. H. (1999). Tracking persons in monocular image sequences. Computer Vision and Image Understanding, 74(3), 174–192.
Xu, X., & Li, B. (2007). Learning motion correlation for tracking articulated human body with a Rao-Blackwellised particle filter. In IEEE international conference on computer vision (ICCV) (pp. 1–8).
Zhang, J., Luo, J., Collins, R., & Liu, Y. (2006). Body localization in still images using hierarchical models and hybrid search. In IEEE international conference on computer vision and pattern recognition (CVPR), vol. 2 (pp. 1536–1543).
Author information
Authors and Affiliations
Corresponding author
Additional information
This project was supported in part by gifts from Honda Research Institute and Intel Corporation. Funding for portions of this work was also provided by NSF grants IIS-0534858 and IIS-0535075. We would like to thank Ming-Hsuan Yang, Rui Li, Payman Yadollahpour and Stefan Roth for help in data collection and post-processing. We also would like to thank Stan Sclaroff for making the color video capture equipment available for this effort.
The first two authors contributed equally to this work.
The work of L. Sigal was conducted at Brown University.
Rights and permissions
About this article
Cite this article
Sigal, L., Balan, A.O. & Black, M.J. HumanEva: Synchronized Video and Motion Capture Dataset and Baseline Algorithm for Evaluation of Articulated Human Motion. Int J Comput Vis 87, 4–27 (2010). https://doi.org/10.1007/s11263-009-0273-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-009-0273-6