A very simple framework for 3D human poses estimation using a single 2D image: Comparison of geometric moments descriptors
Introduction
One of the main objectives of smart environments is to enhance the quality of life of the inhabitants. For this purpose, monitoring systems have to understand the needs and intention of a human in order to adapt the environment, for example in term of heating or lighting. Moreover, by monitoring the movement of a user, these systems could also be able to alert the user or ask for help in case of danger or if the movement could lead to an injury like a fall for example. Then, Human action recognition systems have a lot of possible applications in surveillance, pedestrian tracking and Human Machine Interaction. Human pose estimation is a key step to action recognition.
A human action is often represented as a succession of human poses [1]. As these poses could be 2D or 3D, so estimating them have attracted a lot of attention. A 2D pose is usually represented by a set of joint locations [2] whose estimation remains challenging because of the human body shape variability, viewpoint change, etc. Considering 3D pose, we usually represent it by a skeleton model parameterized by joint locations [3] or by rotation angles [4]. Such representation has the advantage to be Viewpoint-invariant, however, estimating 3D poses from a single image still remains a difficult problem. The reasons are multiple. First, multiple 3D poses may have the same 2D pose reprojection even if tracking approaches can solve this ambiguity. Second, 3D pose is inferred from detected 2D joint locations so 2D pose reliability is essential because it greatly affects skeleton estimation performance. In camera network used in a video-surveillance context, image quality is often poor making 2D joint detection a difficult task, moreover camera parameters are unknown making the correspondence 2D/3D difficult.
In this work, we propose a new framework for the extraction of 3D skeleton pose assumptions from a single 2D image provided by a low cost webcam. Our approach focuses uniquely on the silhouette shape recognition. A silhouette database is constructed from 3D human pose and action simulator and is used in order to match the nearest silhouette and as a result possible 3D human pose. Section 2 presents the state of the art in the field of human pose estimation. Section 3 explains the methodology we applied in order to estimate the human pose from a single silhouette but also the 3D simulator used to build our training database. Section 4 provides the mathematical description of the geometrics moments (and their parameters) used and compared for this application. Finally, Section 5 presents the results obtained by the approach on both our simulated and real database.
Section snippets
Related works
There are many methods in the state-of-the-art that deals with the human pose estimation and action recognition. Nevertheless, these tasks are still challenging for computer vision community. Human activity analyses started with O’Rourke and Badler [5] and Hogg [6] in the eighties. Since last decades, scientists proposed many approaches. We can categorize these approaches into two main categories.
Most of the approaches use a 3D model or 3D detection for estimating the pose of a subject and for
Methodology
The proposed approach for 3D pose estimation is based on shape analysis of human silhouette. The method can be decomposed into four parts: (1) simulated silhouette and skeleton database, (2) Human detection and 2D silhouette extraction, (3) silhouette shape matching, (4) skeleton scaling and validation.
Shape descriptors
In order to describe the silhouettes, we have processed and compared four well-known shape descriptors based on invariant and orthogonal moments. Such moments have been proved to be a good region-based descriptor in a multitude of machine learning application and for content-based image retrieval [25], [26]. As we assume that the 3D pose is directly linked to the 2D shape of the silouette, the main objective is to represent the shape accurately. An ideal descriptor for our pose recovery problem
Experimental studies
In Section 3.2 we have shown that for each 2D image of a silhouette in the database, we store both the feature vector and the associated 3D skeleton composed of 19 joints. Then, for each test image with its extracted silhouette, the similarity is computed between the processed feature vector and the stored features vectors in the database. For similarity computation, we compared different metrics (MSE, MAE, Cosine) and finally choose the Euclidian distance which had the best performance. Note
Conclusions
In this paper, we presented a very simple framework for 3D human pose estimation from a single image. In particular, we referred to a scenario where the environment is equipped with a simple low-cost passive camera without the need of any depth information or field of view intersection. The mains novelty of the approach are the use of open source Softwares as Blender and Makehuman in order to easily generate the learning database and the proof that orthogonal moments are able to encode the
Dieudonné Fabrice Atrevi received a Master of Science degree in computer vision from the International Institut of Francophonie (IFI) based in Hanoï, Vietnam, in 2015. Currently, he is PhD student at Orleans University. His research interests include image processing and machine learning for scene understanding.
References (34)
Model-based vision: a program to see a walking person
Image Vis. Comput.
(1983)- et al.
A survey of advances in vision-based human motion capture and analysis
Comput. Vision Image Understanding
(2006) - et al.
An approach to pose-based action recognition
Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on
(2013) - et al.
Articulated pose estimation with flexible mixtures-of-parts
Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on
(2011) Reconstruction of articulated objects from point correspondences in a single uncalibrated image
Computer Vision and Pattern Recognition, 2000. Proceedings. IEEE Conference on
(2000)- et al.
Human pose tracking in monocular sequence using multilevel structured models
Pattern Anal. Mach. Intell., IEEE Trans.
(2009) - et al.
Model-based image analysis of human motion using constraint propagation
Pattern Anal. Mach. Intell., IEEE Trans.
(1980) - et al.
Poselets: body part detectors trained using 3d human pose annotations
Computer Vision, 2009 IEEE 12th International Conference on
(2009) - et al.
Modeling 3d human poses from uncalibrated monocular images
Computer Vision, 2009 IEEE 12th International Conference on
(2009) - et al.
Deterministic 3d human pose estimation using rigid structure
Computer Vision–ECCV 2010
(2010)
Real-time physical modelling of character movements with microsoft kinect
Symposium on Virtual Reality Software and Technology (VRST 12)
Human activity recognition process using 3-d posture data
IEEE Trans. Hum. Mach. Syst.
Dense depth maps-based human pose tracking and recognition in dynamic scenes using ridge data
International Conference on Advanced Video and Signal Based Surveillance (AVSS)
Gaussian mixture based hmm for human daily activity recognition using 3d skeleton features
Conference on Industrial Electronics and Applications (ICIEA)
Performance animation from low-dimensional control signals
ACM Trans. Graph.
Improving posture classification accuracy for depth sensor-based human activity monitoring in smart environments
Comput. Vision Image Understanding
People watching: human actions as a cue for single-view geometry
Proceedings of the 12th European Conference on Computer Vision
Cited by (0)
Dieudonné Fabrice Atrevi received a Master of Science degree in computer vision from the International Institut of Francophonie (IFI) based in Hanoï, Vietnam, in 2015. Currently, he is PhD student at Orleans University. His research interests include image processing and machine learning for scene understanding.
Damien Vivet received its PhD in 2011 at Blaise Pascal University on the subject "Radar based localization and mapping in a dynamic environment". He is currently researcher at ISAE-Supaero in the DEOS/SCAN department in Toulouse. His main reseach field deals with multimodal perception and scene understanding for navigation tasks.
Florent Duculty passed its PhD in 2003 at Blaise Pascal University in computer vision applied to automatic guidance. He is currently assistant professor at Orleans University in PRISME laboratory, Automatism team, on the theme of automatic and controls.
Bruno Emile received its PhD in 1996 at Nice University. He got it’s HDR habilitation in 2010 at Orleans University on the subject "Contribution to scene interpretation: application to the development of operational systems". He is associate professor in the Image and Vision team from PRISME Laboratory. His research interests include image processing and machine learning for human behavior understanding.