Elsevier

Pattern Recognition

Volume 71, November 2017, Pages 389-401
Pattern Recognition

A very simple framework for 3D human poses estimation using a single 2D image: Comparison of geometric moments descriptors

https://doi.org/10.1016/j.patcog.2017.06.024Get rights and content

Highlights

  • 3D human pose estimation from a single image is an important problem.

  • We use geometric moments to analyse the silhouette of human extracted from single image and make comparison between different geometrics moments (Krawtchouk, Hanh, Zernike and Hu).

  • We prove that by using a very simple framework we are able to extract the 3D posture of a human with a single 2D image in real time.

  • We generate the learning dataset with Blender, an open source software publicly available, by using motion capture data.

Abstract

In this paper, we propose a framework in order to automatically extract the 3D pose of an individual from a single silhouette image obtained with a classical low-cost camera without any depth information. By pose, we mean the configuration of human bones in order to reconstruct a 3D skeleton representing the 3D posture of the detected human. Our approach combines prior learned correspondences between silhouettes and skeletons extracted from simulated 3D human models publicly available on the internet. The main advantages of such approach are that silhouettes can be very easily extracted from video, and 3D human models can be animated using motion capture data in order to quickly build any movement training data. In order to match detected silhouettes with simulated silhouettes, we compared geometrics invariants moments. According to our results, we show that the proposed method provides very promising results with a very low time processing.

Introduction

One of the main objectives of smart environments is to enhance the quality of life of the inhabitants. For this purpose, monitoring systems have to understand the needs and intention of a human in order to adapt the environment, for example in term of heating or lighting. Moreover, by monitoring the movement of a user, these systems could also be able to alert the user or ask for help in case of danger or if the movement could lead to an injury like a fall for example. Then, Human action recognition systems have a lot of possible applications in surveillance, pedestrian tracking and Human Machine Interaction. Human pose estimation is a key step to action recognition.

A human action is often represented as a succession of human poses [1]. As these poses could be 2D or 3D, so estimating them have attracted a lot of attention. A 2D pose is usually represented by a set of joint locations [2] whose estimation remains challenging because of the human body shape variability, viewpoint change, etc. Considering 3D pose, we usually represent it by a skeleton model parameterized by joint locations [3] or by rotation angles [4]. Such representation has the advantage to be Viewpoint-invariant, however, estimating 3D poses from a single image still remains a difficult problem. The reasons are multiple. First, multiple 3D poses may have the same 2D pose reprojection even if tracking approaches can solve this ambiguity. Second, 3D pose is inferred from detected 2D joint locations so 2D pose reliability is essential because it greatly affects skeleton estimation performance. In camera network used in a video-surveillance context, image quality is often poor making 2D joint detection a difficult task, moreover camera parameters are unknown making the correspondence 2D/3D difficult.

In this work, we propose a new framework for the extraction of 3D skeleton pose assumptions from a single 2D image provided by a low cost webcam. Our approach focuses uniquely on the silhouette shape recognition. A silhouette database is constructed from 3D human pose and action simulator and is used in order to match the nearest silhouette and as a result possible 3D human pose. Section 2 presents the state of the art in the field of human pose estimation. Section 3 explains the methodology we applied in order to estimate the human pose from a single silhouette but also the 3D simulator used to build our training database. Section 4 provides the mathematical description of the geometrics moments (and their parameters) used and compared for this application. Finally, Section 5 presents the results obtained by the approach on both our simulated and real database.

Section snippets

Related works

There are many methods in the state-of-the-art that deals with the human pose estimation and action recognition. Nevertheless, these tasks are still challenging for computer vision community. Human activity analyses started with O’Rourke and Badler [5] and Hogg [6] in the eighties. Since last decades, scientists proposed many approaches. We can categorize these approaches into two main categories.

Most of the approaches use a 3D model or 3D detection for estimating the pose of a subject and for

Methodology

The proposed approach for 3D pose estimation is based on shape analysis of human silhouette. The method can be decomposed into four parts: (1) simulated silhouette and skeleton database, (2) Human detection and 2D silhouette extraction, (3) silhouette shape matching, (4) skeleton scaling and validation.

Shape descriptors

In order to describe the silhouettes, we have processed and compared four well-known shape descriptors based on invariant and orthogonal moments. Such moments have been proved to be a good region-based descriptor in a multitude of machine learning application and for content-based image retrieval [25], [26]. As we assume that the 3D pose is directly linked to the 2D shape of the silouette, the main objective is to represent the shape accurately. An ideal descriptor for our pose recovery problem

Experimental studies

In Section 3.2 we have shown that for each 2D image of a silhouette in the database, we store both the feature vector and the associated 3D skeleton composed of 19 joints. Then, for each test image with its extracted silhouette, the similarity is computed between the processed feature vector and the stored features vectors in the database. For similarity computation, we compared different metrics (MSE, MAE, Cosine) and finally choose the Euclidian distance which had the best performance. Note

Conclusions

In this paper, we presented a very simple framework for 3D human pose estimation from a single image. In particular, we referred to a scenario where the environment is equipped with a simple low-cost passive camera without the need of any depth information or field of view intersection. The mains novelty of the approach are the use of open source Softwares as Blender and Makehuman in order to easily generate the learning database and the proof that orthogonal moments are able to encode the

Dieudonné Fabrice Atrevi received a Master of Science degree in computer vision from the International Institut of Francophonie (IFI) based in Hanoï, Vietnam, in 2015. Currently, he is PhD student at Orleans University. His research interests include image processing and machine learning for scene understanding.

References (34)

  • D. Hogg

    Model-based vision: a program to see a walking person

    Image Vis. Comput.

    (1983)
  • T.B. Moeslund et al.

    A survey of advances in vision-based human motion capture and analysis

    Comput. Vision Image Understanding

    (2006)
  • C. Wang et al.

    An approach to pose-based action recognition

    Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on

    (2013)
  • Y. Yang et al.

    Articulated pose estimation with flexible mixtures-of-parts

    Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on

    (2011)
  • C. Taylor

    Reconstruction of articulated objects from point correspondences in a single uncalibrated image

    Computer Vision and Pattern Recognition, 2000. Proceedings. IEEE Conference on

    (2000)
  • M.W. Lee et al.

    Human pose tracking in monocular sequence using multilevel structured models

    Pattern Anal. Mach. Intell., IEEE Trans.

    (2009)
  • J. O’Rourke et al.

    Model-based image analysis of human motion using constraint propagation

    Pattern Anal. Mach. Intell., IEEE Trans.

    (1980)
  • L. Bourdev et al.

    Poselets: body part detectors trained using 3d human pose annotations

    Computer Vision, 2009 IEEE 12th International Conference on

    (2009)
  • X.K. Wei et al.

    Modeling 3d human poses from uncalibrated monocular images

    Computer Vision, 2009 IEEE 12th International Conference on

    (2009)
  • J. Valmadre et al.

    Deterministic 3d human pose estimation using rigid structure

    Computer Vision–ECCV 2010

    (2010)
  • E.H.H. Shum

    Real-time physical modelling of character movements with microsoft kinect

    Symposium on Virtual Reality Software and Technology (VRST 12)

    (2012)
  • S. Gaglio et al.

    Human activity recognition process using 3-d posture data

    IEEE Trans. Hum. Mach. Syst.

    (2015)
  • Y.K.A. Jalal

    Dense depth maps-based human pose tracking and recognition in dynamic scenes using ridge data

    International Conference on Advanced Video and Signal Based Surveillance (AVSS)

    (2014)
  • S.K.L. Piyathilaka

    Gaussian mixture based hmm for human daily activity recognition using 3d skeleton features

    Conference on Industrial Electronics and Applications (ICIEA)

    (2013)
  • J.H.J. Chai

    Performance animation from low-dimensional control signals

    ACM Trans. Graph.

    (2005)
  • E.S. Ho et al.

    Improving posture classification accuracy for depth sensor-based human activity monitoring in smart environments

    Comput. Vision Image Understanding

    (2016)
  • A.G. D.F. Fouhey

    People watching: human actions as a cue for single-view geometry

    Proceedings of the 12th European Conference on Computer Vision

    (2012)
  • Cited by (0)

    Dieudonné Fabrice Atrevi received a Master of Science degree in computer vision from the International Institut of Francophonie (IFI) based in Hanoï, Vietnam, in 2015. Currently, he is PhD student at Orleans University. His research interests include image processing and machine learning for scene understanding.

    Damien Vivet received its PhD in 2011 at Blaise Pascal University on the subject "Radar based localization and mapping in a dynamic environment". He is currently researcher at ISAE-Supaero in the DEOS/SCAN department in Toulouse. His main reseach field deals with multimodal perception and scene understanding for navigation tasks.

    Florent Duculty passed its PhD in 2003 at Blaise Pascal University in computer vision applied to automatic guidance. He is currently assistant professor at Orleans University in PRISME laboratory, Automatism team, on the theme of automatic and controls.

    Bruno Emile received its PhD in 1996 at Nice University. He got it’s HDR habilitation in 2010 at Orleans University on the subject "Contribution to scene interpretation: application to the development of operational systems". He is associate professor in the Image and Vision team from PRISME Laboratory. His research interests include image processing and machine learning for human behavior understanding.

    View full text