Space-time representation of people based on 3D skeletal data: A review

https://doi.org/10.1016/j.cviu.2017.01.011Get rights and content

Highlights

  • First survey dedicated to human representations based on 3D skeleton data.

  • Our survey is comprehensive and covers the most recent and advanced approaches.

  • An insightful categorization and analysis of the 3D skeleton-based representations is provided.

Abstract

Spatiotemporal human representation based on 3D visual perception data is a rapidly growing research area. Representations can be broadly categorized into two groups, depending on whether they use RGB-D information or 3D skeleton data. Recently, skeleton-based human representations have been intensively studied and kept attracting an increasing attention, due to their robustness to variations of viewpoint, human body scale and motion speed as well as the realtime, online performance. This paper presents a comprehensive survey of existing space-time representations of people based on 3D skeletal data, and provides an informative categorization and analysis of these methods from the perspectives, including information modality, representation encoding, structure and transition, and feature engineering. We also provide a brief overview of skeleton acquisition devices and construction methods, enlist a number of benchmark datasets with skeleton data, and discuss potential future research directions.

Introduction

Human representation in spatiotemporal space is a fundamental research problem extensively investigated in computer vision and machine intelligence over the past few decades. The objective of building human representations is to extract compact, descriptive information (i.e., features) to encode and characterize a human’s attributes from perception data (e.g., human shape, pose, and motion), when developing recognition or other human-centered reasoning systems. As an integral component of reasoning systems, approaches to construct human representations have been widely used in a variety of real-world applications, including video analysis (Ge et al., 2012), surveillance (Jun et al., 2013), robotics (Demircan et al., 2015), human-machine interaction (Han et al., 2017a), augmented and virtual reality (Green et al., 2008), assistive living (Okada et al., 2005), smart homes (Brdiczka et al., 2009), education (Mondada et al., 2009), and many others (Broadbent, Stafford, MacDonald, 2009, Ding, Fan, 2016, Fujita, 2000, Kang, Freedman, Matarić, Cunningham, Lopez, 2005).

During recent years, human representations based on 3D perception data have been attracting an increasing amount of attention (Kviatkovsky, Rivlin, Shimshoni, 2014, Li, Yu, Wu, Su, Ji, 2015, Siddharth, Barbu, Siskind, 2014, Vieira, Nascimento, Oliveira, Liu, Campos, 2012). Comparing with 2D visual data, additional depth information provides several advantages. Depth images provide geometric information of pixels that encode the external surface of the scene in 3D space. Features extracted from depth images and 3D point clouds are robust to variations of illumination, scale, and rotation (Aggarwal, Xia, 2014, Han, Shao, Xu, Shotton, 2013). Thanks to the emergence of affordable structured-light color-depth sensing technology, such as the Microsoft Kinect (2012) and ASUS Xtion PRO LIVE (2011) RGB-D cameras, it is much easier and cheaper to obtain depth data. In addition, structured-light cameras enable us to retrieve the 3D human skeletal information in real time (Shotton et al., 2011a), which used to be only possible when using expensive and complex vision systems (e.g., motion capture systems Tobon (2010)), thereby significantly popularizing skeleton-based human representations. Moreover, the vast increase in computational power allows researchers to develop advanced computational algorithms (e.g., deep learning (Du et al., 2015)) to process visual data at an acceptable speed. The advancements contribute to the boom of utilizing 3D perception data to construct reasoning systems in computer vision and machine learning communities.

Since the performance of machine learning and reasoning methods heavily relies on the design of data representation (Bengio et al., 2013), human representations are intensively investigated to address human-centered research problems (e.g., human detection, tracking, pose estimation, and action recognition). Among a large number of human representation approaches (Bălan, Sigal, Black, Davis, Haussecker, 2007, Belagiannis, Amin, Andriluka, Schiele, Navab, Ilic, 2014, Burenius, Sullivan, Carlsson, 2013, Ganapathi, Plagemann, Koller, Thrun, 2010, Rahmani, Mahmood, Huynh, Mian, 2014a, Wang, Liu, Chorowski, Chen, Wu, 2012a), most of the existing 3D based methods can be broadly grouped into two categories: representations based on local features (Le, Zou, Yeung, Ng, 2011, Zhang, Parker, 2011) and skeleton-based representations (Han, Yang, Reardon, Zhang, Zhang, 2017b, Sun, Wei, Liang, Tang, Sun, 2015, Tang, Chang, Tejani, Kim, 2014, Xu, Cheng, 2013). Methods based on local features detect points of interest in space-time dimensions, describe the patches centered at the points as features, and encode them (e.g., using bag-of-word models) into representations, which can locate salient regions and are relatively robust to partial occlusion. However, methods based on local features ignore spatial relationships among the features. These approaches are often incapable of identifying feature affiliations, and thus the methods are generally incapable to represent multiple individuals in the same scene. These methods are also computationally expensive because of the complexity of the procedures including keypoint detection, feature description, dictionary construction, etc.

On the other hand, human representations based on 3D skeleton information provide a very promising alternative. The concept of skeleton-based representation can be traced back to the early seminal research of Johansson (1973), which demonstrated that a small number of joint positions can effectively represent human behaviors. 3D skeleton-based representations also demonstrate promising performance in real-world applications including Kinect-based gaming, as well as in computer vision research (Du, Wang, Wang, 2015, Yao, Gall, Fanelli, Gool, 2011). 3D skeleton-based representations are able to model the relationship of human joints and encode the whole body configuration. They are also robust to scale and illumination changes, and can be invariant to camera view as well as human body rotation and motion speed. In addition, many skeleton-based representations can be computed at a high frame rate, which can significantly facilitate online, real-time applications. Given the advantages and previous success of 3D skeleton-based representations, we have witnessed a significant increase of new techniques to construct such representations in recent years, as demonstrated in Fig. 1, which underscores the need of this survey paper focusing on the review of 3D skeleton-based human representations.

Several survey papers were published in related research areas such as motion and activity recognition. For example, Han et al. (2013) described the Kinect sensor and its general application in computer vision and machine intelligence. Aggarwal and Xia (2014) recently published a review paper on human activity recognition from 3D visual data, which summarized five categories of representations based on 3D silhouettes, skeletal joints or body part locations, local spatio-temporal features, scene flow features, and local occupancy features. Several earlier surveys were also published to review methods to recognize human poses, motions, gestures, and activities (Aggarwal, Ryoo, 2011, Borges, Conci, Cavallaro, 2013, Chen, Wei, Ferryman, 2013, Ji, Liu, 2010, Ke, Thuc, Lee, Hwang, Yoo, Choi, 2013, LaViola, 2013, Lun, Zhao, 2015, Moeslund, Granum, 2001, Moeslund, Hilton, Krüger, 2006, Poppe, 2010, Ruffieux, Lalanne, Mugellini, Khaled, 2014, Ye, Zhang, Wang, Zhu, Yang, Gall, 2013), as well as their applications (Chaaraoui, Climent-Pérez, Flórez-Revuelta, 2012, Zhou, Hu, 2008). However, none of the survey papers specifically focused on 3D human representation based on skeletal data, which was the subject of numerous research papers in the literature and continues to gain popularity in recent years.

The objective of this survey is to provide a comprehensive overview of 3D skeleton-based human representations mainly published in the computer vision and machine intelligence communities, which are built upon 3D human skeleton data that is assumed as the raw measurements directly from sensing hardware. We categorize and compare the reviewed approaches from multiple perspectives, including information modality, representation coding, structure and transition, and feature engineering methodology, and analyze the pros and cons of each category. Compared with the existing surveys, the main contributions of this review include:

  • To the best of our knowledge, this is the first survey dedicated to human representations based on 3D skeleton data, which fills the current void in the literature.

  • The survey is comprehensive and covers the most recent and advanced approaches. We review 171 3D skeleton-based human representations, including 150 papers that were published in the recent five years, thereby providing readers with the complete, state-of-the-art methods.

  • This paper provides an insightful categorization and analysis of the 3D skeleton-based representation construction approaches from multiple perspectives, and summarizes and compares attributes of all reviewed representations.

In addition, we provide a complete list of available benchmark datasets. Although we also provide a brief overview of human modeling methods to generate skeleton data through pose recognition and joint estimation (Akhter, Black, 2015, Lehrmann, Gehler, Nowozin, 2013, Pons-Moll, Fleet, Rosenhahn, 2014, Zhou, De la Torre, 2014), the purpose is to provide related background information. Skeleton construction, which is widely studied in the research fields (such as computer vision, computer graphics, human-computer interaction, and animation) is not the focus of this paper. In addition, the main application domains of interest in this survey paper is human gesture, action, and activity recognition, as most of the reviewed papers focus on these applications. Although several skeleton-based representations are also used for human re-identification (Giachetti, Fornasa, Parezzan, Saletti, Zambaldo, Zanini, Achilles, Ichim, Tombari, Navab, et al., 2016, Munaro, Fossati, Basso, Menegatti, Van Gool, 2014), however, skeleton-based features are usually used along with other shape or texture based features (e.g., 3D point cloud) in this application, as skeleton-based features are generally incapable to represent human appearance that is critical for human re-identification.

The remainder of this review is structured as follows. Background information including 3D skeleton acquisition and construction as well as public benchmark datasets is presented in Section 2. Sections 3–6 discuss the categorization of 3D skeleton-based human representations from four perspectives, including information modality in Section 3, encoding in Section 4, hierarchy and transition in Section 5, and feature construction methodology in Section 6. After discussing the advantages of skeleton-based representations and pointing out future research directions in Section 7, the review paper is concluded in Section 8.

Section snippets

Background

The objective of building 3D skeleton-based human representations is to extract compact, discriminative descriptions to characterize a human’s attributes from 3D human skeletal information. The 3D skeleton data encodes human body as an articulated system of rigid segments connected by joints. This section discusses how 3D skeletal data can be acquired, including devices that directly provide the skeletal data and computational methods to construct the skeleton. Available benchmark datasets

Information modality

Skeleton-based human representations are constructed from various features computed from raw 3D skeletal data that can be possibly acquired from various sensing technologies. We define each type of skeleton-based features extracted from each individual sensing technique as a modality. From the perspective of information modality, 3D skeleton-based human representations can be classified into four categories based on joint displacement, orientation, raw position, and combined information.

Representation encoding

Feature encoding is a necessary and important component in representation construction (Huang et al., 2014c), which aims at integrating all extracted features together into a final feature vector that can be used as the input to classifiers or other reasoning systems. In the scenario of 3D skeleton-based representation construction, the encoding methods can be broadly grouped into three classes: concatenation-based encoding, statistics-based encoding, and bag-of-words encoding. The encoding

Structure and topological transition

While most skeleton-based 3D human representations are based on pure low-level features extracted from the skeleton data in 3D Euclidean space, several works studied mid-level features or feature transition to other topological space. This section categorizes the reviewed approaches from the structure and transition perspective into three groups: representations using low-level features in Euclidean space, representations using mid-level features based on human body parts, and manifold-based

Feature engineering

Feature engineering is one of the most fundamental research problems in computer vision and machine learning research. Early feature engineering techniques for human representation construction are manual; features are hand-crafted and their importance are manually decided. In recent years, we have been witnessing a clear transition from manual feature engineering to automated feature learning and extraction. In this section, we categorize and analyze human representations based on 3D skeleton

Performance analysis of the current state of the art

In this section, we compare the accuracy and efficiency of different approaches using several most used datasets, including MSR Action3D, CAD-60, MSRC-12, and HDM05, which cover both structured light sensors (Kinect v1) and motion capture sensor systems. The performance is evaluated using the precision metric, since almost all the existing approaches report the precision results. The detailed comparison of different approaches is presented in Table 7.

From Table 7, it is observed that there is

Conclusion

This paper presents a unique and comprehensive survey of the state-of-the-art space-time human representations based 3D skeleton data that is now widely available. We provide a brief overview of existing 3D skeleton acquisition and construction methods, as well as a detailed categorization of the 3D skeleton-based representations from four key perspectives, including information modality, representation encoding, structure and topological transition, and feature engineering. We also compare the

References (234)

  • I. Akhter et al.

    Pose-conditioned joint angle limits for 3D human pose reconstruction

    IEEE Conference on Computer Vision and Pattern Recognition

    (2015)
  • B. Amor et al.

    Action recognition using rate-invariant analysis of skeletal shape trajectories

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2016)
  • M. Andriluka et al.

    Pictorial structures revisited people detection and articulated pose estimation

    IEEE Conference on Computer Vision and Pattern Recognition

    (2009)
  • R. Anirudh et al.

    Elastic functional coding of human actions: From vector-fields to latent variables

    IEEE Conference on Computer Vision and Pattern Recognition

    (2015)
  • S. Azary et al.

    Grassmannian sparse representations and motion depth surfaces for 3D action recognition

    IEEE Conference on Computer Vision and Pattern Recognition Workshops

    (2013)
  • A. Baak et al.

    A data-driven approach for real-time full body pose reconstruction from a depth camera

    IEEE International Conference on Computer Vision

    (2011)
  • A.O. Bălan et al.

    Detailed human shape and pose from images

    IEEE Conference on Computer Vision and Pattern Recognition

    (2007)
  • B.I. Barbosa et al.

    Re-identification with RGB-D sensors

    International Workshop on Re-Identification

    (2012)
  • V. Belagiannis et al.

    3D pictorial structures for multiple human pose estimation

    IEEE Conference on Computer Vision and Pattern Recognition

    (2014)
  • Y. Bengio et al.

    Representation learning: a review and new perspectives

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2013)
  • P. BESL et al.

    A method for registration of 3-D shapes

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1992)
  • V. Bloom et al.

    Dynamic feature selection for online action recognition

    Human Behavior Understanding

    (2013)
  • V. Bloom et al.

    G3Di: A gaming interaction dataset with a real time detection and evaluation framework

    Workshops on Computer Vision on European Conference on Computer Vision

    (2014)
  • V. Bloom et al.

    G3D: a gaming action dataset and real time action recognition evaluation framework

    Workshops on IEEE Conference on Computer Vision and Pattern Recognition

    (2012)
  • P.V.K. Borges et al.

    Video-based human behavior understanding: a survey

    IEEE Trans. Circuits Syst. Video Technol.

    (2013)
  • S. Boubou et al.

    Classifying actions based on histogram of oriented velocity vectors

    J. Intell. Inf. Syst.

    (2015)
  • L. Bourdev et al.

    Poselets: Body part detectors trained using 3D human pose annotations

    IEEE International Conference on Computer Vision

    (2009)
  • O. Brdiczka et al.

    Detecting human behavior models from multimodal observation in a smart home

    IEEE Trans. Autom. Sci. Eng.

    (2009)
  • E. Broadbent et al.

    Acceptance of healthcare robots for the older population: review and future directions

    Int. J. Soc. Robot.

    (2009)
  • A.L. Brooks et al.

    Markerless motion tracking: MS Kinect & Organic Motion OpenStage®

    International Conference on Disability, Virtual Reality and Associated Technologies

    (2012)
  • M. Burenius et al.

    3D pictorial structures for multiple view articulated pose estimation

    IEEE Conference on Computer Vision and Pattern Recognition

    (2013)
  • L. Campbell et al.

    Recognition of human body motion using phase space constraints

    IEEE International Conference on Computer Vision

    (1995)
  • J. Cavazza et al.

    Kernelized covariance for action recognition

    International Conference on Pattern Recognition

    (2016)
  • J. Charles et al.

    Learning shape models for monocular human pose estimation from the Microsoft Xbox Kinect

    IEEE International Conference on Computer Vision

    (2011)
  • R. Chaudhry et al.

    Bio-inspired dynamic 3D discriminative skeletal features for human action recognition

    IEEE Conference on Computer Vision and Pattern Recognition

    (2013)
  • C. Chen et al.

    UTD-MAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor

    IEEE International Conference on Image Processing

    (2015)
  • G. Chen et al.

    Action recognition using ensemble weighted multi-instance learning

    IEEE International Conference on Robotics and Automation

    (2014)
  • X. Chen et al.

    Online RGB-D gesture recognition with extreme learning machines

    ACM International Conference on Multimodal Interaction

    (2013)
  • E. Cippitelli et al.

    Comparison of RGB-D mapping solutions for application to food intake monitoring

    Ambient Assisted Living

    (2015)
  • E. Cippitelli et al.

    A human activity recognition system using skeleton data from RGBD sensors

    Comput. Intell. Neurosci.

    (2016)
  • E. Cippitelli et al.

    Time synchronization and data fusion for RGB-depth cameras and inertial sensors in AAL applications

    Workshop on IEEE International Conference on Communication

    (2015)
  • E. Demircan et al.

    Human movement understanding

    IEEE Robot. Autom. Mag.

    (2015)
  • M. Devanne et al.

    Space-time pose representation for 3D human action recognition

    New Trends in Image Analysis and Processing

    (2013)
  • M. Devanne et al.

    3-D human action recognition by shape analysis of motion trajectories on Riemannian manifold

    IEEE Trans. Cybern.

    (2015)
  • M. Devanne et al.

    Combined shape analysis of human poses and motion units for action segmentation and recognition

    IEEE International Conference and Workshops on Automatic Face and Gesture Recognition

    (2015)
  • M. Ding et al.

    Articulated and generalized Gaussian kernel correlation for human pose estimation

    IEEE Trans. Image Process.

    (2016)
  • J. Dong et al.

    Towards unified human parsing and pose estimation

    IEEE Conference on Computer Vision and Pattern Recognition

    (2014)
  • Y. Du et al.

    Hierarchical recurrent neural network for skeleton based action recognition

    IEEE Conference on Computer Vision and Pattern Recognition

    (2015)
  • A. Elhayek et al.

    Efficient ConvNet-based marker-less motion capture in general scenes with a low number of cameras

    IEEE Conference on Computer Vision and Pattern Recognition

    (2015)
  • C. Ellis et al.

    Exploring the trade-off between accuracy and observational latency in action recognition

    Int. J. Comput. Vis.

    (2013)
  • Cited by (275)

    View all citing articles on Scopus
    1

    These authors contributed equally to this work.

    View full text