Space-time representation of people based on 3D skeletal data: A review
Introduction
Human representation in spatiotemporal space is a fundamental research problem extensively investigated in computer vision and machine intelligence over the past few decades. The objective of building human representations is to extract compact, descriptive information (i.e., features) to encode and characterize a human’s attributes from perception data (e.g., human shape, pose, and motion), when developing recognition or other human-centered reasoning systems. As an integral component of reasoning systems, approaches to construct human representations have been widely used in a variety of real-world applications, including video analysis (Ge et al., 2012), surveillance (Jun et al., 2013), robotics (Demircan et al., 2015), human-machine interaction (Han et al., 2017a), augmented and virtual reality (Green et al., 2008), assistive living (Okada et al., 2005), smart homes (Brdiczka et al., 2009), education (Mondada et al., 2009), and many others (Broadbent, Stafford, MacDonald, 2009, Ding, Fan, 2016, Fujita, 2000, Kang, Freedman, Matarić, Cunningham, Lopez, 2005).
During recent years, human representations based on 3D perception data have been attracting an increasing amount of attention (Kviatkovsky, Rivlin, Shimshoni, 2014, Li, Yu, Wu, Su, Ji, 2015, Siddharth, Barbu, Siskind, 2014, Vieira, Nascimento, Oliveira, Liu, Campos, 2012). Comparing with 2D visual data, additional depth information provides several advantages. Depth images provide geometric information of pixels that encode the external surface of the scene in 3D space. Features extracted from depth images and 3D point clouds are robust to variations of illumination, scale, and rotation (Aggarwal, Xia, 2014, Han, Shao, Xu, Shotton, 2013). Thanks to the emergence of affordable structured-light color-depth sensing technology, such as the Microsoft Kinect (2012) and ASUS Xtion PRO LIVE (2011) RGB-D cameras, it is much easier and cheaper to obtain depth data. In addition, structured-light cameras enable us to retrieve the 3D human skeletal information in real time (Shotton et al., 2011a), which used to be only possible when using expensive and complex vision systems (e.g., motion capture systems Tobon (2010)), thereby significantly popularizing skeleton-based human representations. Moreover, the vast increase in computational power allows researchers to develop advanced computational algorithms (e.g., deep learning (Du et al., 2015)) to process visual data at an acceptable speed. The advancements contribute to the boom of utilizing 3D perception data to construct reasoning systems in computer vision and machine learning communities.
Since the performance of machine learning and reasoning methods heavily relies on the design of data representation (Bengio et al., 2013), human representations are intensively investigated to address human-centered research problems (e.g., human detection, tracking, pose estimation, and action recognition). Among a large number of human representation approaches (Bălan, Sigal, Black, Davis, Haussecker, 2007, Belagiannis, Amin, Andriluka, Schiele, Navab, Ilic, 2014, Burenius, Sullivan, Carlsson, 2013, Ganapathi, Plagemann, Koller, Thrun, 2010, Rahmani, Mahmood, Huynh, Mian, 2014a, Wang, Liu, Chorowski, Chen, Wu, 2012a), most of the existing 3D based methods can be broadly grouped into two categories: representations based on local features (Le, Zou, Yeung, Ng, 2011, Zhang, Parker, 2011) and skeleton-based representations (Han, Yang, Reardon, Zhang, Zhang, 2017b, Sun, Wei, Liang, Tang, Sun, 2015, Tang, Chang, Tejani, Kim, 2014, Xu, Cheng, 2013). Methods based on local features detect points of interest in space-time dimensions, describe the patches centered at the points as features, and encode them (e.g., using bag-of-word models) into representations, which can locate salient regions and are relatively robust to partial occlusion. However, methods based on local features ignore spatial relationships among the features. These approaches are often incapable of identifying feature affiliations, and thus the methods are generally incapable to represent multiple individuals in the same scene. These methods are also computationally expensive because of the complexity of the procedures including keypoint detection, feature description, dictionary construction, etc.
On the other hand, human representations based on 3D skeleton information provide a very promising alternative. The concept of skeleton-based representation can be traced back to the early seminal research of Johansson (1973), which demonstrated that a small number of joint positions can effectively represent human behaviors. 3D skeleton-based representations also demonstrate promising performance in real-world applications including Kinect-based gaming, as well as in computer vision research (Du, Wang, Wang, 2015, Yao, Gall, Fanelli, Gool, 2011). 3D skeleton-based representations are able to model the relationship of human joints and encode the whole body configuration. They are also robust to scale and illumination changes, and can be invariant to camera view as well as human body rotation and motion speed. In addition, many skeleton-based representations can be computed at a high frame rate, which can significantly facilitate online, real-time applications. Given the advantages and previous success of 3D skeleton-based representations, we have witnessed a significant increase of new techniques to construct such representations in recent years, as demonstrated in Fig. 1, which underscores the need of this survey paper focusing on the review of 3D skeleton-based human representations.
Several survey papers were published in related research areas such as motion and activity recognition. For example, Han et al. (2013) described the Kinect sensor and its general application in computer vision and machine intelligence. Aggarwal and Xia (2014) recently published a review paper on human activity recognition from 3D visual data, which summarized five categories of representations based on 3D silhouettes, skeletal joints or body part locations, local spatio-temporal features, scene flow features, and local occupancy features. Several earlier surveys were also published to review methods to recognize human poses, motions, gestures, and activities (Aggarwal, Ryoo, 2011, Borges, Conci, Cavallaro, 2013, Chen, Wei, Ferryman, 2013, Ji, Liu, 2010, Ke, Thuc, Lee, Hwang, Yoo, Choi, 2013, LaViola, 2013, Lun, Zhao, 2015, Moeslund, Granum, 2001, Moeslund, Hilton, Krüger, 2006, Poppe, 2010, Ruffieux, Lalanne, Mugellini, Khaled, 2014, Ye, Zhang, Wang, Zhu, Yang, Gall, 2013), as well as their applications (Chaaraoui, Climent-Pérez, Flórez-Revuelta, 2012, Zhou, Hu, 2008). However, none of the survey papers specifically focused on 3D human representation based on skeletal data, which was the subject of numerous research papers in the literature and continues to gain popularity in recent years.
The objective of this survey is to provide a comprehensive overview of 3D skeleton-based human representations mainly published in the computer vision and machine intelligence communities, which are built upon 3D human skeleton data that is assumed as the raw measurements directly from sensing hardware. We categorize and compare the reviewed approaches from multiple perspectives, including information modality, representation coding, structure and transition, and feature engineering methodology, and analyze the pros and cons of each category. Compared with the existing surveys, the main contributions of this review include:
- •
To the best of our knowledge, this is the first survey dedicated to human representations based on 3D skeleton data, which fills the current void in the literature.
- •
The survey is comprehensive and covers the most recent and advanced approaches. We review 171 3D skeleton-based human representations, including 150 papers that were published in the recent five years, thereby providing readers with the complete, state-of-the-art methods.
- •
This paper provides an insightful categorization and analysis of the 3D skeleton-based representation construction approaches from multiple perspectives, and summarizes and compares attributes of all reviewed representations.
In addition, we provide a complete list of available benchmark datasets. Although we also provide a brief overview of human modeling methods to generate skeleton data through pose recognition and joint estimation (Akhter, Black, 2015, Lehrmann, Gehler, Nowozin, 2013, Pons-Moll, Fleet, Rosenhahn, 2014, Zhou, De la Torre, 2014), the purpose is to provide related background information. Skeleton construction, which is widely studied in the research fields (such as computer vision, computer graphics, human-computer interaction, and animation) is not the focus of this paper. In addition, the main application domains of interest in this survey paper is human gesture, action, and activity recognition, as most of the reviewed papers focus on these applications. Although several skeleton-based representations are also used for human re-identification (Giachetti, Fornasa, Parezzan, Saletti, Zambaldo, Zanini, Achilles, Ichim, Tombari, Navab, et al., 2016, Munaro, Fossati, Basso, Menegatti, Van Gool, 2014), however, skeleton-based features are usually used along with other shape or texture based features (e.g., 3D point cloud) in this application, as skeleton-based features are generally incapable to represent human appearance that is critical for human re-identification.
The remainder of this review is structured as follows. Background information including 3D skeleton acquisition and construction as well as public benchmark datasets is presented in Section 2. Sections 3–6 discuss the categorization of 3D skeleton-based human representations from four perspectives, including information modality in Section 3, encoding in Section 4, hierarchy and transition in Section 5, and feature construction methodology in Section 6. After discussing the advantages of skeleton-based representations and pointing out future research directions in Section 7, the review paper is concluded in Section 8.
Section snippets
Background
The objective of building 3D skeleton-based human representations is to extract compact, discriminative descriptions to characterize a human’s attributes from 3D human skeletal information. The 3D skeleton data encodes human body as an articulated system of rigid segments connected by joints. This section discusses how 3D skeletal data can be acquired, including devices that directly provide the skeletal data and computational methods to construct the skeleton. Available benchmark datasets
Information modality
Skeleton-based human representations are constructed from various features computed from raw 3D skeletal data that can be possibly acquired from various sensing technologies. We define each type of skeleton-based features extracted from each individual sensing technique as a modality. From the perspective of information modality, 3D skeleton-based human representations can be classified into four categories based on joint displacement, orientation, raw position, and combined information.
Representation encoding
Feature encoding is a necessary and important component in representation construction (Huang et al., 2014c), which aims at integrating all extracted features together into a final feature vector that can be used as the input to classifiers or other reasoning systems. In the scenario of 3D skeleton-based representation construction, the encoding methods can be broadly grouped into three classes: concatenation-based encoding, statistics-based encoding, and bag-of-words encoding. The encoding
Structure and topological transition
While most skeleton-based 3D human representations are based on pure low-level features extracted from the skeleton data in 3D Euclidean space, several works studied mid-level features or feature transition to other topological space. This section categorizes the reviewed approaches from the structure and transition perspective into three groups: representations using low-level features in Euclidean space, representations using mid-level features based on human body parts, and manifold-based
Feature engineering
Feature engineering is one of the most fundamental research problems in computer vision and machine learning research. Early feature engineering techniques for human representation construction are manual; features are hand-crafted and their importance are manually decided. In recent years, we have been witnessing a clear transition from manual feature engineering to automated feature learning and extraction. In this section, we categorize and analyze human representations based on 3D skeleton
Performance analysis of the current state of the art
In this section, we compare the accuracy and efficiency of different approaches using several most used datasets, including MSR Action3D, CAD-60, MSRC-12, and HDM05, which cover both structured light sensors (Kinect v1) and motion capture sensor systems. The performance is evaluated using the precision metric, since almost all the existing approaches report the precision results. The detailed comparison of different approaches is presented in Table 7.
From Table 7, it is observed that there is
Conclusion
This paper presents a unique and comprehensive survey of the state-of-the-art space-time human representations based 3D skeleton data that is now widely available. We provide a brief overview of existing 3D skeleton acquisition and construction methods, as well as a detailed categorization of the 3D skeleton-based representations from four key perspectives, including information modality, representation encoding, structure and topological transition, and feature engineering. We also compare the
References (234)
- et al.
Human activity recognition from 3D data: a review
Pattern. Recognit. Lett.
(2014) - et al.
Ongoing human action recognition with motion capture
Pattern. Recognit.
(2014) - et al.
A review on vision techniques applied to human behaviour analysis for ambient-assisted living
Expert Syst. Appl.
(2012) - et al.
Evolutionary joint selection to improve human action recognition with RGB-D devices
Expert Syst. Appl.
(2014) - et al.
A survey of human motion analysis using depth imagery
Pattern Recognit. Lett.
(2013) - et al.
Discriminative human action recognition in the learned hierarchical manifold space
Image Vis. Comput.
(2010) - et al.
Hierarchical implicit surface joint limits for human body tracking
Comput. Vision Image Understanding
(2005) - et al.
Semantic parametric body shape estimation from noisy depth sequences
Rob. Auton. Syst.
(2016) - et al.
Action recognition on motion capture data using a dynemes and forward differences representation
J. Vis. Commun. Image Represent.
(2014) - et al.
Human activity analysis: a review
ACM Comput. Surv.
(2011)
Pose-conditioned joint angle limits for 3D human pose reconstruction
IEEE Conference on Computer Vision and Pattern Recognition
Action recognition using rate-invariant analysis of skeletal shape trajectories
IEEE Trans. Pattern Anal. Mach. Intell.
Pictorial structures revisited people detection and articulated pose estimation
IEEE Conference on Computer Vision and Pattern Recognition
Elastic functional coding of human actions: From vector-fields to latent variables
IEEE Conference on Computer Vision and Pattern Recognition
Grassmannian sparse representations and motion depth surfaces for 3D action recognition
IEEE Conference on Computer Vision and Pattern Recognition Workshops
A data-driven approach for real-time full body pose reconstruction from a depth camera
IEEE International Conference on Computer Vision
Detailed human shape and pose from images
IEEE Conference on Computer Vision and Pattern Recognition
Re-identification with RGB-D sensors
International Workshop on Re-Identification
3D pictorial structures for multiple human pose estimation
IEEE Conference on Computer Vision and Pattern Recognition
Representation learning: a review and new perspectives
IEEE Trans. Pattern Anal. Mach. Intell.
A method for registration of 3-D shapes
IEEE Trans. Pattern Anal. Mach. Intell.
Dynamic feature selection for online action recognition
Human Behavior Understanding
G3Di: A gaming interaction dataset with a real time detection and evaluation framework
Workshops on Computer Vision on European Conference on Computer Vision
G3D: a gaming action dataset and real time action recognition evaluation framework
Workshops on IEEE Conference on Computer Vision and Pattern Recognition
Video-based human behavior understanding: a survey
IEEE Trans. Circuits Syst. Video Technol.
Classifying actions based on histogram of oriented velocity vectors
J. Intell. Inf. Syst.
Poselets: Body part detectors trained using 3D human pose annotations
IEEE International Conference on Computer Vision
Detecting human behavior models from multimodal observation in a smart home
IEEE Trans. Autom. Sci. Eng.
Acceptance of healthcare robots for the older population: review and future directions
Int. J. Soc. Robot.
Markerless motion tracking: MS Kinect & Organic Motion OpenStage®
International Conference on Disability, Virtual Reality and Associated Technologies
3D pictorial structures for multiple view articulated pose estimation
IEEE Conference on Computer Vision and Pattern Recognition
Recognition of human body motion using phase space constraints
IEEE International Conference on Computer Vision
Kernelized covariance for action recognition
International Conference on Pattern Recognition
Learning shape models for monocular human pose estimation from the Microsoft Xbox Kinect
IEEE International Conference on Computer Vision
Bio-inspired dynamic 3D discriminative skeletal features for human action recognition
IEEE Conference on Computer Vision and Pattern Recognition
UTD-MAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor
IEEE International Conference on Image Processing
Action recognition using ensemble weighted multi-instance learning
IEEE International Conference on Robotics and Automation
Online RGB-D gesture recognition with extreme learning machines
ACM International Conference on Multimodal Interaction
Comparison of RGB-D mapping solutions for application to food intake monitoring
Ambient Assisted Living
A human activity recognition system using skeleton data from RGBD sensors
Comput. Intell. Neurosci.
Time synchronization and data fusion for RGB-depth cameras and inertial sensors in AAL applications
Workshop on IEEE International Conference on Communication
Human movement understanding
IEEE Robot. Autom. Mag.
Space-time pose representation for 3D human action recognition
New Trends in Image Analysis and Processing
3-D human action recognition by shape analysis of motion trajectories on Riemannian manifold
IEEE Trans. Cybern.
Combined shape analysis of human poses and motion units for action segmentation and recognition
IEEE International Conference and Workshops on Automatic Face and Gesture Recognition
Articulated and generalized Gaussian kernel correlation for human pose estimation
IEEE Trans. Image Process.
Towards unified human parsing and pose estimation
IEEE Conference on Computer Vision and Pattern Recognition
Hierarchical recurrent neural network for skeleton based action recognition
IEEE Conference on Computer Vision and Pattern Recognition
Efficient ConvNet-based marker-less motion capture in general scenes with a low number of cameras
IEEE Conference on Computer Vision and Pattern Recognition
Exploring the trade-off between accuracy and observational latency in action recognition
Int. J. Comput. Vis.
Cited by (275)
Lighter and faster: A multi-scale adaptive graph convolutional network for skeleton-based action recognition
2024, Engineering Applications of Artificial IntelligenceDeep learning methods for single camera based clinical in-bed movement action recognition
2024, Image and Vision ComputingContinual spatio-temporal graph convolutional networks
2023, Pattern RecognitionSTSD: spatial–temporal semantic decomposition transformer for skeleton-based action recognition
2024, Multimedia SystemsAngle based hand gesture recognition using graph convolutional network
2024, Computer Animation and Virtual Worlds
- 1
These authors contributed equally to this work.