In this paper, we investigate temporal features that are extracted by a multi-channel convolutional neural network in depth map-based human action recognition. At the beginning, for the non-zero pixels representing the person shape in each depth map we calculate handcrafted features. On multivariate time-series of such handcrafted features we train a multi-class, multi-channel CNN to model temporal features as well as we extract statistical features of time-series. The concatenated features are stored in a common feature vector. Afterwards, for each class we train a separate one-against-all convolutional neural network to extract class-specific features of depth maps. For each class-specific, multivariate time-series we calculate statistical features of time-series. Finally, each class-specific feature vector is concatenated with the common feature vector resulting in an action feature vector. For each action represented by action feature vectors we train a multi-class classifier with one-hot encoding of output labels. The recognition of the action is done by a voting-based ensemble operating on such one-hot encodings. We demonstrate experimentally that on UTD-MHAD dataset the proposed algorithm outperforms state-of-the-art depth-based algorithms and attains promising results on MSR-Action3D dataset.
|