Three-stream fusion network for first-person interaction recognition
Introduction
Despite the increasing research on computer vision, the task of understanding human activity in videos remains a challenging task. Recent approaches based on deep learning techniques have achieved significant progress in third-person activity recognition [1], [2], [3], [4], [5]. In third-person videos, the camera is fixed and at a large distance from people and objects. Many researchers have also studied human activity recognition in first-person video [6], [7], [8], [9], [10]. First-person video is captured by a camera mounted on a person or object. The video has a unique characteristics called camera ego-motion, which is not usually seen in third-person video. When the video includes camera ego-motion, the appearance of the target subjects is easily distorted and the motion vectors are disordered. Therefore, the recognition of activities in first-person video requires an appropriate approach tailored to these particular characteristics. In this paper, we analyze first-person video frames and focus especially on the interaction between a camera wearer and a human subject.
In Fig. 1, the optical flow extracted from two consecutive RGB frames shows various motion vectors in each third-person and first-person video. As shown in Fig. 1(a), motion vectors in the third-person video appear around the region where people are positioned because the camera is fixed and stationary. However, Fig. 1(b) shows complex motion vectors for the target and the camera wearer. The motion vectors appear close to the region where the target is positioned and the entire region of optical flow. Camera ego-motion renders analyzing the appearance and motion characteristics of the target in first-person video more challenging for three reasons: (1) it is difficult to build discriminative motion features from the complex motion vector in first-person video; (2) target appearance, which is an important feature for analyzing the target’s behavior, is severely distorted because of the camera wearer’s movement; (3) camera ego-motion can occur when the camera wearer moves, regardless of the target’s actions, and its data can be noisy when first-person interactions are analyzed.
This paper proposes a three-stream fusion network to recognize interactions in first-person video where large amounts of camera ego-motion occur. The proposed method is composed of two main parts: three-stream architecture and three-stream correlation fusion (TSCF). The three-stream architecture consists of target appearance stream, target motion stream, and ego-motion stream. The target appearance stream and the motion stream capture the appearance features and motion features of the target, respectively, and the ego-motion stream captures features of the camera wearer’s movement. The camera ego-motion is an important clue to the camera wearer’s movement because the wearer’s pose in first-person video is unknown. To generate robust features for the camera ego-motion, the TSCF considers two types of correlations. It considers the correlation between the target appearance and motion to utilize the spatiotemporal relationship. This relationship complements the target’s appearance and motion features, which are distorted by the camera wearer’s movement. To consider the problem of the camera wearer’s movement being noisy, the TSCF also uses the correlation between the target and camera ego-motion to determine whether the camera wearer’s movement is caused by the target’s action. In other words, we can determine whether the camera ego-motion is an important clue for analyzing first-person interactions between the target and the camera wearer.
The main contributions of this paper are summarized as follows: (1) we propose a novel deep learning framework called the three-stream fusion network. The proposed network is specialized in extracting discriminative features and considers camera ego-motion an important feature for analyzing the camera wearer’s movement; (2) we also introduce a fusion method called TSCF, which considers the correlations between the target’s appearance, motion and the camera ego-motion. The proposed fusion method creates robust features mitigate the effects of the camera ego-motion; (3)wWe show that our proposed method outperforms state-of-the-art activity recognition methods using the JPL First-Person Interaction dataset and the UTKinect-FirstPerson dataset.
Section snippets
Related works
There has been a great deal of progress in human activity recognition in video captured from a third-person viewpoint. Early work contributed hand-craft features to feature representation for activity recognition [11], [12], [13], [14], [15], [16], [17], [18]. Some studies suggested various methods, such as support vector machine (SVM) [19], [20], unsupervised learning [21], and multi-label learning [22] to improve recognition performance. In more recent research, a significant performance
Three-stream fusion network
An overview of our three-stream fusion network is shown in Fig. 2. Our proposed network consists of the three-stream architecture, TSCF, and an LSTM model. Our three-stream architecture captures features for the target’s appearance, motion, and the camera ego-motion. The output feature maps from the three streams are combined by the proposed TSCF. The LSTM model classifies the video considering the temporal dynamics of TSCF features.
Datasets
The proposed three-stream fusion network and TSCF were evaluated on two public first-person video datasets: the UTKinect-FirstPerson dataset [46] and the JPL First-Person Interaction dataset [9]. Both datasets are proposed for the purpose of first-person activity recognition. The recognition performances were evaluated on each dataset.
UTKinect-FirstPerson Dataset (humanoid). In this dataset, eight subjects interact with a humanoid robot on which a Kinect sensor is mounted. The human performs
Conclusion
We proposed a three-stream fusion network for interaction recognition in first-person videos where camera ego-motion occurs. The TSCF was introduced to consider the correlations of target appearance, target motion, and camera ego-motion features. The proposed three-stream fusion network with the TSCF successfully classified the first-person interaction video clips by means of robust video feature vectors that mitigate the effects of the camera’s movement. The proposed method showed a
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgment
This work was supported by Institute for Information & Communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) [No. 2019-0-00079, Department of Artificial Intelligence, Korea University] and [No. 2014-0-00059, Development of Predictive Visual Intelligence Technology].
Ye-Ji Kim received the B.S. degree in software engineering from Kwangwoon University, Seoul, Korea, in 2016, and her M.S. degree in computer science and Engineering from Korea University, Seoul, Korea in 2019. Her research interests include computer vision, artificial intelligence and machine learning.
References (55)
- et al.
First person action recognition via two-stream convnet with long-term fusion pooling
Pattern Recognit. Lett.
(2018) - et al.
Qualitative estimation of camera motion parameters from the linear composition of optical flow
Pattern Recognit.
(2004) - et al.
View-independent human action recognition with volume motion template on single stereo camera
Pattern Recognit. Lett.
(2010) - et al.
Trajectory aligned features for first person action recognition
Pattern Recognit.
(2017) - et al.
Spatio-temporal vector of locally max pooled features for action recognition in videos
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2017) - et al.
Spatiotemporal residual networks for video action recognition
Advances in Neural Information Processing Systems
(2016) - et al.
Convolutional two-stream network fusion for video action recognition
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2016) - et al.
Two-stream convolutional networks for action recognition in videos
Advances in Neural Information Processing Systems
(2014) - et al.
Spatiotemporal pyramid network for video action recognition
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2017) - et al.
Fast unsupervised ego-action learning for first-person sports videos
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2011)
Going deeper into first-person activity recognition
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
First-person activity recognition: what are they doing to me?
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Modeling sub-event dynamics in first-person action recognition
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Histograms of oriented gradients for human detection
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Learning realistic human actions from movies
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Human action recognition using factorized spatio-temporal convolutional networks
Proceedings of the IEEE International Conference on Computer Vision
Recognizing hand gestures using dynamic bayesian network
Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition
Multiple people tracking using an appearance model based on temporal color
Proceedings of the International Workshop on Biologically Motivated Computer Vision
A network of dynamic probabilistic models for human interaction analysis
IEEE Trans. Circuits Syst. Video Technol.
Recognizing human actions: a local SVM approach
Proceedings of the IEEE International Conference on Pattern Recognition
Facial component extraction and face recognition with support vector machines
Proceedings of the IEEE International Conference on Automatic Face Gesture Recognition
Stacked convolutional denoising auto-encoders for feature representation
IEEE Trans. Cybern.
Robust and discriminative labeling for multi-label active learning based on maximum correntropy criterion
IEEE Trans. Image Process.
Action recognition with coarse-to-fine deep feature integration and asynchronous fusion
AAAI
Spectral–spatial unified networks for hyperspectral image classification
IEEE Trans. Geosci. Remote Sens.
Long-term recurrent convolutional networks for visual recognition and description
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
RPAN: an end-to-end recurrent pose-attention network for action recognition in videos
Proceedings of the IEEE International Conference on Computer Vision
Cited by (8)
Visual question answering based on local-scene-aware referring expression generation
2021, Neural NetworksCitation Excerpt :Over the past few years, visual question answering (VQA) has attracted substantial attention from both the computer vision and natural language processing communities (Anderson et al., 2018; Antol et al., 2015; Ben-Younes, Cadene, Cord, & Thome, 2017; Fukui et al., 2016; Goyal, Khot, Summers-Stay, Batra, & Parikh, 2017; Lu et al., 2018; Teney, Anderson, He, & Van Den Hengel, 2018; Yu, Yu, Cui, Tao, & Tian, 2019). Compared to the traditional tasks of computer vision or natural language processing, such as object detection (Ren, He, Girshick, & Sun, 2015), image captioning (Liu, Wang, & Yang, 2017; Lu, Xiong, Parikh, & Socher, 2017; Pedersoli, Lucas, Schmid, & Verbeek, 2017; Rennie, Marcheret, Mroueh, Ross, & Goel, 2017; Yu, Tan, Bansal, & Berg, 2017), tracking (Park, Choi, Jain, & Lee, 2013; Roh, Kim, Park, & Lee, 2007), face recognition (Kang, Han, Jain, & Lee, 2014; Maeng, Liao, Kang, Lee, & Jain, 2012), action recognition (Kim, Lee, & Lee, 2020; Lee & Lee, 2019; Roh, Shin, & Lee, 2010), or machine translation (Bahdanau, Cho, & Bengio, 2015; Cho et al., 2014), the VQA is a challenging task that requires a more fine-grained semantic understanding of both questions and images jointly, as well as common sense knowledge to answer accurately. The recently collected VQA v2 dataset (Antol et al., 2015; Goyal et al., 2017) provides complementary pairs of questions and answers.
Ye-Ji Kim received the B.S. degree in software engineering from Kwangwoon University, Seoul, Korea, in 2016, and her M.S. degree in computer science and Engineering from Korea University, Seoul, Korea in 2019. Her research interests include computer vision, artificial intelligence and machine learning.
Dong-Gyu Lee received the B.S. degree in computer engineering from Kwangwoon University, Seoul, Korea, in 2011, and his Ph.D. degree in computer science and engineering from Korea University, Seoul, Korea in 2019. His research interests include computer vision, pattern recognition, artificial intelligence, and computational models of vision.
Seong-Whan Lee received his B.S. degree in computer science and statistics from Seoul National University, Seoul, in 1984, and his M.S. and Ph.D. degrees in computer science from the Korea Advanced Institute of Science and Technology, Seoul, Korea, in 1986 and 1989, respectively. Currently, he is the Hyundai-Kia Motor Chair Professor and the head of the Department of Artificial Intelligence and the Department of Brain and Cognitive Engineering at Korea University. He is a fellow of the IEEE, IAPR, and the Korea Academy of Science and Technology. His research interests include pattern recognition, artificial intelligence and brain engineering.
- 1
Equally contributed.