Three-stream fusion network for first-person interaction recognition

https://doi.org/10.1016/j.patcog.2020.107279Get rights and content

Highlights

  • We propose a novel three-stream framework for first-person interaction recognition.

  • The three-stream correlation fusion can consider correlations between a target and a camera wearer.

  • Superior performance by using the proposed three-stream architecture with the fusion method.

Abstract

First-person interaction recognition is a challenging task because of unstable video conditions resulting from the camera wearer’s movement. For human interaction recognition from a first-person viewpoint, this paper proposes a three-stream fusion network with two main parts: three-stream architecture and three-stream correlation fusion. The three-stream architecture captures the characteristics of the target appearance, target motion, and camera ego-motion. Meanwhile the three-stream correlation fusion combines the feature map of each of the three streams to consider the correlations among the target appearance, target motion, and camera ego-motion. The fused feature vector is robust to the camera movement and compensates for the noise of the camera ego-motion. Short-term intervals are modeled using the fused feature vector, and a long short-term memory (LSTM) model considers the temporal dynamics of the video. We evaluated the proposed method on two public benchmark datasets to validate the effectiveness of our approach. The experimental results show that the proposed fusion method successfully generated a discriminative feature vector, and our network outperformed all competing activity recognition methods in first-person videos where considerable camera ego-motion occurs.

Introduction

Despite the increasing research on computer vision, the task of understanding human activity in videos remains a challenging task. Recent approaches based on deep learning techniques have achieved significant progress in third-person activity recognition [1], [2], [3], [4], [5]. In third-person videos, the camera is fixed and at a large distance from people and objects. Many researchers have also studied human activity recognition in first-person video [6], [7], [8], [9], [10]. First-person video is captured by a camera mounted on a person or object. The video has a unique characteristics called camera ego-motion, which is not usually seen in third-person video. When the video includes camera ego-motion, the appearance of the target subjects is easily distorted and the motion vectors are disordered. Therefore, the recognition of activities in first-person video requires an appropriate approach tailored to these particular characteristics. In this paper, we analyze first-person video frames and focus especially on the interaction between a camera wearer and a human subject.

In Fig. 1, the optical flow extracted from two consecutive RGB frames shows various motion vectors in each third-person and first-person video. As shown in Fig. 1(a), motion vectors in the third-person video appear around the region where people are positioned because the camera is fixed and stationary. However, Fig. 1(b) shows complex motion vectors for the target and the camera wearer. The motion vectors appear close to the region where the target is positioned and the entire region of optical flow. Camera ego-motion renders analyzing the appearance and motion characteristics of the target in first-person video more challenging for three reasons: (1) it is difficult to build discriminative motion features from the complex motion vector in first-person video; (2) target appearance, which is an important feature for analyzing the target’s behavior, is severely distorted because of the camera wearer’s movement; (3) camera ego-motion can occur when the camera wearer moves, regardless of the target’s actions, and its data can be noisy when first-person interactions are analyzed.

This paper proposes a three-stream fusion network to recognize interactions in first-person video where large amounts of camera ego-motion occur. The proposed method is composed of two main parts: three-stream architecture and three-stream correlation fusion (TSCF). The three-stream architecture consists of target appearance stream, target motion stream, and ego-motion stream. The target appearance stream and the motion stream capture the appearance features and motion features of the target, respectively, and the ego-motion stream captures features of the camera wearer’s movement. The camera ego-motion is an important clue to the camera wearer’s movement because the wearer’s pose in first-person video is unknown. To generate robust features for the camera ego-motion, the TSCF considers two types of correlations. It considers the correlation between the target appearance and motion to utilize the spatiotemporal relationship. This relationship complements the target’s appearance and motion features, which are distorted by the camera wearer’s movement. To consider the problem of the camera wearer’s movement being noisy, the TSCF also uses the correlation between the target and camera ego-motion to determine whether the camera wearer’s movement is caused by the target’s action. In other words, we can determine whether the camera ego-motion is an important clue for analyzing first-person interactions between the target and the camera wearer.

The main contributions of this paper are summarized as follows: (1) we propose a novel deep learning framework called the three-stream fusion network. The proposed network is specialized in extracting discriminative features and considers camera ego-motion an important feature for analyzing the camera wearer’s movement; (2) we also introduce a fusion method called TSCF, which considers the correlations between the target’s appearance, motion and the camera ego-motion. The proposed fusion method creates robust features mitigate the effects of the camera ego-motion; (3)wWe show that our proposed method outperforms state-of-the-art activity recognition methods using the JPL First-Person Interaction dataset and the UTKinect-FirstPerson dataset.

Section snippets

Related works

There has been a great deal of progress in human activity recognition in video captured from a third-person viewpoint. Early work contributed hand-craft features to feature representation for activity recognition [11], [12], [13], [14], [15], [16], [17], [18]. Some studies suggested various methods, such as support vector machine (SVM) [19], [20], unsupervised learning [21], and multi-label learning [22] to improve recognition performance. In more recent research, a significant performance

Three-stream fusion network

An overview of our three-stream fusion network is shown in Fig. 2. Our proposed network consists of the three-stream architecture, TSCF, and an LSTM model. Our three-stream architecture captures features for the target’s appearance, motion, and the camera ego-motion. The output feature maps from the three streams are combined by the proposed TSCF. The LSTM model classifies the video considering the temporal dynamics of TSCF features.

Datasets

The proposed three-stream fusion network and TSCF were evaluated on two public first-person video datasets: the UTKinect-FirstPerson dataset [46] and the JPL First-Person Interaction dataset [9]. Both datasets are proposed for the purpose of first-person activity recognition. The recognition performances were evaluated on each dataset.

UTKinect-FirstPerson Dataset (humanoid). In this dataset, eight subjects interact with a humanoid robot on which a Kinect sensor is mounted. The human performs

Conclusion

We proposed a three-stream fusion network for interaction recognition in first-person videos where camera ego-motion occurs. The TSCF was introduced to consider the correlations of target appearance, target motion, and camera ego-motion features. The proposed three-stream fusion network with the TSCF successfully classified the first-person interaction video clips by means of robust video feature vectors that mitigate the effects of the camera’s movement. The proposed method showed a

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work was supported by Institute for Information & Communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) [No. ‪2019-0-00079‬, Department of Artificial Intelligence, Korea University] and [No. 2014-0-00059, Development of Predictive Visual Intelligence Technology].

Ye-Ji Kim received the B.S. degree in software engineering from Kwangwoon University, Seoul, Korea, in 2016, and her M.S. degree in computer science and Engineering from Korea University, Seoul, Korea in 2019. Her research interests include computer vision, artificial intelligence and machine learning.

References (55)

  • M. Ma et al.

    Going deeper into first-person activity recognition

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • M.S. Ryoo et al.

    First-person activity recognition: what are they doing to me?

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2013)
  • H.F. Zaki et al.

    Modeling sub-event dynamics in first-person action recognition

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • N. Dalal et al.

    Histograms of oriented gradients for human detection

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2005)
  • I. Laptev et al.

    Learning realistic human actions from movies

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2008)
  • L. Sun et al.

    Human action recognition using factorized spatio-temporal convolutional networks

    Proceedings of the IEEE International Conference on Computer Vision

    (2015)
  • H.-I. Suk et al.

    Recognizing hand gestures using dynamic bayesian network

    Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition

    (2008)
  • H.-K. Roh et al.

    Multiple people tracking using an appearance model based on temporal color

    Proceedings of the International Workshop on Biologically Motivated Computer Vision

    (2000)
  • H.-I. Suk et al.

    A network of dynamic probabilistic models for human interaction analysis

    IEEE Trans. Circuits Syst. Video Technol.

    (2011)
  • C. Schuldt et al.

    Recognizing human actions: a local SVM approach

    Proceedings of the IEEE International Conference on Pattern Recognition

    (2004)
  • D. Xi et al.

    Facial component extraction and face recognition with support vector machines

    Proceedings of the IEEE International Conference on Automatic Face Gesture Recognition

    (2002)
  • B. Du et al.

    Stacked convolutional denoising auto-encoders for feature representation

    IEEE Trans. Cybern.

    (2016)
  • B. Du et al.

    Robust and discriminative labeling for multi-label active learning based on maximum correntropy criterion

    IEEE Trans. Image Process.

    (2017)
  • W. Lin et al.

    Action recognition with coarse-to-fine deep feature integration and asynchronous fusion

    AAAI

    (2018)
  • Y. Xu et al.

    Spectral–spatial unified networks for hyperspectral image classification

    IEEE Trans. Geosci. Remote Sens.

    (2018)
  • J. Donahue et al.

    Long-term recurrent convolutional networks for visual recognition and description

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2015)
  • W. Du et al.

    RPAN: an end-to-end recurrent pose-attention network for action recognition in videos

    Proceedings of the IEEE International Conference on Computer Vision

    (2017)
  • Cited by (8)

    • Visual question answering based on local-scene-aware referring expression generation

      2021, Neural Networks
      Citation Excerpt :

      Over the past few years, visual question answering (VQA) has attracted substantial attention from both the computer vision and natural language processing communities (Anderson et al., 2018; Antol et al., 2015; Ben-Younes, Cadene, Cord, & Thome, 2017; Fukui et al., 2016; Goyal, Khot, Summers-Stay, Batra, & Parikh, 2017; Lu et al., 2018; Teney, Anderson, He, & Van Den Hengel, 2018; Yu, Yu, Cui, Tao, & Tian, 2019). Compared to the traditional tasks of computer vision or natural language processing, such as object detection (Ren, He, Girshick, & Sun, 2015), image captioning (Liu, Wang, & Yang, 2017; Lu, Xiong, Parikh, & Socher, 2017; Pedersoli, Lucas, Schmid, & Verbeek, 2017; Rennie, Marcheret, Mroueh, Ross, & Goel, 2017; Yu, Tan, Bansal, & Berg, 2017), tracking (Park, Choi, Jain, & Lee, 2013; Roh, Kim, Park, & Lee, 2007), face recognition (Kang, Han, Jain, & Lee, 2014; Maeng, Liao, Kang, Lee, & Jain, 2012), action recognition (Kim, Lee, & Lee, 2020; Lee & Lee, 2019; Roh, Shin, & Lee, 2010), or machine translation (Bahdanau, Cho, & Bengio, 2015; Cho et al., 2014), the VQA is a challenging task that requires a more fine-grained semantic understanding of both questions and images jointly, as well as common sense knowledge to answer accurately. The recently collected VQA v2 dataset (Antol et al., 2015; Goyal et al., 2017) provides complementary pairs of questions and answers.

    View all citing articles on Scopus

    Ye-Ji Kim received the B.S. degree in software engineering from Kwangwoon University, Seoul, Korea, in 2016, and her M.S. degree in computer science and Engineering from Korea University, Seoul, Korea in 2019. Her research interests include computer vision, artificial intelligence and machine learning.

    Dong-Gyu Lee received the B.S. degree in computer engineering from Kwangwoon University, Seoul, Korea, in 2011, and his Ph.D. degree in computer science and engineering from Korea University, Seoul, Korea in 2019. His research interests include computer vision, pattern recognition, artificial intelligence, and computational models of vision.

    Seong-Whan Lee received his B.S. degree in computer science and statistics from Seoul National University, Seoul, in 1984, and his M.S. and Ph.D. degrees in computer science from the Korea Advanced Institute of Science and Technology, Seoul, Korea, in 1986 and 1989, respectively. Currently, he is the Hyundai-Kia Motor Chair Professor and the head of the Department of Artificial Intelligence and the Department of Brain and Cognitive Engineering at Korea University. He is a fellow of the IEEE, IAPR, and the Korea Academy of Science and Technology. His research interests include pattern recognition, artificial intelligence and brain engineering.

    1

    Equally contributed.

    View full text