Three-stream fusion network for first-person interaction recognition

doi:10.1016/j.patcog.2020.107279

Pattern Recognition

Volume 103, July 2020, 107279

https://doi.org/10.1016/j.patcog.2020.107279 Get rights and content

Highlights

•
We propose a novel three-stream framework for first-person interaction recognition.
•
The three-stream correlation fusion can consider correlations between a target and a camera wearer.
•
Superior performance by using the proposed three-stream architecture with the fusion method.

Abstract

First-person interaction recognition is a challenging task because of unstable video conditions resulting from the camera wearer’s movement. For human interaction recognition from a first-person viewpoint, this paper proposes a three-stream fusion network with two main parts: three-stream architecture and three-stream correlation fusion. The three-stream architecture captures the characteristics of the target appearance, target motion, and camera ego-motion. Meanwhile the three-stream correlation fusion combines the feature map of each of the three streams to consider the correlations among the target appearance, target motion, and camera ego-motion. The fused feature vector is robust to the camera movement and compensates for the noise of the camera ego-motion. Short-term intervals are modeled using the fused feature vector, and a long short-term memory (LSTM) model considers the temporal dynamics of the video. We evaluated the proposed method on two public benchmark datasets to validate the effectiveness of our approach. The experimental results show that the proposed fusion method successfully generated a discriminative feature vector, and our network outperformed all competing activity recognition methods in first-person videos where considerable camera ego-motion occurs.

Introduction

Despite the increasing research on computer vision, the task of understanding human activity in videos remains a challenging task. Recent approaches based on deep learning techniques have achieved significant progress in third-person activity recognition [1], [2], [3], [4], [5]. In third-person videos, the camera is fixed and at a large distance from people and objects. Many researchers have also studied human activity recognition in first-person video [6], [7], [8], [9], [10]. First-person video is captured by a camera mounted on a person or object. The video has a unique characteristics called camera ego-motion, which is not usually seen in third-person video. When the video includes camera ego-motion, the appearance of the target subjects is easily distorted and the motion vectors are disordered. Therefore, the recognition of activities in first-person video requires an appropriate approach tailored to these particular characteristics. In this paper, we analyze first-person video frames and focus especially on the interaction between a camera wearer and a human subject.

In Fig. 1, the optical flow extracted from two consecutive RGB frames shows various motion vectors in each third-person and first-person video. As shown in Fig. 1(a), motion vectors in the third-person video appear around the region where people are positioned because the camera is fixed and stationary. However, Fig. 1(b) shows complex motion vectors for the target and the camera wearer. The motion vectors appear close to the region where the target is positioned and the entire region of optical flow. Camera ego-motion renders analyzing the appearance and motion characteristics of the target in first-person video more challenging for three reasons: (1) it is difficult to build discriminative motion features from the complex motion vector in first-person video; (2) target appearance, which is an important feature for analyzing the target’s behavior, is severely distorted because of the camera wearer’s movement; (3) camera ego-motion can occur when the camera wearer moves, regardless of the target’s actions, and its data can be noisy when first-person interactions are analyzed.

This paper proposes a three-stream fusion network to recognize interactions in first-person video where large amounts of camera ego-motion occur. The proposed method is composed of two main parts: three-stream architecture and three-stream correlation fusion (TSCF). The three-stream architecture consists of target appearance stream, target motion stream, and ego-motion stream. The target appearance stream and the motion stream capture the appearance features and motion features of the target, respectively, and the ego-motion stream captures features of the camera wearer’s movement. The camera ego-motion is an important clue to the camera wearer’s movement because the wearer’s pose in first-person video is unknown. To generate robust features for the camera ego-motion, the TSCF considers two types of correlations. It considers the correlation between the target appearance and motion to utilize the spatiotemporal relationship. This relationship complements the target’s appearance and motion features, which are distorted by the camera wearer’s movement. To consider the problem of the camera wearer’s movement being noisy, the TSCF also uses the correlation between the target and camera ego-motion to determine whether the camera wearer’s movement is caused by the target’s action. In other words, we can determine whether the camera ego-motion is an important clue for analyzing first-person interactions between the target and the camera wearer.

The main contributions of this paper are summarized as follows: (1) we propose a novel deep learning framework called the three-stream fusion network. The proposed network is specialized in extracting discriminative features and considers camera ego-motion an important feature for analyzing the camera wearer’s movement; (2) we also introduce a fusion method called TSCF, which considers the correlations between the target’s appearance, motion and the camera ego-motion. The proposed fusion method creates robust features mitigate the effects of the camera ego-motion; (3)wWe show that our proposed method outperforms state-of-the-art activity recognition methods using the JPL First-Person Interaction dataset and the UTKinect-FirstPerson dataset.

Section snippets

Related works

There has been a great deal of progress in human activity recognition in video captured from a third-person viewpoint. Early work contributed hand-craft features to feature representation for activity recognition [11], [12], [13], [14], [15], [16], [17], [18]. Some studies suggested various methods, such as support vector machine (SVM) [19], [20], unsupervised learning [21], and multi-label learning [22] to improve recognition performance. In more recent research, a significant performance

Three-stream fusion network

An overview of our three-stream fusion network is shown in Fig. 2. Our proposed network consists of the three-stream architecture, TSCF, and an LSTM model. Our three-stream architecture captures features for the target’s appearance, motion, and the camera ego-motion. The output feature maps from the three streams are combined by the proposed TSCF. The LSTM model classifies the video considering the temporal dynamics of TSCF features.

Datasets

The proposed three-stream fusion network and TSCF were evaluated on two public first-person video datasets: the UTKinect-FirstPerson dataset [46] and the JPL First-Person Interaction dataset [9]. Both datasets are proposed for the purpose of first-person activity recognition. The recognition performances were evaluated on each dataset.

UTKinect-FirstPerson Dataset (humanoid). In this dataset, eight subjects interact with a humanoid robot on which a Kinect sensor is mounted. The human performs

Conclusion

We proposed a three-stream fusion network for interaction recognition in first-person videos where camera ego-motion occurs. The TSCF was introduced to consider the correlations of target appearance, target motion, and camera ego-motion features. The proposed three-stream fusion network with the TSCF successfully classified the first-person interaction video clips by means of robust video feature vectors that mitigate the effects of the camera’s movement. The proposed method showed a

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work was supported by Institute for Information & Communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) [No. ‪2019-0-00079‬, Department of Artificial Intelligence, Korea University] and [No. 2014-0-00059, Development of Predictive Visual Intelligence Technology].

Ye-Ji Kim received the B.S. degree in software engineering from Kwangwoon University, Seoul, Korea, in 2016, and her M.S. degree in computer science and Engineering from Korea University, Seoul, Korea in 2019. Her research interests include computer vision, artificial intelligence and machine learning.

References (55)

H. Kwon et al.
First person action recognition via two-stream convnet with long-term fusion pooling
Pattern Recognit. Lett.
(2018)
S.-C. Park et al.
Qualitative estimation of camera motion parameters from the linear composition of optical flow
Pattern Recognit.
(2004)
M.-C. Roh et al.
View-independent human action recognition with volume motion template on single stereo camera
Pattern Recognit. Lett.
(2010)
S. Singh et al.
Trajectory aligned features for first person action recognition
Pattern Recognit.
(2017)
I.C. Duta et al.
Spatio-temporal vector of locally max pooled features for action recognition in videos
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2017)
C. Feichtenhofer et al.
Spatiotemporal residual networks for video action recognition
Advances in Neural Information Processing Systems
(2016)
C. Feichtenhofer et al.
Convolutional two-stream network fusion for video action recognition
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2016)
K. Simonyan et al.
Two-stream convolutional networks for action recognition in videos
Advances in Neural Information Processing Systems
(2014)
Y. Wang et al.
Spatiotemporal pyramid network for video action recognition
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2017)
K.M. Kitani et al.
Fast unsupervised ego-action learning for first-person sports videos
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2011)

M. Ma et al.

Going deeper into first-person activity recognition

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2016)

M.S. Ryoo et al.

First-person activity recognition: what are they doing to me?

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2013)

H.F. Zaki et al.

Modeling sub-event dynamics in first-person action recognition

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2017)

N. Dalal et al.

Histograms of oriented gradients for human detection

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2005)

I. Laptev et al.

Learning realistic human actions from movies

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2008)

L. Sun et al.

Human action recognition using factorized spatio-temporal convolutional networks

Proceedings of the IEEE International Conference on Computer Vision

(2015)

H.-I. Suk et al.

Recognizing hand gestures using dynamic bayesian network

Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition

(2008)

H.-K. Roh et al.

Multiple people tracking using an appearance model based on temporal color

Proceedings of the International Workshop on Biologically Motivated Computer Vision

(2000)

H.-I. Suk et al.

A network of dynamic probabilistic models for human interaction analysis

IEEE Trans. Circuits Syst. Video Technol.

(2011)

C. Schuldt et al.

Recognizing human actions: a local SVM approach

Proceedings of the IEEE International Conference on Pattern Recognition

(2004)

D. Xi et al.

Facial component extraction and face recognition with support vector machines

Proceedings of the IEEE International Conference on Automatic Face Gesture Recognition

(2002)

B. Du et al.

Stacked convolutional denoising auto-encoders for feature representation

IEEE Trans. Cybern.

(2016)

B. Du et al.

Robust and discriminative labeling for multi-label active learning based on maximum correntropy criterion

IEEE Trans. Image Process.

(2017)

W. Lin et al.

Action recognition with coarse-to-fine deep feature integration and asynchronous fusion

AAAI

(2018)

Y. Xu et al.

Spectral–spatial unified networks for hyperspectral image classification

IEEE Trans. Geosci. Remote Sens.

(2018)

J. Donahue et al.

Long-term recurrent convolutional networks for visual recognition and description

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2015)

W. Du et al.

RPAN: an end-to-end recurrent pose-attention network for action recognition in videos

Proceedings of the IEEE International Conference on Computer Vision

(2017)

Cited by (8)

Visual question answering based on local-scene-aware referring expression generation
2021, Neural Networks
Citation Excerpt :
Over the past few years, visual question answering (VQA) has attracted substantial attention from both the computer vision and natural language processing communities (Anderson et al., 2018; Antol et al., 2015; Ben-Younes, Cadene, Cord, & Thome, 2017; Fukui et al., 2016; Goyal, Khot, Summers-Stay, Batra, & Parikh, 2017; Lu et al., 2018; Teney, Anderson, He, & Van Den Hengel, 2018; Yu, Yu, Cui, Tao, & Tian, 2019). Compared to the traditional tasks of computer vision or natural language processing, such as object detection (Ren, He, Girshick, & Sun, 2015), image captioning (Liu, Wang, & Yang, 2017; Lu, Xiong, Parikh, & Socher, 2017; Pedersoli, Lucas, Schmid, & Verbeek, 2017; Rennie, Marcheret, Mroueh, Ross, & Goel, 2017; Yu, Tan, Bansal, & Berg, 2017), tracking (Park, Choi, Jain, & Lee, 2013; Roh, Kim, Park, & Lee, 2007), face recognition (Kang, Han, Jain, & Lee, 2014; Maeng, Liao, Kang, Lee, & Jain, 2012), action recognition (Kim, Lee, & Lee, 2020; Lee & Lee, 2019; Roh, Shin, & Lee, 2010), or machine translation (Bahdanau, Cho, & Bengio, 2015; Cho et al., 2014), the VQA is a challenging task that requires a more fine-grained semantic understanding of both questions and images jointly, as well as common sense knowledge to answer accurately. The recently collected VQA v2 dataset (Antol et al., 2015; Goyal et al., 2017) provides complementary pairs of questions and answers.
Visual question answering requires a deep understanding of both images and natural language. However, most methods mainly focus on visual concept; such as the relationships between various objects. The limited use of object categories combined with their relationships or simple question embedding is insufficient for representing complex scenes and explaining decisions. To address this limitation, we propose the use of text expressions generated for images, because such expressions have few structural constraints and can provide richer descriptions of images. The generated expressions can be incorporated with visual features and question embedding to obtain the question-relevant answer. A joint-embedding multi-head attention network is also proposed to model three different information modalities with co-attention. We quantitatively and qualitatively evaluated the proposed method on the VQA v2 dataset and compared it with state-of-the-art methods in terms of answer prediction. The quality of the generated expressions was also evaluated on the RefCOCO, RefCOCO+, and RefCOCOg datasets. Experimental results demonstrate the effectiveness of the proposed method and reveal that it outperformed all of the competing methods in terms of both quantitative and qualitative results.
Rewarded Meta-Pruning: Meta Learning with Rewards for Channel Pruning
2023, Mathematics
Speeding Up Action Recognition Using Dynamic Accumulation of Residuals in Compressed Domain
2022, arXiv
Speeding Up Action Recognition Using Dynamic Accumulation of Residuals in Compressed Domain
2022, SSRN
HTNet: Anchor-free Temporal Action Localization with Hierarchical Transformers
2022, arXiv
Speeding Up Action Recognition Using Dynamic Accumulation of Residuals in Compressed Domain
2022, Research Square

View all citing articles on Scopus

Dong-Gyu Lee received the B.S. degree in computer engineering from Kwangwoon University, Seoul, Korea, in 2011, and his Ph.D. degree in computer science and engineering from Korea University, Seoul, Korea in 2019. His research interests include computer vision, pattern recognition, artificial intelligence, and computational models of vision.

Seong-Whan Lee received his B.S. degree in computer science and statistics from Seoul National University, Seoul, in 1984, and his M.S. and Ph.D. degrees in computer science from the Korea Advanced Institute of Science and Technology, Seoul, Korea, in 1986 and 1989, respectively. Currently, he is the Hyundai-Kia Motor Chair Professor and the head of the Department of Artificial Intelligence and the Department of Brain and Cognitive Engineering at Korea University. He is a fellow of the IEEE, IAPR, and the Korea Academy of Science and Technology. His research interests include pattern recognition, artificial intelligence and brain engineering.

¹: Equally contributed.

View full text

Three-stream fusion network for first-person interaction recognition

Highlights

Abstract

Introduction

Section snippets

Related works

Three-stream fusion network

Datasets

Conclusion

Declaration of Competing Interest

Acknowledgment

Pattern Recognit. Lett.

Pattern Recognit.

Pattern Recognit. Lett.

Pattern Recognit.

Spatio-temporal vector of locally max pooled features for action recognition in videos

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Spatiotemporal residual networks for video action recognition

Advances in Neural Information Processing Systems

Convolutional two-stream network fusion for video action recognition

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Two-stream convolutional networks for action recognition in videos

Advances in Neural Information Processing Systems

Spatiotemporal pyramid network for video action recognition

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Fast unsupervised ego-action learning for first-person sports videos

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Going deeper into first-person activity recognition

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

First-person activity recognition: what are they doing to me?

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Modeling sub-event dynamics in first-person action recognition

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Histograms of oriented gradients for human detection

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Learning realistic human actions from movies

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Human action recognition using factorized spatio-temporal convolutional networks

Proceedings of the IEEE International Conference on Computer Vision

Recognizing hand gestures using dynamic bayesian network

Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition

Multiple people tracking using an appearance model based on temporal color

Proceedings of the International Workshop on Biologically Motivated Computer Vision

A network of dynamic probabilistic models for human interaction analysis

IEEE Trans. Circuits Syst. Video Technol.

Recognizing human actions: a local SVM approach

Proceedings of the IEEE International Conference on Pattern Recognition

Facial component extraction and face recognition with support vector machines

Proceedings of the IEEE International Conference on Automatic Face Gesture Recognition

Stacked convolutional denoising auto-encoders for feature representation

IEEE Trans. Cybern.

Robust and discriminative labeling for multi-label active learning based on maximum correntropy criterion

IEEE Trans. Image Process.

Action recognition with coarse-to-fine deep feature integration and asynchronous fusion

AAAI

Spectral–spatial unified networks for hyperspectral image classification

IEEE Trans. Geosci. Remote Sens.

Long-term recurrent convolutional networks for visual recognition and description

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

RPAN: an end-to-end recurrent pose-attention network for action recognition in videos

Proceedings of the IEEE International Conference on Computer Vision