Lightweight Action Recognition in Compressed Videos

Huo, Yuqi; Xu, Xiaoli; Lu, Yao; Niu, Yulei; Ding, Mingyu; Lu, Zhiwu; Xiang, Tao; Wen, Ji-rong

doi:10.1007/978-3-030-66096-3_24

Yuqi Huo^10,11,
Xiaoli Xu^10,11,
Yao Lu^10,11,
Yulei Niu^10,11,
Mingyu Ding¹²,
Zhiwu Lu^10,11,
Tao Xiang¹³ &
…
Ji-rong Wen^10,11

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12536))

Included in the following conference series:

European Conference on Computer Vision

2083 Accesses
7 Citations

Abstract

Most existing action recognition models are large convolutional neural networks that work only with raw RGB frames as input. However, practical applications require lightweight models that directly process compressed videos. In this work, for the first time, such a model is developed, which is lightweight enough to run in real-time on embedded AI devices without sacrifices in recognition accuracy. A new Aligned Temporal Trilinear Pooling (ATTP) module is formulated to fuse three modalities in a compressed video. To remedy the weaker motion vectors (compared to optical flow computed from raw RGB streams) for representing dynamic content, we introduce a temporal fusion method to explicitly induce the temporal context, as well as knowledge distillation from a model trained with optical flows via feature alignment. Compared to existing compressed video action recognition models, it is much more compact and faster thanks to adopting a lightweight CNN backbone.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Multi-stream with Deep Convolutional Neural Networks for Human Action Recognition in Videos

Hierarchical Temporal Pooling for Efficient Online Action Recognition

A four-stream ConvNet based on spatial and depth flow for human action classification using RGB-D data

Article 07 January 2020

References

Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: CVPR, pp. 6299–6308 (2017)
Google Scholar
Chollet, F.: Xception: deep learning with depthwise separable convolutions. In: CVPR, pp. 1251–1258 (2017)
Google Scholar
Ding, M., et al.: Learning depth-guided convolutions for monocular 3D object detection. In: CVPR, pp. 4306–4315 (2020)
Google Scholar
Ding, M., Wang, Z., Sun, J., Shi, J., Luo, P.: Camnet: coarse-to-fine retrieval for camera re-localization. In: ICCV, pp. 2871–2880 (2019)
Google Scholar
Ding, M., Wang, Z., Zhou, B., Shi, J., Lu, Z., Luo, P.: Every frame counts: joint learning of video segmentation and optical flow. In: AAAI, pp. 10713–10720 (2020)
Google Scholar
Ding, M., Zhao, A., Lu, Z., Xiang, T., Wen, J.R.: Face-focused cross-stream network for deception detection in videos. In: CVPR, pp. 7802–7811 (2019)
Google Scholar
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: ICCV, pp. 6202–6211 (2019)
Google Scholar
Forecast, C.V.: Cisco visual networking index: Forecast and trends, 2017–2022. Cisco Public Information, White paper (2019)
Google Scholar
Gao, Y., Beijbom, O., Zhang, N., Darrell, T.: Compact bilinear pooling. In: CVPR, pp. 317–326 (2016)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Howard, A., et al.: Searching for mobilenetv3. In: ICCV, pp. 1314–1324 (2019)
Google Scholar
Huo, Y., Xu, X., Lu, Y., Niu, Y., Lu, Z., Wen, J.R.: Mobile video action recognition. arXiv preprint arXiv:1908.10155 (2019)
Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: Squeezenet: alexnet-level accuracy with 50x fewer parameters and \(<\) 0.5 mb model size. arXiv preprint arXiv:1602.07360 (2016)
Ji, J., Buch, S., Soto, A., Carlos Niebles, J.: End-to-end joint semantic segmentation of actors and actions in video. In: ECCV, pp. 702–717 (2018)
Google Scholar
Kantorov, V., Laptev, I.: Efficient feature extraction, encoding and classification for action recognition. In: CVPR, pp. 2593–2600 (2014)
Google Scholar
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR, pp. 1725–1732 (2014)
Google Scholar
Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
Kim, J.H., On, K.W., Lim, W., Kim, J., Ha, J.W., Zhang, B.T.: Hadamard product for low-rank bilinear pooling. In: ICLR (2016)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NeurIPS, pp. 1097–1105 (2012)
Google Scholar
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: Hmdb: a large video database for human motion recognition. In: ICCV, pp. 2556–2563. IEEE (2011)
Google Scholar
Lin, J., Gan, C., Han, S.: Tsm: temporal shift module for efficient video understanding. In: ICCV, pp. 7083–7093 (2019)
Google Scholar
Lin, T., Zhao, X., Su, H., Wang, C., Yang, M.: Bsn: boundary sensitive network for temporal action proposal generation. In: ECCV, pp. 3–19 (2018)
Google Scholar
Lin, T.Y., RoyChowdhury, A., Maji, S.: Bilinear cnn models for fine-grained visual recognition. In: ICCV, pp. 1449–1457 (2015)
Google Scholar
Ma, N., Zhang, X., Zheng, H.T., Sun, J.: Shufflenet v2: practical guidelines for efficient cnn architecture design. In: ECCV, pp. 116–131 (2018)
Google Scholar
Shou, Z., et al.: Dmc-net: Generating discriminative motion cues for fast compressed video action recognition. In: CVPR, pp. 1268–1277 (2019)
Google Scholar
Sikora, T.: The mpeg-4 video standard verification model. TCSV 7(1), 19–31 (1997)
Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NeurIPS, pp. 568–576 (2014)
Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
Sullivan, G.J., Ohm, J.R., Han, W.J., Wiegand, T.: Overview of the high efficiency video coding (HEVC) standard. TCSV 22(12), 1649–1668 (2012)
Google Scholar
Tan, M., Le, Q.V.: Efficientnet: rethinking model scaling for convolutional neural networks. In: ICML, pp. 6105–6114 (2019)
Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV, pp. 4489–4497 (2015)
Google Scholar
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: CVPR, pp. 6450–6459 (2018)
Google Scholar
Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2
Chapter Google Scholar
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018)
Google Scholar
Wiegand, T., Sullivan, G.J., Bjontegaard, G., Luthra, A.: Overview of the H. 264/ AVC video coding standard. TCSV 13(7), 560–576 (2003)
Google Scholar
Wu, C.Y., Zaheer, M., Hu, H., Manmatha, R., Smola, A.J., Krähenbühl, P.: Compressed video action recognition. In: CVPR, pp. 6026–6035 (2018)
Google Scholar
Yu, Z., Yu, J., Fan, J., Tao, D.: Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: ICCV, pp. 1821–1830 (2017)
Google Scholar
Zhang, B., Wang, L., Wang, Z., Qiao, Y., Wang, H.: Real-time action recognition with enhanced motion vector CNNs. In: CVPR, pp. 2718–2726 (2016)
Google Scholar
Zhang, B., Wang, L., Wang, Z., Qiao, Y., Wang, H.: Real-time action recognition with deeply transferred motion vector CNNs. TIP 27(5), 2326–2339 (2018)
MathSciNet Google Scholar
Zolfaghari, M., Singh, K., Brox, T.: Eco: efficient convolutional network for online video understanding. In: ECCV, pp. 695–712 (2018)
Google Scholar

Download references

Acknowledgements

This work was supported by Beijing Outstanding Young Scientist Program (BJJWZYJH012019100020098), National Natural Science Foundation of China (61976220 and 61832017), and the Outstanding Innovative Talents Cultivation Funded Programs 2018 of Renmin University of China.

Author information

Authors and Affiliations

Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China
Yuqi Huo, Xiaoli Xu, Yao Lu, Yulei Niu, Zhiwu Lu & Ji-rong Wen
Beijing Key Laboratory of Big Data Management and Analysis Methods, Beijing, China
Yuqi Huo, Xiaoli Xu, Yao Lu, Yulei Niu, Zhiwu Lu & Ji-rong Wen
The University of Hong Kong, Hong Kong, China
Mingyu Ding
University of Surrey, Guildford, UK
Tao Xiang

Authors

Yuqi Huo
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoli Xu
View author publications
You can also search for this author in PubMed Google Scholar
Yao Lu
View author publications
You can also search for this author in PubMed Google Scholar
Yulei Niu
View author publications
You can also search for this author in PubMed Google Scholar
Mingyu Ding
View author publications
You can also search for this author in PubMed Google Scholar
Zhiwu Lu
View author publications
You can also search for this author in PubMed Google Scholar
Tao Xiang
View author publications
You can also search for this author in PubMed Google Scholar
Ji-rong Wen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhiwu Lu .

Editor information

Editors and Affiliations

University of Clermont Auvergne, Clermont Ferrand, France
Adrien Bartoli
Università degli Studi di Udine, Udine, Italy
Andrea Fusiello

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huo, Y. et al. (2020). Lightweight Action Recognition in Compressed Videos. In: Bartoli, A., Fusiello, A. (eds) Computer Vision – ECCV 2020 Workshops. ECCV 2020. Lecture Notes in Computer Science(), vol 12536. Springer, Cham. https://doi.org/10.1007/978-3-030-66096-3_24

Download citation

DOI: https://doi.org/10.1007/978-3-030-66096-3_24
Published: 03 January 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-66095-6
Online ISBN: 978-3-030-66096-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Lightweight Action Recognition in Compressed Videos

Abstract

Access this chapter

Similar content being viewed by others

Multi-stream with Deep Convolutional Neural Networks for Human Action Recognition in Videos

Hierarchical Temporal Pooling for Efficient Online Action Recognition

A four-stream ConvNet based on spatial and depth flow for human action classification using RGB-D data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Lightweight Action Recognition in Compressed Videos

Abstract

Access this chapter

Similar content being viewed by others

Multi-stream with Deep Convolutional Neural Networks for Human Action Recognition in Videos

Hierarchical Temporal Pooling for Efficient Online Action Recognition

A four-stream ConvNet based on spatial and depth flow for human action classification using RGB-D data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation