3D Convolutional Networks for Action Recognition: Application to Sport Gesture Recognition

Martin, Pierre-Etienne; Benois-Pineau, Jenny; Péteri, Renaud; Zemmari, Akka; Morlier, Julien

doi:10.1007/978-3-030-74478-6_9

Pierre-Etienne Martin³,
Jenny Benois-Pineau⁴,
Renaud Péteri⁵,
Akka Zemmari⁴ &
…
Julien Morlier⁶

975 Accesses

Abstract

3D convolutional networks is a good means to perform tasks such as video segmentation into coherent spatio-temporal chunks and classification of them with regard to a target taxonomy. In the chapter we are interested in the classification of continuous video takes with repeatable actions, such as strokes of table tennis. Filmed in a free marker less ecological environment, these videos represent a challenge from both segmentation and classification point of view. The 3D convnets are an efficient tool for solving these problems with window-based approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Fine grained sport action recognition with Twin spatio-temporal convolutional neural networks

Article 19 April 2020

Classification of Human Actions Using 3-D Convolutional Neural Networks: A Hierarchical Approach

A Very Deep Sequences Learning Approach for Human Action Recognition

Notes

References

Mateusz Budnik, Efrain-Leonardo Gutierrez-Gomez, Bahjat Safadi, Denis Pellerin, and Georges Quénot. Learned features versus engineered features for multimedia indexing. Multim. Tools Appl., 76(9):11941–11958, 2017.
Article Google Scholar
Cyrille Beaudry, Renaud Péteri, and Laurent Mascarilla. Action recognition in videos using frequency analysis of critical point trajectories. In 2014 IEEE International Conference on Image Processing, ICIP 2014, Paris, France, October 27–30, 2014, pages 1445–1449, 2014.
Google Scholar
Gabriella Csurka, Christopher R. Dance, Lixin Fan, Jutta Willamowski, and Cédric Bray. Visual categorization with bags of keypoints. In In Workshop on Statistical Learning in Computer Vision, ECCV, pages 1–22, 2004.
Google Scholar
João Carreira, Eric Noland, Chloe Hillier, and Andrew Zisserman. A short note on the kinetics-700 human action dataset. CoRR, abs/1907.06987, 2019.
Google Scholar
Jordan Calandre, Renaud Péteri, and Laurent Mascarilla. Optical flow singularities for sports video annotation: Detection of strokes in table tennis. In Working Notes Proceedings of the MediaEval 2019 Workshop, Sophia Antipolis, France, 27–30 October 2019, 2019.
Google Scholar
Rizwan Chaudhry, Avinash Ravichandran, Gregory D. Hager, and René Vidal. Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20–25 June 2009, Miami, Florida, USA, pages 1932–1939, 2009.
Google Scholar
João Carreira and Andrew Zisserman. Quo vadis, action recognition? A new model and the kinetics dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21–26, 2017, pages 4724–4733, 2017.
Google Scholar
Teofilo de Campos, Mark Barnard, Krystian Mikolajczyk, Josef Kittler, Fei Yan, William J. Christmas, and David Windridge. An evaluation of bags-of-words and spatio-temporal shapes for action recognition. In IEEE Workshop on Applications of Computer Vision (WACV 2011), 5–7 January 2011, Kona, HI, USA, pages 344–351, 2011.
Google Scholar
Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans. Pattern Anal. Mach. Intell., 39(4):677–691, 2017.
Google Scholar
Naina Dhingra and Andreas M. Kunz. Res3ATN - deep 3D residual attention network for hand gesture recognition in videos. In 2019 International Conference on 3D Vision, 3DV 2019, Québec City, QC, Canada, September 16–19, 2019, pages 491–501, 2019.
Google Scholar
Yang Du, Chunfeng Yuan, Bing Li, Lili Zhao, Yangxi Li, and Weiming Hu. Interaction-aware spatio-temporal pyramid attention networks for action classification. In ECCV (16), volume 11220 of Lecture Notes in Computer Science, pages 388–404. Springer, 2018.
Google Scholar
Alexei A. Efros, Alexander C. Berg, Greg Mori, and Jitendra Malik. Recognizing action at a distance. In 9th IEEE International Conference on Computer Vision (ICCV 2003), 14–17 October 2003, Nice, France, pages 726–733, 2003.
Google Scholar
Christoph Feichtenhofer, Axel Pinz, and Richard P. Wildes. Spatiotemporal multiplier networks for video action recognition. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21–26, 2017, pages 7445–7454, 2017.
Google Scholar
Mehrnaz Fani, Kanav Vats, Christopher Dulhanty, David A. Clausi, and John S. Zelek. Pose-projected action recognition hourglass network (PARHN) in soccer. In 16th Conference on Computer and Robot Vision, CRV 2019, Kingston, ON, Canada, May 29–31, 2019, pages 201–208, 2019.
Google Scholar
Lena Gorelick, Moshe Blank, Eli Shechtman, Michal Irani, and Ronen Basri. Actions as space-time shapes. IEEE Trans. Pattern Anal. Mach. Intell., 29(12):2247–2253, 2007.
Article Google Scholar
Adrien Gaidon, Zaïd Harchaoui, and Cordelia Schmid. Temporal localization of actions with actoms. IEEE Trans. Pattern Anal. Mach. Intell., 35(11):2782–2795, 2013.
Article Google Scholar
Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, and Jitendra Malik. AVA: A video dataset of spatio-temporally localized atomic visual actions. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18–22, 2018, pages 6047–6056, 2018.
Google Scholar
Deepti Ghadiyaram, Du Tran, and Dhruv Mahajan. Large-scale weakly-supervised pre-training for video action recognition. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16–20, 2019, pages 12046–12055, 2019.
Google Scholar
Rui Hou, Chen Chen, and Mubarak Shah. Tube convolutional neural network (T-CNN) for action detection in videos. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22–29, 2017, pages 5823–5832, 2017.
Google Scholar
Michael E. Houle, Vincent Oria, Shin’ichi Satoh, and Jichao Sun. Annotation propagation in image databases using similarity graphs. ACM Trans. Multim. Comput. Commun. Appl., 10(1):7:1–7:21, 2013.
Google Scholar
Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Enhua Wu. Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell., 42(8):2011–2023, 2020.
Article Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016, pages 770–778, 2016.
Google Scholar
Bogdan Ionescu, Jenny Benois-Pineau, Tomas Piatrik, and Georges Quénot, editors. Fusion in Computer Vision - Understanding Complex Visual Content. Advances in Computer Vision and Pattern Recognition. Springer, 2014.
Google Scholar
G. Johansson. Visual perception of biological motion and a model for its analysis. Perception and Psychophysics, 14:pp. 201–211, 1973.
Article Google Scholar
Mihir Jain, Jan C. van Gemert, Hervé Jégou, Patrick Bouthemy, and Cees G. M. Snoek. Action localization with tubelets from motion. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23–28, 2014, pages 740–747, 2014.
Google Scholar
Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell., 35(1):221–231, 2013.
Article Google Scholar
Will Kay, João Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset. CoRR, abs/1705.06950, 2017.
Google Scholar
Ewa Kijak, Guillaume Gravier, Lionel Oisel, and Patrick Gros. Audiovisual integration for tennis broadcast structuring. Multim. Tools Appl., 30(3):289–311, 2006.
Article Google Scholar
M. Esat Kalfaoglu, Sinan Kalkan, and A. Aydin Alatan. Late temporal modeling in 3d CNN architectures with BERT for action recognition. CoRR, abs/2008.01232, 2020.
Google Scholar
Ivan Laptev. Modeling and visual recognition of human actions and interactions. Habilitation à diriger des recherches, Ecole Normale Supérieure de Paris - ENS Paris, July 2013.
Google Scholar
Ce Liu. Beyond Pixels: Exploring New Representations and Applications for Motion Analysis. PhD thesis, Massachusetts Institute of Technology, 5 2009.
Google Scholar
Jinkue Lee and Hoeryong Jung. Tuhad: Taekwondo unit technique human action dataset with key frame-based CNN action recognition. Sensors, 20(17):4871, 2020.
Google Scholar
Ivan Laptev and Patrick Pérez. Retrieving actions in movies. In IEEE 11th International Conference on Computer Vision, ICCV 2007, Rio de Janeiro, Brazil, October 14–20, 2007, pages 1–8, 2007.
Google Scholar
Ang Li, Meghana Thotakuri, David A. Ross, João Carreira, Alexander Vostrikov, and Andrew Zisserman. The ava-kinetics localized human actions video dataset. CoRR, abs/2005.00214, 2020.
Google Scholar
Zhihao Li, Wenmin Wang, Nannan Li, and Jinzhuo Wang. Tube convnets: Better exploiting motion for action recognition. In 2016 IEEE International Conference on Image Processing, ICIP 2016, Phoenix, AZ, USA, September 25–28, 2016, pages 3056–3060, 2016.
Google Scholar
Ruichen Liu, Zhelong Wang, Xin Shi, Hongyu Zhao, Sen Qiu, Jie Li, and Ning Yang. Table tennis stroke recognition based on body sensor network. In Internet and Distributed Computing Systems - 12th International Conference, IDCS 2019, Naples, Italy, October 10–12, 2019, Proceedings, pages 1–10, 2019.
Google Scholar
J. Macqueen. Some methods for classification and analysis of multivariate observations. In In 5-th Berkeley Symposium on Mathematical Statistics and Probability, pages 281–297, 1967.
Google Scholar
Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, and Julien Morlier. Optimal choice of motion estimation methods for fine-grained action classification with 3d convolutional networks. In 2019 IEEE International Conference on Image Processing, ICIP 2019, Taipei, Taiwan, September 22–25, 2019, pages 554–558, 2019.
Google Scholar
Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, and Julien Morlier. Fine grained sport action recognition with twin spatio-temporal convolutional neural networks. Multim. Tools Appl., 79(27–28):20429–20447, 2020.
Article Google Scholar
Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, and Julien Morlier. 3d attention mechanisms in twin spatio-temporal convolutional neural networks. application to action classification in videos of table tennis games. In 25th International Conference on Pattern Recognition (ICPR2020) - MiCo Milano Congress Center, Italy, 10–15 January 2021, 2021.
Google Scholar
Nikolaus Mayer, Eddy Ilg, Philip Häusser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016, pages 4040–4048, 2016.
Google Scholar
Marcin Marszalek, Ivan Laptev, and Cordelia Schmid. Actions in context. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20–25 June 2009, Miami, Florida, USA, pages 2929–2936, 2009.
Google Scholar
Joe Yue-Hei Ng, Matthew J. Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. Beyond short snippets: Deep networks for video classification. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7–12, 2015, pages 4694–4702, 2015.
Google Scholar
Abraham Montoya Obeso, Jenny Benois-Pineau, Mireya Saraí García-Vázquez, and Alejandro Alvaro Ramírez-Acosta. Forward-backward visual saliency propagation in deep NNS vs internal attentional mechanisms. In Ninth International Conference on Image Processing Theory, Tools and Applications, IPTA 2019, Istanbul, Turkey, November 6–9, 2019, pages 1–6, 2019.
Google Scholar
Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell., 39(6):1137–1149, 2017.
Article Google Scholar
A Rokszin, Z Márkus, G Braunitzer, A Berényi, G Benedek, and A Nagy. Visual pathways serving motion detection in the mammalian brain. Sensors, 10(4):3218–3242, 2010.
Article Google Scholar
Andrei Stoian, Marin Ferecatu, Jenny Benois-Pineau, and Michel Crucianu. Fast action localization in large-scale video archives. IEEE Trans. Circuits Syst. Video Techn., 26(10):1917–1930, 2016.
Article Google Scholar
Konrad Schindler and Luc Van Gool. Action snippets: How many frames does human action recognition require? In 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2008), 24–26 June 2008, Anchorage, Alaska, USA, 2008.
Google Scholar
Christian Schüldt, Ivan Laptev, and Barbara Caputo. Recognizing human actions: A local SVM approach. In 17th International Conference on Pattern Recognition, ICPR 2004, Cambridge, UK, August 23–26, 2004, pages 32–36, 2004.
Google Scholar
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7–12, 2015, pages 1–9, 2015.
Google Scholar
Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8–13 2014, Montreal, Quebec, Canada, pages 568–576, 2014.
Google Scholar
Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. Finegym: A hierarchical video dataset for fine-grained action understanding. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13–19, 2020, pages 2613–2622, 2020.
Google Scholar
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR, abs/1212.0402, 2012.
Google Scholar
Du Tran, Lubomir D. Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7–13, 2015, pages 4489–4497, 2015.
Google Scholar
Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18–22, 2018, pages 6450–6459, 2018.
Google Scholar
Amin Ullah, Jamil Ahmad, Khan Muhammad, Muhammad Sajjad, and Sung Wook Baik. Action recognition in video sequences using deep bi-directional LSTM with CNN features. IEEE Access, 6:1155–1166, 2018.
Google Scholar
Gül Varol, Ivan Laptev, and Cordelia Schmid. Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell., 40(6):1510–1517, 2018.
Article Google Scholar
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4–9 December 2017, Long Beach, CA, USA, pages 5998–6008, 2017.
Google Scholar
Xuanhan Wang, Lianli Gao, Peng Wang, Xiaoshuai Sun, and Xianglong Liu. Two-stream 3-d convnet fusion for action recognition in videos with arbitrary size and length. IEEE Trans. Multimedia, 20(3):634–644, 2018.
Article Google Scholar
Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang Zhang, Xiaogang Wang, and Xiaoou Tang. Residual attention network for image classification. In CVPR, pages 6450–6458. IEEE Computer Society, 2017.
Google Scholar
Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. Action recognition by dense trajectories. In The 24th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, Colorado Springs, CO, USA, 20–25 June 2011, pages 3169–3176, 2011.
Google Scholar
Philippe Weinzaepfel, Jérôme Revaud, Zaïd Harchaoui, and Cordelia Schmid. Deepflow: Large displacement optical flow with deep matching. In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1–8, 2013, pages 1385–1392, 2013.
Google Scholar
Heng Wang and Cordelia Schmid. Action recognition with improved trajectories. In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1–8, 2013, pages 3551–3558, 2013.
Google Scholar
Jiachen Wang, Kejian Zhao, Dazhen Deng, Anqi Cao, Xiao Xie, Zheng Zhou, Hui Zhang, and Yingcai Wu. Tac-simur: Tactic-based simulative visual analytics of table tennis. IEEE Trans. Vis. Comput. Graph., 26(1):407–417, 2020.
Article Google Scholar
Kun Xia, Hanyu Wang, Menghan Xu, Zheng Li, Sheng He, and Yusong Tang. Racquet sports recognition using a hybrid clustering model learned from integrated wearable sensor. Sensors, 20(6):1638, 2020.
Google Scholar
Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24–26, 2017, Conference Track Proceedings, 2017.
Google Scholar
Zoran Zivkovic and Ferdinand van der Heijden. Efficient adaptive density estimation per image pixel for the task of background subtraction. Pattern Recognit. Lett., 27(7):773–780, 2006.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Comparative Cultural Psychology, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany
Pierre-Etienne Martin
LaBRI UMR 5800, University of Bordeaux, Talence Cedex, France
Jenny Benois-Pineau & Akka Zemmari
MIA, University of La Rochelle, La Rochelle, France
Renaud Péteri
IMS, University of Bordeaux, Talence Cedex, France
Julien Morlier

Authors

Pierre-Etienne Martin
View author publications
You can also search for this author in PubMed Google Scholar
Jenny Benois-Pineau
View author publications
You can also search for this author in PubMed Google Scholar
Renaud Péteri
View author publications
You can also search for this author in PubMed Google Scholar
Akka Zemmari
View author publications
You can also search for this author in PubMed Google Scholar
Julien Morlier
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pierre-Etienne Martin .

Editor information

Editors and Affiliations

LaBRI UMR 5800, University of Bordeaux, Talence Cedex, France
Jenny Benois-Pineau
LaBRI UMR 5800, University of Bordeaux, Talence Cedex, France
Akka Zemmari

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Martin, PE., Benois-Pineau, J., Péteri, R., Zemmari, A., Morlier, J. (2021). 3D Convolutional Networks for Action Recognition: Application to Sport Gesture Recognition. In: Benois-Pineau, J., Zemmari, A. (eds) Multi-faceted Deep Learning. Springer, Cham. https://doi.org/10.1007/978-3-030-74478-6_9

Download citation

DOI: https://doi.org/10.1007/978-3-030-74478-6_9
Published: 24 February 2012
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-74477-9
Online ISBN: 978-3-030-74478-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

3D Convolutional Networks for Action Recognition: Application to Sport Gesture Recognition

Abstract

Access this chapter

Similar content being viewed by others

Fine grained sport action recognition with Twin spatio-temporal convolutional neural networks

Classification of Human Actions Using 3-D Convolutional Neural Networks: A Hierarchical Approach

A Very Deep Sequences Learning Approach for Human Action Recognition

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

3D Convolutional Networks for Action Recognition: Application to Sport Gesture Recognition

Abstract

Access this chapter

Similar content being viewed by others

Fine grained sport action recognition with Twin spatio-temporal convolutional neural networks

Classification of Human Actions Using 3-D Convolutional Neural Networks: A Hierarchical Approach

A Very Deep Sequences Learning Approach for Human Action Recognition

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation