Skip to main content

3D Convolutional Networks for Action Recognition: Application to Sport Gesture Recognition

  • Chapter
  • First Online:
Multi-faceted Deep Learning

Abstract

3D convolutional networks is a good means to perform tasks such as video segmentation into coherent spatio-temporal chunks and classification of them with regard to a target taxonomy. In the chapter we are interested in the classification of continuous video takes with repeatable actions, such as strokes of table tennis. Filmed in a free marker less ecological environment, these videos represent a challenge from both segmentation and classification point of view. The 3D convnets are an efficient tool for solving these problems with window-based approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://www.mturk.com/.

  2. 2.

    https://www.csc.kth.se/cvap/actions/.

  3. 3.

    http://www.wisdom.weizmann.ac.il/~vision/SpaceTimeActions.html.

  4. 4.

    https://www.cvssp.org/acasva/Downloads.html.

  5. 5.

    https://sdolivia.github.io/FineGym/.

  6. 6.

    https://www.di.ens.fr/~laptev/actions/hollywood2/.

  7. 7.

    www.crcv.ucf.edu/data/UCF_YouTube_Action.php.

  8. 8.

    www.crcv.ucf.edu/data/UCF101.php.

  9. 9.

    www.thumos.info.

  10. 10.

    https://deepmind.com/research/open-source/kinetics.

  11. 11.

    https://research.google.com/ava/.

  12. 12.

    https://deepmind.com/research/open-source/kinetics.

References

  1. Mateusz Budnik, Efrain-Leonardo Gutierrez-Gomez, Bahjat Safadi, Denis Pellerin, and Georges Quénot. Learned features versus engineered features for multimedia indexing. Multim. Tools Appl., 76(9):11941–11958, 2017.

    Article  Google Scholar 

  2. Cyrille Beaudry, Renaud Péteri, and Laurent Mascarilla. Action recognition in videos using frequency analysis of critical point trajectories. In 2014 IEEE International Conference on Image Processing, ICIP 2014, Paris, France, October 27–30, 2014, pages 1445–1449, 2014.

    Google Scholar 

  3. Gabriella Csurka, Christopher R. Dance, Lixin Fan, Jutta Willamowski, and Cédric Bray. Visual categorization with bags of keypoints. In In Workshop on Statistical Learning in Computer Vision, ECCV, pages 1–22, 2004.

    Google Scholar 

  4. João Carreira, Eric Noland, Chloe Hillier, and Andrew Zisserman. A short note on the kinetics-700 human action dataset. CoRR, abs/1907.06987, 2019.

    Google Scholar 

  5. Jordan Calandre, Renaud Péteri, and Laurent Mascarilla. Optical flow singularities for sports video annotation: Detection of strokes in table tennis. In Working Notes Proceedings of the MediaEval 2019 Workshop, Sophia Antipolis, France, 27–30 October 2019, 2019.

    Google Scholar 

  6. Rizwan Chaudhry, Avinash Ravichandran, Gregory D. Hager, and René Vidal. Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20–25 June 2009, Miami, Florida, USA, pages 1932–1939, 2009.

    Google Scholar 

  7. João Carreira and Andrew Zisserman. Quo vadis, action recognition? A new model and the kinetics dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21–26, 2017, pages 4724–4733, 2017.

    Google Scholar 

  8. Teofilo de Campos, Mark Barnard, Krystian Mikolajczyk, Josef Kittler, Fei Yan, William J. Christmas, and David Windridge. An evaluation of bags-of-words and spatio-temporal shapes for action recognition. In IEEE Workshop on Applications of Computer Vision (WACV 2011), 5–7 January 2011, Kona, HI, USA, pages 344–351, 2011.

    Google Scholar 

  9. Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans. Pattern Anal. Mach. Intell., 39(4):677–691, 2017.

    Google Scholar 

  10. Naina Dhingra and Andreas M. Kunz. Res3ATN - deep 3D residual attention network for hand gesture recognition in videos. In 2019 International Conference on 3D Vision, 3DV 2019, Québec City, QC, Canada, September 16–19, 2019, pages 491–501, 2019.

    Google Scholar 

  11. Yang Du, Chunfeng Yuan, Bing Li, Lili Zhao, Yangxi Li, and Weiming Hu. Interaction-aware spatio-temporal pyramid attention networks for action classification. In ECCV (16), volume 11220 of Lecture Notes in Computer Science, pages 388–404. Springer, 2018.

    Google Scholar 

  12. Alexei A. Efros, Alexander C. Berg, Greg Mori, and Jitendra Malik. Recognizing action at a distance. In 9th IEEE International Conference on Computer Vision (ICCV 2003), 14–17 October 2003, Nice, France, pages 726–733, 2003.

    Google Scholar 

  13. Christoph Feichtenhofer, Axel Pinz, and Richard P. Wildes. Spatiotemporal multiplier networks for video action recognition. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21–26, 2017, pages 7445–7454, 2017.

    Google Scholar 

  14. Mehrnaz Fani, Kanav Vats, Christopher Dulhanty, David A. Clausi, and John S. Zelek. Pose-projected action recognition hourglass network (PARHN) in soccer. In 16th Conference on Computer and Robot Vision, CRV 2019, Kingston, ON, Canada, May 29–31, 2019, pages 201–208, 2019.

    Google Scholar 

  15. Lena Gorelick, Moshe Blank, Eli Shechtman, Michal Irani, and Ronen Basri. Actions as space-time shapes. IEEE Trans. Pattern Anal. Mach. Intell., 29(12):2247–2253, 2007.

    Article  Google Scholar 

  16. Adrien Gaidon, Zaïd Harchaoui, and Cordelia Schmid. Temporal localization of actions with actoms. IEEE Trans. Pattern Anal. Mach. Intell., 35(11):2782–2795, 2013.

    Article  Google Scholar 

  17. Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, and Jitendra Malik. AVA: A video dataset of spatio-temporally localized atomic visual actions. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18–22, 2018, pages 6047–6056, 2018.

    Google Scholar 

  18. Deepti Ghadiyaram, Du Tran, and Dhruv Mahajan. Large-scale weakly-supervised pre-training for video action recognition. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16–20, 2019, pages 12046–12055, 2019.

    Google Scholar 

  19. Rui Hou, Chen Chen, and Mubarak Shah. Tube convolutional neural network (T-CNN) for action detection in videos. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22–29, 2017, pages 5823–5832, 2017.

    Google Scholar 

  20. Michael E. Houle, Vincent Oria, Shin’ichi Satoh, and Jichao Sun. Annotation propagation in image databases using similarity graphs. ACM Trans. Multim. Comput. Commun. Appl., 10(1):7:1–7:21, 2013.

    Google Scholar 

  21. Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Enhua Wu. Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell., 42(8):2011–2023, 2020.

    Article  Google Scholar 

  22. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016, pages 770–778, 2016.

    Google Scholar 

  23. Bogdan Ionescu, Jenny Benois-Pineau, Tomas Piatrik, and Georges Quénot, editors. Fusion in Computer Vision - Understanding Complex Visual Content. Advances in Computer Vision and Pattern Recognition. Springer, 2014.

    Google Scholar 

  24. G. Johansson. Visual perception of biological motion and a model for its analysis. Perception and Psychophysics, 14:pp. 201–211, 1973.

    Article  Google Scholar 

  25. Mihir Jain, Jan C. van Gemert, Hervé Jégou, Patrick Bouthemy, and Cees G. M. Snoek. Action localization with tubelets from motion. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23–28, 2014, pages 740–747, 2014.

    Google Scholar 

  26. Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell., 35(1):221–231, 2013.

    Article  Google Scholar 

  27. Will Kay, João Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset. CoRR, abs/1705.06950, 2017.

    Google Scholar 

  28. Ewa Kijak, Guillaume Gravier, Lionel Oisel, and Patrick Gros. Audiovisual integration for tennis broadcast structuring. Multim. Tools Appl., 30(3):289–311, 2006.

    Article  Google Scholar 

  29. M. Esat Kalfaoglu, Sinan Kalkan, and A. Aydin Alatan. Late temporal modeling in 3d CNN architectures with BERT for action recognition. CoRR, abs/2008.01232, 2020.

    Google Scholar 

  30. Ivan Laptev. Modeling and visual recognition of human actions and interactions. Habilitation à diriger des recherches, Ecole Normale Supérieure de Paris - ENS Paris, July 2013.

    Google Scholar 

  31. Ce Liu. Beyond Pixels: Exploring New Representations and Applications for Motion Analysis. PhD thesis, Massachusetts Institute of Technology, 5 2009.

    Google Scholar 

  32. Jinkue Lee and Hoeryong Jung. Tuhad: Taekwondo unit technique human action dataset with key frame-based CNN action recognition. Sensors, 20(17):4871, 2020.

    Google Scholar 

  33. Ivan Laptev and Patrick Pérez. Retrieving actions in movies. In IEEE 11th International Conference on Computer Vision, ICCV 2007, Rio de Janeiro, Brazil, October 14–20, 2007, pages 1–8, 2007.

    Google Scholar 

  34. Ang Li, Meghana Thotakuri, David A. Ross, João Carreira, Alexander Vostrikov, and Andrew Zisserman. The ava-kinetics localized human actions video dataset. CoRR, abs/2005.00214, 2020.

    Google Scholar 

  35. Zhihao Li, Wenmin Wang, Nannan Li, and Jinzhuo Wang. Tube convnets: Better exploiting motion for action recognition. In 2016 IEEE International Conference on Image Processing, ICIP 2016, Phoenix, AZ, USA, September 25–28, 2016, pages 3056–3060, 2016.

    Google Scholar 

  36. Ruichen Liu, Zhelong Wang, Xin Shi, Hongyu Zhao, Sen Qiu, Jie Li, and Ning Yang. Table tennis stroke recognition based on body sensor network. In Internet and Distributed Computing Systems - 12th International Conference, IDCS 2019, Naples, Italy, October 10–12, 2019, Proceedings, pages 1–10, 2019.

    Google Scholar 

  37. J. Macqueen. Some methods for classification and analysis of multivariate observations. In In 5-th Berkeley Symposium on Mathematical Statistics and Probability, pages 281–297, 1967.

    Google Scholar 

  38. Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, and Julien Morlier. Optimal choice of motion estimation methods for fine-grained action classification with 3d convolutional networks. In 2019 IEEE International Conference on Image Processing, ICIP 2019, Taipei, Taiwan, September 22–25, 2019, pages 554–558, 2019.

    Google Scholar 

  39. Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, and Julien Morlier. Fine grained sport action recognition with twin spatio-temporal convolutional neural networks. Multim. Tools Appl., 79(27–28):20429–20447, 2020.

    Article  Google Scholar 

  40. Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, and Julien Morlier. 3d attention mechanisms in twin spatio-temporal convolutional neural networks. application to action classification in videos of table tennis games. In 25th International Conference on Pattern Recognition (ICPR2020) - MiCo Milano Congress Center, Italy, 10–15 January 2021, 2021.

    Google Scholar 

  41. Nikolaus Mayer, Eddy Ilg, Philip Häusser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016, pages 4040–4048, 2016.

    Google Scholar 

  42. Marcin Marszalek, Ivan Laptev, and Cordelia Schmid. Actions in context. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20–25 June 2009, Miami, Florida, USA, pages 2929–2936, 2009.

    Google Scholar 

  43. Joe Yue-Hei Ng, Matthew J. Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. Beyond short snippets: Deep networks for video classification. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7–12, 2015, pages 4694–4702, 2015.

    Google Scholar 

  44. Abraham Montoya Obeso, Jenny Benois-Pineau, Mireya Saraí García-Vázquez, and Alejandro Alvaro Ramírez-Acosta. Forward-backward visual saliency propagation in deep NNS vs internal attentional mechanisms. In Ninth International Conference on Image Processing Theory, Tools and Applications, IPTA 2019, Istanbul, Turkey, November 6–9, 2019, pages 1–6, 2019.

    Google Scholar 

  45. Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell., 39(6):1137–1149, 2017.

    Article  Google Scholar 

  46. A Rokszin, Z Márkus, G Braunitzer, A Berényi, G Benedek, and A Nagy. Visual pathways serving motion detection in the mammalian brain. Sensors, 10(4):3218–3242, 2010.

    Article  Google Scholar 

  47. Andrei Stoian, Marin Ferecatu, Jenny Benois-Pineau, and Michel Crucianu. Fast action localization in large-scale video archives. IEEE Trans. Circuits Syst. Video Techn., 26(10):1917–1930, 2016.

    Article  Google Scholar 

  48. Konrad Schindler and Luc Van Gool. Action snippets: How many frames does human action recognition require? In 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2008), 24–26 June 2008, Anchorage, Alaska, USA, 2008.

    Google Scholar 

  49. Christian Schüldt, Ivan Laptev, and Barbara Caputo. Recognizing human actions: A local SVM approach. In 17th International Conference on Pattern Recognition, ICPR 2004, Cambridge, UK, August 23–26, 2004, pages 32–36, 2004.

    Google Scholar 

  50. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7–12, 2015, pages 1–9, 2015.

    Google Scholar 

  51. Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8–13 2014, Montreal, Quebec, Canada, pages 568–576, 2014.

    Google Scholar 

  52. Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. Finegym: A hierarchical video dataset for fine-grained action understanding. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13–19, 2020, pages 2613–2622, 2020.

    Google Scholar 

  53. Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR, abs/1212.0402, 2012.

    Google Scholar 

  54. Du Tran, Lubomir D. Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7–13, 2015, pages 4489–4497, 2015.

    Google Scholar 

  55. Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18–22, 2018, pages 6450–6459, 2018.

    Google Scholar 

  56. Amin Ullah, Jamil Ahmad, Khan Muhammad, Muhammad Sajjad, and Sung Wook Baik. Action recognition in video sequences using deep bi-directional LSTM with CNN features. IEEE Access, 6:1155–1166, 2018.

    Google Scholar 

  57. Gül Varol, Ivan Laptev, and Cordelia Schmid. Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell., 40(6):1510–1517, 2018.

    Article  Google Scholar 

  58. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4–9 December 2017, Long Beach, CA, USA, pages 5998–6008, 2017.

    Google Scholar 

  59. Xuanhan Wang, Lianli Gao, Peng Wang, Xiaoshuai Sun, and Xianglong Liu. Two-stream 3-d convnet fusion for action recognition in videos with arbitrary size and length. IEEE Trans. Multimedia, 20(3):634–644, 2018.

    Article  Google Scholar 

  60. Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang Zhang, Xiaogang Wang, and Xiaoou Tang. Residual attention network for image classification. In CVPR, pages 6450–6458. IEEE Computer Society, 2017.

    Google Scholar 

  61. Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. Action recognition by dense trajectories. In The 24th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, Colorado Springs, CO, USA, 20–25 June 2011, pages 3169–3176, 2011.

    Google Scholar 

  62. Philippe Weinzaepfel, Jérôme Revaud, Zaïd Harchaoui, and Cordelia Schmid. Deepflow: Large displacement optical flow with deep matching. In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1–8, 2013, pages 1385–1392, 2013.

    Google Scholar 

  63. Heng Wang and Cordelia Schmid. Action recognition with improved trajectories. In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1–8, 2013, pages 3551–3558, 2013.

    Google Scholar 

  64. Jiachen Wang, Kejian Zhao, Dazhen Deng, Anqi Cao, Xiao Xie, Zheng Zhou, Hui Zhang, and Yingcai Wu. Tac-simur: Tactic-based simulative visual analytics of table tennis. IEEE Trans. Vis. Comput. Graph., 26(1):407–417, 2020.

    Article  Google Scholar 

  65. Kun Xia, Hanyu Wang, Menghan Xu, Zheng Li, Sheng He, and Yusong Tang. Racquet sports recognition using a hybrid clustering model learned from integrated wearable sensor. Sensors, 20(6):1638, 2020.

    Google Scholar 

  66. Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24–26, 2017, Conference Track Proceedings, 2017.

    Google Scholar 

  67. Zoran Zivkovic and Ferdinand van der Heijden. Efficient adaptive density estimation per image pixel for the task of background subtraction. Pattern Recognit. Lett., 27(7):773–780, 2006.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pierre-Etienne Martin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this chapter

Cite this chapter

Martin, PE., Benois-Pineau, J., Péteri, R., Zemmari, A., Morlier, J. (2021). 3D Convolutional Networks for Action Recognition: Application to Sport Gesture Recognition. In: Benois-Pineau, J., Zemmari, A. (eds) Multi-faceted Deep Learning. Springer, Cham. https://doi.org/10.1007/978-3-030-74478-6_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-74478-6_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-74477-9

  • Online ISBN: 978-3-030-74478-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics