skip to main content
10.1145/3460426.3463591acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
short-paper

Spatio-Temporal Activity Detection and Recognition in Untrimmed Surveillance Videos

Authors Info & Claims
Published:01 September 2021Publication History

ABSTRACT

This work presents a spatio-temporal activity detection and recognition framework for untrimmed surveillance videos consisting of a three-step pipeline: object detection, tracking, and activity recognition. The framework relies on the YOLO v4 architecture for object detection, Euclidean distance for tracking, while the activity recognizer uses a 3D Convolutional Deep learning architecture employing spatio-temporal boundaries and addressing it as multi-label classification. The evaluation experiments on the VIRAT dataset achieve accurate detections of the temporal boundaries and recognitions of activities in untrimmed videos, with better performance for the multi-label compared to the multi-class activity recognition.

References

  1. George Awad, Asad A. Butt, Keith Curtis, Yooyoung Lee, Jonathan Fiscus, Afzal Godil, Andrew Delgado, Jesse Zhang, Eliot Godard, Lukas Diduch, Jeffrey Liu, Alan F. Smeaton, Yvette Graham, Gareth J. F. Jones, Wessel Kraaij, and Georges Quénot. 2020. TRECVID 2020: comprehensive campaign for evaluating video retrieval tasks across multiple application domains. In Proceedings of TRECVID 2020. NIST, USA, NIST, 100 Bureau Drive Gaithersburg, MD 20899.Google ScholarGoogle Scholar
  2. Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. 2020. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 (2020).Google ScholarGoogle Scholar
  3. Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition. IEEE, 961--970.Google ScholarGoogle ScholarCross RefCross Ref
  4. Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 6299--6308.Google ScholarGoogle ScholarCross RefCross Ref
  5. Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, 2625--2634.Google ScholarGoogle ScholarCross RefCross Ref
  6. Jiyang Gao, Zhenheng Yang, Kan Chen, Chen Sun, and Ram Nevatia. 2017. Turn tap: Temporal unit regression network for temporal action proposals. In Proceedings of the IEEE international conference on computer vision. IEEE, 3628--3636.Google ScholarGoogle ScholarCross RefCross Ref
  7. Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3d CNNs retrace the history of 2d cnns and imagenet?. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. IEEE, 6546--6555.Google ScholarGoogle ScholarCross RefCross Ref
  8. Yu-Gang Jiang, Jingen Liu, A Roshan Zamir, George Toderici, Ivan Laptev, Mubarak Shah, and Rahul Sukthankar. 2014. THUMOS challenge: Action recognition with a large number of classes.Google ScholarGoogle Scholar
  9. Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. IEEE, 1725--1732.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. 2018. Bsn: Boundary sensitive network for temporal action proposal generation. In Proceedings of the European Conference on Computer Vision (ECCV). Springer, 3--19.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. 2016. Ssd: Single shot multibox detector. In European conference on computer vision. Springer, 21--37.Google ScholarGoogle ScholarCross RefCross Ref
  12. Wenhe Liu, Guoliang Kang, Po-Yao Huang, Xiaojun Chang, Yijun Qian, Junwei Liang, Liangke Gui, Jing Wen, and Peng Chen. 2020. Argus: Efficient activity detection system for extended video analysis. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision Workshops. IEEE, 126--133.Google ScholarGoogle ScholarCross RefCross Ref
  13. Sangmin Oh, Anthony Hoogs, Amitha Perera, Naresh Cuntoor, Chia-Chih Chen, Jong Taek Lee, Saurajit Mukherjee, JK Aggarwal, Hyungtae Lee, Larry Davis, et al. 2011. A large-scale benchmark dataset for event recognition in surveillance video. In CVPR 2011. IEEE, 3153--3160.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Aayush Jung Rana, Praveen Tirupattur, Mamshad Nayeem Rizve, Kevin Duarte, Ugur Demir, Yogesh Singh Rawat, and Mubarak Shah. 2019. An Online System for Real-Time Activity Detection in Untrimmed Surveillance Videos. In TRECVID. NIST.Google ScholarGoogle Scholar
  15. Joseph Redmon and Ali Farhadi. 2018. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018).Google ScholarGoogle Scholar
  16. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2016. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence 39, 6 (2016), 1137--1149.Google ScholarGoogle Scholar
  17. Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional net- works for action recognition in videos. In Proceedings of the 27th International Conference on Neural Information Processing Systems-Volume 1. NIPS, 568--576.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Mingxing Tan, Ruoming Pang, and Quoc V Le. 2020. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. IEEE, 10781--10790.Google ScholarGoogle ScholarCross RefCross Ref
  19. Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. IEEE, 4489--4497.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Huifen Xia and Yongzhao Zhan. 2020. A Survey on Temporal Action Localization. IEEE Access 8 (2020), 70477--70487.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Spatio-Temporal Activity Detection and Recognition in Untrimmed Surveillance Videos

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ICMR '21: Proceedings of the 2021 International Conference on Multimedia Retrieval
      August 2021
      715 pages
      ISBN:9781450384636
      DOI:10.1145/3460426

      Copyright © 2021 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 September 2021

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • short-paper

      Acceptance Rates

      Overall Acceptance Rate254of830submissions,31%

      Upcoming Conference

      ICMR '24
      International Conference on Multimedia Retrieval
      June 10 - 14, 2024
      Phuket , Thailand

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader