Skip to main content
Log in

A comprehensive solution for detecting events in complex surveillance videos

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Event detection have long been a fundamental problem in computer vision society. Various datasets for recognizing human events and activities have been proposed to help developing better models and methods, such as UCF101, HMDB51, etc. These datasets all share the same properties that either predefined scripts are provided or the images are almost actor-oriented with little background noise. These properties, however, are completely different from that of surveillance event detection, making the effective solutions on these datasets totally not suitable. Event detection in complex surveillance video is a much more difficult task with several challenges: heavy occlusions between pedestrians, low image resolution and uncontrolled scene condition. TRECVID-SED evaluation, aiming at detecting events in highly crowded airport, is well-known for its great difficulties. To deal with event detection in realistic scene, such as TRECVID-SED, we introduce a comprehensive solution framework based on pedestrian detection, deep key-pose detection and trajectory analysis. Explicitly, instead of detecting whole body of one person, we detect the head-shoulder of pedestrian, addressing the issue of heavy occlusion of pedestrians in complex scene. We also propose a trajectory-based event detection method so as to better focus on the key actors of events. For those events with discriminative poses, we model the event detection as key pose detection by taking advantages of Faster R-CNN. The presented framework achieves the best result in TRECVID-SED 2016 evaluation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. http://www.bupt-mcprl.net/datadownload.php

  2. http://www.vision.caltech.edu/Image_Datasets/CaltechPedestrians

References

  1. Amor BB, Jingyong S, Srivastava A (2016) Action recognition using rate-invariant analysis of skeletal shape trajectories. IEEE Trans Pattern Anal Mach Intell 38(1):1–13

    Article  Google Scholar 

  2. S Bell, CL Zitnick, K Bala, R Girshick (2015) Inside-Outside Net: Detecting Objects in Context with Skip Pooling and Recurrent Neural Networks. arXiv 1–24

  3. Cai Z, et al. (2016) A unified multi-scale deep convolutional neural network for fast object detection. European Conference on Computer Vision. Springer International Publishing

  4. Chang BW, R Nevatia (2008) Robust object tracking by hierarchical association of detection responses." European Conference on Computer Vision. Springer Berlin Heidelberg

  5. X Chang et al. (2016) Semantic pooling for complex event analysis in untrimmed videos. IEEE Trans Patt Anal Mach Intel

  6. X Chang et al. (2016) Bi-level semantic representation analysis for multimedia event detection. IEEE Trans Cybernet

  7. Chen Q et al. (2015) Part-based deep network for pedestrian detection in surveillance videos." Visual Communications and Image Processing (VCIP), 2015. IEEE

  8. Dalal N, B Triggs (2005) Histograms of oriented gradients for human detection." Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. Vol. 1. IEEE

  9. Felzenszwalb PF et al (2010) Object detection with discriminatively trained part-based models. IEEE Trans Pattern Anal Mach Intell 32(9):1627–1645

    Article  Google Scholar 

  10. Gidaris, Spyros, and Nikos Komodakis (2015) Object detection via a multi-region and semantic segmentation-aware cnn model. Proc IEEE Int Conf Comput Vis

  11. Girshick R (2015) Fast r-cnn. Proc IEEE Int Conf Comput Vis

  12. Girshick R et al. (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. Proc IEEE Conf Comput Vis Patt Recog

  13. Horn BKP, Schunck BG (1981) Determining optical flow. Artif Intell 17(1–3):185–203

    Article  Google Scholar 

  14. http://crcv.ucf.edu/data/UCF101.php

  15. https://www.nist.gov/itl/iad/mig/trecvid-multimedia-event-detection-evaluation-track

  16. Karen, A Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  17. Kuhn HW (1955) The Hungarian method for the assignment problem. Naval Res Logist Quart 2:83–97 Kuhn's original publication

    Article  MathSciNet  Google Scholar 

  18. D Le, S Phan, Y Miyao, S Satoh et al (2016) @ TRECVID

  19. Lenz P, A Geiger, R Urtasun (2015) Followme: Efficient online min-cost flow tracking with bounded memory and computation. Proc IEEE Int Conf Comput Vis

  20. Li Y, K He, J Sun (2016) "R-fcn: Object detection via region-based fully convolutional networks. Adv Neural Info Proc Syst

  21. J. Liang, P. Huang, L. Jiang, Z. Lan, J. Chen, A. Hauptmann et al. @ TRECVID (2016) Multimedia event Detection, Ad-hoc Video Search, Surveillance event Detection

  22. Liu L et al (2016) Learning spatio-temporal representations for action recognition: a genetic programming approach. IEEE Trans Cybernet 46(1):158–170

    Article  Google Scholar 

  23. Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110

    Article  MathSciNet  Google Scholar 

  24. Ng AY, Jordan MI (2002) On discriminative vs. generative classifiers: a comparison of logistic regression and naive bayes. Adv Neural Inf Proces Syst 2:841–848

    Google Scholar 

  25. Peng X, Wang L, Wang X et al (2016) Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput Vis Image Underst 150:109–125

    Article  Google Scholar 

  26. Prince, SJD (2012) Computer vision: models, learning, and inference". Cambridge University Press

  27. Redmon J et al. (2016) You only look once: Unified, real-time object detection." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

  28. Ren S et al. (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Adv Neur Info Proc Syst

  29. Russakovsky O, Deng J et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252

    Article  MathSciNet  Google Scholar 

  30. Simonyan K, A Zisserman (2014) Two-stream convolutional networks for action recognition in videos. Adv Neur Info Proc Syst

  31. Solera F, S Calderara, R Cucchiara (2015) Learning to divide and conquer for online multi-target tracking. Proc IEEE Int Conf Comput Vis

  32. Wang H et al. (2011) Action recognition by dense trajectories." Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE

  33. Wang H et al (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103(1):60–79

    Article  MathSciNet  Google Scholar 

  34. Wang, et al (2016) Tracklet association by online target-specific metric learning and coherent dynamics estimation. IEEE Trans Patt Anal Mach Intel

  35. Wu J, Zhang Y, Lin W (2016) Good practices for learning to recognize actions using FV and VLAD. IEEE Trans Cybernet 46(12):2978–2990

    Article  Google Scholar 

  36. P. Yang, J. Xiong, D. Xie, S. Pu, HRI Team @ TRECVID (2016) Surveillance event detection

  37. S Yu, L Jiang, CMU Informedia @ TRECVID (2015). Proc TRECVID 2015 Work

  38. Zach C, T Pock, H Bischof (2007) A duality based approach for realtime TV-L 1 optical flow. Pattern Recog 214–223

  39. Zha Z-J et al (2013) Detecting group activities with multi-camera context. IEEE Trans Circ Syst Video Technol 23(5):856–869

    Article  Google Scholar 

  40. Zhang L, Y Li, R Nevatia (2008) Global data association for multi-object tracking using network flows. Comput Vis Patt Recog, 2008. CVPR 2008. IEEE Conference on. IEEE

  41. Zhang S et al (2015) Multi-target tracking by learning local-to-global trajectory models. Pattern Recogn 48(2):580–590

    Article  Google Scholar 

  42. Zhang X et al (2016) Deep fusion of multiple semantic cues for complex event recognition. IEEE Trans Image Process 25(3):1033–1046

    Article  MathSciNet  Google Scholar 

  43. Zhang D, Han J, Jiang L, Ye S, Chang X (2017) Revealing event saliency in unconstrained video collection. IEEE Trans Image Process 26(4):1746–1758

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yandong Zhu.

Additional information

This work is supported by Key Laboratory of Forensic Marks, Ministry of Public Security ,Beijing,China and Chinese National Natural Science Foundation (61532018, 61372169, 61471049).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhu, Y., Zhou, K., Wang, M. et al. A comprehensive solution for detecting events in complex surveillance videos. Multimed Tools Appl 78, 817–838 (2019). https://doi.org/10.1007/s11042-018-6163-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-018-6163-6

Keywords

Navigation