Skip to main content
Log in

LSTM-based multi-label video event detection

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Since large-scale surveillance videos always contain complex visual events, how to generate video descriptions effectively and efficiently without human supervision has become mandatory. To address this problem, we propose a novel architecture for jointly recognizing multiple events in a given surveillance video, motivated by the sequence to sequence network. The proposed architecture can predict what happens in a video directly without the preprocessing of object detection and tracking. We evaluate several variants of the proposed architecture with different visual features on a novel dataset perpared by our group. Moreover, we compute a wide range of quantitative metrics to evaluate this architecture. We further compare it to the popular Support Vector Machine-based visual event detection method. The comparison results suggest that the proposal method can outperform the traditional computer vision pipelines for visual event detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. CoRR 1409:0473

    Google Scholar 

  2. Benfold B, Reid ID (2011) Stable multi-target tracking in real-time surveillance video. In: IEEE Conference on computer vision and pattern recognition, pp 3457–3464

  3. Chang C, Lin C (2011) LIBSVM: A library for support vector machines. ACM TIST 2(3):27:1–27:27

    Google Scholar 

  4. Cheng Z, Shen J (2016) On very large scale test collection for landmark image search benchmarking. Signal Process 124:13–26

    Article  Google Scholar 

  5. Cho K, van Merrienboer B, Bahdanau D, Bengio Y (2014) On the properties of neural machine translation: Encoder-decoder approaches. In: Proceedings of eighth workshop on syntax, semantics and structure in statistical translation, pp 103–111

  6. Cho K, van Merrienboer B, Gülçehre Ç, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing, pp 1724–1734

  7. Chu W, Song Y, Jaimes A (2015) Video co-summarization: Video summarization by visual co-occurrence. In: IEEE Conference on computer vision and pattern recognition, pp 3584–3592

  8. Collins RT, Biernacki C, Celeux G, Lipton AJ, Govaert G, Kanade T (2000) Introduction to the special section on video surveillance. IEEE Trans Pattern Anal Mach Intell 22(8):745–746

    Article  Google Scholar 

  9. Dai J, Li Y, He K, Sun J (2016) R-fcn: Object detection via region-based fully convolutional networks. In: Advances in neural information processing systems, pp 379–387

  10. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE Conference on computer vision and pattern recognition, pp 886–893

  11. Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. Lect Notes Comput Sci 3952:428–441

    Article  Google Scholar 

  12. Donahue J, Hendricks LA, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrell T (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell 39(4):677–691

    Article  Google Scholar 

  13. Fan C, Crandall DJ (2016) Deepdiary: Automatically captioning lifelogging image streams. In: European conference on computer vision workshops, pp 459–473

  14. Fujiyoshi H, Lipton AJ, Kanade T (2004) Real-time human motion analysis by image skeletonization. IEICE, Transactions 87-D(1):113–120

    Google Scholar 

  15. Gao Z, Zhang H, Xu GP, Xue YB, Hauptmann AG (2015) Multi-view discriminative and structured dictionary learning with group sparsity for human action recognition. Signal Process 112:83–97

    Article  Google Scholar 

  16. Gong Y, Jia Y, Leung T, Toshev A, Ioffe S (2013) Deep convolutional ranking for multilabel image annotation. arXiv:1312.4894

  17. Guo J, Ren T, Bei J, Zhu Y (2015) Salient object detection in RGB-d image based on saliency fusion and propagation. In: International conference on internet multimedia computing and service, pp 59:1–59:5

  18. Gutchess D, Trajkovic M, Cohen-Solal E, Lyons DM, Jain AK (2001) A background model initialization algorithm for video surveillance. In: ICCV, pp 733–740

  19. He X, Gao M, Kan M, Wang D (2017) Birank: Towards ranking on bipartite graphs. IEEE Trans Knowl Data Eng 29(1):57–71

    Article  Google Scholar 

  20. Hochreiter S, Schmidhuber J (1996) LSTM can solve hard long time lag problems. In: Advances in neural information processing systems, pp 473–479

  21. Ibrahim MS, Muralidharan S, Deng Z, Vahdat A, Mori G (2016) A hierarchical deep temporal model for group activity recognition. In: IEEE Conference on computer vision and pattern recognition, pp 1971–1980

  22. Ji S, Xu W, Yang M, Yu K (2013) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231

    Article  Google Scholar 

  23. Johnson J, Karpathy A, Fei-Fei L (2016) Densecap: Fully convolutional localization networks for dense captioning. In: IEEE Conference on computer vision and pattern recognition, pp 4565–4574

  24. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Li F (2014) Large-scale video classification with convolutional neural networks. In: IEEE Conference on computer vision and pattern recognition, pp 1725–1732

  25. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105

  26. Lan T, Wang Y, Yang W, Robinovitch SN, Mori G (2012) Discriminative latent models for recognizing contextual group activities. IEEE Trans Pattern Anal Mach Intell 34(8):1549–1562

    Article  Google Scholar 

  27. Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: IEEE Conference on computer vision and pattern recognition, pp 2169–2178

  28. Lee D (2005) Effective gaussian mixture learning for video background subtraction. IEEE Trans Pattern Anal Mach Intell 27(5):827–832

    Article  Google Scholar 

  29. Li J, Wong Y, Kankanhalli MS (2016) Multi-stream deep learning framework for automated presentation assessment. In: IEEE International symposium on multimedia, pp 222–225

  30. Liu A, Su Y, Jia P, Gao Z, Hao T, Yang Z (2015) Multipe/single-view human action recognition via part-induced multitask structural learning. IEEE Trans Cybernetics 45(6):1194–1208

    Article  Google Scholar 

  31. Liu A, Su Y, Nie W, Kankanhalli MS (2017) Hierarchical clustering multi-task learning for joint human action grouping and recognition. IEEE Trans Pattern Anal Mach Intell 39(1):102–114

    Article  Google Scholar 

  32. Liu A, Xu N, Nie W, Su Y, Wong Y, Kankanhalli M (2017) Benchmarking a multi-modal & multi-view & interactive dataset for human action recognition. IEEE Trans Cybern

  33. Ma S, Bargal SA, Zhang J, Sigal L, Sclaroff S (2017) Do less and achieve more: Training CNNs for action recognition utilizing action images from the web. Pattern Recognition. https://doi.org/10.1016/j.patcog.2017.01.027

  34. Money AG, Agius HW (2008) Video summarisation: A conceptual framework and survey of the state of the art. J Vis Commun Image Represent 19(2):121–143

    Article  Google Scholar 

  35. Nie L, Wang M, Zha Z, Chua T (2012) Oracle in image search: A content-based approach to performance prediction. ACM Trans Inf Syst 30(2):13:1–13:23

    Article  Google Scholar 

  36. Peng X, Wang L, Wang X, Qiao Y (2016) Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. Comput Vis Image Underst 150:109–125

    Article  Google Scholar 

  37. Pers J, Sulic V, Kristan M, Perse M, Polanec K, Kovacic S (2010) Histograms of optical flow for efficient representation of body motion. Pattern Recogn Lett 31(11):1369–1376

    Article  Google Scholar 

  38. Pritch Y, Ratovitch S, Hendel A, Peleg S (2009) Clustered synopsis of surveillance video. In: IEEE International conference on advanced video and signal based surveillance, pp 195–200

  39. Qian Y, Bi M, Tan T, Yu K (2016) Very deep convolutional neural networks for noise robust speech recognition. IEEE/ACM Trans Audio, Speech & Language Processing 24(12):2263–2276

    Article  Google Scholar 

  40. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576

  41. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576

  42. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition

  43. Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems

  44. Truong BT, Venkatesh S (2007) Video abstraction: A systematic review and classification. TOMCCAP 3(1):3

    Article  Google Scholar 

  45. Tu K, Meng M, Lee MW, Choe TE, Zhu SC (2014) Joint video and text parsing for understanding events and answering queries. IEEE Multimedia 21(2):42–70

    Article  Google Scholar 

  46. Venugopalan S, Rohrbach M, Donahue J, Mooney RJ, Darrell T, Saenko K (2015) Sequence to sequence - video to text. In: IEEE International conference on computer vision, pp 4534–4542

  47. Venugopalan S, Hendricks LA, Mooney RJ, Saenko K (2016) Improving lstm-based video description with linguistic knowledge mined from text. In: Proceedings of the 2016 conference on empirical methods in natural language processing, pp 1961–1966

  48. Wang L, Hu W, Tan T (2003) Recent developments in human motion analysis. Pattern Recogn 36(3):585–601

    Article  Google Scholar 

  49. Yang Z, Yuan Y, Wu Y, Salakhutdinov R, Cohen WW (2016) Encode, review and decode: Reviewer module for caption generation

  50. Yeung S, Fathi A, Fei-Fei L (2014) Videoset: video summary evaluation through text. In: CVPR Egocentric vision workshop

  51. Zhang H, Shang X, Luan H, Wang M, Chua T (2016) Learning from collective intelligence: Feature learning using social images and tags. ACM Trans Multimed Comput Commun Appl 13(1):1:1–1:23

    Article  Google Scholar 

Download references

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (61772359, 61472275, 61572356), the Tianjin Research Program of Application Foundation and Advanced Technology (15JCYBJC16200), the National Research Foundation, Prime Minister Office, Singapore under its International Research Centre in Singapore Funding Initiative.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to An-An Liu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, AA., Shao, Z., Wong, Y. et al. LSTM-based multi-label video event detection. Multimed Tools Appl 78, 677–695 (2019). https://doi.org/10.1007/s11042-017-5532-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-017-5532-x

Keywords

Navigation