Human Interaction Recognition by Mining Discriminative Patches on Key Frames

Shan, Dingyi; Qing, Laiyun; Miao, Jun

doi:10.1007/978-3-319-54184-6_22

Dingyi Shan^17,18,
Laiyun Qing^17,18 &
Jun Miao¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10112))

Included in the following conference series:

Asian Conference on Computer Vision

1964 Accesses

Abstract

In this paper, we propose a novel model for recognizing human interaction in videos via discriminative patches. Each frame is represented as a set of mid-level discriminative patches, which are extracted automatically by association rule mining on convolutional neural networks (CNN) activations. We further refine these patches based on the observation that discriminative patches usually occur in climax period of an interaction. The climax of an interaction in the paper is defined as the continuous frames which have more firing patches. The patches are further purified by a reward-punishment rule, which ensures that the discriminative patches emerge in climax period or key frames frequently and seldom occur in non-key frames. Finally, the label of an interaction video clip is determined by votes of each patch detected in it. The experimental results on UT-Interaction Set #1, Set #2 and BIT-Interaction Dataset show that the proposed discriminative patches obtain encouraging performances.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://caffe.berkeleyvision.org/.

References

Singh, S., Gupta, A., Efros, A.A.: Unsupervised discovery of mid-level discriminative patches. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) Computer Vision – ECCV 2012. LNCS, vol. 7573, pp. 73–86. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33709-3_6
Chapter Google Scholar
Doersch, C., Gupta, A., Efros, A.A.: Mid-level visual element discovery as discriminative mode seeking. In: Advances in Neural Information Processing Systems, pp. 494–502 (2013)
Google Scholar
Juneja, M., Vedaldi, A., Jawahar, C., Zisserman, A.: Blocks that shout: distinctive parts for scene classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 923–930 (2013)
Google Scholar
Wang, X., Wang, B., Bai, X., Liu, W., Tu, Z.: Max-margin multiple-instance dictionary learning. In: Proceedings of the 30th International Conference on Machine Learning, pp. 846–854 (2013)
Google Scholar
Bourdev, L., Malik, J.: Poselets: body part detectors trained using 3D human pose annotations. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 1365–1372. IEEE (2009)
Google Scholar
Li, Y., Liu, L., Shen, C., van den Hengel, A.: Mid-level deep pattern mining. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 971–980. IEEE (2015)
Google Scholar
Lan, T., Wang, Y., Yang, W., Robinovitch, S.N., Mori, G.: Discriminative latent models for recognizing contextual group activities. IEEE Trans. Pattern Anal. Mach. Intell. 34, 1549–1562 (2012)
Article Google Scholar
Ryoo, M.S., Aggarwal, J.K.: Recognition of composite human activities through context-free grammar based representation. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 1709–1718. IEEE (2006)
Google Scholar
Choi, W., Shahid, K., Savarese, S.: Learning context for collective activity recognition. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3273–3280. IEEE (2011)
Google Scholar
Vahdat, A., Gao, B., Ranjbar, M., Mori, G.: A discriminative key pose sequence model for recognizing human interactions. In: 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp. 1729–1736. IEEE (2011)
Google Scholar
Su, B., Ding, X.: Linear sequence discriminant analysis: a model-based dimensionality reduction method for vector sequences. In: ICCV, pp. 889–896 (2013)
Google Scholar
Su, B., Zhou, J., Ding, X., Wang, H., Wu, Y.: Hierarchical dynamic parsing and encoding for action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 202–217. Springer, Heidelberg (2016). doi:10.1007/978-3-319-46493-0_13
Chapter Google Scholar
Raptis, M., Sigal, L.: Poselet key-framing: a model for human activity recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2650–2657 (2013)
Google Scholar
Brendel, W., Todorovic, S.: Learning spatiotemporal graphs of human activities. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 778–785. IEEE (2011)
Google Scholar
Raptis, M., Kokkinos, I., Soatto, S.: Discovering discriminative action parts from mid-level video representations. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1242–1249. IEEE (2012)
Google Scholar
Kong, Y., Jia, Y., Fu, Y.: Learning human interaction by interactive phrases. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7572, pp. 300–313. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33718-5_22
Chapter Google Scholar
Kong, Y., Jia, Y., Fu, Y.: Interactive phrases: semantic descriptions for human interaction recognition. IEEE Trans. Pattern Anal. Mach. Intell. 36, 1775–1788 (2014)
Article Google Scholar
Ryoo, M.S., Aggarwal, J.K.: Spatio-temporal relationship match: video structure comparison for recognition of complex human activities. In: 2009 IEEE 12th International Conference on Computer vision, pp. 1593–1600. IEEE (2009)
Google Scholar
Amer, M.R., Todorovic, S.: Sum-product networks for modeling activities with stochastic structure. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1314–1321. IEEE (2012)
Google Scholar
Bossard, L., Guillaumin, M., Gool, L.: Food-101 – mining discriminative components with random forests. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 446–461. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10599-4_29
Google Scholar
Xu, Z., Qing, L., Miao, J.: Activity auto-completion: predicting human activities from partial videos. In: ICCV, pp. 3191–3199 (2015)
Google Scholar
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the ACM International Conference on Multimedia, pp. 675–678. ACM (2014)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27:1–27:27 (2011). http://www.csie.ntu.edu.tw/cjlin/libsvm
Article Google Scholar
Ryoo, M.S., Aggarwal, J.: Ut-interaction dataset, ICPR contest on semantic description of human activities (SDHA). In: IEEE International Conference on Pattern Recognition Workshops, vol. 2, p. 4 (2010)
Google Scholar
Kong, Y., Fu, Y.: Close human interaction recognition using patch-aware models. IEEE Trans. Image Process. 25, 167–178 (2016)
Article MathSciNet Google Scholar
Lan, T., Chen, T.-C., Savarese, S.: A hierarchical representation for future action prediction. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 689–704. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10578-9_45
Google Scholar
Ryoo, M.: Human activity prediction: early recognition of ongoing activities from streaming videos. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 1036–1043. IEEE (2011)
Google Scholar
Zhang, Y., Liu, X., Chang, M.-C., Ge, W., Chen, T.: Spatio-temporal phrases for activity recognition. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7574, pp. 707–721. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33712-3_51
Chapter Google Scholar
Cao, Y., Barrett, D., Barbu, A., Narayanaswamy, S., Yu, H., Michaux, A., Lin, Y., Dickinson, S., Siskind, J., Wang, S.: Recognize human activities from partially observed videos. In: CVPR, pp. 2658–2665 (2013)
Google Scholar
Kong, Y., Kit, D., Fu, Y.: A discriminative model with multiple temporal scales for action prediction. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 596–611. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10602-1_39
Google Scholar

Download references

Acknowledgments

This research is partially sponsored by Natural Science Foundation of China (Nos. 61472387, 61272320, and 61572004) and Beijing Natural Science Foundation (Nos. 4152005 and 4162058).

Author information

Authors and Affiliations

School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing, 100049, China
Dingyi Shan & Laiyun Qing
Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, 100190, China
Dingyi Shan, Laiyun Qing & Jun Miao

Authors

Dingyi Shan
View author publications
You can also search for this author in PubMed Google Scholar
Laiyun Qing
View author publications
You can also search for this author in PubMed Google Scholar
Jun Miao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Laiyun Qing .

Editor information

Editors and Affiliations

National Tsing Hua University, Hsinchu, Taiwan
Shang-Hong Lai
Graz University of Technology, Graz, Austria
Vincent Lepetit
Drexel University, Philadelphia, Pennsylvania, USA
Ko Nishino
The University of Tokyo, Tokyo, Japan
Yoichi Sato

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shan, D., Qing, L., Miao, J. (2017). Human Interaction Recognition by Mining Discriminative Patches on Key Frames. In: Lai, SH., Lepetit, V., Nishino, K., Sato, Y. (eds) Computer Vision – ACCV 2016. ACCV 2016. Lecture Notes in Computer Science(), vol 10112. Springer, Cham. https://doi.org/10.1007/978-3-319-54184-6_22

Download citation

DOI: https://doi.org/10.1007/978-3-319-54184-6_22
Published: 10 March 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-54183-9
Online ISBN: 978-3-319-54184-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics