Abstract
In this paper, we propose a novel model for recognizing human interaction in videos via discriminative patches. Each frame is represented as a set of mid-level discriminative patches, which are extracted automatically by association rule mining on convolutional neural networks (CNN) activations. We further refine these patches based on the observation that discriminative patches usually occur in climax period of an interaction. The climax of an interaction in the paper is defined as the continuous frames which have more firing patches. The patches are further purified by a reward-punishment rule, which ensures that the discriminative patches emerge in climax period or key frames frequently and seldom occur in non-key frames. Finally, the label of an interaction video clip is determined by votes of each patch detected in it. The experimental results on UT-Interaction Set #1, Set #2 and BIT-Interaction Dataset show that the proposed discriminative patches obtain encouraging performances.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
References
Singh, S., Gupta, A., Efros, A.A.: Unsupervised discovery of mid-level discriminative patches. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) Computer Vision – ECCV 2012. LNCS, vol. 7573, pp. 73–86. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33709-3_6
Doersch, C., Gupta, A., Efros, A.A.: Mid-level visual element discovery as discriminative mode seeking. In: Advances in Neural Information Processing Systems, pp. 494–502 (2013)
Juneja, M., Vedaldi, A., Jawahar, C., Zisserman, A.: Blocks that shout: distinctive parts for scene classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 923–930 (2013)
Wang, X., Wang, B., Bai, X., Liu, W., Tu, Z.: Max-margin multiple-instance dictionary learning. In: Proceedings of the 30th International Conference on Machine Learning, pp. 846–854 (2013)
Bourdev, L., Malik, J.: Poselets: body part detectors trained using 3D human pose annotations. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 1365–1372. IEEE (2009)
Li, Y., Liu, L., Shen, C., van den Hengel, A.: Mid-level deep pattern mining. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 971–980. IEEE (2015)
Lan, T., Wang, Y., Yang, W., Robinovitch, S.N., Mori, G.: Discriminative latent models for recognizing contextual group activities. IEEE Trans. Pattern Anal. Mach. Intell. 34, 1549–1562 (2012)
Ryoo, M.S., Aggarwal, J.K.: Recognition of composite human activities through context-free grammar based representation. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 1709–1718. IEEE (2006)
Choi, W., Shahid, K., Savarese, S.: Learning context for collective activity recognition. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3273–3280. IEEE (2011)
Vahdat, A., Gao, B., Ranjbar, M., Mori, G.: A discriminative key pose sequence model for recognizing human interactions. In: 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp. 1729–1736. IEEE (2011)
Su, B., Ding, X.: Linear sequence discriminant analysis: a model-based dimensionality reduction method for vector sequences. In: ICCV, pp. 889–896 (2013)
Su, B., Zhou, J., Ding, X., Wang, H., Wu, Y.: Hierarchical dynamic parsing and encoding for action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 202–217. Springer, Heidelberg (2016). doi:10.1007/978-3-319-46493-0_13
Raptis, M., Sigal, L.: Poselet key-framing: a model for human activity recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2650–2657 (2013)
Brendel, W., Todorovic, S.: Learning spatiotemporal graphs of human activities. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 778–785. IEEE (2011)
Raptis, M., Kokkinos, I., Soatto, S.: Discovering discriminative action parts from mid-level video representations. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1242–1249. IEEE (2012)
Kong, Y., Jia, Y., Fu, Y.: Learning human interaction by interactive phrases. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7572, pp. 300–313. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33718-5_22
Kong, Y., Jia, Y., Fu, Y.: Interactive phrases: semantic descriptions for human interaction recognition. IEEE Trans. Pattern Anal. Mach. Intell. 36, 1775–1788 (2014)
Ryoo, M.S., Aggarwal, J.K.: Spatio-temporal relationship match: video structure comparison for recognition of complex human activities. In: 2009 IEEE 12th International Conference on Computer vision, pp. 1593–1600. IEEE (2009)
Amer, M.R., Todorovic, S.: Sum-product networks for modeling activities with stochastic structure. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1314–1321. IEEE (2012)
Bossard, L., Guillaumin, M., Gool, L.: Food-101 – mining discriminative components with random forests. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 446–461. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10599-4_29
Xu, Z., Qing, L., Miao, J.: Activity auto-completion: predicting human activities from partial videos. In: ICCV, pp. 3191–3199 (2015)
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the ACM International Conference on Multimedia, pp. 675–678. ACM (2014)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27:1–27:27 (2011). http://www.csie.ntu.edu.tw/cjlin/libsvm
Ryoo, M.S., Aggarwal, J.: Ut-interaction dataset, ICPR contest on semantic description of human activities (SDHA). In: IEEE International Conference on Pattern Recognition Workshops, vol. 2, p. 4 (2010)
Kong, Y., Fu, Y.: Close human interaction recognition using patch-aware models. IEEE Trans. Image Process. 25, 167–178 (2016)
Lan, T., Chen, T.-C., Savarese, S.: A hierarchical representation for future action prediction. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 689–704. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10578-9_45
Ryoo, M.: Human activity prediction: early recognition of ongoing activities from streaming videos. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 1036–1043. IEEE (2011)
Zhang, Y., Liu, X., Chang, M.-C., Ge, W., Chen, T.: Spatio-temporal phrases for activity recognition. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7574, pp. 707–721. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33712-3_51
Cao, Y., Barrett, D., Barbu, A., Narayanaswamy, S., Yu, H., Michaux, A., Lin, Y., Dickinson, S., Siskind, J., Wang, S.: Recognize human activities from partially observed videos. In: CVPR, pp. 2658–2665 (2013)
Kong, Y., Kit, D., Fu, Y.: A discriminative model with multiple temporal scales for action prediction. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 596–611. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10602-1_39
Acknowledgments
This research is partially sponsored by Natural Science Foundation of China (Nos. 61472387, 61272320, and 61572004) and Beijing Natural Science Foundation (Nos. 4152005 and 4162058).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Shan, D., Qing, L., Miao, J. (2017). Human Interaction Recognition by Mining Discriminative Patches on Key Frames. In: Lai, SH., Lepetit, V., Nishino, K., Sato, Y. (eds) Computer Vision – ACCV 2016. ACCV 2016. Lecture Notes in Computer Science(), vol 10112. Springer, Cham. https://doi.org/10.1007/978-3-319-54184-6_22
Download citation
DOI: https://doi.org/10.1007/978-3-319-54184-6_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-54183-9
Online ISBN: 978-3-319-54184-6
eBook Packages: Computer ScienceComputer Science (R0)