Skip to main content
Log in

A Robust and Efficient Video Representation for Action Recognition

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

This paper introduces a state-of-the-art video representation and applies it to efficient action recognition and detection. We first propose to improve the popular dense trajectory features by explicit camera motion estimation. More specifically, we extract feature point matches between frames using SURF descriptors and dense optical flow. The matches are used to estimate a homography with RANSAC. To improve the robustness of homography estimation, a human detector is employed to remove outlier matches from the human body as human motion is not constrained by the camera. Trajectories consistent with the homography are considered as due to camera motion, and thus removed. We also use the homography to cancel out camera motion from the optical flow. This results in significant improvement on motion-based HOF and MBH descriptors. We further explore the recent Fisher vector as an alternative feature encoding approach to the standard bag-of-words (BOW) histogram, and consider different ways to include spatial layout information in these encodings. We present a large and varied set of evaluations, considering (i) classification of short basic actions on six datasets, (ii) localization of such actions in feature-length movies, and (iii) large-scale recognition of complex events. We find that our improved trajectory features significantly outperform previous dense trajectories, and that Fisher vectors are superior to BOW encodings for video recognition tasks. In all three tasks, we show substantial improvements over the state-of-the-art results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. http://lear.inrialpes.fr/~wang/improved_trajectories.

  2. http://lear.inrialpes.fr/~wang/improved_trajectories.

  3. The number of videos in each subset varies slightly from the figures reported in (Tang et al. 2012). The reason is that there are multiple releases of the data. For our experiments, we used the labels from the LDC2011E42 release.

References

  • Arandjelovic, R., & Zisserman, A. (2012). Three things everyone should know to improve object retrieval. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 2911–2918).

  • Ballas, N., Yang, Y., Lan, Z. Z., Delezoide. B., Prêteux. F., & Hauptmann, A. (2013). Space-time robust video representation for action recognition. In IEEE International Conference on Computer Vision.

  • Bay, H., Tuytelaars, T., & Gool, L. V. (2006). SURF: Speeded up robust features. In European Conference on Computer Vision.

  • Cao, L., Mu, Y., Natsev, A., Chang, S. F., Hua, G., & Smith, J. (2012). Scene aligned pooling for complex video recognition. In European Conference on Computer Vision.

  • Chatfield, K., Lempitsky, V., Vedaldi, A., & Zisserman, A. (2011). The devil is in the details: An evaluation of recent feature encoding methods. In British Machine Vision Conference.

  • Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In IEEE Conference on Computer Vision and Pattern Recognition.

  • Dalal, N., Triggs, B., & Schmid, C. (2006). Human detection using oriented histograms of flow and appearance. In European Conference on Computer Vision.

  • Dollár, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In IEEE Workshop Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

  • Duchenne, O., Laptev, I., Sivic, J., Bach, F., & Ponce, J. (2009). Automatic annotation of human actions in video. In IEEE International Conference on Computer Vision (pp. 1491–1498).

  • Farnebäck, G. (2003). Two-frame motion estimation based on polynomial expansion. In Proceedings of the Scandinavian Conference on Image Analysis.

  • Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.

    Article  Google Scholar 

  • Ferrari, V., Marin-Jimenez, M., & Zisserman, A. (2008). Progressive search space reduction for human pose estimation. In IEEE Conference on Computer Vision and Pattern Recognition.

  • Fischler, M. A., & Bolles, R. C. (1981). Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6), 381–395.

    Article  MathSciNet  Google Scholar 

  • Gaidon, A., Harchaoui, Z., & Schmid, C. (2011). Actom sequence models for efficient action detection. In IEEE Conference on Computer Vision and Pattern Recognition.

  • Gaidon, A., Harchaoui, Z., & Schmid, C. (2013). Activity representation with motion hierarchies. International Journal of Computer Vision, 3, 1–20.

    MathSciNet  Google Scholar 

  • Gauglitz, S., Höllerer, T., & Turk, M. (2011). Evaluation of interest point detectors and feature descriptors for visual tracking. International Journal of Computer Vision, 94(3), 335–360.

    Article  MATH  Google Scholar 

  • van Gemert, J., Veenman, C., Smeulders, A., & Geusebroek, J. M. (2010). Visual word ambiguity. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(7), 1271–1283.

    Article  Google Scholar 

  • Gorelick, L., Blank, M., Shechtman, E., Irani, M., & Basri, R. (2007). Actions as space-time shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(12), 2247–2253.

    Article  Google Scholar 

  • Gupta, A., Kembhavi, A., & Davis, L. (2009). Observing human-object interactions: using spatial and functional compatibility for recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(10), 1775–1789.

    Article  Google Scholar 

  • Ikizler-Cinbis, N., & Sclaroff, S. (2010). Object, scene and actions: Combining multiple features for human action recognition. In European Conference on Computer Vision.

  • Izadinia, H., & Shah, M. (2012). Recognizing complex events using large margin joint low-level event model. In European Conference on Computer Vision.

  • Jain, M., Jégou, H., & Bouthemy, P. (2013). Better exploiting motion for better action recognition. In IEEE Conference on Computer Vision and Pattern Recognition.

  • Jégou, H., Perronnin, F., Douze, M., Sánchez, J., Pérez, P., & Schmid, C. (2011). Aggregating local image descriptors into compact codes. IEEE Transactions on Pattern Analysis and Machine Intelligence.

  • Jiang, Y. G., Dai, Q., Xue, X., Liu, W., & Ngo, C. W. (2012). Trajectory-based modeling of human actions with motion reference points. In European Conference on Computer Vision (pp. 425–438).

  • Jiang, Y. G., Liu, J., Roshan Zamir, A., Laptev, I., Piccardi, M., Shah, M., & Sukthankar, R. (2013). THUMOS challenge: Action recognition with a large number of classes. http://crcv.ucf.edu/ICCV13-Action-Workshop/

  • Karaman, S., Seidenari, L., Bagdanov, A. D., & Del Bimbo, A. (2013). L1-regularized logistic regression stacking and transductive crf smoothing for action recognition in video. In ICCV Workshop on Action Recognition with a Large Number of Classes.

  • Kim, I., Oh, S., Vahdat, A., Cannons, K., Perera, A., & Mori, G. (2013). Segmental multi-way local pooling for video recognition. In ACM Conference on Multimedia (pp. 637–640).

  • Kläser, A., Marszałek, M., & Schmid, C. (2008). A spatio-temporal descriptor based on 3D-gradients. In British Machine Vision Conference

  • Kläser, A., Marszałek, M., Schmid, C., & Zisserman, A. (2010). Human focused action localization in video. In ECCV Workshop on Sign, Gesture, and Activity

  • Krapac, J., Verbeek, J., & Jurie, F. (2011). Modeling spatial layout with Fisher vectors for image categorization. In IEEE International Conference on Computer Vision.

  • Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). HMDB: A large video database for human motion recognition. In IEEE International Conference on Computer Vision (pp. 2556–2563).

  • Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64(2–3), 107–123.

    Article  Google Scholar 

  • Laptev, I., & Pérez, P. (2007). Retrieving actions in movies. In IEEE International Conference on Computer Vision.

  • Laptev, I., Marszałek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In IEEE Conference on Computer Vision and Pattern Recognition.

  • Le, Q. V., Zou, W. Y., Yeung, S. Y., & Ng, A. Y. (2011). Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In IEEE Conference on Computer Vision and Pattern Recognition.

  • Li, K., Oh, S., Perera, A. A., & Fu, Y. (2012). A videography analysis framework for video retrieval and summarization. In British Machine Vision Conference (pp. 1–12).

  • Li, W., Yu, Q., Divakaran, A., & Vasconcelos, N. (2013). Dynamic pooling for complex event recognition. In IEEE International Conference on Computer Vision.

  • Liu, J., Luo, J., & Shah, M. (2009). Recognizing realistic actions from videos in the wild. In IEEE Conference on Computer Vision and Pattern Recognition.

  • Ma, S., Zhang, J., Ikizler-Cinbis, N., & Sclaroff, S. (2013). Action recognition and localization by hierarchical space-time segments. In IEEE International Conference on Computer Vision.

  • Marszałek, M., Laptev, I., & Schmid, C. (2009). Actions in context. In IEEE Conference on Computer Vision and Pattern Recognition.

  • Mathe, S., & Sminchisescu, C. (2012). Dynamic eye movement datasets and learnt saliency models for visual action recognition. In European Conference on Computer Vision (pp. 842–856).

  • Matikainen, P., Hebert, M., & Sukthankar, R. (2009). Trajectons: Action recognition through the motion analysis of tracked features. In ICCV Workshops on Video-Oriented Object and Event Classification.

  • Matikainen, P., Hebert, M., & Sukthankar, R. (2010). Representing pairwise spatial and temporal relations for action recognition. In European Conference on Computer Vision.

  • McCann, S., & Lowe, D. G. (2013). Spatially local coding for object recognition. In Asian Conference on Computer Vision (pp. 204–217). New York: Springer.

  • Messing, R., Pal, C., & Kautz, H. (2009). Activity recognition using the velocity histories of tracked keypoints. In IEEE International Conference on Computer Vision.

  • Murthy, O. R., & Goecke, R. (2013a). Combined ordered and improved trajectories for large scale human action recognition. In ICCV Workshop on Action Recognition with a Large Number of Classes.

  • Murthy, O. R., & Goecke, R. (2013b). Ordered trajectories for large scale human action recognition. In ICCV Workshops.

  • Natarajan, P., Wu, S., Vitaladevuni, S., Zhuang, X., Tsakalidis, S., Park, U., Prasad, R., & Natarajan, P. (2012). Multimodal feature fusion for robust event detection in web videos. In IEEE Conference on Computer Vision and Pattern Recognition.

  • Niebles, J. C., Chen, C. W., & Fei-Fei, L. (2010). Modeling temporal structure of decomposable motion segments for activity classification. In European Conference on Computer Vision

  • Oneata, D., Verbeek, J., & Schmid, C. (2013). Action and event recognition with Fisher vectors on a compact feature set. In IEEE International Conference on Computer Vision.

  • Over, P., Awad, G., Michel, M., Fiscus, J., Sanders, G., Shaw, B., Kraaij, W., Smeaton, A. F., & Quenot, G. (2012). TRECVID 2012: An overview of the goals, tasks, data, evaluation mechanisms and metrics. In Proceedings of TRECVID.

  • Park, D., Zitnick, C. L., Ramanan, D., & Dollár, P. (2013). Exploring weak stabilization for motion feature extraction. In IEEE Conference on Computer Vision and Pattern Recognition.

  • Patron-Perez, A., Marszalek, M., Zisserman, A., & Reid, I. (2010). High Five: Recognising human interactions in TV shows. In British Machine Vision Conference.

  • Peng, X., Wang, L., Cai, Z., Qiao, Y., & Peng, Q. (2013). Hybrid super vector with improved dense trajectories for action recognition. In ICCV Workshops

  • Prest, A., Schmid, C., & Ferrari, V. (2012). Weakly supervised learning of interactions between humans and objects. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(3), 601–614.

  • Prest, A., Ferrari, V., & Schmid, C. (2013). Explicit modeling of human-object interactions in realistic videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(4), 835–848.

  • Reddy, K., & Shah, M. (2012). Recognizing 50 human action categories of web videos. Machine Vision and Applications (pp. 1–11).

  • Sánchez, J., Perronnin, F., & de Campos, T. (2012). Modeling the spatial layout of images beyond spatial pyramids. Pattern Recognition Letters, 33(16), 2216–2223.

    Article  Google Scholar 

  • Sánchez, J., Perronnin, F., Mensink, T., & Verbeek, J. (2013). Image classification with the Fisher vector: Theory and practice. International Journal of Computer Vision, 105(3), 222–245.

    Article  MathSciNet  MATH  Google Scholar 

  • Sapienza, M., Cuzzolin, F., & Torr, P. (2012). Learning discriminative space-time actions from weakly labelled videos. In British Machine Vision Conference.

  • Schüldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local SVM approach. In International Conference on Pattern Recognition.

  • Scovanner, P., Ali, S., & Shah, M. (2007). A 3-dimensional SIFT descriptor and its application to action recognition. In ACM Conference on Multimedia.

  • Shi, F., Petriu, E., & Laganiere, R. (2013). Sampling strategies for real-time action recognition. In IEEE Conference on Computer Vision and Pattern Recognition.

  • Shi, J., & Tomasi, C. (1994). Good features to track. In IEEE Conference on Computer Vision and Pattern Recognition.

  • Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. CRCV-TR-12-01.

  • Sun, C., & Nevatia, R. (2013). Large-scale web video event classification by use of fisher vectors. In IEEE Winter Conference on Applications of Computer Vision.

  • Sun, J., Wu, X., Yan, S., Cheong, L. F., Chua, T. S., & Li, J. (2009). Hierarchical spatio-temporal context modeling for action recognition. In IEEE Conference on Computer Vision and Pattern Recognition.

  • Szeliski, R. (2006). Image alignment and stitching: A tutorial. Foundations and Trends in Computer Graphics and Vision, 2(1), 1–104.

    Article  MathSciNet  MATH  Google Scholar 

  • Tang, K., Fei-Fei, L., & Koller, D. (2012). Learning latent temporal structure for complex event detection. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 1250–1257).

  • Tang, K., Yao, B., Fei-Fei, L., & Koller, D. (2013). Combining the right features for complex event recognition. In IEEE International Conference on Computer Vision (pp. 2696–2703).

  • Uemura, H., Ishikawa, S., & Mikolajczyk, K. (2008). Feature tracking and motion compensation for action recognition. In British Machine Vision Conference.

  • Vahdat, A., & Mori, G. (2013). Handling uncertain tags in visual recognition. In IEEE International Conference on Computer Vision.

  • Vahdat, A., Cannons, K., Mori, G., Oh, S., & Kim, I. (2013). Compositional models for video event detection: A multiple kernel learning latent variable approach. In IEEE International Conference on Computer Vision.

  • Vig, E., Dorr, M., & Cox, D. (2012). Space-variant descriptor sampling for action recognition based on saliency and eye movements. In European Conference on Computer Vision (pp. 84–97).

  • Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In IEEE International Conference on Computer Vision.

  • Wang, H., Ullah, M. M., Kläser, A., Laptev, I., & Schmid, C. (2009). Evaluation of local spatio-temporal features for action recognition. In British Machine Vision Conference.

  • Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2013a). Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision, 103(1), 60–79.

    Article  MathSciNet  Google Scholar 

  • Wang, L., Qiao, Y., Tang, X., et al (2013b). Mining motion atoms and phrases for complex action recognition. In IEEE International Conference on Computer Vision (pp. 2680–2687).

  • Wang, X., Wang, L., & Qiao, Y. (2012). A comparative study of encoding, pooling and normalization methods for action recognition. In Asian Conference on Computer Vision.

  • Willems, G., Tuytelaars, T., & Gool, L. (2008). An efficient dense and scale-invariant spatio-temporal interest point detector. In European Conference on Computer Vision.

  • Wu, S., Oreifej, O., & Shah, M. (2011). Action recognition in videos acquired by a moving camera using motion decomposition of Lagrangian particle trajectories. In IEEE International Conference on Computer Vision.

  • Yang, Y., & Shah, M. (2012). Complex events detection using data-driven concepts. In European Conference on Computer Vision.

  • Yeffet, L., & Wolf, L. (2009). Local trinary patterns for human action recognition. In IEEE International Conference on Computer Vision.

  • Yu, G., Yuan, J., & Liu, Z. (2012). Propagative Hough voting for human activity recognition. In European Conference on Computer Vision (pp. 693–706).

  • Zhu, J., Wang, B., Yang, X., Zhang, W., & Tu, Z. (2013). Action recognition with actons. In IEEE International Conference on Computer Vision.

Download references

Acknowledgments

This work was supported by Quaero (Funded by OSEO, French State agency for innovation), the European integrated Project AXES, the MSR/INRIA joint Project and the ERC advanced Grant ALLEGRO.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dan Oneata.

Additional information

Communicated by Ivan Laptev, Josef Sivic, and Deva Ramanan.

H. Wang is currently with Amazon Research Seattle, but carried out the work described here while being affiliated with INRIA.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, H., Oneata, D., Verbeek, J. et al. A Robust and Efficient Video Representation for Action Recognition. Int J Comput Vis 119, 219–238 (2016). https://doi.org/10.1007/s11263-015-0846-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-015-0846-5

Keywords

Navigation