Skip to main content

Advertisement

Log in

Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

We present a novel unsupervised learning method for human action categories. A video sequence is represented as a collection of spatial-temporal words by extracting space-time interest points. The algorithm automatically learns the probability distributions of the spatial-temporal words and the intermediate topics corresponding to human action categories. This is achieved by using latent topic models such as the probabilistic Latent Semantic Analysis (pLSA) model and Latent Dirichlet Allocation (LDA). Our approach can handle noisy feature points arisen from dynamic background and moving cameras due to the application of the probabilistic models. Given a novel video sequence, the algorithm can categorize and localize the human action(s) contained in the video. We test our algorithm on three challenging datasets: the KTH human motion dataset, the Weizmann human action dataset, and a recent dataset of figure skating actions. Our results reflect the promise of such a simple approach. In addition, our algorithm can recognize and localize multiple actions in long and complex video sequences containing multiple motions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Blank, M., Gorelick, L., Shechtman, E., Irani, M., & Basri, R. (2005). Actions as space-time shapes. In Proceedings of the tenth IEEE international conference on computer vision (Vol. 2, pp. 1395–1402). Los Alamitos: IEEE Computer Society.

    Chapter  Google Scholar 

  • Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.

    Article  MATH  Google Scholar 

  • Bobick, A. F., & Davis, J. W. (2001). The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(3), 257–267.

    Article  Google Scholar 

  • Boiman, O., & Irani, M. (2005). Detecting irregularities in images and in video. In Proceedings of the tenth IEEE international conference on computer vision (Vol. 1, pp. 462–469). Los Alamitos: IEEE Computer Society.

    Chapter  Google Scholar 

  • Cheung, V., Frey, B. J., & Jojic, N. (2005). Video epitomes. In Proceedings of the 2005 IEEE computer society conference on computer vision and pattern recognition (Vol. 1, pp. 42–49). Los Alamitos: IEEE Computer Society.

    Google Scholar 

  • Dalal, N., Triggs, B., & Schmid, C. (2006). Human detection using oriented histograms of flow and appearance. In European conference on computer vision (Vol. 2, pp. 428–441).

  • Dance, C., Willamowski, J., Fan, L., Bray, C., & Csurka, G. (2004). Visual categorization with bags of keypoints. In ECCV international workshop on statistical learning in computer vision.

  • Dollár, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In 2nd joint IEEE international workshop on visual surveillance and performance evaluation of tracking and surveillance (pp. 65–72).

  • Efros, A. A., Berg, A. C., Mori, G., & Malik, J. (2003). Recognizing action at a distance. In Proceedings of the ninth IEEE international conference on computer vision (Vol. 2, pp. 726–733). Los Alamitos: IEEE Computer Society.

    Chapter  Google Scholar 

  • Fanti, C., Zelnik-Manor, L., & Perona, P. (2005). Hybrid models for human motion recognition. In Proceedings of the tenth IEEE international conference on computer vision (Vol. 1, pp. 1166–1173). Los Alamitos: IEEE Computer Society.

    Google Scholar 

  • Fei-Fei, L., & Perona, P. (2005). A Bayesian hierarchical model for learning natural scene categories. In Proceedings of the 2005 IEEE computer society conference on computer vision and pattern recognition (pp. 524–531). Los Alamitos: IEEE Computer Society.

    Google Scholar 

  • Felzenszwalb, P. F., & Huttenlocher, D. P. (2005). Pictorial structures for object recognition. International Journal of Computer Vision, 61(1), 55–79.

    Article  Google Scholar 

  • Feng, X., & Perona, P. (2002). Human action recognition by sequence of movelet codewords. In 1st international symposium on 3D data processing visualization and transmission (3DPVT 2002) (pp. 717–721).

  • Fergus, R., Fei-Fei, L., Perona, P., & Zisserman, A. (2005). Learning object categories from Google’s image search. In Proceedings of the tenth international conference on computer vision (Vol. 2, pp. 1816–1823). Los Alamitos: IEEE Computer Society.

    Chapter  Google Scholar 

  • Harris, C., & Stephens, M. (1988). A combined corner and edge detector. In Proceedings of the fourth Alvey vision conference (pp. 147–152).

  • Hoey, J. (2001). Hierarchical unsupervised learning of facial expression categories. In IEEE workshop on detection and recognition of events in video (pp. 99–106).

  • Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval (pp. 50–57), August 1999.

  • Kadir, T., & Brady, M. (2003). Scale saliency: a novel approach to salient feature and scale selection. In International conference on visual information engineering (pp. 25–28).

  • Ke, Y., Sukthankar, R., & Hebert, M. (2005). Efficient visual event detection using volumetric features. In Proceedings of the tenth IEEE international conference on computer vision (pp. 166–173). Los Alamitos: IEEE Computer Society.

    Google Scholar 

  • Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64(2–3), 107–123.

    Article  Google Scholar 

  • Laptev, I., & Lindeberg, T. (2006). Local descriptors for spatio-temporal recognition. In Lecture notes in computer science (Vol. 3667). Spatial coherence for visual motion analysis, first international workshop, SCVMA 2004, Prague, Czech Republic, 15 May 2004. Berlin: Springer.

    Chapter  Google Scholar 

  • Niebles, J. C., & Fei-Fei, L. (2007). A hierarchical model of shape and appearance for human action classification. In Proceedings of the 2007 IEEE computer society conference on computer vision and pattern recognition. Los Alamitos: IEEE Computer Society.

    Google Scholar 

  • Niebles, J. C., Wang, H., & Fei-Fei, L. (2006). Unsupervised learning of human action categories using spatial-temporal words. In Proceedings of British machine vision conference 2006 (Vol. 3, pp. 1249–1258), September 2006.

  • Oikonomopoulos, A., Patras, I., & Pantic, M. (2006). Human action recognition with spatiotemporal salient points. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 36(3), 710–719.

    Article  Google Scholar 

  • Ramanan, D., & Forsyth, D. A. (2004). Automatic annotation of everyday movements. In Thrun, S., Saul, L., & Schölkopf, B. (Eds.), Advances in neural information processing systems (Vol. 16). Cambridge: MIT Press.

    Google Scholar 

  • Savarese, S., Winn, J. M., & Criminisi, A. (2006). Discriminative object class models of appearance and shape by correlations. In Proceedings of the 2006 IEEE computer society conference on computer vision and pattern recognition. Los Alamitos: IEEE Computer Society.

    Google Scholar 

  • Schmid, C., Mohr, R., & Bauckhage, C. (2000). Evaluation of interest point detectors. International Journal of Computer Vision, 2(37), 151–172.

    Article  Google Scholar 

  • Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: a local svm approach. In ICPR (pp. 32–36).

  • Shechtman, E., & Irani, M. (2005). Space-time behavior based correlation. In Proceedings of the 2005 IEEE computer society conference on computer vision and pattern recognition (Vol. 1, pp. 405–412). Los Alamitos: IEEE Computer Society.

    Google Scholar 

  • Sidenbladh, H., & Black, M. J. (2003). Learning the statistics of people in images and video. International Journal of Computer Vision, 54(1-3), 181–207.

    Article  Google Scholar 

  • Sivic, J., Russell, B. C., Efros, A. A., Zisserman, A., & Freeman, W. T. (2005). Discovering objects and their location in images. In Proceedings of the tenth IEEE international conference on computer vision (pp. 370–377), October 2005. Los Alamitos: IEEE Computer Society.

    Chapter  Google Scholar 

  • Song, Y., Goncalves, L., & Perona, P. (2003). Unsupervised learning of human motion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(25), 1–14.

    Google Scholar 

  • Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition. Los Alamitos: IEEE Computer Society.

    Google Scholar 

  • Wang, Y., Jiang, H., Drew, M. S., Li, Z.-N., & Mori, G. (2006). Unsupervised discovery of action classes. In Proceedings of the 2006 IEEE computer society conference on computer vision and pattern recognition. Los Alamitos: IEEE Computer Society.

    Google Scholar 

  • Xiang, T., & Gong, S. (2005). Video behaviour profiling and abnormality detection without manual labelling. In Proceedings of the tenth IEEE international conference on computer vision (pp. 1238–1245). Los Alamitos: IEEE Computer Society.

    Chapter  Google Scholar 

  • Yilmaz, A., & Shah, M. (2005). Recognizing human actions in videos acquired by uncalibrated moving cameras. In Proceedings of the tenth IEEE international conference on computer vision (Vol. 1, pp. 150–157). Los Alamitos: IEEE Computer Society.

    Chapter  Google Scholar 

  • Zhong, H., Shi, J., & Visontai, M. (2004). Detecting unusual activity in video. In Proceedings of the 2004 IEEE computer society conference on computer vision and pattern recognition (pp. 819–826). Los Alamitos: IEEE Computer Society.

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Juan Carlos Niebles.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Niebles, J.C., Wang, H. & Fei-Fei, L. Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words. Int J Comput Vis 79, 299–318 (2008). https://doi.org/10.1007/s11263-007-0122-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-007-0122-4

Keywords

Navigation