Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words

Niebles, Juan Carlos; Wang, Hongcheng; Fei-Fei, Li

doi:10.1007/s11263-007-0122-4

Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words

Published: 04 March 2008

Volume 79, pages 299–318, (2008)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Juan Carlos Niebles^1,2,
Hongcheng Wang³ &
Li Fei-Fei⁴

5520 Accesses
1033 Citations
9 Altmetric
Explore all metrics

Abstract

We present a novel unsupervised learning method for human action categories. A video sequence is represented as a collection of spatial-temporal words by extracting space-time interest points. The algorithm automatically learns the probability distributions of the spatial-temporal words and the intermediate topics corresponding to human action categories. This is achieved by using latent topic models such as the probabilistic Latent Semantic Analysis (pLSA) model and Latent Dirichlet Allocation (LDA). Our approach can handle noisy feature points arisen from dynamic background and moving cameras due to the application of the probabilistic models. Given a novel video sequence, the algorithm can categorize and localize the human action(s) contained in the video. We test our algorithm on three challenging datasets: the KTH human motion dataset, the Weizmann human action dataset, and a recent dataset of figure skating actions. Our results reflect the promise of such a simple approach. In addition, our algorithm can recognize and localize multiple actions in long and complex video sequences containing multiple motions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Blank, M., Gorelick, L., Shechtman, E., Irani, M., & Basri, R. (2005). Actions as space-time shapes. In Proceedings of the tenth IEEE international conference on computer vision (Vol. 2, pp. 1395–1402). Los Alamitos: IEEE Computer Society.
Chapter Google Scholar
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.
Article MATH Google Scholar
Bobick, A. F., & Davis, J. W. (2001). The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(3), 257–267.
Article Google Scholar
Boiman, O., & Irani, M. (2005). Detecting irregularities in images and in video. In Proceedings of the tenth IEEE international conference on computer vision (Vol. 1, pp. 462–469). Los Alamitos: IEEE Computer Society.
Chapter Google Scholar
Cheung, V., Frey, B. J., & Jojic, N. (2005). Video epitomes. In Proceedings of the 2005 IEEE computer society conference on computer vision and pattern recognition (Vol. 1, pp. 42–49). Los Alamitos: IEEE Computer Society.
Google Scholar
Dalal, N., Triggs, B., & Schmid, C. (2006). Human detection using oriented histograms of flow and appearance. In European conference on computer vision (Vol. 2, pp. 428–441).
Dance, C., Willamowski, J., Fan, L., Bray, C., & Csurka, G. (2004). Visual categorization with bags of keypoints. In ECCV international workshop on statistical learning in computer vision.
Dollár, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In 2nd joint IEEE international workshop on visual surveillance and performance evaluation of tracking and surveillance (pp. 65–72).
Efros, A. A., Berg, A. C., Mori, G., & Malik, J. (2003). Recognizing action at a distance. In Proceedings of the ninth IEEE international conference on computer vision (Vol. 2, pp. 726–733). Los Alamitos: IEEE Computer Society.
Chapter Google Scholar
Fanti, C., Zelnik-Manor, L., & Perona, P. (2005). Hybrid models for human motion recognition. In Proceedings of the tenth IEEE international conference on computer vision (Vol. 1, pp. 1166–1173). Los Alamitos: IEEE Computer Society.
Google Scholar
Fei-Fei, L., & Perona, P. (2005). A Bayesian hierarchical model for learning natural scene categories. In Proceedings of the 2005 IEEE computer society conference on computer vision and pattern recognition (pp. 524–531). Los Alamitos: IEEE Computer Society.
Google Scholar
Felzenszwalb, P. F., & Huttenlocher, D. P. (2005). Pictorial structures for object recognition. International Journal of Computer Vision, 61(1), 55–79.
Article Google Scholar
Feng, X., & Perona, P. (2002). Human action recognition by sequence of movelet codewords. In 1st international symposium on 3D data processing visualization and transmission (3DPVT 2002) (pp. 717–721).
Fergus, R., Fei-Fei, L., Perona, P., & Zisserman, A. (2005). Learning object categories from Google’s image search. In Proceedings of the tenth international conference on computer vision (Vol. 2, pp. 1816–1823). Los Alamitos: IEEE Computer Society.
Chapter Google Scholar
Harris, C., & Stephens, M. (1988). A combined corner and edge detector. In Proceedings of the fourth Alvey vision conference (pp. 147–152).
Hoey, J. (2001). Hierarchical unsupervised learning of facial expression categories. In IEEE workshop on detection and recognition of events in video (pp. 99–106).
Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval (pp. 50–57), August 1999.
Kadir, T., & Brady, M. (2003). Scale saliency: a novel approach to salient feature and scale selection. In International conference on visual information engineering (pp. 25–28).
Ke, Y., Sukthankar, R., & Hebert, M. (2005). Efficient visual event detection using volumetric features. In Proceedings of the tenth IEEE international conference on computer vision (pp. 166–173). Los Alamitos: IEEE Computer Society.
Google Scholar
Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64(2–3), 107–123.
Article Google Scholar
Laptev, I., & Lindeberg, T. (2006). Local descriptors for spatio-temporal recognition. In Lecture notes in computer science (Vol. 3667). Spatial coherence for visual motion analysis, first international workshop, SCVMA 2004, Prague, Czech Republic, 15 May 2004. Berlin: Springer.
Chapter Google Scholar
Niebles, J. C., & Fei-Fei, L. (2007). A hierarchical model of shape and appearance for human action classification. In Proceedings of the 2007 IEEE computer society conference on computer vision and pattern recognition. Los Alamitos: IEEE Computer Society.
Google Scholar
Niebles, J. C., Wang, H., & Fei-Fei, L. (2006). Unsupervised learning of human action categories using spatial-temporal words. In Proceedings of British machine vision conference 2006 (Vol. 3, pp. 1249–1258), September 2006.
Oikonomopoulos, A., Patras, I., & Pantic, M. (2006). Human action recognition with spatiotemporal salient points. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 36(3), 710–719.
Article Google Scholar
Ramanan, D., & Forsyth, D. A. (2004). Automatic annotation of everyday movements. In Thrun, S., Saul, L., & Schölkopf, B. (Eds.), Advances in neural information processing systems (Vol. 16). Cambridge: MIT Press.
Google Scholar
Savarese, S., Winn, J. M., & Criminisi, A. (2006). Discriminative object class models of appearance and shape by correlations. In Proceedings of the 2006 IEEE computer society conference on computer vision and pattern recognition. Los Alamitos: IEEE Computer Society.
Google Scholar
Schmid, C., Mohr, R., & Bauckhage, C. (2000). Evaluation of interest point detectors. International Journal of Computer Vision, 2(37), 151–172.
Article Google Scholar
Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: a local svm approach. In ICPR (pp. 32–36).
Shechtman, E., & Irani, M. (2005). Space-time behavior based correlation. In Proceedings of the 2005 IEEE computer society conference on computer vision and pattern recognition (Vol. 1, pp. 405–412). Los Alamitos: IEEE Computer Society.
Google Scholar
Sidenbladh, H., & Black, M. J. (2003). Learning the statistics of people in images and video. International Journal of Computer Vision, 54(1-3), 181–207.
Article Google Scholar
Sivic, J., Russell, B. C., Efros, A. A., Zisserman, A., & Freeman, W. T. (2005). Discovering objects and their location in images. In Proceedings of the tenth IEEE international conference on computer vision (pp. 370–377), October 2005. Los Alamitos: IEEE Computer Society.
Chapter Google Scholar
Song, Y., Goncalves, L., & Perona, P. (2003). Unsupervised learning of human motion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(25), 1–14.
Google Scholar
Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition. Los Alamitos: IEEE Computer Society.
Google Scholar
Wang, Y., Jiang, H., Drew, M. S., Li, Z.-N., & Mori, G. (2006). Unsupervised discovery of action classes. In Proceedings of the 2006 IEEE computer society conference on computer vision and pattern recognition. Los Alamitos: IEEE Computer Society.
Google Scholar
Xiang, T., & Gong, S. (2005). Video behaviour profiling and abnormality detection without manual labelling. In Proceedings of the tenth IEEE international conference on computer vision (pp. 1238–1245). Los Alamitos: IEEE Computer Society.
Chapter Google Scholar
Yilmaz, A., & Shah, M. (2005). Recognizing human actions in videos acquired by uncalibrated moving cameras. In Proceedings of the tenth IEEE international conference on computer vision (Vol. 1, pp. 150–157). Los Alamitos: IEEE Computer Society.
Chapter Google Scholar
Zhong, H., Shi, J., & Visontai, M. (2004). Detecting unusual activity in video. In Proceedings of the 2004 IEEE computer society conference on computer vision and pattern recognition (pp. 819–826). Los Alamitos: IEEE Computer Society.
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical Engineering, Princeton University, Engineering Quadrangle, Olden Street, Princeton, NJ, 08544, USA
Juan Carlos Niebles
Robotics and Intelligent Systems Group, Universidad del Norte, Km 5 Vía Puerto Colombia, Barranquilla, Colombia
Juan Carlos Niebles
United Technologies Research Center (UTRC), 411 Silver Lane, East Hartford, CT, 06108, USA
Hongcheng Wang
Department of Computer Science, Princeton University, 35 Olden Street, Princeton, NJ, 08540, USA
Li Fei-Fei

Authors

Juan Carlos Niebles
View author publications
You can also search for this author in PubMed Google Scholar
Hongcheng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Li Fei-Fei
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Juan Carlos Niebles.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Niebles, J.C., Wang, H. & Fei-Fei, L. Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words. Int J Comput Vis 79, 299–318 (2008). https://doi.org/10.1007/s11263-007-0122-4

Download citation

Received: 16 March 2007
Accepted: 26 December 2007
Published: 04 March 2008
Issue Date: September 2008
DOI: https://doi.org/10.1007/s11263-007-0122-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words

Abstract

Access this article

Similar content being viewed by others

Language-Motivated Approaches to Action Recognition

Space-Time Tree Ensemble for Action Recognition and Localization

Learning Action Primitives for Multi-level Video Event Understanding

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words

Abstract

Access this article

Similar content being viewed by others

Language-Motivated Approaches to Action Recognition

Space-Time Tree Ensemble for Action Recognition and Localization

Learning Action Primitives for Multi-level Video Event Understanding

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation