ABSTRACT
A key problem with the automatic detection of semantic concepts (like 'interview' or 'soccer') in video streams is the manual acquisition of adequate training sets. Recently, we have proposed to use online videos downloaded from portals like youtube.com for this purpose, whereas tags provided by users during video upload serve as ground truth annotations.
The problem with such training data is that it is weakly labeled: Annotations are only provided on video level, and many shots of a video may be "non-relevant", i.e. not visually related to a tag. In this paper, we present a probabilistic framework for learning from such weakly annotated training videos in the presence of irrelevant content. Thereby, the relevance of keyframes is modeled as a latent random variable that is estimated during training.
In quantitative experiments on real-world online videos and TV news data, we demonstrate that the proposed model leads to a significantly increased robustness with respect to irrelevant content, and to a better generalization of the resulting concept detectors.
- D. Borth, A. Ulges, C. Schulze, and T. Breuel. Keyframe Extraction for Video Tagging and Summarization. In GI--Informatiktage, 2008.Google Scholar
- M. Campbell, A. Haubold, M. Liu, A. Natsev, J. Smith, and J. Tesic. IBM Research TRECVID--2007 Video Retrieval System. In TRECVID Workshop, Gaithersburg, USA, November 2007.Google Scholar
- A. Dempster, N. Laird, and D. Rubin. Maximum Likelihood from Incomplete Data via the EM algorithm. Journal of the Royal Statistical Society, 39(1):1--38, 1977.Google Scholar
- T. Deselaers, D. Keysers, and H. Ney. Discriminative Training for Object Recognition Using Image Patches. In CVPR, pages 157--162, 2005. Google ScholarDigital Library
- R. Duda, P. Hart, and D. Stork. Pattern Classification (2nd Edition). Wiley Interscience Publications, 2000. Google ScholarDigital Library
- R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman. Learning Object Categories from Google’s Image Search. Computer Vision, 2:1816--1823, 2005. Google ScholarDigital Library
- R. Fergus, P. Perona, and A. Zisserman. Object Class Recognition by Unsupervised Scale-Invariant Learning. In CVPR, pages 264--271, 2003.Google ScholarCross Ref
- T. Hofmann. Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning, 42:177--196, 2001. Google ScholarDigital Library
- S. Lazebnik, C. Schmid, and J. Ponce. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. In CVPR, pages 2169--2178, 2006. Google ScholarDigital Library
- K. Mikolajczyk, R. Mohr, and C. Bauckhage. Evaluation of Interest Point Detectors. Intern. J. Compt. Vis., 37(2):1--38, 2007. Google ScholarDigital Library
- K. Mikolajczyk and C. Schmid. A Performance Evaluation of Local Descriptors. In CVPR, pages 257--263, 2007.Google Scholar
- A. Opelt, M. Fussenegger, and P. Auer. Generic Object Recognition with Boosting. IEEE Trans. Pattern Anal. Mach. Intell., 28(3):416--431, 2006. Google ScholarDigital Library
- J. Philbin, O. Chum, J. Sivic, V. Ferrari, M. Marin, A. Bosch, N. Apostolof, and A. Zisserman. Oxford TRECVID 2007, Notebook paper. In TRECVID Workshop, 2007.Google Scholar
- C. Rosenberg and M. Hebert. Training Object Detection Models with Weakly Labeled Data. In BMVC, 2002.Google ScholarCross Ref
- J. Sivic, B. Russell, A. Efros, and A. Zisserman. Discovering Objects and their Locations in Images. In ICCV, pages 370--377, 2005. Google ScholarDigital Library
- C. G. M. Snoek, I. Everts, J. C. van Gemert, J.-M. Geusebroek, B. Huurnink, D. C. Koelma, M. van Liempt, O. de Rooij, K. E. A. van de Sande, A. W. M. Smeulders, J. R. R. Uijlings, and M. Worring. The MediaMill TRECVID 2007 Semantic Video Search Engine. In TRECVID Workshop, November 2007.Google Scholar
- H. Tamura, S. Mori, and T. Yamawaki. Textural Features Corresponding to Visual Perception. IEEE Trans. on Sys., Man, Cybern., 6(8):460--472, 1978.Google ScholarCross Ref
- A. Ulges, C. Schulze, D. Keysers, and T. M. Breuel. Content-Based Video Tagging for Online Video Portals. In MUSCLE/Image-CLEF Workshop, Budapest, 2007.Google Scholar
- A. Ulges, C. Schulze, D. Keysers, and T. M. Breuel. A System that Learns to Tag Videos by Watching Youtube. In ICVS (accepted for publication), 2008. Google ScholarDigital Library
- C. Yang and T. Lozano-Perez. Image Database Retrieval with Multiple-Instance Learning Techniques. In Int. Conf. on Data Eng., 2000. Google ScholarDigital Library
- J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid. Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study. International Journal of Computer Vision, 73(2):213--238, June 2007. Google ScholarDigital Library
Index Terms
- Identifying relevant frames in weakly labeled videos for training concept detectors
Recommendations
Training object detectors from few weakly-labeled and many unlabeled images
Highlights- A novel method to train detector by few weakly-labeled images and lots of unlabeled images.
AbstractWeakly-supervised object detection attempts to limit the amount of supervision by dispensing the need for bounding boxes, but still assumes image-level labels on the entire training set. In this work, we study the problem of training ...
A system that learns to tag videos by watching youtube
ICVS'08: Proceedings of the 6th international conference on Computer vision systemsWe present a system that automatically tags videos, i.e. detects high-level semantic concepts like objects or actions in them. To do so, our system does not rely on datasets manually annotated for research purposes. Instead, we propose to use videos from ...
Learning automatic concept detectors from online video
Concept detection is targeted at automatically labeling video content with semantic concepts appearing in it, like objects, locations, or activities. While concept detectors have become key components in many research prototypes for content-based video ...
Comments