Video-based descriptors for object recognition☆
Graphical abstract
Highlights
► We analyze and derive representations of objects from video. ► We integrate multi-scale detection and tracking. ► We derive a video-based feature descriptor. ► We describe a visual recognition system for a hand-held device.
Introduction
We tackle the problem of recognizing objects and scenes from images, given example views. The difficulty of this problem is the large nuisance variability that the data can exhibit, depending on the vantage point, visibility conditions (occlusions), illumination, etc. under which the object is seen, even if it does not exhibit intrinsic variability. The analysis in [1] suggests that the nuisances induce almost all the variability in the data, and what remains (the dependency of the data on the object) is supported on a thin set. The most common approach to this problem is to eliminate some of the nuisances by pre-processing the data (to obtain “distinctive” and yet “insensitive” features), and to “learn away” the residual nuisance variability, often using a training set of manually labeled images. Both practices are poorly grounded in principle: Pre-processing does not, in general, improve the performance in a classification task (cfr. the data processing inequality [2]); Training a classifier using unrelated images (aiming to approximate independent samples from the class-conditional distribution) brings into question the fact that there is a scene out there, and limits the classifier to learning generic regularities in images. It can be shown that, when a collection of passively gathered independent snapshots is used as training set, not only is the worst-case error in a visual recognition problem at chance level (i.e. the risk is the same that is offered by the prior), but so is the average case [3]. This is not the case, however, when the training data consists of purposefully captured images during an active exploration phase [4].
In this paper we propose a different approach to recognition, grounded in the ideas of Active Vision [5], [6] and Actionable Information [4], whereby the training set consists not of isolated snapshots, such as photo collections harvested from the web, but of temporally coherent sequences of images where the user is free to move around an object or manipulate it. Even if the objects are static, the use of video results in quantifiably superior recognition performance in a single (test) image. More importantly, the issue of representation is well grounded in the presence of multiple images of the same scene, and temporal continuity provides the crucial “bit” that the images in the training set are of the same scene, and therefore all the variability in the data is ascribed to the nuisances.
Contrary to common perception, building representations of objects from video for the purpose of recognition is not only a more sound process, but it is also more computationally efficient. In fact, the descriptor we propose is far more efficient to compute than common descriptors computed from single images, and has better discriminative and invariance properties. To show that, we both derive our representation from first principles – demanding that our descriptor be the “best” among a chosen class tied to the classifier – and empirically compute the performance of the resulting recognition scheme, comparing it with popular baseline algorithms. We have also made our implementation available for others to try on their mobile phones, so our results can be independently validated.
We describe our implementation in Section 3, and the analysis that motivates the design choices in Section 2. We start from the most general visual decision task (detection, localization, recognition, categorization etc.) and abstract it to a binary hypothesis testing problem, to highlight the crucial issue of representation (Section 2). While standard statistical decision theory trivializes representational issues, our goal is to design sufficient statistics that reduce the complexity of the decision at run-time as much as possible, while having the least impact on the optimality of the decision. Thus, we start from the notion of complete invariant statistics, and show how a classifier can be constructed based on them. We characterize the dependency of these statistics on nuisance factors via the notion of bounded-input-bounded-output (BIBO) stability, as well as structural stability. We show that these concepts enable reducing the marginalization process – where nuisances such as viewpoint, illumination, partial occlusions would have to be integrated-out at decision time – to a combinatorial decision, that can be tested in real-time even on severely resource-constrained hand-held platforms.
Our approach also highlights the importance of time in recognition problems. This is currently under-played in favor of hand-labeled training data, but time can effectively act as a “weak supervisor” in visual recognition, and we attempt to tap on that role.
Our effort relates to a wealth of recent work on visual recognition, localization and categorization represented, for instance, in the PASCAL challenge (see [7] for references). Our effort to run in real-time relates to [8], [9] although the constraints of a hand-held device limit the class of methods to much simpler classifiers, such as nearest neighbors and TF-IDF [10]. Rather than tinkering with the classifier, we focus on representation as the core issue. Modules of our system relate to multi-scale feature selection, tracking, local descriptors, and bag-of-features classification, specifically on baseline algorithms [[11], [12], [13], [14]]. We propose a method to integrate multi-scale detection and tracking that does not involve joint location-scale optimization [15], but explicitly accounts for topological changes across scales. This approach (dubbed“tracking on the selection tree”, TST) respects the semi-group stucture of scaling/quantization, and is motivated by the “structural stability” of the selection process. This improves accuracy and robustness while making tracking more efficient. We also replace traditional single-view descriptors [13], [16], [17] with a template that is designed to be optimal in the mean-square sense, under conditions described in Section 2, dubbed “best template” descriptor (BTD). Unlike approaches that simulate nuisance variability in the training set from a single image [9], [18], we exploit the real nuisance distribution by tracking frames during learning. Our contributions in this manuscript involve the tracker, TST (Section 2.6), the descriptor, BTD (Section 2.8), the analysis that motivates them (Section 2), and the implementation on a mobile device (Section 3).
Section snippets
Representation
This section motivates our algorithm design choices via analysis of an abstraction of the recognition problem. The reader interested in just the algorithmic aspect of the system can skip ahead to Section 3.
Implementation and experiments
We implemented the recognition system described above and tested its performance in terms of accuracy and computational efficiency. The integration of the tracking on the selection tree (TST) and the best-template descriptors (BTD) enables to run in real-time on a mobile device such as an iPhone, still providing comparable or better recognition accuracy than traditional algorithms.
We have implemented the recognition system described above on an iPhone 3GS with a 600 MHz ARM chip CPU. The
Discussion
We have described a recognition system with integrated feature tracking and object recognition. We have presented an analysis that motivates the design choices in light of attempting to make the run-time cost of the algorithm as small as possible. Our analysis allows us to reach a number of conclusions that are relevant for the design choices that a resource-constrained platform imposes. The need to integrate correspondence, or tracking, into recognition forces us to implement an efficient
Acknowledgments
This project was supported in part by ARO 56765-CI, ONR N00014-08-1-0414, AFOSR FA9550-09-1-0427. A video demonstration of the system can be seen at http://www.youtube.com/watch?v=cMv-McHw660.
References (45)
- et al.
Region matching with missing parts
Image and Vision Computing
(2006) - et al.
On the set of images modulo viewpoint and contrast changes
The Bayesian Choice
(2001)- et al.
Controlled recognition bounds for scaling and occlusion channels
Actionable information in vision
- et al.
Active vision
International Journal of Computer Vision
(1988) Active perception
Proceedings of the IEEE
(1988)- et al.
The PASCAL Visual Object Classes Challenge 2009 (VOC2009) Results
- et al.
Semantic texton forests for image categorization and segmentation
- et al.
Keypoint recognition using randomized trees
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2006)
Introduction to Modern Information Retrieval
An iterative image registration technique with an application to stereo vision
Distinctive image features from scale-invariant keypoints
IJCV
Histograms of oriented gradients for human detection
Principles for Automatic Scale Selection, Tech. rep., KTH, Computational Vision and Active Perception laboratory
Geometric blur for template matching
A fast local descriptor for dense matching
Multiple target localisation at over 100 fps
Differential Topology
Morse Theory, Annals of Mathematics Studies No. 51
Stochastic models for generic images
Quarterly of Applied Mathematics
Cited by (0)
- ☆
This paper has been recommended for acceptance by Jan-Michael Frahm. Editor's Choice Articles are invited and handled by a select rotating 12 member Editorial Board committee.