Elsevier

Image and Vision Computing

Volume 29, Issue 10, September 2011, Pages 639-652
Image and Vision Computing

Video-based descriptors for object recognition

https://doi.org/10.1016/j.imavis.2011.08.003Get rights and content

Abstract

We describe a visual recognition system operating on a hand-held device, based on a video-based feature descriptor, and characterize its invariance and discriminative properties. Feature selection and tracking are performed in real-time, and used to train a template-based classifier during a capture phase prompted by the user. During normal operation, the system recognizes objects in the field of view based on their ranking. Severe resource constraints have prompted a re-evaluation of existing algorithms improving their performance (accuracy and robustness) as well as computational efficiency. We motivate the design choices in the implementation with a characterization of the stability properties of local invariant detectors, and of the conditions under which a template-based descriptor is optimal. The analysis also highlights the role of time as “weak supervisor” during training, which we exploit in our implementation.

Graphical abstract

Highlights

► We analyze and derive representations of objects from video. ► We integrate multi-scale detection and tracking. ► We derive a video-based feature descriptor. ► We describe a visual recognition system for a hand-held device.

Introduction

We tackle the problem of recognizing objects and scenes from images, given example views. The difficulty of this problem is the large nuisance variability that the data can exhibit, depending on the vantage point, visibility conditions (occlusions), illumination, etc. under which the object is seen, even if it does not exhibit intrinsic variability. The analysis in [1] suggests that the nuisances induce almost all the variability in the data, and what remains (the dependency of the data on the object) is supported on a thin set. The most common approach to this problem is to eliminate some of the nuisances by pre-processing the data (to obtain “distinctive” and yet “insensitive” features), and to “learn away” the residual nuisance variability, often using a training set of manually labeled images. Both practices are poorly grounded in principle: Pre-processing does not, in general, improve the performance in a classification task (cfr. the data processing inequality [2]); Training a classifier using unrelated images (aiming to approximate independent samples from the class-conditional distribution) brings into question the fact that there is a scene out there, and limits the classifier to learning generic regularities in images. It can be shown that, when a collection of passively gathered independent snapshots is used as training set, not only is the worst-case error in a visual recognition problem at chance level (i.e. the risk is the same that is offered by the prior), but so is the average case [3]. This is not the case, however, when the training data consists of purposefully captured images during an active exploration phase [4].

In this paper we propose a different approach to recognition, grounded in the ideas of Active Vision [5], [6] and Actionable Information [4], whereby the training set consists not of isolated snapshots, such as photo collections harvested from the web, but of temporally coherent sequences of images where the user is free to move around an object or manipulate it. Even if the objects are static, the use of video results in quantifiably superior recognition performance in a single (test) image. More importantly, the issue of representation is well grounded in the presence of multiple images of the same scene, and temporal continuity provides the crucial “bit” that the images in the training set are of the same scene, and therefore all the variability in the data is ascribed to the nuisances.

Contrary to common perception, building representations of objects from video for the purpose of recognition is not only a more sound process, but it is also more computationally efficient. In fact, the descriptor we propose is far more efficient to compute than common descriptors computed from single images, and has better discriminative and invariance properties. To show that, we both derive our representation from first principles – demanding that our descriptor be the “best” among a chosen class tied to the classifier – and empirically compute the performance of the resulting recognition scheme, comparing it with popular baseline algorithms. We have also made our implementation available for others to try on their mobile phones, so our results can be independently validated.

We describe our implementation in Section 3, and the analysis that motivates the design choices in Section 2. We start from the most general visual decision task (detection, localization, recognition, categorization etc.) and abstract it to a binary hypothesis testing problem, to highlight the crucial issue of representation (Section 2). While standard statistical decision theory trivializes representational issues, our goal is to design sufficient statistics that reduce the complexity of the decision at run-time as much as possible, while having the least impact on the optimality of the decision. Thus, we start from the notion of complete invariant statistics, and show how a classifier can be constructed based on them. We characterize the dependency of these statistics on nuisance factors via the notion of bounded-input-bounded-output (BIBO) stability, as well as structural stability. We show that these concepts enable reducing the marginalization process – where nuisances such as viewpoint, illumination, partial occlusions would have to be integrated-out at decision time – to a combinatorial decision, that can be tested in real-time even on severely resource-constrained hand-held platforms.

Our approach also highlights the importance of time in recognition problems. This is currently under-played in favor of hand-labeled training data, but time can effectively act as a “weak supervisor” in visual recognition, and we attempt to tap on that role.

Our effort relates to a wealth of recent work on visual recognition, localization and categorization represented, for instance, in the PASCAL challenge (see [7] for references). Our effort to run in real-time relates to [8], [9] although the constraints of a hand-held device limit the class of methods to much simpler classifiers, such as nearest neighbors and TF-IDF [10]. Rather than tinkering with the classifier, we focus on representation as the core issue. Modules of our system relate to multi-scale feature selection, tracking, local descriptors, and bag-of-features classification, specifically on baseline algorithms [[11], [12], [13], [14]]. We propose a method to integrate multi-scale detection and tracking that does not involve joint location-scale optimization [15], but explicitly accounts for topological changes across scales. This approach (dubbed“tracking on the selection tree”, TST) respects the semi-group stucture of scaling/quantization, and is motivated by the “structural stability” of the selection process. This improves accuracy and robustness while making tracking more efficient. We also replace traditional single-view descriptors [13], [16], [17] with a template that is designed to be optimal in the mean-square sense, under conditions described in Section 2, dubbed “best template” descriptor (BTD). Unlike approaches that simulate nuisance variability in the training set from a single image [9], [18], we exploit the real nuisance distribution by tracking frames during learning. Our contributions in this manuscript involve the tracker, TST (Section 2.6), the descriptor, BTD (Section 2.8), the analysis that motivates them (Section 2), and the implementation on a mobile device (Section 3).

Section snippets

Representation

This section motivates our algorithm design choices via analysis of an abstraction of the recognition problem. The reader interested in just the algorithmic aspect of the system can skip ahead to Section 3.

Implementation and experiments

We implemented the recognition system described above and tested its performance in terms of accuracy and computational efficiency. The integration of the tracking on the selection tree (TST) and the best-template descriptors (BTD) enables to run in real-time on a mobile device such as an iPhone, still providing comparable or better recognition accuracy than traditional algorithms.

We have implemented the recognition system described above on an iPhone 3GS with a 600 MHz ARM chip CPU. The

Discussion

We have described a recognition system with integrated feature tracking and object recognition. We have presented an analysis that motivates the design choices in light of attempting to make the run-time cost of the algorithm as small as possible. Our analysis allows us to reach a number of conclusions that are relevant for the design choices that a resource-constrained platform imposes. The need to integrate correspondence, or tracking, into recognition forces us to implement an efficient

Acknowledgments

This project was supported in part by ARO 56765-CI, ONR N00014-08-1-0414, AFOSR FA9550-09-1-0427. A video demonstration of the system can be seen at http://www.youtube.com/watch?v=cMv-McHw660.

References (45)

  • A. Duci et al.

    Region matching with missing parts

    Image and Vision Computing

    (2006)
  • G. Sundaramoorthi et al.

    On the set of images modulo viewpoint and contrast changes

  • C.P. Robert

    The Bayesian Choice

    (2001)
  • S. Soatto et al.

    Controlled recognition bounds for scaling and occlusion channels

  • S. Soatto

    Actionable information in vision

  • J. Aloimonos et al.

    Active vision

    International Journal of Computer Vision

    (1988)
  • R. Bajcsy

    Active perception

    Proceedings of the IEEE

    (1988)
  • M. Everingham et al.

    The PASCAL Visual Object Classes Challenge 2009 (VOC2009) Results

  • J. Shotton et al.

    Semantic texton forests for image categorization and segmentation

  • V. Lepetit et al.

    Keypoint recognition using randomized trees

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2006)
  • G. Salton et al.

    Introduction to Modern Information Retrieval

    (1983)
  • B. Lucas et al.

    An iterative image registration technique with an application to stereo vision

  • S. Baker et al.
  • D.G. Lowe

    Distinctive image features from scale-invariant keypoints

    IJCV

    (2004)
  • N. Dalal et al.

    Histograms of oriented gradients for human detection

  • T. Lindeberg

    Principles for Automatic Scale Selection, Tech. rep., KTH, Computational Vision and Active Perception laboratory

    (1998)
  • A. Berg et al.

    Geometric blur for template matching

  • E. Tola et al.

    A fast local descriptor for dense matching

  • S. Taylor et al.

    Multiple target localisation at over 100 fps

  • V. Guillemin et al.

    Differential Topology

    (1974)
  • J. Milnor

    Morse Theory, Annals of Mathematics Studies No. 51

    (1969)
  • D. Mumford et al.

    Stochastic models for generic images

    Quarterly of Applied Mathematics

    (2001)
  • Cited by (0)

    This paper has been recommended for acceptance by Jan-Michael Frahm. Editor's Choice Articles are invited and handled by a select rotating 12 member Editorial Board committee.

    View full text