Elsevier

Pattern Recognition

Volume 74, February 2018, Pages 459-473
Pattern Recognition

On-line fusion of trackers for single-object tracking

https://doi.org/10.1016/j.patcog.2017.09.026Get rights and content

Highlights

  • The work focuses on the design of good strategies for the on-line fusion of trackers.

  • Fusion can operate at two levels: tracker output selection and model correction.

  • We show experimentally that the ability to predict drift is essential for fusion.

  • We propose a tracker complementarity measure to choose the best tracker combination.

Abstract

Visual object tracking is a fundamental function of computer vision that has been the object of numerous studies. The diversity of the proposed approaches leads to the idea of trying to fuse them and take advantage of their individual strengths while controlling the noise they may introduce in some circumstances. The work presented here describes a generic framework for combining and/or selecting on-line the different components of the processing chain of a set of trackers, and examines the impact of various fusion strategies. The results are assessed from a repertoire of 9 state-of-the-art trackers evaluated over 46 fusion strategies on the VOT 2013, VOT 2015 and OTB-100 datasets. A complementarity measure able to predict the overall performance of a given set of trackers is also proposed.

Introduction

Single-object tracking is a computer vision function with a long research history. Indeed, mastering the capacity of pursuing reliably and efficiently a given target, and being robust to various nuisance phenomena, is the key to many off-line or on-line applications exploiting video data (security, video surveillance, road traffic control, production control, human-machine interaction, multimedia indexing). This research topic produces a huge amount of studies each year, sometimes accompanied by new evaluation benchmarks and metrics, e.g., the ALOV++ dataset [1], or the On-line Tracking Benchmark [2]. One of the recent outstanding benchmark actions has been the yearly VOT Challenges1 that have emphasized the evaluation of single-object model-free (no pre-learned model) short-term (no re-detection function) tracking through two criteria: robustness to drift and localization accuracy. The main conclusions were that “None of the trackers consistently outperformed the others by all measures at all sequence attributes” and that “trackers tend to specialize either for robustness or accuracy” [3].

What is proposed in this article is to build on this empirical fact and study how an existing repertoire of available trackers can be generically combined in an efficient way. Tracker combination will be based on the traditional fusion concepts of redundancy – tracker outputs are combined – and complementarity – the repertoire of trackers samples different features and functional structures. A key component of a fusion scheme is the availability of a self-diagnosis capacity: its role is to prevent the propagation of errors during fusion, detect and discard trackers with noisy behaviors, and eventually correct them. It will be shown that possessing such a capacity greatly improves the fusion performance.

The emphasis of the present work will be put on analyzing the drifting behavior of trackers. From an operating point of view, losing the tracked target is certainly the most impacting type of error, since it implies also loss of its awareness. Target re-acquisition is possible but is generally less reliable and more costly. The VOT challenge has been one of the first benchmarks to address specifically this question, and our evaluation will follow their methodology.

Fusion approaches can be segmented in two families [4]: passive or active. The passive approaches only combine tracker outputs with no interaction between the trackers, whereas the active ones integrate data provided by each tracker with the objective of correcting their inner model when necessary. We show that active fusion leads in general to better performances, but necessitates a control over tracker components and update mechanisms.

Determining what are the most efficient combinations of trackers, i.e., enabling a high level of robustness at low cost, has a practical impact. We introduce a complementarity measure between trackers based on individual drift measures to predict the fusion performance of the combined trackers in order to select it. The main contributions of this paper are the following:

  • (1)

    We describe a generic and parametric framework for the combination of trackers and identify several levels and modes where fusion can operate. The formalism results in 46 different configurations.

  • (2)

    We show experimentally that the ability to predict drift is essential for good fusion performance and propose several on-line schemes to compute such a prediction that can be used in active fusion approaches.

  • (3)

    We evaluate and rank the fusion configurations on 4 databases using a repertoire of 9 state-of-the-art trackers.

  • (4)

    We propose a quantitative way to predict the complementarity of trackers to determine the best combinations for fusion.

The paper is organized as follows: Section 2 introduces the different notions of fusion applied to the on-line combination of trackers. Section 3 presents the corresponding related work. Section 4 evaluates a series of trackers and analyses their behavior with the idea of fusing them. Several ways to detect on-line abnormal tracker behavior are presented in Section 5. The fusion framework is described in Section 6. Section 7 gives material and implementation details for the experiments used to identify the key fusion strategies in Section 8. Section 9 concludes with final recommendations.

Section snippets

Exploiting multiple trackers

The fusion of trackers is based on a simple principle: single trackers have each their domain of expertise that can be enlarged by fusion at different levels. To understand how fusion can be realized, a better understanding of the structure and functioning of a single tracker is first needed.

Related work

The idea of fusion in tracking has been addressed in the literature either as a control and selection of models exploited in the processing chain, or as a dynamic combination of the inputs and outputs of several modules. The information fusion domain sometimes speaks of centralized vs. decentralized functional architectures [6].

Most of the tracking approaches proposed in the literature can be considered as falling into the centralized architecture category: they describe different ways to

Off-line tracker evaluation

A usual action before proposing a fusion approach is to state the quality of the components that will be combined. We present in this section how individual tracker performance is classically addressed (Section 4.1). Additionally, we describe a local analysis evaluating the potential complementarity of trackers and more suitable to fusion issues (Section 4.2).

On-line tracker failure prediction

Many existing trackers can be used in our fusion framework: it is only required that they output a bounding box and an associated confidence value at each frame. Here, we describe an important stage of the fusion of multiple tracker outputs, which is to select the correct outputs. Our approach is to predict tracking failures from a set of M parallel trackers T={T1,T2,,TM}, either individually or collectively. At each time t, for each tracker Ti, i ∈ [1, M], a state sti={0,1} is estimated, 1

Proposed fusion approach

Our system is composed of a set of M trackers T={T1,T2,,TM} running in parallel. At the starting frame, all trackers receive a bounding box B0 as input, which is the known localization of the target in the first image I0. At each image It, each tracker Ti, i ∈ [1, M], outputs an estimated bounding box B^ti. Our fusion approach consists of fusing their outputs B^t=(B^t1,B^t2,,B^tM) with a dynamic selection of the good ones, to give the output of the system B^tfusion. This selection goes

Material and implementation

Databases. We evaluate our fusion approach on 3 databases with varied objects and scenes in challenging conditions (camera moving and zoom, brightness changes, occlusions, deformable objects, fast appearance changes and object movements, etc.):

  • (a)

    VOT2013+ contains 12 videos from VOT2013 [33] completed with 1 video from the KITTI Vision Benchmark Suite [45], and 5 other videos we have taken from a GoPro camera embarked on a vehicle, available on request. VOT2013+ contains a total of 25 objects

Experimental results

We conducted thorough experiments from our available data and software to answer two different questions: the impact of the selection and the correction step in our fusion approach and what is the best fusion configuration in Section 8.1; the best combination of trackers achieving high robustness at low cost in Section 8.2.

Conclusion

The work described in this paper focused on the design of good strategies for the on-line fusion of trackers. The emphasis was on controlling the overall robustness of tracking measured as a number of drifting events, i.e., the number of times the target is lost when applied on a given database. Trackers deal with critical situations differently (illumination, occlusion, appearance changes, camera motion); the idea was to exploit their complementarity on various fusion strategies.

Fusion can

Isabelle Leang graduated from the Ecole Nationale Supérieure de l’Electronique et de ses Applications (France), received a Master degree in Computer Sciences from the Université de Cergy-Pontoise (France) in 2012. She is actually preparing a Ph.D. degree in the Information Processing and Modeling Departement at ONERA (France).

References (48)

  • LiX. et al.

    A survey of appearance models in visual object tracking

    ACM Trans. Intell. Syst. Technol.

    (2013)
  • Y. Bar-Shalom et al.

    Tracking and Data Fusion: A Handbook of Algorithms.

    (2011)
  • M.H. Khan et al.

    A generalized search method for multiple competing hypotheses in visual tracking

    Proceedings of the International Conference on Pattern Recognition

    (2014)
  • KwonJ. et al.

    Tracking by sampling trackers

    Proceedings of the International Conference on Computer Vision

    (2011)
  • O. Khalid et al.

    Multi-tracker partition fusion

    IEEE Trans. Circuits Syst. Video Technol.

    (2017)
  • ZhangJ. et al.

    MEEM: robust tracking via multiple experts using entropy minimization

    Proceedings of the European Conference on Computer Vision

    (2014)
  • YoonJ.H. et al.

    Visual tracking via adaptive tracker selection with multiple features

    Proceedings of the European Conference on Computer Vision

    (2012)
  • H. Nam et al.

    Learning multi-domain convolutional neural networks for visual tracking

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • ZhouY. et al.

    Similarity fusion for visual tracking

    Int. J. Comput. Vis.

    (2016)
  • A. Yilmaz et al.

    Object tracking: a survey

    ACM Comput. Surv. (CSUR)

    (2006)
  • N.T. Siebel et al.

    Fusion of multiple tracking algorithms for robust people tracking

    Proceedings of the European Conference on Computer Vision

    (2002)
  • B. Stenger et al.

    Learning to track with multiple observers

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2009)
  • J. Santner et al.

    PROST: parallel robust online simple tracking

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2010)
  • S. Moujtahid et al.

    Coherent selection of independent trackers for real-time object tracking

    Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP)

    (2015)
  • Cited by (32)

    • Siamese network for object tracking with multi-granularity appearance representations

      2021, Pattern Recognition
      Citation Excerpt :

      TCNN [23] describes a valuable idea for ensemble trackers that the weights of multiple trackers are identified by two factors: affinity to the current frame and reliability to track a certain object. Leang et al. [24] try to fuse different trackers and take advantage of their individual strengths while controlling the noise they may introduce in some circumstances. They examine the influence of selecting different components of the processing chain of a set of trackers.

    • Passenger flow counting in buses based on deep learning using surveillance video

      2020, Optik
      Citation Excerpt :

      While this approach performed better than other existing methods for passenger counting, it may not perform as well in scenarios with a very dense crowd. Several studies have been published in the field of visual tracking, including single-object tracking [34–36] and multi-object tracking [37–39]. The primary purpose of multi-object tracking is to automatically determine the movement trajectory of all targets in the scene.

    • MSSTResNet-TLD: A robust tracking method based on tracking-learning-detection framework by using multi-scale spatio-temporal residual network feature model

      2019, Neurocomputing
      Citation Excerpt :

      It plays an important role in many relevant research fields [1–5], such as video surveillance, traffic monitoring, autonomous driving, human computer interaction, robotics, medical applications, and road scene understanding, etc. With the great progress over the past few decades, especially the breakthrough of convolutional neural network, visual object tracking has successfully applied in our daily life under various video scenarios [6–10]. For the generic single object tracking problems, the target can be an arbitrary desired object, initialized with a bounding box in the first frame image, and the tracker aims at locating the target in the following each frame image [11].

    • Thermal infrared and visible sequences fusion tracking based on a hybrid tracking framework with adaptive weighting scheme

      2019, Infrared Physics and Technology
      Citation Excerpt :

      Jiang et al. [9] presented an online tracking method by giving weights to multiple features like HOG, color names and image intensity adaptively, and then fusing the response map of each feature to obtain the final result. Leang et al. [10] proposed a framework that can integrate or select different tracking results from different features or tracking methods, and then fuse them with adaptive weights. However, visible image based algorithms are proved to be incapable of tracking efficiently when faced with challenges like extreme illumination conditions (night, indoor, etc.), target under typical weather (fog, rain, smog, etc.), background clutter and so on.

    • Consensus and stacking based fusion and survey of facial feature point detectors

      2023, Journal of Ambient Intelligence and Humanized Computing
    View all citing articles on Scopus

    Isabelle Leang graduated from the Ecole Nationale Supérieure de l’Electronique et de ses Applications (France), received a Master degree in Computer Sciences from the Université de Cergy-Pontoise (France) in 2012. She is actually preparing a Ph.D. degree in the Information Processing and Modeling Departement at ONERA (France).

    Stéphane Herbin received an engineering degree from the Ecole Supérieure d’Electricité (Supélec), the M.Sc. degree in Electrical Engineering from the University of Illinois at Urbana-Champaign, and the Ph.D. degree in applied mathematics from the Ecole Normale Supérieure de Cachan. Employed by ONERA since 2000, he works in the Information Processing and Modeling Department. His main research interests are stochastic modeling and analysis for object recognition and scene interpretation in images and videos.

    Benoît Girard received a Ph.D. degree in Computer Science (2003) from the Université Pierre et Marie Curie (Paris, France). He currently work as a Research Director at the Centre National de la Recherche Scientifique. His main research interests are action selection, reinforcement learning and decision making in animals and robots.

    Jacques Droulez received a mathematical training at Ecole Polytechnique (Paris, France) and a MD (1982) from the University Paris 6. He is currently Research Director at the Centre National de la Recherche Scientifique. His main research interests are motion and object perception, sensori-motor control and Bayesian modeling of biological systems.

    View full text