On-line fusion of trackers for single-object tracking

doi:10.1016/j.patcog.2017.09.026

Pattern Recognition

Volume 74, February 2018, Pages 459-473

https://doi.org/10.1016/j.patcog.2017.09.026 Get rights and content

Highlights

•
The work focuses on the design of good strategies for the on-line fusion of trackers.
•
Fusion can operate at two levels: tracker output selection and model correction.
•
We show experimentally that the ability to predict drift is essential for fusion.
•
We propose a tracker complementarity measure to choose the best tracker combination.

Abstract

Visual object tracking is a fundamental function of computer vision that has been the object of numerous studies. The diversity of the proposed approaches leads to the idea of trying to fuse them and take advantage of their individual strengths while controlling the noise they may introduce in some circumstances. The work presented here describes a generic framework for combining and/or selecting on-line the different components of the processing chain of a set of trackers, and examines the impact of various fusion strategies. The results are assessed from a repertoire of 9 state-of-the-art trackers evaluated over 46 fusion strategies on the VOT 2013, VOT 2015 and OTB-100 datasets. A complementarity measure able to predict the overall performance of a given set of trackers is also proposed.

Introduction

Single-object tracking is a computer vision function with a long research history. Indeed, mastering the capacity of pursuing reliably and efficiently a given target, and being robust to various nuisance phenomena, is the key to many off-line or on-line applications exploiting video data (security, video surveillance, road traffic control, production control, human-machine interaction, multimedia indexing). This research topic produces a huge amount of studies each year, sometimes accompanied by new evaluation benchmarks and metrics, e.g., the ALOV++ dataset [1], or the On-line Tracking Benchmark [2]. One of the recent outstanding benchmark actions has been the yearly VOT Challenges¹ that have emphasized the evaluation of single-object model-free (no pre-learned model) short-term (no re-detection function) tracking through two criteria: robustness to drift and localization accuracy. The main conclusions were that “None of the trackers consistently outperformed the others by all measures at all sequence attributes” and that “trackers tend to specialize either for robustness or accuracy” [3].

What is proposed in this article is to build on this empirical fact and study how an existing repertoire of available trackers can be generically combined in an efficient way. Tracker combination will be based on the traditional fusion concepts of redundancy – tracker outputs are combined – and complementarity – the repertoire of trackers samples different features and functional structures. A key component of a fusion scheme is the availability of a self-diagnosis capacity: its role is to prevent the propagation of errors during fusion, detect and discard trackers with noisy behaviors, and eventually correct them. It will be shown that possessing such a capacity greatly improves the fusion performance.

The emphasis of the present work will be put on analyzing the drifting behavior of trackers. From an operating point of view, losing the tracked target is certainly the most impacting type of error, since it implies also loss of its awareness. Target re-acquisition is possible but is generally less reliable and more costly. The VOT challenge has been one of the first benchmarks to address specifically this question, and our evaluation will follow their methodology.

Fusion approaches can be segmented in two families [4]: passive or active. The passive approaches only combine tracker outputs with no interaction between the trackers, whereas the active ones integrate data provided by each tracker with the objective of correcting their inner model when necessary. We show that active fusion leads in general to better performances, but necessitates a control over tracker components and update mechanisms.

Determining what are the most efficient combinations of trackers, i.e., enabling a high level of robustness at low cost, has a practical impact. We introduce a complementarity measure between trackers based on individual drift measures to predict the fusion performance of the combined trackers in order to select it. The main contributions of this paper are the following:

(1)
We describe a generic and parametric framework for the combination of trackers and identify several levels and modes where fusion can operate. The formalism results in 46 different configurations.
(2)
We show experimentally that the ability to predict drift is essential for good fusion performance and propose several on-line schemes to compute such a prediction that can be used in active fusion approaches.
(3)
We evaluate and rank the fusion configurations on 4 databases using a repertoire of 9 state-of-the-art trackers.
(4)
We propose a quantitative way to predict the complementarity of trackers to determine the best combinations for fusion.

The paper is organized as follows: Section 2 introduces the different notions of fusion applied to the on-line combination of trackers. Section 3 presents the corresponding related work. Section 4 evaluates a series of trackers and analyses their behavior with the idea of fusing them. Several ways to detect on-line abnormal tracker behavior are presented in Section 5. The fusion framework is described in Section 6. Section 7 gives material and implementation details for the experiments used to identify the key fusion strategies in Section 8. Section 9 concludes with final recommendations.

Section snippets

Exploiting multiple trackers

The fusion of trackers is based on a simple principle: single trackers have each their domain of expertise that can be enlarged by fusion at different levels. To understand how fusion can be realized, a better understanding of the structure and functioning of a single tracker is first needed.

Related work

The idea of fusion in tracking has been addressed in the literature either as a control and selection of models exploited in the processing chain, or as a dynamic combination of the inputs and outputs of several modules. The information fusion domain sometimes speaks of centralized vs. decentralized functional architectures [6].

Most of the tracking approaches proposed in the literature can be considered as falling into the centralized architecture category: they describe different ways to

Off-line tracker evaluation

A usual action before proposing a fusion approach is to state the quality of the components that will be combined. We present in this section how individual tracker performance is classically addressed (Section 4.1). Additionally, we describe a local analysis evaluating the potential complementarity of trackers and more suitable to fusion issues (Section 4.2).

On-line tracker failure prediction

Many existing trackers can be used in our fusion framework: it is only required that they output a bounding box and an associated confidence value at each frame. Here, we describe an important stage of the fusion of multiple tracker outputs, which is to select the correct outputs. Our approach is to predict tracking failures from a set of M parallel trackers $T = {T_{1}, T_{2}, \dots, T_{M}},$ either individually or collectively. At each time t, for each tracker T_i, i ∈ [1, M], a state $s_{t}^{i} = {0, 1}$ is estimated, 1

Proposed fusion approach

Our system is composed of a set of M trackers $T = {T_{1}, T_{2}, \dots, T_{M}}$ running in parallel. At the starting frame, all trackers receive a bounding box B₀ as input, which is the known localization of the target in the first image I₀. At each image I_t, each tracker T_i, i ∈ [1, M], outputs an estimated bounding box ${\hat{B}}_{t}^{i}$ . Our fusion approach consists of fusing their outputs ${\hat{B}}_{t} = ({\hat{B}}_{t}^{1}, {\hat{B}}_{t}^{2}, \dots, {\hat{B}}_{t}^{M})$ with a dynamic selection of the good ones, to give the output of the system ${\hat{B}}_{t}^{f u s i o n}$ . This selection goes

Material and implementation

Databases. We evaluate our fusion approach on 3 databases with varied objects and scenes in challenging conditions (camera moving and zoom, brightness changes, occlusions, deformable objects, fast appearance changes and object movements, etc.):

(a)
VOT2013+ contains 12 videos from VOT2013 [33] completed with 1 video from the KITTI Vision Benchmark Suite [45], and 5 other videos we have taken from a GoPro camera embarked on a vehicle, available on request. VOT2013+ contains a total of 25 objects

Experimental results

We conducted thorough experiments from our available data and software to answer two different questions: the impact of the selection and the correction step in our fusion approach and what is the best fusion configuration in Section 8.1; the best combination of trackers achieving high robustness at low cost in Section 8.2.

Conclusion

The work described in this paper focused on the design of good strategies for the on-line fusion of trackers. The emphasis was on controlling the overall robustness of tracking measured as a number of drifting events, i.e., the number of times the target is lost when applied on a given database. Trackers deal with critical situations differently (illumination, occlusion, appearance changes, camera motion); the idea was to exploit their complementarity on various fusion strategies.

Fusion can

Isabelle Leang graduated from the Ecole Nationale Supérieure de l’Electronique et de ses Applications (France), received a Master degree in Computer Sciences from the Université de Cergy-Pontoise (France) in 2012. She is actually preparing a Ph.D. degree in the Information Processing and Modeling Departement at ONERA (France).

References (48)

XiongN. et al.
Multi-sensor management for information fusion: issues and approaches
Inf. Fus.
(2002)
T. Penne et al.
Markov chain monte carlo modular ensemble tracking
Image Vis. Comput.
(2013)
E. Erdem et al.
Visual tracking by fusing multiple cues with context-sensitive reliabilities
Pattern Recognit.
(2012)
ZhongB. et al.
Visual tracking via weakly supervised learning from multiple imperfect oracles
Pattern Recognit.
(2014)
T. Vojir et al.
Online adaptive hidden Markov model for multi-tracker fusion
Comput. Vis. Image Underst.
(2016)
T. Vojir et al.
Robust scale-adaptive mean-shift for tracking
Pattern Recognit. Lett.
(2014)
A.W. Smeulders et al.
Visual tracking: an experimental survey
IEEE Trans. Pattern Anal. Mach. Intell.
(2014)
WuY. et al.
Online object tracking: a benchmark
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2013)
M. Kristan et al.
A novel performance evaluation methodology for single-target trackers
IEEE Trans. Pattern Anal. Mach. Intell.
(2016)
C. Bailer et al.
A superior tracking approach: building a strong tracker through fusion
Proceedings of the European Conference on Computer Vision
(2014)

LiX. et al.

A survey of appearance models in visual object tracking

ACM Trans. Intell. Syst. Technol.

(2013)

Y. Bar-Shalom et al.

Tracking and Data Fusion: A Handbook of Algorithms.

(2011)

M.H. Khan et al.

A generalized search method for multiple competing hypotheses in visual tracking

Proceedings of the International Conference on Pattern Recognition

(2014)

KwonJ. et al.

Tracking by sampling trackers

Proceedings of the International Conference on Computer Vision

(2011)

O. Khalid et al.

Multi-tracker partition fusion

IEEE Trans. Circuits Syst. Video Technol.

(2017)

ZhangJ. et al.

MEEM: robust tracking via multiple experts using entropy minimization

Proceedings of the European Conference on Computer Vision

(2014)

YoonJ.H. et al.

Visual tracking via adaptive tracker selection with multiple features

Proceedings of the European Conference on Computer Vision

(2012)

H. Nam et al.

Learning multi-domain convolutional neural networks for visual tracking

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2016)

ZhouY. et al.

Similarity fusion for visual tracking

Int. J. Comput. Vis.

(2016)

A. Yilmaz et al.

Object tracking: a survey

ACM Comput. Surv. (CSUR)

(2006)

N.T. Siebel et al.

Fusion of multiple tracking algorithms for robust people tracking

Proceedings of the European Conference on Computer Vision

(2002)

B. Stenger et al.

Learning to track with multiple observers

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2009)

J. Santner et al.

PROST: parallel robust online simple tracking

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2010)

S. Moujtahid et al.

Coherent selection of independent trackers for real-time object tracking

Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP)

(2015)

Cited by (32)

Siamese network for object tracking with multi-granularity appearance representations
2021, Pattern Recognition
Citation Excerpt :
TCNN [23] describes a valuable idea for ensemble trackers that the weights of multiple trackers are identified by two factors: affinity to the current frame and reliability to track a certain object. Leang et al. [24] try to fuse different trackers and take advantage of their individual strengths while controlling the noise they may introduce in some circumstances. They examine the influence of selecting different components of the processing chain of a set of trackers.
A reliable tracker has the ability to adapt to change of objects over time, and is robust and accurate. We build such a tracker by extracting semantic features using robust Siamese networks and multi-granularity color features. It incorporates a semantic model that can capture high quality semantic features and an appearance model that can describe object at pixel, local and global levels effectively. Furthermore, we propose a novel selective traverse algorithm to allocate weights to semantic models and appearance models dynamically for better tracking performance. During tracking, our tracker updates appearance representations for objects based on the recent tracking results. The proposed tracker operates at speeds that exceed the real-time requirement, and outperforms nearly all other state-of-the-art trackers on OTB-2013/2015 and VOT-2016/2017 benchmarks.
Passenger flow counting in buses based on deep learning using surveillance video
2020, Optik
Citation Excerpt :
While this approach performed better than other existing methods for passenger counting, it may not perform as well in scenarios with a very dense crowd. Several studies have been published in the field of visual tracking, including single-object tracking [34–36] and multi-object tracking [37–39]. The primary purpose of multi-object tracking is to automatically determine the movement trajectory of all targets in the scene.
An efficient traffic management system is crucial for public transportation. If the passenger flow can be detected accurately and instantaneously, the routes and schedules for public transportation can be effectively improved. However, previous research identified many challenges in passenger counting, such as messy image background, variations in lighting, and occlusions. In this paper, we propose a passenger flow counting model for buses, based on deep learning. First, we design a straightforward way to understand the opening state of the door. Next, a single shot multibox detector is used to learn the features of passengers and detect them. Finally, a particle filter with a three-step cascaded data association scheme is used for passenger tracking. To demonstrate the performance of the proposed algorithm, surveillance videos of three different situations, i.e., day, night, and a rainy day, are selected. Additionally, to make the system applicable to real cases, a few special scenes such as different objects worn by the passengers, passenger occlusions, and dense crowds, are considered. According to the experimental results, our method exhibits better performance than some existing methods.
MSSTResNet-TLD: A robust tracking method based on tracking-learning-detection framework by using multi-scale spatio-temporal residual network feature model
2019, Neurocomputing
Citation Excerpt :
It plays an important role in many relevant research fields [1–5], such as video surveillance, traffic monitoring, autonomous driving, human computer interaction, robotics, medical applications, and road scene understanding, etc. With the great progress over the past few decades, especially the breakthrough of convolutional neural network, visual object tracking has successfully applied in our daily life under various video scenarios [6–10]. For the generic single object tracking problems, the target can be an arbitrary desired object, initialized with a bounding box in the first frame image, and the tracker aims at locating the target in the following each frame image [11].
The performance of tracking task is directly dependent on the appearance features of target object, a robust approach for constructing appearance features is crucial for adaptation the appearance change. To construct an accurate and robust appearance model for visual object tracking, we modify original deep residual learning network architecture and name it Multi-Scale Residual Network (MSResNet). The first video frame image and its related information of the current input video sequence are used to learn a multi-scale appearance model of target object and a loss function is minimized over the appearance features. Meanwhile, spatial information of each video frame and temporal information between successive video frames effectively combine with MSResNet. And thus the features are generated by Multi-Scale Spatio-Temporal Residual Network, which is named MSSTResNet feature model, can adapt to scale variation, illumination variation, background clutters, severe deformation of the target object, and so on. We implement a robust tracking method based on tracking-learning-detection framework by using our proposed MSSTResNet feature model and name it MSSTResNet-TLD tracker. Unlike the previous tracking methods, the MSResNet architecture is not offline pre-trained on a large auxiliary datasets but is directly learned end-to-end with a multi-task loss by using the current input video sequence. Furthermore, the multi-task loss function utilizes the classification loss and regression loss that is more accurate for target localization. Our experimental results demonstrate that the proposed tracking method outperforms the current state-of-the-art tracking methods on Visual Object Tracking Benchmark (VOT-2016), Object Tracking Benchmark (OTB-2015), and Unmanned Aerial Vehicles (UAV20L) test datasets. Furthermore, our MSSTResNet-TLD tracker is faster than previous most trackers based on deep Convolutional Neural Network (ConvNet or CNN) and our tracker is extremely robust to the tiny target object. Our source code is available for download at https://github.com/binger1225/MSSTResNet-TLD-Tracker.
Thermal infrared and visible sequences fusion tracking based on a hybrid tracking framework with adaptive weighting scheme
2019, Infrared Physics and Technology
Citation Excerpt :
Jiang et al. [9] presented an online tracking method by giving weights to multiple features like HOG, color names and image intensity adaptively, and then fusing the response map of each feature to obtain the final result. Leang et al. [10] proposed a framework that can integrate or select different tracking results from different features or tracking methods, and then fuse them with adaptive weights. However, visible image based algorithms are proved to be incapable of tracking efficiently when faced with challenges like extreme illumination conditions (night, indoor, etc.), target under typical weather (fog, rain, smog, etc.), background clutter and so on.
Object tracking based on single sensor image sequences is now proved to be insufficient when facing complex challenging factors such as occlusions, background clutter, illumination variations, deformation and scale change. Complementary information between thermal infrared and visible image sequences is highly valuable and plays a critical role in tracking under complex scenarios. Previous fusion-before-tracking algorithms are not efficient and accurate enough due to the inevitable introduction of redundant information and considerable computational consumption. In this paper, we propose a robust fusion tracking method that exploits the abovementioned complementary information under a hybrid “tracking-by-detection” framework which consists of two tracking modules—the correlation filter based tracking (CFT) module and histogram based tracking (HIST) module. In CFT module, features extracted from both thermal infrared and visible images such as histogram of oriented gradient (HOG), image intensity and color names, are utilized to generate response maps and then adaptively fused through a denoising fusion scheme. In HIST module, a response map is obtained by adopting RGB color histogram in a statistical tracking model. Then, the response maps of two modules are fused via a new adaptive weighting scheme we proposed. Extensive experimental results on challenging thermal infrared and visible image sequences demonstrate the accuracy and robustness of the proposed method in comparison with several state-of-the-art methods.
MoD2T:Model-Data-Driven Motion-Static Object Tracking Method
2023, arXiv
Consensus and stacking based fusion and survey of facial feature point detectors
2023, Journal of Ambient Intelligence and Humanized Computing

View all citing articles on Scopus

Stéphane Herbin received an engineering degree from the Ecole Supérieure d’Electricité (Supélec), the M.Sc. degree in Electrical Engineering from the University of Illinois at Urbana-Champaign, and the Ph.D. degree in applied mathematics from the Ecole Normale Supérieure de Cachan. Employed by ONERA since 2000, he works in the Information Processing and Modeling Department. His main research interests are stochastic modeling and analysis for object recognition and scene interpretation in images and videos.

Benoît Girard received a Ph.D. degree in Computer Science (2003) from the Université Pierre et Marie Curie (Paris, France). He currently work as a Research Director at the Centre National de la Recherche Scientifique. His main research interests are action selection, reinforcement learning and decision making in animals and robots.

Jacques Droulez received a mathematical training at Ecole Polytechnique (Paris, France) and a MD (1982) from the University Paris 6. He is currently Research Director at the Centre National de la Recherche Scientifique. His main research interests are motion and object perception, sensori-motor control and Bayesian modeling of biological systems.

View full text

On-line fusion of trackers for single-object tracking

Highlights

Abstract

Introduction

Section snippets

Exploiting multiple trackers

Related work

Off-line tracker evaluation

On-line tracker failure prediction

Proposed fusion approach

Material and implementation

Experimental results

Conclusion

Inf. Fus.

Image Vis. Comput.

Pattern Recognit.

Pattern Recognit.

Comput. Vis. Image Underst.

Pattern Recognit. Lett.

Visual tracking: an experimental survey

IEEE Trans. Pattern Anal. Mach. Intell.

Online object tracking: a benchmark

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

A novel performance evaluation methodology for single-target trackers

IEEE Trans. Pattern Anal. Mach. Intell.

A superior tracking approach: building a strong tracker through fusion

Proceedings of the European Conference on Computer Vision

A survey of appearance models in visual object tracking

ACM Trans. Intell. Syst. Technol.

Tracking and Data Fusion: A Handbook of Algorithms.

A generalized search method for multiple competing hypotheses in visual tracking

Proceedings of the International Conference on Pattern Recognition

Tracking by sampling trackers

Proceedings of the International Conference on Computer Vision

Multi-tracker partition fusion

IEEE Trans. Circuits Syst. Video Technol.

MEEM: robust tracking via multiple experts using entropy minimization

Proceedings of the European Conference on Computer Vision

Visual tracking via adaptive tracker selection with multiple features

Proceedings of the European Conference on Computer Vision

Learning multi-domain convolutional neural networks for visual tracking

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Similarity fusion for visual tracking

Int. J. Comput. Vis.

Object tracking: a survey

ACM Comput. Surv. (CSUR)

Fusion of multiple tracking algorithms for robust people tracking

Proceedings of the European Conference on Computer Vision

Learning to track with multiple observers

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

PROST: parallel robust online simple tracking

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Coherent selection of independent trackers for real-time object tracking

Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP)