On-line fusion of trackers for single-object tracking
Introduction
Single-object tracking is a computer vision function with a long research history. Indeed, mastering the capacity of pursuing reliably and efficiently a given target, and being robust to various nuisance phenomena, is the key to many off-line or on-line applications exploiting video data (security, video surveillance, road traffic control, production control, human-machine interaction, multimedia indexing). This research topic produces a huge amount of studies each year, sometimes accompanied by new evaluation benchmarks and metrics, e.g., the ALOV++ dataset [1], or the On-line Tracking Benchmark [2]. One of the recent outstanding benchmark actions has been the yearly VOT Challenges1 that have emphasized the evaluation of single-object model-free (no pre-learned model) short-term (no re-detection function) tracking through two criteria: robustness to drift and localization accuracy. The main conclusions were that “None of the trackers consistently outperformed the others by all measures at all sequence attributes” and that “trackers tend to specialize either for robustness or accuracy” [3].
What is proposed in this article is to build on this empirical fact and study how an existing repertoire of available trackers can be generically combined in an efficient way. Tracker combination will be based on the traditional fusion concepts of redundancy – tracker outputs are combined – and complementarity – the repertoire of trackers samples different features and functional structures. A key component of a fusion scheme is the availability of a self-diagnosis capacity: its role is to prevent the propagation of errors during fusion, detect and discard trackers with noisy behaviors, and eventually correct them. It will be shown that possessing such a capacity greatly improves the fusion performance.
The emphasis of the present work will be put on analyzing the drifting behavior of trackers. From an operating point of view, losing the tracked target is certainly the most impacting type of error, since it implies also loss of its awareness. Target re-acquisition is possible but is generally less reliable and more costly. The VOT challenge has been one of the first benchmarks to address specifically this question, and our evaluation will follow their methodology.
Fusion approaches can be segmented in two families [4]: passive or active. The passive approaches only combine tracker outputs with no interaction between the trackers, whereas the active ones integrate data provided by each tracker with the objective of correcting their inner model when necessary. We show that active fusion leads in general to better performances, but necessitates a control over tracker components and update mechanisms.
Determining what are the most efficient combinations of trackers, i.e., enabling a high level of robustness at low cost, has a practical impact. We introduce a complementarity measure between trackers based on individual drift measures to predict the fusion performance of the combined trackers in order to select it. The main contributions of this paper are the following:
- (1)
We describe a generic and parametric framework for the combination of trackers and identify several levels and modes where fusion can operate. The formalism results in 46 different configurations.
- (2)
We show experimentally that the ability to predict drift is essential for good fusion performance and propose several on-line schemes to compute such a prediction that can be used in active fusion approaches.
- (3)
We evaluate and rank the fusion configurations on 4 databases using a repertoire of 9 state-of-the-art trackers.
- (4)
We propose a quantitative way to predict the complementarity of trackers to determine the best combinations for fusion.
The paper is organized as follows: Section 2 introduces the different notions of fusion applied to the on-line combination of trackers. Section 3 presents the corresponding related work. Section 4 evaluates a series of trackers and analyses their behavior with the idea of fusing them. Several ways to detect on-line abnormal tracker behavior are presented in Section 5. The fusion framework is described in Section 6. Section 7 gives material and implementation details for the experiments used to identify the key fusion strategies in Section 8. Section 9 concludes with final recommendations.
Section snippets
Exploiting multiple trackers
The fusion of trackers is based on a simple principle: single trackers have each their domain of expertise that can be enlarged by fusion at different levels. To understand how fusion can be realized, a better understanding of the structure and functioning of a single tracker is first needed.
Related work
The idea of fusion in tracking has been addressed in the literature either as a control and selection of models exploited in the processing chain, or as a dynamic combination of the inputs and outputs of several modules. The information fusion domain sometimes speaks of centralized vs. decentralized functional architectures [6].
Most of the tracking approaches proposed in the literature can be considered as falling into the centralized architecture category: they describe different ways to
Off-line tracker evaluation
A usual action before proposing a fusion approach is to state the quality of the components that will be combined. We present in this section how individual tracker performance is classically addressed (Section 4.1). Additionally, we describe a local analysis evaluating the potential complementarity of trackers and more suitable to fusion issues (Section 4.2).
On-line tracker failure prediction
Many existing trackers can be used in our fusion framework: it is only required that they output a bounding box and an associated confidence value at each frame. Here, we describe an important stage of the fusion of multiple tracker outputs, which is to select the correct outputs. Our approach is to predict tracking failures from a set of M parallel trackers either individually or collectively. At each time t, for each tracker Ti, i ∈ [1, M], a state is estimated, 1
Proposed fusion approach
Our system is composed of a set of M trackers running in parallel. At the starting frame, all trackers receive a bounding box B0 as input, which is the known localization of the target in the first image I0. At each image It, each tracker Ti, i ∈ [1, M], outputs an estimated bounding box . Our fusion approach consists of fusing their outputs with a dynamic selection of the good ones, to give the output of the system . This selection goes
Material and implementation
Databases. We evaluate our fusion approach on 3 databases with varied objects and scenes in challenging conditions (camera moving and zoom, brightness changes, occlusions, deformable objects, fast appearance changes and object movements, etc.):
- (a)
VOT2013+ contains 12 videos from VOT2013 [33] completed with 1 video from the KITTI Vision Benchmark Suite [45], and 5 other videos we have taken from a GoPro camera embarked on a vehicle, available on request. VOT2013+ contains a total of 25 objects
Experimental results
We conducted thorough experiments from our available data and software to answer two different questions: the impact of the selection and the correction step in our fusion approach and what is the best fusion configuration in Section 8.1; the best combination of trackers achieving high robustness at low cost in Section 8.2.
Conclusion
The work described in this paper focused on the design of good strategies for the on-line fusion of trackers. The emphasis was on controlling the overall robustness of tracking measured as a number of drifting events, i.e., the number of times the target is lost when applied on a given database. Trackers deal with critical situations differently (illumination, occlusion, appearance changes, camera motion); the idea was to exploit their complementarity on various fusion strategies.
Fusion can
Isabelle Leang graduated from the Ecole Nationale Supérieure de l’Electronique et de ses Applications (France), received a Master degree in Computer Sciences from the Université de Cergy-Pontoise (France) in 2012. She is actually preparing a Ph.D. degree in the Information Processing and Modeling Departement at ONERA (France).
References (48)
- et al.
Multi-sensor management for information fusion: issues and approaches
Inf. Fus.
(2002) - et al.
Markov chain monte carlo modular ensemble tracking
Image Vis. Comput.
(2013) - et al.
Visual tracking by fusing multiple cues with context-sensitive reliabilities
Pattern Recognit.
(2012) - et al.
Visual tracking via weakly supervised learning from multiple imperfect oracles
Pattern Recognit.
(2014) - et al.
Online adaptive hidden Markov model for multi-tracker fusion
Comput. Vis. Image Underst.
(2016) - et al.
Robust scale-adaptive mean-shift for tracking
Pattern Recognit. Lett.
(2014) - et al.
Visual tracking: an experimental survey
IEEE Trans. Pattern Anal. Mach. Intell.
(2014) - et al.
Online object tracking: a benchmark
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2013) - et al.
A novel performance evaluation methodology for single-target trackers
IEEE Trans. Pattern Anal. Mach. Intell.
(2016) - et al.
A superior tracking approach: building a strong tracker through fusion
Proceedings of the European Conference on Computer Vision
(2014)
A survey of appearance models in visual object tracking
ACM Trans. Intell. Syst. Technol.
Tracking and Data Fusion: A Handbook of Algorithms.
A generalized search method for multiple competing hypotheses in visual tracking
Proceedings of the International Conference on Pattern Recognition
Tracking by sampling trackers
Proceedings of the International Conference on Computer Vision
Multi-tracker partition fusion
IEEE Trans. Circuits Syst. Video Technol.
MEEM: robust tracking via multiple experts using entropy minimization
Proceedings of the European Conference on Computer Vision
Visual tracking via adaptive tracker selection with multiple features
Proceedings of the European Conference on Computer Vision
Learning multi-domain convolutional neural networks for visual tracking
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Similarity fusion for visual tracking
Int. J. Comput. Vis.
Object tracking: a survey
ACM Comput. Surv. (CSUR)
Fusion of multiple tracking algorithms for robust people tracking
Proceedings of the European Conference on Computer Vision
Learning to track with multiple observers
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
PROST: parallel robust online simple tracking
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Coherent selection of independent trackers for real-time object tracking
Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP)
Cited by (32)
Siamese network for object tracking with multi-granularity appearance representations
2021, Pattern RecognitionCitation Excerpt :TCNN [23] describes a valuable idea for ensemble trackers that the weights of multiple trackers are identified by two factors: affinity to the current frame and reliability to track a certain object. Leang et al. [24] try to fuse different trackers and take advantage of their individual strengths while controlling the noise they may introduce in some circumstances. They examine the influence of selecting different components of the processing chain of a set of trackers.
Passenger flow counting in buses based on deep learning using surveillance video
2020, OptikCitation Excerpt :While this approach performed better than other existing methods for passenger counting, it may not perform as well in scenarios with a very dense crowd. Several studies have been published in the field of visual tracking, including single-object tracking [34–36] and multi-object tracking [37–39]. The primary purpose of multi-object tracking is to automatically determine the movement trajectory of all targets in the scene.
MSSTResNet-TLD: A robust tracking method based on tracking-learning-detection framework by using multi-scale spatio-temporal residual network feature model
2019, NeurocomputingCitation Excerpt :It plays an important role in many relevant research fields [1–5], such as video surveillance, traffic monitoring, autonomous driving, human computer interaction, robotics, medical applications, and road scene understanding, etc. With the great progress over the past few decades, especially the breakthrough of convolutional neural network, visual object tracking has successfully applied in our daily life under various video scenarios [6–10]. For the generic single object tracking problems, the target can be an arbitrary desired object, initialized with a bounding box in the first frame image, and the tracker aims at locating the target in the following each frame image [11].
Thermal infrared and visible sequences fusion tracking based on a hybrid tracking framework with adaptive weighting scheme
2019, Infrared Physics and TechnologyCitation Excerpt :Jiang et al. [9] presented an online tracking method by giving weights to multiple features like HOG, color names and image intensity adaptively, and then fusing the response map of each feature to obtain the final result. Leang et al. [10] proposed a framework that can integrate or select different tracking results from different features or tracking methods, and then fuse them with adaptive weights. However, visible image based algorithms are proved to be incapable of tracking efficiently when faced with challenges like extreme illumination conditions (night, indoor, etc.), target under typical weather (fog, rain, smog, etc.), background clutter and so on.
Consensus and stacking based fusion and survey of facial feature point detectors
2023, Journal of Ambient Intelligence and Humanized Computing
Isabelle Leang graduated from the Ecole Nationale Supérieure de l’Electronique et de ses Applications (France), received a Master degree in Computer Sciences from the Université de Cergy-Pontoise (France) in 2012. She is actually preparing a Ph.D. degree in the Information Processing and Modeling Departement at ONERA (France).
Stéphane Herbin received an engineering degree from the Ecole Supérieure d’Electricité (Supélec), the M.Sc. degree in Electrical Engineering from the University of Illinois at Urbana-Champaign, and the Ph.D. degree in applied mathematics from the Ecole Normale Supérieure de Cachan. Employed by ONERA since 2000, he works in the Information Processing and Modeling Department. His main research interests are stochastic modeling and analysis for object recognition and scene interpretation in images and videos.
Benoît Girard received a Ph.D. degree in Computer Science (2003) from the Université Pierre et Marie Curie (Paris, France). He currently work as a Research Director at the Centre National de la Recherche Scientifique. His main research interests are action selection, reinforcement learning and decision making in animals and robots.
Jacques Droulez received a mathematical training at Ecole Polytechnique (Paris, France) and a MD (1982) from the University Paris 6. He is currently Research Director at the Centre National de la Recherche Scientifique. His main research interests are motion and object perception, sensori-motor control and Bayesian modeling of biological systems.