Elsevier

Information Fusion

Volume 12, Issue 4, October 2011, Pages 275-283
Information Fusion

Dynamical information fusion of heterogeneous sensors for 3D tracking using particle swarm optimization

https://doi.org/10.1016/j.inffus.2010.06.005Get rights and content

Abstract

This paper presents a new method for three dimensional object tracking by fusing information from stereo vision and stereo audio. From the audio data, directional information about an object is extracted by the Generalized Cross Correlation (GCC) and the object’s position in the video data is detected using the Continuously Adaptive Mean shift (CAMshift) method. The obtained localization estimates combined with confidence measurements are then fused to track an object utilizing Particle Swarm Optimization (PSO). In our approach the particles move in the 3D space and iteratively evaluate their current position with regard to the localization estimates of the audio and video module and their confidences, which facilitates the direct determination of the object’s three dimensional position. This technique has low computational complexity and its tracking performance is independent of any kind of model, statistics, or assumptions, contrary to classical methods. The introduction of confidence measurements further increases the robustness and reliability of the entire tracking system and allows an adaptive and dynamical information fusion of heterogenous sensor information.

Introduction

Today, object tracking is an increasing research topic, due to growing security requirements. Applications such as video-conferencing, surveillance, smart automobiles, and automatic scene analysis are few examples in the field of autonomous systems heavily relying on tracking using heterogeneous sensors. A variety of single-sensor techniques based solely on sound or vision already exists for that purpose. As single or homogeneous sensor techniques techniques have their specific weaknesses when deployed as stand-alone systems, it is advantageous to combine the information obtained by two or more heterogeneous sensors. In tasks like pedestrian tracking, a thermal camera is often fused with an optical sensor to enlarge the utilizable spectrum as well as the system’s robustness [1]. Combinations of video and audio signals can amongst others enhance a speech event detection by incorporating the video data at hand. Audiovisual information fusion has been successfully applied to biometric person authentication [2], as well as to man–machine communication, like the smart kiosk, a terminal which enables automatic interaction between multiple speakers and a vending machine [3].

Different approaches for audiovisual fusion based object tracking have been established. Mostly recognized are techniques based on Kalman filters. In [4], a decentralized Kalman fusion technique is applied, which recursively combines audio and video state estimates into a more reliable global position estimate. While being one of the fastest methods due to its low complexity, the Kalman fusion approach is limited by its assumption of linear dynamics and an unimodal Gaussian posterior density. However, in real-world scenarios, especially in cluttered or noisy scenes, measurements tend to have non-Gaussian, multi-modal distributions. Furthermore, its linear state transition system – hypothesizing a deterministic movement and speed of the object to be tracked – tends to fail with sudden and abrupt movements.

A more sophisticated extension of the Kalman filter is the particle filtering, utilized for audiovisual fusion in [5], [6], [7]. Contrary to Kalman’s hypothesis of a Gaussian distribution, Particle filters model a stochastic process with an arbitrary probability density function by approximating it numerically with a cloud of particles. The obtained posterior density function estimate becomes an adequate approximation to the true posterior probability density function, when the number of particles is large. Consequently, a growing number of particles improves the tracking results, but also increases the computational costs and therefore degrades the system’s speed.

Similar methods for audiovisual object tracking are probabilistic graphical models, which try to exploit the mutual relationship between two modalities, like audio and video data representing the same object, in order to achieve an optimal performance. The system presented in [8] uses probabilistic generative models to describe the observed data from multimedia sequences. The models are denoted as generative as they delineate the observed data in terms of the process that generated them, using additional variables that are not observable. The models are probabilistic due to the fact that they estimate probability distributions of signals rather than describing the signals themselves.

One of the most popular among graphical-model techniques using probabilistic models is the Bayesian network approach [6], [9], which is considered to be a powerful tool for information fusion. Bayesian networks are a way of modeling a joint probability distribution of multiple random variables. In [10], a Bayesian network was used to detect the time and position of speech events by analyzing audio and video data. The gained information was then utilized to robustly recognize and separate speech signals in noisy and reverberant environments. The system was successfully operating a conference room. A similar implementation can be found in [9]. Although the Bayesian approach is very simple and powerful in principle, its central drawback in practice is that it often requires intractably large amount of computations, mainly for the execution of integrations in a very high dimensional space of random variables.

A further related approach is the usage of Hidden Markov models (HMM) [11].

Besides computational complexity, the above mentioned fusion methods have the drawback of relying on model assumptions. With the methods becoming more complex, the limitations of a distribution model may be partly overcome and more generally applicable. However, this happens on the expense of exponentially increasing complexity and processing time, while being still dependent on different assumptions, models, training sets, statistics, and transition probabilities that have to be postulated.

Many object tracking implementations merely aim on detecting and tracking objects on the screen, without determining their positions in the real-world reference frame. The calculation of the three dimensional coordinates represents an additional task for conventional tracking i.e. disparity calculation or triangulation.

It is our aim in this study to establish a minimal-complexity 3D object tracking algorithm which keeps the hardware system implementation as simple as possible in order to achieve a real-time behavior while preserving a robust tracking performance.

The sound localization method adopted in this work deploys two microphones. Current sub-space methods, e.g. the MUltiple SIgnal Classification (MUSIC) algorithm [12], demand microphone arrays equipped with eight, 16 or more elements.

The visual detection algorithm used here deploys two cameras. In contrast to other approaches [13] in which laser sensors were additionally needed we only use cameras and microphones in order to achieve three dimensional tracking. The introduced tracking algorithm is solely based on color distributions to identify and track moving objects in a video sequence. It is a robust technique more flexible than the background subtraction method [10] and well-suited for abrupt changes in the camera position as well as for alterations in the environment [14].

The tracking information delivered by the audio module is fused with that of the video system in a novel manner using Particle Swarm Optimization algorithm (PSO). In our design, a particle is regarded as a possible object position in the solution space, which is simply the three dimensional space in which the object sojourns.

The implemented tracker is composed of an audio module, the stereo camera component, and the fusion module, shown in Fig. 1. The camera component acquires two video frames of the same scene at a time instant t. After a matching operation, it delivers a pair of correspondence points MPl and MPr, which describe the coordinates of the same part of the object projected onto the left and right frames, respectively. Furthermore, the visual system provides a confidence value Confvis, which evaluates the reliability of the visual system’s result. Parallel to image processing, the audio block estimates the azimuth φ of the moving object at the same time instant t as well as a confidence measure Confaud, assessing the plausibility of the audio result. The information of these two blocks is then propagated to the PSO tracking module, which delivers a 3D position estimate M(X, Y, Z) of the moving object at time instant t.

Parts of this work have been published in [15]. In this paper we further extend the basic approach by considering confidence measurements when evaluating the position estimates of audio and video modules. This allows an adaptive and dynamical information fusion of heterogenous sensor information, which further increases the robustness and the reliability of the entire tracking system. A detailed description of the tracking system will be presented in the following sections. To this end, this paper is organized as follows. First, the PSO concept is introduced in Section 2. In Sections 3 Sound source localization, 4 Visual object localization we describe the audio and the video tracking systems independently. In Section 5 we present a novel method for audiovisual fusion based on PSO. Experimental results and comparison with current tracking techniques are discussed in Section 6. Section 7 concludes the current study and introduces venues for future work.

Section snippets

Particle swarm optimization

In recent years, researchers proved that the Particle Swarm Optimization is a highly efficient optimization method and a search algorithm with high performance capabilities and outstanding flexibility, making it suitable for a huge variety of signal processing tasks. The PSO algorithm, originally proposed in [16], is a simulation of a simplified social model. It draws its roots from artificial life, and was inspired by bird flocking, fish schooling, and swarming theory in particular.

The social

Sound source localization

Our sound localization system setup consists of two microphones placed horizontally 47 cm away from each other, and uses the Time Delay Of Arrival (TDOA) method for localization. This system delivers an azimuth angle φ which represents the relative angle between the origin of the system and the object being tracked. The audio system furthermore delivers the value Confaud, which describes the confidence of the obtained audio localization, i.e. the confidence of the audio system.

Visual object localization

Our vision system consists of two cameras, which deliver two correspondence or matching points, MPl = (xMPl, yMPl) from the left frame and MPr = (xMPr, yMPr) from the right frame. These points represent the 2D-projection of the object to be tracked. Additionally, the vision system delivers the value Confvis, which describes the confidence of the obtained matching point pair, i.e. the confidence of the vision system.

Audiovisual information fusion and tracking

The task of the fusion module is to combine the information conveyed by the audio and vision algorithm in order to deliver an estimate of the current 3D position corresponding to the tracked object, relative to the system origin. In this section we will briefly explain a Kalman-based fusion technique which will serve as a reference system to compare the accuracy, performance, and execution time of our fusion algorithm. Afterwards we present our PSO fusion technique.

Results and comparison

To test and evaluate our tracker, audio and video data of a person moving and talking in an area facing the stereo camera and stereo microphone system were taken. The hardware deployed consisted of two firewire cameras and two AKG omni-directional microphones. In a first implementation, the videos were taken with 15 frames per second with a resolution of 640 × 480 pixels. The audio material was recorded using a sampling frequency of 44,100 Hz. For a single audio calculation step, we captured and

Conclusion and future work

We presented a novel 3D object tracking system based on dynamically fusing audiovisual information with regard to the reliability of the single modules. Our PSO based fusion approach does not need any models, statistics or learning phase. It overcomes the problems of classic audiovisual fusion methods, that are based on assumptions regarding the distributions of variables, or that tend to become complex by reducing these limitations. Speed performance was shown to be slightly faster than the

References (31)

  • A. Leykin et al.

    Pedestrian tracking by fusion of thermal-visible surveillance videos

    MVA Machine Visions and Applications Journal

    (2008)
  • N. Poh et al.

    Hybrid biometric person authentication using face and voice features

    Lecture Notes in Computer Science

    (2001)
  • A. Christian et al.

    Digital smart kiosk project

  • M. Brandstein et al.

    Microphone Arrays, Signal Processing Techniques and Applications

    (2001)
  • N. Checka, K. Wilson, Person tracking using audio-video sensor fusion, in: Sow Procedings, Massachusetts Institute of...
  • H. Asoh et al, An application of a particle filter to bayesian multiple sound source tracking with audio and video...
  • D.N. Zotkin et al.

    Joint audio-visual tracking using particle filters

    EURASIP Journal on Applied Signal Processing

    (2002)
  • M.J. Beal et al.

    A graphical model for audiovisual object tracking

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2003)
  • X. Zou, B. Bhanu, Tracking humans using multi-modal fusion, in: Proceedings of the 2005 IEEE Computer Society...
  • F. Asano

    Detection and separation of speech event using audio and video information fusion and its application to robust speech interface

    EURASIP Journal on Applied Signal Processing

    (2004)
  • A. Noulas, B. Kröse, Probabilistic audio visual sensor fusion for speaker detection, Tech. Rep., University of...
  • R.O. Schmidt

    Multiple emitter location and signal parameter estimation

    IEEE Transactions on Antennas and Propagation

    (1986)
  • J. Fritsch et al.

    Audiovisual person tracking with a mobile robot

  • G.R. Bradski

    Computer vision face tracking for use in a perceptual user interface

    Intel Technology Journal

    (1998)
  • F. Keyrouz, U. Kirchmaier, K. Diepold, Three dimensional object tracking based on audiovisual fusion using particle...
  • Cited by (18)

    • A Swarm Intelligence inspired algorithm for contour detection in images

      2013, Applied Soft Computing Journal
      Citation Excerpt :

      In recent years, SI-based methods proofed to be capable of managing complex systems and tasks, from route-planning and control of unmanned vehicles to mobile telecommunication networks. While SI-based algorithms mostly descend from the field of optimization, they were also successfully applied to computer vision and image processing tasks, ranging from edge detection [25], over feature extraction and object recognition [26,27], to three-dimensional tracking in stereo-video [28]. All these algorithms offer robustness alongside low computational costs.

    • Audio-Visual Tracking of Concurrent Speakers

      2022, IEEE Transactions on Multimedia
    • 3D audio-visual speaker tracking with a novel particle filter

      2020, Proceedings - International Conference on Pattern Recognition
    View all citing articles on Scopus

    This work was supported by the German Science Foundation (DFG) under the collaborative research center SFB TR-28 ‘Kognitive Automobile’ (www.kognimobil.org).

    View full text