1 Introduction

An important component of tracking is the filtering problem in which estimates of object’s state are computed while observations are progressively received. The estimation process is in general modeled using a Bayesian formulation [2]. For many filters, i.e. the well-known Kalman filter [1] or nonparametric methods such as particle filters [3], the posterior probability can be recursively updated by applying a perception model and a motion model. Under real world conditions, the object motion can change over time and it is impossible to define a unique motion model which captures all different motions the object can execute. An elegant way of dealing with motion uncertainties and capturing the complex dynamics of objects is the Interacting Multiple Model (IMM) filter [4]. It has been successfully employed in several applications [5, 6]. The IMM approach can be used to fuse several models in one context by weighting each model from a set of models as possible candidates. Each model contributes to the final distribution depending on its current weight. In most cases, the motion is modeled by a bank of standard Kalman filters per object and the dynamics are described in 3D space. However, there exist several scenarios where objects are solely tracked on directly observed image space information. For example person tracking without available external calibration. The goal of this paper is to reveal some fallacy when applying an IMM filter restricted to such information. We show how a separation of the state space vector improves the overall system accuracy. After some basic concepts of an IMM filter are described in Sect. 2, we show how a basic IMM setup with three standard motion models should be modified for a better image space object tracking. The results achieved on the public available VOT 2014 dataset are presented in Sects. 3 and 4 contains a conclusion.

2 Interacting Multiple Model Filter

In this section the basic concepts of the IMM filter and a reference IMM configuration for the evaluation are described. For a more detailed description see for example Hartikainen and Särkkä [7] or Bar-Shalom et al. [8]. As mentioned, it is reasonable to assume that the dynamic model of an object can change from time to time. As a solution, a system is considered to be composed of multiple independent models, where the currently active model is one from a discrete set of n candidate models (\(M = \{M^1,\ldots ,M^n\}\)). The IMM filter is a popular choice for practical applications. Some prior probability \(\mu ^j_{0}\) for each model \(M^j\) and the state transition probability between time index \(k-1\) and k from model i to model j (denoted by \(p_{ij} = P(M^j_{k}|M^i_{k-1})\)) are assumed to be known. The transition probability matrix \(p_{ij}\) can be interpreted as a first order Markov chain characterizing the mode transitions. Hence systems of this type are commonly referred to Markovian switching systems (Bar-Shalom et al. [8]). Thus, the model or mode transition can be characterized by a first order Markov chain and described as transition probability matrix \(p_{ij}\). The closed form solution for the state estimation problem of a discrete-time IMM filter can be written as follows:

$$\begin{aligned} x_{k}= & {} F^j_{k} x_{k-1}+w^j_{k} \end{aligned}$$
(1)
$$\begin{aligned} y_{k}= & {} H^j_{k}x_{k} + r^j_{k} \end{aligned}$$
(2)

Here, \(x_{k}\) is the state of the object and the effective model in time step \({k-1}\) is denoted by j. \(F_{k}\) is the state transition matrix which applies the effect of each system state parameter at time \(k-1\) on the system state at time k. \(H_{k}\) is the measurement model matrix that maps the state parameters into the measurement domain. \(w_{k} \sim N(0,Q_{k})\) is the process noise and \(r_{k} \sim N(0,R_{k})\) is the measurement noise. For our goal to only rely on the directly observed information \(y_{k}\), we use the image space coordinates and the scale of the object as measurement. This information can be obtained from every object detector following the sliding window paradigm. Although the detectors differ in many aspects, the output of such a sliding window based detector is a rectangular bounding box centered at the object location. Here (xy) is the center position in the image space and s the scale. For describing the overall state of our reference IMM configuration the corresponding velocities \((\dot{x},\dot{y},\dot{s})\) and acceleration are used \(\ddot{x},\ddot{y},\ddot{s}\). The most common linear motion models are the constant position model (CP), the constant velocity model (CV), and the constant acceleration model (CA). In our experiments, we choose an IMM filter configuration, which consist of these three basic models. When the object remains at the same position the velocity and acceleration are reduced to zero since the object is not moving. Thus, the transition matrix for a state vector including the 9 mentioned states \((x_{k} = (x,y,s,\dot{x},\dot{y},\dot{s},\ddot{x},\ddot{y},\ddot{s}))\) for the constant position motion model is defined as

$$\begin{aligned} F^{CP}_{k}=\left[ \begin{array}{cc} I_{3 \times 3} &{} 0_{3\times 6} \\ 0_{3\times 6} &{} 0_{6\times 6} \\ \end{array} \right] . \end{aligned}$$
(3)

The constant velocity model is used in most tracking approaches and can be then be defined as

$$\begin{aligned} F^{CV}_{k}=\left[ \begin{array}{ccc} I_{3 \times 3} &{} I_{3 \times 3}T &{} 0_{3 \times 3} \\ 0_{3\times 3} &{} I_{3 \times 3}&{} 0_{3\times 3} \\ 0_{3\times 3} &{} 0_{3\times 3} &{} 0_{3\times 3} \\ \end{array} \right] . \end{aligned}$$
(4)

Here, T is the number of discrete time steps. In literature, several assumptions on how to model the acceleration process of an object are proposed (see Li and Jilkov [9]). Here, a CA model is considered as

$$\begin{aligned} F^{CA}_{k} =\left[ \begin{array}{ccc} I_{3 \times 3} &{} I_{3 \times 3}T&{} \frac{1}{2}I_{3 \times 3}T^2 \\ 0_{3\times 3} &{} I_{3 \times 3}&{} I_{3\times 3}T \\ 0_{3\times 3} &{} 0_{3\times 3} &{} 0_{3\times 3} \\ \end{array} \right] . \end{aligned}$$
(5)

The IMM filter basically consists of three major steps: interaction (mixing), filtering and combination. In the interaction stage and under the assumption that a particular model is the right model at the current time step, the initial conditions for this model are obtained by mixing the state estimates produced by all filters. In detail, the mixing probabilities \(\mu ^{i|j}_{k}\) for each model \(M^{i}\) and \(M^{j}\) are calculated as \(\mu ^{i|j}_{k} = \frac{1}{\bar{c}_{j}} p_{ij} \mu ^i_{k-1}\) with \(\bar{c}_{j} = \sum ^{n}_{i=1} p_{ij} \mu ^i_{k-1}\). Thereby, \(\mu ^i_{k-1}\) is the probability of model \(M^{i}\) in the time step \(k-1\) and \(\bar{c}_{j}\) a normalization factor. For each filter the mixed mean and covariance is computed as follows:

$$\begin{aligned} m^{0j}_{k-1} = \sum ^{n}_{i=1} \frac{1}{\bar{c}_{j}} \mu ^{i|j}_{k} m^i_{k-1} \end{aligned}$$
(6)
$$\begin{aligned} P^{0j}_{k-1} = \sum ^{n}_{i=1} \mu ^{i|j}_{k} \left( P^{i}_{k-1} +(m^{i}_{k-1}-m^{0j}_{k-1})(m^{i}_{k-1}-m^{0j}_{k-1})^T \right) \end{aligned}$$
(7)

Here, \(m^{i}_{k-1}\) and \(P^{i}_{k-1}\) are the updated mean and covariance for model i at time step \(k-1\).

Then in the filtering stage, for each individual model conditioned on the current active mode, a standard Kalman filtering (KF) is done. Correspondingly a prediction \(\left[ m^{-,i}_{k},P^{-,i}_{k} \right] = KF_{p}(m^{0j}_{k-1}, P^{0j}_{k-1}, F^{i}_{k}, Q_{k})\) and update step \( \left[ m^{i}_{k},P^{i}_{k} \right] = KF_{u}(m^{-,i}_{k}, P^{-,i}_{k}, H^{i}_{k}, R^{i}_{k})\) is applied. Initialization is done with \(m^{i}_{k-1}\) and \(P^{i}_{k-1}\). Then the model probabilities \(\mu ^{i}_{k} = \frac{1}{c} \varLambda ^{i}_{k} \bar{c}_{i}\) are adapted according to the likelihood of the measurement for each filter \(\varLambda ^{i}_{k}\). Where \(c = \sum ^{n}_{i=1} \varLambda ^{i}_{k} \bar{c}_{i}\) is a normalizing factor.

The final step of the IMM filter is combination. There, the combined estimate for the state mean and covariance is computed as follows:

$$\begin{aligned} m_{k} = \sum ^{n}_{i=1} \mu ^{i}_{k} m^i_{k} \end{aligned}$$
(8)
$$\begin{aligned} P_{k} = \sum ^{n}_{i=1} \mu ^{i}_{k} \left( P^{i}_{k} +(m^{i}_{k}-m_{k})(m^{i}_{k}-m_{k})^T \right) \end{aligned}$$
(9)

3 IMM Configuration and Evaluation

In this section, we evaluate the effectiveness of different IMM filter configuration in terms of state separation for the case of tracking the object only with directly observed image space information. The desired states for the IMM filter for tracking were determined in Sect. 2. Besides the center position in the image space (xy) and the scale (s) of the object, the IMM filter uses the corresponding velocities \((\dot{x},\dot{y},\dot{s})\) and acceleration \((\ddot{x},\ddot{y},\ddot{s})\). The discrete set of motion models consists of three basic models, in particular CP, CV, and CA. Intuitively, one would simply set the state vector set to

$$\begin{aligned} x_{k,\textit{IMM }1} = (x,y,s,\dot{x},\dot{y},\dot{s},\ddot{x},\ddot{y},\ddot{s} ) \text {.} \end{aligned}$$
(10)

Thus, only one IMM filter is required for monitoring all desired states. For a standard Kalman filter, a separation of the states and additionally required filter is redundant. Due to the characteristics of an IMM filter, not only a poor choice of single motion model, but in addition a careless extension of the states can lead to a non optimal performance. An optimal filtering behavior using a multiple model system requires an optimal filter for every possible model sequence. Hence, some kind of approximations are needed in practical applications. For an IMM filter, this is done by conditioning all filters on the currently active model and the final state estimate is obtained by merging the results of all elemental filters. Hence, a poor estimate of active model affects the weighting of the mixed inputs. A combining of the image coordinates and the scale in one state vector can thereby result in errors for the calculation of the model probabilities, especially when combining the scale with the image position. For example, the scale change of an object can be constant while the object is moving. Thus the best fitting model for describing the scale is CP, whereas this model is a poor fit for the image position. Therefore, we propose to use an extra IMM-filter instead of one. Hence, the scale and the corresponding velocity and acceleration are estimated independent from the position states and their derivatives. This leads to the following IMM configuration:

$$\begin{aligned} x_{k,\textit{IMM }2} = (x,y,\dot{x},\dot{y},\ddot{x},\ddot{y}) ; (s,\dot{s},\ddot{s} ) \text {.} \end{aligned}$$
(11)

A separation of the scale with an additional filter seems obvious, but when tracking with directly observed image space data, a split into independent image coordinates may appear to be at first not required. In order to show the benefit of such an IMM set up, we recommend an IMM configuration as follows:

$$\begin{aligned} x_{k,\textit{IMM }3} = (x,\dot{x},\ddot{x}) ; (y,\dot{y},\ddot{y}) ; (s,\dot{s},\ddot{s} )\text {.} \end{aligned}$$
(12)

Here, three IMM filter are used to describe the x position, y position, scale, and the corresponding derivatives. Hence, every motion along the image axes is captured with a separate filter.

Evaluation is done on the VOT 2014 dataset [10]. This dataset is a selection of 25 widely-used object tracking sequences. Although the dataset is originally designed to compare different appearance or visual tracker, it includes a variety of different object motions. Figure 1 shows the first frame of exemplary sequences where the unified bounding box of the object is highlighted in green.

Fig. 1.
figure 1

Example tracking sequences for evaluation from the VOT 2014. The first frame with the unified bounding box of the object is shown for the sequences “bicycle”, “jogging”, “surfing”, “woman”. (Color figure online)

The main feature of the IMM filter is the ability to estimate the state of a dynamic system with several behavior modes which can switch from one to another. Besides that, the IMM filter is a good compromise between performance and complexity [11]. The overall performance depends on a number of design parameters. The most critical design parameters are the model set structure, process and measurement noises, initial state, and the jump structure with transition probabilities. Nonetheless, the above described basic IMM setup with 3 standard motion model is suboptimal for some scenarios from the VOT 2014 dataset [10], we keep the combination of one constant position, one constant velocity, and one constant acceleration model fixed. In practice, the transition probability matrix is often assumed known and is chosen a priori. As stated in Bar-Shalom [8], an ad-hoc approach is to fill the diagonals with values close to one. We set the diagonals to 0.99 and the other transition values to 0.005. Because the CV model is the mostly used in tracking approaches, we set the initial model probability \(\mu ^i_{0}\) in favor of this model to 0.98 and to 0.01 for the other models. The measurement and process noise is modeled as additive white noise. In the experiments the standard deviation of both noises was varied between 1, 2, 5, and 10. Here, only the diagonals of process noise covariance matrix Q and measurement noise covariance R include non-zero values.

For every image sequence, the first 10 frames are excluded and used for initializing of the filters. The update interval \(t_{update}\) for getting a new measurement for the filter was varied between every frame, every third frame and fifth frame. Since, the standard output of object detectors are a rectangular bounding box centered at the object location, we use the ground truth bounding boxes from the VOT 2014 dataset to simulate the output of an object detector and for evaluating the prediction accuracy. Performance measures aim at summarizing the extent to which the trackers prediction agrees with the ground truth annotation. In Cehovin et al. [12], a general definition of an object state description in a sequence with length N is established based on the center of the object and the region of the object at time k. In case of tracking an object in image space the region is usually described by a bounding box. From the IMM filter, we use the predicted states center location x, y and scale s to calculate an unified bounding box \(A^{O}_{k}\). With this predicted objects region form the tracker and the ground-truth region an overlap can be calculated as \(\frac{A^{O}_{k} \cap A^{GT}_{k}}{A^{O}_{k} \cup A^{GT}_{k}} \). For the ground truth area \(A^{GT}_{k}\) also an unified bounding box is considered. In general, the width of the enclosing bounding box is more strongly influenced by the body pose of the objects. Hence, a unified bounding box with a width of \(\frac{1}{3}\) scale is used. A property of region overlap measures is that they account for both position and size of the predicted and ground-truth bounding boxes simultaneously, and there is no normalization problem. The overlap measure is summarized over an entire sequence by an average overlap. In addition to the average overlap, the number of frames in which the overlap is below a threshold of 0.5 is recorded and used as a second comparative score.

Table 1. Performance summary for the different IMM filter configurations.

The overall results for the three different IMM configurations are exemplary summarized for \(\sigma ^{2}_{w}=2\), \(\sigma ^{2}_{r}=5\), \(t_{update}=3\) in Table 1. Other parameter settings may differ slightly, but are equal at their core. This means that the achieved overlap varies and that for some specific sequences the ranking of the IMM configuration changes, but overall it can clearly be noticed that the IMM configuration, that uses separated image space coordinates and scale, outperforms the other configurations. Due to the fact that the motion of objects in some particular sequence is highly non-linear, the chosen combination of motion model is not optimal. Moreover, this can also result in a changed ranking, but the trend towards the third configuration for achieving superior results is clearly visible for all evaluated parameter settings.

When tracking an object without a mapping between measurement domain and the states, the motion in a particular direction is independent from the other direction. Because the elemental filters are conditioned on the best fitting model the final estimate is negatively influenced by a naive extension of the state vector. For combining the scale and its derived changes with the actual motion states this seems obvious. But the presented results show how crucial this is also for mixing between image coordinates. For the majority of the evaluated sequences the average overlap achieved with the separated IMM states is larger than with the other configuration. An improvement can also be perceived by avoiding a combination between dynamics and scale. Thus the second configuration (IMM 2) outperforms the naive state extension from configuration one (IMM 1). This state splitting is also recommend when the actual motion is described in 3D. In summary, when relying on direct observed measurement, which is common for 2D Tracking, a naive extension of the state vector in case of tracking with an IMM filter instead of using single Bayes filter should be avoided. However, the fact that independent states are affected by mixing the inputs from the elemental filters, which is a result of the required approximation for an optimal filtering without keeping every possible model sequence, is easily forgotten when applying IMM filter for direct tracking in 2D. With this simple reminder a better IMM filtering can be achieved. While, the overall performance can be further improved by selecting alternative motion models which better fit to the dynamics of the object in the scene, the awareness of not naively extend the state is also crucial. All states of an IMM state vector should depend on each other and hence each additional independent state and its derivatives should be considered in an additional IMM filter. Hence the conditioning on the current best fitting model can not negatively affects the overall performance. The motion of an object in image space is a very good example where the dynamics along the image axes should be considered independently with an IMM filter.

4 Conclusion

In this paper, we showed how a naive extension of the state space can negatively affect the performance of an IMM filter. The required approximation by merging the output of the elemental filter based on the current best fitting filter affects states which are independent of each other. Especially when tracking an object only based on direct image space measurement, a combination in the state vector of both image coordinates should be avoided. This simple reminder of a common fallacy helps to improve the effectiveness of an IMM filter for considering varying system characteristics. The benefit of this favored design scheme for an IMM filter configuration is shown on the VOT 2014 dataset.