Keywords

1 Introduction

Tracking of multiple targets in a crowded and unconstrained environment has many applications in video surveillance and security systems. This is a challenging problem due to the high amount of noise in the measured data, occlusion among targets, and interaction of targets with themselves or with other objects. Currently, tracking-by-detection is considered as the most successful solution for this problem [4, 26, 27, 29, 33]. However, tracking of detection boxes is not enough for many real applications such as human activity recognition and analysis.

In this work, we propose an approach to track segmented targets instead of their corresponding detection boxes in multi-view video sequences. We extract superpixels from detection boxes in all images and associate them over different views and time steps. Association of several superpixels in a detection box results in a segmentation. Moreover, association of several segmentations from different views results in a 3D reconstruction. Finally, association of segmentations or reconstructions over time (i.e., temporal association) results in tracking of segmented targets in video sequences. In other words, we address the problem of segmentation, reconstruction and tracking of multiple targets in multi-view video sequences.

In contrast to previous works, we aim to assign a unique target ID not only to each individual detection, but to every superpixel in the entire multi-view video sequence. In common with some other approaches [14, 18], the problem is first formulated as a maximum a-priori problem and then mapped into a constraint flow graph, which can be efficiently solved by available off-the-shelf binary linear programming solvers. This work is an extension of the work of [18] which has considered reconstruction in tracking and is inspired by the work of [25] that addresses both video segmentation and tracking. Our main contributions are (1) combined segmentation, reconstruction and tracking of unknown number of targets in multi-view video sequences; (2) a new constrained flow graph that takes multi-view couplings and low-level superpixel information into account. Experimental result on standard, publicly available datasets show that the method can outperform many other methods with tracking performance while achieving comparable segmentation performance.

2 Related Work

Tracking-by-detection is the most successful strategy that has been explored intensively by many researchers [1, 20, 23, 31, 33]. Here, first a set of detections is obtained by applying object detection algorithms on all images and then is fed into a data association algorithm to track the targets (i.e., finding the identities of targets) in the sequence of frames such that the trajectories of targets are smooth. The main challenge is the data association problem, where the number of possible association of targets over the time frames increases exponentially with the number of targets. To address this problem, modern approaches cast this problem in different ways such as a graph optimization whose solution can be obtained using Integer Linear Programming [5, 6], a network flow [31, 34], continuous or discrete-continuous energy minimization [3, 26], and generalized clique graphs [12, 33]. In order to make the problem tractable, some researchers apply some restrictions, such as reducing the targets’ state to the observations in the optimization problem [6, 21, 34] or sting measurements [10, 21, 22, 34]. However, these techniques are only able to track a set of bounding boxes containing the objects. Evidently in many applications finer tracking of the targets is highly desirable.

In order to have finer tracking of objects, video segmentation techniques [8, 9, 16] are used to assign semantic labels to the pixels in a sequence of frames such that pixels belonging to the same target should preserve their label throughout the entire video sequence. For instance, the authors in [19, 28] use video segmentation for pedestrian tracking. Fragkiadaki and Shi [14] cast the problem of multi-target tracking as clustering of low-level trajectories in order to enhance the tracking results in cluttered situations. Milan et al. [25] aim to track superpixels over the frame sequence by casting the superpixel tracking as a multi-label optimization problem. They define several types of cost functions in their graphical model (i.e., Conditional Random Field (CRF)). The solution of optimization leads to a joint segmentation and tracking of targets. However, their approach is based on a single view.

Here, we use integer linear programming for joint segmentation, reconstruction and tracking of multiple targets observed by multiple cameras. The proposed approach performs data association among extracted superpixels in each view and also among several views. In our approach, we aim to simultaneously segment, reconstruct, and track targets in a multi-view setup.

3 Approach

A 2D image detection is defined by a tuple \(\mathcal {D}_i= (x_i,s_i,c_i,t_i)\), where \(x_i\) is the position, \(s_i\) the size of the detection, \(c_i\) the camera and \(t_i\) the time. A superpixel \(sp_j\) is a group of image pixels in one frame that have similar color and each detection \(\mathcal {D}_i\) can be split into several superpixels. A segmentation \( \mathcal {S}_i=\{sp_j\} \) of a detection \(\mathcal {D}_i\) is then a set of at least one superpixel which represent a target (e.g., human body).

$$\begin{aligned} \mathcal {S}_i \subseteq \{sp|\forall sp_j,sp_k \in \mathcal {S}_i, j\ne k:c_{sp_j}= c_{sp_k} \wedge t_{sp_j}=t_{sp_k}\}. \end{aligned}$$
(1)

where \(c_{sp_j}\) is the camera and \(t_{sp_j}\) is the time of superpixel \(sp_j\). The set of all feasible segmentations is \(\mathcal {S}= \{\mathcal {S}_i\}\). A 3D reconstruction \(\mathcal {R}_k\) is a set of segmentations from different views. We denote the number of segmentations within one reconstruction as its cardinality \(|\mathcal {R}_k|\). Every reconstruction must have at least two coupled segmentations, i.e., \(|\mathcal {R}_k|\ge 2\). The set of all reconstructions is \(\mathcal {R}\). Ideally, one reconstruction corresponds to one real world target. Thus, from each camera at most one segmentation can be included in one \(\mathcal {R}_k\):

$$\begin{aligned} \mathcal {R}_k \subseteq \{\mathcal {S}|\forall \mathcal {S}_i,\mathcal {S}_j \in \mathcal {R}_k, i\ne j:c_i\ne c_j \wedge t_i=t_j\}. \end{aligned}$$
(2)

where \(c_i\) is the camera and \(t_i\) the time of segmentation \(\mathcal {S}_i\). The set of the union of segmentations and reconstructions is denoted as \(V=\mathcal {S} \cup \mathcal {R}\). A trajectory hypothesis is defined as \(\mathcal {T}_u=\{V_{u_1},V_{u_2},...,V_{u_n}\}\), which means each node in a trajectory can be either a segmentation or a reconstruction. The complete association hypothesis \(\mathcal {T}\) is a set of trajectory hypotheses, i.e., \( \mathcal {T} = \{\mathcal {T}_u\} \). Given a set of segmentations \(\mathcal {S}\) and a set of reconstructions \(\mathcal {R}\), the tracking task is achieved by finding an optimal set of tracks \(\mathcal {T}^*\), which has the Maximum a-Posteriori (MAP) probability:

$$\begin{aligned} \mathcal {T}^*&= \mathop {\mathrm{argmax}}\limits _{\mathcal {T}} P(\mathcal {T}|\mathcal {R},\mathcal {S})\nonumber \\&= \mathop {\mathrm{argmax}}\limits _{\mathcal {T}} P(\mathcal {R},\mathcal {S}|\mathcal {T})P(\mathcal {T})\nonumber \\&= \mathop {\mathrm{argmax}}\limits _{\mathcal {T}}P(\mathcal {R}|\mathcal {T})P(\mathcal {S}|\mathcal {R},\mathcal {T})P(\mathcal {T})\nonumber \\&= \mathop {\mathrm{argmax}}\limits _{\mathcal {T}}P(\mathcal {S}|\mathcal {T})P(\mathcal {R}|\mathcal {T})P(\mathcal {T}) \end{aligned}$$
(3)

Using Bayes rule, the posterior probability can be written as the multiplication of the likelihood and prior probabilities. The likelihood term can be further decomposed since segmentation \( \mathcal {S} \) and reconstruction \( \mathcal {R} \) are conditionally independent. Next we introduce two constraints: two different segmentations \( \mathcal {S}_i \) and \( \mathcal {S}_j \) cannot share the same superpixel, and two different reconstructions \(\mathcal {R}_k\) and \(\mathcal {R}_l\) cannot share the same segmentation:

$$\begin{aligned} \mathcal {S}_i \cap \mathcal {S}_j&= \emptyset , \ \forall i \ne j,\ \forall \mathcal {S}_i,\mathcal {S}_j \in \mathcal {T}\nonumber \\ \mathcal {R}_k \cap \mathcal {R}_l&= \emptyset , \ \forall k \ne l,\ \forall \mathcal {R}_k,\mathcal {R}_l \in \mathcal {T} \end{aligned}$$
(4)

We further assume non-overlapping trajectories, i.e., one segmentation \(\mathcal {S}_i\) or one reconstruction \(\mathcal {R}_k\) can only be part of at most one trajectory:

$$\begin{aligned} \mathcal {T}_u \cap \mathcal {T}_v = \emptyset , \ \forall u \ne v \end{aligned}$$
(5)

With these non-overlapping assumptions, the individual segmentation likelihood probabilities \(P(\mathcal {S}_i|\mathcal {T})\) are conditionally independent and so are the individual reconstruction likelihood probabilities \(P(\mathcal {R}_k|\mathcal {T})\). Moreover, the individual prior probabilities \(P(\mathcal {T}_u)\) are independent as well. So the MAP formulation can be factorized as:

$$\begin{aligned} \mathcal {T}^*&= \mathop {\mathrm{argmax}}\limits _{\mathcal {T}} \prod \limits _{\mathcal {S}_i \in \mathcal {S}} P(\mathcal {S}_i|\mathcal {T}) \prod \limits _{\mathcal {R}_k \in \mathcal {R}} P(\mathcal {R}_k|\mathcal {T}) \prod \limits _{\mathcal {T}_u\in \mathcal {T}} P(\mathcal {T}_u)\nonumber \\ s.t.&\mathcal {S}_i \cap \mathcal {S}_j = \emptyset , \ \forall i \ne j,\ \forall \mathcal {S}_i,\mathcal {S}_j \in \mathcal {T}\nonumber \\&\mathcal {R}_k \cap \mathcal {R}_l = \emptyset , \ \forall k \ne l,\ \forall \mathcal {R}_k,\mathcal {R}_l \in \mathcal {T}\nonumber \\&\mathcal {T}_u \cap \mathcal {T}_v = \emptyset , \ \forall u \ne v \end{aligned}$$
(6)

Here, the segmentation likelihood \( P(\mathcal {S}_i|\mathcal {T}) \) and the reconstruction likelihood \( P(\mathcal {R}_k|\mathcal {T}) \) measure the quality of the segmentation and reconstruction, respectively. They are precisely defined in Sect. 5.

The nodes (which are either segmentations or reconstructions) in a trajectory hypothesis \(\mathcal {T}_u\) form a Markov chain. Hence, the a priori probability of a single trajectory \(P(\mathcal {T}_u)\) is defined as:

$$\begin{aligned} P(\mathcal {T}_u)&= P(\{V_{u_1},V_{u_2},...,V_{u_n}\})\nonumber \\&=P_{en}(V_{u_1})P_{link}(V_{u_2}|V_{u_1})...P_{ex}(V_{u_n}) \end{aligned}$$
(7)

where \(P_{en}(V_{u_i})\) and \(P_{ex}(V_{u_i})\) are the probabilities of a trajectory to start or terminate at node \(V_{u_i}\), respectively. \(P_{link}(V_{u_j}|V_{u_i})\) defines the transition probability from \(V_{u_j}\) to \(V_{u_i}\). By solving the optimization problem of Eq. (3), which corresponds to a hypergraph like the one in Fig. 1, we are able to track, segment, and reconstruct all the objects simultaneously. The final MAP formulation can be solved by K-shortest path [6], Dynamic Programming [34] or Binary Integer Programming (BIP) [17, 18]. In this work, we reformulate it into as a constrained cost-flow graph and solve it using Binary Integer Programming.

Fig. 1.
figure 1

An example of the proposed flow graph for three time steps (\(t_0-t_2\)). One to several superpixels can be associated to build a segmentation node. Several segmentation nodes can be associated to build a reconstruction node. The association of all nodes over time leads to target tracking. E indicates source node and T indicates sink node.

4 Min-Cost Flow Graph

A hypergraph G is defined as the set of vertices and edges \(G=\{\mathcal {V}, \mathcal {E}\}\). The set of all vertices is the union of segmentations and reconstructions, as well as additional source and sink nodes E (start of a trajectory) and T (end of a trajectory), i.e. \(\mathcal {V}=\mathcal {S}\cup \mathcal {R}\cup \{E,T\}\). The edges \(\mathcal {E}\) consist of hyperedges \(\mathcal {S}_i\) and \(\mathcal {R}_k\), corresponding to 2D segmentations, and 3D reconstructions, respectively, as well as the temporal edges \(\mathcal {E}_{k,l}=(\mathcal {V}_k,\mathcal {V}_l)\), which connect two nodes between frames. In addition, we connect each reconstruction to a source node E via a source edge and a sink node T via a sink edge. The introduction of source and sink nodes enables the hypergraph to solve the problem even when the number of trajectories (targets) is not known as a prior.

As a flow graph, each edge is able to carry a certain amount of flow f with a unit cost c per flow. More specifically, the segmentation hyperedge \(\mathcal {S}_k\) has a flow \(f_{\mathcal {S}_k}\) with cost \(C_{\mathcal {S}_k}\), the reconstruction hyperedge \(\mathcal {R}_k\) has a flow \(f_{\mathcal {R}_k}\) with cost \(C_{\mathcal {R}_k}\) and a temporal edge \(\mathcal {E}_{k,l}\) has a flow \(f_{k,l}\) with cost \(C_{k,l}\). Analogously, the source and sink edges have flow of \(f_{en,k}\) and \(f_{ex,k}\) with cost \(C_{en,k}\) and \(C_{ex,k}\) respectively.

Due to the trajectory non-overlapping constraint in Eq. (5), each flow path through the graph corresponds to a trajectory, thus the total number of trajectories exactly corresponds to the amount of flows from E to T. The maximum allowed flow through an edge is set to be 1, as one object can only belong to one trajectory and vice versa. A flow of \(f=0\) means the corresponding edge is not chosen and a flow of \(f=1\) implies the edge is part of the trajectory.

The constraint defined in Eq. (5) is reformulated as the flow conservation constraint, i.e., the sum of the outgoing flows of one node equals the sum of the incoming flows:

$$\begin{aligned} f_{en,i} + \sum \limits _j f_{j,i}&= f_{\mathcal {S}_i} = \sum \limits _j f_{i,j} +f_{ex,i}\nonumber \\ f_{en,k} + \sum \limits _l f_{l,k}&= f_{\mathcal {R}_k} = \sum \limits _l f_{k,l} +f_{ex,k} \end{aligned}$$
(8)

In Eq. (4) we also introduce constraints of non-overlapping segmentations and reconstructions. This coupling constraint can be reformulated to the flow graph representation as:

$$\begin{aligned} \sum \limits _{\mathcal {S}_i \in Q_j} f_{\mathcal {S}_i}&\le 1, \quad \forall j, Q_j=\{i|sp_j \in \mathcal {S}_i\}\nonumber \\ \sum \limits _{\mathcal {R}_k \in O_l} f_{\mathcal {R}_k}&\le 1, \quad \forall l, O_l=\{k|\mathcal {S}_l \in \mathcal {R}_k\} \end{aligned}$$
(9)

These constraints mean, for all segmentations \(\mathcal {S}_i\) and \(\mathcal {S}_j\), which have the same superpixel, i.e. \(\mathcal {S}_i \cap \mathcal {S}_i \ne \emptyset \), the sum of the corresponding flows \(f_{\mathcal {S}_i}\) and \(f_{\mathcal {S}_j}\) must be less or equal to 1; analogously, for all reconstructions \(\mathcal {R}_k\) and \(\mathcal {R}_l\), which share the same segmentation, the sum of the corresponding flows \(f_{\mathcal {R}_k}\) and \(f_{\mathcal {R}_l}\) must be either 0 or 1.

Next, we reformulate the problem of maximizing the probabilities of the assignment hypothesis \(\mathcal {T}\) as the problem of finding the best flow through the whole graph with minimum overall cost. The flow costs can be related to the probabilities after taking the negative logarithm on the MAP formulation:

$$\begin{aligned} \mathcal {T}^*&=\mathop {\mathrm{argmax}}\limits _{\mathcal {T}}\sum \limits _{\mathcal {S}_i\in \mathcal {T}_u}-\log (P(\mathcal {S}_i|\mathcal {T}))+\sum \limits _{\mathcal {R}_k\in \mathcal {T}_u}-\log (P(\mathcal {R}_k|\mathcal {T})) +\sum \limits _{\mathcal {T}_u\in \mathcal {T}}-\log (P(\mathcal {T}_u))\nonumber \\&=\mathop {\mathrm{argmax}}\limits _{\mathcal {T}}\sum \limits _{\mathcal {S}_i\in \mathcal {T}_u}-\log (P(\mathcal {S}_i|\mathcal {T}))+\sum \limits _{\mathcal {R}_k\in \mathcal {T}_u}-\log (P(\mathcal {R}_k|\mathcal {T}))\nonumber \\&\qquad \qquad +\sum \limits _{\mathcal {T}_u\in \mathcal {T}}(-\log P_{en}(V_{u_1})+\sum \limits _j-\log P_{link}(V_{u_{j+1}}|V_{u_j})-\log P_{ex}(V_{u_n}))\nonumber \\&=\mathop {\mathrm{argmax}}\limits _{\mathcal {T}}\sum \limits _i C_{\mathcal {S}_i}f_{\mathcal {S}_i}+\sum \limits _k C_{\mathcal {R}_k}f_{\mathcal {R}_k}\nonumber \\&\qquad \qquad +\sum \limits _k C_{en,k}f_{en,k}+\sum \limits _{k,l} C_{k,l}f_{k,l}+\sum \limits _k C_{ex,k}f_{ex,k} \end{aligned}$$
(10)

Obviously, we get the costs C from the MAP formulation:

$$\begin{aligned} C_{\mathcal {S}_i}&= -\log P(\mathcal {S}_i|\mathcal {T})\nonumber \\ C_{\mathcal {R}_k}&= -\log P(\mathcal {R}_k|\mathcal {T})\nonumber \\ C_{en,k}&= -\log P_{en}(V_k)\nonumber \\ C_{k,l}&= - \log P_{link}(V_l|V_k)\nonumber \\ C_{ex,k}&= -\log P_{ex}(V_k) \end{aligned}$$
(11)

In this way, the tracking, segmentation and reconstruction problem is modeled as a BIP problem. The above presented BIP is solvable using cutting-plane methods, branch and cut or branch and price methods. In our implementation we use CPLEX integer programming solver.

5 Definitions of Probabilities

In order to formulate the problem, we must define all the probabilities presented in the model. They are precisely defined in this section.

5.1 Segmentation Likelihood \( P(\mathcal {S}_i|\mathcal {T}) \)

The segmentation likelihood \(P(\mathcal {S}_i|\mathcal {T})\) defines the quality of a segmentation given a set of trajectories \(\mathcal {T}\). It depends on the position and appearance of the superpixels inside the detection box, as well as the configuration of the overall segmentation (union of the superpixels).

$$\begin{aligned} P(\mathcal {S}_i|\mathcal {T})=P_{pos}(\mathcal {S}_i|\mathcal {T})P_{app}(\mathcal {S}_i|\mathcal {T})P_{conf}(\mathcal {S}_i|\mathcal {T}) \end{aligned}$$
(12)

The position of segmentation likelihood term \(P_{pos}(\mathcal {S}_i|\mathcal {T})\) is defined as:

$$\begin{aligned} P_{pos}(\mathcal {S}_i|\mathcal {T}) = {\left\{ \begin{array}{ll} P_{pos}(\mathcal {S}_i), &{}\mathcal {S}_i\in \mathcal {T}\\ 1-P_{seg}(\mathcal {S}_i), &{}\text {otherwise} \end{array}\right. } \end{aligned}$$
(13)

where \(P_{pos}(\mathcal {S}_i)\) is the a priori position of segmentation probability. Analogously, \(P_{app}(\mathcal {S}_i|\mathcal {T})\) and \(P_{conf}(\mathcal {S}_i|\mathcal {T})\) are defined by the a priori probability \(P_{app}(\mathcal {S}_i)\) and \(P_{conf}(\mathcal {S}_i)\), respectively. Next we present these a priori probabilities precisely.

Each superpixel in the detection box has a prior probability of being part of the segmentation \( I_{\mathcal {M}}(\varvec{x}(sp_j)) \), where \(I_{\mathcal {M}}(\cdot )\) is intensity of an object shape mask \(\mathcal {M}\) [25] and \(\varvec{x}(sp_j)\) is the center point of the superpixel. The probability \(P_{pos}(\mathcal {S}_i)\) measures the probability that all the superpixels within \(\mathcal {S}_i\) form a feasible segmentation. The a priori position probability is then modeled as:

$$\begin{aligned} p(sp_j)&= I_{\mathcal {M}}(\varvec{x}(sp_j))\nonumber \\ P_{pos}(\mathcal {S}_i)&= P_{pos}\Big (\bigcap _{sp_j \in \mathcal {S}_i} sp_j\Big )\nonumber \\&= \prod \limits _{sp_j \in \mathcal {S}_i}p(sp_j) \end{aligned}$$
(14)

where, we assume that the events of a superpixel being part of a segmentation are independent from each other. Therefore, the probability of the intersection of the events equals to the multiplication of all the individual probabilities.

The a priori appearance probability takes the similarity of superpixels within one segmentation into account. It depends on the mean color difference of the superpixels:

$$\begin{aligned} P_{app}(\mathcal {S}_i)=\frac{1}{\text {Pair}(|\mathcal {S}_i|)}\sum \limits _{\begin{array}{c} sp_j,sp_k \in \mathcal {S}_i\\ i \ne j \end{array}}\mathcal {F}(1+||{Lab(sp_j)}-{Lab(sp_k)}||,0,\varepsilon _{Lab}) \end{aligned}$$
(15)

where \(\text {Pair}(|\mathcal {S}_i|)\) is the number of pairs of superpixels within segmentation \(\mathcal {S}_i\), \(Lab(sp_j)\) is the average color of superpixel \(sp_j\) in LAB color space, cardinality \(|\mathcal {S}_i|\) is the number of superpixels within the segmentation, and \(\varepsilon _{Lab}\) is the maximum allowed color deviation in LAB color space. Large color deviation is penalized since at our datasets, we observed that the objects contain only 1–3 colors each. To this end, we introduce the decreasing function \(\mathcal {F}\) and use it to map the error to probability:

$$\begin{aligned} \mathcal {F}(d,d_{min},d_{max}) = \frac{1}{2}\text {erfc}(4\frac{d-d_{min}}{d_{max}-d_{min}}-2) \end{aligned}$$
(16)

The a priori configuration probability models the quality of overall shape of the segmentation. A segmentation is expected to cover \(\alpha \) percent area of the detection box, to be connected and to have a certain width/height ratio as it represents a shape of a human. This term is defined as:

$$\begin{aligned} P_{conf}(\mathcal {S}_i)=\frac{1}{\text {CN}(\mathcal {S}_i)} \exp (-||\rho -\alpha ||-\Big |\Big |\frac{w(\mathcal {S}_i)}{h(\mathcal {S}_i)}-\frac{w_i}{h_i}\Big |\Big |) \end{aligned}$$
(17)

where \(\rho \) is the ratio that the segmentation covers the detection box, \(\alpha \) is the expected ratio of segmentation, \(w(\mathcal {S}_i)\) and \(h(\mathcal {S}_i)\) are the width and the height of segmentation \(\mathcal {S}_i\), respectively, \(w_i\) and \(h_i\) are the width and the height of corresponding detection \(\mathcal {D}_i\), respectively. \(\text {CN}(\mathcal {S}_i)\) is the number of connected components of the segmentation \(\mathcal {S}_i\). To compute this value, we label all the superpixels and assign the same label to the connected superpixels. The number of unique labels is CN\((\mathcal {S}_i)\). Note that, a segmentation with too less superpixels (\(\rho \ll \alpha \)) or an unexpected shape (segmentation width/height ratio does not equal detection ratio) will be penalized.

5.2 Reconstruction Likelihood \( P(\mathcal {R}_k|\mathcal {T}) \)

The probability \( P(\mathcal {R}_k|\mathcal {T}) \) measures the likelihood of a 3D reconstruction \(\mathcal {R}_k\) given a set of trajectories \(\mathcal {T}\). This probability depends on (1) the geometric positions, (2) appearance similarity and (3) qualities of the 2D segmentations \( \mathcal {S}_j \) within reconstruction \( \mathcal {R}_k \).

$$\begin{aligned} P(\mathcal {R}_k|\mathcal {T}) = P_d(\mathcal {R}_k|\mathcal {T})P_c(\mathcal {R}_k|\mathcal {T})P_s(\mathcal {R}_k|\mathcal {T}) \end{aligned}$$
(18)

Similar to Eq. 13, the geometric and appearance likelihoods in Eq. (18) is defined by the corresponding a priori probabilities \(P_d(\mathcal {R}_k)\) and \(P_c(\mathcal {R}_k)\). Quality likelihood term can be further decomposed. We precisely define these terms in the following.

Geometric Probability. Ideally, all the segmentations in a reconstruction should project to the same position in the world coordinate system. However, due to the calibration error and the imprecise segmentations, the world positions of the 2D segmentations are unlikely to match up exactly. In this situation, high deviations are penalized.

The geometric reconstruction error \( \varepsilon _k \) is defined as the root mean square deviation of the segmentations in \( \mathcal {R}_k \) from their mean position in world coordinate system:

$$\begin{aligned} \varepsilon _k&= \sqrt{\frac{1}{|\mathcal {R}_k|}\sum \limits _{\mathcal {S}_i \in R_k}|\phi ^c(\mathcal {S}_i)-\bar{x}_k|^2}\nonumber \\ \bar{x}_k&= \frac{1}{|\mathcal {R}_k|}\sum \limits _{\mathcal {S}_i \in R_k}\phi ^c(\mathcal {S}_i) \end{aligned}$$
(19)

here, \(|R_k|\) is the number of segmentations with the reconstruction. \(\phi ^c(\mathcal {S}_i)\) transforms the position of the segmentation to world coordinate system for camera c.

Using the decreasing function \(\mathcal {F}\), the geometric reconstruction probability \( P_d(\mathcal {R}_k) \) is defined as:

$$\begin{aligned} P_d(\mathcal {R}_k) =\mathcal {F}(\varepsilon _k,0,\varepsilon _{max}(\mathcal {R}_k)) \end{aligned}$$
(20)

where \(\varepsilon _{max}(\mathcal {R}_k)\) is the maximum allowed reconstruction error. This value depends on the detector inaccuracies (error in 2D image) as well as on the error of camera calibration (error in 3D ground plane) and varies for every position. Thus, \(\varepsilon _{max}(\mathcal {R}_k)\) is defined as:

$$\begin{aligned} \varepsilon _{max}(\mathcal {R}_k)=\varepsilon _{det}\sum \limits _{\mathcal {S}_j \in \mathcal {R}_k}||\varTheta ^c(\mathcal {S}_j)||+\varepsilon _{cal} \end{aligned}$$
(21)

here, \(\varepsilon _{cal}\) is the calibration error of the cameras and is set to be a constant. \(\varepsilon _{det}\) is the detection error resulted from the inaccuracies of the detection bounding boxes. Due to the distortion of the camera calibration, the influence of the detection error \(\varepsilon _{det}\) largely depends on the object location. In order to model this part, we first compute the sensitivity of the world coordinate projection function \(\phi ^c(\varvec{x})\) at each image position. This sensitivity depends on the Jacobian matrix of \(\phi ^c(\varvec{x})\) [18]:

$$\begin{aligned} \frac{\partial }{\partial \varvec{x}}\phi ^c(\varvec{x})=J^c(\varvec{x})=\left( \begin{array}{cc} \frac{\partial \phi ^c_{x_{\omega }}}{\partial x} &{} \frac{\partial \phi ^c_{x_{\omega }}}{\partial y}\\ \frac{\partial \phi ^c_{y_{\omega }}}{\partial x}&{}\frac{\partial \phi ^c_{y_{\omega }}}{\partial y} \end{array}\right) \end{aligned}$$
(22)

The sensitivity of the projection function is then defined as the lengths of the gradients of both \(x_{\omega }\) and \(y_{\omega }\) components:

$$\begin{aligned} \varTheta ^c(\varvec{x}) = \left( \begin{array}{c} ||\nabla _{x_{\omega }}\phi ^c(\varvec{x})||\\ ||\nabla _{y_{\omega }}\phi ^c(\varvec{x})|| \end{array}\right) \end{aligned}$$
(23)

In Eq. (21), the detection error \(\varepsilon _{det}\) acts as a weight factor of the sum of projection sensitivities of all cameras that are involved in reconstruction \(\mathcal {R}_k\).

Appearance Similarity. The segmentations within one reconstruction represent the same target viewed by different cameras, thus they have the similar appearance. The appearance similarity \(P_c(\mathcal {R}_k)\) depends on the average color of the segmentations from different views:

$$\begin{aligned} P_c(\mathcal {R}_k) = \frac{1}{\text {Pair}(|\mathcal {R}_k|)}\sum \limits _{\begin{array}{c} \mathcal {S}_i,\mathcal {S}_j \in \mathcal {R}_k\\ i \ne j \end{array}}\mathcal {F}(1+||{Lab(\mathcal {S}_i)}-{Lab(\mathcal {S}_j)}||,0,\varepsilon _{Lab}) \end{aligned}$$
(24)

where \(\text {Pair}(|\mathcal {R}_k|)\) is the number of pairs of segmentations within reconstruction \(\mathcal {R}_k\).

Quality of the involved segmentations. \( P_s(\mathcal {R}_k|\mathcal {T}) \) defines the overall quality of the segmentations within the reconstruction \(\mathcal {R}_k\). A reconstruction that contains good segmentations has a higher probability and vice versa. It is defined as the multiplication of the segmentation likelihoods of all the segmentations within \( \mathcal {R}_k \):

$$\begin{aligned} P_{s}(\mathcal {R}_k|\mathcal {T}) = \prod \limits _{\mathcal {S}_i \in R_k}P(\mathcal {S}_i|\mathcal {T}) \end{aligned}$$
(25)

where the definition of segmentation likelihood \(P(\mathcal {S}_i|\mathcal {T})\) is same as Eq. 12.

5.3 Transition Probability \(P_{link}\)

For two nodes \(V_l,V_k\), the transition probability \( P_{link}(V_l|V_k) \) defines the probability of the reconstructions to be in the same trajectory. Here \(V_l\) and \(V_k\) can be both segmentations, both reconstructions, or one segmentation and one reconstruction. This probability consists of three terms: (1) the spatial and (2) the temporal distance probability between the two nodes, as well as (3) the superpixel linking probability:

$$\begin{aligned} P_{link}(V_l|V_k) = P(\bar{x}_l|\bar{x}_k,\varDelta \tau )P(\varDelta \tau )P_t(V_l|V_k) \end{aligned}$$
(26)

Spatial Distance Probability. The spatial term is defined by the distance probability function:

$$\begin{aligned} P(\bar{x}_l|\bar{x}_k,\varDelta \tau )&= \mathcal {F}(||\bar{x}_l-\bar{x}_k||,0,\frac{v_{max}}{f}\varDelta \tau )\nonumber \\ \varDelta \tau&= \tau _l-\tau _k \end{aligned}$$
(27)

where \( v_{max} \) is the maximum velocity of the target, f is the frame rate and \(\varDelta \tau \) is the frame gap.

Temporal Distance probability. Like in [21], the temporal term is defined with a exponential model:

$$\begin{aligned} P(\varDelta \tau ) = \gamma ^{n(\tau -1)} \end{aligned}$$
(28)

where \(\gamma \) is the false negative rate of the detector and n is the number of cameras that should generate observations of the targets in the frame gap.

Superpixel Linking Probability. In oder to generate the temporal links between the superpixels, we first run the temporal superpixel (TSP) algorithm of Chang et al. [11]. Then the superpixel linking probability depends on the number of the superpixels that have same ID in the two reconstructions:

$$\begin{aligned} P_t(\mathcal {V}_l|\mathcal {V}_k) = \frac{1}{\min (N_l,N_k)}\sum \limits _{\begin{array}{c} sp_i \in \mathcal {V}_l\\ sp_j \in \mathcal {V}_k \end{array}}[TSP(sp_i)=TSP(sp_j)] \end{aligned}$$
(29)

where \([\cdot ]\) is the indicator function and \(N_k\) is the number of superpixels within the node \(V_k\) (which can be either segmentation or reconstruction). This term prefers keeping the superpixels that have same ID in the whole trajectory.

5.4 Entrance and Exit Probabilities \(P_{en}\) and \(P_{ex}\)

The entrance and exit probabilities of one node (segmentation or reconstruction) measure the probability of a trajectory to start and terminate at this node. Similar to [18], we assume these probabilities are positive when (1) the position of the node is close enough to the image border, (2) the size of the node is about the minimum detection size of the object detector and (3) all nodes in the first frame and the last frame have a high entrance probability and exit probability, respectively. Otherwise the entrance and exit probabilities are 0.

6 Experiments

6.1 Tracking

Dataset. For the evaluation of tracking performance, we test our algorithm on the widely used PETS 2010 Benchmark [13] including S2L1, S2L2, S2L3, S1L2. The sequences in benchmark show variable pedestrian densities and dynamic behaviors. Many inter-object and long term occlusions exist in the dataset, which makes it very challenging for detection and multiple targets tracking. Moreover, the frame rate of PETS 2010 is only 7 frames per second, this means far object movement between two consecutive frames, making precise tracking even more challenging.

Evaluation Metrics. We use the same detection result as in [18] and the public available ground truth in [2, 18]. As in [26, 32], the Hungarian algorithm is employed to assign tracking output to the ground truth. For tracking precision and recall, we report the frequently used CLEAR MOT metrics in [7], including Multiple Object Tracking Accuracy (MOTA) and Multiple Object Tracking Precision (MOTP). We also quote three popular metrics proposed in [24]: mostly tracked (MT, \(>80\,\%\) overlap), mostly lost (ML, \(<20\,\%\) overlap), partly tracked (PT). These metrics show the temporal coverage of truth trajectories by the algorithm.

Table 1. Quantitative tracking results on PETS 2010 dataset. Evaluations are compared to cl2 [15], DP [31], DCO [27], SegTrack [25].

Settings. Although our algorithm supports arbitrary number of views, we report our results for using one to three views for every video sequence. We also employ the sliding window, which is the commonly used methodology to process video sequences with arbitrary length. In our experiments, we use a sliding window of 50 frames, with an overlap of 9 frames. The result of the whole sequence is obtained by Hungarian algorithm [30], matching trajectories between adjacent windows.

Next we set the parameters that are required by the model for all the scenarios. These parameters are intuitive and we report the set of parameters which has best performance in our experiments.

The average detection box inaccuracy \(\varepsilon _{det}\) is set to be 4px and the expected calibration error \(\varepsilon _{cal}\) is 0.5 m. We choose false negative rate \(\gamma =0.3\) to cope with different densities of scenarios. The maximum color deviation in LAB color space is set to \(\varepsilon _{Lab}=50\). The expected segmentation ratio \(\alpha \) is 0.8.

The maximal walking speed of a person is limited to \(v_{max}=5\,\text {m/s}\), making it possible to track a running person. The maximum frame gap is set to \(\delta \tau _{max}=9\). A higher frame gap results to more transition connections in the graph and higher computational cost, a lower frame gap reduces performance.

The boundary of the scene is set to \(d_{b,max}=1\). For all enter and exit probabilities, we set the maximum value to \(P_{en,max}=P_{ex,max}=0.1\).

6.2 Segmentation

We evaluate the segmentation performance on PETS-S2L2 sequence. This scenario has a medium person density and contains many target occlusions, which makes it very challenging for object segmentation. We report the following error metrics: the clustering error (percentage of misclassified pixels, abbreviated cl. err.), the per-region error (average ratio of false labeling of pixels per ground truth mask, abbreviated per-reg. err.), the number of segments that cover each mask (abbreviated over-seg.) and the number of extracted objects (of which \(90\,\%\) area is correctly segmented, abbreviated extr. obj.).

Table 2. Quantitative segmentation results on PETS S2L2 scenario compared to TSP [11], Greedy [25] and SegTrack [25].
Table 3. Runtime (in seconds) of the CPLEX solver on an i5 2.6 Ghz, 4 GB RAM.

6.3 Results and Discussion

The qualitative results are shown in Fig. 2. The quantitative tracking results compared to other methods are shown in Table 1. It can be seen that the proposed algorithm outperforms all the previous methods in MOTA (accuracy) and MOTP (precision). In most cases, the introduction of additional cameras improves the overall tracking performance. For example, using two cameras in S2L3 dataset, where many long-term occlusions exist, the tracking accuracy (MOTA) is significantly improved. However, the accuracy goes down when the number of used cameras continues to increase due to the accumulation of calibration errors of cameras as well as the color errors of different cameras. This phenomenon has a greater influence on crowded scenarios and leads to assignment errors.

Fig. 2.
figure 2

Qualitative results on PETS 2010 with two cameras; the first and second rows show the results of the first and second views of the S2L2 dataset; The third and forth rows show the results of the first and second views of the S2L3 dataset; solid boxes represent the detected and segmented targets and dotted boxes represent the occluded but reconstructed targets.

The qualitative segmentation evaluation results are summarized in Table 2. Combined video segmentation, reconstruction and tracking in multi-view environment is very challenging. Compared to other three baselines which only manage to segment in a single view, we obtain very close performance, while our task is more complex: segment, track and reconstruct with multiple views.

The runtime of the algorithm on different scenarios and with different numbers of views is shown in Table 3. The computational cost rises with the increasing density of scenario. It can be seen that on the simple scenarios S2L1 and S1L2, the algorithm runs in less time than the scenario length. For other scenarios, the runtime with 2 views is close to the scenario length.

7 Conclusion

In this paper, we have proposed a novel algorithm that for the first time address the problem of joint segmentation, reconstruction and tracking of multiple targets in multi-view video sequences. In contrast to previous works that accomplished either segmentation-tracking or reconstruction-tracking, we accomplish these tasks at the same time using a global optimization framework. We cast the problem into a cost flow graph whose vertices are the superpixels extracted from target detection boxes and whose edges represent the cost of possible associations. We utilized binary integer programming to find the minimum trajectory from source to target of the graph. The obtained optimal superpixel association simultaneously provides segmentation, reconstruction, and tracking of targets. The experimental results confirmed that our algorithm outperforms other algorithms in tracking while segmentation results are comparable to them. Future work should consider some other visual cues such as optical flow into optimization framework to improve the segmentation results by introducing new constraints between superpixels. Furthermore, we currently extract superpixels from detection boxes which could be extended to extracting them from whole images.