1 Introduction

The recent years have witnessed great developments of CCTV system in the domain of inland waterway automated surveillance [13].

Despite that numerous tracking algorithms [46] have been proposed in the literature over the past couple of decades, ship tracking remains challenging due to factors such as pose variation, illumination change, occlusion and motion blur [7]. A typical tracking system should contain an accurate motion model which relates the target over time and an appearance model which evaluates the likelihood of an observed image patch [8]. Recently, sparse coding [913] has been successfully applied in visual tracking with promising results. Zhang et al. [14] present a comprehensive review of the state-of-the-art tracking algorithms based on sparse coding. In their seminal work, these methods are divided into three main categories: appearance modeling based on sparse coding (AMSC), target searching based on sparse representation (TSSR) and their combination. The reliable experimental results demonstrate that ASMC methods significantly outperform TSSR methods. A typical ASMC method is proposed in [15], which learns an appearance model based on the response distribution of basic functions by independent component analysis (ICA). This milestone work is robust to appearance change due to the feature selection strategy. However, the learned basis functions are too general to code the difference between the target and background, resulting in a model which is not discriminative enough. Based on this work, dozens of methods have been proposed to improve its performance. Zhang et al. [16] propose a novel effective appearance model based on sparse coding. In [16], the responses of general basis functions extracted by ICA on a large set of natural patches are considered as features, and the appearance of the target is modeled as the probability of these features. This method is robust to appearance change caused by pose variation, illumination change and partial occlusion. However, we notice that this method does not directly consider the scale issue. Gai et al. [17] present a new feature extraction method based on quaternion wavelet transform (QWT) to improve the performance of the banknote classification. In [18], a new multi-scale texture classifier which uses features extracted from the sub-bands of the reduced quaternion wavelet transform (RQWT) decomposition is proposed in the transform domain. The experimental results demonstrate that the proposed method can achieve a high texture classification accuracy rate.

Zhang et al. [19] propose a compressive tracking (CT) method. Their appearance model is constructed based on features extracted from the multi-scale image feature space with data-independent basis. They employ a very sparse measurement matrix to extract the features. Then, the features in the compressed domain are classified via a naive Bayes classifier. This method achieves favorably high efficiency and accuracy in inland waterway ship tracking. However, in order to ensure that the tracker based on subspace representation maintains the structure of the appearance information, they choose a very sparse measurement matrix that satisfies the restricted isometry property (RIP) based on Haar-like features. However, it lacks randomness for the random projection matrix and the memory load for integral image is a bit high to some extent. Meanwhile, multi-scale tracking for object of interest is not considered.

Motivated by the above-mentioned discussions, in this paper, we propose an effective and efficient ship tracking algorithm. Firstly, subspace fern-based appearance model demonstrates robustness to pose change, illumination variation and occlusion. Secondly, in order to reduce computation complexity and increase the randomness of the random projection matrix, we choose rapid fern features instead of Haar-like features. Thirdly, we further employ a feedback mechanism to reduce tracking drift and perform multi-scale ship tracking.

The remainder of this paper is organized as follows: Sect. 2 describes our proposed method in details. Section 3 conducts numerous experiments on challenging CCTV video sequences to illustrate the effectiveness of the proposed method. Finally, Sect. 4 presents a brief summary.

2 The proposed algorithm

2.1 Fern-compressed appearance model

Image is signal which can be compressed. Recent information theories strongly suggest that a random matrix which satisfies the Johnson–Lindenstrauss lemma also holds true for the RIP [20]. This means that we can employ a random matrix of this kind to extract features from the object of interest. These extracted features maintain the structure of the object and with high probability preserve the distance between the points when projecting them onto a low-dimensional vector space [21]. Motivated by this theoretical support, we adopt the same sparse random matrix in [19, 20] to efficiently extract fern features for the appearance model. The matrix is defined as below:

$$\begin{aligned} r_{i,j} =\sqrt{s}{*}\left\{ {{\begin{array}{l} 1\, \qquad \hbox {with probability} \frac{1}{2s} \\ 0\, \qquad \hbox {with probability}\, 1-\frac{1}{s} \\ -1\, \quad \hbox {with probability} \frac{1}{2s} \\ \end{array} }} \right. \end{aligned}$$
(1)

Such a random measurement matrix with \(s=2\) or 3 is proved to satisfy the Johnson–Lindenstrauss lemma [20]. Meanwhile, the random matrix is sparse enough and performs more accurately than the standard Gaussian matrix \({{\varvec{R}}}\in {\mathbb {R}}^{n*m}\) where \(r_{ij} \sim N(0,1)\), which typically satisfies the RIP constraint [2123]. A typical feature extraction procedure is:

$$\begin{aligned} \mathbf{y}={\varvec{R}}\mathbf{x} \end{aligned}$$
(2)

where \(\mathbf{x}\in {\mathbb {R}}^{m}\) corresponds to a high-dimensional image space, \({{\varvec{R}}}\in {\mathbb {R}}^{n*m}\) represents the sparse random matrix and \(\mathbf{y}\in {\mathbb {R}}^{n}\) corresponds to a low-dimensional space. Under the condition of RIP, the theoretical dimensional bound [20] of random matrix \({{\varvec{R}}}\) is:

$$\begin{aligned} n\ge \frac{4+2\beta }{\epsilon ^{2}/2-\epsilon ^{3}/3}\hbox {ln}(d) \end{aligned}$$
(3)

where \(\beta \) controls the projection probability of success, \(\epsilon \) controls the desired accuracy in distance preservation and \(d\) represents the number of input points in \({\mathbb {R}}^{m}\). But in practice, in the inland waterway ship tracking application, we find that \(n\ge 60\) is sufficient to obtain good tracking results.

In order to maintain the structure of the object of interest and avoid using holistic templates for sparse representation, Zhang extracts a linear combination of generalized Haar-like features [19]. However, Haar-like features still have a concept of representation for holistic information. That is to say, it lacks randomness for the random projection matrix. At the same time, the memory load for computing and storing integral image is a bit high to some extent. Therefore, in our tracking system, we develop our ship appearance model based on fern-compressed features. A typical fern feature is defined as:

$$\begin{aligned} f_i =\left\{ {{\begin{array}{l} 0\quad \hbox {if}\, I\left( {d_{i,1} } \right) <I\left( {d_{i,2} } \right) \\ 1\quad \hbox {otherwise} \\ \end{array} }} \right. \end{aligned}$$
(4)

It is illustrated in Fig. 1 in detail.

Fig. 1
figure 1

Fern feature for each group

For a given patch, firstly, we randomly generate \(N\) fern features, i.e., \(N\) pair pixel comparisons and split them into \(M\) groups. For each group, a random number generator is employed to generate a number \(k\) from 2 to \(s\). That is to say, there are \(k\) fern features in this group. In the next step, each fern feature value is concatenated to produce a group feature value between [\(0,2^{k}-1\)]. The corresponding weights for each fern feature form the very sparse random matrix, whose elements are defined upon the Eq. (1). The array multiplication between the features and weights implicitly denotes the sparse representation for the object of interest. The compressive sensing theory ensures that the extracted features preserve almost all the salient information of the original image.

2.2 Naive Bayes classifier and parameters update

Having obtained the low-dimensional representation \(\mathbf{y}=(y_1 ,y_2 ,\ldots ,y_m )^{T}\) for each sample \(\mathbf{x}=(x_1 ,x_2 ,\ldots ,x_n )^{T} (n~\gg ~m)\), we make the assumption that all elements in \(\mathbf{y}\) are independently distributed. And then, we utilize the same naive Bayes classifier and parameters update equations in [19]. They are sequentially listed from Eqs. (5) to (9):

$$\begin{aligned} H\!\!\left( \mathbf{y} \right) =\mathop \sum \limits _{i=1}^n \hbox {log}\left( \frac{p(y_i |Y=1)}{p(y_i |Y=0)}\right) \end{aligned}$$
(5)

\(Y\in \{0,~1\}\) represents the binary sample label. The conditional distributions for both the molecular and denominator in Eq. (5) are assumed to be Gaussian distribution where

$$\begin{aligned} p(y_i |Y=1)\sim N(\mu _i^1 ,\sigma _i^1 )\end{aligned}$$
(6)
$$\begin{aligned} p(y_i |Y=0)\sim N(\mu _i^0 ,\sigma _i^0 ) \end{aligned}$$
(7)

The parameters update equations are listed as follows:

$$\begin{aligned}&\displaystyle \mu _i^1 \leftarrow \gamma \mu _i^1 +(1-\gamma )\mu ^{1} \end{aligned}$$
(8)
$$\begin{aligned}&\displaystyle \sigma _i^1 \leftarrow \sqrt{\left( {\gamma (\sigma _i^1 )^{2}+(1-\gamma } \right) (\sigma ^{1})^{2}+\gamma (1-\gamma )(\mu _i^1 -\mu ^{1})^{2}} \end{aligned}$$
(9)

\(\gamma \!>\!0\) is the learning parameter, \(\sigma ^{1}\!=\sqrt{\mathop \sum \nolimits _{k=0|y=1}^{n-1} (v_i\!\left( k \right) -\mu ^{1})^{2}}\) and \(\mu ^{1}=\frac{1}{n}\mathop \sum \nolimits _{k=0|y=1}^{n-1} v_i (k)\). Finally, the maximal response classified by H represents the tracking location in the current frame.

Fig. 2
figure 2

Feedback mechanism via point tracking from \({{\varvec{I}}}_t \) to \({\varvec{I}}_{t}+1 \)

2.3 Multi-scale tracking

By all of the steps above, in frame \(t+1\), we have obtained the most likely candidate of ship location \(D_{t+1}\) of the same height and width as that manually chosen in the first frame. However, we can take full advantage of fern features to subtly tackle the scale problem (Fig. 2). By regarding the fern feature location as the seed point, we firstly perform point tracking from \({\varvec{I}}_t \) to \({\varvec{I}}_{t+1} \) (line in red and blue) and record the corresponding positions as \(P_t\)and \(P_{t+1}\), respectively. In the next step, we filter out those points outside the most likely candidate of bounding box and put the remainder into a container \(C\). Next, we reversely perform point (elements in \(C)\) tracking from \({\varvec{I}}_{t+1} \) to \({\varvec{I}}_t \) (line in green) and record the tracking positions as \(P_{t^{{\prime }}} \). Then, we calculate the distances \(D_{i}\) between \(P_{ti}\) and \(P_{t^{{\prime }}i} \) and throw out those points in \(C\) whose distance is larger than \(D_{i\mathrm{med}}\) which means the median of the distance set. We estimate the distance \(D_{(t+1)ij}\) between the rest points in \(C\) and the corresponding distance in \(D_{(t)ij}\) to obtain the final scale factor as \(S\):

$$\begin{aligned} S=\hbox {med}\left( \frac{D_{(t+1)ij} }{D_{(t)ij} }\right) \end{aligned}$$
(10)

Eventually, the final object location can be recalculated as:

$$\begin{aligned} D_{(t+1)\mathrm{final}} =D_{t+1} *S \end{aligned}$$
(11)

The main steps of our algorithm are summarized in Table 1:

Table 1 Main steps of our algorithm

3 Implementation and experimental results

In order to demonstrate the favorable performance of the proposed algorithm, we test our tracker with three latest state-of-the-art trackers on 10 challenging CCTV videos. These trackers are multiple instance learning (MIL) tracker [4], tracking-learning-detection (TLD) tracker [6] and CT tracker [19]. The 10 CCTV videos include very challenging scenes which suffer cluttered background, illumination variation, partial or even full occlusion and extremely scale change. To make fair comparison, all the trackers mentioned above are implemented in MATLAB. We show the qualitative evaluation results in 3.1 and the quantitative evaluation results in 3.2.

3.1 Qualitative evaluations

See Fig. 3.

Fig. 3
figure 3figure 3

Screenshots of sampled tracking results (Red—our tracker, Blue—CT tracker, Green—TLD tracker, Black—MIL tracker). a CCTV 1, b CCTV 2, c CCTV 3, d CCTV 4, e CCTV 5, f CCTV 6, g CCTV 7, h CCTV 8, i CCTV 9, j CCTV 10 (color figure online)

3.2 Quantitative evaluations

We use two metrics to evaluate the proposed tracker and the reference trackers. The metrics are center location error (CLE) (in pixels) and average frames per second (FPS). The experimental results are shown in Table 2 and illustrated in Fig. 4. The red fonts in Table 2 indicate the best performance, and the blue fonts indicate the second-best. We can obviously see that our tracker achieves the best or second-best performance in all sequences, both in terms of CLE and FPS.

Fig. 4
figure 4

Center location error for each sequence

Table 2 Quantitative evaluation results of ten challenging CCTV videos

3.2.1 Low video quality, motion blur and cluttered background

CCTV 1 and CCTV 2 are long sequences of low video quality. The blurry images due to fast motion make it difficult to track the ship. As shown in frame 1000 in CCTV 1, when the ship undergoes out-of-plane rotation, both the TLD tracker and MIL tracker drift away from the ship because of the drastic appearance change caused by abrupt motion. To make the situation worse, in frame 280 in CCTV 2, when the ship moves into a region along the river bank, both TLD tracker and MIL tracker drift to the background because the texture of the background is similar to the ship. Both our tracker and CT tracker perform well on these two sequences because feature extraction procedure based on subspace representation makes our classifier better differentiate the target from the cluttered background.

3.2.2 Illumination change and pose variation

CCTV 3, CCTV 4, CCTV 5 and CCTV 6 sequences comprise illumination change and pose variation, which makes them very challenging. MIL tracker performs poorly in those 4 sequences because the greedy feature selection method used in the MIL tracker has potential overfitting problem and this method is easy to select the less discriminative features. This will lead to error accumulation over time. Thus MIL tracker will eventually drift, as shown in frame 420 in CCTV 5 and frame 300 in CCTV 6. TLD method consists of a short-term median flow tracker and a cascade of object detectors. The median flow tracker is sensitive to appearance change caused by illumination variation (see frame 470 in CCTV 3 and frame 300 in CCTV 4). Both the proposed tracker and CT tracker are robust to illumination change and pose variation because ship appearance can be well modeled by random projection based on the Johnson–Lindenstrauss lemma. Meanwhile, the discriminative model with local features has been demonstrated to handle pose variation well.

3.2.3 Partial and full occlusion

The ship undergoes partial and even full occlusion in CCTV 7 and CCTV 8. MIL tracker uses an MIL-based appearance model to represent training data in the form of bags and does not discriminatively consider the importance of the sample into its learning procedure. Therefore, MIL tracker may detect the positive samples that are less important. TLD tracker works better than MIL tracker due to its online semi-supervised PN learning procedure, but yields some unstable results when two ships occlude each other severely in frame 800 in CCTV 7 and frame 630 in CCTV 8. Our proposed tracker and CT tracker achieve better results in these sequences due to the power of subspace representation. The discriminative model based on compressed features makes our tracker and CT tracker better separate the object of interest from its surrounding background.

3.2.4 Extremely scale change

CCTV 9 and CCTV 10 are very long sequences which suffer extremely scale change. One of the main goals of our work is to demonstrate adaptiveness to scale variation. Both CT tracker and MIL tracker do not consider this issue while our tracker and TLD tracker employ different manners. TLD tracker adopts a predefined scale set and performs greedy detection procedure in all scales for a given input image based on sliding window strategy. The predefined scale set may bring significant error (frame 813 in CCTV 9 and frame 305 in CCTV 10). Our tracker performs better than TLD tracker in these sequences. On one hand, our tracker inherits the merits of subspace representation which demonstrates robustness to scale change. On the other hand, our fern-based appearance model is discriminatively learned from target and background with a data-independent measurement. The fern features are compressively sensed with a very sparse measurement matrix. The compressive sensing theory ensures that the extracted features preserve almost all the information of the input image. Moreover, our feedback mechanism better estimates the scale variation between consecutive frames, thereby facilitating separation of the foreground and background.

4 Conclusion

This paper presents a multi-scale ship tracking algorithm via a fern-based appearance model. In this work, a very sparse measurement matrix is adopted to efficiently extract the compressed fern features from the foreground and background. Then, we update our naive Bayes classifier in the compressed domain. We tackle the multi-scale tracking problem by a feedback mechanism. Numerous experiments on challenging video clips demonstrate that the proposed algorithm achieves favorable performance in terms of efficiency and accuracy.