Abstract
Ship tracking plays a key role in inland waterway closed circuit television (CCTV) video surveillance. Although much success has been demonstrated in the construction of effective appearance model, numerous issues remain to be addressed due to factors such as pose and illumination change, partial or full occlusion, abrupt scale variation and motion blur. In this paper, we firstly inherit the intrinsical merits of subspace representation which demonstrates robustness to partial or full occlusion, pose and illumination variation. A very sparse measurement matrix is adopted to extract the features for the appearance model. A naive Bayes classifier with online update is employed to determine whether the image patch belongs to the foreground or background. Secondly, in order to increase the randomness of the random projection matrix and further reduce memory load, we develop our ship appearance model based on fern features in the compressed domain. Thirdly, we track the scale by enhancing the tracker with a mechanism of feedback. Finally, both qualitative and quantitative evaluations on numerous challenging CCTV videos demonstrate that the proposed algorithm achieves favorable performance in terms of efficiency and accuracy.
Similar content being viewed by others
1 Introduction
The recent years have witnessed great developments of CCTV system in the domain of inland waterway automated surveillance [1–3].
Despite that numerous tracking algorithms [4–6] have been proposed in the literature over the past couple of decades, ship tracking remains challenging due to factors such as pose variation, illumination change, occlusion and motion blur [7]. A typical tracking system should contain an accurate motion model which relates the target over time and an appearance model which evaluates the likelihood of an observed image patch [8]. Recently, sparse coding [9–13] has been successfully applied in visual tracking with promising results. Zhang et al. [14] present a comprehensive review of the state-of-the-art tracking algorithms based on sparse coding. In their seminal work, these methods are divided into three main categories: appearance modeling based on sparse coding (AMSC), target searching based on sparse representation (TSSR) and their combination. The reliable experimental results demonstrate that ASMC methods significantly outperform TSSR methods. A typical ASMC method is proposed in [15], which learns an appearance model based on the response distribution of basic functions by independent component analysis (ICA). This milestone work is robust to appearance change due to the feature selection strategy. However, the learned basis functions are too general to code the difference between the target and background, resulting in a model which is not discriminative enough. Based on this work, dozens of methods have been proposed to improve its performance. Zhang et al. [16] propose a novel effective appearance model based on sparse coding. In [16], the responses of general basis functions extracted by ICA on a large set of natural patches are considered as features, and the appearance of the target is modeled as the probability of these features. This method is robust to appearance change caused by pose variation, illumination change and partial occlusion. However, we notice that this method does not directly consider the scale issue. Gai et al. [17] present a new feature extraction method based on quaternion wavelet transform (QWT) to improve the performance of the banknote classification. In [18], a new multi-scale texture classifier which uses features extracted from the sub-bands of the reduced quaternion wavelet transform (RQWT) decomposition is proposed in the transform domain. The experimental results demonstrate that the proposed method can achieve a high texture classification accuracy rate.
Zhang et al. [19] propose a compressive tracking (CT) method. Their appearance model is constructed based on features extracted from the multi-scale image feature space with data-independent basis. They employ a very sparse measurement matrix to extract the features. Then, the features in the compressed domain are classified via a naive Bayes classifier. This method achieves favorably high efficiency and accuracy in inland waterway ship tracking. However, in order to ensure that the tracker based on subspace representation maintains the structure of the appearance information, they choose a very sparse measurement matrix that satisfies the restricted isometry property (RIP) based on Haar-like features. However, it lacks randomness for the random projection matrix and the memory load for integral image is a bit high to some extent. Meanwhile, multi-scale tracking for object of interest is not considered.
Motivated by the above-mentioned discussions, in this paper, we propose an effective and efficient ship tracking algorithm. Firstly, subspace fern-based appearance model demonstrates robustness to pose change, illumination variation and occlusion. Secondly, in order to reduce computation complexity and increase the randomness of the random projection matrix, we choose rapid fern features instead of Haar-like features. Thirdly, we further employ a feedback mechanism to reduce tracking drift and perform multi-scale ship tracking.
The remainder of this paper is organized as follows: Sect. 2 describes our proposed method in details. Section 3 conducts numerous experiments on challenging CCTV video sequences to illustrate the effectiveness of the proposed method. Finally, Sect. 4 presents a brief summary.
2 The proposed algorithm
2.1 Fern-compressed appearance model
Image is signal which can be compressed. Recent information theories strongly suggest that a random matrix which satisfies the Johnson–Lindenstrauss lemma also holds true for the RIP [20]. This means that we can employ a random matrix of this kind to extract features from the object of interest. These extracted features maintain the structure of the object and with high probability preserve the distance between the points when projecting them onto a low-dimensional vector space [21]. Motivated by this theoretical support, we adopt the same sparse random matrix in [19, 20] to efficiently extract fern features for the appearance model. The matrix is defined as below:
Such a random measurement matrix with \(s=2\) or 3 is proved to satisfy the Johnson–Lindenstrauss lemma [20]. Meanwhile, the random matrix is sparse enough and performs more accurately than the standard Gaussian matrix \({{\varvec{R}}}\in {\mathbb {R}}^{n*m}\) where \(r_{ij} \sim N(0,1)\), which typically satisfies the RIP constraint [21–23]. A typical feature extraction procedure is:
where \(\mathbf{x}\in {\mathbb {R}}^{m}\) corresponds to a high-dimensional image space, \({{\varvec{R}}}\in {\mathbb {R}}^{n*m}\) represents the sparse random matrix and \(\mathbf{y}\in {\mathbb {R}}^{n}\) corresponds to a low-dimensional space. Under the condition of RIP, the theoretical dimensional bound [20] of random matrix \({{\varvec{R}}}\) is:
where \(\beta \) controls the projection probability of success, \(\epsilon \) controls the desired accuracy in distance preservation and \(d\) represents the number of input points in \({\mathbb {R}}^{m}\). But in practice, in the inland waterway ship tracking application, we find that \(n\ge 60\) is sufficient to obtain good tracking results.
In order to maintain the structure of the object of interest and avoid using holistic templates for sparse representation, Zhang extracts a linear combination of generalized Haar-like features [19]. However, Haar-like features still have a concept of representation for holistic information. That is to say, it lacks randomness for the random projection matrix. At the same time, the memory load for computing and storing integral image is a bit high to some extent. Therefore, in our tracking system, we develop our ship appearance model based on fern-compressed features. A typical fern feature is defined as:
It is illustrated in Fig. 1 in detail.
For a given patch, firstly, we randomly generate \(N\) fern features, i.e., \(N\) pair pixel comparisons and split them into \(M\) groups. For each group, a random number generator is employed to generate a number \(k\) from 2 to \(s\). That is to say, there are \(k\) fern features in this group. In the next step, each fern feature value is concatenated to produce a group feature value between [\(0,2^{k}-1\)]. The corresponding weights for each fern feature form the very sparse random matrix, whose elements are defined upon the Eq. (1). The array multiplication between the features and weights implicitly denotes the sparse representation for the object of interest. The compressive sensing theory ensures that the extracted features preserve almost all the salient information of the original image.
2.2 Naive Bayes classifier and parameters update
Having obtained the low-dimensional representation \(\mathbf{y}=(y_1 ,y_2 ,\ldots ,y_m )^{T}\) for each sample \(\mathbf{x}=(x_1 ,x_2 ,\ldots ,x_n )^{T} (n~\gg ~m)\), we make the assumption that all elements in \(\mathbf{y}\) are independently distributed. And then, we utilize the same naive Bayes classifier and parameters update equations in [19]. They are sequentially listed from Eqs. (5) to (9):
\(Y\in \{0,~1\}\) represents the binary sample label. The conditional distributions for both the molecular and denominator in Eq. (5) are assumed to be Gaussian distribution where
The parameters update equations are listed as follows:
\(\gamma \!>\!0\) is the learning parameter, \(\sigma ^{1}\!=\sqrt{\mathop \sum \nolimits _{k=0|y=1}^{n-1} (v_i\!\left( k \right) -\mu ^{1})^{2}}\) and \(\mu ^{1}=\frac{1}{n}\mathop \sum \nolimits _{k=0|y=1}^{n-1} v_i (k)\). Finally, the maximal response classified by H represents the tracking location in the current frame.
2.3 Multi-scale tracking
By all of the steps above, in frame \(t+1\), we have obtained the most likely candidate of ship location \(D_{t+1}\) of the same height and width as that manually chosen in the first frame. However, we can take full advantage of fern features to subtly tackle the scale problem (Fig. 2). By regarding the fern feature location as the seed point, we firstly perform point tracking from \({\varvec{I}}_t \) to \({\varvec{I}}_{t+1} \) (line in red and blue) and record the corresponding positions as \(P_t\)and \(P_{t+1}\), respectively. In the next step, we filter out those points outside the most likely candidate of bounding box and put the remainder into a container \(C\). Next, we reversely perform point (elements in \(C)\) tracking from \({\varvec{I}}_{t+1} \) to \({\varvec{I}}_t \) (line in green) and record the tracking positions as \(P_{t^{{\prime }}} \). Then, we calculate the distances \(D_{i}\) between \(P_{ti}\) and \(P_{t^{{\prime }}i} \) and throw out those points in \(C\) whose distance is larger than \(D_{i\mathrm{med}}\) which means the median of the distance set. We estimate the distance \(D_{(t+1)ij}\) between the rest points in \(C\) and the corresponding distance in \(D_{(t)ij}\) to obtain the final scale factor as \(S\):
Eventually, the final object location can be recalculated as:
The main steps of our algorithm are summarized in Table 1:
3 Implementation and experimental results
In order to demonstrate the favorable performance of the proposed algorithm, we test our tracker with three latest state-of-the-art trackers on 10 challenging CCTV videos. These trackers are multiple instance learning (MIL) tracker [4], tracking-learning-detection (TLD) tracker [6] and CT tracker [19]. The 10 CCTV videos include very challenging scenes which suffer cluttered background, illumination variation, partial or even full occlusion and extremely scale change. To make fair comparison, all the trackers mentioned above are implemented in MATLAB. We show the qualitative evaluation results in 3.1 and the quantitative evaluation results in 3.2.
3.1 Qualitative evaluations
See Fig. 3.
3.2 Quantitative evaluations
We use two metrics to evaluate the proposed tracker and the reference trackers. The metrics are center location error (CLE) (in pixels) and average frames per second (FPS). The experimental results are shown in Table 2 and illustrated in Fig. 4. The red fonts in Table 2 indicate the best performance, and the blue fonts indicate the second-best. We can obviously see that our tracker achieves the best or second-best performance in all sequences, both in terms of CLE and FPS.
3.2.1 Low video quality, motion blur and cluttered background
CCTV 1 and CCTV 2 are long sequences of low video quality. The blurry images due to fast motion make it difficult to track the ship. As shown in frame 1000 in CCTV 1, when the ship undergoes out-of-plane rotation, both the TLD tracker and MIL tracker drift away from the ship because of the drastic appearance change caused by abrupt motion. To make the situation worse, in frame 280 in CCTV 2, when the ship moves into a region along the river bank, both TLD tracker and MIL tracker drift to the background because the texture of the background is similar to the ship. Both our tracker and CT tracker perform well on these two sequences because feature extraction procedure based on subspace representation makes our classifier better differentiate the target from the cluttered background.
3.2.2 Illumination change and pose variation
CCTV 3, CCTV 4, CCTV 5 and CCTV 6 sequences comprise illumination change and pose variation, which makes them very challenging. MIL tracker performs poorly in those 4 sequences because the greedy feature selection method used in the MIL tracker has potential overfitting problem and this method is easy to select the less discriminative features. This will lead to error accumulation over time. Thus MIL tracker will eventually drift, as shown in frame 420 in CCTV 5 and frame 300 in CCTV 6. TLD method consists of a short-term median flow tracker and a cascade of object detectors. The median flow tracker is sensitive to appearance change caused by illumination variation (see frame 470 in CCTV 3 and frame 300 in CCTV 4). Both the proposed tracker and CT tracker are robust to illumination change and pose variation because ship appearance can be well modeled by random projection based on the Johnson–Lindenstrauss lemma. Meanwhile, the discriminative model with local features has been demonstrated to handle pose variation well.
3.2.3 Partial and full occlusion
The ship undergoes partial and even full occlusion in CCTV 7 and CCTV 8. MIL tracker uses an MIL-based appearance model to represent training data in the form of bags and does not discriminatively consider the importance of the sample into its learning procedure. Therefore, MIL tracker may detect the positive samples that are less important. TLD tracker works better than MIL tracker due to its online semi-supervised PN learning procedure, but yields some unstable results when two ships occlude each other severely in frame 800 in CCTV 7 and frame 630 in CCTV 8. Our proposed tracker and CT tracker achieve better results in these sequences due to the power of subspace representation. The discriminative model based on compressed features makes our tracker and CT tracker better separate the object of interest from its surrounding background.
3.2.4 Extremely scale change
CCTV 9 and CCTV 10 are very long sequences which suffer extremely scale change. One of the main goals of our work is to demonstrate adaptiveness to scale variation. Both CT tracker and MIL tracker do not consider this issue while our tracker and TLD tracker employ different manners. TLD tracker adopts a predefined scale set and performs greedy detection procedure in all scales for a given input image based on sliding window strategy. The predefined scale set may bring significant error (frame 813 in CCTV 9 and frame 305 in CCTV 10). Our tracker performs better than TLD tracker in these sequences. On one hand, our tracker inherits the merits of subspace representation which demonstrates robustness to scale change. On the other hand, our fern-based appearance model is discriminatively learned from target and background with a data-independent measurement. The fern features are compressively sensed with a very sparse measurement matrix. The compressive sensing theory ensures that the extracted features preserve almost all the information of the input image. Moreover, our feedback mechanism better estimates the scale variation between consecutive frames, thereby facilitating separation of the foreground and background.
4 Conclusion
This paper presents a multi-scale ship tracking algorithm via a fern-based appearance model. In this work, a very sparse measurement matrix is adopted to efficiently extract the compressed fern features from the foreground and background. Then, we update our naive Bayes classifier in the compressed domain. We tackle the multi-scale tracking problem by a feedback mechanism. Numerous experiments on challenging video clips demonstrate that the proposed algorithm achieves favorable performance in terms of efficiency and accuracy.
References
Coudert, F.: Towards a new generation of CCTV networks: erosion of data protection safeguards? Comput. Law Secur. Rev. 25(2), 145–154 (2009)
Dadashi, N., Stedmon, A.W., Pridmore, T.P.: Semi-automated CCTV surveillance: the effects of system confidence, system accuracy and task complexity on operator vigilance, reliance and workload. Appl. Ergon. 44(5), 730–738 (2013)
Davies, A.C., Velastin, S.A.: A progress review of intelligent CCTV surveillance systems. In: IDAACS, Institute of Electrical and Electronics Engineers Inc., Sofia, Bulgaria, pp. 417–423 (2005)
Babenko, B., Yang, M.H., Belongie, S.: Robust object tracking with online multiple instance learning. IEEE Trans. Pattern Anal. Mach. Intell. 33(8), 1619–1632 (2011)
Teng, F., Liu, Q., Gao, X.Y., Zhu, L.: Real-time ship tracking via enhanced MIL tracker. In: IETET, Conference Publishing System, Kurukshetra, India, pp. 399–404 (2013)
Kalal, Z., Mikolajczyk, K., Matas, J.: Tracking-learning-detection. IEEE Trans. Pattern Anal. Mach. Intell. 34(7), 1409–1422 (2012)
Koohzadi, M., Keyvanpour, M.: OTWC: an efficient object-tracking method. Signal Image Video Process 1–13 (2013). doi:10.1007/s11760-013-0557-8
Zhang, S., Qi, Z., Zhang, D.: Ship tracking using background subtraction and inter-frame correlation. In: CISP (2009). doi:10.1109/CISP.2009.5302115
Shan, D.J., Zhang, C.: Visual tracking using IPCA and sparse representation. Signal Image Video Process 1–9 (2013). doi:10.1007/s11760-013-0525-3
Sheng, G., Yang, W., Yu, L., Sun, H.: Cluster structured sparse representation for high resolution satellite image classification. In: ICSP, Institute of Electrical and Electronics Engineers Inc., Beijing, China, pp. 693–696 (2012)
Julazadeh, A., Marsousi, M., Alirezaie, J.: Classification based on sparse representation and Euclidian distance. In: VCIP, IEEE Computer Society, San Diego, CA, United states, pp. 1–5 (2012)
Mahoor, M.H., Mu, Z., Veon, K.L., Mavadati, S.M., Cohn, J.F.: Facial action unit recognition with sparse representation. In: FG, IEEE Computer Society, Santa Barbara, CA, United states, pp. 336–342 (2011)
Wang, Z., Huang, M., Ying, Z.: The performance study of facial expression recognition via sparse representation. In: ICMLC, IEEE Computer Society, Qingdao, China, pp. 824–827 (2010)
Zhang, S., Yao, H., Sun, X., Lu, X.: Sparse coding based visual tracking: review and experimental comparison. Pattern Recognit. 46(7), 1772–1788 (2013)
Liu, B., Huang, J., Yang, L., Kulikowsk, C.: Robust tracking using local sparse appearance model and K-selection. In: CVPR, IEEE Computer Society, Colorado Springs, CO, United States, pp. 1313–1320 (2011)
Zhang, S., Yao, H., Lu, X.: Robust visual tracking using feature-based visual attention. In: ICASSP, Institute of Electrical and Electronics Engineers Inc., Dallas, TX, United states, pp. 1150–1153 (2010)
Gai, S., Yang, G., Wan, M.: Employing quaternion wavelet transform for banknote classification. Neurocomputing 118, 171–178 (2013)
Gai, S., Yang, G., Zhang, S.: Multiscale texture classification using reduced quaternion wavelet transform. AEU Int. J. Electron. Commun. 67(3), 233–241 (2013)
Zhang, K., Zhang, L., Yang, M.H.: Real-time compressive tracking. In: ECCV, Springer Verlag, Florence, Italy, pp. 864–877 (2012)
Achlioptas, D.: Database-friendly random projections: Johnson–Lindenstrauss with binary coins. J. Comput. Syst. Sci. 66(4), 671–687 (2003)
Chang, L., Wu, J.: Achievable angles between two compressed sparse vectors under norm/distance constraints imposed by the restricted isometry property: a plane geometry approach. IEEE Trans. Inf. Theory 59(4), 2059–2081 (2013)
Li, H., Shen, C. Shi, Q.: Real-time visual tracking using compressive sensing. In: CVPR, IEEE Computer Society, Colorado Springs, CO, United States, pp. 1305–1312 (2011)
Liu, L., Fieguth, P.: Texture classification from random features. IEEE Trans. Pattern Anal. Mach. Intell. 34(3), 574–586 (2012)
Acknowledgments
The authors would like to thank the editor and reviewers for their valuable comments and suggestions that lead to an improved manuscript. This work is supported by the National Science Foundation of China (NSFC 51279152).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Teng, F., Liu, Q. Multi-scale ship tracking via random projections. SIViP 8, 1069–1076 (2014). https://doi.org/10.1007/s11760-014-0629-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11760-014-0629-4