Next Article in Journal
Wasserstein Distance Learns Domain Invariant Feature Representations for Drift Compensation of E-Nose
Previous Article in Journal
The Impact of the Land Cover Dynamics on Surface Urban Heat Island Variations in Semi-Arid Cities: A Case Study in Ahmedabad City, India, Using Multi-Sensor/Source Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Deep Learning-Based Target Tracking and Classification for Low Quality Videos Using Coded Aperture Cameras

1
Applied Research LLC, Rockville, MD 20850, USA
2
Google, Inc., Mountain View, CA 94043, USA
3
Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
4
Department of Brain and Cognitive Sciences, MIT, Cambridge, MA 02138, USA
*
Author to whom correspondence should be addressed.
Sensors 2019, 19(17), 3702; https://doi.org/10.3390/s19173702
Submission received: 16 July 2019 / Revised: 15 August 2019 / Accepted: 22 August 2019 / Published: 26 August 2019
(This article belongs to the Section Physical Sensors)

Abstract

:
Compressive sensing has seen many applications in recent years. One type of compressive sensing device is the Pixel-wise Code Exposure (PCE) camera, which has low power consumption and individual control of pixel exposure time. In order to use PCE cameras for practical applications, a time consuming and lossy process is needed to reconstruct the original frames. In this paper, we present a deep learning approach that directly performs target tracking and classification in the compressive measurement domain without any frame reconstruction. In particular, we propose to apply You Only Look Once (YOLO) to detect and track targets in the frames and we propose to apply Residual Network (ResNet) for classification. Extensive simulations using low quality optical and mid-wave infrared (MWIR) videos in the SENSIAC database demonstrated the efficacy of our proposed approach.

1. Introduction

Compressive measurements [1] can save data storage and transmission costs. The measurements are normally collected by multiplying the original vectorized image with a Gaussian random matrix. Each measurement is a scalar and the measurement is repeated many times. The saving is achieved because the number of measurements is much fewer than the number of pixels in the original frame. To track a target using compressive measurements, it is required to reconstruct the image scene.
However, it is difficult, if not impossible, to carry out target tracking and classification directly using the compressive measurements that are generated by the Gaussian random matrix. This is because the target location, and target size and shape information in an image frame is destroyed by the Gaussian random matrix.
Recently, a new compressive sensing device known as Pixel-wise Code Exposure (PCE) camera was proposed [2]. In [2], the original frames were reconstructed using L1 [3] or L0 [4,5,6] sparsity based algorithms. It is well-known that it is computationally intensive to reconstruct the original frames and hence real-time applications may be infeasible. Moreover, information may be lost in the reconstruction process [7]. For real-time applications, it will be important to carry out target tracking and classification using compressive measurement directly. Although there are some tracking papers [8] in the literature that appear to be using compressive measurements, they are actually still using the original video frames for tracking.
In this paper, we propose a target tracking and classification approach in compressive measurement domain for long range and low quality optical and MWIR videos. First, YOLO [9] is used for target tracking. The training of YOLO requires image frames with known target locations, which can be easily done. It should be noted that YOLO does have a built-in classifier. However, its performance is not good based on our past experience [10,11,12,13,14]. As a result, ResNet [15] has been used for classification because some customized training can be done via data augmentation of the limited video frames. Although other deep learning based classifiers could be used, we chose ResNet simply because its ability to avoid saturation issues. Our proposed approach was demonstrated using low quality videos (long range, low spatial resolution, and poor illumination) in the SENSIAC database. The tracking and classification results are reasonable up to certain ranges. Big improvement has been noticed over conventional trackers [16,17]. Moreover, conventional trackers do not work well for multiple targets [10].
Although the proposed approach has been applied to shortwave infrared (SWIR) videos in an earlier paper [10], the application of the proposed approach to SENSIAC videos is completely new. Most importantly, the video quality in terms of spatial resolution and illumination in SENSIAC videos is much worse than those SWIR videos in [10]. The SENSIAC database contains both optical and MWIR videos collected from ranges of 1000 m up to 5000 m. In some videos, cameras also move and there are also air turbulence caused by desert heat. Some dust caused by moving vehicles can be seen in some optical videos. There are seven types of vehicles, which are hard to distinguish from long ranges. For MWIR videos, there are daytime and nighttime videos as well. We have demonstrated that the proposed deep learning approach is general and applicable to low quality optical and MWIR videos. Our studies also showed that optical has better tracking and classification performance than MWIR daytime videos and MWIR videos are more appropriate for nighttime operations.
It is worth to briefly review some state-of-the-art algorithms that performs action inference or object classification directly using compressive measurements. We will also highlight the differences between our approach and those other approaches.
Paper [18] presents a reconstruction-free approach to action inference. The key idea is to build smashed filters using training samples that are affine transformed to a canonical viewpoint. The approach works very well even for 100 to 1 compression. However, the approach is for action inference (e.g., a moving car or some other actions), not for target detection, tracking, and classification (e.g., the moving car is a Ram, not a Jeep) in compressed measurement domain. Moreover, the smashed filter may assume that the camera is stationary and the angle is fixed. Extending the approach to target tracking and classification with moving cameras may be non-trivial.
In [19], a CNN approach was presented to perform image classification directly in compressed measurement domain. The input image is assumed to be cropped and centered, and there is only one target in each image. This is totally different from our paper in which the target can be anywhere in the image frames.
Papers [20,21] are similar in spirit to [19]. Both papers discussed direct object classification using compressed measurement. However, both papers assumed that the targets/objects are already centered. Moreover, it is a classification study only without target detection and tracking. This is similar to the ResNet portion of our approach. Again, the problem and scenarios in these papers are different from ours because the target can be anywhere in the video frames in our paper.
Strictly speaking, the approach in [22] is not reconstruction free. The integral image is one type of reconstructed image. After the integral image is obtained, other tracking filters are then applied. There was also no discussion of object classification. Our paper does not require any image reconstruction.
Reference [23] is interesting in that a random mask is applied to conceal the actual contents of the original video. They call the video with random mask a coded aperture video. If one looks closely, the coded aperture idea in [23] is very different from the PCE idea in our paper. In addition, the key idea in [23] is about action recognition (similar to [18]), not object tracking and classification. Extending the idea in [23] to object tracking and classification may not be an easy task.
Reference [24] presents an object detection approach using correlation filters and sparse representation. There was no object classification. No reconstruction of compressive measurements is needed. The results are quite good. One potential limitation of the idea in [24] is that the sparsity approach may be very time consuming when the dictionary size is large and hence may not be suitable for near real-time applications. Different from [24], our paper focuses on object detection, tracking, and classification. Once trained, our approach can work in a near real-time fashion.
In [25], the authors present an approach to extracting features out of the compressed measurements and then uses the features to create a proxy image, which is then used for action recognition. If our interpretation is correct, this approach may not be considered as a reconstruction free approach because there is a construction of a proxy image. Similar to [19,20,21], it appears the approach is suitable for stationary camera cases and also the objects are already centered in the images. In our approach, the camera can be non-stationary and targets can be anywhere in the image.
Paper [26] presents an online reconstruction free approach to object classification using compressed measurements. Similar to [19,20,21,25], the approach assumes the object is already at the center of the image. For an image frame where the target location is unknown, then it is not clear on how this approach can be applied to handle the above situation. We faced the same problem two years ago when we investigated a sparsity based approach [7] that directly classifies objects using compressive measurements. However, we still could not solve the classification issue in which the target is located in a small and random location of an image frame. The methods in [19,20,21,25,26] also did not address the above mentioned issue.
This paper is organized as follows: in Section 2, we describe some background materials, including the PCE camera, YOLO, ResNet, SENSIAC videos, and performance metrics. In Section 3, we present some tracking results using a conventional tracker, which clearly has poor performance when using compressive measurements directly. Section 4, Section 5 and Section 6 then focus on presenting the deep learning results. In particular, Section 4 summarizes the tracking and classification results using optical videos. Section 5 and Section 6 summarize the tracking and classification results for MWIR daytime and nighttime videos, respectively. Finally, we conclude our paper with some remarks for future research. To make our paper easier to read, we have moved some tracking and classification results to the Appendices.

2. Materials and Methods

2.1. PCE Imaging and Coded Aperture

Here, we briefly review the PCE or Coded Aperture (CA) video frames [2]. The differences between a conventional video sensing scheme and PCE are shown in Figure 1. First, conventional cameras capture frames at 30 or 5 or some other frames per second. A PCE camera, however, captures a compressed frame called motion coded image over a fixed period of time (Tv). For instance, it is possible to compress 20 original frames into a single motion coded frame. The compression ratio is very significant. Second, the PCE camera allows one to use different exposure times for different pixel locations. Consequently, high dynamic range can be achieved. Moreover, power can also be saved via low sampling rate. One notable disadvantage of PCE is that, as shown in the right-hand side of Figure 1, an over-complete dictionary is needed to reconstruct the original frames and this process may be very computationally intensive and may prohibit real-time applications.
The coded aperture image Y R M × N is obtained by:
Y ( m , n ) = t = 1 T S ( m , n , t ) · X ( m , n , t )
where X R M × N × T contains a video scene with an image size of M × N and the number of frames of T; S R M × N × T contains the sensing data cube, which contains the exposure times for pixel located at (m, n, t). The value of S (m, n, t) is 1 for frames t ∈ [tstart, tend] and 0 otherwise. [tstart, tend] denotes the start and end frame numbers for a particular pixel.
The video scene X R M × N × T can be reconstructed via sparsity methods (L1 or L0). Details can be found in [2]. However, the reconstruction process is time consuming and hence not suitable for real-time applications.
Instead of performing sparse reconstruction on PCE images, our scheme directly works on the PCE images. Utilizing raw PCE measurements has several challenges. First, moving targets may be smeared if the exposure times are long. Second, there are also missing pixels in the raw measurements because not all pixels are activated during the data collection process. Third, there are much fewer frames in the raw video because a number of original frames are compressed into a single coded frame. This means that the training data will be limited.
In this paper, we have focused on simulating PCE measurement. We then proceed to demonstrate that detecting, tracking, and classifying moving objects is feasible. We carried out multiple experiments with three diverse sensing models: PCE/CA Full, PCE/CA 50%, and PCE/CA 25%.
The PCE Full Model (PCE Full or CA Full) is quite similar to a conventional video sensor: every pixel in the spatial scene is exposed for exactly the same duration of one second. This simple model still produces a compression ratio of 30:1. The number “30” is a design parameter. Based on our sponsor’s requirements, in our experiments, we have used 5 frames, which achieved 5 to 1 compression already.
Next, in the sensing model labeled as PCE 50% or CA 50%, there are roughly 1.85% pixels being activated in each frame with an exposure time of Te = 133.3 ms. Since we are summing up 30 frames into a single coded frame, summing 30 frames of 1.85% is equivalent to 55.5% of all pixels that have exposure in the coded frame. Because the pixels are randomly selected in each frame, some pixels may overlap. So, activating 1.85% in each frame is roughly equivalent to 50% of activated pixels in the coded frame. Similarly, for PCE 25 case, the percentage of activated pixels in each frame will be reduced by half from 1.85% to 0.92%. The exposure duration is still set at the same conventional 4-frame duration. Table 1 below summarizes the comparison between the three sensing models for data and power savaging ratios. Details can be found in [10].

2.2. YOLO Tracker

YOLO [9] is fast and similar to Faster R-CNN [27]. We picked YOLO rather than Faster R-CNN simply because of easier installation and compatibility with our hardware. The training of YOLO is quite simple, as only images with ground truth target locations are needed.
YOLO is mainly performing object detection. The tracking is achieved by detection. That is, the detected object locations from all frames are connected together to form object tracks. Conventional trackers usually require a human operator to manually put a bounding box on the target in the first frame. This is not only inconvenient, but also may not be practical, especially for long term tracking where tracking may need to be re-started after some frames. Comparing with conventional trackers [16,17], YOLO does not require any information on the initial bounding boxes. Moreover, YOLO can handle multiple targets simultaneously.
YOLO also comes with a classification module. However, based on our evaluations, the classification accuracy using YOLO is not good as can be seen in [10,11,12,13,14]. For completeness, we include a block diagram of YOLO-version 1 [9] in Figure 2. The input image needs to be resized to 448 × 448. There are 24 layers. YOLO version 2 has been used in our experiments.

2.3. ResNet Classifier

A common problem in deep CNN is performance saturation. The ResNet-18 model is an 18-layer convolutional neural network (CNN), which avoids performance saturation in training deeper layers. The key idea in ResNet-18 model is an identity shortcut connection, which skips one or more layers. Figure 3 shows the architecture of an 18-layer ResNet.
Training of ResNet requires target patches. The targets are cropped from training videos. Mirror images are then created. We then perform data augmentation using scaling (larger and smaller), rotation (every 45 degrees), and illumination (brighter and dimmer) to create more training data. For each cropped target, we are able to create a data set with 64 more images.
The relationship between YOLO and ResNet is that YOLO determines where the targets are and bounding boxes are put around the targets. The pixels inside the bounding boxes will be fed into the ResNet-18 for classification.
The training of ResNet was done as follows: first, the targets are cropped from training videos at a particular range in the SENSIAC database. Second, mirror images were then generated. Third, we then applied data augmentation using scaling (larger and smaller), rotation (every 45 degrees), and illumination (brighter and dimmer) to generate more training data. For every cropped target, 64 additional synthetic targets were generated.

2.4. Data

To fulfill our sponsor’s requirements, our research objective is to perform tracking and classification of seven vehicles using the SENSIAC videos. There are optical and mid-wave infrared (MWIR) videos collected at distances ranging from 1000 to 5000 m with 500 m increments. The seven types of vehicles are shown in Figure 4. These videos are challenging for several reasons. First, the target sizes are small due to long distances. This is quite different from some benchmark datasets such as MOT Challenge [28] where the range is short and the targets are big. Second, the target orientations also change drastically. Third, the illuminations in different videos are also different. Fourth, the cameras also move in some videos. Fifth, both optical and MWIR videos are present. Sixth, some environmental factors such as air turbulence due to desert heat are also present in some optical videos.
Although there are other benchmark videos such as the MOT Challenge Database, our sponsor is aware of that database. However, since our sponsor is interested in long range, small targets (vehicles), and gray scale videos, MOT Challenge dataset does not meet the requirements of our sponsor. Most videos in the MOT Challenge dataset contain human subjects at close distance and the videos are color videos. Moreover, we have limited project funding to only focus on some relevant datasets. Consequently, we did not have time to explore other videos such as MOT Challenge.
Having said the above, we would like to mention that, in our experiments, a total of 378 videos comprising seven vehicles, six long distance ranges (1000 to 3500 m in 500 m increments), three imaging modalities (optical, MWIR daytime, MWIR nighttime), and three coded aperture modes. In short, our experiments are very comprehensive. No one has carried out such a comprehensive tracking and classification study for SENSIAC dataset in the compressed measurement domain. In this regard, our paper has reasonable contributions to the research community.
Here, we briefly highlight the background for optical and MWIR videos. Figure 5 shows a few examples of optical and MWIR images. The optical and MWIR videos have very different characteristics. Optical imagers have a wavelength between 0.4 and 0.8 microns and MWIR imagers have a wavelength range between 3 and 5 microns. Optical cameras require external illuminations whereas MWIR counterparts do not need external illumination sources because MWIR cameras are sensitive to heat radiation from objects. Consequently, target shadows can affect the target detection performance in optical videos. However, there are no shadows in MWIR videos. Moreover, atmospheric obscurants cause much less scattering in the MWIR bands than in the optical band. As a result, MWIR cameras are tolerant of heat turbulence, smoke, dust and fog.

2.5. Performance Metrics

In our earlier paper [10,11,12,13,14], we have included some tracking results where conventional trackers such as GMM [17] and STAPLE [16] were used. The tracking performance was poor when there are missing data.
Although there may be other metrics that could be used, some of the metrics have similar meanings. Hence, we believe that the following popular and commonly used metrics are sufficient for evaluating the tracker performance:
  • Center Location Error (CLE): It is the error between the center of the bounding box and the ground-truth bounding box.
  • Distance Precision (DP): It is the percentage of frames where the centroids of detected bounding boxes are within 20 pixels of the centroid of ground-truth bounding boxes.
  • EinGT: It is the percentage of the frames where the centroids of the detected bounding boxes are inside the ground-truth bounding boxes.
  • Number of frames with detection: This is the total number of frames that have detection.
For classification, we used confusion matrix and classification accuracy as performance metrics.

3. Conventional Tracking Results

We first present some tracking results for optical videos at a range of 1000 m using a conventional tracker known as STAPLE [16]. The compressive measurements based on the PCE principle have been obtained. Here, every five frames were compressed into one frame. STAPLE requires the target location to be known in the first frame. After that, STAPLE learns the target model online and tracks the target. However, in two of three cases (PCE 50%, and PCE 25%) as shown in Figure 6, Figure 7 and Figure 8, STAPLE was not able to track any targets in subsequent frames. This shows the difficulty of target tracking using PCE cameras. Moreover, in our earlier studies for SWIR videos [10], we already compared conventional trackers with deep learning based trackers. It was observed that conventional trackers do not work well in compressive measurement domain. We would like to mention that, it is somewhat unfair to the authors of [16] because STAPLE was not designed to handle videos in compressed measurement domain. Therefore, in our subsequent studies shown in Section 4, Section 5 and Section 6, we focused only on deep learning results because of the above observations.

4. Tracking and Classification Results Using SENSIAC Optical Videos

This study focuses on the case of tracking and classification using a combination of YOLO and ResNet for coded aperture cameras. The compressive measurements are simulated using PCE camera principle. There are three cases. PCE full refers the compression of 5 frames to 1 with no missing pixels. PCE 50 is the case where we compress 5 frames to 1 and at the same time, only 50% of pixels are activated for a length of 4/30 s. PCE 25 is similar to PCE 50 except that only 25% of the pixels are activated for 4/30 s.

4.1. Tracking

We used 1500 and 3000 m videos to train two separate YOLO models. The 1500 m model was used for 1000 to 2000 m ranges and the 3000 m model was for 2500 to 3500 m ranges. Longer range videos (4000 to 5000 m) were not used because the targets are too small.
Table 2 and two tables in Appendix A show the tracking results for PCE full, PCE 50, and PCE 25, respectively. The trend is that when image compression increases, the performance drops accordingly. Table 2 summarizes the PCE full case. The tracking performance is good up to 3000 m. For PCE 50 case (see the first table in Appendix A), the tracking is only good up to 2000 m. We also observe some poor tracking results for some vehicles (BRDM2 at 2000 m). For PCE 25 case (second table in Appendix A), the tracking is only reasonable up to 1500 m. There are also some poor detection results even for 1000 and 1500 m ranges. The above observations can be corroborated in the snapshots shown in Figure 9 and two figures in Appendix A where we can see that some targets do not have bounding boxes around them in the high compression cases. We can also observe some dusts caused by the moving vehicles. Dusts can seriously affect the tracking and classification performance. In Figure 9 (PCE full case), one can see that most of the sampled frames in 2500 and 3500 m videos do not have any detections. We did not include 1500 and 3000 m snapshots because those videos are used in the training. In the first figure (PCE 50) in Appendix A, it can be seen that the detection performance deteriorates, as most of the sampled frames do not have detections. The tracking results in second figure (PCE 25) in Appendix A are not good even for 1000 m range. The selected video contains the SUV vehicle, which unfortunately has 11% detection in the 1000 m range.
From this study alone, it is very clear to see the difficulty of target tracking using compressive measurement directly for the SENSIAC videos. Challenges mean opportunities. We hope researchers will continue along this path.

4.2. Classification Results

Here, we applied ResNet for classification. Two models were obtained. One used the 1500 m videos for training and then 1000 m and 2000 m videos for testing. The other one used the 3000 m videos for training and 2500 m and 3500 m videos for testing. It should be noted that classification is performed only when there is good detection results from the YOLO tracker. For some frames in the PCE 50 and PCE 25 cases, there may not be any positive detection results and, for those frames, we do not generate any classification results.
Table 3 and two tables in Appendix B show the classification results using ResNet for PCE full, PCE 50, and PCE 25 cases. In each table, the left side contains the confusion matrix and the last column contains the classification accuracy. From Table 3 (PCE full), the accuracy is reasonably good up to 1500 m range. At 2000 m range, the accuracy fluctuates a lot among the different vehicles. For ranges beyond 2500 m, the accuracy is low. From first table (PCE 50) in Appendix B, the accuracy is only good for 1500 m, which is the range that we used for training. Other ranges are not good. Similarly, the results in the second table (PCE 25) in Appendix B are all bad. This study clearly shows that it is difficult to get good classification results for SENSIAC optical videos in which the targets are small. More research is needed.

4.3. Summary (Optical)

We collected some statistics from Table 2, Table 3, and those tables in Appendix A and Appendix B and summarize those averages in Table 4. For optical videos, the performance of tracking and classification is good up to 2000 m in the PCE full case. For PCE 50, the tracking is still reasonable, but the classification is not good. For PCE 25, even the tracking is not very good for 1000 m range. The classification is even worse for PCE 25. More research is needed in order to get better performance.

5. Tracking and Classification Using MWIR Daytime Videos

The SENSIAC database contains MWIR daytime and nighttime videos. Here, we focus on daytime videos.

5.1. Tracking

Similar to the optical case, we trained two models. One used 1500 m videos and the other used 3000 m videos. For the 1500 m model, videos from 1000 and 2000 m videos were used for testing; for the 3000 m model, videos from 2500 and 3500 m were used for testing. Table 5 and two additional tables in Appendix C show the tracking results for PCE full, PCE 50, and PCE 25, respectively.
From Table 5 (PCE full), the tracking results for 1000 to 2500 m are reasonable. Some vehicles have better numbers than others. From the table for PCE 50 in Appendix C, the performance deteriorates drastically. Even for the 1500 and 3000 m ranges, the results are not good. From the table for PCE 25 in Appendix C, the performance gets even worse. This can be confirmed in the snapshots shown in Figure 10 and two additional figures in Appendix C where we can see that some targets do not have bounding boxes around them in the high compression cases. An observation is that the tracking performance in MWIR daytime videos is generally worse than that of using optical videos.

5.2. Classification (MWIR Daytime)

Similar to the optical case, we trained two ResNet classifiers: one for the 1500 m range and another for the 3000 m range. For the 1500 and 3000 m models, videos from 1000 and 2000 m, and 2500 and 3500 m, were used for testing, respectively. Classification is only performed when there is detection in a frame. The observations are summarized in Table 6 and another two tables in Appendix D. In each table, the left side includes a confusion matrix and the last column contains the classification accuracy. From Table 6 (PCE full), one can see that accuracy is not great but decent. For PCE 50 and PCE 25 cases, the performance drops quite significantly, as can be seen from the tables in Appendix D.
If one compares the optical results in Section 4 and results here, one can observe that the optical results are better than the MWIR in daytime.

5.3. Summary (MWIR Daytime)

It is important to emphasize that we are tackling a challenging problem in target tracking and classification in long range and low quality videos. The SENSIAC videos are difficult to track and classify even in the uncompressed case. Here, we condense the results in Table 5 and Table 6 and those additional tables in Appendix C and Appendix D in Table 7. For daytime videos using the MWIR imager, the tracking performance is only good for PCE full and up to 2000 m. For classification, the results are poor in general even for PCE full case. A simple comparison with the optical results in Table 4 concludes that MWIR is not recommended for daytime tracking and classification.

6. MWIR Nighttime Videos

This section focuses on MWIR nighttime videos.

6.1. Tracking

We built two models using videos from 1500 m and 3000 m. For the 1500 m model, videos from 1000 m and 2000 m were used for testing. For the 3000 m model, we used videos from 2500 m and 3500 m for testing. Table 8 and two additional tables in Appendix E show the tracking results for PCE full, PCE 50, and PCE 25, respectively. For PCE full case, the results in Table 8 show that the tracking results are quite good. For the PCE 50 and PCE 25 cases, the results in those tables in Appendix E drop quite significantly. The trend is that when the image compression ratio increases, the performance drops accordingly. In the long range cases (Table 8 and the tables in Appendix E), one can observe some numbers of 0% detection and no detection (ND) cases. This is understandable because MWIR imagers rely of radiation from the target and if the target is far, the signal to noise ratio (SNR) is very low for long ranges. Hence, the target signals will be very weak in long ranges. This can be confirmed in the snapshots shown in Figure 11 and two additional figures in Appendix E where we can see that some targets do not have bounding boxes around them in the high compression cases.

6.2. Classification

Classification is only done when there is detection in a frame. Two classifiers were built: one for 1500 m and one for 3000 m. For PCE full case (Table 9), the classification performance is good for ranges up to 2000 m. For longer ranges, the performance drops. For PCE 50 and PCE 25 results shown in those tables in Appendix F, the longer ranges (≥2500 m) are very poor. As mentioned earlier, MWIR imager relies on signals from the targets and long ranges make the signal very weak. Consequently, the overall tracking and classification results are not good.

6.3. Summary (MWIR Nighttime)

Table 10 summarizes the averaged classification accuracy of the various cases presented earlier. It can be seen the if one is interested in highly accurate classification, then the range has to be less than 2000 m and we need to adopt PCE full mode. Moreover, when we compare the results of MWIR daytime and nighttime results, we will observe that the nighttime results are better. Hence, MWIR should be recommended for nighttime tracking and classification.

7. Conclusions

In this paper, we present a deep learning based approach to target tracking and classification directly using PCE measurements. No time consuming reconstruction step is needed and hence real-time target tracking and classification is possible for practical applications. The proposed approach is based on a combination of two deep learning schemes: YOLO for tracking and ResNet for classification. Comparing with state-of-the-art methods, which either assume the objects are cropped and centered or are only applicable to action inference rather than object classification, our approach is suitable for target tracking and classification applications where limited training data are available. Extensive experiments using 378 optical and MWIR (daytime and nighttime) videos with different ranges, illumination, and environmental conditions in the SENSIAC database clearly demonstrated the performance. Moreover, it was observed that optical is more suitable for daytime operations and MWIR is more appropriate for nighttime operations.
It should be emphasized that the SENSIAC database is very challenging for target tracking and classification, even when using the original measurements. There are some videos collected beyond 3500 m that we have not even touched in our paper. More research is needed for the research community to address such challenging scenarios.

Author Contributions

Conceptualization, C.K., A.R., T.T., J.Z., and R.E.-C.; software, B.C. and J.Y.; data curation, C.K.; writing—original draft preparation, C.K.; funding acquisition, C.K. and T.T.

Funding

This research was funded by the US Air Force under contract FA8651-17-C-0017. The views, opinions and/or findings expressed are those of the authors and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Tracking Results for Optical Case: PCE 50 and PCE 25

Table A1. Tracking metrics for PCE 50 (optical video). ND means “no detection”.
Table A1. Tracking metrics for PCE 50 (optical video). ND means “no detection”.
1000 m2500 m
VehiclesEinGTCLEDP@20 Pixels% DetectionsVehiclesEinGTCLEDP@20 Pixels% Detections
BMP21.0039.250%75%BMP20.00ND0%0%
BRDM21.0019.3666%74%BRDM20.00ND0%0%
BTR701.0029.460%83%BTR700.00ND0%0%
SUV1.0024.956%78%SUV0.00ND0%0%
T721.0061.780%90%T720.00ND0%0%
Truck1.0024.9511%79%Truck0.00ND0%0%
ZSU23-41.0033.570%75%ZSU23-40.00ND0%0%
1500 m3000 m
VehiclesEinGTCLEDP@20 Pixels% DetectionsVehiclesEinGTCLEDP@20 Pixels% Detections
BMP21.0027.770%100%BMP20.00ND0%0%
BRDM21.0015.0097%96%BRDM21.005.93100%2%
BTR701.0021.9521%100%BTR701.007.27100%6%
SUV1.0018.8371%100%SUV0.00ND0%0%
T721.0046.710%100%T721.0014.24100%1%
Truck1.0019.8250%100%Truck1.005.03100%1%
ZSU23-41.0024.891%100%ZSU23-41.007.84100%4%
2000 m3500 m
VehiclesEinGTCLEDP@20 Pixels% DetectionsVehiclesEinGTCLEDP@20 Pixels% Detections
BMP21.0019.8554%95%BMP20.00ND0%0%
BRDM21.0010.92100%5%BRDM20.00ND0%0%
BTR701.0016.3198%93%BTR700.00ND0%0%
SUV1.0013.13100%63%SUV0.00ND0%0%
T721.0031.840%93%T720.00ND0%0%
Truck1.0014.38100%61%Truck0.00ND0%0%
ZSU23-41.0019.0071%90%ZSU23-40.00ND0%0%
Figure A1. Tracking results for frames 1, 63, 126, 189, 252, and 315 for the PCE 50 (optical video) case. The vehicle is SUV. Most of the captured frames do not have detection. Dusts can be seen in some frames and have serious impacts on the tracking and classification performance. (a) 1000 m; (b) 2000 m; (c) 2500 m; and (d) 3500 m.
Figure A1. Tracking results for frames 1, 63, 126, 189, 252, and 315 for the PCE 50 (optical video) case. The vehicle is SUV. Most of the captured frames do not have detection. Dusts can be seen in some frames and have serious impacts on the tracking and classification performance. (a) 1000 m; (b) 2000 m; (c) 2500 m; and (d) 3500 m.
Sensors 19 03702 g0a1aSensors 19 03702 g0a1b
Table A2. Tracking metrics for PCE 25 (optical video). ND means “no detection”.
Table A2. Tracking metrics for PCE 25 (optical video). ND means “no detection”.
1000 m2500 m
VehiclesEinGTCLEDP@20 Pixels% DetectionsVehiclesEinGTCLEDP@20 Pixels% Detections
BMP20.9942.030%75%BMP20.00ND0%0%
BRDM21.0020.7446%32%BRDM20.0060.800%2%
BTR701.0030.860%95%BTR700.0070.780%1%
SUV1.0023.9117%11%SUV0.00ND0%0%
T721.0062.760%89%T720.00ND0%0%
Truck0.9927.125%39%Truck0.00ND0%0%
ZSU23-40.9936.100%73%ZSU23-40.0080.060%1%
1500 m3000 m
VehiclesEinGTCLEDP@20 Pixels% DetectionsVehiclesEinGTCLEDP@20 Pixels% Detections
BMP21.0028.660%75%BMP20.6723.3467%1%
BRDM21.0016.18100%32%BRDM20.3348.2633%1%
BTR701.0024.277%95%BTR700.979.9898%28%
SUV0.00ND0%0%SUV0.0079.130%0%
T721.0050.040%89%T720.8821.1875%2%
Truck0.00ND0%0%Truck0.8715.2787%8%
ZSU23-41.0026.686%73%ZSU23-40.7416.2784%5%
2000 m3500 m
VehiclesEinGTCLEDP@20 Pixels% DetectionsVehiclesEinGTCLEDP@20 Pixels% Detections
BMP20.00ND0%36%BMP20.00ND0%0%
BRDM20.00ND0%1%BRDM20.0049.8019%4%
BTR700.00ND0%38%BTR700.0037.1625%1%
SUV0.00ND0%0%SUV0.0064.680%2%
T720.00ND0%68%T720.0028.9233%1%
Truck0.00ND0%0%Truck0.0064.580%2%
ZSU23-41.0019.70100%47%ZSU23-40.00ND0%0%
Figure A2. Tracking results for frames 1, 63, 126, 189, 252, and 315 for the PCE 25 (optical video) case. The vehicle is SUV. No detections are observed in the sampled frames. (a) 1000 m; (b) 2000 m; (c) 2500 m; and (d) 3500 m.
Figure A2. Tracking results for frames 1, 63, 126, 189, 252, and 315 for the PCE 25 (optical video) case. The vehicle is SUV. No detections are observed in the sampled frames. (a) 1000 m; (b) 2000 m; (c) 2500 m; and (d) 3500 m.
Sensors 19 03702 g0a2

Appendix B. Classification Results for Optical Case: PCE 50 and PCE 25

Table A3. Classification results for PCE 50 (optical video) case. Left shows the confusion matrix and the last column shows the classification accuracy.
Table A3. Classification results for PCE 50 (optical video) case. Left shows the confusion matrix and the last column shows the classification accuracy.
1000 m2500 m
VehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4AccuracyVehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4Accuracy
BMP2230713010513408%BMP200000000%
BRDM20391905142014%BRDM200000000%
BTR7000244262120184%BTR7000000000%
SUV0001200173041%SUV00000000%
T720073925135375%T7200000000%
Truck0385213220175%Truck00000000%
ZSU23-400042162120372%ZSU23-400000000%
1500 m3000 m
VehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4AccuracyVehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4Accuracy
BMP2520994715751414%BMP200001000%
BRDM2027368225173168%BRDM200000900%
BTR70003441280192%BTR70000020000%
SUV0001261246034%SUV00000000%
T72003663265087%T720000200100%
Truck04255452841076%Truck00004000%
ZSU23-400261125721557%ZSU23-4000015000%
2000 m3500 m
VehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4AccuracyVehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4Accuracy
BMP260105127486902%BMP200000000%
BRDM2003121300%BRDM200000000%
BTR700029312836084%BTR7000000000%
SUV005237200010%SUV00000000%
T72001213079127023%T7200000000%
Truck1123542156068%Truck00000000%
ZSU23-40087471069331%ZSU23-400000000%
Table A4. Classification results for PCE 25 (optical video) case. Left shows the confusion matrix and the last column shows the classification accuracy.
Table A4. Classification results for PCE 25 (optical video) case. Left shows the confusion matrix and the last column shows the classification accuracy.
1000 m2500 m
VehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4AccuracyVehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4Accuracy
BMP280471518111173%BMP200000000%
BRDM20013087810%BRDM200009000%
BTR7001179191469149%BTR7000011000%
SUV00020219049%SUV00000000%
T7200121627329282%T7200000000%
Truck012321991262%Truck00001000%
ZSU23-4041231130217427%ZSU23-400003000%
1500 m3000 m
VehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4AccuracyVehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4Accuracy
BMP22046978011%BMP200003000%
BRDM200003100%BRDM200012000%
BTR70001121280079%BTR7001100911210%
SUV00000000%SUV00001000%
T72001692262089%T720000800100%
Truck00000000%Truck102028000%
ZSU23-40015310015632%ZSU23-4000118000%
2000 m3500 m
VehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4AccuracyVehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4Accuracy
BMP200000000%BMP200000000%
BRDM200000000%BRDM2000016000%
BTR7000000000%BTR70001020125%
SUV00000000%SUV00009000%
T7200000000%T720000300100%
Truck00000000%Truck00205000%
ZSU23-400010000%ZSU23-400001000%

Appendix C. Tracking Results for MWIR Daytime Case: PCE 50 and PCE 25

Table A5. Tracking metrics for PCE 50 (MWIR daytime) case. 1500 m and 3000 m were used for training.
Table A5. Tracking metrics for PCE 50 (MWIR daytime) case. 1500 m and 3000 m were used for training.
1000 m2500 m
VehiclesEinGTCLEDP@20 Pixels% DetectionsVehiclesEinGTCLEDP@20 Pixels% Detections
BMP20.00ND0%0%BMP21.0010.30100%40%
BRDM21.0026.451%22%BRDM20.0057.550%20%
BTR701.0026.299%13%BTR700.0064.520%12%
SUV1.0017.0878%6%SUV0.1552.8115%4%
T721.0034.350%44%T720.0090.200%0%
Truck1.0026.7933%1%Truck0.00ND0%0%
ZSU23-41.0027.802%37%ZSU23-40.6732.3867%2%
1500 m3000 m
VehiclesEinGTCLEDP@20 Pixels% DetectionsVehiclesEinGTCLEDP@20 Pixels% Detections
BMP21.0022.9613%35%BMP20.979.1199%28%
BRDM21.0019.5955%84%BRDM20.929.0295%55%
BTR701.0018.0879%71%BTR700.958.3896%74%
SUV0.9513.3097%18%SUV0.7714.0987%20%
T721.0025.452%92%T720.998.46100%75%
Truck1.0016.8081%28%Truck0.8513.7987%36%
ZSU23-41.0029.0939%94%ZSU23-40.939.3995%75%
2000 m3500 m
VehiclesEinGTCLEDP@20 Pixels% DetectionsVehiclesEinGTCLEDP@20 Pixels% Detections
BMP21.0014.9697%10%BMP20.0041.930%1%
BRDM21.0013.3492%7%BRDM20.0065.256%5%
BTR700.00ND0%0%BTR700.0047.568%11%
SUV0.00ND0%0%SUV0.0040.5912%9%
T721.0018.1884%16%T720.006.23100%19%
Truck0.00ND0%0%Truck0.0045.2738%4%
ZSU23-41.0013.09100%8%ZSU23-40.0056.0613%26%
Figure A3. Tracking results for frames 1, 60, 119, 178, 237, and 296 for the PCE 50 (MWIR daytime) case. The vehicle is SUV. No detection is observed in the sampled frames. (a) 1000 m; (b) 2000 m; (c) 2500 m; and (d) 3500 m.
Figure A3. Tracking results for frames 1, 60, 119, 178, 237, and 296 for the PCE 50 (MWIR daytime) case. The vehicle is SUV. No detection is observed in the sampled frames. (a) 1000 m; (b) 2000 m; (c) 2500 m; and (d) 3500 m.
Sensors 19 03702 g0a3aSensors 19 03702 g0a3b
Table A6. Tracking metrics for PCE 25 (MWIR daytime) case. 1500 m and 3000 m were used for training.
Table A6. Tracking metrics for PCE 25 (MWIR daytime) case. 1500 m and 3000 m were used for training.
1000 m2500 m
VehiclesEinGTCLEDP@20 Pixels% DetectionsVehiclesEinGTCLEDP@20 Pixels% Detections
BMP20.9433.260%4%BMP20.0065.380%3%
BRDM20.9929.045%33%BRDM20.0379.445%11%
BTR700.9925.9837%23%BTR700.0276.916%34%
SUV0.9718.4171%16%SUV0.0157.697%41%
T721.0035.500%63%T720.1770.4018%58%
Truck0.9340.0527%4%Truck0.1656.6125%18%
ZSU23-40.9930.541%50%ZSU23-40.2449.3325%36%
1500 m3000 m
VehiclesEinGTCLEDP@20 Pixels% DetectionsVehiclesEinGTCLEDP@20 Pixels% Detections
BMP20.9923.2825%51%BMP21.009.26100%1%
BRDM21.0020.5947%80%BRDM20.3737.5853%14%
BTR700.9821.8961%69%BTR700.5234.1361%17%
SUV0.9017.8387%18%SUV0.2339.4745%9%
T720.9828.991%92%T720.959.62100%16%
Truck0.9622.7859%23%Truck0.1934.0550%10%
ZSU23-41.0021.6534%96%ZSU23-40.5526.2665%18%
2000 m3500 m
VehiclesEinGTCLEDP@20 Pixels% DetectionsVehiclesEinGTCLEDP@20 Pixels% Detections
BMP20.9517.2288%11%BMP20.0038.4433%1%
BRDM21.0016.8985%4%BRDM20.0061.6717%10%
BTR700.2540.6425%1%BTR700.0053.3614%14%
SUV0.00ND0%0%SUV0.0057.487%11%
T720.9131.4564%6%T720.0096.710%1%
Truck1.0013.08100%1%Truck0.0053.723%10%
ZSU23-40.9419.6894%4%ZSU23-40.0050.4419%22%
Figure A4. Tracking results for frames 1, 60, 119, 178, 237, and 296 for the PCE 25 (MWIR daytime) case. The vehicle is SUV. No detections can be seen in the sampled frames. (a) 1000 m; (b) 2000 m; (c) 2500 m; and (d) 3500 m.
Figure A4. Tracking results for frames 1, 60, 119, 178, 237, and 296 for the PCE 25 (MWIR daytime) case. The vehicle is SUV. No detections can be seen in the sampled frames. (a) 1000 m; (b) 2000 m; (c) 2500 m; and (d) 3500 m.
Sensors 19 03702 g0a4aSensors 19 03702 g0a4b

Appendix D. Classification Results for MWIR Daytime Case: PCE 50 and PCE 25

Table A7. Classification results for PCE 50 case (MWIR daytime) case. Left shows the confusion matrix and the last column shows the classification accuracy.
Table A7. Classification results for PCE 50 case (MWIR daytime) case. Left shows the confusion matrix and the last column shows the classification accuracy.
1000 m2500 m
VehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4AccuracyVehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4Accuracy
BMP200000000%BMP200001000%
BRDM22663205088%BRDM20654200092%
BTR705823911049%BTR700336220014%
SUV1001921083%SUV07013118%
T726152182510016%T72021011020%
Truck0000030100%Truck00000000%
ZSU23-4238182641814%ZSU23-404002000%
1500 m3000 m
VehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4AccuracyVehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4Accuracy
BMP24345139713%BMP2421110121211142%
BRDM2026302134088%BRDM24113112508858%
BTR701122654757025%BTR70513225134710329%
SUV011036313057%SUV89017238624%
T724190249930030%T72193313616114759%
Truck0803089089%Truck32021860151012%
ZSU23-401760741173210%ZSU23-4216681429213111%
2000 m3500 m
VehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4AccuracyVehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4Accuracy
BMP2100472223%BMP201000010%
BRDM20700611128%BRDM220158100%
BTR7000000000%BTR70511030033%
SUV00000000%SUV112124053%
T720414434077%T720410600388%
Truck00000000%Truck200013010%
ZSU23-40400118615%ZSU23-413293737155%
Table A8. Classification results for PCE 25 (MWIR daytime) case. Left shows the confusion matrix and the last column shows the classification accuracy.
Table A8. Classification results for PCE 25 (MWIR daytime) case. Left shows the confusion matrix and the last column shows the classification accuracy.
1000 m2500 m
VehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4AccuracyVehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4Accuracy
BMP203234400%BMP211114129%
BRDM268917116074%BRDM20804195221%
BTR70618391533046%BTR706511010301158%
SUV1704146069%SUV5637183513712%
T7258479215140122%T72752141310312750%
Truck1110012080%Truck6181825426%
ZSU23-4247215210374%ZSU23-417456251175%
1500 m3000 m
VehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4AccuracyVehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4Accuracy
BMP2221662511841%BMP2100100050%
BRDM211334146130046%BRDM28916162718%
BTR7017025187125310%BTR706163424455%
SUV020011328117%SUV581211316%
T7261357207983125%T722400510189%
Truck0801369283%Truck542218236%
ZSU23-41115415919482%ZSU23-44172925446%
2000 m3500 m
VehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4AccuracyVehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4Accuracy
BMP2010182910%BMP202010000%
BRDM2020056015%BRDM2433320029%
BTR7000020200%BTR70310241110%
SUV00000000%SUV641319087%
T720202135059%T7200020000%
Truck000041020%Truck741416216%
ZSU23-401024900%ZSU23-4101931028734%

Appendix E. Tracking Results for MWIR Nighttime Case: PCE 50 and PCE 25

Table A9. Tracking metrics for PCE 50 (MWIR nighttime). ND means no detection.
Table A9. Tracking metrics for PCE 50 (MWIR nighttime). ND means no detection.
1000 m2500 m
VehiclesEinGTCLEDP@20 Pixels% DetectionsVehiclesEinGTCLEDP@20 Pixels% Detections
BMP21.0029.114%45%BMP20.00ND0%0%
BRDM21.0024.2327%87%BRDM20.00ND0%0%
BTR701.0015.4285%86%BTR700.00ND0%0%
SUV1.0013.93100%96%SUV0.00ND0%0%
T721.0031.040%82%T720.00ND0%0%
Truck1.0026.220%56%Truck0.00ND0%0%
ZSU23-41.0025.396%86%ZSU23-40.00ND0%0%
1500 m3000 m
VehiclesEinGTCLEDP@20 Pixels% DetectionsVehiclesEinGTCLEDP@20 Pixels% Detections
BMP21.0019.6753%95%BMP20.00ND0%0%
BRDM21.0019.8952%85%BRDM20.00ND0%0%
BTR701.0012.17100%94%BTR700.00ND0%0%
SUV1.0010.18100%93%SUV0.00ND0%0%
T721.0023.983%96%T720.00ND0%0%
Truck1.0018.9573%95%Truck0.00ND0%0%
ZSU23-41.0020.2453%95%ZSU23-40.00ND0%0%
2000 m3500 m
VehiclesEinGTCLEDP@20 Pixels% DetectionsVehiclesEinGTCLEDP@20 Pixels% Detections
BMP21.0015.7791%3%BMP20.00ND0%0%
BRDM20.9914.3897%75%BRDM20.00ND0%0%
BTR701.007.76100%48%BTR700.00ND0%0%
SUV1.006.56100%86%SUV0.00ND0%0%
T721.0016.6297%78%T720.00ND0%0%
Truck1.0014.4095%22%Truck0.00ND0%0%
ZSU23-41.0013.67100%63%ZSU23-40.00ND0%0%
Figure A5. Tracking results for frames 1, 60, 119, 178, 237, and 296 for the PCE 50 (MWIR nighttime) case. The vehicle is SUV. (a) 1000 m; (b) 2000 m; (c) 2500 m; and (d) 3500 m.
Figure A5. Tracking results for frames 1, 60, 119, 178, 237, and 296 for the PCE 50 (MWIR nighttime) case. The vehicle is SUV. (a) 1000 m; (b) 2000 m; (c) 2500 m; and (d) 3500 m.
Sensors 19 03702 g0a5aSensors 19 03702 g0a5b
Table A10. Tracking metrics for PCE 25 (MWIR nighttime). ND means no detection.
Table A10. Tracking metrics for PCE 25 (MWIR nighttime). ND means no detection.
1000 m2500 m
VehiclesEinGTCLEDP@20 Pixels% DetectionsVehiclesEinGTCLEDP@20 Pixels% Detections
BMP21.0027.930%72%BMP20.00ND0%0%
BRDM21.0025.714%100%BRDM20.00ND0%0%
BTR701.0015.8783%100%BTR700.00ND0%0%
SUV1.0013.58100%100%SUV0.00ND0%0%
T721.0031.810%99%T720.00ND0%0%
Truck1.0026.320%100%Truck0.00ND0%0%
ZSU23-41.0025.950%100%ZSU23-40.00ND0%0%
1500 m3000 m
VehiclesEinGTCLEDP@20 Pixels% DetectionsVehiclesEinGTCLEDP@20 Pixels% Detections
BMP21.0019.2660%100%BMP20.00ND0%0%
BRDM21.0018.8373%100%BRDM20.00ND0%0%
BTR701.0011.41100%100%BTR700.00ND0%0%
SUV1.0010.08100%100%SUV0.00ND0%0%
T721.0023.1010%100%T720.00ND0%0%
Truck1.0019.0569%99%Truck0.00ND0%0%
ZSU23-41.0018.9775%100%ZSU23-40.00ND0%0%
2000 m3500 m
VehiclesEinGTCLEDP@20 Pixels% DetectionsVehiclesEinGTCLEDP@20 Pixels% Detections
BMP21.0013.7799%23%BMP20.00ND0%0%
BRDM21.0013.6599%96%BRDM20.00ND0%0%
BTR701.007.75100%68%BTR700.00ND0%0%
SUV1.007.22100%58%SUV0.00ND0%0%
T721.0016.6797%95%T720.00ND0%0%
Truck1.0015.1194%23%Truck0.00ND0%0%
ZSU23-41.0013.8899%79%ZSU23-40.00ND0%0%
Figure A6. Tracking results for frames 1, 60, 119, 178, 237, and 296 for the PCE 25 (MWIR nighttime) case. The vehicle is SUV. (a) 1000 m; (b) 2000 m; (c) 2500 m; and (d) 3500 m.
Figure A6. Tracking results for frames 1, 60, 119, 178, 237, and 296 for the PCE 25 (MWIR nighttime) case. The vehicle is SUV. (a) 1000 m; (b) 2000 m; (c) 2500 m; and (d) 3500 m.
Sensors 19 03702 g0a6aSensors 19 03702 g0a6b

Appendix F. Classification Results for MWIR Nighttime Case: PCE 50 and PCE 25

Table A11. Classification results for PCE 50 case (MWIR nighttime). Left shows the confusion matrix and the last column shows the classification accuracy.
Table A11. Classification results for PCE 50 case (MWIR nighttime). Left shows the confusion matrix and the last column shows the classification accuracy.
1000 m2500 m
VehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4AccuracyVehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4Accuracy
BMP21102901190169%BMP200000000%
BRDM202560121103380%BRDM200000000%
BTR701611512803611441%BTR7000000000%
SUV07902411015070%SUV00000000%
T723461222821278%T7200000000%
Truck051008141071%Truck00000000%
ZSU23-43582254019061%ZSU23-400000000%
1500 m3000 m
VehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4AccuracyVehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4Accuracy
BMP213685018752640%BMP200000000%
BRDM203010030099%BRDM200000000%
BTR7001162040180060%BTR7000000000%
SUV0660187969256%SUV00000000%
T72216003243194%T7200000000%
Truck21004162171150%Truck00000000%
ZSU23-40330069723368%ZSU23-400000000%
2000 m3500 m
VehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4AccuracyVehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4Accuracy
BMP207004000%BMP200000000%
BRDM213197013022574%BRDM200000000%
BTR70064550464432%BTR7000000000%
SUV05625598482119%SUV00000000%
T72222025313791%T7200000000%
Truck019902029336%Truck00000000%
ZSU23-40383365358337%ZSU23-400000000%
Table A12. Classification results for PCE 25 case (MWIR nighttime). Left shows the confusion matrix and the last column shows the classification accuracy.
Table A12. Classification results for PCE 25 case (MWIR nighttime). Left shows the confusion matrix and the last column shows the classification accuracy.
1000 m2500 m
VehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4AccuracyVehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4Accuracy
BMP21981400450276%BMP200000000%
BRDM203570020099%BRDM200000000%
BTR7013196950530226%BTR7000000000%
SUV112601353661038%SUV00000000%
T72257002886481%T7200000000%
Truck1891060207158%Truck00000000%
ZSU23-41661038225170%ZSU23-400000000%
1500 m3000 m
VehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4AccuracyVehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4Accuracy
BMP2591703099151216%BMP200000000%
BRDM2034400141096%BRDM200000000%
BTR7001811170561433%BTR7000000000%
SUV08827084112319%SUV00000000%
T720160032913192%T7200000000%
Truck110571125115132%Truck00000000%
ZSU23-404640124817649%ZSU23-400000000%
2000 m3500 m
VehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4AccuracyVehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4Accuracy
BMP2134700286016%BMP200000000%
BRDM2027921592081%BRDM200000000%
BTR700100490885120%BTR7000000000%
SUV0411012984906%SUV00000000%
T722160228038382%T7200000000%
Truck013015020024%Truck00000000%
ZSU23-4063113114326021%ZSU23-400000000%

References

  1. Candes, E.J.; Wakin, M.B. An introduction to compressive sampling. IEEE Signal Process. Mag. 2008, 25, 21–30. [Google Scholar] [CrossRef]
  2. Zhang, J.; Xiong, T.; Tran, T.; Chin, S.; Etienne-Cummings, R. Compact all-CMOS spatio-temporal compressive sensing video camera with pixel-wise coded exposure. Opt. Express 2016, 24, 9013–9024. [Google Scholar] [CrossRef] [PubMed]
  3. Yang, J.; Zhang, Y. Alternating direction algorithms for l1-problems in compressive sensing. SIAM J. Sci. Comput. 2011, 33, 250–278. [Google Scholar] [CrossRef]
  4. Tropp, J.A. Greed is good: Algorithmic results for sparse approximation. IEEE Trans. Inf. Theory 2004, 50, 2231–2242. [Google Scholar] [CrossRef]
  5. Dao, M.; Kwan, C.; Koperski, K.; Marchisio, G. A joint sparsity approach to tunnel activity monitoring using high resolution satellite images. In Proceedings of the IEEE Ubiquitous Computing, Electronics & Mobile Communication Conference, New York, NY, USA, 19–21 October 2017; pp. 322–328. [Google Scholar]
  6. Zhou, J.; Ayhan, B.; Kwan, C.; Tran, T. ATR performance improvement using images with corrupted or missing pixels. In Pattern Recognition and Tracking XXIX; SPIE: Bellingham, WA, USA, 2018; Volume 106490, p. 106490E. [Google Scholar]
  7. Applied Research LLC. Phase 1 Final Report; Applied Research LLC: Rockville, MD, USA, 2016. [Google Scholar]
  8. Yang, M.H.; Zhang, K.; Zhang, L. Real-Time compressive tracking. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
  9. Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. Available online: https://arxiv.org/abs/1804.02767 (accessed on 8 April 2018).
  10. Kwan, C.; Chou, B.; Yang, J.; Rangamani, A.; Tran, T.; Zhang, J.; Etienne-Cummings, R. Target tracking and classification directly using compressive sensing camera for SWIR videos. J. Signal Image Video Process. 2019, 6, 1–9. [Google Scholar]
  11. Kwan, C.; Chou, B.; Yang, J.; Rangamani, A.; Tran, T.; Zhang, J.; Etienne-Cummings, R. Target tracking and classification using compressive measurements of MWIR and LWIR coded aperture cameras. J. Signal Inf. Process. 2019, 10, 73–95. [Google Scholar] [CrossRef]
  12. Kwan, C.; Chou, B.; Yang, J.; Tran, T. Compressive object tracking and classification using deep learning for infrared videos. In Pattern Recognition and Tracking XXX (Conference SI120); International Society for Optics and Photonics: Lansdale, PA, USA, 2019. [Google Scholar]
  13. Kwan, C.; Chou, B.; Yang, J.; Tran, T. Target tracking and classification directly in compressive measurement domain for low quality videos. In Pattern Recognition and Tracking XXX (Conference SI120); International Society for Optics and Photonics: Lansdale, PA, USA, 2019. [Google Scholar]
  14. Kwan, C.; Chou, B.; Echavarren, A.; Budavari, B.; Li, J.; Tran, T. Compressive vehicle tracking using deep learning. In Proceedings of the IEEE Ubiquitous Computing, Electronics & Mobile Communication Conference, New York, NY, USA, 8–10 November 2018; pp. 51–56. [Google Scholar]
  15. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  16. Bertinetto, L.; Valmadre, J.; Golodetz, S.; Miksik, O.; Torr, P.H. Staple: Complementary learners for real-time tracking. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1401–1409. [Google Scholar]
  17. Stauffer, C.; Grimson, W.E.L. Adaptive background mixture models for real-time tracking, computer vision and pattern recognition. IEEE Comput. Soc. Conf. 1999, 2, 2246–2252. [Google Scholar]
  18. Kulkarni, K.; Turaga, P.K. Reconstruction-free action inference from compressive imagers. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 772–784. [Google Scholar] [CrossRef] [PubMed]
  19. Lohit, S.; Kulkarni, K.; Turaga, P.K. Direct inference on compressive measurements using convolutional neural networks. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 1913–1917. [Google Scholar]
  20. Adler, A.; Elad, M.; Zibulevsky, M. Compressed Learning: A Deep Neural Network Approach. arXiv 2016, arXiv:1610.09615. [Google Scholar]
  21. Xu, Y.; Kelly, K.F. Compressed Domain Image Classification Using a Multi-Rate Neural Network. arXiv 2019, arXiv:1901.09983. [Google Scholar]
  22. Kulkarni, K.; Turaga, P.K. Fast Integral Image Estimation at 1% Measurement Rate. arXiv 2016, arXiv:1601.07258. [Google Scholar]
  23. Wang, Z.W.; Vineet, V.; Pittaluga, F.; Sinha, S.N.; Cossairt, O.; Kang, S.B. Privacy-preserving action recognition using coded aperture videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
  24. Vargas, H.; Fonseca, Y.; Arguello, H. Object detection on compressive measurements using correlation filters and sparse representation. In Proceedings of the 2018 26th European Signal Processing Conference (EUSIPCO), Italy, Rome, 3–7 September 2018; pp. 1960–1964. [Google Scholar]
  25. Değerli, A.; Aslan, S.; Yamac, M.; Sankur, B.; Gabbouj, M. Compressively sensed image recognition. In Proceedings of the 7th European Workshop on Visual Information Processing (EUVIP), Tampere, Finland, 26–28 November 2018; pp. 1–6. [Google Scholar]
  26. Latorre-Carmona, P.; Traver, V.J.; Sánchez, J.S.; Tajahuerce, E. Online reconstruction-free single-pixel image classification. Image Vis. Comput. 2019, 86, 28–37. [Google Scholar] [CrossRef]
  27. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems; Neural Information Processing Systems Foundation, Inc.: San Diego, CA, USA, 2015. [Google Scholar]
  28. MOT Challenge. Available online: https://motchallenge.net/ (accessed on 23 August 2019).
Figure 1. Conventional camera vs. Pixel-wise Coded Exposure (PCE) Compressed Image/Video Sensor [2].
Figure 1. Conventional camera vs. Pixel-wise Coded Exposure (PCE) Compressed Image/Video Sensor [2].
Sensors 19 03702 g001
Figure 2. 24 convolutional layers followed by 2 fully connected layers for YOLO version 1 [9].
Figure 2. 24 convolutional layers followed by 2 fully connected layers for YOLO version 1 [9].
Sensors 19 03702 g002
Figure 3. Architecture of ResNet-18. Figure from [15].
Figure 3. Architecture of ResNet-18. Figure from [15].
Sensors 19 03702 g003
Figure 4. Seven targets in SENSIAC: (a) Truck; (b) SUV; (c) BTR70; (d) BRDM2; (e) BMP2; (f) T72; and (g) ZSU23-4.
Figure 4. Seven targets in SENSIAC: (a) Truck; (b) SUV; (c) BTR70; (d) BRDM2; (e) BMP2; (f) T72; and (g) ZSU23-4.
Sensors 19 03702 g004
Figure 5. Frames from optical and MWIR videos. Although the videos were collected from roughly the same range, the vehicle sizes and characteristics are somewhat different, making the tracking and classification very difficult. Three scenarios are shown: (a) Optical at 1000 m; (b) MWIR daytime at 1000 m; and (c) MWIR nighttime at 1000 m.
Figure 5. Frames from optical and MWIR videos. Although the videos were collected from roughly the same range, the vehicle sizes and characteristics are somewhat different, making the tracking and classification very difficult. Three scenarios are shown: (a) Optical at 1000 m; (b) MWIR daytime at 1000 m; and (c) MWIR nighttime at 1000 m.
Sensors 19 03702 g005
Figure 6. STAPLE tracking results for the PCE full case. Frames: 10, 30, 50, 70, 90, and 110 are shown here.
Figure 6. STAPLE tracking results for the PCE full case. Frames: 10, 30, 50, 70, 90, and 110 are shown here.
Sensors 19 03702 g006
Figure 7. STAPLE tracking results for the PCE 50% case. Frames: 10, 30, 50, 70, 90, and 110 are shown here. The green boxes are not on targets.
Figure 7. STAPLE tracking results for the PCE 50% case. Frames: 10, 30, 50, 70, 90, and 110 are shown here. The green boxes are not on targets.
Sensors 19 03702 g007
Figure 8. STAPLE tracking results for the PCE 25% case. Frames: 10, 30, 50, 70, 90, and 110 are shown here. Many frames do not have detections. The bounding boxes completely miss the targets.
Figure 8. STAPLE tracking results for the PCE 25% case. Frames: 10, 30, 50, 70, 90, and 110 are shown here. Many frames do not have detections. The bounding boxes completely miss the targets.
Sensors 19 03702 g008
Figure 9. Tracking results for frames 1, 63, 126, 189, 252, and 315 in the PCE full (optical videos) case. The vehicle is SUV. Coded aperture compresses every five frames into one. (a) 1000 m; (b) 2000 m; (c) 2500 m; and (d) 3500 m.
Figure 9. Tracking results for frames 1, 63, 126, 189, 252, and 315 in the PCE full (optical videos) case. The vehicle is SUV. Coded aperture compresses every five frames into one. (a) 1000 m; (b) 2000 m; (c) 2500 m; and (d) 3500 m.
Sensors 19 03702 g009aSensors 19 03702 g009b
Figure 10. Tracking results for frames 1, 60, 119, 178, 237, and 296 for the PCE full (MWIR daytime) case. The vehicle is SUV. Only some frames have detections. (a) 1000 m; (b) 2000 m; (c) 2500 m; and (d) 3500 m.
Figure 10. Tracking results for frames 1, 60, 119, 178, 237, and 296 for the PCE full (MWIR daytime) case. The vehicle is SUV. Only some frames have detections. (a) 1000 m; (b) 2000 m; (c) 2500 m; and (d) 3500 m.
Sensors 19 03702 g010aSensors 19 03702 g010b
Figure 11. Tracking results for frames 1, 60, 119, 178, 237, and 296 for the PCE full (MWIR nighttime) case. The vehicle is SUV. (a) 1000 m; (b) 2000 m; (c) 2500 m; and (d) 3500 m.
Figure 11. Tracking results for frames 1, 60, 119, 178, 237, and 296 for the PCE full (MWIR nighttime) case. The vehicle is SUV. (a) 1000 m; (b) 2000 m; (c) 2500 m; and (d) 3500 m.
Sensors 19 03702 g011aSensors 19 03702 g011b
Table 1. Comparison in Data Compression Ratio and Power Saving Ratio between Three Sensing Models. Here, 30 frames are condensed to 1 coded frame.
Table 1. Comparison in Data Compression Ratio and Power Saving Ratio between Three Sensing Models. Here, 30 frames are condensed to 1 coded frame.
SavingsPCE Full/CA FullPCE 50%/CA 50%PCE 25%/CA 25%
Data Saving Ratio30:160:1120:1
Power Saving Ratio1:115:130:1
Table 2. Tracking metrics for PCE full (optical videos).
Table 2. Tracking metrics for PCE full (optical videos).
1000 m2500 m
VehiclesEinGTCLEDP@20 Pixels% DetectionsVehiclesEinGTCLEDP@20 Pixels% Detections
BMP2139.960%100BMP2114.27100%1%
BRDM2123.5414%98BRDM217.96100%44%
BTR70131.060%100BTR70111.32100%40%
SUV127.250%100SUV19.58100%46%
T72163.860%100T72122.462%44%
Truck126.361%99Truck19.92100%8%
ZSU23-4137.290%99 ZSU23-4113.07100%82%
1500 m3000 m
VehiclesEinGTCLEDP@20 Pixels% DetectionsVehiclesEinGTCLEDP@20 Pixels% Detections
BMP2128.080%100%BMP218.39100%77%
BRDM2115.79100%100%BRDM214.81100%100%
BTR70121.9411%100%BTR7017.04100%100%
SUV120.1647%100%SUV16.05100%73%
T72146.960%100%T72115.25100%86%
Truck120.5936%100%Truck16.3100%100%
ZSU23-4126.930%100%ZSU23-417.98100%100%
2000 m3500 m
VehiclesEinGTCLEDP@20 Pixels% DetectionsVehiclesEinGTCLEDP@20 Pixels% Detections
BMP2118.8685%100%BMP2000%0%
BRDM219.59100%100%BRDM20.417.96100%23%
BTR70115.23100%100%BTR7014.74100%20%
SUV112.69100%100%SUV0.582.51100%11%
T72131.410%100%T7218.76100%5%
Truck113.03100%100%Truck0.873.98100%14%
ZSU23-4119.1571%100%ZSU23-40.964.4100%30%
Table 3. Classification results for PCE full (optical) case. Left shows the confusion matrix and the last column shows the classification accuracy.
Table 3. Classification results for PCE full (optical) case. Left shows the confusion matrix and the last column shows the classification accuracy.
1000 m2500 m
VehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4AccuracyVehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4Accuracy
BMP236610210498%BMP200003000%
BRDM20210041152057%BRDM2056000107134%
BTR70003730010100%BTR70240861603258%
SUV0001890185051%SUV0023784818446%
T7201503831011083%T7200001603098%
Truck1060380315085%Truck1040195017%
ZSU23-4000100035997%ZSU23-40000100020567%
1500 m3000 m
VehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4AccuracyVehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4Accuracy
BMP2363001100097%BMP2001391149000%
BRDM20234076064063%BRDM21411313172209630%
BTR70003740000100%BTR704702600580970%
SUV0002010173054%SUV1026418000%
T7204103690099%T720084721119066%
Truck310000361097%Truck380470153131535%
ZSU23-4000000374100%ZSU23-45027826806618%
2000 m3500 m
VehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4AccuracyVehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4Accuracy
BMP2355001702095%BMP200000000%
BRDM204091700155011%BRDM220428162805%
BTR70203551502095%BTR700070060092%
SUV102970258827%SUV402712603%
T7242611948962124%T72002099045%
Truck1756421285178%Truck20180231058%
ZSU23-401411422828777%ZSU23-40050102400%
Table 4. Averaged tracking and classification performances for the various optical video cases. 1500 m and 3000 m videos were used for training.
Table 4. Averaged tracking and classification performances for the various optical video cases. 1500 m and 3000 m videos were used for training.
PCE FullPCE 50PCE 25
RangeAverage % of Frames with DetectionsAverage AccuracyRangeAverage % of Frames with DetectionsAverage AccuracyRangeAverage % of Frames with DetectionsAverage Accuracy
100099%82%100079%52%100059%39%
1500100%87%150099%53%150059%39%
200099%58%200071%27%200027%29%
250038%46%25000%0%25001%0%
300091%31%30002%16%30006%16%
350015%29%35000%0%35002%18%
Table 5. Tracking metrics for PCE full (MWIR daytime) case. 1500 m and 3000 m were used for training.
Table 5. Tracking metrics for PCE full (MWIR daytime) case. 1500 m and 3000 m were used for training.
1000 m2500 m
VehiclesEinGTCLEDP@20 Pixels% DetectionsVehiclesEinGTCLEDP@20 Pixels% Detections
BMP21.0029.460%52%BMP21.0010.01100%82%
BRDM21.0024.864%94%BRDM21.009.73100%75%
BTR701.0024.6513%69%BTR700.8035.1980%35%
SUV1.0018.5469%81%SUV0.9910.4799%22%
T721.0034.300%53%T720.9913.0499%60%
Truck1.0023.615%58%Truck0.9910.3899%31%
ZSU23-41.0029.410%42%ZSU23-41.0010.51100%65%
1500 m3000 m
VehiclesEinGTCLEDP@20 Pixels% DetectionsVehiclesEinGTCLEDP@20 Pixels% Detections
BMP21.0022.3210%99%BMP21.007.20100%100%
BRDM21.0018.8782%99%BRDM21.006.35100%100%
BTR701.0017.9495%99%BTR701.005.93100%100%
SUV1.0013.89100%90%SUV1.004.60100%100%
T721.0024.860%97%T721.007.88100%100%
Truck1.0017.4290%95%Truck1.005.48100%100%
ZSU23-41.0020.7732%99%ZSU23-41.006.63100%100%
2000 m 3500 m
VehiclesEinGTCLEDP@20 Pixels% DetectionsVehiclesEinGTCLEDP@20 Pixels% Detections
BMP21.0015.32100%77%BMP21.004.41100%33%
BRDM21.0012.63100%64%BRDM20.172.25100%52%
BTR701.0011.12100%86%BTR700.974.5199%31%
SUV1.009.53100%31%SUV0.971.83100%33%
T721.0016.8899%64%T720.864.91100%70%
Truck1.0011.87100%30%Truck1.003.48100%11%
ZSU23-41.0013.45100%93%ZSU23-41.003.36100%36%
Table 6. Classification results for PCE Full (MWIR daytime) case. Left shows the confusion matrix and the last column shows the classification accuracy.
Table 6. Classification results for PCE Full (MWIR daytime) case. Left shows the confusion matrix and the last column shows the classification accuracy.
1000 m2500 m
VehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4AccuracyVehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4Accuracy
BMP212315270152366%BMP228326120096%
BRDM2932247438066%BRDM21194085031272%
BTR70726158406064%BTR70336561655445%
SUV2910244314084%SUV01951319111216%
T72106901723038%T7287019116831832%
Truck540010154074%Truck48034328247%
ZSU23-43640950283322%ZSU23-404150122195%
1500 m3000 m
VehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4AccuracyVehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4Accuracy
BMP28226772361623%BMP234410115796%
BRDM2533020020092%BRDM21268057234675%
BTR704602151785161%BTR700110104393876129%
SUV140247168276%SUV1824421942242861%
T72112350617918051%T7211906721649760%
Truck31401292921086%Truck20591011660751921%
ZSU23-4576001513013037%ZSU23-4179132532290%
2000 m3500 m
VehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4AccuracyVehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4Accuracy
BMP287031020149931%BMP210105363185%
BRDM239810022761335%BRDM220761511731741%
BTR7056511702509738%BTR70216127102531%
SUV20140866127%SUV270642751255%
T7241900177131777%T7224275516717766%
Truck00003078172%Truck6502115113%
ZSU23-400006032998%ZSU23-4119471338464%
Table 7. Average detection and classification performance of different MWIR daytime cases. 1500 m and 3000 m were used for training.
Table 7. Average detection and classification performance of different MWIR daytime cases. 1500 m and 3000 m were used for training.
PCE FullPCE 50PCE 25
RangeAverage % of Frames with DetectionsAverage AccuracyRangeAverage % of Frames with DetectionsAverage AccuracyRangeAverage % of Frames with DetectionsAverage Accuracy
100064%59%100018%49%100028%42%
150097%61%150060%43%150061%42%
200064%51%20006%18%20004%26%
250053%52%25006%19%250029%16%
3000100%62%300052%31%300012%26%
350038%46%350011%14%350010%4%
Table 8. Tracking metrics for PCE full (MWIR nighttime).
Table 8. Tracking metrics for PCE full (MWIR nighttime).
1000 m2500 m
VehiclesEinGTCLEDP@20 Pixels% DetectionsVehiclesEinGTCLEDP@20 Pixels% Detections
BMP21.0029.101%52%BMP20.990.9912.0999%96%
BRDM21.0025.718%77%BRDM21.001.009.41100%100%
BTR701.0017.8074%90%BTR701.001.005.43100%100%
SUV1.0014.31100%99%SUV1.001.005.00100%100%
T721.0034.430%65%T721.001.0010.77100%100%
Truck1.0026.192%79%Truck1.001.009.37100%90%
ZSU23-41.0027.960%80%ZSU23-41.001.009.93100%90%
1500 m3000 m
VehiclesEinGTCLEDP@20 Pixels% DetectionsVehiclesEinGTCLEDP@20 Pixels% Detections
BMP21.0019.9250%100%BMP21.006.81100%100%
BRDM21.0019.6363%94%BRDM21.006.62100%99%
BTR701.0011.86100%97%BTR701.003.86100%100%
SUV1.0010.52100%97%SUV1.004.10100%91%
T721.0023.911%100%T721.007.43100%100%
Truck1.0019.3271%97%Truck1.007.33100%99%
ZSU23-41.0019.5762%87%ZSU23-41.006.91100%100%
2000 m3500 m
VehiclesEinGTCLEDP@20 Pixels% DetectionsVehiclesEinGTCLEDP@20 Pixels% Detections
BMP21.0013.85100%66%BMP21.003.19100%11%
BRDM21.0014.00100%65%BRDM20.983.4199%54%
BTR701.007.91100%55%BTR700.942.29100%60%
SUV1.006.47100%84%SUV0.962.19100%51%
T721.0016.7098%93%T720.934.91100%86%
Truck1.0012.38100%25%Truck0.7412.9693%64%
ZSU23-41.0013.42100%78%ZSU23-40.994.40100%96%
Table 9. Classification results for PCE full case (MWIR nighttime). Left shows the confusion matrix and the last column shows the classification accuracy.
Table 9. Classification results for PCE full case (MWIR nighttime). Left shows the confusion matrix and the last column shows the classification accuracy.
1000 m2500 m
VehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4AccuracyVehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4Accuracy
BMP218400010099%BMP216235008166047%
BRDM202740003099%BRDM25288012539180%
BTR7024212340639072%BTR700401305281391636%
SUV00034709097%SUV2762632750473%
T7200112245296%T721210430531585%
Truck30017271096%Truck46109728221368%
ZSU23-401020128399%ZSU23-42214211927485%
1500 m3000 m
VehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4Accuracy BMP2BRDM2BTR70SUVT72TruckZSU23-4Accuracy
BMP2341102033095%BMP22025410931856%
BRDM2032900100097%BRDM29271272633876%
BTR70013470000100%BTR700312339717865%
SUV010277559580%SUV30031660097%
T7230003523098%T7240178424934769%
Truck314126312190%Truck21201425302285%
ZSU23-403113230197%ZSU23-4710017233292%
2000 m3500 m
VehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4AccuracyVehiclesBMP2BRDM2BTR70SUVT72TruckZSU23-4Accuracy
BMP2131000205815%BMP21780049243%
BRDM2020700234088%BRDM21146212415575%
BTR700121760100089%BTR7021279101377636%
SUV1062081372069%SUV19011191942065%
T72280002969189%T723557525142202646%
Truck02101569178%Truck72714438114049%
ZSU23-43574653116559%ZSU23-4255334129686%
Table 10. Averaged classification results at PCE full, PCE 50, and PCE 25 for the MWIR nighttime videos. 1500 m and 3000 m were used for training.
Table 10. Averaged classification results at PCE full, PCE 50, and PCE 25 for the MWIR nighttime videos. 1500 m and 3000 m were used for training.
PCE FullPCE 50PCE 25
RangeAverage % of Frames with DetectionsAverage AccuracyRangeAverage % of Frames with DetectionsAverage AccuracyRangeAverage % of Frames with DetectionsAverage Accuracy
100077%94%100077%67%100096%64%
150096%94%150093%67%1500100%64%
200066%68%200054%41%200063%48%
250096%68%25000%0%25000%0%
300098%77%30000%0%30000%0%
350064%49%35000%0%35000%0%

Share and Cite

MDPI and ACS Style

Kwan, C.; Chou, B.; Yang, J.; Rangamani, A.; Tran, T.; Zhang, J.; Etienne-Cummings, R. Deep Learning-Based Target Tracking and Classification for Low Quality Videos Using Coded Aperture Cameras. Sensors 2019, 19, 3702. https://doi.org/10.3390/s19173702

AMA Style

Kwan C, Chou B, Yang J, Rangamani A, Tran T, Zhang J, Etienne-Cummings R. Deep Learning-Based Target Tracking and Classification for Low Quality Videos Using Coded Aperture Cameras. Sensors. 2019; 19(17):3702. https://doi.org/10.3390/s19173702

Chicago/Turabian Style

Kwan, Chiman, Bryan Chou, Jonathan Yang, Akshay Rangamani, Trac Tran, Jack Zhang, and Ralph Etienne-Cummings. 2019. "Deep Learning-Based Target Tracking and Classification for Low Quality Videos Using Coded Aperture Cameras" Sensors 19, no. 17: 3702. https://doi.org/10.3390/s19173702

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop