Lowlight object recognition by deep learning with passive three-dimensional integral imaging in visible and long wave infrared wavelengths

Pranav Wani; Kashif Usmani; Gokul Krishnan; Timothy O’Connor; Bahram Javidi

doi:10.1364/OE.443657

1. Introduction

Object detection using conventional imaging and deep neural networks has been widely researched [1–5]. In conventional imaging, a high performance object detection can be achieved using deep neural networks. However, adverse conditions such as low photon count or occlusion deteriorate the performance of these systems. Photons are plentiful in long wave infrared (LWIR) spectrum, thus these domains have been used in low illumination conditions. Thus, object classification in low light conditions has been investigated by imaging in the LWIR spectrum [6–8]. However, LWIR cameras are typically costlier, have much lower spatial resolution, and require bulkier optics compared to their visible range counterparts. Imaging in the visible spectrum can be an effective technique for object detection, however, the preeminence of conventional camera noise in photon starved environments degrades the classification performance. In any imaging domain, adverse conditions such as occlusion degrade the object classification capabilities.

Three-dimensional (3D) integral imaging (InIm) is a prominent 3D technique that works by capturing both angular and intensity information of the 3D scene [8–11]. We refer here to the passive 3D InIm, as opposed to the active 3D imaging which falls outside the scope of this paper. 3D imaging is also contrasted with the 2D imaging which refers to the conventional 2D imaging obtained by camera and does not provide depth information. In low light, the 3D InIm reconstructed images have a higher signal-to-noise ratio (SNR) compared to conventional 2D imaging, since the 3D InIm is optimal in the maximum likelihood sense for read-noise dominant images [9,12]. 3D InIm [13–23] is performed by recording multiple 2D elemental images of a scene from diversity of perspectives. These can be captured either through a single image sensor with lenslet array, camera array, or by a single camera mounted on a translation stage. The 2D elemental images are then integrated to obtain 3D information of the scene either through optical or computational reconstruction methods. The computational reconstruction is accomplished by backpropagation of optical rays through a virtual pinhole. The reconstruction depth can be set to any value within the depth of field of the captured elemental images.

In this paper, our aim is to compare the object classification capabilities between visible and LWIR imaging systems in adverse environmental conditions such as occlusion and low illumination. This comparison considers both conventional 2D and 3D InIm in both the visible and LWIR spectra. In our experiments, both cold and hot objects are considered in the scene. A CMOS image sensor is used for implementing the passive 3D InIm system in the visible spectrum. We have used the ‘You Look Only Once version 2 (YOLOv2)’ neural network [24] for object classification. We compare the performance of the neural network for these systems using miss-rate, F1 score, and precision score. Various denoising models are considered and compared.

Our experiments in low illumination and occlusion conditions indicate that for the experiments performed the performance of 3D InIm detection model, on average, is better compared to the traditional 2D imaging systems in both visible and LWIR spectrum. Most importantly, visible domain 3D integral imaging may outperform LWIR based 2D and 3D imaging systems. The LWIR based detectors performed very poorly when used to detect cold objects whereas 3D InIm in visible spectrum performed well. CMOS image sensors are ubiquitous with their performance rapidly increasing and their cost rapidly decreasing. The reported experimental results may indicate an alternative approach to LWIR systems for imaging in low light conditions.

2. Methodology

2.1 Integral imaging

Integral imaging (InIm) is a passive 3D imaging approach which integrates the diverse perspectives of captured 2D elemental images to obtain information of the light field generated by the scene. This can be practically implemented either through a single image sensor with a lenslet array, a camera array, or by a single camera mounted on a moving platform [13–23]. The computational reconstruction can then be accomplished through backpropagation of optical rays through a virtual pinhole. The reconstruction depth can be set to any value within the depth of field of the captured elemental images. InIm uses parallax to record both angular and intensity information, which helps mitigate the effects of partial occlusion during 3D reconstruction of the scene. The 3D reconstructed images also have better signal-to-noise ratio compared to a single 2D image of the scene due to 3D InIm being optimal in the maximum likelihood sense for read-noise dominant images [9,12].

In our experiments, a synthetic aperture integral imaging [25] system consisting of a single camera mounted on a translation stage is used for the 3D InIm. The pickup stage of synthetic aperture InIm is illustrated in Fig. 1(a). Once the 2D elemental images of a scene are captured with varying perspectives, a 3D scene can be computationally reconstructed as illustrated in Fig. 1(b). The computational reconstruction is accomplished by back propagating the captured elemental images through a virtual pinhole array to the desired depth. 3D scene can be reconstructed at any depth that falls within the depth of field of the captured elemental images.

Fig. 1. (a) Synthetic aperture integral imaging setup pickup stage. (b) Reconstruction phase of synthetic aperture integral imaging.

Download Full Size | PDF

In the 3D reconstruction process, reconstructed 3D image intensity I_z(x, y) at the desired depth is computed as [25]:

(1)$${I_z} = \frac{1}{{O(x,y)}}\sum\limits_{m = 0}^{M - 1} {} \sum\limits_{n = 0}^{N - 1} {\left[ {{I_{mn}}\left( {x - \frac{{m \times {L_x} \times {p_x}}}{{{c_x} \times \frac{z}{f}}},y - \frac{{n \times {L_y} \times {p_y}}}{{{c_y} \times \frac{z}{f}}}} \right) + \varepsilon } \right]}$$

Where (x, y) are the pixel indices, O(x, y) is the number of overlapping pixels on (x, y). I_mn is a 2D elemental image, with (m, n) representing its index, and (M, N) indicating the total number of elemental images in the horizontal and vertical directions, respectively. Resolution of the camera is represented by (L_x, L_y). (c_x, c_y), and (p_x, p_y) represent the sensor size and the pitch between adjacent camera positions, respectively, in both horizontal and vertical directions. $z/f$ is the magnification of the camera, with f representing its focal length and z the reconstruction depth. ε is the additive camera noise. When reconstructed at the true depth of the object, the variation of the rays is minimum assuming that the rays coming from the object have approximately similar intensities. [26].

2.2 Classification model

For object detection, we use the ‘You Look Only Once Version 2 (YOLOv2)’ neural network model [24,27]. This model simultaneously locates and classifies the objects of interest within the scene. Its architecture is inspired from the GoogleNet [28] and has 24 convolutional layers with 2 fully connected layers. The input image in this architecture is divided into a grid of cells (of size 7 in our case) and each grid cell predicts several bounding boxes (2 for our case). Each of these bounding boxes have their corresponding confidence scores and class probabilities affixed to them. Each bounding box has the following prediction parameters associated with it: [x,y,w,h,c]. (x, y), and (w, h) represents the central location of the object within the box and the corresponding width and height of the bounding box, respectively. Here, c measures the confidence score for the object.

YOLOv2 is an improvement over the original ‘You Look Only Once version 1’ (YOLOv1) with notable changes such as batch normalization, high resolution classification, etc. [24]. YOLOv2 also utilizes the concept of anchor boxes which enables it to detect multiple objects centered at one grid cell.

2.3 Low illumination image enhancement

In low light illumination conditions, the number of photons reflected from the object scene are very low for the visible spectrum. Thus, images captured in the visible spectrum in low illumination conditions are dominated by camera read noise [11]. We define the visible images to be read noise dominant when the contribution of read noise is greater than the contribution of photon noise [29]. Other sources of noise can be neglected. Following the equation N² = R² + kS [29], this occurs when the detected photons/pixel equals the square of the sensor read noise [30]. Here N is the total noise, R is the read noise, S is the signal, and k is a proportionality constant. N, R, and S are measured in raw levels or ADUs (analog-to-digital units). For our specific visible spectrum sCMOS sensor, which has a read noise specification of 1.6-2 electrons rms per pixel [31], this cutoff occurs at approximately 4-6 photons per pixel. Since all low light visible spectrum data was collected between 1.5 to 3.5 photons/pixel, all low light data in our experiments can be considered as read noise dominant. Uniform values of photons per pixel are maintained by varying the exposure time of the sensor in accordance with the extant environmental conditions.

This necessitates the use of mathematical model for image restoration to reduce the noise and enhance the quality of image in low light conditions. First we subtract the camera offset from the captured images, wherein the camera offset can be computed by averaging a large number of single camera bias reference frames [11]. The camera offset corresponds to the sensor electrons and is useful to prevent the clipping of small signals into a zero/negative digitized intensity, due to noise [11]. In photon starved conditions, the electron counts due to the scene are much lower compared to the camera offset [11]. This gray scale image can be converted to electrons using a conversion factor CF (0.46 for sCMOS visible sensor used in our experiments). The number of photons for each pixel is then calculated by dividing the number of converted electrons with the sensor’s quantum efficiency: NoP = I * CF/QE, where NoP is the estimated number of photons, and QE = 0.7 is the quantum efficiency provided in the camera’s manual. I is the input image without camera offset. This metric can now be used to quantify the illumination levels of the recorded low light visible domain images.

Inverted images in photon starved environments have been shown to have characteristics akin to that of a hazy image [32]. Therefore, in low illumination conditions, image enhancement can be achieved by inverting the image, applying haze removal, then reverting the image. In our experiments, noise reduction is accomplished using the dark channel prior based methods [33] on the inverted images. These models have been widely used for image restoration and enhancement [33], especially in low illumination image restoration tasks [32,34]. The dark channel prior-based haze removal was derived using statistics of natural scenic images. It was observed that in most sky-free regions, most pixels have a very low intensity in at least some of the color channels (R/G/B). These minimum intensity pixels serve as priors and aid in determining an estimate of the transmission function caused by the haze. Captured photon-starved visual spectrum images in our experiments, after the subtraction of camera offset, are inverted and then dehazed using a monochromatic version of the aforementioned method. Instead of performing the optimization over all the channels, the minimization of the global cost function is performed over a single channel for the monochromatic images used in our experiments. The hazy image can be modelled as [33]:

(2)$$\textbf{R}(\xi ) = \textbf{J}(\xi )t(\xi ) + \textbf{A}(1 - t(\xi))$$

where $\xi $ is the vector representation of the pixel location (x, y) and R(ξ) is the noisy image, which in our case, is the inverted low illumination image. J(ξ) is the inverted ideal image to be recovered, A is the atmospheric light, and t(ξ) is the portion of light reaching the camera without scattering. A and t(ξ) can be estimated using the monochromatic version of the dark channel prior. This allows us to estimate J(ξ) and after inversion we get the recovered image.

The enhanced images at this stage are still contaminated with noise. In photon starved conditions, the captured images are read noise dominated which is Gaussian distributed [11]. These images are thus further enhanced using independent component analysis (ICA) based denoising algorithm [35–37]. ICA based denoising has been successfully used for imaging purposes in the literature [38]. To demonstrate the effectiveness of the pipeline containing dark channel modelling followed by ICA based denoising, we consider the signal-to-noise ratio (SNR) of a sample 2D photon-starved visible spectrum image. The SNR is defined as $SNR = ({{\mu_s} - {\mu_b}} )/\sqrt {\sigma _s^2 - \sigma _b^2} $, where the µ_s and σ_s are the mean and standard deviation of the signal, and µ_b and σ_b are the mean and standard deviation of the background [10]. For simplicity, in our case, we defined signal as a 30×30 window of the image with the highest mean, and the background is defined as a 30×30 window of the image with the lowest mean. For fair comparison, both the signal and the background windows are held constant for a particular image as it is enhanced using various pre-processing methods. Figure 2 shows three low illumination visible spectrum images along with their corresponding enhanced counterparts. The enhancement is accomplished using dark channel prior dehazing, ICA denoising, and both dark channel prior based dehazing and ICA denoising combined.

Fig. 2. 2D Imaging results. (a) 2D Visible spectrum low-light image with average photons per pixel of 23.29. (b) low light image after dark channel based dehazing. (c) low light image after ICA based denoising, and (d) low light image after both dark channel based dehazing and ICA based denoising. (e) 2D Visible spectrum low-light image with average photons per pixel of 3.52. (f) low light image after dark channel based dehazing. (g) low light image after ICA based denoising, and (h) low light image after both dark channel based dehazing and ICA based denoising. (i) 2D Visible spectrum low-light image with average photons per pixel of 2.40. (j) low light image after dark channel based dehazing. (k) low light image after ICA based denoising, and (l) low light image after both dark channel based dehazing and ICA based denoising.

Download Full Size | PDF

Table 1 enumerates the SNR values for low-light visible spectrum images using dark channel prior based dehazing, ICA based denoising, and both these methods in combination. As can be seen from Table 1, the dark channel prior with ICA denoising approach helps us to improve the SNR as compared with the original low illumination image. All photon starved visible spectrum images are thus enhanced using dark channel prior-based dehazing with the ICA-based denoising algorithm.

Table 1. SNR^a comparison of visible spectrum low-light 2D images.

View Table | View all tables in this article

3. Experimental methodology

3.1 Experimental setup

A large database of 3D images could be helpful to generalize our experimental results. However, to the best of our knowledge 3D integral imaging databases are not available publicly in the visible or LWIR spectrum. As such, a laboratory experimental setup has been used to generate the required training and testing images. Each scene was recorded using 25 (5 Vertical×5 Horizontal) elemental images. The horizontal and vertical camera pitch were both set to 40 mm. The scenes were created approximately 5 m from the SAII setup and were recorded using a visible sCMOS sensor (Hamamatsu C11440-42U) and a long wave infrared (LWIR) camera (Tamarisk 320 60 Hz) for visible and LWIR sensing, respectively. Table 2 summarizes the different camera parameters used in our experiments.

Table 2. Integral imaging and camera variables.

View Table | View all tables in this article

The dataset for the neural network training was recorded in high illumination conditions. Each scene was simultaneously recorded using both the visible and LWIR cameras. This ensures a reduced bias during the training procedure. In total, 65 scenes were recorded as the training data set.

3.2 Training data augmentation

To make the neural network more robust towards various degradations, augmentation has been done on the 2D training data set. Augmented 2D images were created by horizontally flipping the original 2D images and by artificially adding Gaussian noise to the high illumination 2D images. Artificially added Gaussian noise helps to approximately replicate the low illumination real-world conditions for visible imaging, and reduces the image contrast in LWIR images providing a similar effect to reducing the temperature of the thermal object(s).

To approximately simulate lowlight conditions for our training data, we establish a method for quantifying low illumination conditions. In photon-starved environments, the captured images in the visible spectrum are read noise dominated, which follows a Gaussian distribution [11]. Assuming the high illumination natural scene to be non-Gaussian, the level of additive Gaussian noise may be quantified using a measure of non-Gaussianity. This in turn helps us to quantify the illumination levels of the photon starved scenes. Several techniques exist to measure the non-Gaussianity, the most common among which are kurtosis, negentropy, and mutual information [39]. We use kurtosis to quantify the additive Gaussian noise, where a kurtosis of 3 signifies an ideal Gaussian noise or alternatively a zero illumination read noise dominant image. For our experimental setup, we found low illumination images had a kurtosis value ranging from 2.5 to 2.8. Therefore, we use this range to determine the appropriate amount of additive noise needed to augment the high illumination training data set to approximately simulate real low light data. For each 2D image in the training set, a value is randomly chosen from this range and zero-mean Gaussian noise is added to the image wherein the variance of the Gaussian noise is chosen such that the kurtosis will equal the chosen value. This methodology provides us the final set of augmented training data set for the visible spectrum.

In contrast to the visible spectrum, the recorded low illumination 2D images in the LWIR spectrum are not photon-starved. Augmentation of the LWIR training data set is accomplished by adding Gaussian noise to vary the contrast of the hot objects with their surroundings. Contrast can be defined as the ratio of the mean of hot objects to the mean of the surroundings.

Each visible and LWIR spectrum 2D image is mirror imaged and then both the original and flipped versions of the image are used to create two additional images at different simulated noise levels. After augmentation, total 390 images were generated from the original 65 images in both the visible and the LWIR spectrum and are used as training data set. Sample augmented images in both visible and LWIR spectrum are shown in Fig. 3 below.

Fig. 3. 2D Imaging samples from experimental set up. (a) Sample high illumination visible spectrum training image. (b) and (c) are sample augmented images derived from (a). (d) Sample LWIR spectrum training image. (e) and (f) are sample augmented images derived from (d).

Download Full Size | PDF

3.3 Object detector

The YOLOv2 neural network used for the object detection task was trained using Adam optimizer with mini batch size of 16 and initial learning rate of 0.001. The total number of epochs was set to 40, and the learning rate was reduced by a factor of 10 after every 10 epochs. This detector accepts images of the size 224×224×3. Since LWIR camera had a wider field of view, the visible images were zero padded such that the target objects were represented by approximately the same number of pixels in both visible and LWIR images while being fed into the neural network.

In our experiments, we choose to incorporate five different types of objects: a thermal dummy, a short coffee pot, a tall coffee pot, lab beaker, and a clothing iron. First and last object had the capacity to self-warm, while the remaining three objects could be heated by pouring hot water into them. Three detectors were trained using the training data set. The first detector was trained on the visible spectrum images where all the five types of objects were clearly visible. The second detector was trained in the LWIR spectrum with the hot objects labeled for training. The final detector was also trained in the LWIR spectrum, however, with both hot and cold variants of the objects labeled for training. This allowed us to study whether the LWIR based object detection could be improved by labelling both hot and cold objects simultaneously. Testing was conducted on both hot and cold variants of the objects. Sample high illumination training images are shown in the Fig. 4 below.

Fig. 4. (a) Sample high illumination training image in the visible spectrum. (b) Corresponding high illumination training image in the LWIR spectrum.

Download Full Size | PDF

3.4 Test data

Test data was collected in low-illumination conditions. Seventy different scenes were recorded simultaneously in the visible and LWIR spectrum. These scenes were recorded using the same experimental setup as described in section 3.1. Each scene was recorded using 5×5 2D elemental images. The average photons per pixel for the visible spectrum images ranges from 1.5 to 3.5. LWIR images are not classified as photon-starved owing to an abundance of thermal photons in the LWIR spectrum. Each photon-starved visible spectrum 2D elemental image was enhanced using the dark channel based dehazing and ICA based denoising as described in previous sections. The central perspective elemental images were used for comparing to conventional 2D imaging. Thus, 70 images were recorded to test object detection performance in the 2D domain. For the 3D experiment, each scene was reconstructed at various depths pertaining to the objects present in the scene. After reconstructing each image at the correct focus plane for all objects in the scene, 243 images were available for object detection performance evaluation in the 3D domain. Although we selected the appropriate focus planes manually, several methods based on InIm exist to compute the correct depth of the objects [40].

Sample test images have been shown in the Fig. 5 below. Visible spectrum images shown in Fig. 5 have been enhanced using the dehazing and ICA pre-processing pipeline as described previously. Figure 6 shows a flow chart summarizing the steps involved in our experimental methodology.

Fig. 5. (a) Sample low illumination 2D visible test image (3.3 Photons per pixel). (b) 2D image after enhancement using the dehazing and ICA pre-processing pipeline. (c) Sample low illumination 2D LWIR test image corresponding to the scene in (a). (d) Sample low illumination 3D visible test image corresponding to the scene in (a) reconstructed at the plane of the tall coffee pot (3.68 m). 2D elemental images used for its reconstruction were enhanced using the pre-processing pipeline as described in previous sections. (e) Sample low illumination 3D LWIR test image corresponding to the scene in (a). (f), (g), and (h) are additional sample 2D images after denoising enhancement in the visible domain.

Download Full Size | PDF

Fig. 6. Flow chart summarizing the steps involved in the experimental methodology.

Download Full Size | PDF

4. Results and discussion

As three separate detectors are considered in this work, we denote the detector trained on the visible spectrum images as ‘Detector-Visible’, the LWIR detector trained only on hot objects is denoted as ‘Detector-LWIR-type I’, and the LWIR detector trained with both hot and cold versions of the objects is denoted as ‘Detector-LWIR-type II’. Network detection threshold was set to 0.5 and the intersection over union ratio (IOU) threshold for object detection was also set to 0.5.

The prediction is considered correct if the intersection over union ratio (IOU) for the predicted bounding box and the ground truth exceeds a predetermined threshold, set to 0.5 for our experiments. IOU is used to measure the overlap ratio between the detected and the ground truth bounding box. Its value varies from 0 (no overlap) to 1 (total overlap). We quantify the object detection performance using metrics such as precision, recall, F1 score, and miss rate. Precision and recall are defined as Precision = TP / (TP + FP), and Recall = TP / (TP + FN). Here, TP is the number of true positives, FP is the number of false positives, and FN is the number of false negatives. The miss rate quantifies the rate of true objects that skip detection. It is defined as Miss Rate = FN / (TP + FN). The F1 score is a higher order metric that combines precision and recall values to provide a more intelligible denotation to the performance of the network. It is defined as F1 = 2 * Precision * Recall / (Precision + Recall). The value of $F1$ score ranges from 0 to 1 with larger values being construed as that corresponding to a better performing network.

The average results for the five objects have been tabulated in Table 3 below. The scenes contained both hot and cold versions of all the objects. For each object, the ratio of number of occurrences of its hot and cold versions in the scenes was approximately 55:45.

Table 3. Average performance scores for object detection on all objects using YOLOv2 network.

View Table | View all tables in this article

It must be noted that the results for the LWIR spectrum are different for hot and cold objects. As such, the average results for LWIR spectrum mentioned in Table 3 are dependent on the ratio of number of occurrences of hot and cold versions of the objects present in the scenes. Table 4 and 5 below provides the results of the LWIR detectors separately for hot and cold objects. Table 4 considers only the hot objects in the LWIR spectrum, and Table 5 considers only the cold objects in the LWIR spectrum.

Table 4. Average performance scores for object detection on hot objects using YOLOv2 network.

View Table | View all tables in this article

Table 5. Average performance scores for object detection on cold objects using YOLOv2 network.^a

View Table | View all tables in this article

The above tables show that the LWIR based detectors perform very poorly when used to detect cold objects. This should be expected since cold objects provide little to no information in the LWIR domain. Their utility is most pronounced in the detection of hot objects. Varying the detector threshold allows us to plot the average miss rate with the number of false positives per image. The average plot for all the objects has been shown in the Fig. 7(a) below. For more clarity, Fig. 7(b) shows the plot for the best LWIR detector (type II) compared between all, only cold, and only hot objects.

Fig. 7. (a) Average miss rate versus the number of false positives per image for all detection modalities (visible and LWIR) in both 2D and 3D. The plot is displayed on a logarithmic scale. This plot represents the average over all the objects in the scenes. (b) Average miss rate vs. false positives per image plot for the best LWIR detector (type II) compared between all, only cold, and only hot objects in both 2D and 3D.

Download Full Size | PDF

From Table 3, the average F1 scores for 2D imaging system for all detectors is less than that of their corresponding 3D counterparts. It can thus be observed that, on average, 3D InIm system perform better than traditional 2D imaging system. This can also be observed by considering the average miss rate for all the detectors, which is lower for 3D InIm systems compared to that of their 2D counterparts. For type I and II LWIR detectors, the precision for 3D InIm drops below that of the 2D imaging system. This is in contrast to the visible detector, where the precision for 3D InIm significantly exceeds that of the 2D imaging system. We speculate that this may be attributed to the spatial features in both domains. Objects typically have numerous features in the visible domain, while maintaining a smooth profile in the LWIR domain. In visible domain 3D reconstruction brings forth the numerous features concealed behind the occlusion, thus boosting the detector precision. In contrast, 3D reconstruction in the LWIR domain offers minimal boost to the number of available features, while introducing the blurring effects of the occluding objects. Its combined effect may be responsible for the slight drop in the precision of the LWIR detectors. However, it should be noted that for LWIR 3D InIm the recall scores (0.2494, and 0.5784) are much higher than that of their 2D counterparts (0.1434, and 0.2113), thus suggesting that many more objects are getting detected albeit with low average precision. Thus although the precision scores for LWIR 3D are smaller compared to that of LWIR 2D, the F1 scores are higher and miss rates are significantly lower for 3D InIm system.

The average F1 score across 2D and 3D for type II detector (0.3322) is higher than that of type I detector (0.2454). Similarly, the average miss rate across 2D and 3D for type II detector (0.6051) is significantly lower compared to type I detector (0.8036). This demonstrates that training the LWIR detector on objects with large variation in their temperatures across scenes creates a more robust detector. As type II LWIR detector outperforms the type I LWIR detector, it would be selected to represent the best values for object detection in the LWIR spectrum while comparing with the visible spectrum.

Comparing the performance of the visible imaging system with the LWIR based imaging system (with type II detector) provides different results for 2D and 3D object detection. For 2D imaging system, the LWIR object detector outperforms the visible object detector. Average F1 score for 2D LWIR (0.2843) is higher than that of 2D visible (0.2400), and the average miss rate of LWIR (0.7887) is lower compared with the visible (0.8525). For traditional 2D imaging in low illumination and adverse environmental conditions, the LWIR based imaging systems offer better performance compared to the visible spectrum imaging systems. However, incorporating the 3D InIm for imaging significantly improves the performance of the visible imaging systems under the degradations considered. In LWIR spectrum, the objects of interest are already well segmented from the background, and thus 3D InIm does not give significant boost to LWIR compared that to the visible spectrum images. With 3D InIm, the average precision value for visible spectrum object detection (0.9610) is significantly higher than that for LWIR object detection (0.2830). The same is observed with average F1 score, with visible systems having an average score of 0.8862 compared to the LWIR systems with average score of 0.3801. The miss rate for 3D visible (0.1778) is significantly smaller compared to the LWIR system (0.4216). With 3D InIm, it can be observed that, visible spectrum imaging system is able to detect objects with a higher F1 score, lower miss rate, and a significantly larger precision score compared to the LWIR system.

These experimental results demonstrate that for the experiments performed for low illumination and occlusion conditions, the performance of detection model on 3D InIm, on average, is better compared to the traditional 2D imaging systems in both visible and LWIR spectrum. It is also evident that for the traditional 2D imaging systems, object detection in the LWIR spectrum outperforms that in the visible spectrum. This explains the ubiquity of the LWIR cameras for imaging in low illumination and challenging environmental conditions. However, our experiments demonstrate that visible domain 3D integral imaging may outperform LWIR based 2D and 3D imaging systems in low light conditions plus occlusion. Compared to the traditional LWIR imaging, 3D visible domain integral imaging systems can provide object detection with a higher F1 score and a significantly higher precision score.

5. Conclusions

We have experimentally investigated object classification performance using 3D integral imaging with a deep neural network in adverse environmental conditions of low light illumination and partial occlusions. The experimental investigations were performed in visible spectral band with a CMOS camera, and in long wave IR domain with a LWIR camera, for both cold and hot objects in the scene. LWIR imaging has been a common approach to detect objects in low light levels. The ability to perform reliable object classification in low light and degraded environment in the visible spectral band with ubiquitous CMOS image sensors has many advantages including providing more than an order of magnitude improved spatial resolution over LWIR camera. The YOLO v2 neural network was used for object detection and classification in the visible and LWIR domains. 3D integral imaging mitigated the effects of adverse environmental conditions providing improved performance. Experimental results show that for the experiments we performed, 3D integral imaging in the visible spectrum gives higher F1 score, a lower miss rate, and a significantly higher precision score than 1) 2D visible spectrum object detection, as well as, 2) both 2D and 3D object detections in LWIR spectrum. Our results demonstrate that for certain applications, such as in low illumination and adverse environmental conditions, passive 3D imaging systems in the visible domain may offer a viable alternative to the costlier, bulkier LWIR imaging systems. While these experiments are not comprehensive, they demonstrate the potential of 3D InIm in the visible spectrum for low light applications.

Funding

Air Force Office of Scientific Research (FA9550-18-1-0338, FA9550-21-1-0333); Office of Naval Research (N000141712405, N00014-17-1-2561, N00014-20-1-2690).

Disclosures

The authors declare no conflict of interest.

Data availability

Data underlying the results are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov, “Scalable object detection using deep neural networks,” IEEE Conference on Computer Vision and Pattern Recognition, 2155–2162 (2014).

2. Y. Tian, P. Luo, X. Wang, and X. Tang, “Deep learning strong parts for pedestrian detection,” IEEE International Conference on Computer Vision, 1904–1912 (2015).

3. P. F. Felzenszwalb, R. B. Girshick, D. Mcallester, and D. Ramanan, “Object detection with discriminatively trained part-based models,” IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1627–1645 (2010). [CrossRef]

4. K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” IEEE International Conference on Computer Vision, 770–778 (2015).

5. V. Bevilacqua, A. Brunetti, G. D. Cascarano, A. Guerriero, F. Pesce, M. Moschetta, and L. Gesualdo, “A comparison between two semantic deep learning frameworks for the autosomal dominant polycystic kidney disease segmentation based on magnetic resonance images,” BMC Med. Inf. Decis. Making 19(S9), 244 (2019). [CrossRef]

6. J. L. Pezzaniti, D. Chenault, K. Gurton, and M. Felton, “Detection of obscured targets with IR polarimetric imaging,” Proc. SPIE 9072, 90721D (2014). [CrossRef]

7. J. S. Tyo, B. M. Ratliff, J. K. Boger, W. T. Black, D. L. Bowers, and M. P. Fetrow, “The effects of thermal equilibrium and contrast in LWIR polarimetric images,” Opt. Express 15(23), 15161–15167 (2007). [CrossRef]

8. S. Komatsu, A. Markman, A. Mahalanobis, K. Chen, and B. Javidi, “Three-dimensional integral imaging and object detection using long-wave infrared imaging,” Appl. Opt. 56(9), D120–D126 (2017). [CrossRef]

9. A. Markman and B. Javidi, “Learning in the dark: 3D integral imaging object recognition in very low illumination conditions using convolutional neural networks,” OSA Conti. 1(2), 373–383 (2018). [CrossRef]

10. D. Aloni, A. Stern, and B. Javidi, “Three-dimensional photon counting integral imaging reconstruction using penalized maximum likelihood expectation maximization,” Opt. Express 19(20), 19681–19687 (2011). [CrossRef]

11. X. Shen, A. Carnicer, and B. Javidi, “Three-dimensional polarimetric integral imaging under low illumination conditions,” Opt. Lett. 44(13), 3230–3233 (2019). [CrossRef]

12. B. Tavakoli, B. Javidi, and E. Watson, “Three dimensional visualization by photon counting computational integral imaging,” Opt. Express 16(7), 4426–4436 (2008). [CrossRef]

13. G. Lippmann, “Epreuves reversibles donnant la sensation du relief,” J. Phys. 7(1), 821–825 (1908). [CrossRef]

14. N. Davies, M. McCormick, and L. Yang, “Three-dimensional imaging systems: a new development,” Appl. Opt. 27(21), 4520–4528 (1988). [CrossRef]

15. H. Arimoto and B. Javidi, “Integral Three-dimensional Imaging with digital reconstruction,” Opt. Lett. 26(3), 157–159 (2001). [CrossRef]

16. F. Okano, H. Hoshino, J. Arai, and I. Yuyama, “Real-time pickup method for a three-dimensional image based on integral photography,” Appl. Opt. 36(7), 1598–1603 (1997). [CrossRef]

17. M. Martinez-Corral, A. Dorado, J. C. Barreiro, G. Saavedra, and B. Javidi, “Recent advances in the capture and display of macroscopic and microscopic 3D scenes by integral imaging,” Proc. IEEE 105(5), 825–836 (2017). [CrossRef]

18. A. Stern and B. Javidi, “Three-dimensional image sensing and reconstruction with time-division multiplexed computational integral imaging,” Appl. Opt. 42(35), 7036–7042 (2003). [CrossRef]

19. E. H. Adelson and J. R. Bergen, “The plenoptic function and the elements of early vision,” Computational Models of Visual Processing 1, 3–20 (1991).

20. J. Liu, D. Claus, T. Xu, T. Keßner, A. Herkommer, and W. Osten, “Light field endoscopy and its parametric description,” Opt. Lett. 42(9), 1804–1807 (2017). [CrossRef]

21. G. Scrofani, J. Sola-Pikabea, A. Llavador, E. Sanchez-Ortiga, J. C. Barreiro, G. Saavedra, J. Garcia-Sucerquia, and M. Martinez-Corral, “FIMic: design for ultimate 3D-integral microscopy of in-vivo biological samples,” Biomed. Opt. Express 9(1), 335–346 (2018). [CrossRef]

22. J. Arai, E. Nakasu, T. Yamashita, H. Hiura, M. Miura, T. Nakamura, and R. Funatsu, “Progress overview of capturing method for integral 3-D imaging displays,” Proc. IEEE 105(5), 837–849 (2017). [CrossRef]

23. M. Yamaguchi, “Full-parallax holographic light-field 3-D displays and interactive 3-D touch,” Proc. IEEE 105(5), 947–959 (2017). [CrossRef]

24. J. Redmon and A. Farhadi, “YOLO9000: Better, Faster, Stronger,” Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 6517–6525 (2017).

25. J. S. Jang and B. Javidi, “Three-dimensional synthetic aperture integral imaging,” Opt. Lett. 27(13), 1144–1146 (2002). [CrossRef]

26. M. Daneshpanah and B. Javidi, “Profilometry and optical slicing by passive three dimensional imaging,” Opt. Lett. 34(7), 1105–1107 (2009). [CrossRef]

27. J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 779–788 (2016).

28. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 1–9 (2015).

29. “Noise, Dynamic Range, and Bit Depth in Digital SLRs,” https://homes.psd.uchicago.edu/~ejmartin/pix/20d/tests/noise/index.html.

30. “CCD Signal-To-Noise Ratio,” https://www.microscopyu.com/tutorials/ccd-signal-to-noise-ratio.

31. “ORCA Fusion Hamamatsu,” https://www.hamamatsu.com/resources/pdf/sys/SCAS0138E_C14440-20UP_tec.pdf.

32. X. Dong, G. Wang, Y. Pang, W. Li, J. Wen, W. Meng, and Y. Lu, “Fast efficient algorithm for enhancement of low lighting video,” Proc. IEEE Int. Conf. Multimedia and Expo (ICME), 1–6 (2011).

33. K. He, J. Sun, and X. Tang, “Single Image Haze Removal Using Dark Channel Prior,” IEEE Trans. Pattern Anal. Mach. Intell. 33(12), 2341–2353 (2011). [CrossRef]

34. C. Tang, Y. Wang, H. Feng, Z. Xu, Q. Li, and Y. Chen, “Low-light image enhancement with strong light weakening and bright halo suppressing,” IET Image Process. 13(3), 537–542 (2019). [CrossRef]

35. H. Maghrebi and E. Prouff, “On the Use of Independent Component Analysis to Denoise Side-Channel Measurements,” Int. Workshop on Constructive Side-Channel Anal. and Secure Des. (COSADE)61–81 (2018).

36. A. Hyvärinen and E. Oja, “A Fast Fixed-Point Algorithm for Independent Component Analysis,” Neural Comput. 9(7), 1483–1492 (1997). [CrossRef]

37. K. Liang and J. Ye, “ICA-based image denoising: A comparative analysis of four classical algorithms,” Proc. IEEE 2nd Int. Conf. on Big Data Analysis (ICBDA), 709–713 (2017).

38. A. Baghaie, R. M. D’Souza, and Z. Yu, “Application of Independent Component Analysis Techniques in Speckle Noise Reduction of Retinal OCT Images,” Optik 127(15), 5783–5791 (2016). [CrossRef]

39. H. Maghrebi and E. Prouff, “On the use of independent component analysis to denoise side-channel measurements,” Springer COSADE 10815, (2018).

40. A. Martinez-Uso, P. Latorre-Carmona, J. Sotoca, F. Pla, and B. Javidi, “Depth estimation in integral imaging based on maximum voting strategy,” J. Display Technol. 12(12), 1715–1723 (2016). [CrossRef]

Average photons per pixel	SNR with no pre-processing	SNR with dark channel prior based dehazing	SNR with ICA^b based denoising	SNR with dark channel based dehazing and ICA denoising
23.29 (Fig. 2(a))	3.33	4.48	5.17	8.89
3.52 (Fig. 2(e))	2.29	3.05	3.81	7.02
2.40 (Fig. 2(i))	1.62	2.29	2.53	4.65

	Visible	LWIR
Focal length (f)	50mm	11mm
Diameter of lens (D)	40mm	9.2mm
Sensor size	2048(H) 2048(V)	320(H) 240(V)
Pixel size	6.5 × 6.5 $μ$ m	17 × 17 $μ$ m
Reconstruction depth (z)	∼5m	∼5m
Camera pitch (H,V)	40 mm, 40mm	40 mm, 40mm
Elemental images	25 (5H × 5 V)	25 (5H × 5 V)
F-number (N = f/D)	1.25	1.2

	Average Precision	Average Recall	Average F1	Average Miss Rate
Detector-Visible 2D	0.6429	0.1475	0.2400	0.8525
Detector-Visible 3D	0.9610	0.8222	0.8862	0.1778
Detector-LWIR-type I 2D	0.4130	0.1434	0.2129	0.8566
Detector-LWIR-type I 3D	0.3139	0.2494	0.2779	0.7506
Detector-LWIR-type II 2D	0.4341	0.2113	0.2843	0.7887
Detector-LWIR-type II 3D	0.2830	0.5784	0.3801	0.4216

	Average Precision	Average Recall	Average F1	Average Miss Rate
Detector-LWIR-type I 2D	0.4222	0.2275	0.2957	0.7725
Detector-LWIR-type I 3D	0.3139	0.3477	0.3299	0.6523
Detector-LWIR-type II 2D	0.4623	0.2934	0.3590	0.7066
Detector-LWIR-type II 3D	0.3183	0.6595	0.4294	0.3405

	Average Precision	Average Recall	Average F1	Average Miss Rate
Detector-LWIR-type I 2D	NaN	0	NaN	1.0
Detector-LWIR-type I 3D	NaN	0	NaN	1.0
Detector-LWIR-type II 2D	0.3043	0.0714	0.1157	0.9286
Detector-LWIR-type II 3D	0.1889	0.3727	0.2508	0.627

Lowlight object recognition by deep learning with passive three-dimensional integral imaging in visible and long wave infrared wavelengths

Abstract

1. Introduction

2. Methodology

2.1 Integral imaging

2.2 Classification model

2.3 Low illumination image enhancement

3. Experimental methodology

3.1 Experimental setup

3.2 Training data augmentation

3.3 Object detector

3.4 Test data

4. Results and discussion

5. Conclusions

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (7)

Tables (5)

Equations (2)

Optics Express