Keywords

1 Introduction

Hyperspectral imaging encodes the reflectance of the scene from hundreds or thousands of bands with a narrow wavelength interval (e.g., 10 nm) into a hyperspectral image. Different from conventional images, each pixel in the hyperspectral image contains a continuous spectrum, thus allowing the acquisition of abundant spectral information. Such information has proven to be quite useful for distinguishing different materials. Therefore, hyperspectral images have been widely exploited to facilitate various applications in computer vision community, such as visual tracking [20], image segmentation [18], face recognition [14], scene classification [5], and anomaly detection [10].

The acquisition of spectral information, however, comes at the cost of decreasing the spatial resolution of hyperspectral images. This is because a fewer number of photons are captured by each detector due to the narrower width of the spectral bands. In order to maintain a reasonable signal-to-noise ratio (SNR), the instantaneous field of view (IFOV) needs to be increased, which renders it difficult to produce hyperspectral images with high spatial resolution. To address this problem, many efforts have been made for the hyperspectral imagery super-resolution.

Most of the existing methods mainly focus on enhancing the spatial resolution of the observed hyperspectral image. According to the input images, they can be divided into two categories: (1) fusion based methods where a high-resolution conventional image (e.g., RGB image) and a low-resolution hyperspectral image are fused together to produce a high-resolution hyperspectral image [11, 22] (2) single image super-resolution which directly increases the spatial resolution of a hyperspectral image [12, 24, 25, 27]. Although these methods have shown effective performance, the acquisition of the input hyperspectral image often requires specialized hyperspectral sensors as well as extensive imaging cost. To mitigate this problem, some recent literature [2, 4, 7, 13] turn to investigate a novel hyperspectral imagery super-resolution scheme, termed spectral super-resolution, which aims at improving the spectral resolution of a given RGB image. Since the input image can be easily captured by conventional RGB sensors, imaging cost can be greatly reduced.

However, it is challenging to accurately reconstruct a hyperspectral image from a single RGB observation, since mapping three discrete intensity values to a continuous spectrum is a highly ill-posed linear inverse problem. To address this problem, we propose to learn a complicated non-linear mapping function for spectral super-resolution with deep convolutional neural networks (CNN). It has been shown that the 3-dimensional color vector for a specific pixel can be viewed as the downsampled observation of the corresponding spectrum. Moreover, for a candidate pixel, there often exist abundant locally and no-locally similar pixels (i.e. exhibiting similar spectra) in the spatial domain. As a result, the color vectors corresponding to those similar pixels can be viewed as a group of downsampled observations of the latent spectra for the candidate pixel. Therefore, accurate spectral reconstruction requires to explicitly consider both the local and non-local information from the input RGB image. To this end, we develop a novel multi-scale CNN. Our method jointly encodes the local and non-local image information through symmetrically downsampling and upsampling the intermediate feature maps in a cascading paradigm, thus enhancing the spectral reconstruction accuracy. We experimentally show that the proposed method can be easily trained in an end-to-end scheme and beat several state-of-the-art methods on a large hyperspectral image dataset with respect to various evaluation metrics.

Our contributions are twofold:

  • We design a novel CNN architecture that is able to encode both local and non-local information for spectral reconstruction.

  • We perform extensive experiments on a large hyperspectral dataset and obtain the state-of-the-art performance.

2 Related Work

This section gives a brief review of the existing spectral super-resolution methods, which can be divided into the following two categories.

Statistic Based Methods. This line of research mainly focus on exploiting the inherent statistical distribution of the latent hyperspectral image as priors to guide the super-resolution [21, 26]. Most of these methods involve building overcomplete dictionaries and learning sparse coding coefficients to linearly combine the dictionary atoms. For example, in [4], Arad et al. leveraged image priors to build a dictionary using K-SVD [3]. At test time, orthogonal matching pursuit [15] was used to compute a sparse representation of the input RGB image. [2] proposed a new method inspired by A+ [19], where sparse coefficients are computed by explicitly solving a sparse least square problem. These methods directly exploit the whole image to build image prior, ignoring local and non-local structure information. What’s more, since the image prior is often handcrafted or heuristically designed with shallow structure, these methods fail to generalize well in practice.

Learning Based Methods. These methods directly learn a certain mapping function from the RGB image to a corresponding hyperspectral image. For example, [13] proposed a training based method using a radial basis function network. The input data was pre-processed with a white balancing function to alleviate the influence of different illumination. The total reconstruction accuracy is affected by the performance of this pre-processing stage. Recently, witnessing the great success of deep learning in many other ill-posed inverse problems such as image denoising [23] and single image super-resolution [6], it is natural to consider using deep networks (especially convolutional neural networks) for spectral super-resolution. In [7], Galliani et al. exploited a variant of fully convolutional DenseNets (FC-DenseNets [9]) for spectral super-resolution. However, this method is sensitive to the hyper-parameters and its performance can still be further improved.

3 Proposed Method

In this section, we will introduce the proposed multi-scale convolution neural network in details. Firstly, we introduce some building blocks which will be utilized in our network. Then, we will illustrate the architecture of the proposed network.

Table 1. Basic building blocks of our network

3.1 Building Blocks

There are three basic building blocks in our network. Their structures are shown in Table 1.

Double convolution (Double Conv) block consists of two \(3\times 3\) convolutions. Each of them is followed by batch normalization, leaky ReLU and dropout. We exploit batch normalization and dropout to deal with overfitting.

Downsample block contains a regular max-pooling layer. It reduces the spatial size of the feature map and enlarges the receptive field of the network.

Upsample block is utilized to upsample the feature map in the spatial domain. To this end, much previous literature often adopts the transposed convolution. However, it is prone to generate checkboard artifacts. To address this problem, we use the pixel shuffle operation [17]. It has been shown that pixel shuffle alleviates the checkboard artifacts. In addition, due to not introducing any learnable parameters, pixel shuffle also helps improve the robustness against over-fitting.

Fig. 1.
figure 1

Diagram of the proposed method. “Conv m” represents convolutional layers with an output of m feature maps. We use \(3 \times 3\) convolution in green blocks and \(1 \times 1\) convolution in the red block. Gray arrows represent feature concatenation (Color figure online).

3.2 Network Architecture

Our method is inspired by the well known U-Net architecture for image segmentation [16]. The overall architecture of the proposed multi-scale convolution neural network is depicted in Fig. 1. The network follows the encoder-decoder pattern. For the encoder part, each downsampling step consists of a “Double Conv” with a downsample block. The spatial size is progressively reduced, and the number of features is doubled at each step. The decoder is symmetric to the encoder path. Every step in the decoder path consists of an upsampling operation followed by a “Double Conv” block. The spatial size of the features is recovered, while the number of features is halved every step. Finally, a \(1 \times 1\) convolution maps the output features to the reconstructed 31-channel hyperspectral image. In addition to the feedforward path, skip connections are used to concatenate the corresponding feature maps of the encoder and decoder.

Our method naturally fits the task of spectral reconstruction. The encoder can be interpreted as extracting features from RGB images. Through downsampling in a cascade way, the receptive field of the network is constantly increased, which allows the network to “see” more pixels in an increasingly larger field of view. By doing so, both the local and non-local information can be encoded to better represent the latent spectra. The symmetric decoder procedure is employed to reconstruct the latent hyperspectral images based on these deep and compact features. The skip connections with concatenations are essential for introducing multi-scale information and yielding better estimation of the spectra.

4 Experiments

4.1 Datasets

In this study, all experiments are performed on the NTIRE2018 dataset [1]. This dataset is extended from the ICVL dataset [4]. The ICVL dataset includes 203 images captured using Specim PS Kappa DX4 hyperspectral camera. Each image is of size \(1392\times 1300\) in spatial resolution and contains 519 spectral bands in the range of 400–1000 nm. In experiments, 31 successive bands ranging from 400–700 nm with 10 nm interval are extracted from each image for evaluation. In the NTIRE2018 challenge, this dataset is further extended by supplementing 53 extra images of the same spatial and spectral resolution. As a result, 256 high-resolution hyperspectral images are collected as the training data. In addition, another 5 hyperspectral images are further introduced as the test set. In the NTIRE2018 dataset, the corresponding RGB rendition is also provided for each image. In the following, we will employ the RGB-hyperspectral image pairs to evaluate the proposed method.

Table 2. Quantitative results on each test image.
Fig. 2.
figure 2

Sample results of spectral reconstruction by our method. Top line: RGB rendition. Bottom line: groundtruth (solid) and reconstructed (dashed) spectral response of four pixels identified by the dots in RGB images.

4.2 Comparison Methods and Implementation Details

To demonstrate the effectiveness of the proposed method, we compare it with four spectral super-resolution methods, including spline interpolation, the sparse recovery method in [4] (Arad et al.), A+ [2], and the deep learning method in [7] (Galliani et al.). [2, 4] are implemented by the codes released by the authors. Since there is no code released for [7], we reimplement it in this study. In the following, we will give the implementation details of each method.

Spline Interpolation. The interpolation algorithm serves as the most primitive baseline in this study. Specifically, for each RGB pixel \(\varvec{p}_{l} = \big ( r,g,b \big )\), we use spline interpolation to upsample it and obtain a 31-dimensional spectrum (\(\varvec{p}_{h}\)). According to the visible spectrumFootnote 1, the r, g, b values of an RGB pixel are assigned to 700 nm, 550 nm, and 450 nm, respectively.

Arad et al. and A+. The low spectral resolution image is assumed to be a directly downsampled version of the corresponding hyperspectral image using some specific linear projection matrix. In [2, 4] this matrix is required to be perfectly known. In our experiments, we fit the projection matrix using training data with conventional linear regression.

Fig. 3.
figure 3

Training and test curves.

Galliani et al. and Our Method. We experimentally find the optimal set of hyper-parameters for both methods. \(50\%\) dropout is applied to Galliani et al., while our method utilizes \(20\%\) dropout rate. All the leaky ReLU activation functions are applied with a negative slope of 0.2. We train the networks for 100 epochs using Adam optimizer with \(10^{-6}\) regularization. Weight initialization and learning rate vary for different methods. For Galliani et al., the weights are initialized via HeUniform [8], and the learning rate is set to \(2 \times 10^{-3}\) for the first 50 epochs, decayed to \(2 \times 10^{-4}\) for the next 50 epochs. As for our method, we use HeNormal initialization [8]. The initial learning rate is \(5 \times 10^{-5}\) and is multiplied by 0.93 every 10 epochs. We perform data augmentation by extracting patches of size \(64 \times 64\) with a stride of 40 pixels from training data. The total amount of training samples is over 267, 000. At the test phase, we directly feed the whole image to the network and get the estimated hyperspectral image in one single forward pass.

4.3 Evaluation Metrics

To quantitatively evaluate the performance of the proposed method, we adopt the following two categories of evaluation metrics.

Pixel-Level Reconstruction Error. We follow [2] to use absolute and relative root-mean-square error (RMSE and rRMSE) as quantitative measurements for reconstruction accuracy. Let \(I_{h}^{(i)}\) and \(I_{e}^{(i)}\) denote the ith element of the real and estimated hyperspectral images, \(\bar{I}_{h}\) is the average of \(I_{h}\), and n is the total number of elements in one hyperspectral image. There are two formulas for RMSE and rRMSE respectively.

$$\begin{aligned} \begin{aligned} RMSE_{1} = \frac{1}{n}\sum _{i=1}^{n}\sqrt{\left( I_{h}^{(i)}-I_{e}^{(i)}\right) ^{2}}&\qquad RMSE_{2} = \sqrt{\frac{1}{n}\sum _{i=1}^{n}\left( I_{h}^{(i)}-I_{e}^{(i)}\right) ^{2}} \\ rRMSE_{1} = \frac{1}{n}\sum _{i=1}^{n}\frac{\sqrt{\left( I_{h}^{(i)}-I_{e}^{(i)}\right) ^{2}}}{I_{h}^{(i)}}&\qquad rRMSE_{2} = \sqrt{\frac{1}{n}\sum _{i=1}^{n}\frac{\left( I_{h}^{(i)}-I_{e}^{(i)}\right) ^{2}}{\bar{I}_{h}^{2}}} \end{aligned} \end{aligned}$$

Spectral Similarity. Since the key for spectral super-resolution is to reconstruct the spectra, we also use spectral angle mapper (SAM) to evaluate the performance of different methods. SAM calculates the average spectral angle between the spectra of real and estimated hyperspectral images. Let \(\varvec{p}_{h}^{(j)}, \varvec{p}_{e}^{(j)} \epsilon \ \mathbb {R}^{C}\) represents the spectra of the jth hyperspectral pixel in real and estimated hyperspectral images (C is the number of bands), and m is the total number of pixels within an image. The SAM value can be computed as follows.

$$\begin{aligned} SAM = \frac{1}{m}cos^{-1}\left( \sum _{j=1}^{m}\frac{(\varvec{p}_{h}^{(j)})^{T} \varvec{\cdot } \varvec{p}_{e}^{(j)}}{\left\| \varvec{p}_{h}^{(j)} \right\| _{2} \varvec{\cdot } \left\| \varvec{p}_{e}^{(j)} \right\| _{2}} \right) \end{aligned}$$

4.4 Experimental Results

Convergence Analysis. We plot the curve of MSE loss on the training set and the curves of five evaluation metrics computed on the test set in Fig. 3. It can be seen that both the training loss and the value of metrics gradually decrease and ultimately converge with the proceeding of the training. This demonstrates that the proposed multi-scale convolution neural network converges well.

Quantitative Results. Table 2 provides the quantitative results of our method and all baseline methods. It can be seen that our model outperforms all competitors with regards to \(RMSE_{1}\) and \(rRMSE_{1}\), and produces comparable results to Galliani et al. on \(RMSE_{2}\) and \(rRMSE_{2}\). More importantly, our method surpasses all the others with respect to spectral angle mapper. This clearly proves that our model reconstructs spectra more accurately than other competitors. It is worth pointing out that reconstruction error (absolute and relative RMSE) is not necessarily positively correlated with spectral angle mapper (SAM). For example, when the pixels of an image are shuffled, RMSE and rRMSE will remain the same, while SAM will change completely. According to the results in Table 2, we can find that our finely designed network enhances spectral super-resolution from both aspects, viz., yielding better results on both average root-mean-square error and spectral angle similarity.

Fig. 4.
figure 4

Visualization of absolute reconstruction error. From left to right: RGB rendition, A+, Galliani et al., and our method

Visual Results. To further clarify the superiority in reconstruction accuracy. We show the absolute reconstruction error of test images in Fig. 4. The error is summarized over all bands of the hyperspectral image. Since A+ outperforms Arad et al. in terms of any evaluation metric, we use A+ to represent the sparse coding methods. It can be seen that our method yields smoother reconstructed images as well as lower reconstruction error than other competitors.

In addition, we randomly choose three test images and plot the real and reconstructed spectra for four pixels in Fig. 2 to further demonstrate the effectiveness of the proposed method in spectrum reconstruction. It can be seen that only slight difference exists between the reconstructed spectra and the ground truth.

According to these results above, we can conclude that the proposed method is effective in spectral super-resolution and outperforms several state-of-the-art competitors.

5 Conclusion

In this study, we show that leveraging both the local and non-local information of input images is essential for the accurate spectral reconstruction. Following this idea, we design a novel multi-scale convolutional neural network, which employs a symmetrically cascaded downsampling-upsampling architecture to jointly encode the local and non-local image information for spectral reconstruction. With extensive experiments on a large hyperspectral images dataset, the proposed method clearly outperforms several state-of-the-art methods in terms of reconstruction accuracy and spectral similarity.