Keywords

1 Introduction

Image enhancement algorithms are basic work in many areas especially in the public security, biomedical field, health service, and marine information field, where significant achievements have been made in [1,2,3]. At present, researchers have proposed a great diversity of parallel image processing algorithms, such as CUDA-based image enhancement algorithms [4], and image processing algorithms based on multicore DSP [5]. However, as one of the most important images processing technologies, the serial single-scale Retinex (SSR) algorithm is still too slow to finish the image enhancement tasks within an acceptable time.

In order to solve above problem, this paper proposes the parallel SSR image enhancement algorithm based on OpenMP. The parallel SSR image enhancement algorithm is implemented on the Tianhe-2 supercomputer using the OpenMP programming model, which is evaluated and achieves an average speedup of 12. The experimental results show that the proposed parallel algorithm can fulfill the needs of real-time processing in image enhancement field.

2 Parallel Design and Implementation

2.1 Parallelism Analyses

SSR algorithm enhances an image through the implementation of sub-algorithms such as Gaussian template, Gaussian convolution, and exponential transformation. As the data processed by these sub-algorithms is independent of each other, SSR algorithm has good parallelism. As illustrated in Fig. 1, the following three aspects are presented to analyze the parallelism of the serial SSR algorithm.

Fig. 1.
figure 1

The parallelism of single-scale Retinex algorithm

Parallelism 1: the blurred image estimating the incident illumination component is generated by Gaussian convolution operations. During this, each pixel is not associated with others. So, the image can be divided into sub-blocks for parallel computing.

Parallelism 2: the size of the Gaussian template is determined by the input parameters. When the Gaussian weight is normalized, each pixel is divided respectively by the sum of the weights, This process can be calculated in parallel.

Parallelism 3: the operation of exponential transformation can be also executed in parallel because there is no data dependence directly in those operations.

2.2 Parallel Design and Implementation of SSR Algorithm

In this section, parallel design of these sub-algorithms are firstly illustrated in Fig. 2 and then parallel implement are presented.

Fig. 2.
figure 2

Parallel design of single-scale Retinex algorithm

Parallel Design and Implementation of Gaussian Convolution.

The subsequent operation of image segmentation into different data blocks is independent in Gaussian convolution serial algorithm. This is consistent with the parallel characteristics of OpenMP because no dependence between the data of non-direct adjacent pixels in the image. And a two-dimensional Gaussian function G(x, y) could be written as the product of two one-dimensional Gaussian functions G(x) and G(y), meaning that G(x, y) can be calculated serially by convolution of G(x)δ(y) and G(y)δ(x). However each of one-dimensional Gaussian functions G(x) could be executed in parallel. So the two-dimensional Gaussian convolution can be generated serially by two one-dimensional Gaussian convolutions performed respectively in parallel in the X and Y directions. And for example, there is an image which size is 7 × 7, and the convolution kernel is 3 × 3. It can be seen that the convolution operation requires 9 multiplications for each element in the image, so the total number of multiplication operations executed in sequential algorithm is 7 × 7 × 9 = 441 times. In contrast, parallel execution requires only 2 × 7 × 9 = 126 operations in the case of sufficient threads. And the execution time will be reduced and the speed will be increased compared with sequential algorithm.

Parallel Design and Implementation of Gaussian Template.

The Gaussian template generation is mainly divided into two steps. The first step is that the weight sum is calculated serially, and the second step is that the normalization Gaussian template is generated in parallel. Supposing a 3 by 3 normalization Gaussian template is generated in serial algorithm with one thread and needs to be executed 9 times. However, in the case of parallelized execution with 9 threads, it only needs to be executed once.

Parallel Design and Implementation of Exponential Transformation.

The original image and the Gaussian blurred image are set to the logarithmic domain to obtain a logarithmic image. The function of exponential transformation is to extend the image’s high gray level and compress the low gray level. The most critical step in the exponential transformation is the linear mapping of each value. Assuming that the image size is 1000 × 1000, it takes a lot of time to go through the linear mapping. If linear mapping is performed in parallel using 24 threads, the image only needs to perform 1737 operations rather than 1,000,000 in the serial algorithm. Therefore, it is very profitable to perform each worthy linear mapping in parallel.

3 Experimental Results and Performance Analysis

3.1 Experimental Environment and Test Set

The experiment is performed on Tianhe-2 supercomputer equipped with 16000 nodes, which each note has three coprocessors, two Xeon E5-2692 processors, 24 cores and 64 GB of memory. The experimental environment is shown in Table 1. In this section, 10 different sizes of pictures are used to demonstrate the speedup performance of the parallel algorithm, the minimum size is 1730 × 883, the maximum is 4000 × 3000, and the format is JPG. These pictures are all agricultural images, including apples, pears, kiwis, farmland and mountain forests. They are from the shooting of the Dajiang UAV. The image test set is shown in Table 2.

Table 1. Experimental environment
Table 2. Image test set

3.2 Speedup Comparison

The running time of serial and parallel algorithms respectively in Table 3 which Th represents thread. Within a certain range, the parallel SSR algorithm shortens the image processing time with the number of OpenMP threads increasing, and the average speedup is increased by about 12. After the parallel SSR algorithm are executed in parallel from dual thread to 24 thread, the speedup is obviously improved, and the parallel SSR algorithm can achieve near linear acceleration. The speedup curve is shown in Fig. 3. This experiment was carried out on a single node of Tianhe-2 supercomputer, each node had 24 cores, and the speedup reached a peak at 24 threads, making full use of the performance of multi-core. The speedup start to reduce at 32 threads because the number of threads at this time exceeds the number of CPU cores, but the processing time is still better than the serial algorithm. And the experimental results show that the speedup of the proposed parallel algorithm is significantly improved, and can satisfy the needs of real-time processing in image enhancement field.

Table 3. Comparison of running time (s)
Fig. 3.
figure 3

Speedup comparison

4 Conclusion

This paper proposes a parallel SSR algorithm based on OpenMP. Compared to the serial Retinex algorithm, the proposed parallel algorithm can achieve an average speedup of 12, which represents a significant decrease in execution time. Experimental results show that the proposed parallel algorithm can acquire a significant increase in speedup and can better meet the requirements of real-time processing of the image enhancement algorithm in the image processing field.