doi:10.1016/S0743-7315(02)00054-0
Copyright © 2003 Elsevier Science (USA). All rights reserved.
A multi-threaded fast convolver for dynamically parallel image filtering*1
Jeremy Kepner
MIT Lincoln Laboratory, 244 Wood St., Lexington, MA 02420, USA
Received 24 April 2002;
revised 8 September 2002.
Available online 23 March 2003.
References and further reading may be available for this article. To view references and further reading you must
purchase this article.
Abstract
2D convolution is a staple of digital image processing. The advent of large format imagers makes it possible to literally “pave” with silicon the focal plane of an optical sensor, which results in very large images that can require a significant amount computation to process. Filtering of large images via 2D convolutions is often complicated by a variety of effects (e.g., non-uniformities found in wide field of view instruments) which must be compensated for in the filtering process by changing the filter across the image. This paper describes a fast (FFT based) method for convolving images with slowly varying filters. A parallel version of the method is implemented using a multi-threaded approach, which allows more efficient load balancing and a simpler software architecture. The method has been implemented within a high level interpreted language (IDL), while also exploiting open standards vector libraries (VSIPL) and open standards parallel directives (OpenMP). The parallel approach and software architecture are generally applicable to a variety of algorithms and has the advantage of enabling users to obtain the convenience of an easy operating environment while also delivering high performance using a fully portable code.
Author Keywords: Image processing; Parallel algorithms; Multi-threaded; Open standards; High level languages
Fig. 1. Generic image processing pipeline. Most pipelines consists of steps similar to the above. In this process 2D filtering is the most compute intensive step. Later steps are more complex (e.g. source analysis) but operate on a reduced amount of data. (Cartwheel Galaxy image courtesy of Kirk Borne (ST ScI), and NASA.)
Fig. 2. Basic 2D filtering. FFT implementation of 2D filtering which performs the mathematical operation:
F(
x,
y)=∫∫
K(
x′,
y′)
I(
x−
x′,
y−
y′) d
x′ d
y′.
Fig. 3. Wide field filtering. FFT implementation of 2D filtering for wide field imaging with multiple point response functions. Each portion of image is filtered separately and then recombined using the appropriate weights. The equivalent mathematical operation is:
Fij(
x,
y)=
Wij(
x,
y)∫∫
Kij(
x′,
y′)
Iij(
x−
x′,
y−
y′) d
x′ d
y′.
Fig. 4. Load imbalance. Pre-detection the work in a pipeline is usually a simple function of the number of pixels. Post-detection the work is usually a simple function of the number of detections. In a statically parallel system, statistical fluctuations in the number of detections will lead to a load imbalance in the post detection processing.
Fig. 5. Layered software architecture. The user interacts with the top layer which provides high level abstractions for high productivity. Lower layers provide performance via parallel processing and high performance kernels.
Fig. 6. Multi-threaded program. OpenMP uses a fork/join model in which a master thread forks off multiple parallel threads, which rejoin to communicate or synchronize (figure adapted from [
4]).
Fig. 7. Parallel speedup. Measured speedups of wide field 2D filtering application on an shared memory parallel system (SGI Origin2000).
Fig. 8. Parallel efficiency. Measured parallel efficiency of wide field 2D filtering application on an shared memory parallel system (SGI Origin2000).
Fig. 9. FFTW performance. Performance of FFTW on FFTs ranging in size from 16 to 1024. The blue line indicates the performance achieved using the optimally padded FFT. The red line shows the performance obtained using powers of two.
Fig. 10. FFTW optimal paddings. Optimal padding (as a percentage of FFT size). The ability to get good performance with non-powers of two significantly reduces the amount of padding required. The blue line indicates the average padding require across all the different cases. The red line shows the average padding required if only powers of two are used.
Fig. 11. FFTW performance benefit. The percentage performance improvement obtained using optimal padding compared to only powers of two.
Fig. 12. Detailed algorithm flow. Precise computations, data structures and data movements necessary to execute 2D convolution algorithm with multiple kernels. The inputs consist of the image to be filtered and a grid with a kernel at each node. The algorithm proceeds by looping (in parallel) over each kernel in the grid
Kij and executing the following steps:
- • Determine boundaries of input sub-image Iij.
- • Compute corresponding weights Wij.
- • Determine boundaries of filtered sub-image Fij.
- • Determine boundaries of sub-image padded by kernel Iij+.
- • Determine size of and create padded sub-image IijFFT.
- • Create padded kernel KijFFT.
- • Copy Iij+ into IijFFT.
- • Copy Kij to center of KijFFT.
- • In place 2D FFT IijFFT.
- • In place 2D FFT KijFFT.
- • Multiply IijFFT by KijFFT and return to IijFFT.
- • In place 2D IFFT KijFFT.
- • In place 2D circular Shift IijFFT.
- • Multiply IijFFT sub-region corresponding to Iij by Wij and add to Fij.
These steps are described in greater detail in
Appendix B.
Fig. 13. Static speedup. Speedup obtainable for
M=1000 tasks (“balls”) and given failure rates using a static assignment scheme. Random fluctuations significantly bound speedup.
Fig. 14. Dynamic speedup. Speedup obtainable for
M=1000 tasks and
f=0 using a dynamic work assignment scheme. This scheme provides good speedups even in the worst case.
Table 1. Parallel overheads

Computation, communication, load balancing and coding overheads for exploiting different levels of parallelism for doing a 2D convolution.
Table 2. Parallel speedups

Measured parallel speedup and parallel efficiency obtained for different image sizes and different numbers of processors. In each case the kernel size was N=100.