doi:10.1016/j.micpro.2006.07.004
Copyright © 2006 Elsevier B.V. All rights reserved.
Multiplier-less VLSI architecture for real-time computation of multi-dimensional convolution
aComputational Intelligence and Machine Vision Laboratory, Department of Electrical and Computer Engineering, Old Dominion University, Norfolk, VA 23529, USA
Available online 28 August 2006.
References and further reading may be available for this article. To view references and further reading you must
purchase this article.
Abstract
A VLSI efficient multiplier-less architecture for real-time computation of multi-dimensional convolution is presented in this paper. The new architecture performs computations in the logarithmic domain by utilizing novel multiplier-less log2 and inverse-log2 modules which are capable of converting the fraction numbers currently not available in the literature. An effective data handling strategy is developed in conjunction with the logarithmic modules to eliminate the necessity of multipliers in the architecture. The proposed approach reduces hardware resources significantly compared to other approaches maintaining a high degree of accuracy. The architecture is developed as a combined systolic-pipelined design that produces an output in every clock cycle after an initial latency of 93.19 uSec. The architecture is capable of operating with a clock frequency of 99 MHz based on Xilinx’s Virtex II 2v2000ff896-4 FPGA and the throughput of the system is observed as 99 MOPS (million outputs per second). Error analysis performed with the FPGA-based system in the image processing examples of edge detection and noise filtering shows that the proposed architecture produces outputs similar to that obtained by software simulation using Matlab.
Keywords: Multi-dimensional convolution; Multiplier-less architecture; Logarithmic domain computation; Systolic-pipelines architecture; FPGA-based implementation
Fig. 1. Overview of 1-D convolution in log-domain.
Fig. 2. An example to demonstrate the concept of log2 approximation.
Fig. 3. (a) Actual curve and estimated curve of log2(N) obtained from approximation technique and (b) percentage error of the difference between actual and estimated values.
Fig. 4. Block diagram of 2-D convolution architectures: (a) architecture with explicit pipelining of adder tree and (b) architecture inherently pipelined the adder tree into the leftmost column of PEs.
Fig. 5. Address generator for DPRAMs implementing line buffers.
Fig. 6. Architecture of processing elements.
Fig. 7. (a) Architecture of log2 and (b) mapping of multiplexers in MBS.
Fig. 8. (a) Architecture of inverse-log2 and (b) mapping of multiplexers in RMBS.
Fig. 9. Window sliding architecture is used for padding of the borders of the image.
Fig. 10. Edge detection with Laplacian kernel: (a) normalized Laplacian kernel, (b) grayscale input image, (c) 2-D convolution result by Matlab function, (d) 2-D convolution result by hardware simulation and (e) hardware simulation result scaled by 2.
Fig. 11. Plot of error from approximation: (a) difference error between the results of the Matlab double precision function and hardware implementation with Laplacian kernel (average error of 0.17 intensity) and (b) histogram of error with x-axis normalized by peak error at 3.5 intensity.
Fig. 12. Smoothening of image corrupted by Gaussian white noise: (a) grayscale of input image corrupted by Gaussian white noise with m = 0, and σ2 = 05 , (b) image filtered by Matlab function and (c) image filtered by the proposed hardware.
Fig. 13. Plot of error from approximation: (a) difference error between the results of the Matlab double precision function and hardware implementation with Gaussian smoothening kernel (average error of 0.58 intensity) and (b) histogram of error with x-axis normalized by peak error at 2.59 intensity.
Fig. 14. Combined edge detection and smoothening of images using the proposed hardware with kernel coefficients: (a) scaled by 1 and (b) scaled by 2.
Fig. 15. Timing diagram of log2 and inverse-log2 modules.
Table 1.
Performance and hardware utilization of log2 architecture with various resolutions

Table 2.
Performance and hardware utilization of inverse-log2 architecture with various resolutions

Table 3.
Performance and hardware utilization of PE architecture with various resolutions

Table 4.
Hardware utilization for 8-bit architecture

Table 5.
Comparison of hardware resources and performancewith other 2-D convolution implementations
a Number of logic gates is obtained based on fully optimized architecture with a 9-bit signed integer and an 8-bit fraction resolution.