Suitability of recent hardware accelerators (DSPs, FPGAs, and GPUs) for computer vision and image processing algorithms
Introduction
Computer vision and image processing algorithms are used in a variety of applications in experimental mechanics [1], medical technologies [2], and human action recognition [3]. Many of the algorithms that have been used in these applications are computationally demanding, and in practical applications it is necessary to rapidly analyse the data. One of the main techniques for decreasing computation time is to use hardware with high computational power. Although the processing power of the central processing units (CPUs) in personal computers (PCs) is increasing, it remains insufficient for many applications. In addition, PCs cannot be used for computer vision tasks in mobile or portable devices. Hardware accelerators (e.g. digital signal processors (DSPs), field programmable gate arrays (FPGAs), and graphics processing units (GPUs)) are designed to address the increasing need for performing fast calculations in complicated algorithms. Furthermore, some hardware accelerators can be used in portable systems where it is not feasible to use PC-based systems.
Although DSPs, FPGAs, and GPUs have markedly different chip architectures, requiring different software development techniques, each can be used as a hardware accelerator to speed up computations. Microarchitecture and fabrication technologies are rapidly evolving, and commercial competition has motivated major hardware accelerator vendors to update and increase the capabilities of their products using the latest technological advances. However, different hardware accelerators are designed in ways that make them efficient for some algorithms but not others. Furthermore, the choice of a hardware accelerator is typically a trade-off between computational power, speed, development time, power consumption, and price. Identifying a suitable hardware accelerator for a specific algorithm or application can be thus very challenging.
Previously published reviews have investigated different aspects of using hardware accelerators in computer vision and image processing tasks. These review papers can be divided into four main groups, which are discussed here.
In the first group of review papers, a specific algorithm or application is chosen and various hardware accelerators for that task are compared. An example is stereo vision algorithms for real-time systems, as in [4]. These review papers may help with the choice of a suitable hardware accelerator for specific applications. However, the system requirements can vary considerably for other applications or algorithms. For example, in some applications real-time execution is important (see [4]), while for other applications it may be adequate to simply increase the processing speed. The choice of a suitable hardware accelerator depends significantly on the application and the algorithm.
In the second group of reviews, specific hardware accelerators are chosen to test the performance of algorithms and their implementation. For instance, algorithm implementations for a single FPGA and a single GPU for sliding-window applications are discussed in [5]. In these hardware-oriented reviews, the fact that new technologies have many advantages over their older versions, was not considered, which does not help developers to find suitable modern hardware accelerators for their own applications. Furthermore, a specific FPGA or a specific GPU does not necessarily represent the capability of that type of hardware accelerator in general. Therefore, these review papers may not help researchers to obtain an accurate comparison between hardware accelerators, unless they decide to choose a hardware accelerator specifically from those that have been reviewed.
In the third group of reviews, a broader application is chosen and different hardware accelerators are discussed for that purpose. Some examples are: parallel computing with multicore CPUs, FPGAs, and GPUs in experimental mechanics [6]; medical image processing on GPUs [[7], [8]]; and medical image registration on GPUs [9] or multicore CPUs and GPUs [10]. There are also some technical details about the chip architectures in these papers. Even though these papers can provide useful information, some of them (such as [[7], [8], [9], [10]]) only discuss GPUs and do not cover FPGAs or DSPs. In addition, the hardware details are usually limited to a specific hardware and are of limited use for comparing different hardware accelerators.
In the fourth group of reviews, the chip architecture and software tools of hardware accelerators are discussed in detail. An example is heterogeneous computing (i.e. the combination of CPUs with FPGAs or GPUs) for general applications [11]. Even though such reviews provide useful information, there is a need to update and simplify the technical details to provide practical advice for researchers on the choice of suitable hardware accelerators for computer vision and image processing applications.
This review combines the approach of the third and fourth groups of review papers described above. Our goal was to provide sufficient information and practical examples to enable researchers to choose the most suitable hardware accelerator for computer vision and image processing applications. To this end, DSPs, FPGAs, and GPUs are discussed in separate sections, followed by examples that demonstrate the performance of the various devices in different computer vision and image processing applications.
One of the main challenges in reviewing different hardware accelerators is to provide a fair comparison. Since the model names of DSPs, FPGAs, and GPUs are not indicative of their performance, a ‘speed normalisation’ factor was proposed [4] in an effort to improve the accuracy of comparison in the same chip architecture family. However, hardware accelerators are too complicated to limit the performance comparison only to the processing speed, which cannot indicate the advantage of one hardware accelerator over another, especially when they do not belong to the same family. Moreover, the processing speed of an algorithm is not only dependent on the hardware accelerator, but also on the programmer’s skill. In order to provide a practical comparison between hardware accelerators in this review, the most important features of DSPs, FPGAs, and GPUs for computer vision and image processing algorithms are introduced and discussed. Then, based on the technical specifications, hardware accelerators are divided into groups with similar levels of performance.
Another limitation of some review papers (such as [6]) is the discussion of outdated hardware technologies, which offer little help in assessing the performance and capabilities of modern hardware accelerators. This review addresses this issue by reporting on the latest improvements, and covers recent papers (published since 2009) with a focus on the latest hardware technologies.
This review is organised as follows. DSPs, FPGAs, and GPUs are discussed in Sections 2, 3, and 4, respectively. In each section, and for each hardware accelerator, different families, available development tools and utilities, development time, and the advantages and disadvantages of using the type of hardware accelerator are discussed. Each section concludes with a separate literature review and summary, and each literature review section presents separate tables with a summary of the application, algorithms being implemented, hardware type used, and performance (or data throughput) of the algorithm. In addition, the papers being reviewed are sorted chronologically and the year of introduction of FPGAs and GPUs (as an indicator of their hardware technology level) is reported. Since FPGAs and GPUs have both been widely used in computer vision and image processing tasks, Section 5 is devoted to the comparison of GPUs and FPGAs. Finally, Section 6 summarises this review.
Section snippets
Digital signal processors (DSPs)
DSPs are microprocessors with an architecture that is specifically designed for performing signal processing tasks. Texas Instruments (TI) and Analog Devices (AD) are the two major companies in the DSP production market. TI-DSPs are more common in the computer vision and image processing research community than AD-DSPs, so this review focuses on TI-DSPs.
TI has designed various DSPs with different processing power ranges and capabilities for different purposes. TI-DSPs can be divided into 4
Field-programmable gate arrays (FPGAs)
The FPGA chip incorporates arrays of reprogrammable logic gates. As opposed to CPUs, DSPs, and GPUs, FPGA fabrics do not have a pre-structured chip architecture or a central processing unit. Thus, prior to programming the reconfigurable FPGAs, the programmer should design a hardware architecture for their specific application using the logic gates inside the FPGA.
The FPGA hardware architecture is configured by interconnecting FPGA logic gates to perform a specific task, and requires
Graphics processing units (GPUs)
The first graphics accelerators were built for professional graphics workstations, such as the Infinite Reality for the Onyx series [105]. GPUs consist of many processing cores, and are accelerators that are optimised for performing fast matrix calculations in parallel (images are in the form of 2D matrices). These devices are typically very affordable, since their development is motivated by the gaming industry. GPUs are thus cost-effective hardware accelerators for massively parallel
Portability of software over different hardware
It is sometimes required to transfer codes from one hardware accelerator to another of the same type, such as when upgrading to a new generation hardware, or when testing the code in another device. The transfer process may be challenging if the available code is crafted to take advantage of the specific architecture of the original hardware. In this section, we discuss how it could be possible to transfer the code and potential challenges for DSPs, FPGAs, and GPUs.
Heterogeneous hardware accelerators
Heterogeneous hardware accelerators are designed to use the advantages of a hardware accelerator while offsetting its disadvantages by fusing its functionality with another hardware accelerator. For instance, as discussed in Section 4.5, one of the disadvantages of GPUs is the data transfer time between the host PC and the GPU. This time is decreased in heterogeneous CPU–GPU computing architectures, such as accelerated processing units (APUs) designed by AMD (formerly known as Fusion). An AMD
Comparison of FPGAs and GPU for implementing image processing, and computer vision algorithms
NVidia GPUs have been used more than FPGAs for high performance applications in recent years. For instance, the second fastest supercomputer in the world, named Titan, includes 18,688 NVidia Tesla GPUs, and has a processing power of more than 2 1016 calculations per second [175].
Among computer vision and image processing algorithms, stereo vision algorithms are the most common application implemented in hardware accelerators. Tippetts et al. [4] reviewed the implementation of various stereo
Hardware accelerators designed for machine learning
In recent years, the application of machine learning techniques has been growing very rapidly. In particular, deep neural networks (i.e. deep learning) and convolutional neural networks have been used extensively in various applications. Image processing and computer vision applications have also taken advantage of machine learning [181] and deep learning [[182], [184]] techniques. GPUs are naturally suited to the implementation of neural networks because of the similarity between the
Summary and conclusions
In this review, we have provided practical information for selecting suitable hardware accelerators for computer vision and image processing algorithms. We discussed the hardware architectures of the most recent DSPs, FPGAs, and GPUs, and the important features of these hardware accelerators for computer vision and image processing algorithms. For each hardware accelerator, available tools and utilities, development time, advantages, and disadvantages were discussed in an attempt to help
References (191)
- et al.
A review of 3D/2D registration methods for image-guided interventions
Med Image Anal
(2012) A survey on vision-based human action recognition
Image Vis. Comput.
(2010)- et al.
Parallel computing in experimental mechanics and optical measurement: A review
Opt. Lasers Eng.
(2012) - et al.
Medical image processing on the GPU–Past, present and future
Med Image Anal
(2013) - et al.
A fast stereo matching algorithm suitable for embedded real-time systems
Comput. Vis. Image Underst.
(2010) - et al.
Evaluation of stereo correspondence algorithms and their implementation on FPGA
J. Syst. Archit.
(2014) - et al.
A real-time global stereo-matching on FPGA
Microprocess. Microsyst.
(2016) - et al.
Accelerating image boundary detection by hardware parallelism
Microprocess. Microsyst.
(2014) - et al.
Run-time self-reconfigurable 2D convolver for adaptive image processing
Microelectronics J.
(2011) - et al.
A real-time versatile roadway path extraction and tracking on an FPGA platform
Comput. Vis. Image Underst.
(2010)
FPGA based disparity map computation with vergence control
Microprocess. Microsyst.
Computer vision-based, noncontacting deformation measurements in mechanics: A generational transformation
Appl. Mech. Rev.
Review of stereo vision algorithms and their suitability for resource-limited systems
J. Real-Time Image Process.
A survey of GPU-based medical image computing techniques.
Quant. Imaging Med. Surg.
A survey of medical image registration on graphics hardware
Comput. Methods Programs Biomed.
A survey of medical image registration on multicore and the GPU
IEEE Signal Process. Mag.
State-of-the-art in heterogeneous computing
Sci. Program.
Stereo vision system for moving object detecting and locating based on CMOS image sensor and DSP chip
Pattern Anal. Appl.
Robust motion estimation on a low-power multi-core DSP
EURASIP J. Adv. Signal Process.
Highly efficient image registration for embedded systems using a distributed multicore DSP architecture
J. Real-Time Image Process.
Trends in multicore DSP platforms
IEEE Signal Process. Mag.
Domain-specific language for HW/SW Co-design for FPGAs
High-level synthesis revised: Generation of FPGA accelerators from a domain-specific language using the polyhedron model
Parallel Comput. Accel. Comput. Sci. Eng.
Cited by (89)
An algorithm for solving the equations of the Nodal Expansion Method in parallel using GPU
2024, Nuclear Engineering and DesignReal-time FPGA-based laser absorption spectroscopy using on-chip machine learning for 10 kHz intra-cycle emissions sensing towards adaptive reciprocating engines
2023, Applications in Energy and Combustion ScienceIntelligent Transportation System Based on Smart Soft-Sensors to Analyze Road Traffic and Assist Driver Behavior Applicable to Smart Cities
2023, Microprocessors and MicrosystemsDrogue position measurement of autonomous aerial refueling based on embedded system
2023, Sensors and Actuators A: PhysicalTrends in AI inference energy consumption: Beyond the performance-vs-parameter laws of deep learning
2023, Sustainable Computing: Informatics and SystemsFuture data center energy-conservation and emission-reduction technologies in the context of smart and low-carbon city construction
2023, Sustainable Cities and SocietyCitation Excerpt :The digital industry has emphasized the need for computing power in DCs (Stanley, 2015), which is derived from chips (Hamza, Deogun, & Alexander, 2016), as shown in Fig. 6(c), and can be used to evaluate the DC performance using various computing power indicators (Helali & Omri, 2021). Among these, general computing controls the data flow (Jiang, Qiu, & Gao, 2019), high-performance computing can quickly solve complex problems (Buyya et al., 2010; Delimitrou & Kozyrakis, 2012; Dong, 2011; Fainman & Porter, 2013; Garimella et al., 2013; Hammadi & Mhamdi, 2014; Hamza et al., 2016; Harris, 2005; Helali & Omri, 2021; Hrouga et al., 2022; Hu & Deng, 2019; Jiang et al., 2019; Nath et al., 2006; Stanley, 2015; Stokel-Walker, 2022; Tang et al., 2017; Wei et al., 2019; Xu et al., 2018; Zeng and Veeravalli, 2014; T. Zhang et al., 2022), storage performance is highly related to security (HajiRassouliha, Taberner, & Nash, 2018), and network capability is measured by bandwidth and network latency (Elgendy, Zhang, & Tian, 2019). The computing power environment is supported by the Internet and 5 G mobile base stations, enabling services such as edge computing and data transmission (Brewer, Katz, & Chawathe, 1998).