Fast description and synthesis of control-dominant circuits☆
Introduction
A famous debate between Gene Amdahl and Daniel Slotnick on the feasability of parallel computing dates back to as far as 1967 [1], [2]. Nowadays, as the performance of single-threaded processors have stopped following Moore’s Law, multi-core processors have become a commodity, and the spectrum of high-performance parallel computing devices has never been so colourful. Initially aiming at delivering jaw dropping 3D graphics, the Graphics Processing Units (GPUs) present in state-of-the-art video cards are increasingly used in the most advanced scientific applications, offering performances in the order of teraflops [3]. Field-Programmable Gate-Arrays (FPGAs) can also leverage on the billions of transistors that are made available through state of the art integrated circuit (IC) fabrication processes, and are still riding on – what’s left of – Moore’s Law. Having evolved from glue-logic to processing devices of their own, modern high-end FPGAs include hundreds of hard DSP and RAM memory blocks, tens of thousands of registers, sometimes hard-processors, and have thus gradually narrowed the gap separating their performance with those of ASICs [4]. Unlike the ASIC however, an FPGA can be reconfigured in a matter of minutes to implement any hardware design (fitting the device), possibly including configurable processors and their programs.
Recently, the Basic Linear Algebra Subroutines (BLAS) library has been implemented on three different platforms [5]: an Intel Core 2 Duo E8500 processor, an Nvidia Tesla C1060 GPU, and a Virtex5 FPGA (inside a BEE3 platform). Results show that the FPGA implementation provides the best energy efficiency in terms of GFLOPS per Watt, while the GPU is the most power hungry. In the field of matrix multiplication, high-performance GPU implementations will deliver as much as 393 GFLOPS [6]. The FPGA implementations of matrix multiplication proposed in [7], [8], [9], [10], [11], [12], [13] can deliver up to 29.8 GFLOPS. As the world goes more mobile and becomes more conscious of its impact on the environment, energy efficiency can be a strong driving force to the adoption of FPGAs for processing and co-processing units [14]. Nevertheless, while concurrent programming is becoming the de-facto approach to fully-exploit the potential of state of the art programmable ICs, programming and debugging hardware remains inherently harder than software.
Despite important advances in the field of high-level C-synthesis, most hardware designs are still described with Register-Transfer-Level (RTL) languages such as VHDL and Verilog. However, such abstraction level is simply too low to handle the millions and billions of transistors available in modern ICs. In this work, we consider a beyond-RTL abstraction level that considers assignments as transfers between data-synchronized sources and sinks, instead of simple register transfers. Using implicit predefined synchronization interfaces similar to Xilinx’s AXI4-Stream and Altera’s Avalon Streaming interfaces, such Synchronized-Transfer-Level (STL) abstraction frees the programmer from the tedious task of handling and specifying low-level synchronization control logic. Data transfers occur between connected source and sink when both are ready to send/receive.
Our hardware description language (HDL) builds on the CASM language [15], which proposes that Finite State Machines (FSMs) handle dynamic connections between data token sources and sinks. The proposed STL abstraction adds a constraint programming paradigm to allow the specification of logical rules constraining the authorization of data transfers over different connections [16]. Thanks to those authorization rules, it is possible to describe concisely behaviors such as synchronization, arbitration and constrained scheduling between concurrent connections. Nevertheless, the dynamic connection of interfaces supporting back-pressure (ready-to-receive signals) is confronted to the pitfall of induced combinational loops, resulting in unpredictable behavior. The proposed methodology addresses this issue too.
This paper illustrates how our STL description and synthesis methodology can help a designer to quickly implement the matrix multiplication application, which involves the design of a state-of-the-art floating-point accumulator that relies on the delayed-buffering (DB) method [17]. At the featured abstraction level, such design requires only a few state machines handling constrained connections between data-driven operators. Most of the hardware complexity is implicit and automatically handled by our compiler. The number of transfers that are authorized at each clock cycle is also optimized, taking into account the dependencies between the transfers and the constraints specified by the designer. The compiler is particularly helpful at optimizing cyclic dependencies, preventing the generation of combinatorial loops. The full description of our pipelined implementation of the matrix multiplication circuit is simple enough to be implemented in hours instead of days, without any advanced knowledge of hardware design. Nevertheless, the performances obtained are similar to the ones reported by experts for the dedicated implementations already cited.
The paper is organized as follows: Section 2 discusses related work in the field of high level hardware synthesis. Section 3 presents the main features of the STL design methodology such as predefined synchronization protocols and interfaces, transfer authorization rules, and the handling of combinational loops. Section 4 presents the application of the featured methodology to the design of a floating-point matrix multiplication hardware accelerator, supporting matrix sizes up to 1024 × 1024, and of a high-performance floating-point accumulator. Section 5 presents and discusses the results obtained for the FPGA implementation targeting both Virtex-V and Stratix-III devices. Section 6 concludes this work.
Section snippets
Related work
Proposing useful abstractions for beyond-RTL hardware description languages remains an open challenge. The design of modern digital hardware applications involves the interconnection of hundreds and more components, ranging from simple registers to complex multi-core devices. Despite decades of research and development, C/C++/SystemC high-level synthesis, while most beneficial to system-level design and verification and for fast prototyping [18], [19], [20], has yet to deliver efficient
STL design methodology
This section presents our STL design methodology, especially the synchronized-channel connections, our hardware description language (HDL), the handling of cyclic combinational dependencies and our optimizing compiler.
Application to matrix-multiplication
In this section, the presented STL design methodology is applied to the high-level description of a single-precision floating-point matrix multiplication hardware accelerator, exploiting the spatial and temporal parallelism available in modern FPGA devices. At the top-level, the proposed architecture manages the connections between multiple instances of a multiply-accumulate (MAC) processing element (PE), which is itself designed using our STL methodology as well.
Experimental results
This section presents and discusses the synthesis results obtained for floating point pipelined accumulator and the matrix multiplication hardware accelerator designed with the presented STL methodology. Synthesis results are obtained using Altera’s Quartus II IDE targeting a Stratix-III EP3SE260 FPGA, and Xilinx’s ISE with XST synthetizer targeting a Virtex-5 XC5VSX240 FPGA. The implementations rely on configurable floating-point operators (adder and multiplier) available in the altera_mf and
Conclusion
The millions of reconfigurable gates available in modern FPGA make them increasingly difficult to program at the RTL level. Such abstraction level is error-prone, time consuming, and mostly reserved to experts in the field of hardware design. Many approaches have been proposed to target the automated mapping of C/C++/SystemC code to FPGA. Nevertheless, control-dominant applications cannot benefit from such tools, which deliver their best performances for predictable dataflow graphs.
As a
Acknowledgement
The authors are grateful to the Natural Sciences and Engineering Research Council of Canada (NSERC) for its financial support.
Marc-André Daigneault received the degree in electrical engineering from Ecole Polytechnique de Montreal, Canada, in 2007, where he his currently a Ph.D student. His research interests includes reconfigurable and parallel computing, high-level synthesis and digital system design.
References (28)
Unconventional systems
Validity of the single processor approach to achieving large scale computing capabilities
- et al.
Comparing hardware accelerators in scientific applications: a case study
IEEE Trans Parallel Distrib Syst
(2011) - et al.
Measuring the gap between FPGAs and ASICs
IEEE Trans Comput-Aided Des Integr Circ Syst
(2007) - Kestur S, Davis J, Williams O. BLAS comparison on FPGA, CPU and GPU. In: 2010 IEEE computer society annual symposium on...
- Cui X, Chen Y, Mei H. Improving performance of matrix multiplication and FFT on GPU. In: 2009 15th International...
- Dave N, Fleming K, King M, Pellauer M, Vijayaraghavan M. Hardware acceleration of matrix multiplication on a Xilinx...
- et al.
Energy- and time-efficient matrix multiplication on FPGAs
IEEE Trans Very Large Scale Integr (VLSI) Syst
(2005) - et al.
Scalable and modular algorithms for floating-point matrix multiplication on reconfigurable computing systems
IEEE Trans Parallel Distrib Syst
(2007) - Govindu G, Zhuo L, Choi S, Prasanna V. Analysis of high-performance floating-point arithmetic on FPGAs. In:...
Beyond traditional microprocessors for geoscience high-performance computing applications
IEEE Micro
Cited by (3)
Design of FPGA-Based Mealy FSMs with Counters
2019, 2019 8th International Conference on Modern Circuits and Systems Technologies, MOCAST 2019Automated synthesis of streaming transfer level hardware designs
2018, ACM Transactions on Reconfigurable Technology and SystemsIntermediate-Level Synthesis of a Gauss-Jordan Elimination Linear Solver
2015, Proceedings - 2015 IEEE 29th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2015
Marc-André Daigneault received the degree in electrical engineering from Ecole Polytechnique de Montreal, Canada, in 2007, where he his currently a Ph.D student. His research interests includes reconfigurable and parallel computing, high-level synthesis and digital system design.
Jean Pierre David received the degree in electrical engineering from the Universite de Liege, Belgium, in 1995 and the Ph.D degree from the Universite Catholique de Louvain, Belgium, in 2002. He is now an Associate Professor with the Ecole Polytechnique de Montreal, Canada. His research interests include reconfigurable computing, high-level synthesis, digital system design and their applications.
- ☆
Reviews processed and recommended for publication to Editor-in-Chief by Associate Editor Dr. Rene Cumplido.