Fast description and synthesis of control-dominant circuits

https://doi.org/10.1016/j.compeleceng.2014.02.011Get rights and content

Abstract

General purpose processors, graphics processing units (GPUs) and field-programmable gate-arrays (FPGAs) compete and collaborate to offer ever increasing performances. Nevertheless, despite fruitful decades of research, FPGA are still a lot more difficult to exploit than processor-based approaches. It is today possible to automatically map C/C++/SystemC algorithms into circuits. However, exploiting fine grain parallelism for control dominant applications is still reserved to highly specialized people in hardware design. This paper presents the application of our synchronized-transfer-level hardware design methodology to the implementation of pipelined floating point operators. The methodology builds on a hardware description language for which the designer manages dynamic connections between data token sources and sinks. A compiler automates the generation and the optimization of the synchronization logic, whose low-level complexity is thus hidden to the designer. Applied to the design of a floating-point matrix multiplication hardware accelerator, the proposed methodology leads to similar computing performances than the dedicated designs reported in the literature but within shorter design times (hours instead of days), simpler source code and no need for advanced hardware design skills.

Introduction

A famous debate between Gene Amdahl and Daniel Slotnick on the feasability of parallel computing dates back to as far as 1967 [1], [2]. Nowadays, as the performance of single-threaded processors have stopped following Moore’s Law, multi-core processors have become a commodity, and the spectrum of high-performance parallel computing devices has never been so colourful. Initially aiming at delivering jaw dropping 3D graphics, the Graphics Processing Units (GPUs) present in state-of-the-art video cards are increasingly used in the most advanced scientific applications, offering performances in the order of teraflops [3]. Field-Programmable Gate-Arrays (FPGAs) can also leverage on the billions of transistors that are made available through state of the art integrated circuit (IC) fabrication processes, and are still riding on – what’s left of – Moore’s Law. Having evolved from glue-logic to processing devices of their own, modern high-end FPGAs include hundreds of hard DSP and RAM memory blocks, tens of thousands of registers, sometimes hard-processors, and have thus gradually narrowed the gap separating their performance with those of ASICs [4]. Unlike the ASIC however, an FPGA can be reconfigured in a matter of minutes to implement any hardware design (fitting the device), possibly including configurable processors and their programs.

Recently, the Basic Linear Algebra Subroutines (BLAS) library has been implemented on three different platforms [5]: an Intel Core 2 Duo E8500 processor, an Nvidia Tesla C1060 GPU, and a Virtex5 FPGA (inside a BEE3 platform). Results show that the FPGA implementation provides the best energy efficiency in terms of GFLOPS per Watt, while the GPU is the most power hungry. In the field of matrix multiplication, high-performance GPU implementations will deliver as much as 393 GFLOPS [6]. The FPGA implementations of matrix multiplication proposed in [7], [8], [9], [10], [11], [12], [13] can deliver up to 29.8 GFLOPS. As the world goes more mobile and becomes more conscious of its impact on the environment, energy efficiency can be a strong driving force to the adoption of FPGAs for processing and co-processing units [14]. Nevertheless, while concurrent programming is becoming the de-facto approach to fully-exploit the potential of state of the art programmable ICs, programming and debugging hardware remains inherently harder than software.

Despite important advances in the field of high-level C-synthesis, most hardware designs are still described with Register-Transfer-Level (RTL) languages such as VHDL and Verilog. However, such abstraction level is simply too low to handle the millions and billions of transistors available in modern ICs. In this work, we consider a beyond-RTL abstraction level that considers assignments as transfers between data-synchronized sources and sinks, instead of simple register transfers. Using implicit predefined synchronization interfaces similar to Xilinx’s AXI4-Stream and Altera’s Avalon Streaming interfaces, such Synchronized-Transfer-Level (STL) abstraction frees the programmer from the tedious task of handling and specifying low-level synchronization control logic. Data transfers occur between connected source and sink when both are ready to send/receive.

Our hardware description language (HDL) builds on the CASM language [15], which proposes that Finite State Machines (FSMs) handle dynamic connections between data token sources and sinks. The proposed STL abstraction adds a constraint programming paradigm to allow the specification of logical rules constraining the authorization of data transfers over different connections [16]. Thanks to those authorization rules, it is possible to describe concisely behaviors such as synchronization, arbitration and constrained scheduling between concurrent connections. Nevertheless, the dynamic connection of interfaces supporting back-pressure (ready-to-receive signals) is confronted to the pitfall of induced combinational loops, resulting in unpredictable behavior. The proposed methodology addresses this issue too.

This paper illustrates how our STL description and synthesis methodology can help a designer to quickly implement the matrix multiplication application, which involves the design of a state-of-the-art floating-point accumulator that relies on the delayed-buffering (DB) method [17]. At the featured abstraction level, such design requires only a few state machines handling constrained connections between data-driven operators. Most of the hardware complexity is implicit and automatically handled by our compiler. The number of transfers that are authorized at each clock cycle is also optimized, taking into account the dependencies between the transfers and the constraints specified by the designer. The compiler is particularly helpful at optimizing cyclic dependencies, preventing the generation of combinatorial loops. The full description of our pipelined implementation of the matrix multiplication circuit is simple enough to be implemented in hours instead of days, without any advanced knowledge of hardware design. Nevertheless, the performances obtained are similar to the ones reported by experts for the dedicated implementations already cited.

The paper is organized as follows: Section 2 discusses related work in the field of high level hardware synthesis. Section 3 presents the main features of the STL design methodology such as predefined synchronization protocols and interfaces, transfer authorization rules, and the handling of combinational loops. Section 4 presents the application of the featured methodology to the design of a floating-point matrix multiplication hardware accelerator, supporting matrix sizes up to 1024 × 1024, and of a high-performance floating-point accumulator. Section 5 presents and discusses the results obtained for the FPGA implementation targeting both Virtex-V and Stratix-III devices. Section 6 concludes this work.

Section snippets

Related work

Proposing useful abstractions for beyond-RTL hardware description languages remains an open challenge. The design of modern digital hardware applications involves the interconnection of hundreds and more components, ranging from simple registers to complex multi-core devices. Despite decades of research and development, C/C++/SystemC high-level synthesis, while most beneficial to system-level design and verification and for fast prototyping [18], [19], [20], has yet to deliver efficient

STL design methodology

This section presents our STL design methodology, especially the synchronized-channel connections, our hardware description language (HDL), the handling of cyclic combinational dependencies and our optimizing compiler.

Application to matrix-multiplication

In this section, the presented STL design methodology is applied to the high-level description of a single-precision floating-point matrix multiplication hardware accelerator, exploiting the spatial and temporal parallelism available in modern FPGA devices. At the top-level, the proposed architecture manages the connections between multiple instances of a multiply-accumulate (MAC) processing element (PE), which is itself designed using our STL methodology as well.

Experimental results

This section presents and discusses the synthesis results obtained for floating point pipelined accumulator and the matrix multiplication hardware accelerator designed with the presented STL methodology. Synthesis results are obtained using Altera’s Quartus II IDE targeting a Stratix-III EP3SE260 FPGA, and Xilinx’s ISE with XST synthetizer targeting a Virtex-5 XC5VSX240 FPGA. The implementations rely on configurable floating-point operators (adder and multiplier) available in the altera_mf and

Conclusion

The millions of reconfigurable gates available in modern FPGA make them increasingly difficult to program at the RTL level. Such abstraction level is error-prone, time consuming, and mostly reserved to experts in the field of hardware design. Many approaches have been proposed to target the automated mapping of C/C++/SystemC code to FPGA. Nevertheless, control-dominant applications cannot benefit from such tools, which deliver their best performances for predictable dataflow graphs.

As a

Acknowledgement

The authors are grateful to the Natural Sciences and Engineering Research Council of Canada (NSERC) for its financial support.

Marc-André Daigneault received the degree in electrical engineering from Ecole Polytechnique de Montreal, Canada, in 2007, where he his currently a Ph.D student. His research interests includes reconfigurable and parallel computing, high-level synthesis and digital system design.

References (28)

  • D.L. Slotnick

    Unconventional systems

  • G.M. Amdahl

    Validity of the single processor approach to achieving large scale computing capabilities

  • R. Weber et al.

    Comparing hardware accelerators in scientific applications: a case study

    IEEE Trans Parallel Distrib Syst

    (2011)
  • I. Kuon et al.

    Measuring the gap between FPGAs and ASICs

    IEEE Trans Comput-Aided Des Integr Circ Syst

    (2007)
  • Kestur S, Davis J, Williams O. BLAS comparison on FPGA, CPU and GPU. In: 2010 IEEE computer society annual symposium on...
  • Cui X, Chen Y, Mei H. Improving performance of matrix multiplication and FFT on GPU. In: 2009 15th International...
  • Dave N, Fleming K, King M, Pellauer M, Vijayaraghavan M. Hardware acceleration of matrix multiplication on a Xilinx...
  • J.W. Jang et al.

    Energy- and time-efficient matrix multiplication on FPGAs

    IEEE Trans Very Large Scale Integr (VLSI) Syst

    (2005)
  • L. Zhuo et al.

    Scalable and modular algorithms for floating-point matrix multiplication on reconfigurable computing systems

    IEEE Trans Parallel Distrib Syst

    (2007)
  • Govindu G, Zhuo L, Choi S, Prasanna V. Analysis of high-performance floating-point arithmetic on FPGAs. In:...
  • Kumar V, Joshi S, Patkar S, Narayanan H. FPGA based high performance double-precision matrix multiplication. In: 2009...
  • Jiang J, Mirian V, Tang KP, Chow P, Xing Z. Matrix multiplication based on scalable macro-pipelined FPGA accelerator...
  • Holanda B, Pimentel R, Barbosa J, Camarotti R, Silva-Filho A, Joao L, et al. An FPGA-based accelerator to speed-up...
  • O. Lindtjorn et al.

    Beyond traditional microprocessors for geoscience high-performance computing applications

    IEEE Micro

    (2011)
  • Cited by (3)

    Marc-André Daigneault received the degree in electrical engineering from Ecole Polytechnique de Montreal, Canada, in 2007, where he his currently a Ph.D student. His research interests includes reconfigurable and parallel computing, high-level synthesis and digital system design.

    Jean Pierre David received the degree in electrical engineering from the Universite de Liege, Belgium, in 1995 and the Ph.D degree from the Universite Catholique de Louvain, Belgium, in 2002. He is now an Associate Professor with the Ecole Polytechnique de Montreal, Canada. His research interests include reconfigurable computing, high-level synthesis, digital system design and their applications.

    Reviews processed and recommended for publication to Editor-in-Chief by Associate Editor Dr. Rene Cumplido.

    View full text