Performance effects of pipeline architecture on an FPGA-based binary32 floating point multiplier

https://doi.org/10.1016/j.micpro.2013.08.007Get rights and content

Abstract

High pipeline depth architecture with pipeline stage more than five is rarely adopted in existing multipliers for real world applications. In this paper, a field programmable gate array (FPGA) based binary32 floating point multiplier (FPM) is presented to support variety of pipeline depth and the effects of pipeline architecture have been investigated. Pipeline architecture is formulated based on radix-4 Booth recoding approach, an improved Wallace tree, and partial product accumulation. Upon detail and quantitative investigation on the proposed architecture on both cutting edge Xilinx and Altera devices, pipeline depth affects maximum running frequency much more than power consumption, and the pipeline depth should be limited to obtain maximum running frequency for binary32 FPM on both cutting edge target devices, which is consistent to the previous study. Meanwhile, the study demonstrates the pipeline depth to reach at peak performance is lower than that of targeting at FPGA device with 4-input LUTs years ago.

Introduction

Compared to fixed point number, floating point number has been adopted more frequently in scientific computation because of higher accuracy based on its broader changing range. There is a revised IEEE standard for binary32, binary64, and binary128 binary formats [1], whose efficient hardware implementation forms a critical part of many processors, especially a floating point arithmetic processor, to achieve higher performance, smaller area, and/or lower power consumption. Among all arithmetic operations, the multiplication is crucial and necessary due to its higher complexity and larger computational time compared to the other kinds of operations such as addition. Therefore, efficient implementations of floating point multipliers, particularly hardware implementations, are crucial to scientific computing.

Over the past decades, many efforts have been dedicated to specific hardware design for performance improvement on floating point computations on both algorithmic and implementation levels [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13]. Some among them focused on implementations with FPGA platforms [2], [3], [4], [5], [8], [9], [11], [12], [13], which have a bottleneck in speed, energy consumption, etc. due to high computational complexity of floating point multiplication. Specifically, pipeline architecture is a popular approach to enhance circuit performance especially computing throughput. However, high-pipeline depth architecture is not very popular in existing designs of real world applications even for FPGA-based lower speed implementations because many efforts are needed on optimization of each pipeline stage and there is a pipeline depth limit over which the performance, especially the speed, would degrade. Meanwhile, performance effects on such implementations are an open problem without enough quantitative study.

Interesting works [14], [15], [16] provided us a good start to know the performance effects of pipeline architecture year ago. However, about 10 years have past and both the architecture of FPGA and technology have changed a lot, the performance effects need to be revisited.

In this paper, we will investigate the performance effects of pipeline architecture on binary32 floating point multiplier (FPM). An FPGA-based multiplier is presented to support a variety of pipeline architecture. The pipeline architecture is formulated based on radix-4 Booth recoding algorithm, optimal Wallace tree, and partial product accumulation. We carry out two types of experiments to demonstrate the effects of pipeline architecture in a quantitative approach. Our investigation has clarified the pipeline depth limit over which the performance degrades based cutting edge FPGA devices for a binary32 FPM.

Section snippets

Binary32 numbers

Floating point numbers are defined formerly by IEEE 754-1985 standard, then improved in the revised version IEEE 754:2008 [1], the latest version of which specifies three kinds of binary number format: 32-bit, 64-bit, and 128-bit. For simplicity, only binary32 format is adopted in this paper for investigation on an FPM design. The binary32 format is defined as shown in Fig. 1.

For a binary32 number N, an extra bit is added to the fraction to form a significand. Its normalized number can be

Architecture of multiplication on binary32 numbers

The proposed multiplier of binary32 numbers takes the similar units as those in [8], in which the significand multiplication unit is responsible for both multiplying the unsigned significands and placing the decimal point into the multiplication product. Based on the two functions of the significand multiplication unit, efficiency of this unit dominates in the binary32 number multiplication performance. Therefore, it deserves most of our design efforts. Its detail design is given as follows. On

Pipeline design

Conventional floating point multipliers generally use a very few stage pipeline, that is to say, the pipeline stage is normally less than five, in a real world application. Such fact is due to two reasons. Firstly, a deeper pipeline architecture takes more chip resource. Secondly, when the throughput reaches peak, more pipeline stages do not help to the maximum running frequency. For instance, some designs do not apply any pipeline features [3], [4], [6], [7], [9], [10], [11], [12], [13], while

Implementation platform

In the following experiments, our design is targeted at Xilinx Virtex XC5VLX50 (speed-3) with FF676 package and Altera Cyclone II EP2C35F672C6 (FBGA package, speed grade 6, 33216 logic elements) respectively. Implementation on Xilinx device is developed using ISE design suite 10.1, and power is estimated by XPower Analyzer 10.1. For an Altera device, it is developed using Altera Quartus II 7.2, and power is estimated by PowerPlay early power estimator version 1.0.

Verification is carried out by

Conclusion

What are the performance effects of pipeline architecture especially pipeline stage on a binary32 FPM based on FPGA devices? Experience shows that high-stage pipeline architecture should be considered carefully. However, what is the concrete demonstration especially when the architecture and technology have changed a lot? Our investigation provides a novel view on the question, especially in a quantitative approach on both Xilinx and Altera device. The most important observation on a binary32

Acknowledgements

This work was supported in part by the National Foundation of Science of China under Grant No. 61072135, and the Wuhan Science and Technology Bureau under Grant No. 201110921295.

Xianyang Jiang received the B.S. degree in safety engineering from Shenyang Institute of Aeronautic Engineering, Shenyang, China, in 1995, the M.S. degree in physical electronics and optoelectronics, and the Ph.D. degree in pattern recognition and intelligent system from Huazhong University of Science and Technology, China, in 1998 and 2003, respectively. He has worked in Institute of Computing Technology, CAS. From 2004 to 2005, he was a postdoc fellow in INRIA, France. He is currently an

References (19)

  • IEEE-SA Standards Board

    IEEE standard for Floating Point Arithmetic

    (2008)
  • G. Marcus, P. Hinojosa, A. Avila, J. Nolazco-FIores, A fully synthesizable single-precision, floating-point...
  • M. Taher, M. Aboulwafa, A. Abdelwahab, E. M. Saad, High-speed, area-efficient FPGA-based floating-point arithmetic...
  • S.V. Siddamal, R.M. Banakar, B.C. Jinaga, Design of high-speed floating point multiplier, in: Proc. of 4th IEEE Int....
  • P. Karlstrom et al.

    High-performance, low-latency field-programmable gate array-based floating-point adder and multiplier units in a Virtex 4

    IET Computers and Digital Techniques

    (July 2008)
  • E. Quinnell et al.

    Bridge floating-point fused multiply-add design

    IEEE Transaction on Very Large Scale Integration (VLSI) Systems

    (2008)
  • D. Tan et al.

    Low-power multiple-precision iterative floating-point multiplier with SIMD support

    IEEE Transaction on Computers

    (Feb. 2009)
  • R.X. Gong, S.J. Zhang, H.N. Zhang, X.B. Meng, W.Y. Gong, L.L. Xie, Y. Huang, Hardware implementation of a high speed...
  • L.S.A. Hamid, K. Shehata, H. El-Ghitani, M. ElSaid, Design of generic floating point multiplier and adder/subtractor...
There are more references available in the full text version of this article.

Cited by (8)

  • An area efficient multi-mode quadruple precision floating point adder

    2016, Microprocessors and Microsystems
    Citation Excerpt :

    Until the end of the 1980's, floating point arithmetic operations have been implemented using software only. Nowadays, most modern microprocessors are designed along with hardware specifications for handling floating point operations [2]. Due to the large dynamic range and high precision, floating point operations are used in the wide range of scientific and engineering applications, which requires

  • Research and Technical Writing for Science and Engineering

    2022, Research and Technical Writing for Science and Engineering
  • Design of Low-Area and High Speed Pipelined Single Precision Floating Point Multiplier

    2020, 2020 6th International Conference on Advanced Computing and Communication Systems, ICACCS 2020
  • An area-efficient 32-bit floating point multiplier using hybrid GPPs addition

    2017, 2017 International Conference on Microelectronic Devices, Circuits and Systems, ICMDCS 2017
View all citing articles on Scopus

Xianyang Jiang received the B.S. degree in safety engineering from Shenyang Institute of Aeronautic Engineering, Shenyang, China, in 1995, the M.S. degree in physical electronics and optoelectronics, and the Ph.D. degree in pattern recognition and intelligent system from Huazhong University of Science and Technology, China, in 1998 and 2003, respectively. He has worked in Institute of Computing Technology, CAS. From 2004 to 2005, he was a postdoc fellow in INRIA, France. He is currently an associate professor of the Institute of Microelectronics and Information Technology, Wuhan University, China. His research interests include computer architecture, VLSI design, reconfigurable computing, and bioinformatics.

Peng Xiao received the B.S. degree from Wuhan University in 2010. Now he is a master student in School of Physics and Technology, Wuhan University, China.

Meikang Qiu (SM’07) received the BE and ME degrees from Shanghai Jiao Tong University, China. He received the MS and PhD degrees of computer science from the University of Texas at Dallas, in 2003 and 2007, respectively. He had worked at Chinese Helicopter R&D Institute and IBM. He is currently an assistant professor of ECE at the University of Kentucky. He has published more than 100 peer reviewed papers, including 35 journal papers. He has been on various chairs and TPC members for many international conferences. He served as the Program Chair of IEEE EmbeddCom ’09 and EM-Com ’09. He received Air Force Summer Faculty Award 2009. He won three best paper awards (IEEE Embedded and ubiquitous Computing (EUC ’09), IEEE/ACM Green-Com ’10, and IEEE CSE ’10) and one best paper nomination. His research interests include embedded systems, computer security, and wireless sensor networks. He is a senior member of the IEEE.

Gaofeng Wang (S’93–M’95–SM’01) received the Ph.D. degree in electrical engineering from the University of Wisconsin–Milwaukee, in 1993, and the Ph.D. degree in scientific computing from Stanford University, Stanford, CA, in 2001. From 1988 to 1990, he was with the Department of Space Physics, Wuhan University, Hubei, China. From 1990 to 1993, he was a Research Assistant with the Department of Electrical Engineering, University of Wisconsin–Milwaukee. From 1993 to 1996, he was a Scientist with Tanner Research, Inc., Pasadena, CA. From 1996 to 2001, he was a Principal Engineer with Synopsys Inc., Mountain View, CA. In Summer 1999, he was a consultant to Bell Laboratories, Murray Hill, NJ. From 2001 to 2003, he was the Chief Technology Officer (CTO) with Intpax Inc., San Jose, CA. Since 2004, he has been the Chief Technology Officer (CTO) with Siargo Inc., San Jose, CA. He is also currently a Professor and the Head of the CJ Huang Information Technical Research Institute, Wuhan University. He has authored or coauthored over 120 publications. He holds six U.S. patents. His research and development interests include integrated circuit (IC) and microelectromechanical systems (MEMS) design and simulation, computational electromagnetics, electronic design automation, and wavelet applications in engineering.

View full text