research-article

Public Access

A Study of Pointer-Chasing Performance on Shared-Memory Processor-FPGA Systems

Authors:
Gabriel Weisz

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

,
Joseph Melber

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

,
Yu Wang

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

,
Kermin Fleming

Intel Corporation, Hudson, MA, USA

Intel Corporation, Hudson, MA, USA
View Profile

,
Eriko Nurvitadhi

Intel Corporation, Hillsboro, OR, USA

Intel Corporation, Hillsboro, OR, USA
View Profile

,
James C. Hoe

Carnegie Mellon University, Pittsburgh`, PA, USA

Carnegie Mellon University, Pittsburgh`, PA, USA
View Profile

FPGA '16: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysFebruary 2016Pages 264–273https://doi.org/10.1145/2847263.2847269

Published:21 February 2016Publication History

FPGA '16: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pages 264–273

ABSTRACT

The advent of FPGA acceleration platforms with direct coherent access to processor memory creates an opportunity for accelerating applications with irregular parallelism governed by large in-memory pointer-based data structures. This paper uses the simple reference behavior of a linked-list traversal as a proxy to study the performance potentials of accelerating these applications on shared-memory processor-FPGA systems. The linked-list traversal is parameterized by node layout in memory, per-node data payload size, payload dependence, and traversal concurrency to capture the main performance effects of different pointer-based data structures and algorithms. The paper explores the trade-offs over a wide range of implementation options available on shared-memory processor-FPGA architectures, including using tightly-coupled processor assistance. We make observations of the key effects on currently available systems including the Xilinx Zynq, the Intel QuickAssist QPI FPGA Platform, and the Convey HC-2. The key results show: (1) the FPGA fabric is least efficient when traversing a single list with non-sequential node layout and a small payload size; (2) processor assistance can help alleviate this shortcoming; and (3) when appropriate, a fabric only approach that interleaves multiple linked list traversals is an effective way to maximize traversal performance.

References

Altera, Inc. Arria 5 Device Overview, January 2015. AV-51001.Google Scholar
Bruce Wile. Coherent Accelerator Processor Interface(CAPI) for POWER8 Systems, September 2014.Google Scholar
Sai Rahul Chalamalasetti, Kevin Lim, Mitch Wright, Alvin AuYoung, Parthasarathy Ranganathan, and Martin Margala. An FPGA Memcached Appliance. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA '13, pages 245--254, 2013. Google ScholarDigital Library
Convey Computer Corporation. Convey Personality Development Kit Reference Manual, April 2012. Version 5.2.Google Scholar
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, Third Edition. The MIT Press, 3rd edition, 2009. Google ScholarDigital Library
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.Google ScholarCross Ref
Christopher Dennl, Daniel Ziener, and Jürgen Teich. On-the-fly Composition of FPGA-Based SQL Query Accelerators Using a Partially Reconfigurable Module Library. Field-Programmable Custom Computing Machines, Annual IEEE Symposium on, 0:45--52, 2012. Google ScholarDigital Library
Geoffrey Hinton, Li Deng, Dong Yu, Abdel rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara Sainath George Dahl, and Brian Kingsbury. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29(6):82--97, November 2012.Google ScholarCross Ref
Skand Hurkat, Jungwook Choi, Eriko Nurvitadhi, José F. Martínez, and Rob A. Rutenbar. Fast Hierarchical Implementation of Sequential Tree-reweighted Belief Propagation for Probabilistic Inference. In Proceedings of the 25th International Conference on Field Programmable Logic and Applications, FPL '15, September 2015.Google ScholarCross Ref
Maysam Lavasani, Hari Angepat, and Derek Chiou. An FPGA-based In-Line Accelerator for Memcached. Computer Architecture Letters, 13(2):57--60, July 2014. Google ScholarDigital Library
Rishiyur Nikhil. Bluespec System Verilog: Efficient, Correct RTL from High Level Specifications. In Formal Methods and Models for Co-Design, 2004. MEMOCODE '04. Proceedings. Second ACM and IEEE International Conference on, pages 69--70, June 2004.Google ScholarDigital Library
Eriko Nurvitadhi, Gabriel Weisz, Yu Wang, Skand Hurkat, Marie Nguyen, James C. Hoe, José F. Martínez, and Carlos Guestrin. GraphGen: An FPGA Framework for Vertex-Centric Graph Computation. In Field-Programmable Custom Computing Machines (FCCM), 2014 IEEE 22nd Annual International Symposium on, pages 25--28, May 2014. Google ScholarDigital Library
N. Oliver, R.R. Sharma, S. Chang, B. Chitlur, E. Garcia, J. Grecco, A. Grier, N. Ijih, Yaping Liu, P. Marolia, H. Mitchel, S. Subhaschandra, A. Sheiman, T. Whisonant, and P. Gupta. A reconfigurable computing system based on a cache-coherent fabric. In Reconfigurable Computing and FPGAs (ReConFig), 2011 International Conference on, pages 80--85, Nov 2011. Google ScholarDigital Library
Kalin Ovtcharov, Olatunji Ruwase, Joo-Young Kim, Jeremy Fowers, Karin Strauss, and Eric S. Chung. Accelerating Deep Convolutional Neural Networks Using Specialized Hardware, February 2015.Google Scholar
Keshav Pingali, Donald Nguyen, Milind Kulkarni, Martin Burtscher, M. Amber Hassaan, Rashid Kaleem, Tsung-Hsien Lee, Andrew Lenharth, Roman Manevich, Mario Méndez-Lojo, Dimitrios Prountzos, and Xin Sui. The Tao of Parallelism in Algorithms. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '11, pages 12--25, 2011. Google ScholarDigital Library
Andrew Putnam, Adrian Caulfield, Eric Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray, Michael Haselman, Scott Hauck, Stephen Heil, Amir Hormati, Joo-Young Kim, Sitaram Lanka, Jim Larus, Eric Peterson, Simon Pope, Aaron Smith, Jason Thong, Phillip Yi Xiao, and Doug Burger. A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services. In 41st Annual International Symposium on Computer Architecture (ISCA), June 2014. Google ScholarDigital Library
Yaman Umuroglu, Donn Morrison, and Magnus Jahre. Hybrid Breadth-First Search on a Single-Chip FPGA-CPU Heterogeneous Platform. In Proceedings of the 25th International Conference on Field Programmable Logic and Applications, FPL '15, September 2015.Google ScholarCross Ref
Xilinx, Inc. Zynq-7000 All Programmable SoC Overview, October 2014. v1.7.Google Scholar

Index Terms

A Study of Pointer-Chasing Performance on Shared-Memory Processor-FPGA Systems
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
      2. Reconfigurable computing

Recommendations

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

We present a novel mechanism to accelerate state-of-art Convolutional Neural Networks (CNNs) on CPU-FPGA platform with coherent shared memory. First, we exploit Fast Fourier Transform (FFT) and Overlap-and-Add (OaA) to reduce the computational ...
Read More
Nuclear Reactor Simulations on OpenCL FPGA Platform
FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Field-programmable gate arrays (FPGAs) are becoming a promising choice as a heterogeneous computing component for scientific computing when floating-point optimized architectures are added to the current FPGAs. The maturing high-level synthesis (HLS) ...
Read More
Intel nehalem processor core made FPGA synthesizable
FPGA '10: Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays

We present a FPGA-synthesizable version of the Intel Nehalem processor core, synthesized, partitioned and mapped to a multi-FPGA emulation system consisting of Xilinx Virtex-4 and Virtex-5 FPGAs. To our knowledge, this is the first time a modern state-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
FPGA '16: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
February 2016
298 pages
ISBN:9781450338561
DOI:10.1145/2847263
General Chair:
Deming Chen
University of Illinois at Urbana-Champaign, USA
,
Program Chair:
Jonathan Greene
Microsemi, USA
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 February 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cache coherence
fpga
heterogeneous systems
pointer chasing
shared memory
Qualifiers
- research-article
Conference

Acceptance Rates
FPGA '16 Paper Acceptance Rate20of111submissions,18%Overall Acceptance Rate125of627submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 26
  Total Citations
  View Citations
- 953
  Total Downloads
- Downloads (Last 12 months)107
- Downloads (Last 6 weeks)26
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A Study of Pointer-Chasing Performance on Shared-Memory Processor-FPGA Systems

FPGA '16: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

ABSTRACT

References

Cited By

Index Terms

Recommendations

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Nuclear Reactor Simulations on OpenCL FPGA Platform

Intel nehalem processor core made FPGA synthesizable

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A Study of Pointer-Chasing Performance on Shared-Memory Processor-FPGA Systems

FPGA '16: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

ABSTRACT

References

Cited By

Index Terms

Recommendations

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Nuclear Reactor Simulations on OpenCL FPGA Platform

Intel nehalem processor core made FPGA synthesizable

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media