poster

Revisiting sorting for GPGPU stream architectures

Authors:
Duane G. Merrill

University of Virginia, Charlottesville, VA, USA

University of Virginia, Charlottesville, VA, USA
View Profile

,
Andrew S. Grimshaw

University of Virginia, Charlottesville, VA, USA

University of Virginia, Charlottesville, VA, USA
View Profile

PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniquesSeptember 2010Pages 545–546https://doi.org/10.1145/1854273.1854344

Published:11 September 2010Publication History

PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques

Pages 545–546

ABSTRACT

This poster presents efficient strategies for sorting large sequences of fixed-length keys (and values) using GPGPU stream processors. Compared to the state-of-the-art, our radix sorting methods exhibit speedup of at least 2x for all generations of NVIDIA GPGPUs, and up to 3.7x for current GT200-based models. Our implementations demonstrate sorting rates of 482 million key-value pairs per second, and 550 million keys per second (32-bit). For this domain of sorting problems, we believe our sorting primitive to be the fastest available for any fully-programmable microarchitecture.

These results motivate a different breed of parallel primitives for GPGPU stream architectures that can better exploit the memory and computational resources while maintaining the flexibility of a reusable component. Our sorting performance is derived from a parallel scan stream primitive that has been generalized in two ways: (1) with local interfaces for producer/consumer operations (visiting logic), and (2) with interfaces for performing multiple related, concurrent prefix scans (multi-scan).

References

}}J D Owens et al., "GPU Computing," Proceedings of the IEEE, vol. 96, no. 5, pp. 879--899, May 2008.Google ScholarCross Ref
}}Nadathur Satish, Mark Harris, and Michael Garland, "Designing efficient sorting algorithms for manycore GPUs," in IPDPS '09: Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing, 2009, pp. 1--10. Google ScholarDigital Library
}}GPGPU.org. {Online}. http://gpgpu.org/developer/cudppGoogle Scholar
}}Jatin Chhugani et al., "Efficient implementation of sorting on multi-core SIMD CPU architecture," Proc. VLDB Endow., pp. 1313--1324, 2008. Google ScholarDigital Library
}}Larry Seiler et al., "Larrabee: a many-core x86 architecture for visual computing," in SIGGRAPH '08: ACM SIGGRAPH 2008 papers, Los Angeles, CA, 2008, pp. 1--15. Google ScholarDigital Library
}}Donald Knuth, The Art of Computer Programming. Reading, MA, USA: Addison-Wesley, 1973, vol. III: Sorting and Searching.Google ScholarDigital Library
}}Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein, Introduction to Algorithms, 2nd ed.: McGraw-Hill, 2001. Google ScholarDigital Library
}}Duane G. Merrill and Andrew S. Grimshaw, "Revisiting Sorting for GPGPU Stream Architectures," University of Virginia, Department of Computer Science, Charlottesville, VA, Technical Report CS2010-03, 2010.Google Scholar
}}Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides, Design Patterns: Elements of Reusable Object-Oriented Software. Reading, MA, USA: Addisson-Wesley, 1995. Google ScholarDigital Library
}}Duane Merrill and Andrew Grimshaw, "Parallel Scan for Stream Architectures," University of Virginia, Department of Computer Science, Charlottesville, VA, USA, Technical Report CS2009-14, 2009.Google Scholar

Index Terms

Revisiting sorting for GPGPU stream architectures
1. Mathematics of computing
  1. Mathematical software
2. Theory of computation
  1. Design and analysis of algorithms
    1. Data structures design and analysis
      1. Sorting and searching

Recommendations

A Memory Bandwidth-Efficient Hybrid Radix Sort on GPUs
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

Sorting is at the core of many database operations, such as index creation, sort-merge joins, and user-requested output sorting. As GPUs are emerging as a promising platform to accelerate various operations, sorting on GPUs becomes a viable endeavour. ...
Read More
Fast in-place sorting with CUDA based on bitonic sort
PPAM'09: Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I

State of the art graphics processors provide high processing power and furthermore, the high programmability of GPUs offered by frameworks like CUDA increases their usability as high-performance coprocessors for general-purpose computing. Sorting is ...
Read More
Comparison based sorting for systems with multiple GPUs
GPGPU-6: Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units

As a basic building block of many applications, sorting algorithms that efficiently run on modern machines are key for the performance of these applications. With the recent shift to using GPUs for general purpose compuing, researches have proposed ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques
September 2010
596 pages
ISBN:9781450301787
DOI:10.1145/1854273
General Chair:
Valentina Salapura
IBM TJ Watson Research Center
,
Program Chairs:
Michael Gschwind
IBM Systems & Technology Group
,
Jens Knoop
Technische Universität Wien
Copyright © 2010 Copyright is held by the author/owner(s)
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 September 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GPU
kernel fusion
prefix scan
radix sorting
sorting
Qualifiers
- poster
Conference

Acceptance Rates
Overall Acceptance Rate121of471submissions,26%
Upcoming Conference
PACT '24

Sponsor:

sigarch

International Conference on Parallel Architectures and Compilation Techniques

October 14 - 16, 2024

Southern California , CA , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 88
  Total Citations
  View Citations
- 1,599
  Total Downloads
- Downloads (Last 12 months)95
- Downloads (Last 6 weeks)14
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Revisiting sorting for GPGPU stream architectures

PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Memory Bandwidth-Efficient Hybrid Radix Sort on GPUs

Fast in-place sorting with CUDA based on bitonic sort

Comparison based sorting for systems with multiple GPUs