Skip to main content

An Optimal Parallel Prefix-Sums Algorithm on the Memory Machine Models for GPUs

  • Conference paper
Book cover Algorithms and Architectures for Parallel Processing (ICA3PP 2012)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7439))

Abstract

The main contribution of this paper is to show optimal algorithms computing the sum and the prefix-sums on two memory machine models, the Discrete Memory Machine (DMM) and the Unified Memory Machine (UMM). The DMM and the UMM are theoretical parallel computing models that capture the essence of the shared memory and the global memory of GPUs. These models have three parameters, the number p of threads, the width w of the memory, and the memory access latency l. We first show that the sum of n numbers can be computed in \(O({n\over w}+{nl\over p}+l\log n)\) time units on the DMM and the UMM. We then go on to show that \(\Omega({n\over w}+{nl\over p}+l\log n)\) time units are necessary to compute the sum. Finally, we show an optimal parallel algorithm that computes the prefix-sums of n numbers in \(O({n\over w}+{nl\over p}+l\log n)\) time units on the DMM and the UMM.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aho, A.V., Ullman, J.D., Hopcroft, J.E.: Data Structures and Algorithms. Addison Wesley (1983)

    Google Scholar 

  2. Akl, S.G.: Parallel Sorting Algorithms. Academic Press (1985)

    Google Scholar 

  3. Batcher, K.E.: Sorting networks and their applications. In: Proc. AFIPS Spring Joint Comput. Conf., vol. 32, pp. 307–314 (1968)

    Google Scholar 

  4. Flynn, M.J.: Some computer organizations and their effectiveness. IEEE Transactions on Computers C-21, 948–960 (1872)

    Article  Google Scholar 

  5. Gibbons, A., Rytter, W.: Efficient Parallel Algorithms. Cambridge University Press (1988)

    Google Scholar 

  6. Gottlieb, A., Grishman, R., Kruskal, C.P., McAuliffe, K.P., Rudolph, L., Snir, M.: The nyu ultracomputer–designing an MIMD shared memory parallel computer. IEEE Trans. on Computers C-32(2), 175–189 (1983)

    Article  Google Scholar 

  7. Grama, A., Karypis, G., Kumar, V., Gupta, A.: Introduction to Parallel Computing. Addison Wesley (2003)

    Google Scholar 

  8. Harris, M., Sengupta, S., Owens, J.D.: Chapter 39. parallel prefix sum (scan) with CUDA. In: GPU Gems 3. Addison-Wesley (2007)

    Google Scholar 

  9. Hillis, W.D., Steele Jr., G.L.: Data parallel algorithms. Commun. ACM 29(12), 1170–1183 (1986), http://doi.acm.org/10.1145/7902.7903

    Article  Google Scholar 

  10. Hwu, W.W.: GPU Computing Gems Emerald Edition. Morgan Kaufmann (2011)

    Google Scholar 

  11. Ito, Y., Ogawa, K., Nakano, K.: Fast ellipse detection algorithm using hough transform on the GPU. In: Proc. of International Conference on Networking and Computing, pp. 313–319 (December 2011)

    Google Scholar 

  12. Lawrie, D.H.: Access and alignment of data in an array processor. IEEE Trans. on Computers C-24(12), 1145–1155 (1975)

    Article  MathSciNet  Google Scholar 

  13. Man, D., Uda, K., Ito, Y., Nakano, K.: A GPU implementation of computing euclidean distance map with efficient memory access. In: Proc. of International Conference on Networking and Computing, pp. 68–76 (December 2011)

    Google Scholar 

  14. Man, D., Uda, K., Ueyama, H., Ito, Y., Nakano, K.: Implementations of a parallel algorithm for computing euclidean distance map in multicore processors and GPUs. International Journal of Networking and Computing 1, 260–276 (2011)

    Google Scholar 

  15. Nakano, K.: Simple memory machine models for GPUs. In: Proc. of International Parallel and Distributed Processing Symposium Workshops, pp. 788–797 (May 2012)

    Google Scholar 

  16. Nishida, K., Ito, Y., Nakano, K.: Accelerating the dynamic programming for the matrix chain product on the GPU. In: Proc. of International Conference on Networking and Computing, pp. 320–326 (December 2011)

    Google Scholar 

  17. NVIDIA Corporation: NVIDIA CUDA C best practice guide version 3.1 (2010)

    Google Scholar 

  18. NVIDIA Corporation: NVIDIA CUDA C programming guide version 4.0 (2011)

    Google Scholar 

  19. Quinn, M.J.: Parallel Computing: Theory and Practice. McGraw-Hill (1994)

    Google Scholar 

  20. Uchida, A., Ito, Y., Nakano, K.: Fast and accurate template matching using pixel rearrangement on the GPU. In: Proc. of International Conference on Networking and Computing, pp. 153–159 (December 2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Nakano, K. (2012). An Optimal Parallel Prefix-Sums Algorithm on the Memory Machine Models for GPUs. In: Xiang, Y., Stojmenovic, I., Apduhan, B.O., Wang, G., Nakano, K., Zomaya, A. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2012. Lecture Notes in Computer Science, vol 7439. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33078-0_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-33078-0_8

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-33077-3

  • Online ISBN: 978-3-642-33078-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics