skip to main content
10.1145/2716282.2716289acmotherconferencesArticle/Chapter ViewAbstractPublication PagesgpgpuConference Proceedingsconference-collections
research-article

Stochastic gradient descent on GPUs

Published:07 February 2015Publication History

ABSTRACT

Irregular algorithms such as Stochastic Gradient Descent (SGD) can benefit from the massive parallelism available on GPUs. However, unlike in data-parallel algorithms, synchronization patterns in SGD are quite complex. Furthermore, scheduling for scale-free graphs is challenging. This work examines several synchronization strategies for SGD, ranging from simple locking to conflict-free scheduling. We observe that static schedules do not yield better performance despite eliminating the need to perform conflict detection and resolution at runtime. We identify the source of the performance degradation to be the structure of certain parts of the graph (dense vs sparse). This classification can be used to devise hybrid scheduling strategies which exploit different schedules for different regions of the graph to obtain better performance. We found that the best schedule for some problems can be up to two orders of magnitude faster than the worst one. To evaluate the performance of our GPU implementation, we also compare against a CPU implementation of SGD. Dynamic schedules perform comparably to a 14-thread CPU implementation, while a static schedule performs comparably to a 6-thread CPU implementation.

References

  1. B. A.-L. Barabási and E. Bonabeau. Scale-free networks. Scientific American, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  2. Y. Bengio. Speeding up stochastic gradient descent. In NIPS workshop on Efficient Machine Learning, 2007.Google ScholarGoogle Scholar
  3. R. F. Boisvert, R. Pozo, K. A. Remington, R. F. Barrett, and J. Dongarra. Matrix market: a web resource for test matrix collections. In Quality of Numerical Software, pages 125–137, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. B. Catanzaro, N. Sundaram, and K. Keutzer. Fast support vector machine training and classification on graphics processors. In Proceedings of the 25th International Conference on Machine Learning, pages 104–111. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. W. Senior, P. A. Tucker, K. Yang, and A. Y. Ng. Large scale distributed deep networks. In Neural Information Processing Systems 2012. December 3-6, 2012, Lake Tahoe, Nevada, United States., pages 1232–1240, 2012.Google ScholarGoogle Scholar
  6. M. R. Garey and D. S. Johnson. Computers and Intractability; A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., New York, NY, USA, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. T. Paine, H. Jin, J. Yang, Z. Lin, and T. S. Huang. GPU asynchronous stochastic gradient descent to speed up neural network training. CoRR, abs/1312.6186, 2013.Google ScholarGoogle Scholar
  8. K. Pingali, D. Nguyen, M. Kulkarni, M. Burtscher, M. A. Hassaan, R. Kaleem, T.-H. Lee, A. Lenharth, R. Manevich, M. Méndez-Lojo, D. Prountzos, and X. Sui. The tao of parallelism in algorithms. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’11, pages 12–25, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R. Raina, A. Madhavan, and A. Y. Ng. Large-scale deep unsupervised learning using graphics processors. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 873–880, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. N. Satish, N. Sundaram, M. M. A. Patwary, J. Seo, J. Park, M. A. Hassaan, S. Sengupta, Z. Yin, and P. Dubey. Navigating the maze of graph analytics frameworks using massive graph datasets. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD ’14, pages 979–990, New York, NY, USA, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. On parallelizability of stochastic gradient descent for speech DNNs. In Proc. ICASSP, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  12. D. Steinkrau, P. Y. Simard, and I. Buck. Using gpus for machine learning algorithms. In Proceedings of the Eighth International Conference on Document Analysis and Recognition, ICDAR ’05, pages 1115–1119, Washington, DC, USA, 2005. IEEE Computer Society. Introduction Problem statement SGD Implementation Dynamic Schedules Edge-Locked (EL) Node-Locked (NL) Hybrid-Locked (HL) Static Scheduling All-Graph Matching (AGM) Sub-Graph Matching (SGM) Hybrid (H) Evaluation Methodology Inputs Optimizations Overall Performance Static vs Dynamic Schedules CPU Comparison Related work Future work Conclusion Acknowledgement References Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Stochastic gradient descent on GPUs

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in
                • Published in

                  cover image ACM Other conferences
                  GPGPU-8: Proceedings of the 8th Workshop on General Purpose Processing using GPUs
                  February 2015
                  120 pages
                  ISBN:9781450334075
                  DOI:10.1145/2716282

                  Copyright © 2015 ACM

                  Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 7 February 2015

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • research-article

                  Acceptance Rates

                  Overall Acceptance Rate57of129submissions,44%

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader