ABSTRACT
Irregular algorithms such as Stochastic Gradient Descent (SGD) can benefit from the massive parallelism available on GPUs. However, unlike in data-parallel algorithms, synchronization patterns in SGD are quite complex. Furthermore, scheduling for scale-free graphs is challenging. This work examines several synchronization strategies for SGD, ranging from simple locking to conflict-free scheduling. We observe that static schedules do not yield better performance despite eliminating the need to perform conflict detection and resolution at runtime. We identify the source of the performance degradation to be the structure of certain parts of the graph (dense vs sparse). This classification can be used to devise hybrid scheduling strategies which exploit different schedules for different regions of the graph to obtain better performance. We found that the best schedule for some problems can be up to two orders of magnitude faster than the worst one. To evaluate the performance of our GPU implementation, we also compare against a CPU implementation of SGD. Dynamic schedules perform comparably to a 14-thread CPU implementation, while a static schedule performs comparably to a 6-thread CPU implementation.
- B. A.-L. Barabási and E. Bonabeau. Scale-free networks. Scientific American, 2003.Google ScholarCross Ref
- Y. Bengio. Speeding up stochastic gradient descent. In NIPS workshop on Efficient Machine Learning, 2007.Google Scholar
- R. F. Boisvert, R. Pozo, K. A. Remington, R. F. Barrett, and J. Dongarra. Matrix market: a web resource for test matrix collections. In Quality of Numerical Software, pages 125–137, 1996. Google ScholarDigital Library
- B. Catanzaro, N. Sundaram, and K. Keutzer. Fast support vector machine training and classification on graphics processors. In Proceedings of the 25th International Conference on Machine Learning, pages 104–111. ACM, 2008. Google ScholarDigital Library
- J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. W. Senior, P. A. Tucker, K. Yang, and A. Y. Ng. Large scale distributed deep networks. In Neural Information Processing Systems 2012. December 3-6, 2012, Lake Tahoe, Nevada, United States., pages 1232–1240, 2012.Google Scholar
- M. R. Garey and D. S. Johnson. Computers and Intractability; A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., New York, NY, USA, 1990. Google ScholarDigital Library
- T. Paine, H. Jin, J. Yang, Z. Lin, and T. S. Huang. GPU asynchronous stochastic gradient descent to speed up neural network training. CoRR, abs/1312.6186, 2013.Google Scholar
- K. Pingali, D. Nguyen, M. Kulkarni, M. Burtscher, M. A. Hassaan, R. Kaleem, T.-H. Lee, A. Lenharth, R. Manevich, M. Méndez-Lojo, D. Prountzos, and X. Sui. The tao of parallelism in algorithms. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’11, pages 12–25, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- R. Raina, A. Madhavan, and A. Y. Ng. Large-scale deep unsupervised learning using graphics processors. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 873–880, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- N. Satish, N. Sundaram, M. M. A. Patwary, J. Seo, J. Park, M. A. Hassaan, S. Sengupta, Z. Yin, and P. Dubey. Navigating the maze of graph analytics frameworks using massive graph datasets. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD ’14, pages 979–990, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. On parallelizability of stochastic gradient descent for speech DNNs. In Proc. ICASSP, 2014.Google ScholarCross Ref
- D. Steinkrau, P. Y. Simard, and I. Buck. Using gpus for machine learning algorithms. In Proceedings of the Eighth International Conference on Document Analysis and Recognition, ICDAR ’05, pages 1115–1119, Washington, DC, USA, 2005. IEEE Computer Society. Introduction Problem statement SGD Implementation Dynamic Schedules Edge-Locked (EL) Node-Locked (NL) Hybrid-Locked (HL) Static Scheduling All-Graph Matching (AGM) Sub-Graph Matching (SGM) Hybrid (H) Evaluation Methodology Inputs Optimizations Overall Performance Static vs Dynamic Schedules CPU Comparison Related work Future work Conclusion Acknowledgement References Google ScholarDigital Library
Index Terms
- Stochastic gradient descent on GPUs
Recommendations
Understanding and Optimizing Asynchronous Low-Precision Stochastic Gradient Descent
ISCA '17: Proceedings of the 44th Annual International Symposium on Computer ArchitectureStochastic gradient descent (SGD) is one of the most popular numerical algorithms used in machine learning and other domains. Since this is likely to continue for the foreseeable future, it is important to study techniques that can make it run fast on ...
Guided parallelized stochastic gradient descent for delay compensation
AbstractStochastic gradient descent (SGD) algorithm and its variations have been effectively used to optimize neural network models. However, with the rapid growth of big data and deep learning, SGD is no longer the most suitable choice due to ...
Highlights- Its convergence rate of O 1 ρ T + σ 2 shows its applicability for the real-time systems.
Asynchronous parallel stochastic gradient descent: a numeric core for scalable distributed machine learning algorithms
MLHPC '15: Proceedings of the Workshop on Machine Learning in High-Performance Computing EnvironmentsThe implementation of a vast majority of machine learning (ML) algorithms boils down to solving a numerical optimization problem. In this context, Stochastic Gradient Descent (SGD) methods have long proven to provide good results, both in terms of ...
Comments