skip to main content
survey

Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis

Published:30 August 2019Publication History
Skip Abstract Section

Abstract

Deep Neural Networks (DNNs) are becoming an important tool in modern computing applications. Accelerating their training is a major challenge and techniques range from distributed algorithms to low-level circuit design. In this survey, we describe the problem from a theoretical perspective, followed by approaches for its parallelization. We present trends in DNN architectures and the resulting implications on parallelization strategies. We then review and model the different types of concurrency in DNNs: from the single operator, through parallelism in network inference and training, to distributed deep learning. We discuss asynchronous stochastic optimization, distributed system architectures, communication schemes, and neural architecture search. Based on those approaches, we extrapolate potential directions for parallelism in deep learning.

References

  1. M. Abadi et al. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Retrieved from http://www.tensorflow.org.Google ScholarGoogle Scholar
  2. A. Agarwal and J. C. Duchi. 2011. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems 24. MIT Press, 873--881. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. F. Aji and K. Heafield. 2017. Sparse communication for distributed gradient descent. arxiv:1704.05021Google ScholarGoogle Scholar
  4. F. Akopyan et al. 2015. TrueNorth: Design and tool flow of a 65 mW 1 million neuron programmable neurosynaptic chip. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 34, 10 (2015), 1537--1557.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. 2017. QSGD: Communication-efficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems 30. MIT Press, 1709--1720. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. Amodei et al. 2016. Deep speech 2 : End-to-end speech recognition in English and mandarin. In Proceedings of the 33rd International Conference on Machine Learning, vol. 48. 173--182. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Appleyard, T. Kociský, and P. Blunsom. 2016. Optimizing performance of recurrent neural networks on GPUs. arxiv:1604.01946Google ScholarGoogle Scholar
  8. N. S. Arora, R. D. Blumofe, and C. G. Plaxton. 1998. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the 10th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’98). 119--129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda. 2017. S-Caffe: Co-designing MPI runtimes and caffe for scalable deep learning on modern GPU clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’17). 193--205. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Ba and R. Caruana. 2014. Do deep nets really need to be deep? In Advances in Neural Information Processing Systems 27. MIT Press, 2654--2662. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Ba, R. Grosse, and J. Martens. 2017. Distributed second-order optimization using kronecker-factored approximations. In Proceedings of the International Conference on Learning Representations (ICLR’17).Google ScholarGoogle Scholar
  12. B. Baker, O. Gupta, N. Naik, and R. Raskar. 2017. Designing neural network architectures using reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR’17).Google ScholarGoogle Scholar
  13. B. Baker, O. Gupta, R. Raskar, and N. Naik. 2017. Practical neural network performance prediction for early stopping. arxiv:1705.10823.Google ScholarGoogle Scholar
  14. B. W. Barrett et al. 2018. The Portals 4.2 Network Programming Interface. Sandia Report SAND2018-12790. Technical Report.Google ScholarGoogle Scholar
  15. R. Belli and T. Hoefler. 2015. Notified access: Extending remote memory access programming models for producer-consumer synchronization. In Proceedings of the 29th IEEE International Parallel 8 Distributed Processing Symposium (IPDPS’15). Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. T. Ben-Nun, E. Levy, A. Barak, and E. Rubin. 2015. Memory access patterns: The missing piece of the multi-GPU puzzle. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’15). 19:1--19:12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Y. Bengio. 2013. Deep learning of representations: Looking forward. In Proceedings of the Statistical Language and Speech Processing (SLSP’13). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. 2007. Greedy layer-wise training of deep networks. In Advances in Neural Information Processing Systems 19. MIT Press, 153--160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Y. Bengio, P. Simard, and P. Frasconi. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5, 2 (1994), 157--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. P. Blanchard, E. M. El Mhamdi, R. Guerraoui, and J. Stainer. 2017. Machine learning with adversaries: Byzantine tolerant gradient descent. In Advances in Neural Information Processing Systems 30. MIT Press, 119--129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. R. D. Blumofe and C. E. Leiserson. 1999. Scheduling multithreaded computations by work stealing. J. ACM 46, 5 (1999), 720--748. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba. 2016. End to end learning for self-driving cars. arxiv:1604.07316Google ScholarGoogle Scholar
  23. L. Bottou, F. E. Curtis, and J. Nocedal. 2016. Optimization methods for large-scale machine learning. arxiv:1606.04838Google ScholarGoogle Scholar
  24. S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. 2005. Gossip algorithms: Design, analysis and applications. In Proceedings of the IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies, vol. 3. 1653--1664.Google ScholarGoogle Scholar
  25. S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3, 1 (2011), 1--122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. R. P. Brent. 1974. The parallel evaluation of general arithmetic expressions. J. ACM 21, 2 (1974), 201--206. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. A. Brock, T. Lim, J. M. Ritchie, and N. Weston. 2017. SMASH: One-shot model architecture search through HyperNetworks. arxiv:1708.05344.Google ScholarGoogle Scholar
  28. R. H. Byrd, S. L. Hansen, J. Nocedal, and Y. Singer. 2016. A stochastic quasi-newton method for large-scale optimization. SIAM J. Optim. 26, 2 (2016), 1008--1031.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. E. Chan, M. Heimlich, A. Purkayastha, and R. van de Geijn. 2007. Collective communication: Theory, practice, and experience: Research articles. Concurr. Comput.: Pract. Exper. 19, 13 (2007), 1749--1783. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. W. Chan, N. Jaitly, Q. Le, and O. Vinyals. 2016. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). 4960--4964.Google ScholarGoogle Scholar
  31. K. Chellapilla, S. Puri, and P. Simard. 2006. High performance convolutional neural networks for document processing. In Proceedings of the 10th International Workshop on Frontiers in Handwriting Recognition.Google ScholarGoogle Scholar
  32. C.-Y. Chen, J. Choi, D. Brand, A. Agrawal, W. Zhang, and K. Gopalakrishnan. 2017. AdaComp : Adaptive residual gradient compression for data-parallel distributed training. arxiv:1712.02679.Google ScholarGoogle Scholar
  33. K. Chen and Q. Huo. 2016. Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). 5880--5884.Google ScholarGoogle Scholar
  34. T. Chen et al. 2018. TVM: End-to-end optimization stack for deep learning. arxiv:1802.04799.Google ScholarGoogle Scholar
  35. T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam. 2014. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’14). 269--284. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. T. Chen, B. Xu, C. Zhang, and C. Guestrin. 2016. Training deep nets with sublinear memory cost. arxiv:1604.06174Google ScholarGoogle Scholar
  37. Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng. 2017. Dual path networks. In Advances in Neural Information Processing Systems 30. MIT Press, 4470--4478. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. S. Chetlur et al. 2014. cuDNN: Efficient primitives for deep learning. arxiv:1410.0759.Google ScholarGoogle Scholar
  39. T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. 2014. Project adam: Building an efficient and scalable deep learning training system. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation. 571--582. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. K. Cho et al. 2014. Learning phrase representations using RNN encoder--decoder for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1724--1734.Google ScholarGoogle ScholarCross RefCross Ref
  41. F. Chollet. 2016. Xception: Deep learning with depthwise separable convolutions. arxiv:1610.02357Google ScholarGoogle Scholar
  42. C. Chu, S. K. Kim, Y. Lin, Y. Yu, G. Bradski, K. Olukotun, and A. Y. Ng. 2007. Map-reduce for machine learning on multicore. In Advances in Neural Information Processing Systems 19. MIT Press, 281--288. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. I. H. Chung et al. 2017. Parallel deep neural network training for big data on blue gene/Q. IEEE Trans. Parallel Distrib. Syst. 28, 6 (2017), 1703--1714. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. D. C. Cireşan, A. Giusti, L. M. Gambardella, and J. Schmidhuber. 2013. Mitosis detection in breast cancer histology images with deep neural networks. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention. 411--418.Google ScholarGoogle Scholar
  45. A. Coates, B. Huval, T. Wang, D. J. Wu, A. Y. Ng, and B. Catanzaro. 2013. Deep learning with COTS HPC systems. In Proceedings of the 30th International Conference on Machine Learning—Volume 28 (ICML’13). III--1337--III--1345. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. N. Cohen, O. Sharir, and A. Shashua. 2016. Deep SimNets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 4782--4791.Google ScholarGoogle Scholar
  47. N. Cohen, O. Sharir, and A. Shashua. 2016. On the expressive power of deep learning: A tensor analysis. In Proceedings of the 29th Annual Conference on Learning Theory, vol. 49. 698--728.Google ScholarGoogle Scholar
  48. R. Collobert, K. Kavukcuoglu, and C. Farabet. 2011. Torch7: A matlab-like environment for machine learning. In Proceedings of the Workshop on Big Learning: Algorithms, Systems, and Tools for Learning at Scale (BigLearn’11).Google ScholarGoogle Scholar
  49. J. Cong and B. Xiao. 2014. Minimizing computation in convolutional neural networks. In Proceedings of the International Conference on Artificial Neural Networks (ICANN’14). 281--290.Google ScholarGoogle Scholar
  50. M. Courbariaux and Y. Bengio. 2016. BinaryNet: Training deep neural networks with weights and activations constrained to +1 or −1. arxiv:1602.02830Google ScholarGoogle Scholar
  51. M. Courbariaux, Y. Bengio, and J.-P. David. 2015. BinaryConnect: Training deep neural networks with binary weights during propagations. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS’15), vol. 2. 3123--3131. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. H. Cui, H. Zhang, G. R. Ganger, P. B. Gibbons, and E. P. Xing. 2016. GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server. In Proceedings of the European Conference on Computer Systems (EuroSys’16). 4:1--4:16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. X. Cui, W. Zhang, Z. Tüske, and M. Picheny. 2018. Evolutionary stochastic gradient descent for optimization of deep neural networks. In Advances in Neural Information Processing Systems 31. MIT Press, 6048--6058. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. von Eicken. 1993. LogP: Towards a realistic model of parallel computation. In Proceedings of the 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’93). 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. J. Daily et al. 2018. GossipGraD: Scalable deep learning using gossip communication-based asynchronous gradient descent. arxiv:1803.05880.Google ScholarGoogle Scholar
  56. C. De Sa, C. Zhang, K. Olukotun, and C. Ré. 2015. Taming the wild: A unified analysis of HOGWILD!-style algorithms. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS’15), vol. 2. 2674--2682. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. J. Dean et al. 2012. Large scale distributed deep networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS’12), vol. 1. 1223--1231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. J. Dean and S. Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. 2012. Optimal distributed online prediction using mini-batches. J. Mach. Learn. Res. 13, 1 (2012), 165--202. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. O. Delalleau and Y. Bengio. 2011. Shallow vs. deep sum-product networks. In Advances in Neural Information Processing Systems 24. MIT Press, 666--674. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. J. Demmel and G. Dinh. 2018. Communication-optimal convolutional neural nets. arxiv:1802.06905.Google ScholarGoogle Scholar
  62. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09).Google ScholarGoogle Scholar
  63. L. Deng, D. Yu, and J. Platt. 2012. Scalable stacking and learning for building deep architectures. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’12). 2133--2136.Google ScholarGoogle Scholar
  64. T. Dettmers. 2015. 8-bit approximations for parallelism in deep learning. arxiv:1511.04561.Google ScholarGoogle Scholar
  65. G. Diamos et al. 2016. Persistent RNNs: Stashing recurrent weights on-chip. In Proceedings of the 33rd International Conference on Machine Learning, vol. 48. 2024--2033. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. T. G. Dietterich. 2000. Ensemble methods in machine learning. In Proceedings of the 1st International Workshop on Multiple Classifier Systems (MCS’00). 1--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Z. Drezner and A. Barak. 1986. An asynchronous algorithm for scattering information between the active nodes of a multicomputer system. J. Parallel Distrib. Comput. 3, 3 (1986), 344--351. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. N. Dryden et al. 2019. Improving strong-scaling of CNN training by exploiting finer-grained parallelism. In Proceedings of the 33rd IEEE Int:l Parallel 8 Distributed Processing Symposium (IPDPS’19).Google ScholarGoogle ScholarCross RefCross Ref
  69. N. Dryden, T. Moon, S. A. Jacobs, and B. V. Essen. 2016. Communication quantization for data-parallel training of deep neural networks. In Proceedings of the Workshop on Machine Learning in HPC Environments (MLHPC’16). 1--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. J. Duchi, E. Hazan, and Y. Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12 (2011), 2121--2159. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. V. Dumoulin and F. Visin. 2016. A guide to convolution arithmetic for deep learning. arxiv:1603.07285.Google ScholarGoogle Scholar
  72. J. L. Elman. 1990. Finding structure in time. Cogn. Sci. 14, 2 (1990), 179--211.Google ScholarGoogle ScholarCross RefCross Ref
  73. T. Elsken, J.-H. Metzen, and F. Hutter. 2017. Simple and efficient architecture search for convolutional neural networks. arxiv:1711.04528.Google ScholarGoogle Scholar
  74. L. Ericson and R. Mbuvha. 2017. On the performance of network parallel training in artificial neural networks. arxiv:1701.05130.Google ScholarGoogle Scholar
  75. P. Farber and K. Asanovic. 1997. Parallel neural network training on multi-spert. In Proceedings of the 3rd International Conference on Algorithms and Architectures for Parallel Processing. 659--666.Google ScholarGoogle Scholar
  76. B. M. Forrest, D. Roweth, N. Stroud, D. J. Wallace, and G. V. Wilson. 1987. Implementing neural network models on parallel computers. Comput. J. 30, 5 (1987), 413--419. Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. K. Frans, J. Ho, X. Chen, P. Abbeel, and J. Schulman. 2017. Meta learning shared hierarchies. arxiv:1710.09767.Google ScholarGoogle Scholar
  78. M. P. Friedlander and M. W. Schmidt. 2011. Hybrid deterministic-stochastic methods for data fitting. arxiv:1104.2373.Google ScholarGoogle Scholar
  79. A. Gaunt, M. Johnson, M. Riechert, D. Tarlow, R. Tomioka, D. Vytiniotis, and S. Webster. 2017. AMPNet: Asynchronous model-parallel training for dynamic neural networks. arxiv:1705.09786.Google ScholarGoogle Scholar
  80. A. Gholami, A. Azad, P. H. Jin, K. Keutzer, and A. Buluç. 2018. Integrated model, batch, and domain parallelism in training neural networks. In Proceedings of the 30th Symposium on Parallelism in Algorithms and Architectures (SPAA’18). 77--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. X. Glorot and Y. Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, vol. 9. 249--256.Google ScholarGoogle Scholar
  82. D. E. Goldberg and K. Deb. 1991. A comparative analysis of selection schemes used in genetic algorithms. Foundations of Genetic Algorithms, vol. 1. Elsevier, 69--93.Google ScholarGoogle Scholar
  83. Google. 2017. TensorFlow XLA Overview. Retrieved from https://www.tensorflow.org/performance/xla.Google ScholarGoogle Scholar
  84. P. Goyal, P. Dollár, R. B. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. 2017. Accurate, large minibatch SGD: Training ImageNet in 1 Hour. arxiv:1706.02677.Google ScholarGoogle Scholar
  85. A. Graves et al. 2016. Hybrid computing using a neural network with dynamic external memory. Nature 538, 7626 (2016), 471--476.Google ScholarGoogle Scholar
  86. W. Gropp, T. Hoefler, R. Thakur, and E. Lusk. 2014. Using Advanced MPI: Modern Features of the Message-Passing Interface. MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. A. Gruslys, R. Munos, I. Danihelka, M. Lanctot, and A. Graves. 2016. Memory-efficient backpropagation through time. In Advances in Neural Information Processing Systems 29. MIT Press, 4125--4133. Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. 2015. Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on Machine Learning, vol. 37. 1737--1746. Google ScholarGoogle ScholarDigital LibraryDigital Library
  89. S. Gupta, W. Zhang, and F. Wang. 2016. Model accuracy and runtime tradeoff in distributed deep learning: A systematic study. In Proceedings of the IEEE 16th International Conference on Data Mining (ICDM’16). 171--180.Google ScholarGoogle Scholar
  90. S. Hadjis, C. Zhang, I. Mitliagkas, and C. Ré. 2016. Omnivore: An optimizer for multi-device deep learning on CPUs and GPUs. arxiv:1606.04487.Google ScholarGoogle Scholar
  91. S. Han, H. Mao, and W. J. Dally. 2016. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. In Proceedings of the International Conference on Learning Representations (ICLR’16) (2016).Google ScholarGoogle Scholar
  92. E. Hazan, A. Klivans, and Y. Yuan. 2018. Hyperparameter optimization: A spectral approach. In Proceedings of the International Conference on Learning Representations (ICLR’18).Google ScholarGoogle Scholar
  93. K. He, X. Zhang, S. Ren, and J. Sun. 2015. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 1026--1034. Google ScholarGoogle ScholarDigital LibraryDigital Library
  94. K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 770--778.Google ScholarGoogle Scholar
  95. X. He, D. Mudigere, M. Smelyanskiy, and M. Takac. 2017. Distributed hessian-free optimization for deep neural network. In Proceedings of the AAAI Workshops.Google ScholarGoogle Scholar
  96. G. Hinton. 2012. Neural Networks for Machine Learning, Lecture 6a: Overview of Mini-batch Gradient Descent.Google ScholarGoogle Scholar
  97. G. Hinton, O. Vinyals, and J. Dean. 2015. Distilling the knowledge in a neural network. In Proceedings of the NIPS Deep Learning and Representation Learning Workshop.Google ScholarGoogle Scholar
  98. G. E. Hinton, S. Osindero, and Y. W. Teh. 2006. A fast learning algorithm for deep belief nets. Neural Comput. 18, 7 (2006), 1527--1554. Google ScholarGoogle ScholarDigital LibraryDigital Library
  99. Q. Ho et al. 2013. More effective distributed ML via a stale synchronous parallel parameter server. In Proceedings of the 26th International Conference on Neural Information Processing Systems, vol. 1 (NIPS’13). 1223--1231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  100. S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735--1780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  101. T. Hoefler, A. Barak, A. Shiloh, and Z. Drezner. 2017. Corrected gossip algorithms for fast reliable broadcast on unreliable systems. In Proceedings of the 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS’17).Google ScholarGoogle Scholar
  102. T. Hoefler, A. Lumsdaine, and W. Rehm. 2007. Implementation and performance analysis of non-blocking collective operations for MPI. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’07). IEEE Computer Society/ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  103. T. Hoefler and D. Moor. 2014. Energy, memory, and runtime tradeoffs for implementing collective communication operations. J. Supercomput. Front. Innovat. 1, 2 (2014), 58--75. Google ScholarGoogle ScholarDigital LibraryDigital Library
  104. T. Hoefler and T. Schneider. 2012. Optimization principles for collective neighborhood communications. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 98:1--98:10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  105. T. Hoefler and M. Snir. 2011. Generic topology mapping strategies for large-scale parallel architectures. In Proceedings of the ACM International Conference on Supercomputing (ICS’11). 75--85. Google ScholarGoogle ScholarDigital LibraryDigital Library
  106. T. Hoefler and J. L. Traeff. 2009. Sparse collective operations for MPI. In Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium (HIPS’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  107. E. Hoffer, I. Hubara, and D. Soudry. 2017. Train longer, generalize better: Closing the generalization gap in large batch training of neural networks. In Advances in Neural Information Processing Systems 30. MIT Press, 1729--1739. Google ScholarGoogle ScholarDigital LibraryDigital Library
  108. A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. arxiv:1704.04861.Google ScholarGoogle Scholar
  109. K. Hsieh, A. Harlap, N. Vijaykumar, D. Konomis, G. R. Ganger, P. B. Gibbons, and O. Mutlu. 2017. Gaia: Geo-distributed machine learning approaching LAN speeds. In Proceedings of the 14th USENIX Conference on Networked Systems Design and Implementation (NSDI’17). 629--647. Google ScholarGoogle ScholarDigital LibraryDigital Library
  110. G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle Scholar
  111. Y. Huang et al. 2018. GPipe: Efficient training of giant neural networks using pipeline parallelism. arxiv:1811.06965.Google ScholarGoogle Scholar
  112. I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. 2016. Quantized neural networks: Training neural networks with low precision weights and activations. arxiv:1609.07061.Google ScholarGoogle Scholar
  113. D. A. Huffman. 1952. A method for the construction of minimum-redundancy codes. Proc. IRE 40, 9 (1952), 1098--1101.Google ScholarGoogle ScholarCross RefCross Ref
  114. F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size. arxiv:1602.07360.Google ScholarGoogle Scholar
  115. F. N. Iandola, M. W. Moskewicz, K. Ashraf, and K. Keutzer. 2016. FireCaffe: Near-linear acceleration of deep neural network training on compute clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16).Google ScholarGoogle Scholar
  116. IBM. 2019. Engineering and Scientific Subroutine Library (ESSL). Version 6.2 Guide and Reference. Retrieved from https://www.ibm.com/support/knowledgecenter/SSFHY8_6.2/reference/essl_reference_pdf.pdf.Google ScholarGoogle Scholar
  117. P. Ienne. 1993. Architectures for Neuro-Computers: Review and Performance Evaluation. Technical Report. EPFL, Lausanne, Switzerland.Google ScholarGoogle Scholar
  118. D. J. Im, H. Ma, C. D. Kim, and G. W. Taylor. 2016. Generative adversarial parallelization. arxiv:1612.04021.Google ScholarGoogle Scholar
  119. Intel. 2009. Intel Math Kernel Library. Reference Manual. Intel Corporation.Google ScholarGoogle Scholar
  120. Intel. 2017. MKL-DNN. Retrieved from https://01.org/mkl-dnn.Google ScholarGoogle Scholar
  121. S. Ioffe and C. Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15). 448--456. Google ScholarGoogle ScholarDigital LibraryDigital Library
  122. M. Jaderberg et al. 2017. Population-based training of neural networks. arxiv:1711.09846.Google ScholarGoogle Scholar
  123. X. Jia et al. 2018. Highly scalable deep learning training system with mixed-precision: Training ImageNet in four minutes. arxiv:1807.11205.Google ScholarGoogle Scholar
  124. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia. 675--678. Google ScholarGoogle ScholarDigital LibraryDigital Library
  125. J. Jiang, B. Cui, C. Zhang, and L. Yu. 2017. Heterogeneity-aware distributed parameter servers. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’17). 463--478. Google ScholarGoogle ScholarDigital LibraryDigital Library
  126. P. H. Jin, Q. Yuan, F. N. Iandola, and K. Keutzer. 2016. How to scale distributed deep learning? InProceedings of the ML Systems Workshop at NIPS.Google ScholarGoogle Scholar
  127. M. Johnson et al. 2016. Google’s multilingual neural machine translation system: Enabling zero-shot translation. arxiv:1611.04558.Google ScholarGoogle Scholar
  128. R. Johnson and T. Zhang. 2013. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems 26. MIT Press, 315--323. Google ScholarGoogle ScholarDigital LibraryDigital Library
  129. N. P. Jouppi et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA’17). 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  130. L. Kaiser, A. N. Gomez, N. Shazeer, A. Vaswani, N. Parmar, L. Jones, and J. Uszkoreit. 2017. One model to learn them all. arxiv:1706.05137.Google ScholarGoogle Scholar
  131. K. Kandasamy, W. Neiswanger, J. Schneider, B. Poczos, and E. P. Xing. 2018. Neural architecture search with bayesian optimisation and optimal transport. In Advances in Neural Information Processing Systems 31. MIT Press, 2016--2025. Google ScholarGoogle ScholarDigital LibraryDigital Library
  132. T. Karras, T. Aila, S. Laine, and J. Lehtinen. 2017. Progressive growing of GANs for improved quality, stability, and variation. arxiv:1710.10196.Google ScholarGoogle Scholar
  133. J. Keuper and F. Pfreundt. 2015. Asynchronous parallel stochastic gradient descent: A numeric core for scalable distributed machine learning algorithms. In Proceedings of the Workshop on Machine Learning in HPC Environments (MLHPC’15). 1:1--1:11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  134. H. Kim et al. 2016. DeepSpark: Spark-based deep learning supporting asynchronous updates and caffe compatibility. arxiv:1602.08191.Google ScholarGoogle Scholar
  135. Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin. 2016. Compression of deep convolutional neural networks for fast and low power mobile applications. In Proceedings of the International Conference on Learning Representations (ICLR’16).Google ScholarGoogle Scholar
  136. D. P. Kingma and J. Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR’15).Google ScholarGoogle Scholar
  137. A. Klein, S. Falkner, J. T. Springenberg, and F. Hutter. 2016. Learning curve prediction with Bayesian neural networks. In Proceedings of the International Conference on Learning Representations (ICLR).Google ScholarGoogle Scholar
  138. U. Köster et al. 2017. Flexpoint: An adaptive numerical format for efficient training of deep neural networks. In Advances in Neural Information Processing Systems 30. MIT Press, 1740--1750. Google ScholarGoogle ScholarDigital LibraryDigital Library
  139. S. Krishnan, Y. Xiao, and R. A. Saurous. 2018. Neumann optimizer: A practical optimization algorithm for deep neural networks. In Proceedings of the International Conference on Learning Representations (ICLR’18).Google ScholarGoogle Scholar
  140. A. Krizhevsky. 2009. Learning Multiple Layers of Features from Tiny Images. Master’s thesis, University of Toronto, Canada.Google ScholarGoogle Scholar
  141. A. Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. arxiv:1404.5997.Google ScholarGoogle Scholar
  142. A. Krizhevsky, I. Sutskever, and G. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25. MIT Press, 1097--1105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  143. T. Kurth et al. 2017. Deep learning at 15PF: Supervised and semi-supervised classification for scientific data. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’17). 7:1--7:11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  144. G. Lacey, G. W. Taylor, and S. Areibi. 2016. Deep learning on FPGAs: Past, present, and future. arxiv:1602.04283.Google ScholarGoogle Scholar
  145. L. Lamport, R. Shostak, and M. Pease. 1982. The byzantine generals problem. ACM Trans. Program. Lang. Syst. 4, 3 (1982), 382--401. Google ScholarGoogle ScholarDigital LibraryDigital Library
  146. A. Lavin and S. Gray. 2016. Fast algorithms for convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16).Google ScholarGoogle Scholar
  147. Q. V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Y. Ng. 2011. On optimization methods for deep learning. In Proceedings of the 28th International Conference on Machine Learning (ICML’11). 265--272. Google ScholarGoogle ScholarDigital LibraryDigital Library
  148. Q. V. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. S. Corrado, J. Dean, and A. Y. Ng. 2012. Building high-level features using large scale unsupervised learning. In Proceedings of the 29th International Conference on Machine Learning (ICML’12). 507--514. Google ScholarGoogle ScholarDigital LibraryDigital Library
  149. Y. LeCun, Y. Bengio, and G. Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436--444.Google ScholarGoogle Scholar
  150. Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. 1989. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1, 4 (1989), 541--551. Google ScholarGoogle ScholarDigital LibraryDigital Library
  151. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278--2324.Google ScholarGoogle ScholarCross RefCross Ref
  152. H. Lee, P. Pham, Y. Largman, and A. Y. Ng. 2009. Unsupervised feature learning for audio classification using convolutional deep belief networks. In Advances in Neural Information Processing Systems 22. MIT Press, 1096--1104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  153. S. Lee, S. Purushwalkam, M. Cogswell, D. J. Crandall, and D. Batra. 2015. Why M heads are better than one: Training a diverse ensemble of deep networks. arxiv:1511.06314.Google ScholarGoogle Scholar
  154. C. Li, Y. Yang, M. Feng, S. Chakradhar, and H. Zhou. 2016. Optimizing memory efficiency for deep convolutional neural networks on GPUs. In Proceedings of the International Conference for Supercomputing (SC’16). 54:1--54:12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  155. D. Li, X. Wang, and D. Kong. 2017. DeepRebirth: Accelerating deep neural network execution on mobile devices. arxiv:1708.04728.Google ScholarGoogle Scholar
  156. F. Li and B. Liu. 2016. Ternary weight networks. arxiv:1605.04711.Google ScholarGoogle Scholar
  157. M. Li et al. 2014. Scaling distributed machine learning with the parameter server. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI’14). 583--598. Google ScholarGoogle ScholarDigital LibraryDigital Library
  158. T. Li, J. Zhong, J. Liu, W. Wu, and C. Zhang. 2017. Ease.ml: Towards multi-tenant resource sharing for machine learning workloads. arxiv:1708.07308.Google ScholarGoogle Scholar
  159. X. Lian et al. 2017. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems 30. MIT Press, 5336--5346. Google ScholarGoogle ScholarDigital LibraryDigital Library
  160. X. Lian, Y. Huang, Y. Li, and J. Liu. 2015. Asynchronous parallel stochastic gradient for nonconvex optimization. In Proceedings of the 28th International Conference on NIPS, vol. 2. 2737--2745. Google ScholarGoogle ScholarDigital LibraryDigital Library
  161. X. Lian, W. Zhang, C. Zhang, and J. Liu. 2018. Asynchronous decentralized parallel stochastic gradient descent. In Proceedings of the 35th International Conference on Machine Learning (ICML’18). 3043--3052.Google ScholarGoogle Scholar
  162. M. Lin, Q. Chen, and S. Yan. 2014. Network in network. In Proceedings of the International Conferecne on Learning Representations (ICLR’14).Google ScholarGoogle Scholar
  163. Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally. 2018. Deep gradient compression: Reducing the communication bandwidth for distributed training. In Proceedings of the International Conference on Learning Representations (ICLR’18).Google ScholarGoogle Scholar
  164. C. Liu, B. Zoph, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy. 2017. Progressive neural architecture search. arxiv:1712.00559.Google ScholarGoogle Scholar
  165. H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and K. Kavukcuoglu. 2018. Hierarchical representations for efficient architecture search. In Proceedings of the International Conference on Learning Representations (ICLR’18).Google ScholarGoogle Scholar
  166. H. Liu, K. Simonyan, and Y. Yang. 2018. DARTS: Differentiable architecture search. arxiv:1806.09055.Google ScholarGoogle Scholar
  167. X. Liu, J. Pool, S. Han, and W. J. Dally. 2018. Efficient sparse-winograd convolutional neural networks. In Proceedings of the International Conference on Learning Representations (ICLR’18).Google ScholarGoogle Scholar
  168. P. R. Lorenzo, J. Nalepa, L. S. Ramos, and J. R. Pastor. 2017. Hyper-parameter selection in deep neural networks using parallel particle swarm optimization. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO’17). 1864--1871. Google ScholarGoogle ScholarDigital LibraryDigital Library
  169. I. Loshchilov and F. Hutter. 2017. SGDR: Stochastic gradient descent with warm restarts. In Proceedings of the International Conference on Learning Representations (ICLR’17).Google ScholarGoogle Scholar
  170. R. Luo, F. Tian, T. Qin, E. Chen, and T.-Y. Liu. 2018. Neural architecture optimization. In Advances in Neural Information Processing Systems 31. MIT Press, 7816--7827. Google ScholarGoogle ScholarDigital LibraryDigital Library
  171. J. Martens. 2010. Deep learning via hessian-free optimization. In Proceedings of the 27th International Conference on Machine Learning (ICML’10). 735--742. Google ScholarGoogle ScholarDigital LibraryDigital Library
  172. M. Mathieu, M. Henaff, and Y. LeCun. 2014. Fast training of convolutional networks through FFTs. In Proceedings of the International Conference on Learning Representations (ICLR’14).Google ScholarGoogle Scholar
  173. Message Passing Interface Forum. 2015. MPI: A Message-Passing Interface Standard Version 3.1. Retrieved from https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf.Google ScholarGoogle Scholar
  174. Y. Miao, H. Zhang, and F. Metze. 2014. Distributed learning of multilingual DNN feature extractors using GPUs. In Proceedings of the 15th Annual Conference of the International Speech Communication Association (INTERSPEECH’14). 830--834.Google ScholarGoogle Scholar
  175. R. Miikkulainen et al. 2017. Evolving deep neural networks. arxiv:1703.00548.Google ScholarGoogle Scholar
  176. H. Mikami et al. 2018. ImageNet/ResNet-50 training in 224 seconds. arxiv:1811.05233.Google ScholarGoogle Scholar
  177. P. Moritz, R. Nishihara, and M. Jordan. 2016. A linearly-convergent stochastic L-BFGS algorithm. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, vol. 51. 249--258.Google ScholarGoogle Scholar
  178. P. Moritz, R. Nishihara, I. Stoica, and M. I. Jordan. 2016. SparkNet: Training deep networks in spark. In Proceedings of the International Conference on Learning Representations (ICLR’16).Google ScholarGoogle Scholar
  179. U. A. Muller and A. Gunzinger. 1994. Neural net simulation on parallel computers. In Proceedings of the IEEE International Conference on Neural Networks, vol. 6. 3961--3966.Google ScholarGoogle Scholar
  180. R. Negrinho and G. Gordon. 2017. DeepArchitect: Automatically designing and training deep architectures. arxiv:1704.08792.Google ScholarGoogle Scholar
  181. A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. 2009. Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19, 4 (2009), 1574--1609. Google ScholarGoogle ScholarDigital LibraryDigital Library
  182. Y. Nesterov. 1983. A method of solving a convex programming problem with convergence rate O(1/k)<sup>2</sup>. Soviet Math. Doklady 269 (1983), 543--547.Google ScholarGoogle Scholar
  183. Netlib. 2019. Basic Linear Algebra Subprograms (BLAS). Retrieved from http://www.netlib.org/blas.Google ScholarGoogle Scholar
  184. J. Ngiam, Z. Chen, D. Chia, P. W. Koh, Q. V. Le, and A. Y. Ng. 2010. Tiled convolutional neural networks. In Advances in Neural Information Processing Systems 23. MIT Press, 1279--1287. Google ScholarGoogle ScholarDigital LibraryDigital Library
  185. J. Nocedal and S. Wright. 2006. Numerical Optimization. Springer.Google ScholarGoogle Scholar
  186. C. Noel and S. Osindero. 2014. Dogwild!—Distributed hogwild for CPU 8 GPU. In Proceedings of the NIPS Workshop on Distributed Machine Learning and Matrix Computations.Google ScholarGoogle Scholar
  187. E. Nurvitadhi et al. 2017. Can FPGAs beat GPUs in accelerating next-generation deep neural networks?. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). 5--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  188. NVIDIA. 2017. Programming Tensor Cores in CUDA 9. Retrieved from https://devblogs.nvidia.com/programming-tensor-cores-cuda-9.Google ScholarGoogle Scholar
  189. NVIDIA. 2019. CUBLAS Library Documentation. Retrieved from http://docs.nvidia.com/cuda/cublas.Google ScholarGoogle Scholar
  190. C. Olah. 2015. Understanding LSTM Networks. Retrieved from http://colah.github.io/posts/2015-08-Understanding-LSTMs.Google ScholarGoogle Scholar
  191. K. Osawa et al. 2018. Second-order optimization method for large mini-batch: Training resnet-50 on ImageNet in 35 Epochs. arxiv:1811.12019.Google ScholarGoogle Scholar
  192. M. Ott et al. 2018. Scaling neural machine translation. In Proceedings of the 3rd Conference on Machine Translation: Research Papers. 1--9.Google ScholarGoogle ScholarCross RefCross Ref
  193. Y. Oyama et al. 2016. Predicting statistics of asynchronous SGD parameters for a large-scale distributed deep learning system on GPU supercomputers. In Proceedings of the IEEE International Conference on Big Data (BigData’16). 66--75.Google ScholarGoogle ScholarCross RefCross Ref
  194. Y. Oyama, T. Ben-Nun, T. Hoefler, and S. Matsuoka. 2018. Accelerating deep learning frameworks with micro-batches. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER’18).Google ScholarGoogle Scholar
  195. PaddlePaddle. 2017. Elastic Deep Learning. Retrieved from https://github.com/PaddlePaddle/cloud/tree/develop/doc/edl.Google ScholarGoogle Scholar
  196. T. Paine et al. 2013. GPU asynchronous stochastic gradient descent to speed up neural network training. arxiv:1312.6186.Google ScholarGoogle Scholar
  197. S. J. Pan and Q. Yang. 2010. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 10 (2010), 1345--1359. Google ScholarGoogle ScholarDigital LibraryDigital Library
  198. P. Patarasuk and X. Yuan. 2009. Bandwidth optimal all-reduce algorithms for clusters of workstations. J. Parallel Distrib. Comput. 69, 2 (2009), 117--124. Google ScholarGoogle ScholarDigital LibraryDigital Library
  199. F. Petroski Such et al. 2017. Deep neuroevolution: Genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. arxiv:1712.06567.Google ScholarGoogle Scholar
  200. H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean. 2018. Efficient neural architecture search via parameter sharing. arxiv:1802.03268.Google ScholarGoogle Scholar
  201. B. T. Polyak and A. B. Juditsky. 1992. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30, 4 (1992), 838--855. Google ScholarGoogle ScholarDigital LibraryDigital Library
  202. D. Povey, X. Zhang, and S. Khudanpur. 2014. Parallel training of deep neural networks with natural gradient and parameter averaging. arxiv:1410.7455.Google ScholarGoogle Scholar
  203. R. Puri et al. 2018. Large scale language modeling: Converging on 40GB of text in four hours. arxiv:1808.01371.Google ScholarGoogle Scholar
  204. H. Qi, E. R. Sparks, and A. Talwalkar. 2017. Paleo: A performance model for deep neural networks. In Proceedings of the International Conference on Learning Representations (ICLR’17).Google ScholarGoogle Scholar
  205. N. Qian. 1999. On the momentum term in gradient descent learning algorithms. Neural Netw. 12, 1 (1999). Google ScholarGoogle ScholarDigital LibraryDigital Library
  206. R. Rabenseifner. 2004. Optimization of collective reduction operations. In Proceedings of the International Conference on Computational Science. 1--9.Google ScholarGoogle ScholarCross RefCross Ref
  207. A. Rahimi and B. Recht. 2017. Reflections on random kitchen sinks. Retrieved from http://www.argmin.net/2017/12/05/kitchen-sinks NIPS Test of Time Award Talk.Google ScholarGoogle Scholar
  208. R. Raina, A. Madhavan, and A. Y. Ng. 2009. Large-scale deep unsupervised learning using graphics processors. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML’09). 873--880. Google ScholarGoogle ScholarDigital LibraryDigital Library
  209. S. Sundhar Ram, A. Nedic, and V. V. Veeravalli. 2009. Asynchronous gossip algorithms for stochastic optimization. In Proceedings of the International Conference on Game Theory for Networks. 80--81. Google ScholarGoogle ScholarDigital LibraryDigital Library
  210. M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. 2016. XNOR-Net: ImageNet classification using binary convolutional neural networks. arxiv:1603.05279.Google ScholarGoogle Scholar
  211. E. Real, A. Aggarwal, Y. Huang, and Q. V Le. 2018. Regularized evolution for image classifier architecture search. arxiv:1802.01548.Google ScholarGoogle Scholar
  212. E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V. Le, and A. Kurakin. 2017. Large-scale evolution of image classifiers. In Proceedings of the 34th International Conference on Machine Learning. 2902--2911. Google ScholarGoogle ScholarDigital LibraryDigital Library
  213. B. Recht, C. Re, S. Wright, and F. Niu. 2011. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems 24. MIT Press, 693--701. Google ScholarGoogle ScholarDigital LibraryDigital Library
  214. C. Renggli, D. Alistarh, and T. Hoefler. 2018. SparCML: High-performance sparse communication for machine learning. arxiv:1802.08021.Google ScholarGoogle Scholar
  215. H. Robbins and S. Monro. 1951. A stochastic approximation method. Ann. Math. Stat. 22, 3 (1951), 400--407.Google ScholarGoogle ScholarCross RefCross Ref
  216. T. Salimans and D. P. Kingma. 2016. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems 29. MIT Press, 901--909. Google ScholarGoogle ScholarDigital LibraryDigital Library
  217. F. Seide et al. 2014. 1-bit stochastic gradient descent and application to data-parallel distributed training of speech DNNs. In Proceedings of the 15th Annual Conference of the International Speech Communication Association (INTERSPEECH’14).Google ScholarGoogle ScholarCross RefCross Ref
  218. F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 2014. On parallelizability of stochastic gradient descent for speech DNNs. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). 235--239.Google ScholarGoogle Scholar
  219. S. Shalev-Shwartz and S. Ben-David. 2014. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  220. C. J. Shallue et al. 2018. Measuring the effects of data parallelism on neural network training. arxiv:1811.03600.Google ScholarGoogle Scholar
  221. O. Shamir. 2016. Without-replacement sampling for stochastic gradient methods. In Advances in Neural Information Processing Systems 29. MIT Press, 46--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  222. R. Shokri and V. Shmatikov. 2015. Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security (CCS’15). 1310--1321. Google ScholarGoogle ScholarDigital LibraryDigital Library
  223. D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. 2017. Mastering the game of go without human knowledge. Nature 550, 7676 (2017), 354.Google ScholarGoogle Scholar
  224. K. Simonyan and A. Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. Proceedings of the International Conference on Learning Representations (ICLR’15).Google ScholarGoogle Scholar
  225. A. J. R. Simpson. 2015. Instant learning: Parallel deep neural networks and convolutional bootstrapping. arxiv:1505.05972.Google ScholarGoogle Scholar
  226. S. L. Smith, P. Kindermans, and Q. V. Le. 2017. Don’t decay the learning rate, increase the batch size. arxiv:1711.00489.Google ScholarGoogle Scholar
  227. J. Snoek, H. Larochelle, and R. P Adams. 2012. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems 25. MIT Press, 2951--2959. Google ScholarGoogle ScholarDigital LibraryDigital Library
  228. E. Solomonik and T. Hoefler. 2015. Sparse Tensor Algebra as a Parallel Programming Model. arxiv:1512.00066.Google ScholarGoogle Scholar
  229. M. Song, Y. Hu, H. Chen, and T. Li. 2017. Towards pervasive and user satisfactory CNN across GPU microarchitectures. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’17). 1--12.Google ScholarGoogle Scholar
  230. H. V. Sorensen and C. S. Burrus. 1993. Efficient computation of the DFT with only a subset of input or output points. IEEE Trans. Signal Process. 41, 3 (1993), 1184--1200. Google ScholarGoogle ScholarDigital LibraryDigital Library
  231. V. Strassen. 1969. Gaussian elimination is not optimal. Numer. Math. 13, 4 (1969), 354--356. Google ScholarGoogle ScholarDigital LibraryDigital Library
  232. N. Strom. 2015. Scalable distributed DNN training using commodity GPU cloud computing. In Proceedings of the 16th Annual Conference of the International Speech Communication Association.Google ScholarGoogle ScholarCross RefCross Ref
  233. V. Sze, Y. H. Chen, T. J. Yang, and J. S. Emer. 2017. Efficient processing of deep neural networks: A tutorial and survey. Proc. IEEE 105, 12 (2017), 2295--2329.Google ScholarGoogle ScholarCross RefCross Ref
  234. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’15).Google ScholarGoogle Scholar
  235. G. Taylor, R. Burmeister, Z. Xu, B. Singh, A. Patel, and T. Goldstein. 2016. Training neural networks without gradients: A scalable ADMM approach. (2016). arxiv:1605.02026Google ScholarGoogle Scholar
  236. L. Truong et al. 2016. Latte: A language, compiler, and runtime for elegant and efficient deep neural networks. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’16). 209--223. Google ScholarGoogle ScholarDigital LibraryDigital Library
  237. J. Tsitsiklis, D. Bertsekas, and M. Athans. 1986. Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IEEE Trans. Automat. Control 31, 9 (1986), 803--812.Google ScholarGoogle ScholarCross RefCross Ref
  238. B. Van Essen et al. 2015. LBANN: Livermore big artificial neural network HPC toolkit. In Proceedings of the Workshop on Machine Learning in HPC Environments. Google ScholarGoogle ScholarDigital LibraryDigital Library
  239. V. Vanhoucke, A. Senior, and M. Z. Mao. 2011. Improving the speed of neural networks on CPUs. In Proceedings of the Deep Learning and Unsupervised Feature Learning Workshop (NIPS’11).Google ScholarGoogle Scholar
  240. N. Vasilache et al. 2018. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. arxiv:1802.04730.Google ScholarGoogle Scholar
  241. N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Piantino, and Y. LeCun. 2015. Fast convolutional nets with fbfft: A GPU performance evaluation. In Proceedings of the International Conference on Learning Representations (ICLR’15).Google ScholarGoogle Scholar
  242. A. Vasudevan, A. Anderson, and D. Gregg. 2017. Parallel multi channel convolution using general matrix multiplication. arxiv:1704.04428.Google ScholarGoogle Scholar
  243. P. Verbancsics and J. Harguess. 2015. Image classification using generative neuro evolution for deep learning. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 488--493. Google ScholarGoogle ScholarDigital LibraryDigital Library
  244. A. Viebke, S. Memeti, S. Pllana, and A. Abraham. 2019. CHAOS: A parallelization scheme for training convolutional neural networks on Intel Xeon Phi. The Journal of Supercomputing 75, 1 (Jan. 2019), 197--227. Google ScholarGoogle ScholarDigital LibraryDigital Library
  245. W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li. 2017. TernGrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems 30. MIT Press, 1509--1519. Google ScholarGoogle ScholarDigital LibraryDigital Library
  246. P. J. Werbos. 1990. Backpropagation through time: What it does and how to do it. Proc. IEEE 78, 10 (1990), 1550--1560.Google ScholarGoogle ScholarCross RefCross Ref
  247. J. H. Wilkinson. 1994. Rounding Errors in Algebraic Processes. Dover Publications. Google ScholarGoogle ScholarDigital LibraryDigital Library
  248. R. J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8, 3 (1992), 229--256. Google ScholarGoogle ScholarDigital LibraryDigital Library
  249. S. Winograd. 1980. Arithmetic Complexity of Computations. Society for Industrial and Applied Mathematics.Google ScholarGoogle Scholar
  250. L. Xie and A. Yuille. 2017. Genetic CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 1388--1397.Google ScholarGoogle Scholar
  251. P. Xie, J. K. Kim, Y. Zhou, Q. Ho, A. Kumar, Y. Yu, and E. Xing. 2016. Lighter-communication distributed machine learning via sufficient factor broadcasting. In Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence (UAI’16). 795--804. Google ScholarGoogle ScholarDigital LibraryDigital Library
  252. E. P. Xing, Q. Ho, W. Dai, J. K. Kim, J. Wei, S. Lee, X. Zheng, P. Xie, A. Kumar, and Y. Yu. 2015. Petuum: A new platform for distributed machine learning on big data. IEEE Trans. Big Data 1, 2 (2015), 49--67.Google ScholarGoogle ScholarCross RefCross Ref
  253. K. Xu et al. 2015. Show, attend and tell: Neural image caption generation with visual attention. arxiv:1502.03044.Google ScholarGoogle Scholar
  254. O. Yadan, K. Adams, Y. Taigman, and M. Ranzato. 2013. Multi-GPU training of ConvNets. arxiv:1312.5853.Google ScholarGoogle Scholar
  255. F. Yan, O. Ruwase, Y. He, and T. Chilimbi. 2015. Performance modeling and scalability optimization of distributed deep learning systems. In Proceedings of the 21st ACM International Conference on Knowledge Discovery and Data Mining (KDD’15). 1355--1364. Google ScholarGoogle ScholarDigital LibraryDigital Library
  256. C. Ying et al. 2018. Image classification at supercomputer scale. arxiv:1811.06992.Google ScholarGoogle Scholar
  257. Y. You et al. 2019. Large-batch training for LSTM and beyond. arxiv:1901.08256.Google ScholarGoogle Scholar
  258. Y. You, A. Buluç, and J. Demmel. 2017. Scaling deep learning on GPU and knights landing clusters. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’17). 9:1--9:12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  259. Y. You, I. Gitman, and B. Ginsburg. 2017. Large batch training of convolutional networks. arxiv:1708.03888.Google ScholarGoogle Scholar
  260. Y. You, Z. Zhang, C. Hsieh, and J. Demmel. 2017. 100-epoch ImageNet training with AlexNet in 24 minutes. arxiv:1709.05011Google ScholarGoogle Scholar
  261. S. R. Young et al. 2017. Evolving deep networks using HPC. In Proceedings of the Workshop on Machine Learning in HPC Environments (MLHPC’17). 7:1--7:7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  262. F. Yu and V. Koltun. 2016. Multi-scale context aggregation by dilated convolutions. In Proceedings of the International Conference on Learning Representations (ICLR’16).Google ScholarGoogle Scholar
  263. Y. Yu, J. Jiang, and X. Chi. 2016. Using supercomputer to speed up neural network training. In Proceedings of the IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS’16). 942--947.Google ScholarGoogle Scholar
  264. H. Zhang et al. 2015. Poseidon: A system architecture for efficient GPU-based deep learning on multiple machines. arxiv:1512.06216Google ScholarGoogle Scholar
  265. H. Zhang, Z. Zheng, S. Xu, W. Dai, Q. Ho, X. Liang, Z. Hu, J. Wei, P. Xie, and E. P. Xing. 2017. Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters. In Proceedings of the USENIX Annual Technical Conference (ATC’17). 181--193. Google ScholarGoogle ScholarDigital LibraryDigital Library
  266. J. Zhang, I. Mitliagkas, and C. Ré. 2017. YellowFin and the art of momentum tuning. arxiv:1706.03471Google ScholarGoogle Scholar
  267. K. Zhang and X. W. Chen. 2014. Large-scale deep belief nets with MapReduce. IEEE Access 2 (2014), 395--403.Google ScholarGoogle ScholarCross RefCross Ref
  268. S. Zhang et al. 2015. Deep learning with elastic averaging SGD. In Advances in Neural Information Processing Systems 28. MIT Press, 685--693. Google ScholarGoogle ScholarDigital LibraryDigital Library
  269. S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen. 2016. Cambricon-X: An accelerator for sparse neural networks. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  270. S. Zhang, C. Zhang, Z. You, R. Zheng, and B. Xu. 2013. Asynchronous stochastic gradient descent for DNN training. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 6660--6663.Google ScholarGoogle Scholar
  271. W. Zhang et al. 2017. GaDei: On scale-up training as a service for deep learning. In Proceedings of the IEEE International Conference on Data Mining (ICDM’17). 1195--1200.Google ScholarGoogle ScholarCross RefCross Ref
  272. W. Zhang, S. Gupta, X. Lian, and J. Liu. 2016. Staleness-aware async-SGD for distributed deep learning. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI’16). 2350--2356. Google ScholarGoogle ScholarDigital LibraryDigital Library
  273. X. Zhang, M. McKenna, J. P. Mesirov, and D. L. Waltz. 1990. An efficient implementation of the back-propagation algorithm on the connection machine CM-2. In Advances in Neural Information Processing Systems 2. MIT Press, 801--809. Google ScholarGoogle ScholarDigital LibraryDigital Library
  274. H. Zhao and J. Canny. 2014. Kylix: A sparse allreduce for commodity clusters. In Proceedings of the 43rd International Conference on Parallel Processing. 273--282. Google ScholarGoogle ScholarDigital LibraryDigital Library
  275. Z. Zhong, J. Yan, and C.-L. Liu. 2017. Practical network blocks design with Q-Learning. arxiv:1708.05552Google ScholarGoogle Scholar
  276. S. Zhou et al. 2016. DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arxiv:1606.06160Google ScholarGoogle Scholar
  277. M. A. Zinkevich, M. Weimer, A. Smola, and L. Li. 2010. Parallelized stochastic gradient descent. In Proceedings of the 23rd International Conference on Neural Information Processing Systems, vol. 2. 2595--2603. Google ScholarGoogle ScholarDigital LibraryDigital Library
  278. A. Zlateski, K. Lee, and H. S. Seung. 2016. ZNNi: Maximizing the inference throughput of 3D convolutional networks on CPUs and GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 854--865. Google ScholarGoogle ScholarDigital LibraryDigital Library
  279. B. Zoph and Q. V. Le. 2017. Neural architecture search with reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR’17).Google ScholarGoogle Scholar
  280. B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. 2017. Learning transferable architectures for scalable image recognition. arxiv:1707.07012Google ScholarGoogle Scholar

Index Terms

  1. Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Computing Surveys
            ACM Computing Surveys  Volume 52, Issue 4
            July 2020
            769 pages
            ISSN:0360-0300
            EISSN:1557-7341
            DOI:10.1145/3359984
            • Editor:
            • Sartaj Sahni
            Issue’s Table of Contents

            Copyright © 2019 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 30 August 2019
            • Accepted: 1 March 2019
            • Received: 1 August 2018
            Published in csur Volume 52, Issue 4

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • survey
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format .

          View HTML Format