survey

Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis

Authors:
Tal Ben-Nun

ETH Zurich, Zürich, Switzerland

ETH Zurich, Zürich, Switzerland

0000-0002-3657-6568
View Profile

,
Torsten Hoefler

ETH Zurich, Zürich, Switzerland

ETH Zurich, Zürich, Switzerland
View Profile

Authors Info & Claims

ACM Computing Surveys Volume 52 Issue 4Article No.: 65pp 1–43https://doi.org/10.1145/3320060

Published:30 August 2019Publication History

ACM Computing Surveys

Abstract

Deep Neural Networks (DNNs) are becoming an important tool in modern computing applications. Accelerating their training is a major challenge and techniques range from distributed algorithms to low-level circuit design. In this survey, we describe the problem from a theoretical perspective, followed by approaches for its parallelization. We present trends in DNN architectures and the resulting implications on parallelization strategies. We then review and model the different types of concurrency in DNNs: from the single operator, through parallelism in network inference and training, to distributed deep learning. We discuss asynchronous stochastic optimization, distributed system architectures, communication schemes, and neural architecture search. Based on those approaches, we extrapolate potential directions for parallelism in deep learning.

References

M. Abadi et al. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Retrieved from http://www.tensorflow.org.Google Scholar
A. Agarwal and J. C. Duchi. 2011. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems 24. MIT Press, 873--881. Google ScholarDigital Library
A. F. Aji and K. Heafield. 2017. Sparse communication for distributed gradient descent. arxiv:1704.05021Google Scholar
F. Akopyan et al. 2015. TrueNorth: Design and tool flow of a 65 mW 1 million neuron programmable neurosynaptic chip. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 34, 10 (2015), 1537--1557.Google ScholarDigital Library
D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. 2017. QSGD: Communication-efficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems 30. MIT Press, 1709--1720. Google ScholarDigital Library
D. Amodei et al. 2016. Deep speech 2 : End-to-end speech recognition in English and mandarin. In Proceedings of the 33rd International Conference on Machine Learning, vol. 48. 173--182. Google ScholarDigital Library
J. Appleyard, T. Kociský, and P. Blunsom. 2016. Optimizing performance of recurrent neural networks on GPUs. arxiv:1604.01946Google Scholar
N. S. Arora, R. D. Blumofe, and C. G. Plaxton. 1998. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the 10th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’98). 119--129. Google ScholarDigital Library
A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda. 2017. S-Caffe: Co-designing MPI runtimes and caffe for scalable deep learning on modern GPU clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’17). 193--205. Google ScholarDigital Library
J. Ba and R. Caruana. 2014. Do deep nets really need to be deep? In Advances in Neural Information Processing Systems 27. MIT Press, 2654--2662. Google ScholarDigital Library
J. Ba, R. Grosse, and J. Martens. 2017. Distributed second-order optimization using kronecker-factored approximations. In Proceedings of the International Conference on Learning Representations (ICLR’17).Google Scholar
B. Baker, O. Gupta, N. Naik, and R. Raskar. 2017. Designing neural network architectures using reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR’17).Google Scholar
B. Baker, O. Gupta, R. Raskar, and N. Naik. 2017. Practical neural network performance prediction for early stopping. arxiv:1705.10823.Google Scholar
B. W. Barrett et al. 2018. The Portals 4.2 Network Programming Interface. Sandia Report SAND2018-12790. Technical Report.Google Scholar
R. Belli and T. Hoefler. 2015. Notified access: Extending remote memory access programming models for producer-consumer synchronization. In Proceedings of the 29th IEEE International Parallel 8 Distributed Processing Symposium (IPDPS’15). Google ScholarDigital Library
T. Ben-Nun, E. Levy, A. Barak, and E. Rubin. 2015. Memory access patterns: The missing piece of the multi-GPU puzzle. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’15). 19:1--19:12. Google ScholarDigital Library
Y. Bengio. 2013. Deep learning of representations: Looking forward. In Proceedings of the Statistical Language and Speech Processing (SLSP’13). Google ScholarDigital Library
Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. 2007. Greedy layer-wise training of deep networks. In Advances in Neural Information Processing Systems 19. MIT Press, 153--160. Google ScholarDigital Library
Y. Bengio, P. Simard, and P. Frasconi. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5, 2 (1994), 157--166. Google ScholarDigital Library
P. Blanchard, E. M. El Mhamdi, R. Guerraoui, and J. Stainer. 2017. Machine learning with adversaries: Byzantine tolerant gradient descent. In Advances in Neural Information Processing Systems 30. MIT Press, 119--129. Google ScholarDigital Library
R. D. Blumofe and C. E. Leiserson. 1999. Scheduling multithreaded computations by work stealing. J. ACM 46, 5 (1999), 720--748. Google ScholarDigital Library
M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba. 2016. End to end learning for self-driving cars. arxiv:1604.07316Google Scholar
L. Bottou, F. E. Curtis, and J. Nocedal. 2016. Optimization methods for large-scale machine learning. arxiv:1606.04838Google Scholar
S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. 2005. Gossip algorithms: Design, analysis and applications. In Proceedings of the IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies, vol. 3. 1653--1664.Google Scholar
S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3, 1 (2011), 1--122. Google ScholarDigital Library
R. P. Brent. 1974. The parallel evaluation of general arithmetic expressions. J. ACM 21, 2 (1974), 201--206. Google ScholarDigital Library
A. Brock, T. Lim, J. M. Ritchie, and N. Weston. 2017. SMASH: One-shot model architecture search through HyperNetworks. arxiv:1708.05344.Google Scholar
R. H. Byrd, S. L. Hansen, J. Nocedal, and Y. Singer. 2016. A stochastic quasi-newton method for large-scale optimization. SIAM J. Optim. 26, 2 (2016), 1008--1031.Google ScholarDigital Library
E. Chan, M. Heimlich, A. Purkayastha, and R. van de Geijn. 2007. Collective communication: Theory, practice, and experience: Research articles. Concurr. Comput.: Pract. Exper. 19, 13 (2007), 1749--1783. Google ScholarDigital Library
W. Chan, N. Jaitly, Q. Le, and O. Vinyals. 2016. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). 4960--4964.Google Scholar
K. Chellapilla, S. Puri, and P. Simard. 2006. High performance convolutional neural networks for document processing. In Proceedings of the 10th International Workshop on Frontiers in Handwriting Recognition.Google Scholar
C.-Y. Chen, J. Choi, D. Brand, A. Agrawal, W. Zhang, and K. Gopalakrishnan. 2017. AdaComp : Adaptive residual gradient compression for data-parallel distributed training. arxiv:1712.02679.Google Scholar
K. Chen and Q. Huo. 2016. Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). 5880--5884.Google Scholar
T. Chen et al. 2018. TVM: End-to-end optimization stack for deep learning. arxiv:1802.04799.Google Scholar
T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam. 2014. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’14). 269--284. Google ScholarDigital Library
T. Chen, B. Xu, C. Zhang, and C. Guestrin. 2016. Training deep nets with sublinear memory cost. arxiv:1604.06174Google Scholar
Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng. 2017. Dual path networks. In Advances in Neural Information Processing Systems 30. MIT Press, 4470--4478. Google ScholarDigital Library
S. Chetlur et al. 2014. cuDNN: Efficient primitives for deep learning. arxiv:1410.0759.Google Scholar
T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. 2014. Project adam: Building an efficient and scalable deep learning training system. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation. 571--582. Google ScholarDigital Library
K. Cho et al. 2014. Learning phrase representations using RNN encoder--decoder for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1724--1734.Google ScholarCross Ref
F. Chollet. 2016. Xception: Deep learning with depthwise separable convolutions. arxiv:1610.02357Google Scholar
C. Chu, S. K. Kim, Y. Lin, Y. Yu, G. Bradski, K. Olukotun, and A. Y. Ng. 2007. Map-reduce for machine learning on multicore. In Advances in Neural Information Processing Systems 19. MIT Press, 281--288. Google ScholarDigital Library
I. H. Chung et al. 2017. Parallel deep neural network training for big data on blue gene/Q. IEEE Trans. Parallel Distrib. Syst. 28, 6 (2017), 1703--1714. Google ScholarDigital Library
D. C. Cireşan, A. Giusti, L. M. Gambardella, and J. Schmidhuber. 2013. Mitosis detection in breast cancer histology images with deep neural networks. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention. 411--418.Google Scholar
A. Coates, B. Huval, T. Wang, D. J. Wu, A. Y. Ng, and B. Catanzaro. 2013. Deep learning with COTS HPC systems. In Proceedings of the 30th International Conference on Machine Learning—Volume 28 (ICML’13). III--1337--III--1345. Google ScholarDigital Library
N. Cohen, O. Sharir, and A. Shashua. 2016. Deep SimNets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 4782--4791.Google Scholar
N. Cohen, O. Sharir, and A. Shashua. 2016. On the expressive power of deep learning: A tensor analysis. In Proceedings of the 29th Annual Conference on Learning Theory, vol. 49. 698--728.Google Scholar
R. Collobert, K. Kavukcuoglu, and C. Farabet. 2011. Torch7: A matlab-like environment for machine learning. In Proceedings of the Workshop on Big Learning: Algorithms, Systems, and Tools for Learning at Scale (BigLearn’11).Google Scholar
J. Cong and B. Xiao. 2014. Minimizing computation in convolutional neural networks. In Proceedings of the International Conference on Artificial Neural Networks (ICANN’14). 281--290.Google Scholar
M. Courbariaux and Y. Bengio. 2016. BinaryNet: Training deep neural networks with weights and activations constrained to +1 or −1. arxiv:1602.02830Google Scholar
M. Courbariaux, Y. Bengio, and J.-P. David. 2015. BinaryConnect: Training deep neural networks with binary weights during propagations. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS’15), vol. 2. 3123--3131. Google ScholarDigital Library
H. Cui, H. Zhang, G. R. Ganger, P. B. Gibbons, and E. P. Xing. 2016. GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server. In Proceedings of the European Conference on Computer Systems (EuroSys’16). 4:1--4:16. Google ScholarDigital Library
X. Cui, W. Zhang, Z. Tüske, and M. Picheny. 2018. Evolutionary stochastic gradient descent for optimization of deep neural networks. In Advances in Neural Information Processing Systems 31. MIT Press, 6048--6058. Google ScholarDigital Library
D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. von Eicken. 1993. LogP: Towards a realistic model of parallel computation. In Proceedings of the 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’93). 1--12. Google ScholarDigital Library
J. Daily et al. 2018. GossipGraD: Scalable deep learning using gossip communication-based asynchronous gradient descent. arxiv:1803.05880.Google Scholar
C. De Sa, C. Zhang, K. Olukotun, and C. Ré. 2015. Taming the wild: A unified analysis of HOGWILD!-style algorithms. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS’15), vol. 2. 2674--2682. Google ScholarDigital Library
J. Dean et al. 2012. Large scale distributed deep networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS’12), vol. 1. 1223--1231. Google ScholarDigital Library
J. Dean and S. Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113. Google ScholarDigital Library
O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. 2012. Optimal distributed online prediction using mini-batches. J. Mach. Learn. Res. 13, 1 (2012), 165--202. Google ScholarDigital Library
O. Delalleau and Y. Bengio. 2011. Shallow vs. deep sum-product networks. In Advances in Neural Information Processing Systems 24. MIT Press, 666--674. Google ScholarDigital Library
J. Demmel and G. Dinh. 2018. Communication-optimal convolutional neural nets. arxiv:1802.06905.Google Scholar
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09).Google Scholar
L. Deng, D. Yu, and J. Platt. 2012. Scalable stacking and learning for building deep architectures. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’12). 2133--2136.Google Scholar
T. Dettmers. 2015. 8-bit approximations for parallelism in deep learning. arxiv:1511.04561.Google Scholar
G. Diamos et al. 2016. Persistent RNNs: Stashing recurrent weights on-chip. In Proceedings of the 33rd International Conference on Machine Learning, vol. 48. 2024--2033. Google ScholarDigital Library
T. G. Dietterich. 2000. Ensemble methods in machine learning. In Proceedings of the 1st International Workshop on Multiple Classifier Systems (MCS’00). 1--15. Google ScholarDigital Library
Z. Drezner and A. Barak. 1986. An asynchronous algorithm for scattering information between the active nodes of a multicomputer system. J. Parallel Distrib. Comput. 3, 3 (1986), 344--351. Google ScholarDigital Library
N. Dryden et al. 2019. Improving strong-scaling of CNN training by exploiting finer-grained parallelism. In Proceedings of the 33rd IEEE Int:l Parallel 8 Distributed Processing Symposium (IPDPS’19).Google ScholarCross Ref
N. Dryden, T. Moon, S. A. Jacobs, and B. V. Essen. 2016. Communication quantization for data-parallel training of deep neural networks. In Proceedings of the Workshop on Machine Learning in HPC Environments (MLHPC’16). 1--8. Google ScholarDigital Library
J. Duchi, E. Hazan, and Y. Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12 (2011), 2121--2159. Google ScholarDigital Library
V. Dumoulin and F. Visin. 2016. A guide to convolution arithmetic for deep learning. arxiv:1603.07285.Google Scholar
J. L. Elman. 1990. Finding structure in time. Cogn. Sci. 14, 2 (1990), 179--211.Google ScholarCross Ref
T. Elsken, J.-H. Metzen, and F. Hutter. 2017. Simple and efficient architecture search for convolutional neural networks. arxiv:1711.04528.Google Scholar
L. Ericson and R. Mbuvha. 2017. On the performance of network parallel training in artificial neural networks. arxiv:1701.05130.Google Scholar
P. Farber and K. Asanovic. 1997. Parallel neural network training on multi-spert. In Proceedings of the 3rd International Conference on Algorithms and Architectures for Parallel Processing. 659--666.Google Scholar
B. M. Forrest, D. Roweth, N. Stroud, D. J. Wallace, and G. V. Wilson. 1987. Implementing neural network models on parallel computers. Comput. J. 30, 5 (1987), 413--419. Google ScholarDigital Library
K. Frans, J. Ho, X. Chen, P. Abbeel, and J. Schulman. 2017. Meta learning shared hierarchies. arxiv:1710.09767.Google Scholar
M. P. Friedlander and M. W. Schmidt. 2011. Hybrid deterministic-stochastic methods for data fitting. arxiv:1104.2373.Google Scholar
A. Gaunt, M. Johnson, M. Riechert, D. Tarlow, R. Tomioka, D. Vytiniotis, and S. Webster. 2017. AMPNet: Asynchronous model-parallel training for dynamic neural networks. arxiv:1705.09786.Google Scholar
A. Gholami, A. Azad, P. H. Jin, K. Keutzer, and A. Buluç. 2018. Integrated model, batch, and domain parallelism in training neural networks. In Proceedings of the 30th Symposium on Parallelism in Algorithms and Architectures (SPAA’18). 77--86. Google ScholarDigital Library
X. Glorot and Y. Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, vol. 9. 249--256.Google Scholar
D. E. Goldberg and K. Deb. 1991. A comparative analysis of selection schemes used in genetic algorithms. Foundations of Genetic Algorithms, vol. 1. Elsevier, 69--93.Google Scholar
Google. 2017. TensorFlow XLA Overview. Retrieved from https://www.tensorflow.org/performance/xla.Google Scholar
P. Goyal, P. Dollár, R. B. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. 2017. Accurate, large minibatch SGD: Training ImageNet in 1 Hour. arxiv:1706.02677.Google Scholar
A. Graves et al. 2016. Hybrid computing using a neural network with dynamic external memory. Nature 538, 7626 (2016), 471--476.Google Scholar
W. Gropp, T. Hoefler, R. Thakur, and E. Lusk. 2014. Using Advanced MPI: Modern Features of the Message-Passing Interface. MIT Press. Google ScholarDigital Library
A. Gruslys, R. Munos, I. Danihelka, M. Lanctot, and A. Graves. 2016. Memory-efficient backpropagation through time. In Advances in Neural Information Processing Systems 29. MIT Press, 4125--4133. Google ScholarDigital Library
S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. 2015. Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on Machine Learning, vol. 37. 1737--1746. Google ScholarDigital Library
S. Gupta, W. Zhang, and F. Wang. 2016. Model accuracy and runtime tradeoff in distributed deep learning: A systematic study. In Proceedings of the IEEE 16th International Conference on Data Mining (ICDM’16). 171--180.Google Scholar
S. Hadjis, C. Zhang, I. Mitliagkas, and C. Ré. 2016. Omnivore: An optimizer for multi-device deep learning on CPUs and GPUs. arxiv:1606.04487.Google Scholar
S. Han, H. Mao, and W. J. Dally. 2016. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. In Proceedings of the International Conference on Learning Representations (ICLR’16) (2016).Google Scholar
E. Hazan, A. Klivans, and Y. Yuan. 2018. Hyperparameter optimization: A spectral approach. In Proceedings of the International Conference on Learning Representations (ICLR’18).Google Scholar
K. He, X. Zhang, S. Ren, and J. Sun. 2015. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 1026--1034. Google ScholarDigital Library
K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 770--778.Google Scholar
X. He, D. Mudigere, M. Smelyanskiy, and M. Takac. 2017. Distributed hessian-free optimization for deep neural network. In Proceedings of the AAAI Workshops.Google Scholar
G. Hinton. 2012. Neural Networks for Machine Learning, Lecture 6a: Overview of Mini-batch Gradient Descent.Google Scholar
G. Hinton, O. Vinyals, and J. Dean. 2015. Distilling the knowledge in a neural network. In Proceedings of the NIPS Deep Learning and Representation Learning Workshop.Google Scholar
G. E. Hinton, S. Osindero, and Y. W. Teh. 2006. A fast learning algorithm for deep belief nets. Neural Comput. 18, 7 (2006), 1527--1554. Google ScholarDigital Library
Q. Ho et al. 2013. More effective distributed ML via a stale synchronous parallel parameter server. In Proceedings of the 26th International Conference on Neural Information Processing Systems, vol. 1 (NIPS’13). 1223--1231. Google ScholarDigital Library
S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735--1780. Google ScholarDigital Library
T. Hoefler, A. Barak, A. Shiloh, and Z. Drezner. 2017. Corrected gossip algorithms for fast reliable broadcast on unreliable systems. In Proceedings of the 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS’17).Google Scholar
T. Hoefler, A. Lumsdaine, and W. Rehm. 2007. Implementation and performance analysis of non-blocking collective operations for MPI. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’07). IEEE Computer Society/ACM. Google ScholarDigital Library
T. Hoefler and D. Moor. 2014. Energy, memory, and runtime tradeoffs for implementing collective communication operations. J. Supercomput. Front. Innovat. 1, 2 (2014), 58--75. Google ScholarDigital Library
T. Hoefler and T. Schneider. 2012. Optimization principles for collective neighborhood communications. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 98:1--98:10. Google ScholarDigital Library
T. Hoefler and M. Snir. 2011. Generic topology mapping strategies for large-scale parallel architectures. In Proceedings of the ACM International Conference on Supercomputing (ICS’11). 75--85. Google ScholarDigital Library
T. Hoefler and J. L. Traeff. 2009. Sparse collective operations for MPI. In Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium (HIPS’09). Google ScholarDigital Library
E. Hoffer, I. Hubara, and D. Soudry. 2017. Train longer, generalize better: Closing the generalization gap in large batch training of neural networks. In Advances in Neural Information Processing Systems 30. MIT Press, 1729--1739. Google ScholarDigital Library
A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. arxiv:1704.04861.Google Scholar
K. Hsieh, A. Harlap, N. Vijaykumar, D. Konomis, G. R. Ganger, P. B. Gibbons, and O. Mutlu. 2017. Gaia: Geo-distributed machine learning approaching LAN speeds. In Proceedings of the 14th USENIX Conference on Networked Systems Design and Implementation (NSDI’17). 629--647. Google ScholarDigital Library
G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
Y. Huang et al. 2018. GPipe: Efficient training of giant neural networks using pipeline parallelism. arxiv:1811.06965.Google Scholar
I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. 2016. Quantized neural networks: Training neural networks with low precision weights and activations. arxiv:1609.07061.Google Scholar
D. A. Huffman. 1952. A method for the construction of minimum-redundancy codes. Proc. IRE 40, 9 (1952), 1098--1101.Google ScholarCross Ref
F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size. arxiv:1602.07360.Google Scholar
F. N. Iandola, M. W. Moskewicz, K. Ashraf, and K. Keutzer. 2016. FireCaffe: Near-linear acceleration of deep neural network training on compute clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16).Google Scholar
IBM. 2019. Engineering and Scientific Subroutine Library (ESSL). Version 6.2 Guide and Reference. Retrieved from https://www.ibm.com/support/knowledgecenter/SSFHY8_6.2/reference/essl_reference_pdf.pdf.Google Scholar
P. Ienne. 1993. Architectures for Neuro-Computers: Review and Performance Evaluation. Technical Report. EPFL, Lausanne, Switzerland.Google Scholar
D. J. Im, H. Ma, C. D. Kim, and G. W. Taylor. 2016. Generative adversarial parallelization. arxiv:1612.04021.Google Scholar
Intel. 2009. Intel Math Kernel Library. Reference Manual. Intel Corporation.Google Scholar
Intel. 2017. MKL-DNN. Retrieved from https://01.org/mkl-dnn.Google Scholar
S. Ioffe and C. Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15). 448--456. Google ScholarDigital Library
M. Jaderberg et al. 2017. Population-based training of neural networks. arxiv:1711.09846.Google Scholar
X. Jia et al. 2018. Highly scalable deep learning training system with mixed-precision: Training ImageNet in four minutes. arxiv:1807.11205.Google Scholar
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia. 675--678. Google ScholarDigital Library
J. Jiang, B. Cui, C. Zhang, and L. Yu. 2017. Heterogeneity-aware distributed parameter servers. In Proceedings of the ACM International Conference on Management of Data (SIGMOD’17). 463--478. Google ScholarDigital Library
P. H. Jin, Q. Yuan, F. N. Iandola, and K. Keutzer. 2016. How to scale distributed deep learning? InProceedings of the ML Systems Workshop at NIPS.Google Scholar
M. Johnson et al. 2016. Google’s multilingual neural machine translation system: Enabling zero-shot translation. arxiv:1611.04558.Google Scholar
R. Johnson and T. Zhang. 2013. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems 26. MIT Press, 315--323. Google ScholarDigital Library
N. P. Jouppi et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA’17). 1--12. Google ScholarDigital Library
L. Kaiser, A. N. Gomez, N. Shazeer, A. Vaswani, N. Parmar, L. Jones, and J. Uszkoreit. 2017. One model to learn them all. arxiv:1706.05137.Google Scholar
K. Kandasamy, W. Neiswanger, J. Schneider, B. Poczos, and E. P. Xing. 2018. Neural architecture search with bayesian optimisation and optimal transport. In Advances in Neural Information Processing Systems 31. MIT Press, 2016--2025. Google ScholarDigital Library
T. Karras, T. Aila, S. Laine, and J. Lehtinen. 2017. Progressive growing of GANs for improved quality, stability, and variation. arxiv:1710.10196.Google Scholar
J. Keuper and F. Pfreundt. 2015. Asynchronous parallel stochastic gradient descent: A numeric core for scalable distributed machine learning algorithms. In Proceedings of the Workshop on Machine Learning in HPC Environments (MLHPC’15). 1:1--1:11. Google ScholarDigital Library
H. Kim et al. 2016. DeepSpark: Spark-based deep learning supporting asynchronous updates and caffe compatibility. arxiv:1602.08191.Google Scholar
Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin. 2016. Compression of deep convolutional neural networks for fast and low power mobile applications. In Proceedings of the International Conference on Learning Representations (ICLR’16).Google Scholar
D. P. Kingma and J. Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR’15).Google Scholar
A. Klein, S. Falkner, J. T. Springenberg, and F. Hutter. 2016. Learning curve prediction with Bayesian neural networks. In Proceedings of the International Conference on Learning Representations (ICLR).Google Scholar
U. Köster et al. 2017. Flexpoint: An adaptive numerical format for efficient training of deep neural networks. In Advances in Neural Information Processing Systems 30. MIT Press, 1740--1750. Google ScholarDigital Library
S. Krishnan, Y. Xiao, and R. A. Saurous. 2018. Neumann optimizer: A practical optimization algorithm for deep neural networks. In Proceedings of the International Conference on Learning Representations (ICLR’18).Google Scholar
A. Krizhevsky. 2009. Learning Multiple Layers of Features from Tiny Images. Master’s thesis, University of Toronto, Canada.Google Scholar
A. Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. arxiv:1404.5997.Google Scholar
A. Krizhevsky, I. Sutskever, and G. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25. MIT Press, 1097--1105. Google ScholarDigital Library
T. Kurth et al. 2017. Deep learning at 15PF: Supervised and semi-supervised classification for scientific data. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’17). 7:1--7:11. Google ScholarDigital Library
G. Lacey, G. W. Taylor, and S. Areibi. 2016. Deep learning on FPGAs: Past, present, and future. arxiv:1602.04283.Google Scholar
L. Lamport, R. Shostak, and M. Pease. 1982. The byzantine generals problem. ACM Trans. Program. Lang. Syst. 4, 3 (1982), 382--401. Google ScholarDigital Library
A. Lavin and S. Gray. 2016. Fast algorithms for convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16).Google Scholar
Q. V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Y. Ng. 2011. On optimization methods for deep learning. In Proceedings of the 28th International Conference on Machine Learning (ICML’11). 265--272. Google ScholarDigital Library
Q. V. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. S. Corrado, J. Dean, and A. Y. Ng. 2012. Building high-level features using large scale unsupervised learning. In Proceedings of the 29th International Conference on Machine Learning (ICML’12). 507--514. Google ScholarDigital Library
Y. LeCun, Y. Bengio, and G. Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436--444.Google Scholar
Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. 1989. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1, 4 (1989), 541--551. Google ScholarDigital Library
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278--2324.Google ScholarCross Ref
H. Lee, P. Pham, Y. Largman, and A. Y. Ng. 2009. Unsupervised feature learning for audio classification using convolutional deep belief networks. In Advances in Neural Information Processing Systems 22. MIT Press, 1096--1104. Google ScholarDigital Library
S. Lee, S. Purushwalkam, M. Cogswell, D. J. Crandall, and D. Batra. 2015. Why M heads are better than one: Training a diverse ensemble of deep networks. arxiv:1511.06314.Google Scholar
C. Li, Y. Yang, M. Feng, S. Chakradhar, and H. Zhou. 2016. Optimizing memory efficiency for deep convolutional neural networks on GPUs. In Proceedings of the International Conference for Supercomputing (SC’16). 54:1--54:12. Google ScholarDigital Library
D. Li, X. Wang, and D. Kong. 2017. DeepRebirth: Accelerating deep neural network execution on mobile devices. arxiv:1708.04728.Google Scholar
F. Li and B. Liu. 2016. Ternary weight networks. arxiv:1605.04711.Google Scholar
M. Li et al. 2014. Scaling distributed machine learning with the parameter server. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI’14). 583--598. Google ScholarDigital Library
T. Li, J. Zhong, J. Liu, W. Wu, and C. Zhang. 2017. Ease.ml: Towards multi-tenant resource sharing for machine learning workloads. arxiv:1708.07308.Google Scholar
X. Lian et al. 2017. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems 30. MIT Press, 5336--5346. Google ScholarDigital Library
X. Lian, Y. Huang, Y. Li, and J. Liu. 2015. Asynchronous parallel stochastic gradient for nonconvex optimization. In Proceedings of the 28th International Conference on NIPS, vol. 2. 2737--2745. Google ScholarDigital Library
X. Lian, W. Zhang, C. Zhang, and J. Liu. 2018. Asynchronous decentralized parallel stochastic gradient descent. In Proceedings of the 35th International Conference on Machine Learning (ICML’18). 3043--3052.Google Scholar
M. Lin, Q. Chen, and S. Yan. 2014. Network in network. In Proceedings of the International Conferecne on Learning Representations (ICLR’14).Google Scholar
Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally. 2018. Deep gradient compression: Reducing the communication bandwidth for distributed training. In Proceedings of the International Conference on Learning Representations (ICLR’18).Google Scholar
C. Liu, B. Zoph, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy. 2017. Progressive neural architecture search. arxiv:1712.00559.Google Scholar
H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and K. Kavukcuoglu. 2018. Hierarchical representations for efficient architecture search. In Proceedings of the International Conference on Learning Representations (ICLR’18).Google Scholar
H. Liu, K. Simonyan, and Y. Yang. 2018. DARTS: Differentiable architecture search. arxiv:1806.09055.Google Scholar
X. Liu, J. Pool, S. Han, and W. J. Dally. 2018. Efficient sparse-winograd convolutional neural networks. In Proceedings of the International Conference on Learning Representations (ICLR’18).Google Scholar
P. R. Lorenzo, J. Nalepa, L. S. Ramos, and J. R. Pastor. 2017. Hyper-parameter selection in deep neural networks using parallel particle swarm optimization. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO’17). 1864--1871. Google ScholarDigital Library
I. Loshchilov and F. Hutter. 2017. SGDR: Stochastic gradient descent with warm restarts. In Proceedings of the International Conference on Learning Representations (ICLR’17).Google Scholar
R. Luo, F. Tian, T. Qin, E. Chen, and T.-Y. Liu. 2018. Neural architecture optimization. In Advances in Neural Information Processing Systems 31. MIT Press, 7816--7827. Google ScholarDigital Library
J. Martens. 2010. Deep learning via hessian-free optimization. In Proceedings of the 27th International Conference on Machine Learning (ICML’10). 735--742. Google ScholarDigital Library
M. Mathieu, M. Henaff, and Y. LeCun. 2014. Fast training of convolutional networks through FFTs. In Proceedings of the International Conference on Learning Representations (ICLR’14).Google Scholar
Message Passing Interface Forum. 2015. MPI: A Message-Passing Interface Standard Version 3.1. Retrieved from https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf.Google Scholar
Y. Miao, H. Zhang, and F. Metze. 2014. Distributed learning of multilingual DNN feature extractors using GPUs. In Proceedings of the 15th Annual Conference of the International Speech Communication Association (INTERSPEECH’14). 830--834.Google Scholar
R. Miikkulainen et al. 2017. Evolving deep neural networks. arxiv:1703.00548.Google Scholar
H. Mikami et al. 2018. ImageNet/ResNet-50 training in 224 seconds. arxiv:1811.05233.Google Scholar
P. Moritz, R. Nishihara, and M. Jordan. 2016. A linearly-convergent stochastic L-BFGS algorithm. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, vol. 51. 249--258.Google Scholar
P. Moritz, R. Nishihara, I. Stoica, and M. I. Jordan. 2016. SparkNet: Training deep networks in spark. In Proceedings of the International Conference on Learning Representations (ICLR’16).Google Scholar
U. A. Muller and A. Gunzinger. 1994. Neural net simulation on parallel computers. In Proceedings of the IEEE International Conference on Neural Networks, vol. 6. 3961--3966.Google Scholar
R. Negrinho and G. Gordon. 2017. DeepArchitect: Automatically designing and training deep architectures. arxiv:1704.08792.Google Scholar
A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. 2009. Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19, 4 (2009), 1574--1609. Google ScholarDigital Library
Y. Nesterov. 1983. A method of solving a convex programming problem with convergence rate O(1/k)<sup>2</sup>. Soviet Math. Doklady 269 (1983), 543--547.Google Scholar
Netlib. 2019. Basic Linear Algebra Subprograms (BLAS). Retrieved from http://www.netlib.org/blas.Google Scholar
J. Ngiam, Z. Chen, D. Chia, P. W. Koh, Q. V. Le, and A. Y. Ng. 2010. Tiled convolutional neural networks. In Advances in Neural Information Processing Systems 23. MIT Press, 1279--1287. Google ScholarDigital Library
J. Nocedal and S. Wright. 2006. Numerical Optimization. Springer.Google Scholar
C. Noel and S. Osindero. 2014. Dogwild!—Distributed hogwild for CPU 8 GPU. In Proceedings of the NIPS Workshop on Distributed Machine Learning and Matrix Computations.Google Scholar
E. Nurvitadhi et al. 2017. Can FPGAs beat GPUs in accelerating next-generation deep neural networks?. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). 5--14. Google ScholarDigital Library
NVIDIA. 2017. Programming Tensor Cores in CUDA 9. Retrieved from https://devblogs.nvidia.com/programming-tensor-cores-cuda-9.Google Scholar
NVIDIA. 2019. CUBLAS Library Documentation. Retrieved from http://docs.nvidia.com/cuda/cublas.Google Scholar
C. Olah. 2015. Understanding LSTM Networks. Retrieved from http://colah.github.io/posts/2015-08-Understanding-LSTMs.Google Scholar
K. Osawa et al. 2018. Second-order optimization method for large mini-batch: Training resnet-50 on ImageNet in 35 Epochs. arxiv:1811.12019.Google Scholar
M. Ott et al. 2018. Scaling neural machine translation. In Proceedings of the 3rd Conference on Machine Translation: Research Papers. 1--9.Google ScholarCross Ref
Y. Oyama et al. 2016. Predicting statistics of asynchronous SGD parameters for a large-scale distributed deep learning system on GPU supercomputers. In Proceedings of the IEEE International Conference on Big Data (BigData’16). 66--75.Google ScholarCross Ref
Y. Oyama, T. Ben-Nun, T. Hoefler, and S. Matsuoka. 2018. Accelerating deep learning frameworks with micro-batches. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER’18).Google Scholar
PaddlePaddle. 2017. Elastic Deep Learning. Retrieved from https://github.com/PaddlePaddle/cloud/tree/develop/doc/edl.Google Scholar
T. Paine et al. 2013. GPU asynchronous stochastic gradient descent to speed up neural network training. arxiv:1312.6186.Google Scholar
S. J. Pan and Q. Yang. 2010. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 10 (2010), 1345--1359. Google ScholarDigital Library
P. Patarasuk and X. Yuan. 2009. Bandwidth optimal all-reduce algorithms for clusters of workstations. J. Parallel Distrib. Comput. 69, 2 (2009), 117--124. Google ScholarDigital Library
F. Petroski Such et al. 2017. Deep neuroevolution: Genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. arxiv:1712.06567.Google Scholar
H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean. 2018. Efficient neural architecture search via parameter sharing. arxiv:1802.03268.Google Scholar
B. T. Polyak and A. B. Juditsky. 1992. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30, 4 (1992), 838--855. Google ScholarDigital Library
D. Povey, X. Zhang, and S. Khudanpur. 2014. Parallel training of deep neural networks with natural gradient and parameter averaging. arxiv:1410.7455.Google Scholar
R. Puri et al. 2018. Large scale language modeling: Converging on 40GB of text in four hours. arxiv:1808.01371.Google Scholar
H. Qi, E. R. Sparks, and A. Talwalkar. 2017. Paleo: A performance model for deep neural networks. In Proceedings of the International Conference on Learning Representations (ICLR’17).Google Scholar
N. Qian. 1999. On the momentum term in gradient descent learning algorithms. Neural Netw. 12, 1 (1999). Google ScholarDigital Library
R. Rabenseifner. 2004. Optimization of collective reduction operations. In Proceedings of the International Conference on Computational Science. 1--9.Google ScholarCross Ref
A. Rahimi and B. Recht. 2017. Reflections on random kitchen sinks. Retrieved from http://www.argmin.net/2017/12/05/kitchen-sinks NIPS Test of Time Award Talk.Google Scholar
R. Raina, A. Madhavan, and A. Y. Ng. 2009. Large-scale deep unsupervised learning using graphics processors. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML’09). 873--880. Google ScholarDigital Library
S. Sundhar Ram, A. Nedic, and V. V. Veeravalli. 2009. Asynchronous gossip algorithms for stochastic optimization. In Proceedings of the International Conference on Game Theory for Networks. 80--81. Google ScholarDigital Library
M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. 2016. XNOR-Net: ImageNet classification using binary convolutional neural networks. arxiv:1603.05279.Google Scholar
E. Real, A. Aggarwal, Y. Huang, and Q. V Le. 2018. Regularized evolution for image classifier architecture search. arxiv:1802.01548.Google Scholar
E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V. Le, and A. Kurakin. 2017. Large-scale evolution of image classifiers. In Proceedings of the 34th International Conference on Machine Learning. 2902--2911. Google ScholarDigital Library
B. Recht, C. Re, S. Wright, and F. Niu. 2011. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems 24. MIT Press, 693--701. Google ScholarDigital Library
C. Renggli, D. Alistarh, and T. Hoefler. 2018. SparCML: High-performance sparse communication for machine learning. arxiv:1802.08021.Google Scholar
H. Robbins and S. Monro. 1951. A stochastic approximation method. Ann. Math. Stat. 22, 3 (1951), 400--407.Google ScholarCross Ref
T. Salimans and D. P. Kingma. 2016. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems 29. MIT Press, 901--909. Google ScholarDigital Library
F. Seide et al. 2014. 1-bit stochastic gradient descent and application to data-parallel distributed training of speech DNNs. In Proceedings of the 15th Annual Conference of the International Speech Communication Association (INTERSPEECH’14).Google ScholarCross Ref
F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 2014. On parallelizability of stochastic gradient descent for speech DNNs. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). 235--239.Google Scholar
S. Shalev-Shwartz and S. Ben-David. 2014. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press. Google ScholarDigital Library
C. J. Shallue et al. 2018. Measuring the effects of data parallelism on neural network training. arxiv:1811.03600.Google Scholar
O. Shamir. 2016. Without-replacement sampling for stochastic gradient methods. In Advances in Neural Information Processing Systems 29. MIT Press, 46--54. Google ScholarDigital Library
R. Shokri and V. Shmatikov. 2015. Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security (CCS’15). 1310--1321. Google ScholarDigital Library
D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. 2017. Mastering the game of go without human knowledge. Nature 550, 7676 (2017), 354.Google Scholar
K. Simonyan and A. Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. Proceedings of the International Conference on Learning Representations (ICLR’15).Google Scholar
A. J. R. Simpson. 2015. Instant learning: Parallel deep neural networks and convolutional bootstrapping. arxiv:1505.05972.Google Scholar
S. L. Smith, P. Kindermans, and Q. V. Le. 2017. Don’t decay the learning rate, increase the batch size. arxiv:1711.00489.Google Scholar
J. Snoek, H. Larochelle, and R. P Adams. 2012. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems 25. MIT Press, 2951--2959. Google ScholarDigital Library
E. Solomonik and T. Hoefler. 2015. Sparse Tensor Algebra as a Parallel Programming Model. arxiv:1512.00066.Google Scholar
M. Song, Y. Hu, H. Chen, and T. Li. 2017. Towards pervasive and user satisfactory CNN across GPU microarchitectures. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’17). 1--12.Google Scholar
H. V. Sorensen and C. S. Burrus. 1993. Efficient computation of the DFT with only a subset of input or output points. IEEE Trans. Signal Process. 41, 3 (1993), 1184--1200. Google ScholarDigital Library
V. Strassen. 1969. Gaussian elimination is not optimal. Numer. Math. 13, 4 (1969), 354--356. Google ScholarDigital Library
N. Strom. 2015. Scalable distributed DNN training using commodity GPU cloud computing. In Proceedings of the 16th Annual Conference of the International Speech Communication Association.Google ScholarCross Ref
V. Sze, Y. H. Chen, T. J. Yang, and J. S. Emer. 2017. Efficient processing of deep neural networks: A tutorial and survey. Proc. IEEE 105, 12 (2017), 2295--2329.Google ScholarCross Ref
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’15).Google Scholar
G. Taylor, R. Burmeister, Z. Xu, B. Singh, A. Patel, and T. Goldstein. 2016. Training neural networks without gradients: A scalable ADMM approach. (2016). arxiv:1605.02026Google Scholar
L. Truong et al. 2016. Latte: A language, compiler, and runtime for elegant and efficient deep neural networks. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’16). 209--223. Google ScholarDigital Library
J. Tsitsiklis, D. Bertsekas, and M. Athans. 1986. Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IEEE Trans. Automat. Control 31, 9 (1986), 803--812.Google ScholarCross Ref
B. Van Essen et al. 2015. LBANN: Livermore big artificial neural network HPC toolkit. In Proceedings of the Workshop on Machine Learning in HPC Environments. Google ScholarDigital Library
V. Vanhoucke, A. Senior, and M. Z. Mao. 2011. Improving the speed of neural networks on CPUs. In Proceedings of the Deep Learning and Unsupervised Feature Learning Workshop (NIPS’11).Google Scholar
N. Vasilache et al. 2018. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. arxiv:1802.04730.Google Scholar
N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Piantino, and Y. LeCun. 2015. Fast convolutional nets with fbfft: A GPU performance evaluation. In Proceedings of the International Conference on Learning Representations (ICLR’15).Google Scholar
A. Vasudevan, A. Anderson, and D. Gregg. 2017. Parallel multi channel convolution using general matrix multiplication. arxiv:1704.04428.Google Scholar
P. Verbancsics and J. Harguess. 2015. Image classification using generative neuro evolution for deep learning. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 488--493. Google ScholarDigital Library
A. Viebke, S. Memeti, S. Pllana, and A. Abraham. 2019. CHAOS: A parallelization scheme for training convolutional neural networks on Intel Xeon Phi. The Journal of Supercomputing 75, 1 (Jan. 2019), 197--227. Google ScholarDigital Library
W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li. 2017. TernGrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems 30. MIT Press, 1509--1519. Google ScholarDigital Library
P. J. Werbos. 1990. Backpropagation through time: What it does and how to do it. Proc. IEEE 78, 10 (1990), 1550--1560.Google ScholarCross Ref
J. H. Wilkinson. 1994. Rounding Errors in Algebraic Processes. Dover Publications. Google ScholarDigital Library
R. J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8, 3 (1992), 229--256. Google ScholarDigital Library
S. Winograd. 1980. Arithmetic Complexity of Computations. Society for Industrial and Applied Mathematics.Google Scholar
L. Xie and A. Yuille. 2017. Genetic CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 1388--1397.Google Scholar
P. Xie, J. K. Kim, Y. Zhou, Q. Ho, A. Kumar, Y. Yu, and E. Xing. 2016. Lighter-communication distributed machine learning via sufficient factor broadcasting. In Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence (UAI’16). 795--804. Google ScholarDigital Library
E. P. Xing, Q. Ho, W. Dai, J. K. Kim, J. Wei, S. Lee, X. Zheng, P. Xie, A. Kumar, and Y. Yu. 2015. Petuum: A new platform for distributed machine learning on big data. IEEE Trans. Big Data 1, 2 (2015), 49--67.Google ScholarCross Ref
K. Xu et al. 2015. Show, attend and tell: Neural image caption generation with visual attention. arxiv:1502.03044.Google Scholar
O. Yadan, K. Adams, Y. Taigman, and M. Ranzato. 2013. Multi-GPU training of ConvNets. arxiv:1312.5853.Google Scholar
F. Yan, O. Ruwase, Y. He, and T. Chilimbi. 2015. Performance modeling and scalability optimization of distributed deep learning systems. In Proceedings of the 21st ACM International Conference on Knowledge Discovery and Data Mining (KDD’15). 1355--1364. Google ScholarDigital Library
C. Ying et al. 2018. Image classification at supercomputer scale. arxiv:1811.06992.Google Scholar
Y. You et al. 2019. Large-batch training for LSTM and beyond. arxiv:1901.08256.Google Scholar
Y. You, A. Buluç, and J. Demmel. 2017. Scaling deep learning on GPU and knights landing clusters. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’17). 9:1--9:12. Google ScholarDigital Library
Y. You, I. Gitman, and B. Ginsburg. 2017. Large batch training of convolutional networks. arxiv:1708.03888.Google Scholar
Y. You, Z. Zhang, C. Hsieh, and J. Demmel. 2017. 100-epoch ImageNet training with AlexNet in 24 minutes. arxiv:1709.05011Google Scholar
S. R. Young et al. 2017. Evolving deep networks using HPC. In Proceedings of the Workshop on Machine Learning in HPC Environments (MLHPC’17). 7:1--7:7. Google ScholarDigital Library
F. Yu and V. Koltun. 2016. Multi-scale context aggregation by dilated convolutions. In Proceedings of the International Conference on Learning Representations (ICLR’16).Google Scholar
Y. Yu, J. Jiang, and X. Chi. 2016. Using supercomputer to speed up neural network training. In Proceedings of the IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS’16). 942--947.Google Scholar
H. Zhang et al. 2015. Poseidon: A system architecture for efficient GPU-based deep learning on multiple machines. arxiv:1512.06216Google Scholar
H. Zhang, Z. Zheng, S. Xu, W. Dai, Q. Ho, X. Liang, Z. Hu, J. Wei, P. Xie, and E. P. Xing. 2017. Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters. In Proceedings of the USENIX Annual Technical Conference (ATC’17). 181--193. Google ScholarDigital Library
J. Zhang, I. Mitliagkas, and C. Ré. 2017. YellowFin and the art of momentum tuning. arxiv:1706.03471Google Scholar
K. Zhang and X. W. Chen. 2014. Large-scale deep belief nets with MapReduce. IEEE Access 2 (2014), 395--403.Google ScholarCross Ref
S. Zhang et al. 2015. Deep learning with elastic averaging SGD. In Advances in Neural Information Processing Systems 28. MIT Press, 685--693. Google ScholarDigital Library
S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen. 2016. Cambricon-X: An accelerator for sparse neural networks. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). 1--12. Google ScholarDigital Library
S. Zhang, C. Zhang, Z. You, R. Zheng, and B. Xu. 2013. Asynchronous stochastic gradient descent for DNN training. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 6660--6663.Google Scholar
W. Zhang et al. 2017. GaDei: On scale-up training as a service for deep learning. In Proceedings of the IEEE International Conference on Data Mining (ICDM’17). 1195--1200.Google ScholarCross Ref
W. Zhang, S. Gupta, X. Lian, and J. Liu. 2016. Staleness-aware async-SGD for distributed deep learning. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI’16). 2350--2356. Google ScholarDigital Library
X. Zhang, M. McKenna, J. P. Mesirov, and D. L. Waltz. 1990. An efficient implementation of the back-propagation algorithm on the connection machine CM-2. In Advances in Neural Information Processing Systems 2. MIT Press, 801--809. Google ScholarDigital Library
H. Zhao and J. Canny. 2014. Kylix: A sparse allreduce for commodity clusters. In Proceedings of the 43rd International Conference on Parallel Processing. 273--282. Google ScholarDigital Library
Z. Zhong, J. Yan, and C.-L. Liu. 2017. Practical network blocks design with Q-Learning. arxiv:1708.05552Google Scholar
S. Zhou et al. 2016. DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arxiv:1606.06160Google Scholar
M. A. Zinkevich, M. Weimer, A. Smola, and L. Li. 2010. Parallelized stochastic gradient descent. In Proceedings of the 23rd International Conference on Neural Information Processing Systems, vol. 2. 2595--2603. Google ScholarDigital Library
A. Zlateski, K. Lee, and H. S. Seung. 2016. ZNNi: Maximizing the inference throughput of 3D convolutional networks on CPUs and GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 854--865. Google ScholarDigital Library
B. Zoph and Q. V. Le. 2017. Neural architecture search with reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR’17).Google Scholar
B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. 2017. Learning transferable architectures for scalable image recognition. arxiv:1707.07012Google Scholar

Index Terms

Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis
1. Computing methodologies
2. General and reference
  1. Document types
    1. Surveys and overviews

Recommendations

Recommender Systems based on Parallel and Distributed Deep Learning
PCI '23: Proceedings of the 27th Pan-Hellenic Conference on Progress in Computing and Informatics

As individuals have become overloaded with information, Recommender Systems (RS) were created to provide machine generated recommendations. Significant advancements in RS have been made thanks to Machine Learning methods; Deep Learning (DL) in particular ...
Read More
Fully Distributed Deep Learning Inference on Resource-Constrained Edge Devices
Embedded Computer Systems: Architectures, Modeling, and Simulation
Abstract
Performing inference tasks of deep learning applications on IoT edge devices ensures privacy of input data and can result in shorter latency when compared to a cloud solution. As most edge devices are memory- and compute-constrained, they cannot ...
Read More
Distributing deep learning inference on edge devices
CoNEXT '20: Proceedings of the 16th International Conference on emerging Networking EXperiments and Technologies

Deep Neural Networks (DNNs) and Convolutional Neural Networks (CNNs) are widely used in IoT related applications. However, inferencing pre-trained large DNNs and CNNs consumes a significant amount of time, memory and computational resources. This makes ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Computing Surveys Volume 52, Issue 4
July 2020
769 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/3359984
Editor:
Sartaj Sahni
Department of Computer and Information Science and Engineering
Issue’s Table of Contents
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 August 2019
- Accepted: 1 March 2019
- Received: 1 August 2018
Published in csur Volume 52, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Deep learning
distributed computing
parallel algorithms
Qualifiers
- survey
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 296
  Total Citations
  View Citations
- 7,491
  Total Downloads
- Downloads (Last 12 months)1,015
- Downloads (Last 6 weeks)106
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis

ACM Computing Surveys

Abstract

References

Cited By

Index Terms

Recommendations

Recommender Systems based on Parallel and Distributed Deep Learning

Fully Distributed Deep Learning Inference on Resource-Constrained Edge Devices

Distributing deep learning inference on edge devices