research-article

Characterizing the Performance of Accelerated Jetson Edge Devices for Training Deep Learning Models

Authors:
Prashanthi S.K

Indian Institute of Science, Bangalore, India

Indian Institute of Science, Bangalore, India

0000-0002-7490-3128
View Profile

,
Sai Anuroop Kesanapalli

Indian Institute of Science, Bangalore, India

Indian Institute of Science, Bangalore, India

0000-0003-1270-7460
View Profile

,
Yogesh Simmhan

Indian Institute of Science, Bangalore, India

Indian Institute of Science, Bangalore, India

0000-0003-4140-7774
View Profile

Proceedings of the ACM on Measurement and Analysis of Computing Systems Volume 6 Issue 3Article No.: 44pp 1–26https://doi.org/10.1145/3570604

Published:08 December 2022Publication History

Proceedings of the ACM on Measurement and Analysis of Computing Systems

Abstract

Deep Neural Networks (DNNs) have had a significant impact on domains like autonomous vehicles and smart cities through low-latency inferencing on edge computing devices close to the data source. However, DNN training on the edge is poorly explored. Techniques like federated learning and the growing capacity of GPU-accelerated edge devices like NVIDIA Jetson motivate the need for a holistic characterization of DNN training on the edge. Training DNNs is resource-intensive and can stress an edge's GPU, CPU, memory and storage capacities. Edge devices also have different resources compared to workstations and servers, such as slower shared memory and diverse storage media. Here, we perform a principled study of DNN training on individual devices of three contemporary Jetson device types: AGX Xavier, Xavier NX and Nano for three diverse DNN model--dataset combinations. We vary device and training parameters such as I/O pipelining and parallelism, storage media, mini-batch sizes and power modes, and examine their effect on CPU and GPU utilization, fetch stalls, training time, energy usage, and variability. Our analysis exposes several resource inter-dependencies and counter-intuitive insights, while also helping quantify known wisdom. Our rigorous study can help tune the training performance on the edge, trade-off time and energy usage on constrained devices, and even select an ideal edge hardware for a DNN workload, and, in future, extend to federated learning too. As an illustration, we use these results to build a simple model to predict the training time and energy per epoch for any given DNN across different power modes, with minimal additional profiling.

References

Hazem A. Abdelhafez, Hassan Halawa, Karthik Pattabiraman, and Matei Ripeanu. 2021. Snowflakes at the Edge: A Study of Variability among NVIDIA Jetson AGX Xavier Boards. In ACM EdgeSys Workshop.Google ScholarDigital Library
Hazem A. Abdelhafez and Matei Ripeanu. 2019. Studying the Impact of CPU and Memory Controller Frequencies on Power Consumption of the Jetson TX1. In IEEE Intl. Conf. on Fog and Mobile Edge Comp. (FMEC).Google Scholar
Assemblyai. 2022. TF v/s Pytorch. https://www.assemblyai.com/blog/pytorch-vs-tensorflow-in-2022/.Google Scholar
S. Baller, A. Jindal, M. Chadha, and M. Gerndt. 2021. DeepEdgeBench: Benchmarking Deep Neural Networks on Edge Devices. In IEEE International Conference on Cloud Engineering.Google Scholar
Yoshua Bengio. 2012. Practical recommendations for gradient-based training of deep architectures. In Neural networks: Tricks of the trade. Springer, 437--478.Google ScholarDigital Library
Daniel J. Beutel, Taner Topal, Akhil Mathur, Xinchi Qiu, Javier Fernandez-Marques, Yan Gao, Lorenzo Sani, Kwing Hei Li, Titouan Parcollet, Pedro Porto Buarque de Gusmão, and Nicholas D. Lane. 2020. Flower: A friendly federated learning research framework. arXiv preprint arXiv:2007.14390 (2020).Google Scholar
Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloé Kiddon, Jakub Konený, Stefano Mazzocchi, Brendan McMahan, Timon Van Overveldt, David Petrou, Daniel Ramage, and Jason Roselander. 2019. Towards Federated Learning at Scale: System Design. In Proceedings of Machine Learning and Systems, A. Talwalkar, V. Smith, and M. Zaharia (Eds.), Vol. 1. 374--388. https://proceedings.mlsys.org/paper/2019/file/ bd686fd640be98efaae0091fa301e613-Paper.pdfGoogle Scholar
Shubham Chandel. 2022. Pytorch Model Summary. https://github.com/sksq96/pytorch-summary.Google Scholar
Zachary Charles, Zachary Garrett, Zhouyuan Huo, Sergei Shmulyian, and Virginia Smith. 2021. On large-cohort training for federated learning. Advances in Neural Information Processing Systems 34 (2021).Google Scholar
Jiasi Chen and Xukan Ran. 2019. Deep Learning With Edge Computing: A Review. Proc. IEEE 107, 8 (2019), 1655--1674. https://doi.org/10.1109/JPROC.2019.2921977Google ScholarCross Ref
John Chen, Cameron Wolfe, Zhao Li, and Anastasios Kyrillidis. 2022. Demon: Improved Neural Network Training with Momentum Decay. In ICASSP 2022--2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 3958--3962.Google Scholar
Qi Chen, Wei Wang, Fangyu Wu, Suparna De, Ruili Wang, Bailing Zhang, and Xin Huang. 2019. A survey on an emerging area: Deep learning for smart city data. IEEE Transactions on Emerging Topics in Computational Intelligence (2019).Google Scholar
Xiaohan Ding, Guiguang Ding, Jungong Han, and Sheng Tang. 2018. Auto-balanced filter pruning for efficient convolutional neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.Google ScholarCross Ref
Noah Golmant, Nikita Vemuri, Zhewei Yao, Vladimir Feinberg, Amir Gholami, Kai Rothauge, Michael W Mahoney, and Joseph Gonzalez. 2018. On the computational inefficiency of large batch sizes for stochastic gradient descent. arXiv preprint arXiv:1811.12941 (2018).Google Scholar
Google. 2022. Dev Board datasheet. https://coral.ai/docs/dev-board/datasheet/.Google Scholar
Google. 2022. Google Coral Products. https://coral.ai/products/.Google Scholar
Hassan Halawa, Hazem A. Abdelhafez, Andrew Boktor, and Matei Ripeanu. 2017. NVIDIA Jetson Platform Characterization. In Euro-Par 2017: Parallel Processing. Springer International Publishing, Cham, 92--105.Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarCross Ref
Stephan Holly, Alexander Wendt, and Martin Lechner. 2020. Profiling Energy Consumption of Deep Neural Networks on NVIDIA Jetson Nano. In 2020 11th International Green and Sustainable Computing Workshops (IGSC). 1--6. https: //doi.org/10.1109/IGSC51522.2020.9290876Google ScholarCross Ref
Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, Quoc V. Le, and Hartwig Adam. 2019. Searching for MobileNetV3. In IEEE/CVF International Conference on Computer Vision (ICCV). IEEE.Google ScholarCross Ref
Intel. 2022. Intel Movidius VPUs. https://www.intel.com/content/www/us/en/products/details/processors/movidiusvpu.html.Google Scholar
Sumin Kim, Seunghwan Oh, and Youngmin Yi. 2021. Minimizing GPU Kernel Launch Overhead in Deep Learning Inference on Mobile GPUs. In Proceedings of the 22nd International Workshop on Mobile Computing Systems and Applications (Virtual, United Kingdom) (HotMobile '21). Association for Computing Machinery, New York, NY, USA, 57--63. https://doi.org/10.1145/3446382.3448606Google ScholarDigital Library
Dimitrios Kollias et al. 2018. Dimitrios Kollias and Athanasios Tagaris and Andreas Stafylopatis and Stefanos Kollias and Georgios Tagaris. Complex & Intelligent Systems (2018).Google Scholar
Alex Krizhevsky and Geoffrey Hinton. 2009. Learning multiple layers of features from tiny images. (2009).Google Scholar
Navjot Kukreja, Alena Shilova, Olivier Beaumont, Jan Huckelheim, Nicola Ferrier, Paul Hovland, and Gerard Gorman. 2019. Training on the Edge: The why and the how. In 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 899--903. https://doi.org/10.1109/IPDPSW.2019.00148Google ScholarCross Ref
Abhishek Vijaya Kumar and Muthian Sivathanu. 2020. Quiver: An informed storage cache for deep learning. In 18th {USENIX} Conference on File and Storage Technologies ({FAST} 20).Google Scholar
Sampo Kuutti, Richard Bowden, Yaochu Jin, Phil Barber, and Saber Fallah. 2020. A survey of deep learning applications to autonomous vehicle control. IEEE Transactions on Intelligent Transportation Systems (2020).Google Scholar
Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278--2324. https://doi.org/10.1109/5.726791Google ScholarCross Ref
Jie Liu, Jiawen Liu, Wan Du, and Dong Li. 2019. Performance Analysis and Characterization of Training Deep Learning Models on Mobile Device. In 2019 IEEE 25th International Conference on Parallel and Distributed Systems (ICPADS). 506--515. https://doi.org/10.1109/ICPADS47876.2019.00077Google ScholarCross Ref
man page. 2021. iostat. https://man7.org/linux/man-pages/man1/iostat.1.html.Google Scholar
man pages. 2021. vmtouch. https://linux.die.net/man/8/vmtouch.Google Scholar
Dominic Masters and Carlo Luschi. 2018. Revisiting small batch training for deep neural networks. arXiv preprint arXiv:1804.07612 (2018).Google Scholar
Peter Mattson, Christine Cheng, Gregory Diamos, Cody Coleman, Paulius Micikevicius, David Patterson, Hanlin Tang, Gu-Yeon Wei, Peter Bailis, Victor Bittorf, David Brooks, Dehao Chen, Debo Dutta, Udit Gupta, Kim Hazelwood, Andy Hock, Xinyuan Huang, Daniel Kang, David Kanter, Naveen Kumar, Jeffery Liao, Deepak Narayanan, Tayo Oguntebi, Gennady Pekhimenko, Lillian Pentecost, Vijay Janapa Reddi, Taylor Robie, Tom St John, Carole-Jean Wu, Lingjie Xu, Cliff Young, and Matei Zaharia. 2020. MLPerf Training Benchmark, Vol. 2. 336--349. https://proceedings.mlsys.org/ paper/2020/file/02522a2b2726fb0a03bb19f2d8d9524d-Paper.pdfGoogle Scholar
Jayashree Mohan, Amar Phanishayee, Ashish Raniwala, and Vijay Chidambaram. 2021. Analyzing and mitigating data stalls in DNN training. Proceedings of the VLDB Endowment (2021).Google ScholarDigital Library
Nvidia. 2021. Jetson AGX Xavier Developer Kit. https://developer.nvidia.com/embedded/jetson-agx-xavier-developerkit.Google Scholar
Nvidia. 2021. Jetson Nano Developer Kit. https://developer.nvidia.com/embedded/jetson-nano-developer-kit.Google Scholar
Nvidia. 2021. Jetson NX Xavier Developer Kit. https://developer.nvidia.com/embedded/jetson-xavier-nx.Google Scholar
Nvidia. 2021. Power modes for Nano. https://docs.nvidia.com/jetson/archives/l4t-archived/l4t-3261/index.html#page/ Tegra%20Linux%20Driver%20Package%20Development%20Guide/power_management_nano.html#.Google Scholar
Nvidia. 2021. Power modes for NX and AGX. https://docs.nvidia.com/jetson/archives/l4t-archived/l4t-3261/index.html# page/Tegra%20Linux%20Driver%20Package%20Development%20Guide/power_management_jetson_xavier.html#.Google Scholar
Nvidia. 2021. Technical Brief: Nvidia Jetson AGX Orin. https://www.nvidia.com/content/dam/en-zz/Solutions/gtcf21/ jetson-orin/nvidia-jetson-agx-orin-technical-brief.pdf.Google Scholar
Nvidia. 2021. tegrastats. https://docs.nvidia.com/jetson/archives/l4t-archived/l4t-3231/index.html#page/Tegra% 20Linux%20Driver%20Package%20Development%20Guide/AppendixTegraStats.html.Google Scholar
Nvidia. 2022. Jetson AGX Orin Developer Kit. https://www.nvidia.com/en-us/autonomous-machines/embeddedsystems/jetson-orin/.Google Scholar
papers with code. 2021. Mobilenet V3. https://paperswithcode.com/lib/torchvision/mobilenet-v3.Google Scholar
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems, Vol. 32. https://proceedings.neurips.cc/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdfGoogle Scholar
T Prabhakar, Nisha Bhaskar, Tejas Pande, and Chaitanya Kulkarni. 2014. Joule Jotter: An interactive energy meter for metering, monitoring and control. In International Workshop on Demand Response, co-located with the ACM e-Energy.Google Scholar
pytorch. 2021. TORCH.UTILS.DATA. https://pytorch.org/docs/stable/data.html.Google Scholar
PyTorch. 2022. Cuda event. https://pytorch.org/docs/stable/generated/torch.cuda.Event.html.Google Scholar
Prashanthi S. K, Aakash Khochare, Sai Anuroop Kesanapalli, Rahul Bhope, and Yogesh Simmhan. 2022. Workshop on Parallel AI and Systems for the Edge - PAISE. In 2022 IEEE International Symposium on Parallel and Distributed Processing, Workshops and Phd Forum (IPDPSW).Google Scholar
Christopher J Shallue, Jaehoon Lee, Joseph Antognini, Jascha Sohl-Dickstein, Roy Frostig, and George E Dahl. 2018. Measuring the effects of data parallelism on neural network training. arXiv preprint arXiv:1811.03600 (2018).Google Scholar
Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. https://doi.org/10.48550/ARXIV.1409.1556Google ScholarCross Ref
Vladislav Sovrasov. 2021. Flops counter. https://pypi.org/project/ptflops/.Google Scholar
TensorFlow. 2022. TFF GLDv2. https://www.tensorflow.org/federated/api_docs/python/tff/simulation/datasets/gldv2/ load_dataGoogle Scholar
Rik van Riel. 2001. Page Replacement in Linux 2.4 Memory Management. In 2001 USENIX Annual Technical Conference (USENIX ATC 01). USENIX Association, Boston, MA. https://www.usenix.org/conference/2001-usenix-annualtechnical-conference/page-replacement-linux-24-memory-managementGoogle Scholar
Yu Wang, Gu-Yeon Wei, and David Brooks. 2019. Benchmarking tpu, gpu, and cpu platforms for deep learning. arXiv preprint arXiv:1907.10701 (2019).Google Scholar
Tobias Weyand, Andre Araujo, Bingyi Cao, and Jack Sim. 2020. Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval. In IEEE/CVF conference on computer vision and pattern recognition.Google ScholarCross Ref
Zhi Zhou, Xu Chen, En Li, Liekang Zeng, Ke Luo, and Junshan Zhang. 2019. Edge intelligence: Paving the last mile of artificial intelligence with edge computing. Proc. IEEE (2019)Google ScholarCross Ref

Index Terms

Characterizing the Performance of Accelerated Jetson Edge Devices for Training Deep Learning Models
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
  2. Embedded and cyber-physical systems
2. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks
  2. Parallel computing methodologies

Recommendations

Characterizing the Performance of Accelerated Jetson Edge Devices for Training Deep Learning Models
SIGMETRICS '23: Abstract Proceedings of the 2023 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems

Deep Neural Network (DNN) models are becoming ubiquitous in a variety of contemporary domains such as Autonomous Vehicles, Smart cities and Healthcare. They help drones to navigate, identify suspicious activities from safety cameras, and perform ...
Read More
Characterizing the Performance of Accelerated Jetson Edge Devices for Training Deep Learning Models
SIGMETRICS '23

Deep Neural Network (DNN) models are becoming ubiquitous in a variety of contemporary domains such as Autonomous Vehicles, Smart cities and Healthcare. They help drones to navigate, identify suspicious activities from safety cameras, and perform ...
Read More
High performance distributed deep learning: a beginner's guide
PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming

The current wave of advances in Deep Learning (DL) has led to many exciting challenges and opportunities for Computer Science and Artificial Intelligence researchers alike. Modern DL frameworks like Caffe2, TensorFlow, Cognitive Toolkit (CNTK), PyTorch, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the ACM on Measurement and Analysis of Computing Systems Volume 6, Issue 3
POMACS
December 2022
534 pages
EISSN:2476-1249
DOI:10.1145/3576048
Editors:
Augustin Chaintreau
Columbia University
,
Leana Golubchik
University of Southern California, United States
,
Zhi-Li Zhang
University of Minnesota, United States
Issue’s Table of Contents
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 December 2022
Published in pomacs Volume 6, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
dnn training
edge accelerators
performance characterization
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 687
  Total Downloads
- Downloads (Last 12 months)499
- Downloads (Last 6 weeks)47
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Characterizing the Performance of Accelerated Jetson Edge Devices for Training Deep Learning Models

Proceedings of the ACM on Measurement and Analysis of Computing Systems

Abstract

References

Cited By

Index Terms

Recommendations

Characterizing the Performance of Accelerated Jetson Edge Devices for Training Deep Learning Models

Characterizing the Performance of Accelerated Jetson Edge Devices for Training Deep Learning Models

High performance distributed deep learning: a beginner's guide

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Characterizing the Performance of Accelerated Jetson Edge Devices for Training Deep Learning Models

Proceedings of the ACM on Measurement and Analysis of Computing Systems

Abstract

References

Cited By

Index Terms

Recommendations

Characterizing the Performance of Accelerated Jetson Edge Devices for Training Deep Learning Models

Characterizing the Performance of Accelerated Jetson Edge Devices for Training Deep Learning Models

High performance distributed deep learning: a beginner's guide

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media