Dynamic hard pruning of Neural Networks at the edge of the internet

https://doi.org/10.1016/j.jnca.2021.103330Get rights and content

Abstract

Neural Networks (NN), although successfully applied to several Artificial Intelligence tasks, are often unnecessarily over-parametrized. In edge/fog computing, this might make their training prohibitive on resource-constrained devices, contrasting with the current trend of decentralizing intelligence from remote data centres to local constrained devices. Therefore, we investigate the problem of training effective NN models on constrained devices having a fixed, potentially small, memory budget. We target techniques that are both resource-efficient and performance effective while enabling significant network compression. Our Dynamic Hard Pruning (DynHP) technique incrementally prunes the network during training, identifying neurons that marginally contribute to the model accuracy. DynHP enables a tunable size reduction of the final neural network and reduces the NN memory occupancy during training. Freed memory is reused by a dynamic batch sizing approach to counterbalance the accuracy degradation caused by the hard pruning strategy, improving its convergence and effectiveness. We assess the performance of DynHP through reproducible experiments on three public datasets, comparing them against reference competitors. Results show that DynHP compresses a NN up to 10 times without significant performance drops (up to 3.5% additional error w.r.t. the competitors), reducing up to 80% the training memory occupancy.

Introduction

In the last years, AI solutions have been successfully adopted in a variety of different tasks. Neural networks (NN) are among the most successful technologies that achieve state-of-the-art performance in several application fields, including image recognition, computer vision, natural language processing, and speech recognition. The main ingredients of NNs success are the increased availability of huge training datasets and the possibility of scaling their models to millions of parameters while allowing for a tractable optimization with mini-batch stochastic gradient descent (SGD), graphical processing units (GPUs) and parallel computing. Nevertheless, NNs are characterized by several drawbacks, and many research challenges are still open. Recently, it has been proven that NNs may suffer over parametrization (Han et al., 2015, Ullrich et al., 2017, Molchanov et al., 2017a) so that they can be pruned significantly without any loss of accuracy. Moreover, they can easily over-fit and even memorize random patterns in the data (Zhang et al., 2016) if not properly regularized.

NN solutions are typically designed having in mind large data centres with plenty of storage, computation and energy resources, where data are collected for training, and input data are also collected at inference time. This scenario might not fit emerging application areas enabled by the widespread diffusion of IoT devices, commonly referred to as fog computing environments. Typical application areas are smart cities, autonomous vehicular networks, Industry 4.0, to name a few. IoT devices in fog environments generate huge amounts of data that, for several reasons, might be impossible or impractical to move to remote data centres both for training and for inference. Typically, real-time or privacy/ownership constraints on data make such an approach unfeasible. Therefore, knowledge extraction needs to leverage distributed data collection and computing paradigms, whereby NNs are “used” in locations closer to where data are generated, such as fog gateways or even individual devices such as tablets or Raspberry PIs. Unfortunately, with respect to data centres, these devices have much more limited computational power, memory, network, and energy capabilities. In these contexts, the exploitation of models trained “offline” asks for a significant amount of memory – from hundreds of MBytes to GBytes – and their use, i.e., inference, requires GFLOPs of computation. As an example, the inference step using the AlexNet network (Krizhevsky et al., 2012) costs 729 MFLOPs (FLoating-point OPerations), and the model requires 250MBytes of memory to be stored. Furthermore, resource-constrained devices should deal with limited energy availability (e.g., some devices might be battery powered). Interestingly, the energy consumption of such devices is dominated by memory access - one DRAM access costs two orders of magnitude more than one SRAM access, and three orders of magnitude more than one CMOS add operation (Modarressi and Sarbazi-Azad, 2018). These limitations jeopardize the exploitation of large models trained “offline” in resource-constrained devices characterizing the edge/fog computing paradigms. Training DNNs on such devices is even more challenging, as the training phase is typically much more resource-hungry than the inference phase.

Nowadays, several researchers are investigating the use of neural networks on resource-constrained devices. The effort is focused on enabling their use at both training and inference time on this kind of devices by limiting – or possibly totally avoiding – the loss in performance introduced by the reduced memory and computational capacity of the devices targeted. These approaches can be broadly classified into two research lines. On the one hand, we have methods that, given an already trained neural network, try to reduce its size or distribute it on several devices collaborating during the inference phase. Three main lines have been investigated under this main approach: pruning, quantization, and knowledge distillation (Guo et al., 2016, Han et al., 2017, Lin et al., 2016, Hinton et al., 2015). On the other hand, there are a few proposals that work at training time. These methods employ neural compression techniques (Srinivas et al., 2017, Louizos et al., 2017) or neural architecture search (Elsken et al., 2019) to identify effective configurations that actively reduce the size of the model. It is important to note that techniques following the latter approach typically measure the compression efficacy as the average number of neurons simultaneously active during the training.1 Neural compressors working at training time during an epoch switch off some neurons according to a given criterion. However, we have soft pruning techniques that allow neurons to be switched on again after they have been switched off. Conversely, this does not happen when hard pruning techniques are adopted: neurons that are switched off are lost. While soft pruning guarantees more flexibility during training, it does not reduce the memory occupation of the model during the training process, as information about switched-off neurons has to be always kept from epoch to epoch.

This paper aims to design new hard pruning techniques for learning compressed neural networks on resource-constrained devices typical of edge/fog environments. We claim that learning compressed neural networks directly on edge/fog devices is of paramount importance to achieve pervasive – both effective and efficient – neural network-based AI solutions in these environments. For this reason, we want to target techniques that are both resource-efficient and allow comparable levels of performance with respect to conventional neural network training algorithms while enabling significant levels of compression of the network. We target the goal by assuming that the training of a neural network in edge/fog devices relies on a fixed – and often small – budget of memory that can be used to perform the process. Our assumption comes from observing that the edge/fog paradigm is characterized by moving the computation close to the data source (Lopez et al., 2015, Conti et al., 2017). In this scenario, the devices employed can perform many other operations in parallel w.r.t. training a neural network, e.g., operations related to data gathering/indexing/storing (Barbalace et al., 2020). The edge/fog scenario is different from a standard cloud environment, where we can assume to have servers fully available to train the neural network. We thus propose a new technique based on an effective compression of the network during the training process. Our novel technique, called Dynamic Hard Pruning (DynHP), prunes incrementally but permanently the network by identifying neurons that contribute only marginally to the model accuracy. By doing so, DynHP enables a significant and controllable reduction of the size of the final neural network learned. Moreover, by hard pruning the neural network, we progressively reduce the memory occupied by the model as the training process progresses. Finally, our solution is designed to train and prune the NN under a fixed memory budget, which might be an important feature if we consider a scenario where an edge/fog device, e.g., Nvidia Xavier or Jatson Nano, has to run multiple, e.g., containerized, training processes of different NNs on local data.

However, hard pruning neurons also brings a side effect related to the convergence of the training process to accurate solutions. Precisely, it slows down the convergence and makes stochastic gradient descent more susceptible to get stuck in poor local minima. We capitalize on the increasing amount of memory saved during training to introduce a dynamic sizing of the mini-batches used to train the network to avoid this limitation. Our dynamic batch sizing technique enables direct control of the convergence and the final effectiveness of the network by varying the amount of data seen in a batch. DynHP tunes the size of the batch dynamically, epoch by epoch, by computing its optimal size as a function of the variance of the gradients observed during the training. Moreover, we dynamically adjust the amount of data seen by considering the constraint on the total memory available for the training process. Our proposal thus effectively reuses the memory progressively saved with hard pruning to increase the size of batches used to estimate the gradient and improves convergence speed and quality.

The result achieved by DynHP is three-fold. First, the training of compressed neural networks can be done directly on resource-constrained edge/fog devices. Second, our technique minimizes the performance loss due to network compression by reusing the memory saved for the network to increase the batch size and improve convergence and accuracy. Third, we explicitly target the training of neural networks under hard memory constraints by dynamically optimizing the learning process on the basis of the amount of memory available.

We assess our DynHP on three public datasets, MNIST, Fashion-MNIST and CIFAR-10, against state-of-the-art competitors. Our reproducible experiments show that DynHP can effectively compress from 2.5 to 10 fold a DNN without any significant effectiveness degradation, i.e., 3.5% of additional misclassification error w.r.t. competitors. Moreover, beyond obtaining highly accurate compressed models, our solutions dramatically reduce the overall memory occupation during the entire training process, i.e., we use from 50% to 80% less memory to complete the training.

The paper is structured as follows: Section 2 discusses related work, while Section 3 presents the background of our work. Moreover, Section 4.1 introduces the hard pruning of neural network, while Section 4.2 presents the dynamic hard pruning technique. Section 5 presents and discusses the experimental results on three public datasets. Finally, Section 6 concludes the article and draws some future work.

Section snippets

Related work

It is a known fact that neural networks models are often over-parametrized, i.e., the number of the model’s parameters that must be trained and the resulting complexity of the model exceed the one needed by the problem at hand. Due to this fact, several methods have been proposed to reduce the size of a neural network model (i.e., lower the number of parameters). The challenge of all such methods is finding a way to obtain a final network model with only the necessary number of parameters

Background

In this section we first introduce the notation used in the rest of the paper, then we discuss the state of the art techniques for soft pruning neural networks (during training) that inspired our solution. For the sake of clarity, we report the only details that are necessary to make this paper self-contained.

Dynamic Hard Pruning of neural networks

In this section, we introduce and discuss our Dynamic Hard Pruning technique for training and compressing a neural network at a fixed memory budget. For the sake of clarity, we split the presentation of our solution into two parts. In the following, we first present the hard pruning technique that we use to ablate parts of the neural network during the training process. We then introduce our mechanism of dynamic batch-sizing that we use to drive the overall training process to achieve a

Experiments

The experiments conducted aim to comprehensively evaluate our proposal and compare it with the state-of-the-art SP competitor (Louizos et al., 2017). More precisely, we are interested in answering the following research questions:

  • RQ1:

    To what extent does our HP technique compresses the network and reduces over-parametrization? How much does HP impact the quality of the learned model?

  • RQ2:

    Can we reduce the possible loss of accuracy introduced by HP by dynamically adjusting the size of mini-batches during

Conclusion

We investigated the problem of learning compressed NNs with a fixed – and potentially small – memory budget on edge/fog devices. We proposed DynHP, a new resource-efficient NN learning technique that achieves performances comparable to conventional neural network training algorithms while enabling significant levels of network compression. DynHP prunes incrementally but permanently the network as the training process progresses by identifying neurons that contribute only marginally to the

CRediT authorship contribution statement

Lorenzo Valerio: Conceptualization, Methodology, Software, Validation, Writing – original draft. Franco Maria Nardini: Conceptualization, Methodology, Software, Validation, Writing – original draft. Andrea Passarella: Conceptualization, Writing – review & editing. Raffaele Perego: Conceptualization, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work is partially supported by following projects: HumanE AI Network, Italy (EU H2020 HumanAI-Net, GA #952026), SoBigData++, Italy (EU H2020 SoBigData++, GA #871042) OK-INSAID, Italy (MIUR PON ARS01_00917), H2020 MARVEL, Italy (GA #957337), SAI: Social Explainable AI, Italy (EC CHIST-ERA-19-XAI-010), TEACHING, Italy (H2020 GA #871385).

Dr. Lorenzo Valerio (corr. auth.) is a tenured Technologist at IIT-CNR. He received the Ph.D. in Mathematics and Statistics for Computational Sciences from the University of Milan in 2012. His main research activity focuses on the Machine Learning (ML) for resource constrained environments, Distributed/Decentralized Machine Learning, Next Generation Internet (NGI), opportunistic networking. He has published in journals and conference proceedings more than 30 papers. He has served as Workshop

References (36)

  • ContiM. et al.

    The internet of people (iop): A new wave in pervasive mobile computing

    Pervasive Mob. Comput.

    (2017)
  • ModarressiM. et al.

    Chapter six - topology specialization for networks-on-chip in the dark silicon era

  • BaL.J. et al.

    Do deep nets really need to be deep?

    (2013)
  • Balles, L., Romero, J., Hennig, P., 2017. Coupling adaptive batch sizes with learning rates. In: Proceedings of the...
  • BarbalaceA. et al.

    Edge computing: The case for heterogeneous-isa container migration

  • BellecG. et al.

    Deep rewiring: Training very sparse deep networks

    (2017)
  • CourbariauxM. et al.

    Binaryconnect: Training deep neural networks with binary weights during propagations

    (2015)
  • ElskenT. et al.

    Neural architecture search: A survey

    J. Mach. Learn. Res.

    (2019)
  • FrankleJ. et al.

    The lottery ticket hypothesis: Finding sparse, trainable neural networks

  • GalY. et al.

    Dropout as a Bayesian approximation: Representing model uncertainty in deep learning

  • GuoY. et al.

    Dynamic network surgery for efficient DNNs

  • HanS. et al.

    Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding

    (2015)
  • HanS. et al.

    Dsd: Dense-sparse-dense training for deep neural networks

  • He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition, in: Proceedings of the IEEE...
  • HintonG. et al.

    Distilling the knowledge in a neural network

    (2015)
  • HubaraI. et al.

    Binarized neural networks

    (2016)
  • JinX. et al.

    Training skinny deep neural networks with iterative hard thresholding methods

    (2016)
  • KingmaD.P. et al.

    Variational dropout and the local reparameterization trick

  • Cited by (4)

    • Self-organizing pipelined recurrent wavelet neural network for time series prediction

      2023, Expert Systems with Applications
      Citation Excerpt :

      The pruning algorithm based on the partial least squares (W. Li et al., 2022) was proposed for the simplified long-short term memory neural network. Moreover, some other pruning mechanisms have been successfully designed (Teng et al., 2015; Reiners et al., 2022; Valerio et al., 2022; Chen et al., 2022). However, the large initial structure will increase the computational load of network.

    • Improving latency performance trade-off in keyword spotting applications at the edge

      2023, Proceedings - 2023 9th International Workshop on Advances in Sensors and Interfaces, IWASI 2023

    Dr. Lorenzo Valerio (corr. auth.) is a tenured Technologist at IIT-CNR. He received the Ph.D. in Mathematics and Statistics for Computational Sciences from the University of Milan in 2012. His main research activity focuses on the Machine Learning (ML) for resource constrained environments, Distributed/Decentralized Machine Learning, Next Generation Internet (NGI), opportunistic networking. He has published in journals and conference proceedings more than 30 papers. He has served as Workshop co-chair for IEEE AOC’15. He has also been guest editor for Elsevier Computer Communications. He has been recipient for one Best Paper Aware at IEEE WoWMoM 2013 and one Best Paper Nomination at IEEE SMARTCOMP 2016. He is currently member of the editorial board of Elsevier Computer Communication journal.

    Dr. Franco Maria Nardini is a tenured Researcher at ISTI-CNR. He received the Ph.D. in Information Engineering from the University of Pisa in 2011. His research interests are focused on Web Information Retrieval (IR), Machine Learning (ML), and Data Mining (DM). He authored more than 70 papers in peer-reviewed international journal, conferences and other venues. He is co-recipient of the ACM SIGIR 2015 Best Paper Award and of the ECIR 2014 Best Demo Paper Award. He is member of the program committee of several top-level conferences in IR, ML and DM, like ACM SIGIR, ECIR, ACM SIGKDD, ACM CIKM, ACM WSDM, IJCAI, ECML-PKDD.

    Dr. Andrea Passarella (Ph.D. 2005) is a Research Director at the Institute for Informatics and Telematics (IIT) of the National Research Council of Italy (CNR). Prior to join IIT he was with the Computer Laboratory of the University of Cambridge, UK. He has published 170+ papers on Online and Mobile social networks, decentralized AI, Next Generation Internet, opportunistic, ad hoc and sensor networks, receiving the best paper award at IFIP Networking 2011 and IEEE WoWMoM 2013. He currently serves as General Chair for IEEE PerCom 2022. He is the founding Associate Editor-in-Chief of Elsevier Online Social Networks. He is co-author of the book “Online Social Networks: Human Cognitive Constraints in Facebook and Twitter Personal Graphs” (Elsevier, 2015), and was Guest Co-Editor of several special sections in ACM and Elsevier Journals.

    Dr. Raffaele Perego (http://hpc.isti.cnr.it/~raffaele/) is a Research Director at the Institute of Information Science and Technologies “Alessandro Faedo” (ISTI) of the National Research Council of Italy (CNR) where he leads the High Performance Computing Lab (http://hpc.isti.cnr.it). His main research interests include large-scale information systems, information retrieval, and machine learning. He co-authored more than 180 papers on these topics published in journals and proceedings of international conferences. He served as general chair of ACM SIGIR 2016, ECIR 2021 and received the ACM SIGIR best paper in 2015 and the Yahoo FREP award in 2016.

    View full text