skip to main content
10.1145/3307650.3322237acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article
Open Access

FloatPIM: in-memory acceleration of deep neural network training with high precision

Published:22 June 2019Publication History

ABSTRACT

Processing In-Memory (PIM) has shown a great potential to accelerate inference tasks of Convolutional Neural Network (CNN). However, existing PIM architectures do not support high precision computation, e.g., in floating point precision, which is essential for training accurate CNN models. In addition, most of the existing PIM approaches require analog/mixed-signal circuits, which do not scale, exploiting insufficiently reliable multi-bit Non-Volatile Memory (NVM). In this paper, we propose FloatPIM, a fully-digital scalable PIM architecture that accelerates CNN in both training and testing phases. FloatPIM natively supports floating-point representation, thus enabling accurate CNN training. FloatPIM also enables fast communication between neighboring memory blocks to reduce internal data movement of the PIM architecture. We evaluate the efficiency of FloatPIM on ImageNet dataset using popular large-scale neural networks. Our evaluation shows that FloatPIM supporting floating point precision can achieve up to 5.1% higher classification accuracy as compared to existing PIM architectures with limited fixed-point precision. FloatPIM training is on average 303.2× and 48.6× (4.3× and 15.8×) faster and more energy efficient as compared to GTX 1080 GPU (PipeLayer [1] PIM accelerator). For testing, FloatPIM also provides 324.8× and 297.9× (6.3× and 21.6×) speedup and energy efficiency as compared to GPU (ISAAC [2] PIM accelerator) respectively.

References

  1. L. Song, X. Qian, H. Li, and Y. Chen, "Pipelayer: A pipelined reram-based accelerator for deep learning," in High Performance Computer Architecture (HPCA), 2017 IEEE International Symposium on, IEEE, 2017.Google ScholarGoogle Scholar
  2. A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar, "Isaac: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars," in Proceedings of the 43rd International Symposium on Computer Architecture, pp. 14--26, IEEE Press, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," nature, vol. 521, no. 7553, p. 436, 2015.Google ScholarGoogle Scholar
  4. J. Schmidhuber, "Deep learning in neural networks: An overview," Neural networks, vol. 61, pp. 85--117, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. C. Dong, C. C. Loy, K. He, and X. Tang, "Learning a deep convolutional network for image super-resolution," in European conference on computer vision, pp. 184--199, Springer, 2014.Google ScholarGoogle Scholar
  6. L. Deng, D. Yu, et al., "Deep learning: methods and applications," Foundations and Trends® in Signal Processing, vol. 7, no. 3--4, pp. 197--387, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al., "Mastering the game of go with deep neural networks and tree search," nature, vol. 529, no. 7587, p. 484, 2016.Google ScholarGoogle Scholar
  8. K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.Google ScholarGoogle Scholar
  9. M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, "Xnor-net: Imagenet classification using binary convolutional neural networks," in European Conference on Computer Vision, pp. 525--542, Springer, 2016.Google ScholarGoogle Scholar
  10. M. N. Bojnordi and E. Ipek, "Memristive boltzmann machine: A hardware accelerator for combinatorial optimization and deep learning," in High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on, pp. 1--13, IEEE, 2016.Google ScholarGoogle Scholar
  11. P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie, "Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory," in Proceedings of the 43rd International Symposium on Computer Architecture, pp. 27--39, IEEE Press, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Han, H. Mao, and W. J. Dally, "Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding," arXiv preprint arXiv:1510.00149, 2015.Google ScholarGoogle Scholar
  13. M. Courbariaux, Y. Bengio, and J.-P. David, "Training deep neural networks with low precision multiplications," arXiv preprint arXiv:1412.7024, 2014.Google ScholarGoogle Scholar
  14. C. Louizos, M. Reisser, T. Blankevoort, E. Gavves, and M. Welling, "Relaxed quantization for discretized neural networks," arXiv preprint arXiv:1810.01875, 2018.Google ScholarGoogle Scholar
  15. "Bfloat16 floating point format.." https://en.wikipedia.org/wiki/Bfloat16_floating-point_format.Google ScholarGoogle Scholar
  16. "Intel xeon processors and intel fpgas.." https://venturebeat.com/2018/05/23/intel-unveils-nervana-neural-net-l-1000-for-accelerated-ai-training/.Google ScholarGoogle Scholar
  17. "Intel xeon and fpga lines." https://www.top500.org/news/intel-lays-out-new-roadmap-for-ai-portfolio/.Google ScholarGoogle Scholar
  18. "Nnp-l1000." https://www.tomshardware.com/news/intel-neural-network-processor-lake-crest,37105.html.Google ScholarGoogle Scholar
  19. "Google cloud.." https://cloud.google.com/tpu/docs/tensorflow-ops.Google ScholarGoogle Scholar
  20. "Tpu repository with tensorflow 1.7.0.." https://blog.riseml.com/comparing-google-tpuv2-against-nvidia-v100-on-resnet-50-c2bbb6a51e5e?gi=51a90720b9dd.Google ScholarGoogle Scholar
  21. J. V. Dillon, I. Langmore, D. Tran, E. Brevdo, S. Vasudevan, D. Moore, B. Patton, A. Alemi, M. Hoffman, and R. A. Saurous, "Tensorflow distributions," arXiv preprint arXiv:1711.10604, 2017.Google ScholarGoogle Scholar
  22. "Google. 2018-05-08. retrieved 2018-05-23. in many models this is a drop-in replacement for float-32.." https://www.youtube.com/watch?v=vm67WcLzfvc&t=2555.Google ScholarGoogle Scholar
  23. B. Feinberg, U. K. R. Vengalam, N. Whitehair, S. Wang, and E. Ipek, "Enabling scientific computing on memristive accelerators," in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 367--382, IEEE, 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. "Intel and micron produce breakthrough memory technology.. " http://newsroom.intel.com/community/intel_newsroom/blog/2015/07/28/intel-and-micron-produce-breakthrough-memory-technology.Google ScholarGoogle Scholar
  25. M. Cheng, L. Xia, Z. Zhu, Y. Cai, Y. Xie, Y. Wang, and H. Yang, "Time: A training-in-memory architecture for memristor-based deep neural networks," in Proceedings of the 54th Annual Design Automation Conference 2017, p. 26, ACM, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Y. Cai, T. Tang, L. Xia, M. Cheng, Z. Zhu, Y. Wang, and H. Yang, "Training low bitwidth convolutional neural network on rram," in Proceedings of the 23rd Asia and South Pacific Design Automation Conference, pp. 117--122, IEEE Press, 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Y. Cai, Y. Lin, L. Xia, X. Chen, S. Han, Y. Wang, and H. Yang, "Long live time: improving lifetime for training-in-memory engines by structured gradient sparsification," in Proceedings of the 55th Annual Design Automation Conference, p. 107, ACM, 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in Advances in neural information processing systems, pp. 1097--1105, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. L. K. Hansen and P. Salamon, "Neural network ensembles," IEEE transactions on pattern analysis and machine intelligence, vol. 12, no. 10, pp. 993--1001, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. Kvatinsky, D. Belousov, S. Liman, G. Satat, N. Wald, E. G. Friedman, A. Kolodny, and U. C. Weiser, "Magic - memristor-aided logic," IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 61, no. 11, pp. 895--899, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  31. S. Gupta, M. Imani, and T. Rosing, "Felix: Fast and energy-efficient logic in memory," in 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pp. 1--7, IEEE, 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. A. Siemon, S. Menzel, R. Waser, and E. Linn, "A complementary resistive switch-based crossbar array adder," IEEE journal on emerging and selected topics in circuits and systems, vol. 5, no. 1, pp. 64--74, 2015.Google ScholarGoogle Scholar
  33. S. Kvatinsky, G. Satat, N. Wald, E. G. Friedman, A. Kolodny, and U. C. Weiser, "Memristor-based material implication (IMPLY) logic: design principles and methodologies," TVLSI, vol. 22, no. 10, pp. 2054--2066, 2014.Google ScholarGoogle Scholar
  34. J. Borghetti, G. S. Snider, P. J. Kuekes, J. J. Yang, D. R. Stewart, and R. S. Williams, "Memristive switches enable stateful logic operations via material implication," Nature, vol. 464, no. 7290, pp. 873--876, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  35. B. C. Jang, Y. Nam, B. J. Koo, J. Choi, S. G. Im, S.-H. K. Park, and S.-Y. Choi, "Memristive logic-in-memory integrated circuits for energy-efficient flexible electronics," Advanced Functional Materials, vol. 28, no. 2, 2018.Google ScholarGoogle Scholar
  36. S. Kvatinsky, M. Ramadan, E. G. Friedman, and A. Kolodny, "Vteam: A general model for voltage-controlled memristors," IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 62, no. 8, pp. 786--790, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  37. N. Talati, S. Gupta, P. Mane, and S. Kvatinsky, "Logic design within memristive memories using memristor-aided logic (magic)," IEEE Transactions on Nanotechnology, vol. 15, no. 4, pp. 635--650, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. M. Imani, S. Gupta, and T. Rosing, "Ultra-efficient processing in-memory for data intensive applications," in Proceedings of the 54th Annual Design Automation Conference 2017, p. 6, ACM, 2017.Google ScholarGoogle Scholar
  39. A. Haj-Ali et al., "Efficient algorithms for in-memory fixed point multiplication using magic," in IEEE ISCAS, IEEE, 2018.Google ScholarGoogle Scholar
  40. M. Imani, D. Peroni, Y. Kim, A. Rahimi, and T. Rosing, "Efficient neural network acceleration on gpgpu using content addressable memory," in 2017 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1026--1031, IEEE, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. C. Xu, D. Niu, N. Muralimanohar, R. Balasubramonian, T. Zhang, S. Yu, and Y. Xie, "Overcoming the challenges of crossbar resistive memory architectures," in 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), pp. 476--488, IEEE, 2015.Google ScholarGoogle Scholar
  42. A. Nag, R. Balasubramonian, V. Srikumar, R. Walker, A. Shafiee, J. P. Strachan, and N. Muralimanohar, "Newton: Gravitating towards the physical limits of crossbar acceleration," IEEE Micro, vol. 38, no. 5, pp. 41--49, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770--778, 2016.Google ScholarGoogle Scholar
  44. A. Ghofrani, A. Rahimi, M. A. Lastras-Montaño, L. Benini, R. K. Gupta, and K.-T. Cheng, "Associative memristive memory for approximate computing in gpus," IEEE Journal on Emergingand Selected Topics in Circuits and Systems, vol. 6, no. 2, pp. 222--234, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  45. F. Chollet, "keras." https://github.com/fchollet/keras, 2015.Google ScholarGoogle Scholar
  46. M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al., "Tensorflow: Large-scale machine learning on heterogeneous distributed systems," arXiv preprint arXiv:1603.04467, 2016.Google ScholarGoogle Scholar
  47. X. Dong, C. Xu, N. Jouppi, and Y. Xie, "Nvsim: A circuit-level performance, energy, and area model for emerging non-volatile memory," in Emerging Memory Technologies, pp. 15--50, Springer, 2014.Google ScholarGoogle Scholar
  48. D. Compiler, R. User, and M. Guide, "Synopsys," Inc., see http://www.synopsys.com, 2000.Google ScholarGoogle Scholar
  49. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, "Going deeper with convolutions," in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1--9, 2015.Google ScholarGoogle Scholar
  50. F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, "Squeezenet: Alexnet-level accuracy with 50x fewer parameters and < 0.5 mb model size," arXiv preprint arXiv:1602.07360, 2016.Google ScholarGoogle Scholar
  51. P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaev, G. Venkatesh, et al., "Mixed precision training," arXiv preprint arXiv:1710.03740, 2017.Google ScholarGoogle Scholar
  52. M. Drumond, T. Lin, M. Jaggi, and B. Falsafi, "End-to-end dnn training with block floating point arithmetic," arXiv preprint arXiv:1804.01526, 2018.Google ScholarGoogle Scholar
  53. D. Das, N. Mellempudi, D. Mudigere, D. Kalamkar, S. Avancha, K. Banerjee, S. Sridharan, K. Vaidyanathan, B. Kaul, E. Georganas, et al., "Mixed precision training of convolutional neural networks using integer operations," arXiv preprint arXiv:1802.00930, 2018.Google ScholarGoogle Scholar
  54. C. De Sa, M. Leszczynski, J. Zhang, A. Marzoev, C. R. Aberger, K. Olukotun, and C. Ré, "High-accuracy low-precision training," arXiv preprint arXiv:1803.03383, 2018.Google ScholarGoogle Scholar
  55. H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao, A. Mishra, and H. Esmaeilzadeh, "From high-level deep neural models to fpgas," in The 49th Annual IEEE/ACM International Symposium on Microarchitecture, p. 17, IEEE Press, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. J. Albericio, A. Delmás, P. Judd, S. Sharify, G. O'Leary, R. Genov, and A. Moshovos, "Bit-pragmatic deep neural network computing," in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 382--394, ACM, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, "Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning," in ACM Sigplan Notices, vol. 49, pp. 269--284, ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. V. Aklaghi, A. Yazdanbakhsh, K. Samadi, H. Esmaeilzadeh, and R. Gupta, "Snapea: Predictive early activation for reducing computation in deep convolutional neural networks," ISCA, 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. K. Hegde, J. Yu, R. Agrawal, M. Yan, M. Pellauer, and C. W. Fletcher, "Ucnn: Exploiting computational reuse in deep neural networks via weight repetition," arXiv preprint arXiv:1804.06508, 2018.Google ScholarGoogle Scholar
  60. C. Ding, S. Liao, Y. Wang, Z. Li, N. Liu, Y. Zhuo, C. Wang, X. Qian, Y. Bai, G. Yuan, et al., "C ir cnn: accelerating and compressing deep neural networks using block-circulant weight matrices," in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 395--408, ACM, 2017.Google ScholarGoogle Scholar
  61. E. Nurvitadhi, G. Venkatesh, J. Sim, D. Marr, R. Huang, J. Ong Gee Hock, Y. T. Liew, K. Srivatsan, D. Moss, S. Subhaschandra, et al., "Can fpgas beat gpus in accelerating next-generation deep neural networks?," in Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 5--14, ACM, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, "Eie: efficient inference engine on compressed deep neural network," in Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on, pp. 243--254, IEEE, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, et al., "Dadiannao: A machine-learning supercomputer," in Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 609--622, IEEE Computer Society, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. C. Eckert, X. Wang, J. Wang, A. Subramaniyan, R. Iyer, D. Sylvester, D. Blaauw, and R. Das, "Neural cache: Bit-serial in-cache acceleration of deep neural networks," arXiv preprint arXiv:1805.03718, 2018.Google ScholarGoogle Scholar
  65. S. Li, D. Niu, K. T. Malladi, H. Zheng, B. Brennan, and Y. Xie, "Drisa: A dram-based reconfigurable in-situ accelerator," in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 288--301, ACM, 2017.Google ScholarGoogle Scholar
  66. M. N. Bojnordi and E. Ipek, "The memristive boltzmann machines," IEEE Micro, vol. 37, no. 3, pp. 22--29, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. M. Imani, M. Samragh, Y. Kim, S. Gupta, F. Koushanfar, and T. Rosing, "Rapidnn: In-memory deep neural network acceleration framework," arXiv preprint arXiv:1806.05794, 2018.Google ScholarGoogle Scholar
  68. S. Gupta, M. Imani, H. Kaur, and T. S. Rosing, "Nnpim: A processing in-memory architecture for neural network acceleration," IEEE Transactions on Computers, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. M. Imani, S. Gupta, and T. Rosing, "Genpim: Generalized processing in-memory to accelerate data intensive applications," in 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1155--1158, IEEE, 2018.Google ScholarGoogle Scholar
  70. S. Salamat, M. Imani, S. Gupta, and T. Rosing, "Rnsnet: In-memory neural network acceleration using residue number system," in 2018 IEEE International Conference on Rebooting Computing (ICRC), pp. 1--12, IEEE, 2018.Google ScholarGoogle Scholar
  71. M. Imani, A. Rahimi, D. Kong, T. Rosing, and J. M. Rabaey, "Exploring hyperdimensional associative memory," in 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 445--456, IEEE, 2017.Google ScholarGoogle Scholar
  72. Y. Kim, M. Imani, and T. Rosing, "Orchard: Visual object recognition accelerator based on approximate in-memory processing," in 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pp. 25--32, IEEE, 2017.Google ScholarGoogle Scholar
  73. M. Zhou, M. Imani, S. Gupta, and T. Rosing, "Gas: A heterogeneous memory architecture for graph processing," in Proceedings of the International Symposium on Low Power Electronics and Design, p. 27, ACM, 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. M. Zhou, M. Imani, S. Gupta, Y. Kim, and T. Rosing, "Gram: graph processing in a reram-based computational memory," in Proceedings of the 24th Asia and South Pacific Design Automation Conference, pp. 591--596, ACM, 2019.Google ScholarGoogle Scholar
  75. M. Imani, S. Gupta, S. Sharma, and T. Rosing, "Nvquery: Efficient query processing in non-volatile memory," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2018.Google ScholarGoogle Scholar

Index Terms

  1. FloatPIM: in-memory acceleration of deep neural network training with high precision

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          ISCA '19: Proceedings of the 46th International Symposium on Computer Architecture
          June 2019
          849 pages
          ISBN:9781450366694
          DOI:10.1145/3307650

          Copyright © 2019 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 22 June 2019

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          ISCA '19 Paper Acceptance Rate62of365submissions,17%Overall Acceptance Rate543of3,203submissions,17%

          Upcoming Conference

          ISCA '24

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader