ABSTRACT
Processing In-Memory (PIM) has shown a great potential to accelerate inference tasks of Convolutional Neural Network (CNN). However, existing PIM architectures do not support high precision computation, e.g., in floating point precision, which is essential for training accurate CNN models. In addition, most of the existing PIM approaches require analog/mixed-signal circuits, which do not scale, exploiting insufficiently reliable multi-bit Non-Volatile Memory (NVM). In this paper, we propose FloatPIM, a fully-digital scalable PIM architecture that accelerates CNN in both training and testing phases. FloatPIM natively supports floating-point representation, thus enabling accurate CNN training. FloatPIM also enables fast communication between neighboring memory blocks to reduce internal data movement of the PIM architecture. We evaluate the efficiency of FloatPIM on ImageNet dataset using popular large-scale neural networks. Our evaluation shows that FloatPIM supporting floating point precision can achieve up to 5.1% higher classification accuracy as compared to existing PIM architectures with limited fixed-point precision. FloatPIM training is on average 303.2× and 48.6× (4.3× and 15.8×) faster and more energy efficient as compared to GTX 1080 GPU (PipeLayer [1] PIM accelerator). For testing, FloatPIM also provides 324.8× and 297.9× (6.3× and 21.6×) speedup and energy efficiency as compared to GPU (ISAAC [2] PIM accelerator) respectively.
- L. Song, X. Qian, H. Li, and Y. Chen, "Pipelayer: A pipelined reram-based accelerator for deep learning," in High Performance Computer Architecture (HPCA), 2017 IEEE International Symposium on, IEEE, 2017.Google Scholar
- A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar, "Isaac: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars," in Proceedings of the 43rd International Symposium on Computer Architecture, pp. 14--26, IEEE Press, 2016. Google ScholarDigital Library
- Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," nature, vol. 521, no. 7553, p. 436, 2015.Google Scholar
- J. Schmidhuber, "Deep learning in neural networks: An overview," Neural networks, vol. 61, pp. 85--117, 2015. Google ScholarDigital Library
- C. Dong, C. C. Loy, K. He, and X. Tang, "Learning a deep convolutional network for image super-resolution," in European conference on computer vision, pp. 184--199, Springer, 2014.Google Scholar
- L. Deng, D. Yu, et al., "Deep learning: methods and applications," Foundations and Trends® in Signal Processing, vol. 7, no. 3--4, pp. 197--387, 2014. Google ScholarDigital Library
- D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al., "Mastering the game of go with deep neural networks and tree search," nature, vol. 529, no. 7587, p. 484, 2016.Google Scholar
- K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.Google Scholar
- M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, "Xnor-net: Imagenet classification using binary convolutional neural networks," in European Conference on Computer Vision, pp. 525--542, Springer, 2016.Google Scholar
- M. N. Bojnordi and E. Ipek, "Memristive boltzmann machine: A hardware accelerator for combinatorial optimization and deep learning," in High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on, pp. 1--13, IEEE, 2016.Google Scholar
- P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie, "Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory," in Proceedings of the 43rd International Symposium on Computer Architecture, pp. 27--39, IEEE Press, 2016. Google ScholarDigital Library
- S. Han, H. Mao, and W. J. Dally, "Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding," arXiv preprint arXiv:1510.00149, 2015.Google Scholar
- M. Courbariaux, Y. Bengio, and J.-P. David, "Training deep neural networks with low precision multiplications," arXiv preprint arXiv:1412.7024, 2014.Google Scholar
- C. Louizos, M. Reisser, T. Blankevoort, E. Gavves, and M. Welling, "Relaxed quantization for discretized neural networks," arXiv preprint arXiv:1810.01875, 2018.Google Scholar
- "Bfloat16 floating point format.." https://en.wikipedia.org/wiki/Bfloat16_floating-point_format.Google Scholar
- "Intel xeon processors and intel fpgas.." https://venturebeat.com/2018/05/23/intel-unveils-nervana-neural-net-l-1000-for-accelerated-ai-training/.Google Scholar
- "Intel xeon and fpga lines." https://www.top500.org/news/intel-lays-out-new-roadmap-for-ai-portfolio/.Google Scholar
- "Nnp-l1000." https://www.tomshardware.com/news/intel-neural-network-processor-lake-crest,37105.html.Google Scholar
- "Google cloud.." https://cloud.google.com/tpu/docs/tensorflow-ops.Google Scholar
- "Tpu repository with tensorflow 1.7.0.." https://blog.riseml.com/comparing-google-tpuv2-against-nvidia-v100-on-resnet-50-c2bbb6a51e5e?gi=51a90720b9dd.Google Scholar
- J. V. Dillon, I. Langmore, D. Tran, E. Brevdo, S. Vasudevan, D. Moore, B. Patton, A. Alemi, M. Hoffman, and R. A. Saurous, "Tensorflow distributions," arXiv preprint arXiv:1711.10604, 2017.Google Scholar
- "Google. 2018-05-08. retrieved 2018-05-23. in many models this is a drop-in replacement for float-32.." https://www.youtube.com/watch?v=vm67WcLzfvc&t=2555.Google Scholar
- B. Feinberg, U. K. R. Vengalam, N. Whitehair, S. Wang, and E. Ipek, "Enabling scientific computing on memristive accelerators," in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 367--382, IEEE, 2018. Google ScholarDigital Library
- "Intel and micron produce breakthrough memory technology.. " http://newsroom.intel.com/community/intel_newsroom/blog/2015/07/28/intel-and-micron-produce-breakthrough-memory-technology.Google Scholar
- M. Cheng, L. Xia, Z. Zhu, Y. Cai, Y. Xie, Y. Wang, and H. Yang, "Time: A training-in-memory architecture for memristor-based deep neural networks," in Proceedings of the 54th Annual Design Automation Conference 2017, p. 26, ACM, 2017. Google ScholarDigital Library
- Y. Cai, T. Tang, L. Xia, M. Cheng, Z. Zhu, Y. Wang, and H. Yang, "Training low bitwidth convolutional neural network on rram," in Proceedings of the 23rd Asia and South Pacific Design Automation Conference, pp. 117--122, IEEE Press, 2018. Google ScholarDigital Library
- Y. Cai, Y. Lin, L. Xia, X. Chen, S. Han, Y. Wang, and H. Yang, "Long live time: improving lifetime for training-in-memory engines by structured gradient sparsification," in Proceedings of the 55th Annual Design Automation Conference, p. 107, ACM, 2018. Google ScholarDigital Library
- A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in Advances in neural information processing systems, pp. 1097--1105, 2012. Google ScholarDigital Library
- L. K. Hansen and P. Salamon, "Neural network ensembles," IEEE transactions on pattern analysis and machine intelligence, vol. 12, no. 10, pp. 993--1001, 1990. Google ScholarDigital Library
- S. Kvatinsky, D. Belousov, S. Liman, G. Satat, N. Wald, E. G. Friedman, A. Kolodny, and U. C. Weiser, "Magic - memristor-aided logic," IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 61, no. 11, pp. 895--899, 2014.Google ScholarCross Ref
- S. Gupta, M. Imani, and T. Rosing, "Felix: Fast and energy-efficient logic in memory," in 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pp. 1--7, IEEE, 2018. Google ScholarDigital Library
- A. Siemon, S. Menzel, R. Waser, and E. Linn, "A complementary resistive switch-based crossbar array adder," IEEE journal on emerging and selected topics in circuits and systems, vol. 5, no. 1, pp. 64--74, 2015.Google Scholar
- S. Kvatinsky, G. Satat, N. Wald, E. G. Friedman, A. Kolodny, and U. C. Weiser, "Memristor-based material implication (IMPLY) logic: design principles and methodologies," TVLSI, vol. 22, no. 10, pp. 2054--2066, 2014.Google Scholar
- J. Borghetti, G. S. Snider, P. J. Kuekes, J. J. Yang, D. R. Stewart, and R. S. Williams, "Memristive switches enable stateful logic operations via material implication," Nature, vol. 464, no. 7290, pp. 873--876, 2010.Google ScholarCross Ref
- B. C. Jang, Y. Nam, B. J. Koo, J. Choi, S. G. Im, S.-H. K. Park, and S.-Y. Choi, "Memristive logic-in-memory integrated circuits for energy-efficient flexible electronics," Advanced Functional Materials, vol. 28, no. 2, 2018.Google Scholar
- S. Kvatinsky, M. Ramadan, E. G. Friedman, and A. Kolodny, "Vteam: A general model for voltage-controlled memristors," IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 62, no. 8, pp. 786--790, 2015.Google ScholarCross Ref
- N. Talati, S. Gupta, P. Mane, and S. Kvatinsky, "Logic design within memristive memories using memristor-aided logic (magic)," IEEE Transactions on Nanotechnology, vol. 15, no. 4, pp. 635--650, 2016.Google ScholarDigital Library
- M. Imani, S. Gupta, and T. Rosing, "Ultra-efficient processing in-memory for data intensive applications," in Proceedings of the 54th Annual Design Automation Conference 2017, p. 6, ACM, 2017.Google Scholar
- A. Haj-Ali et al., "Efficient algorithms for in-memory fixed point multiplication using magic," in IEEE ISCAS, IEEE, 2018.Google Scholar
- M. Imani, D. Peroni, Y. Kim, A. Rahimi, and T. Rosing, "Efficient neural network acceleration on gpgpu using content addressable memory," in 2017 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1026--1031, IEEE, 2017. Google ScholarDigital Library
- C. Xu, D. Niu, N. Muralimanohar, R. Balasubramonian, T. Zhang, S. Yu, and Y. Xie, "Overcoming the challenges of crossbar resistive memory architectures," in 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), pp. 476--488, IEEE, 2015.Google Scholar
- A. Nag, R. Balasubramonian, V. Srikumar, R. Walker, A. Shafiee, J. P. Strachan, and N. Muralimanohar, "Newton: Gravitating towards the physical limits of crossbar acceleration," IEEE Micro, vol. 38, no. 5, pp. 41--49, 2018.Google ScholarDigital Library
- K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770--778, 2016.Google Scholar
- A. Ghofrani, A. Rahimi, M. A. Lastras-Montaño, L. Benini, R. K. Gupta, and K.-T. Cheng, "Associative memristive memory for approximate computing in gpus," IEEE Journal on Emergingand Selected Topics in Circuits and Systems, vol. 6, no. 2, pp. 222--234, 2016.Google ScholarCross Ref
- F. Chollet, "keras." https://github.com/fchollet/keras, 2015.Google Scholar
- M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al., "Tensorflow: Large-scale machine learning on heterogeneous distributed systems," arXiv preprint arXiv:1603.04467, 2016.Google Scholar
- X. Dong, C. Xu, N. Jouppi, and Y. Xie, "Nvsim: A circuit-level performance, energy, and area model for emerging non-volatile memory," in Emerging Memory Technologies, pp. 15--50, Springer, 2014.Google Scholar
- D. Compiler, R. User, and M. Guide, "Synopsys," Inc., see http://www.synopsys.com, 2000.Google Scholar
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, "Going deeper with convolutions," in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1--9, 2015.Google Scholar
- F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, "Squeezenet: Alexnet-level accuracy with 50x fewer parameters and < 0.5 mb model size," arXiv preprint arXiv:1602.07360, 2016.Google Scholar
- P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaev, G. Venkatesh, et al., "Mixed precision training," arXiv preprint arXiv:1710.03740, 2017.Google Scholar
- M. Drumond, T. Lin, M. Jaggi, and B. Falsafi, "End-to-end dnn training with block floating point arithmetic," arXiv preprint arXiv:1804.01526, 2018.Google Scholar
- D. Das, N. Mellempudi, D. Mudigere, D. Kalamkar, S. Avancha, K. Banerjee, S. Sridharan, K. Vaidyanathan, B. Kaul, E. Georganas, et al., "Mixed precision training of convolutional neural networks using integer operations," arXiv preprint arXiv:1802.00930, 2018.Google Scholar
- C. De Sa, M. Leszczynski, J. Zhang, A. Marzoev, C. R. Aberger, K. Olukotun, and C. Ré, "High-accuracy low-precision training," arXiv preprint arXiv:1803.03383, 2018.Google Scholar
- H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao, A. Mishra, and H. Esmaeilzadeh, "From high-level deep neural models to fpgas," in The 49th Annual IEEE/ACM International Symposium on Microarchitecture, p. 17, IEEE Press, 2016. Google ScholarDigital Library
- J. Albericio, A. Delmás, P. Judd, S. Sharify, G. O'Leary, R. Genov, and A. Moshovos, "Bit-pragmatic deep neural network computing," in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 382--394, ACM, 2017. Google ScholarDigital Library
- T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, "Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning," in ACM Sigplan Notices, vol. 49, pp. 269--284, ACM, 2014. Google ScholarDigital Library
- V. Aklaghi, A. Yazdanbakhsh, K. Samadi, H. Esmaeilzadeh, and R. Gupta, "Snapea: Predictive early activation for reducing computation in deep convolutional neural networks," ISCA, 2018. Google ScholarDigital Library
- K. Hegde, J. Yu, R. Agrawal, M. Yan, M. Pellauer, and C. W. Fletcher, "Ucnn: Exploiting computational reuse in deep neural networks via weight repetition," arXiv preprint arXiv:1804.06508, 2018.Google Scholar
- C. Ding, S. Liao, Y. Wang, Z. Li, N. Liu, Y. Zhuo, C. Wang, X. Qian, Y. Bai, G. Yuan, et al., "C ir cnn: accelerating and compressing deep neural networks using block-circulant weight matrices," in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 395--408, ACM, 2017.Google Scholar
- E. Nurvitadhi, G. Venkatesh, J. Sim, D. Marr, R. Huang, J. Ong Gee Hock, Y. T. Liew, K. Srivatsan, D. Moss, S. Subhaschandra, et al., "Can fpgas beat gpus in accelerating next-generation deep neural networks?," in Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 5--14, ACM, 2017. Google ScholarDigital Library
- S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, "Eie: efficient inference engine on compressed deep neural network," in Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on, pp. 243--254, IEEE, 2016. Google ScholarDigital Library
- Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, et al., "Dadiannao: A machine-learning supercomputer," in Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 609--622, IEEE Computer Society, 2014. Google ScholarDigital Library
- C. Eckert, X. Wang, J. Wang, A. Subramaniyan, R. Iyer, D. Sylvester, D. Blaauw, and R. Das, "Neural cache: Bit-serial in-cache acceleration of deep neural networks," arXiv preprint arXiv:1805.03718, 2018.Google Scholar
- S. Li, D. Niu, K. T. Malladi, H. Zheng, B. Brennan, and Y. Xie, "Drisa: A dram-based reconfigurable in-situ accelerator," in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 288--301, ACM, 2017.Google Scholar
- M. N. Bojnordi and E. Ipek, "The memristive boltzmann machines," IEEE Micro, vol. 37, no. 3, pp. 22--29, 2017.Google ScholarDigital Library
- M. Imani, M. Samragh, Y. Kim, S. Gupta, F. Koushanfar, and T. Rosing, "Rapidnn: In-memory deep neural network acceleration framework," arXiv preprint arXiv:1806.05794, 2018.Google Scholar
- S. Gupta, M. Imani, H. Kaur, and T. S. Rosing, "Nnpim: A processing in-memory architecture for neural network acceleration," IEEE Transactions on Computers, 2019.Google ScholarDigital Library
- M. Imani, S. Gupta, and T. Rosing, "Genpim: Generalized processing in-memory to accelerate data intensive applications," in 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1155--1158, IEEE, 2018.Google Scholar
- S. Salamat, M. Imani, S. Gupta, and T. Rosing, "Rnsnet: In-memory neural network acceleration using residue number system," in 2018 IEEE International Conference on Rebooting Computing (ICRC), pp. 1--12, IEEE, 2018.Google Scholar
- M. Imani, A. Rahimi, D. Kong, T. Rosing, and J. M. Rabaey, "Exploring hyperdimensional associative memory," in 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 445--456, IEEE, 2017.Google Scholar
- Y. Kim, M. Imani, and T. Rosing, "Orchard: Visual object recognition accelerator based on approximate in-memory processing," in 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pp. 25--32, IEEE, 2017.Google Scholar
- M. Zhou, M. Imani, S. Gupta, and T. Rosing, "Gas: A heterogeneous memory architecture for graph processing," in Proceedings of the International Symposium on Low Power Electronics and Design, p. 27, ACM, 2018. Google ScholarDigital Library
- M. Zhou, M. Imani, S. Gupta, Y. Kim, and T. Rosing, "Gram: graph processing in a reram-based computational memory," in Proceedings of the 24th Asia and South Pacific Design Automation Conference, pp. 591--596, ACM, 2019.Google Scholar
- M. Imani, S. Gupta, S. Sharma, and T. Rosing, "Nvquery: Efficient query processing in non-volatile memory," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2018.Google Scholar
Index Terms
- FloatPIM: in-memory acceleration of deep neural network training with high precision
Recommendations
Digital-based processing in-memory: a highly-parallel accelerator for data intensive applications
MEMSYS '19: Proceedings of the International Symposium on Memory SystemsRecently, Processing In-Memory (PIM) has been shown as a promising solution to address data movement issue in the current processors. However, today's PIM technologies are mostly analog-based, which involve both scalability and efficiency issues. In ...
Mellow writes: extending lifetime in resistive memories through selective slow write backs
ISCA'16Emerging resistive memory technologies, such as PCRAM and ReRAM, have been proposed as promising replacements for DRAM-based main memory, due to their better scalability, low standby power, and non-volatility. However, limited write endurance is a major ...
Mellow writes: extending lifetime in resistive memories through selective slow write backs
ISCA '16: Proceedings of the 43rd International Symposium on Computer ArchitectureEmerging resistive memory technologies, such as PCRAM and ReRAM, have been proposed as promising replacements for DRAM-based main memory, due to their better scalability, low standby power, and non-volatility. However, limited write endurance is a major ...
Comments