skip to main content
10.1145/3410463.3414627acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article
Public Access

Transmuter: Bridging the Efficiency Gap using Memory and Dataflow Reconfiguration

Authors Info & Claims
Published:30 September 2020Publication History

ABSTRACT

With the end of Dennard scaling and Moore's law, it is becoming increasingly difficult to build hardware for emerging applications that meet power and performance targets, while remaining flexible and programmable for end users. This is particularly true for domains that have frequently changing algorithms and applications involving mixed sparse/dense data structures, such as those in machine learning and graph analytics. To overcome this, we present a flexible accelerator called Transmuter, in a novel effort to bridge the gap between General-Purpose Processors (GPPs) and Application-Specific Integrated Circuits (ASICs). Transmuter adapts to changing kernel characteristics, such as data reuse and control divergence, through the ability to reconfigure the on-chip memory type, resource sharing and dataflow at run-time within a short latency. This is facilitated by a fabric of light-weight cores connected to a network of reconfigurable caches and crossbars. Transmuter addresses a rapidly growing set of algorithms exhibiting dynamic data movement patterns, irregularity, and sparsity, while delivering GPU-like efficiencies for traditional dense applications. Finally, in order to support programmability and ease-of-adoption, we prototype a software stack composed of low-level runtime routines, and a high-level language library called TransPy, that cater to expert programmers and end-users, respectively.

Our evaluations with Transmuter demonstrate average throughput (energy-efficiency) improvements of 5.0× (18.4×) and 4.2× (4.0×) over a high-end CPU and GPU, respectively, across a diverse set of kernels predominant in graph analytics, scientific computing and machine learning. Transmuter achieves energy-efficiency gains averaging 3.4× and 2.0× over prior FPGA and CGRA implementations of the same kernels, while remaining on average within 9.3× of state-of-the-art ASICs.

References

  1. Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 265--283.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Nilmini Abeyratne, Reetuparna Das, Qingkun Li, Korey Sewell, Bharan Giridhar, Ronald G. Dreslinski, David Blaauw, and Trevor Mudge. 2013. Scaling Towards Kilo-core Processors with Asymmetric High-Radix Topologies. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA). IEEE Computer Society, 496--507.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Edward H Adelson, Charles H Anderson, James R Bergen, Peter J Burt, and Joan M Ogden. 1984. Pyramid Methods in Image Processing. RCA Engineer, Vol. 29, 6 (1984), 33--41.Google ScholarGoogle Scholar
  4. Omid Akbari, Mehdi Kamal, Ali Afzali-Kusha, Massoud Pedram, and Muhammad Shafique. 2019. X-CGRA: An energy-efficient approximate coarse-grained reconfigurable architecture. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2019).Google ScholarGoogle Scholar
  5. Anima Anandkumar, Rong Ge, Daniel J. Hsu, Sham M. Kakade, and Matus Telgarsky. 2012. Tensor decompositions for learning latent variable models. CoRR, Vol. abs/1210.7559 (2012). arxiv: 1210.7559Google ScholarGoogle ScholarCross RefCross Ref
  6. Tuba Ayhan, Wim Dehaene, and Marian Verhelst. 2014. A 128textasciitilde 2048/1536 point FFT hardware implementation with output pruning. In 2014 22nd European Signal Processing Conference (EUSIPCO). IEEE, 266--270.Google ScholarGoogle Scholar
  7. David F Bacon, Rodric Rabbah, and Sunil Shukla. 2013. FPGA programming for the masses. Commun. ACM, Vol. 56, 4 (2013), 56--63.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Rajeshwari Banakar, Stefan Steinke, Bo-Sik Lee, Mahesh Balakrishnan, and Peter Marwedel. 2002. Scratchpad memory: A design alternative for cache on-chip memory in embedded systems. In Proceedings of the Tenth International Symposium on Hardware/Software Codesign. CODES 2002. IEEE, 73--78.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Nathan Bell, Steven Dalton, and Luke N Olson. 2012. Exposing fine-grained parallelism in algebraic multigrid methods. SIAM Journal on Scientific Computing, Vol. 34, 4 (2012), C123--C152.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Nathan L. Binkert, Bradford M. Beckmann, Gabriel Black, Steven K. Reinhardt, Ali G. Saidi, Arkaprava Basu, Joel Hestness, Derek Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib Bin Altaf, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The gem5 simulator. SIGARCH Comput. Archit. News, Vol. 39, 2 (2011), 1--7.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Nathan L Binkert, Ronald G Dreslinski, Lisa R Hsu, Kevin T Lim, Ali G Saidi, and Steven K Reinhardt. 2006. The M5 simulator: Modeling networked systems. Ieee micro, Vol. 26, 4 (2006), 52--60.Google ScholarGoogle Scholar
  12. Ian Buck. 2010. The evolution of GPUs for general purpose computing. In Proceedings of the GPU Technology Conference 2010. 11.Google ScholarGoogle Scholar
  13. Martin Burtscher, Rupesh Nasre, and Keshav Pingali. 2012. A quantitative study of irregular programs on GPUs. In 2012 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 141--151.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Sergi Caelles, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, and Luc Van Gool. 2017. One-Shot Video Object Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 221--230.Google ScholarGoogle ScholarCross RefCross Ref
  15. Benton Highsmith Calhoun, Joseph F Ryan, Sudhanshu Khanna, Mateja Putic, and John Lach. 2010. Flexible circuits and architectures for ultralow power. Proc. IEEE, Vol. 98, 2 (2010), 267--282.Google ScholarGoogle ScholarCross RefCross Ref
  16. Web Chang. 2001. Embedded configurable logic ASIC. US Patent 6,260,087.Google ScholarGoogle Scholar
  17. Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. 2016. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits, Vol. 52, 1 (2016), 127--138.Google ScholarGoogle ScholarCross RefCross Ref
  18. Marco Cuturi. 2013. Sinkhorn Distances: Lightspeed Computation of Optimal Transport. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2 (NIPS'13). 2292--2300.Google ScholarGoogle Scholar
  19. Vidushi Dadu, Jian Weng, Sihao Liu, and Tony Nowatzki. 2019. Towards General Purpose Acceleration by Exploiting Common Data-Dependence Forms. In Proceedings of the 52Nd Annual IEEE/ACM International Symposium on Microarchitecture (Columbus, OH, USA) (MICRO '52). ACM, 924--939.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Scott Davidson, Shaolin Xie, Christopher Torng, Khalid Al-Hawai, Austin Rovinski, Tutu Ajayi, Luis Vega, Chun Zhao, Ritchie Zhao, Steve Dai, Aporva Amarnath, Bandhav Veluri, Paul Gao, Anuj Rao, Gai Liu, Rakesh K Gupta, Zhiru Zhang, Ronald G Dreslinski, Christopher Batten, and Michael B Taylor. 2018. The Celerity Open-Source 511-Core RISC-V Tiered Accelerator Fabric: Fast Architectures and Design Methodologies for Fast Chips. IEEE Micro, Vol. 38, 2 (2018), 30--41.Google ScholarGoogle ScholarCross RefCross Ref
  21. Chris HQ Ding, Xiaofeng He, Hongyuan Zha, Ming Gu, and Horst D Simon. 2001. A min-max cut algorithm for graph partitioning and data clustering. In Proceedings 2001 IEEE International Conference on Data Mining. IEEE, 107--114.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Claire Donnat, Marinka Zitnik, David Hallac, and Jure Leskovec. 2018. Learning structural node embeddings via diffusion wavelets. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 1320--1329.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Richard Dorrance, Fengbo Ren, and Dejan Marković. 2014. A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on FPGAs. In Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays. ACM, 161--170.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Iain S Duff, Michael A Heroux, and Roldan Pozo. 2002. An overview of the sparse basic linear algebra subprograms: The new standard from the BLAS technical forum. ACM Transactions on Mathematical Software (TOMS), Vol. 28, 2 (2002), 239--267.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. E Swartzlander Earl Jr. 2006. Systolic FFT Processors: Past, Present and Future. In IEEE 17th International Conference on Application-specific Systems, Architectures and Processors (ASAP'06). IEEE, 153--158.Google ScholarGoogle Scholar
  26. Nasim Farahini, Shuo Li, Muhammad Adeel Tajammul, Muhammad Ali Shami, Guo Chen, Ahmed Hemani, and Wei Ye. 2013. 39.9 GOPS/Watt Multi-Mode CGRA Accelerator for a Multi-Standard Basestation. In 2013 IEEE International Symposium on Circuits and Systems (ISCAS2013). IEEE, 1448--1451.Google ScholarGoogle ScholarCross RefCross Ref
  27. Kayvon Fatahalian, Jeremy Sugerman, and Pat Hanrahan. 2004. Understanding the efficiency of GPU algorithms for matrix-matrix multiplication. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware. 133--137.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Siying Feng, Subhankar Pal, Yichen Yang, and Ronald G Dreslinski. 2019. Parallelism Analysis of Prominent Desktop Applications: An 18-Year Perspective. In 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 202--211.Google ScholarGoogle ScholarCross RefCross Ref
  29. Jivří Filipovivc, Matúvs Madzin, Jan Fousek, and Ludvěk Matyska. 2015. Optimizing CUDA code by kernel fusion: application on BLAS. The Journal of Supercomputing, Vol. 71, 10 (2015), 3934--3957.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Florian Fricke, André Werner, Keyvan Shahin, and Michael Hübner. 2018. CGRA tool flow for fast run-time reconfiguration. In International Symposium on Applied Reconfigurable Computing. Springer, 661--672.Google ScholarGoogle ScholarCross RefCross Ref
  31. Yusuke Fujii, Takuya Azumi, Nobuhiko Nishio, Shinpei Kato, and Masato Edahiro. 2013. Data transfer matters for GPU computing. In 2013 International Conference on Parallel and Distributed Systems. IEEE, 275--282.Google ScholarGoogle ScholarCross RefCross Ref
  32. Noriyuki Fujimoto. 2008. Dense matrix-vector multiplication on the CUDA architecture. Parallel Processing Letters, Vol. 18, 04 (2008), 511--530.Google ScholarGoogle ScholarCross RefCross Ref
  33. Mingyu Gao and Christos Kozyrakis. 2016. HRL: Efficient and flexible reconfigurable logic for near-data processing. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). Ieee, 126--137.Google ScholarGoogle ScholarCross RefCross Ref
  34. Heiner Giefers, Raphael Polig, and Christoph Hagleitner. 2016a. Measuring and Modeling the Power Consumption of Energy-Efficient FPGA Coprocessors for GEMM and FFT. Journal of Signal Processing Systems, Vol. 85, 3 (Dec. 2016), 307--323.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Heiner Giefers, Peter Staar, Costas Bekas, and Christoph Hagleitner. 2016b. Analyzing the energy-efficiency of sparse matrix multiplication on heterogeneous systems: A comparative study of GPU, Xeon Phi and FPGA. In 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 46--56.Google ScholarGoogle ScholarCross RefCross Ref
  36. Seth Copen Goldstein, Herman Schmit, Mihai Budiu, Srihari Cadambi, Matthew Moe, and R Reed Taylor. 2000. PipeRench: A reconfigurable architecture and compiler. Computer, Vol. 33, 4 (2000), 70--77.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Venkatraman Govindaraju, Chen-Han Ho, and Karthikeyan Sankaralingam. 2011. Dynamically specialized datapaths for energy efficient computing. In 2011 IEEE 17th International Symposium on High Performance Computer Architecture. IEEE, 503--514.Google ScholarGoogle ScholarCross RefCross Ref
  38. Azzam Haidar, Mark Gates, Stan Tomov, and Jack Dongarra. 2013. Toward a scalable multi-gpu eigensolver via compute-intensive kernels and efficient communication. In Proceedings of the 27th international ACM conference on International conference on supercomputing. ACM, 223--232.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Tom R Halfhill. 2006. Ambric's new parallel processor. Microprocessor Report, Vol. 20, 10 (2006), 19--26.Google ScholarGoogle Scholar
  40. Kim Hazelwood, Sarah Bird, David Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia, Aditya Kalro, James Law, Kevin Lee, Jason Lu, Pieter Noordhuis, Misha Smelyanskiy, Liang Xiong, and Xiaodong Wang. 2018. Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 620--629.Google ScholarGoogle ScholarCross RefCross Ref
  41. Xin He, Subhankar Pal, Aporva Amarnath, Siying Feng, Dong-Hyeon Park, Austin Rovinski, Haojie Ye, Yuhan Chen, Ronald Dreslinski, and Trevor Mudge. 2020. Sparse-TPU: Adapting Systolic Arrays for Sparse Matrices. In Proceedings of the 34th ACM International Conference on Supercomputing (ICS '20). ACM, 12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Mark Horowitz. 2014. 1.1 computing's energy problem (and what we can do about it). In 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC). IEEE, 10--14.Google ScholarGoogle ScholarCross RefCross Ref
  43. Randall A Hughes and John D Shott. 1986. The future of automation for high-volume wafer fabrication and ASIC manufacturing. Proc. IEEE, Vol. 74, 12 (1986), 1775--1793.Google ScholarGoogle ScholarCross RefCross Ref
  44. Engin Ipek, Meyrem Kirman, Nevin Kirman, and Jose F. Martinez. 2007. Core Fusion: Accommodating Software Diversity in Chip Multiprocessors. In Proceedings of the 34th Annual International Symposium on Computer Architecture (San Diego, California, USA) (ISCA '07). ACM, 186--197.Google ScholarGoogle Scholar
  45. Satoshi Itoh, Pablo Ordejón, and Richard M Martin. 1995. Order-N tight-binding molecular dynamics on parallel computers. Computer physics communications, Vol. 88, 2--3 (1995), 173--185.Google ScholarGoogle Scholar
  46. Preston A. Jackson, Cy P. Chan, Jonathan E. Scalera, Charles M. Rader, and M. Michael Vai. 2004. A systolic FFT architecture for real time FPGA systems,? High Performance Embedded Computing Conference (HPEC04. In In High Performance Embedded Computing Conference (HPEC04.Google ScholarGoogle Scholar
  47. Wenzel Jakob, Jason Rhinelander, and Dean Moldovan. 2017. pybind 11--Seamless operability between C 11 and Python. https://github.com/pybind/pybind11Google ScholarGoogle Scholar
  48. Supreet Jeloka, Reetuparna Das, Ronald G Dreslinski, Trevor Mudge, and David Blaauw. 2014. Hi-Rise: a high-radix switch for 3D integration with single-cycle arbitration. In 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 471--483.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Kurtis T. Johnson, Ali R Hurson, and Behrooz Shirazi. 1993. General-purpose systolic arrays. Computer, Vol. 26, 11 (1993), 20--31.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (Toronto, ON, Canada) (ISCA '17). Association for Computing Machinery, 1--12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Marcin Junczys-Dowmunt, Roman Grundkiewicz, Tomasz Dwojak, Hieu Hoang, Kenneth Heafield, Tom Neckermann, Frank Seide, Ulrich Germann, Alham Fikri Aji, Nikolay Bogoychev, André F. T. Martins, and Alexandra Birch. 2018. Marian: Fast neural machine translation in C. arXiv preprint arXiv:1804.00344 (2018).Google ScholarGoogle Scholar
  52. Manupa Karunaratne, Aditi Kulkarni Mohite, Tulika Mitra, and Li-Shiuan Peh. 2017. Hycube: A cgra with reconfigurable single-cycle multi-hop interconnect. In Proceedings of the 54th Annual Design Automation Conference 2017. 1--6.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. John Kelm, Daniel Johnson, Matthew Johnson, Neal Crago, William Tuohy, Aqeel Mahesri, Steven Lumetta, Matthew Frank, and Sanjay Patel. 2009. Rigel: An Architecture and Scalable Programming Interface for a 1000-Core Accelerator. In ACM SIGARCH Computer Architecture News, Vol. 37. ACM, 140--151.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Jeremy Kepner, Peter Aaltonen, David A. Bader, Aydin Bulucc, Franz Franchetti, John R. Gilbert, Dylan Hutchison, Manoj Kumar, Andrew Lumsdaine, Henning Meyerhenke, Scott McMillan, Carl Yang, John D. Owens, Marcin Zalewski, Timothy G. Mattson, and José E. Moreira. 2016. Mathematical foundations of the GraphBLAS. In 2016 IEEE High Performance Extreme Computing Conference, HPEC 2016, Waltham, MA, USA, September 13-15, 2016. IEEE, 1--9.Google ScholarGoogle ScholarCross RefCross Ref
  55. Khubaib, M. Aater Suleman, Milad Hashemi, Chris Wilkerson, and Yale N. Patt. 2012. MorphCore: An Energy-Efficient Microarchitecture for High Performance ILP and High Throughput TLP. 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (2012), 305--316.Google ScholarGoogle Scholar
  56. Martha Mercaldi Kim, John D. Davis, Mark Oskin, and Todd Austin. 2008. Polymorphic On-Chip Networks. In Proceedings of the 35th Annual International Symposium on Computer Architecture (ISCA '08). IEEE Computer Society, 101--112.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. David Koeplinger, Matthew Feldman, Raghu Prabhakar, Yaqi Zhang, Stefan Hadjis, Ruben Fiszel, Tian Zhao, Luigi Nardi, Ardavan Pedram, Christos Kozyrakis, and Kunle Olukotun. 2018. Spatial: A Language and Compiler for Application Accelerators. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (Philadelphia, PA, USA) (PLDI 2018). ACM, 296--311.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Rakesh Komuravelli, Matthew D. Sinclair, Johnathan Alsop, Muhammad Huzaifa, Maria Kotsifakou, Prakalp Srivastava, Sarita V. Adve, and Vikram S. Adve. 2015. Stash: Have Your Scratchpad and Cache It Too. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture (Portland, Oregon) (ISCA '15). ACM, 707--719.Google ScholarGoogle Scholar
  59. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105.Google ScholarGoogle Scholar
  60. Hsiang-Tsung Kung. 1982. Why systolic architectures? IEEE computer, Vol. 15, 1 (1982), 37--46.Google ScholarGoogle Scholar
  61. Ian Kuon and Jonathan Rose. 2007. Measuring the gap between FPGAs and ASICs. IEEE Transactions on computer-aided design of integrated circuits and systems, Vol. 26, 2 (2007), 203--215.Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Ian Kuon, Russell Tessier, and Jonathan Rose. 2008. FPGA architecture: Survey and challenges. Now Publishers Inc.Google ScholarGoogle Scholar
  63. Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, and Kilian Q. Weinberger. 2015. From Word Embeddings to Document Distances. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37 (Lille, France) (ICML'15). 957--966.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Georgi Kuzmanov and Mottaqiallah Taouil. 2009. Reconfigurable sparse/dense matrix-vector multiplier. In 2009 International Conference on Field-Programmable Technology. IEEE, 483--488.Google ScholarGoogle ScholarCross RefCross Ref
  65. Benjamin C Lee, Richard W Vuduc, James W Demmel, and Katherine A Yelick. 2004. Performance models for evaluation and automatic tuning of symmetric sparse matrix-vector multiply. In International Conference on Parallel Processing, 2004. ICPP 2004. IEEE, 169--176.Google ScholarGoogle ScholarCross RefCross Ref
  66. Chang-Chi Lee, CP Hung, Calvin Cheung, Ping-Feng Yang, Chin-Li Kao, Dao-Long Chen, Meng-Kai Shih, Chien-Lin Chang Chien, Yu-Hsiang Hsiao, Li-Chieh Chen, Michael Su, Michael Alfano, Joe Siegel, Julius Din, and Bryan Black. 2016. An overview of the development of a GPU with integrated HBM on silicon interposer. In 2016 IEEE 66th Electronic Components and Technology Conference (ECTC). IEEE, 1439--1444.Google ScholarGoogle ScholarCross RefCross Ref
  67. Chang-Hwan Lee. 2015. A gradient approach for value weighted classification learning in naive Bayes. Knowledge-Based Systems, Vol. 85 (2015), 71--79.Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Dongwook Lee, Manhwee Jo, Kyuseung Han, and Kiyoung Choi. 2009. FloRA: Coarse-grained reconfigurable architecture with floating-point operation capability. In 2009 International Conference on Field-Programmable Technology. IEEE, 376--379.Google ScholarGoogle ScholarCross RefCross Ref
  69. Jiajia Li, Xingjian Li, Guangming Tan, Mingyu Chen, and Ninghui Sun. 2012. An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs. In Proceedings of the 26th ACM international conference on Supercomputing. ACM, 377--386.Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Cao Liang and Xinming Huang. 2008. SmartCell: A power-efficient reconfigurable architecture for data streaming applications. In 2008 IEEE Workshop on Signal Processing Systems. IEEE, 257--262.Google ScholarGoogle ScholarCross RefCross Ref
  71. Leibo Liu, Dong Wang, Min Zhu, Yansheng Wang, Shouyi Yin, Peng Cao, Jun Yang, and Shaojun Wei. 2015. An energy-efficient coarse-grained reconfigurable processing unit for multiple-standard video decoding. IEEE Transactions on Multimedia, Vol. 17, 10 (2015), 1706--1720.Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Leibo Liu, Jianfeng Zhu, Zhaoshi Li, Yanan Lu, Yangdong Deng, Jie Han, Shouyi Yin, and Shaojun Wei. 2019. A Survey of Coarse-Grained Reconfigurable Architecture and Design: Taxonomy, Challenges, and Applications. ACM Computing Surveys (CSUR), Vol. 52, 6 (2019), 1--39.Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Beth Logan. 2000. Mel Frequency Cepstral Coefficients for Music Modeling. In ISMIR, Vol. 270. 1--11.Google ScholarGoogle Scholar
  74. Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3431--3440.Google ScholarGoogle ScholarCross RefCross Ref
  75. Ikuo Magaki, Moein Khazraee, Luis Vega Gutierrez, and Michael Bedford Taylor. 2016. Asic clouds: Specializing the datacenter. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). IEEE, 178--190.Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Ken Mai, Tim Paaske, Nuwan Jayasena, Ron Ho, William J. Dally, and Mark Horowitz. 2000. Smart Memories: A Modular Reconfigurable Architecture. In Proceedings of the 27th Annual International Symposium on Computer Architecture (Vancouver, British Columbia, Canada) (ISCA '00). ACM, New York, NY, USA, 161--171.Google ScholarGoogle ScholarCross RefCross Ref
  77. Tim Mattson, David A. Bader, Jonathan W. Berry, Aydin Bulucc, Jack J. Dongarra, Christos Faloutsos, John Feo, John R. Gilbert, Joseph Gonzalez, Bruce Hendrickson, Jeremy Kepner, Charles E. Leiserson, Andrew Lumsdaine, David A. Padua, Stephen Poole, Steven P. Reinhardt, Mike Stonebraker, Steve Wallach, and Andrew Yoo. 2013. Standards for graph algorithm primitives. In IEEE High Performance Extreme Computing Conference, HPEC 2013, Waltham, MA, USA, September 10--12, 2013 . IEEE, 1--2.Google ScholarGoogle ScholarCross RefCross Ref
  78. Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2018. An analysis of neural language modeling at multiple scales. arXiv preprint arXiv:1803.08240 (2018).Google ScholarGoogle Scholar
  79. Badri Narayan Mohapatra and Rashmita Kumari Mohapatra. 2017. FFT and sparse FFT techniques and applications. In 2017 Fourteenth International Conference on Wireless and Optical Communications Networks (WOCN). IEEE, 1--5.Google ScholarGoogle ScholarCross RefCross Ref
  80. Frank Mueller. 1993. Pthreads library interface. Florida State University (1993).Google ScholarGoogle Scholar
  81. Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P Jouppi. 2009. CACTI 6.0: A tool to model large caches. HP laboratories, Vol. 27 (2009), 28.Google ScholarGoogle Scholar
  82. Chris Nicol. 2017. A coarse grain reconfigurable array (cgra) for statically scheduled data flow computing. Wave Computing White Paper (2017).Google ScholarGoogle Scholar
  83. Tony Nowatzki, Vinay Gangadhar, Newsha Ardalani, and Karthikeyan Sankaralingam. 2017. Stream-Dataflow Acceleration. In Proceedings of the 44th Annual International Symposium on Computer Architecture (Toronto, ON, Canada) (ISCA '17). ACM, 416--429.Google ScholarGoogle Scholar
  84. Molly A O'Neil and Martin Burtscher. 2014. Microarchitectural performance characterization of irregular GPU kernels. In 2014 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 130--139.Google ScholarGoogle ScholarCross RefCross Ref
  85. Kalin Ovtcharov, Olatunji Ruwase, Joo-Young Kim, Jeremy Fowers, Karin Strauss, and Eric S Chung. 2015. Accelerating deep convolutional neural networks using specialized hardware. Microsoft Research Whitepaper, Vol. 2, 11 (2015), 1--4.Google ScholarGoogle Scholar
  86. Mike O'Connor, Niladrish Chatterjee, Donghyuk Lee, John Wilson, Aditya Agrawal, Stephen W Keckler, and William J Dally. 2017 Fine-grained DRAM: energy-efficient DRAM for extreme bandwidth systems. In 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 41--54.Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. Subhankar Pal, Jonathan Beaumont, Dong-Hyeon Park, Aporva Amarnath, Siying Feng, Chaitali Chakrabarti, Hun-Seok Kim, David Blaauw, Trevor Mudge, and Ronald Dreslinski. 2018. OuterSPACE: An Outer Product based Sparse Matrix Multiplication Accelerator. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 724--736.Google ScholarGoogle ScholarCross RefCross Ref
  88. Subhankar Pal, Dong-Hyeon Park, Siying Feng, Paul Gao, Jielun Tan, Austin Rovinski, Shaolin Xie, Chun Zhao, Aporva Amarnath, Timothy Wesley, Jonathan Beaumont, Kuan-Yu Chen, Chaitali Chakrabarti, Michael Bedford Taylor, Trevor N. Mudge, David T. Blaauw, Hun-Seok Kim, and Ronald G. Dreslinski. 2019. A 7.3 M Output Non-Zeros/J Sparse Matrix-Matrix Multiplication Accelerator using Memory Reconfiguration in 40 nm. In 2019 Symposium on VLSI Circuits, Kyoto, Japan, June 9-14, 2019. IEEE, 150.Google ScholarGoogle Scholar
  89. Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen, Victor A Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W Keckler, and Joel Emer. 2019. Timeloop: A systematic approach to dnn accelerator evaluation. In 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 304--315.Google ScholarGoogle ScholarCross RefCross Ref
  90. Dong-Hyeon Park, Subhankar Pal, Siying Feng, Paul Gao, Jielun Tan, Austin Rovinski, Shaolin Xie, Chun Zhao, Aporva Amarnath, Timothy Wesley, Jonathan Beaumont, Kuan-Yu Chen, Chaitali Chakrabarti, Michael Bedford Taylor, Trevor N. Mudge, David T. Blaauw, Hun-Seok Kim, and Ronald G. Dreslinski. 2020. A 7.3 M Output Non-Zeros/J, 11.7 M Output Non-Zeros/GB Reconfigurable Sparse Matrix-Matrix Multiplication Accelerator. Journal of Solid-State Circuits, Vol. 55, 4 (2020), 933--944.Google ScholarGoogle ScholarCross RefCross Ref
  91. Ardavan Pedram, Andreas Gerstlauer, and Robert A Van De Geijn. 2011. A high-performance, low-power linear algebra core. In ASAP 2011--22nd IEEE International Conference on Application-specific Systems, Architectures and Processors. IEEE, 35--42.Google ScholarGoogle ScholarDigital LibraryDigital Library
  92. Ardavan Pedram, John D. McCalpin, and Andreas Gerstlauer. 2014. A Highly Efficient Multicore Floating-Point FFT Architecture Based on Hybrid Linear Algebra/FFT Cores. Journal of Signal Processing Systems, Vol. 77, 1 (01 Oct 2014), 169--190.Google ScholarGoogle ScholarDigital LibraryDigital Library
  93. Gerald Penn. 2006. Efficient transitive closure of sparse matrices over closed semirings. Theoretical Computer Science, Vol. 354, 1 (2006), 72--81.Google ScholarGoogle ScholarDigital LibraryDigital Library
  94. Kara KW Poon, Steven JE Wilton, and Andy Yan. 2005. A detailed power model for field-programmable gate arrays. ACM Transactions on Design Automation of Electronic Systems (TODAES), Vol. 10, 2 (2005), 279--302.Google ScholarGoogle ScholarDigital LibraryDigital Library
  95. Raghu Prabhakar, Yaqi Zhang, David Koeplinger, Matt Feldman, Tian Zhao, Stefan Hadjis, Ardavan Pedram, Christos Kozyrakis, and Kunle Olukotun. 2017. Plasticine: A reconfigurable architecture for parallel patterns. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). IEEE, 389--402.Google ScholarGoogle ScholarDigital LibraryDigital Library
  96. Andrew Putnam, Adrian Caulfield, Eric Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Jan Gray, Michael Haselman, Scott Hauck, Stephen Heil, Amir Hormati, Joo-Young Kim, Sitaram Lanka, Eric Peterson, Aaron Smith, Jason Thong, Phillip Yi Xiao, Doug Burger, Jim Larus, Gopi Prashanth Gopal, and Simon Pope. 2014. A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services. In Proceeding of the 41st Annual International Symposium on Computer Architecture (ISCA). IEEE, 13--24.Google ScholarGoogle ScholarDigital LibraryDigital Library
  97. Tim Salimans, Han Zhang, Alec Radford, and Dimitris Metaxas. 2018. Improving GANs using optimal transport. arXiv preprint arXiv:1803.05573 (2018).Google ScholarGoogle Scholar
  98. Fabian Schuiki, Michael Schaffner, and Luca Benini. 2019. Ntx: An energy-efficient streaming accelerator for floating-point generalized reduction workloads in 22 nm fd-soi. In 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 662--667.Google ScholarGoogle Scholar
  99. Korey Sewell, Ronald G. Dreslinski, Thomas Manville, Sudhir Satpathy, Nathaniel Ross Pinckney, Geoffrey Blake, Michael Cieslak, Reetuparna Das, Thomas F. Wenisch, Dennis Sylvester, David T. Blaauw, and Trevor N. Mudge. 2012. Swizzle-Switch Networks for Many-Core Systems. IEEE J. Emerg. Sel. Topics Circuits Syst., Vol. 2, 2 (2012), 278--294.Google ScholarGoogle ScholarCross RefCross Ref
  100. Muhammad Shafique and Siddharth Garg. 2016. Computing in the Dark Silicon Era: Current Trends and Research Challenges. IEEE Design & Test, Vol. 34, 2 (2016), 8--23.Google ScholarGoogle ScholarCross RefCross Ref
  101. Anuraag Soorishetty, Jian Zhou, Subhankar Pal, David Blaauw, H Kim, Trevor Mudge, Ronald Dreslinski, and Chaitali Chakrabarti. 2020. Accelerating Linear Algebra Kernels on a Massively Parallel Reconfigurable Architecture. In ICASSP 2020--2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1558--1562.Google ScholarGoogle ScholarCross RefCross Ref
  102. Samuel Steffl and Sherief Reda. 2017. LACore: A Supercomputing-Like Linear Algebra Accelerator for SoC-Based Designs. In 2017 IEEE International Conference on Computer Design (ICCD). IEEE, 137--144.Google ScholarGoogle Scholar
  103. Michel Steuwer, Toomas Remmelg, and Christophe Dubach. 2017. Lift: a Functional Data-Parallel IR for High-Performance GPU Code Generation. In 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 74--85.Google ScholarGoogle ScholarCross RefCross Ref
  104. John E Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in science & engineering, Vol. 12, 3 (2010), 66--73.Google ScholarGoogle Scholar
  105. Cheng Tan, Manupa Karunaratne, Tulika Mitra, and Li-Shiuan Peh. 2018. Stitch: Fusible heterogeneous accelerators enmeshed with many-core architecture for wearables. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 575--587.Google ScholarGoogle ScholarDigital LibraryDigital Library
  106. Masakazu Tanomoto, Shinya Takamaeda-Yamazaki, Jun Yao, and Yasuhiko Nakashima. 2015. A cgra-based approach for accelerating convolutional neural networks. In 2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip. IEEE, 73--80.Google ScholarGoogle ScholarDigital LibraryDigital Library
  107. Michael Bedford Taylor, Jason Sungtae Kim, Jason E. Miller, David Wentzlaff, Fae Ghodrat, Ben Greenwald, Henry Hoffmann, Paul R. Johnson, Jae W. Lee, Walter Lee, Albert Ma, Arvind Saraf, Mark Seneski, Nathan Shnidman, Volker Strumpen, Matthew I. Frank, Saman P. Amarasinghe, and Anant Agarwal. 2002. The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs. IEEE Micro, Vol. 22, 2 (2002), 25--35.Google ScholarGoogle ScholarDigital LibraryDigital Library
  108. Vaishali Tehre, Pankaj Agrawal, and RV Kshrisagar. [n.d.]. Implementation of Fast Fourier Transform Accelerator on Coarse Grain Reconfigurable Architecture. ( [n.,d.]).Google ScholarGoogle Scholar
  109. Swagath Venkataramani, Ashish Ranjan, Subarno Banerjee, Dipankar Das, Sasikanth Avancha, Ashok Jagannathan, Ajaya Durg, Dheemanth Nagaraj, Bharat Kaul, Pradeep Dubey, and Anand Raghunathan. 2017. Scaledeep: A scalable compute architecture for learning and evaluating deep networks. ACM SIGARCH Computer Architecture News, Vol. 45, 2 (2017), 13--26.Google ScholarGoogle ScholarDigital LibraryDigital Library
  110. Manish Verma, Lars Wehmeyer, Peter Marwedel, and Peter Marwedel. 2004. Cache-Aware Scratchpad Allocation Algorithm. In Proceedings of the Conference on Design, Automation and Test in Europe - Volume 2 (DATE '04). IEEE Computer Society, 21264--.Google ScholarGoogle ScholarCross RefCross Ref
  111. Kizheppatt Vipin and Suhaib A Fahmy. 2018. FPGA dynamic and partial reconfiguration: A survey of architectures, methods, and applications. ACM Computing Surveys (CSUR), Vol. 51, 4 (2018), 1--39.Google ScholarGoogle ScholarDigital LibraryDigital Library
  112. Donglin Wang, Xueliang Du, Leizu Yin, Chen Lin, Hong Ma, Weili Ren, Huijuan Wang, Xingang Wang, Shaolin Xie, Lei Wang, Zijun Liu, Tao Wang, Zhonghua Pu, Guangxin Ding, Mengchen Zhu, Lipeng Yang, Ruoshan Guo, Zhiwei Zhang, Xiao Lin, Jie Hao, Yongyong Yang, Wenqin Sun, Fabiao Zhou, NuoZhou Xiao, Qian Cui, and Xiaoqin Wang. 2016. MaPU: A novel mathematical computing architecture. 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA) (2016), 457--468.Google ScholarGoogle ScholarCross RefCross Ref
  113. Jagath Weerasinghe, Francois Abel, Christoph Hagleitner, and Andreas Herkersdorf. 2015. Enabling FPGAs in Hyperscale Data Centers. In 2015 IEEE 12th Intl Conf on Ubiquitous Intelligence and Computing and 2015 IEEE 12th Intl Conf on Autonomic and Trusted Computing and 2015 IEEE 15th Intl Conf on Scalable Computing and Communications and Its Associated Workshops (UIC-ATC-ScalCom). IEEE, 1078--1086.Google ScholarGoogle Scholar
  114. Mark Wijtvliet, Luc Waeijen, and Henk Corporaal. 2016. Coarse Grained Reconfigurable Architectures in the Past 25 years: Overview and Classification. In 2016 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS). IEEE, 235--244.Google ScholarGoogle ScholarCross RefCross Ref
  115. Xilinx [n.d.]. Partial Reconfiguration User Guide UG702 (v13.3). Xilinx. https: //www.xilinx.com/support/documentation/sw_manuals/xilinx13_3/ug702.pdfGoogle ScholarGoogle Scholar
  116. Xilinx [n.d.]. Partial Reconfiguration User Guide UG909 (v2018.1). Xilinx. https://www.xilinx.com/support/documentation/sw_manuals/xilinx2018_ 1/ug909-vivado-partial-reconfiguration.pdfGoogle ScholarGoogle Scholar
  117. Ichitaro Yamazaki and Xiaoye S Li. 2010. On techniques to improve robustness and scalability of a parallel hybrid linear solver. In International Conference on High Performance Computing for Computational Science. Springer, 421--434.Google ScholarGoogle Scholar
  118. Fanghua Ye, Chuan Chen, and Zibin Zheng. 2018. Deep Autoencoder-like Nonnegative Matrix Factorization for Community Detection. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 1393--1402.Google ScholarGoogle ScholarDigital LibraryDigital Library
  119. Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing fpga-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 161--170.Google ScholarGoogle ScholarDigital LibraryDigital Library
  120. Qiuling Zhu, Tobias Graf, H. Ekin Sumbul, Lawrence T. Pileggi, and Franz Franchetti. 2013. Accelerating sparse matrix-matrix multiplication with 3D-stacked logic-in-memory hardware. 2013 IEEE High Performance Extreme Computing Conference (HPEC) (2013), 1--6.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Transmuter: Bridging the Efficiency Gap using Memory and Dataflow Reconfiguration

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          PACT '20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques
          September 2020
          505 pages
          ISBN:9781450380751
          DOI:10.1145/3410463
          • General Chair:
          • Vivek Sarkar,
          • Program Chair:
          • Hyesoon Kim

          Copyright © 2020 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 30 September 2020

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate121of471submissions,26%

          Upcoming Conference

          PACT '24
          International Conference on Parallel Architectures and Compilation Techniques
          October 14 - 16, 2024
          Southern California , CA , USA

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader