skip to main content
10.1145/2749469.2750416acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article
Open Access

Architecting to achieve a billion requests per second throughput on a single key-value store server platform

Published:13 June 2015Publication History

ABSTRACT

Distributed in-memory key-value stores (KVSs), such as memcached, have become a critical data serving layer in modern Internet-oriented datacenter infrastructure. Their performance and efficiency directly affect the QoS of web services and the efficiency of datacenters. Traditionally, these systems have had significant overheads from inefficient network processing, OS kernel involvement, and concurrency control. Two recent research thrusts have focused upon improving key-value performance. Hardware-centric research has started to explore specialized platforms including FPGAs for KVSs; results demonstrated an order of magnitude increase in throughput and energy efficiency over stock memcached. Software-centric research revisited the KVS application to address fundamental software bottlenecks and to exploit the full potential of modern commodity hardware; these efforts too showed orders of magnitude improvement over stock memcached.

We aim at architecting high performance and efficient KVS platforms, and start with a rigorous architectural characterization across system stacks over a collection of representative KVS implementations. Our detailed full-system characterization not only identifies the critical hardware/software ingredients for high-performance KVS systems, but also leads to guided optimizations atop a recent design to achieve a record-setting throughput of 120 million requests per second (MRPS) on a single commodity server. Our implementation delivers 9.2X the performance (RPS) and 2.8X the system energy efficiency (RPS/watt) of the best-published FPGA-based claims. We craft a set of design principles for future platform architectures, and via detailed simulations demonstrate the capability of achieving a billion RPS with a single server constructed following our principles.

References

  1. Amazon Elasticache, http://aws.amazon.com/elasticache/.Google ScholarGoogle Scholar
  2. Intel® Data Direct I/O Technology, http://www.intel.com/content/www/us/en/io/direct-data-i-o.html.Google ScholarGoogle Scholar
  3. Intel® Ethernet Flow Director, http://www.intel.com/content/www/us/en/ethernet-controllers/ethernet-flow-director-video.html.Google ScholarGoogle Scholar
  4. How Linkedin uses memcached, http://www.oracle.com/technetwork/server-storage/ts-4696-159286.pdf.Google ScholarGoogle Scholar
  5. Intel® I/O Acceleration Technology, http://www.intel.com/content/www/us/en/wireless-network/accel-technology.html.Google ScholarGoogle Scholar
  6. Mellanox® 100Gbps Ethernet NIC, http://www.mellanox.com/related-docs/prod_silicon/PB_ConnectX-4_VPI_Card.pdf.Google ScholarGoogle Scholar
  7. Memcached: A distributed memory object caching system, http://memcached.org/.Google ScholarGoogle Scholar
  8. Memcached SPOF Mystery, https://blog.twitter.com/2010/memcached-spof-mystery.Google ScholarGoogle Scholar
  9. Netflix EVCache, http://techblog.netflix.com/2012/01/ephemeral-volatile-caching-in-cloud.html.Google ScholarGoogle Scholar
  10. Mellanox® OpenFabrics Enterprise Distribution for Linux (MLNX_OFED), http://www.mellanox.com/page/products_dyn?product_family=26.Google ScholarGoogle Scholar
  11. J. Ahn, S. Li, S. O, and N. P. Jouppi, "McSimA+: A manycore simulator with application-level+ simulation and detailed microarchitecture modeling," in ISPASS, 2013.Google ScholarGoogle Scholar
  12. B. Atikoglu, Y. Xu, E. Frachtenberg, S. Jiang, and M. Paleczny, "Workload analysis of a large-scale key-value store," in SIGMETRICS, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Belay, G. Prekas, A. Klimovic, S. Grossman, C. Kozyrakis, and E. Bugnion, "IX: A protected dataplane operating system for high throughput and low latency," in OSDI, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Blott, K. Karras, L. Liu, K. Vissers, J. Bär, and Z. István, "Achieving 10Gbps line-rate key-value stores with FPGAs," in HotCloud, 2013.Google ScholarGoogle Scholar
  15. S. R. Chalamalasetti, K. Lim, M. Wright, A. AuYoung, P. Ranganathan, and M. Margala, "An FPGA Memcached appliance," in FPGA, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. B. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears, "Benchmarking cloud serving systems with YCSB," in SOCC, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Dobrescu, N. Egi, K. Argyraki, B.-G. Chun, K. Fall, G. Iannaccone, A. Knies, M. Manesh, and S. Ratnasamy, "RouteBricks: Exploiting parallelism to scale software routers," in SOSP, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Dragojević, D. Narayanan, M. Castro, and O. Hodson, "FaRM: Fast remote memory," in NSDI, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. B. Fan, D. G. Andersen, and M. Kaminsky, "MemC3: Compact and concurrent memcache with dumber caching and smarter hashing," in NSDI, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Gutierrez, M. Cieslak, B. Giridhar, R. G. Dreslinski, L. Ceze, and T. Mudge, "Integrated 3D-stacked server designs for increasing physical density of key-value stores," in ASPLOS, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Han, K. Jang, K. Park, and S. Moon, "PacketShader: a GPU-accelerated software router," in SIGCOMM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Herlihy, N. Shavit, and M. Tzafrir, "Hopscotch hashing," in Distributed Computing. Springer, 2008, pp. 350--364. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. R. Huggahalli, R. Iyer, and S. Tetrick, "Direct cache access for high bandwidth network I/O," in ISCA, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Intel, "Intel Data Plane Development Kit (Intel DPDK)," http://www.intel.com/go/dpdk, 2014.Google ScholarGoogle Scholar
  25. R. Jevtic, H. Le, M. Blagojevic, S. Bailey, K. Asanovic, E. Alon, and B. Nikolic, "Per-core DVFS with switched-capacitor converters for energy efficiency in manycore processors," IEEE TVLSI, vol. 23, no. 4, pp. 723--730, 2015.Google ScholarGoogle Scholar
  26. A. Kalia, M. Kaminsky, and D. G. Andersen, "Using RDMA efficiently for key-value services," in SIGCOMM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. R. Kapoor, G. Porter, M. Tewari, G. M. Voelker, and A. Vahdat, "Chronos: Predictable low latency for data center applications," in SOCC, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M. Lavasani, H. Angepat, and D. Chiou, "An FPGA-based in-line accelerator for Memcached," in HotChips, 2013.Google ScholarGoogle Scholar
  29. S. Li, J. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi, "McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures," in MICRO, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. Li, K. Lim, P. Faraboschi, J. Chang, P. Ranganathan, and N. P. Jouppi, "System-level integrated server architectures for scale-out datacenters," in MICRO, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. H. Lim, D. Han, D. G. Andersen, and M. Kaminsky, "MICA: A holistic approach to fast in-memory key-value storage," in NSDI, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. K. Lim, D. Meisner, A. G. Saidi, P. Ranganathan, and T. F. Wenisch, "Thin Servers with Smart Pipes: Designing SoC accelerators for Memcached," in ISCA, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. P. Lotfi-Kamran, B. Grot, M. Ferdman, S. Volos, O. Kocberber, J. Picorel, A. Adileh, D. Jevdjic, S. Idgunji, E. Ozer, and B. Falsafi, "Scale-out processors," in ISCA, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Y. Mao, E. Kohler, and R. T. Morris, "Cache craftiness for fast multicore key-value storage," in EuroSys, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. C. Mitchell, Y. Geng, and J. Li, "Using one-sided RDMA reads to build a fast, CPU-efficient key-value store," in USENIX ATC, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. R. Nishtala, H. Fugal, S. Grimm, M. Kwiatkowski, H. Lee, H. C. Li, R. McElroy, M. Paleczny, D. Peek, P. Saab, D. Stafford, T. Tung, and V. Venkataramani, "Scaling Memcache at Facebook," in NSDI, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. S. Novakovic, A. Daglis, E. Bugnion, B. Falsafi, and B. Grot, "Scale-out NUMA," in ASPLOS, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. D. Ongaro, S. M. Rumble, R. Stutsman, J. Ousterhout, and M. Rosenblum, "Fast crash recovery in RAMCloud," in SOSP, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. R. Pagh and F. Rodler, "Cuckoo hashing," Journal of Algorithms, vol. 51, no. 2, pp. 122--144, May 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. D. A. Patterson, "Latency lags bandwith," Commun. ACM, vol. 47, no. 10, pp. 71--75, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. A. Pesterev, J. Strauss, N. Zeldovich, and R. T. Morris, "Improving network connection locality on multicore systems," in EuroSys, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. S. Peter, J. Li, I. Zhang, D. R. K. Ports, D. Woos, A. Krishnamurthy, T. Anderson, and T. Roscoe, "Arrakis: The operating system is the control plane," in OSDI, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. L. Rizzo, "netmap: A novel framework for fast packet I/O," in USENIX ATC, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. D. M. Tullsen, S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, and R. L. Stamm, "Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor," in ISCA, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Architecting to achieve a billion requests per second throughput on a single key-value store server platform

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture
            June 2015
            768 pages
            ISBN:9781450334020
            DOI:10.1145/2749469

            Copyright © 2015 Owner/Author

            Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 13 June 2015

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate543of3,203submissions,17%

            Upcoming Conference

            ISCA '24

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader