ABSTRACT
Distributed in-memory key-value stores (KVSs), such as memcached, have become a critical data serving layer in modern Internet-oriented datacenter infrastructure. Their performance and efficiency directly affect the QoS of web services and the efficiency of datacenters. Traditionally, these systems have had significant overheads from inefficient network processing, OS kernel involvement, and concurrency control. Two recent research thrusts have focused upon improving key-value performance. Hardware-centric research has started to explore specialized platforms including FPGAs for KVSs; results demonstrated an order of magnitude increase in throughput and energy efficiency over stock memcached. Software-centric research revisited the KVS application to address fundamental software bottlenecks and to exploit the full potential of modern commodity hardware; these efforts too showed orders of magnitude improvement over stock memcached.
We aim at architecting high performance and efficient KVS platforms, and start with a rigorous architectural characterization across system stacks over a collection of representative KVS implementations. Our detailed full-system characterization not only identifies the critical hardware/software ingredients for high-performance KVS systems, but also leads to guided optimizations atop a recent design to achieve a record-setting throughput of 120 million requests per second (MRPS) on a single commodity server. Our implementation delivers 9.2X the performance (RPS) and 2.8X the system energy efficiency (RPS/watt) of the best-published FPGA-based claims. We craft a set of design principles for future platform architectures, and via detailed simulations demonstrate the capability of achieving a billion RPS with a single server constructed following our principles.
- Amazon Elasticache, http://aws.amazon.com/elasticache/.Google Scholar
- Intel® Data Direct I/O Technology, http://www.intel.com/content/www/us/en/io/direct-data-i-o.html.Google Scholar
- Intel® Ethernet Flow Director, http://www.intel.com/content/www/us/en/ethernet-controllers/ethernet-flow-director-video.html.Google Scholar
- How Linkedin uses memcached, http://www.oracle.com/technetwork/server-storage/ts-4696-159286.pdf.Google Scholar
- Intel® I/O Acceleration Technology, http://www.intel.com/content/www/us/en/wireless-network/accel-technology.html.Google Scholar
- Mellanox® 100Gbps Ethernet NIC, http://www.mellanox.com/related-docs/prod_silicon/PB_ConnectX-4_VPI_Card.pdf.Google Scholar
- Memcached: A distributed memory object caching system, http://memcached.org/.Google Scholar
- Memcached SPOF Mystery, https://blog.twitter.com/2010/memcached-spof-mystery.Google Scholar
- Netflix EVCache, http://techblog.netflix.com/2012/01/ephemeral-volatile-caching-in-cloud.html.Google Scholar
- Mellanox® OpenFabrics Enterprise Distribution for Linux (MLNX_OFED), http://www.mellanox.com/page/products_dyn?product_family=26.Google Scholar
- J. Ahn, S. Li, S. O, and N. P. Jouppi, "McSimA+: A manycore simulator with application-level+ simulation and detailed microarchitecture modeling," in ISPASS, 2013.Google Scholar
- B. Atikoglu, Y. Xu, E. Frachtenberg, S. Jiang, and M. Paleczny, "Workload analysis of a large-scale key-value store," in SIGMETRICS, 2012. Google ScholarDigital Library
- A. Belay, G. Prekas, A. Klimovic, S. Grossman, C. Kozyrakis, and E. Bugnion, "IX: A protected dataplane operating system for high throughput and low latency," in OSDI, 2014. Google ScholarDigital Library
- M. Blott, K. Karras, L. Liu, K. Vissers, J. Bär, and Z. István, "Achieving 10Gbps line-rate key-value stores with FPGAs," in HotCloud, 2013.Google Scholar
- S. R. Chalamalasetti, K. Lim, M. Wright, A. AuYoung, P. Ranganathan, and M. Margala, "An FPGA Memcached appliance," in FPGA, 2013. Google ScholarDigital Library
- B. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears, "Benchmarking cloud serving systems with YCSB," in SOCC, 2010. Google ScholarDigital Library
- M. Dobrescu, N. Egi, K. Argyraki, B.-G. Chun, K. Fall, G. Iannaccone, A. Knies, M. Manesh, and S. Ratnasamy, "RouteBricks: Exploiting parallelism to scale software routers," in SOSP, 2009. Google ScholarDigital Library
- A. Dragojević, D. Narayanan, M. Castro, and O. Hodson, "FaRM: Fast remote memory," in NSDI, 2014. Google ScholarDigital Library
- B. Fan, D. G. Andersen, and M. Kaminsky, "MemC3: Compact and concurrent memcache with dumber caching and smarter hashing," in NSDI, 2013. Google ScholarDigital Library
- A. Gutierrez, M. Cieslak, B. Giridhar, R. G. Dreslinski, L. Ceze, and T. Mudge, "Integrated 3D-stacked server designs for increasing physical density of key-value stores," in ASPLOS, 2014. Google ScholarDigital Library
- S. Han, K. Jang, K. Park, and S. Moon, "PacketShader: a GPU-accelerated software router," in SIGCOMM, 2010. Google ScholarDigital Library
- M. Herlihy, N. Shavit, and M. Tzafrir, "Hopscotch hashing," in Distributed Computing. Springer, 2008, pp. 350--364. Google ScholarDigital Library
- R. Huggahalli, R. Iyer, and S. Tetrick, "Direct cache access for high bandwidth network I/O," in ISCA, 2005. Google ScholarDigital Library
- Intel, "Intel Data Plane Development Kit (Intel DPDK)," http://www.intel.com/go/dpdk, 2014.Google Scholar
- R. Jevtic, H. Le, M. Blagojevic, S. Bailey, K. Asanovic, E. Alon, and B. Nikolic, "Per-core DVFS with switched-capacitor converters for energy efficiency in manycore processors," IEEE TVLSI, vol. 23, no. 4, pp. 723--730, 2015.Google Scholar
- A. Kalia, M. Kaminsky, and D. G. Andersen, "Using RDMA efficiently for key-value services," in SIGCOMM, 2014. Google ScholarDigital Library
- R. Kapoor, G. Porter, M. Tewari, G. M. Voelker, and A. Vahdat, "Chronos: Predictable low latency for data center applications," in SOCC, 2012. Google ScholarDigital Library
- M. Lavasani, H. Angepat, and D. Chiou, "An FPGA-based in-line accelerator for Memcached," in HotChips, 2013.Google Scholar
- S. Li, J. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi, "McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures," in MICRO, 2009. Google ScholarDigital Library
- S. Li, K. Lim, P. Faraboschi, J. Chang, P. Ranganathan, and N. P. Jouppi, "System-level integrated server architectures for scale-out datacenters," in MICRO, 2011. Google ScholarDigital Library
- H. Lim, D. Han, D. G. Andersen, and M. Kaminsky, "MICA: A holistic approach to fast in-memory key-value storage," in NSDI, 2014. Google ScholarDigital Library
- K. Lim, D. Meisner, A. G. Saidi, P. Ranganathan, and T. F. Wenisch, "Thin Servers with Smart Pipes: Designing SoC accelerators for Memcached," in ISCA, 2013. Google ScholarDigital Library
- P. Lotfi-Kamran, B. Grot, M. Ferdman, S. Volos, O. Kocberber, J. Picorel, A. Adileh, D. Jevdjic, S. Idgunji, E. Ozer, and B. Falsafi, "Scale-out processors," in ISCA, 2012. Google ScholarDigital Library
- Y. Mao, E. Kohler, and R. T. Morris, "Cache craftiness for fast multicore key-value storage," in EuroSys, 2012. Google ScholarDigital Library
- C. Mitchell, Y. Geng, and J. Li, "Using one-sided RDMA reads to build a fast, CPU-efficient key-value store," in USENIX ATC, 2013. Google ScholarDigital Library
- R. Nishtala, H. Fugal, S. Grimm, M. Kwiatkowski, H. Lee, H. C. Li, R. McElroy, M. Paleczny, D. Peek, P. Saab, D. Stafford, T. Tung, and V. Venkataramani, "Scaling Memcache at Facebook," in NSDI, 2013. Google ScholarDigital Library
- S. Novakovic, A. Daglis, E. Bugnion, B. Falsafi, and B. Grot, "Scale-out NUMA," in ASPLOS, 2014. Google ScholarDigital Library
- D. Ongaro, S. M. Rumble, R. Stutsman, J. Ousterhout, and M. Rosenblum, "Fast crash recovery in RAMCloud," in SOSP, 2011. Google ScholarDigital Library
- R. Pagh and F. Rodler, "Cuckoo hashing," Journal of Algorithms, vol. 51, no. 2, pp. 122--144, May 2004. Google ScholarDigital Library
- D. A. Patterson, "Latency lags bandwith," Commun. ACM, vol. 47, no. 10, pp. 71--75, 2004. Google ScholarDigital Library
- A. Pesterev, J. Strauss, N. Zeldovich, and R. T. Morris, "Improving network connection locality on multicore systems," in EuroSys, 2012. Google ScholarDigital Library
- S. Peter, J. Li, I. Zhang, D. R. K. Ports, D. Woos, A. Krishnamurthy, T. Anderson, and T. Roscoe, "Arrakis: The operating system is the control plane," in OSDI, 2014. Google ScholarDigital Library
- L. Rizzo, "netmap: A novel framework for fast packet I/O," in USENIX ATC, 2012. Google ScholarDigital Library
- D. M. Tullsen, S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, and R. L. Stamm, "Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor," in ISCA, 1996. Google ScholarDigital Library
Index Terms
- Architecting to achieve a billion requests per second throughput on a single key-value store server platform
Recommendations
Full-Stack Architecting to Achieve a Billion-Requests-Per-Second Throughput on a Single Key-Value Store Server Platform
Distributed in-memory key-value stores (KVSs), such as memcached, have become a critical data serving layer in modern Internet-oriented data center infrastructure. Their performance and efficiency directly affect the QoS of web services and the ...
Architecting to achieve a billion requests per second throughput on a single key-value store server platform
ISCA'15Distributed in-memory key-value stores (KVSs), such as memcached, have become a critical data serving layer in modern Internet-oriented datacenter infrastructure. Their performance and efficiency directly affect the QoS of web services and the ...
Many-core key-value store
IGCC '11: Proceedings of the 2011 International Green Computing Conference and WorkshopsScaling data centers to handle task-parallel work-loads requires balancing the cost of hardware, operations, and power. Low-power, low-core-count servers reduce costs in one of these dimensions, but may require additional nodes to provide the required ...
Comments