ABSTRACT
Distributed in-memory key-value stores, such as memcached, are central to the scalability of modern internet services. Current deployments use commodity servers with high-end processors. However, given the cost-sensitivity of internet services and the recent proliferation of volume low-power System-on-Chip (SoC) designs, we see an opportunity for alternative architectures. We undertake a detailed characterization of memcached to reveal performance and power inefficiencies. Our study considers both high-performance and low-power CPUs and NICs across a variety of carefully-designed benchmarks that exercise the range of memcached behavior. We discover that, regardless of CPU microarchitecture, memcached execution is remarkably inefficient, saturating neither network links nor available memory bandwidth. Instead, we find performance is typically limited by the per-packet processing overheads in the NIC and OS kernel---long code paths limit CPU performance due to poor branch predictability and instruction fetch bottlenecks.
Our insights suggest that neither high-performance nor low-power cores provide a satisfactory power-performance trade-off, and point to a need for tighter integration of the network interface. Hence, we argue for an alternate architecture---Thin Servers with Smart Pipes (TSSP)---for cost-effective high-performance memcached deployment. TSSP couples an embedded-class low-power core to a memcached accelerator that can process GET requests entirely in hardware, offloading both network handling and data look up. We demonstrate the potential benefits of our TSSP architecture through an FPGA prototyping platform, and show the potential for a 6X-16X power-performance improvement over conventional server baselines.
- Private communication with Facebook engineers, 2012.Google Scholar
- Zynq-7000 All Programmable SoC, 2012.Google Scholar
- D. Abts, M. R. Marty, P. M. Wells, P. Klausler, and H. Liu. Energy proportional datacenter networks. In Proceedings of the International Symposium on Computer Architecture, 2010. Google ScholarDigital Library
- M. Al-Fares, S. Radhakrishnan, B. Raghavan, N. Huang, and A. Vahdat. Hedera: Dynamic flow scheduling for data center networks. In Proceedings of the Symposium on Networked Systems Design and Implementation, 2010. Google ScholarDigital Library
- D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, and V. Vasudevan. FAWN: A fast array of wimpy nodes. In Proceedings of the Symposium on Operating Systems Principles, 2009. Google ScholarDigital Library
- B. Atikoglu, Y. Xu, E. Frachtenberg, S. Jiang, and M. Paleczny. Workload analysis of a large-scale key-value store. In Proceedings of the International Conference on Measurement and Modeling of Computer Systems, 2012. Google ScholarDigital Library
- L. A. Barroso. Warehouse-scale computing: Entering the teenage decade. In Proceedings of the International Symposium on Computer Architecture, 2011. Google Scholar
- D. Beaver, S. Kumar, H. C. Li, J. Sobel, and P. Vajgel. Finding a needle in haystack: facebook's photo storage. In Proceedings of the Symposium on Operating System Design and Implementation, 2010. Google ScholarDigital Library
- M. Berezecki, E. Frachtenberg, M. Paleczny, and K. Steele. Many-core key-value store. In Proceedings of the International Green Computing Conference, 2011. Google ScholarDigital Library
- A. Bhattacharjee and M. Martonosi. Characterizing the TLB behavior of emerging parallel workloads on chip multiprocessors. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2009. Google ScholarDigital Library
- N. L. Binkert, L. R. Hsu, A. G. Saidi, R. G. Dreslinski, A. L. Schultz, and S. K. Reinhardt. Performance Analysis of System Overheads in TCP/IP Workloads. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2005. Google ScholarDigital Library
- N. L. Binkert, A. G. Saidi, and S. K. Reinhardt. Integrated network interfaces for high-bandwidth TCP/IP. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 2006. Google ScholarDigital Library
- M. Cha, A. Mislove, and K. P. Gummadi. A measurement-driven analysis of information propagation in the flickr social network. In Proceedings of the International Conference on World Wide Web, 2009. Google ScholarDigital Library
- S. R. Chalamalasetti, K. Lim, M. Wright, A. AuYoung, P. Ranganathan, and M. Margala. An FPGA memcached appliance. In Proceedings of the International Symposium on Field Programmable Gate Arrays, 2013. Google ScholarDigital Library
- G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon's Highly Available Key-value Store. In Proceedings of the Symposium on Operating Systems Principles, 2007. Google ScholarDigital Library
- Facebook. Memcached Tech Talk with M. Zuckerberg, 2010.Google Scholar
- A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A. Maltz, P. Patel, and S. Sengupta. Vl2: a scalable and flexible data center network. In Proceedings of the Conference on Data Communication, 2009. Google ScholarDigital Library
- T. H. Hetherington, T. G. Rogers, L. Hsu, M. O'Connor, and T. M. Aamodt. Characterizing and evaluating a key-value store application on heterogeneous cpu-gpu systems. In Proceedings of the International Symposium on Performance Analysis of Systems and Software, 2012. Google ScholarDigital Library
- R. Huggahalli, R. Iyer, and S. Tetrick. Direct Cache Access for High Bandwidth Network I/O. In Proceedings of the International Symposium on Computer Architecture, 2005. Google ScholarDigital Library
- J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, M. W. ur Rahman, N. S. Islam, X. Ouyang, H. Wang, S. Sur, and D. K. Panda. Memcached design on high performance rdma capable interconnects. In Proceedings of the International Conference on Parallel Processing, 2011. Google ScholarDigital Library
- R. Kapoor, G. Porter, M. Tewari, G. M. Voelker, and A. Vahdat. Chronos: predictable low latency for data center applications. In Proceedings of the Symposium on Cloud Computing, 2012. Google ScholarDigital Library
- A. Kirsch and M. Mitzenmacher. The power of one move: Hashing schemes for hardware. IEEE/ACM Transactions on Networking, 18(6):1752--1765, dec. 2010. Google ScholarDigital Library
- G. Liao and L. Bhuyan. Performance measurement of an integrated nic architecture with 10gbe. In Proceedings of the Symposium on High Performance Interconnects, 2009. Google ScholarDigital Library
- K. Lim, P. Ranganathan, J. Chang, C. Patel, T. Mudge, and S. Reinhardt. Understanding and Designing New Server Architectures for Emerging Warehouse-Computing Environments. In Proceedings of the International Symposium on Computer Architecture, 2008. Google ScholarDigital Library
- Z. Metreveli, N. Zeldovich, and M. F. Kaashoek. Cphash: a cache-partitioned hash table. In Proceedings of the Symposium on Principles and Practice of Parallel Programming. ACM, 2012. Google ScholarDigital Library
- G. Minshall, Y. Saito, J. C. Mogul, and B. Verghese. Application performance pitfalls and TCP's Nagle algorithm. SIGMETRICS Performance Evaluation Review, 27(4), 2000. Google ScholarDigital Library
- R. Nishtala, H. Fugal, S. Grimm, M. Kwiatkowski, H. Lee, H. C. Li, R. McElroy, M. Paleczny, D. Peek, P. Saab, D. Stafford, T. Tung, and V. Venkataramani. Scaling memcache at facebook. In Proceedings of the Symposium on Networked Systems Design and Implementation, 2013. Google ScholarDigital Library
- G. Regnier, S. Makineni, I. Illikkal, R. Iyer, D. Minturn, R. Huggahalli, D. Newell, L. Cline, and A. Foong. Tcp onloading for data center servers. Computer, 37(11), nov. 2004. Google ScholarDigital Library
- S. M. Rumble, D. Ongaro, R. Stutsman, M. Rosenblum, and J. K. Ousterhout. It's time for low latency. In Proceedings of the Conference on Hot Topics in Operating Systems, 2011. Google ScholarDigital Library
- P. Stuedi, A. Trivedi, and B. Metzler. Wimpy nodes with 10gbe: leveraging one-sided operations in soft-rdma to boost memcached. In Proceedings of the USENIX Annual Technical Conference, 2012. Google ScholarDigital Library
- V. Janapa Reddi, Benjamin Lee, Trishul Chilimbi, and Kushagra Vaid. Web Search Using Mobile Cores: Quantifying and Mitigating the Price of Efficiency. In Proceedings of the International Symposium on Computer Architecture, 2010. Google ScholarDigital Library
- A. Wiggins and J. Langston. Enhancing the Scalability of Memcached, 2012.Google Scholar
Recommendations
Thin servers with smart pipes: designing SoC accelerators for memcached
ICSA '13Distributed in-memory key-value stores, such as memcached, are central to the scalability of modern internet services. Current deployments use commodity servers with high-end processors. However, given the cost-sensitivity of internet services and the ...
A multithreaded PowerPC processor for commercial servers
This paper describes the microarchitecture of the RS64 IV, a multithreaded PowerPC® processor, and its memory system. Because this processor is used only in IBM iSeries™ and pSeries™ commercial servers, it is optimized solely for commercial server ...
Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache
ISCA '13: Proceedings of the 40th Annual International Symposium on Computer ArchitectureRecent research advocates using large die-stacked DRAM caches to break the memory bandwidth wall. Existing DRAM cache designs fall into one of two categories --- block-based and page-based. The former organize data in conventional blocks (e.g., 64B), ...
Comments