research-article

Thin servers with smart pipes: designing SoC accelerators for memcached

Authors:
Kevin Lim

HP Labs

HP Labs
View Profile

,
David Meisner

Facebook

Facebook
View Profile

,
Ali G. Saidi

ARM R&D

ARM R&D
View Profile

,
Parthasarathy Ranganathan

HP Labs

HP Labs
View Profile

,
Thomas F. Wenisch

EECS, Univ. of Michigan

EECS, Univ. of Michigan
View Profile

ISCA '13: Proceedings of the 40th Annual International Symposium on Computer ArchitectureJune 2013Pages 36–47https://doi.org/10.1145/2485922.2485926

Published:23 June 2013Publication History

ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture

Pages 36–47

ABSTRACT

Distributed in-memory key-value stores, such as memcached, are central to the scalability of modern internet services. Current deployments use commodity servers with high-end processors. However, given the cost-sensitivity of internet services and the recent proliferation of volume low-power System-on-Chip (SoC) designs, we see an opportunity for alternative architectures. We undertake a detailed characterization of memcached to reveal performance and power inefficiencies. Our study considers both high-performance and low-power CPUs and NICs across a variety of carefully-designed benchmarks that exercise the range of memcached behavior. We discover that, regardless of CPU microarchitecture, memcached execution is remarkably inefficient, saturating neither network links nor available memory bandwidth. Instead, we find performance is typically limited by the per-packet processing overheads in the NIC and OS kernel---long code paths limit CPU performance due to poor branch predictability and instruction fetch bottlenecks.

Our insights suggest that neither high-performance nor low-power cores provide a satisfactory power-performance trade-off, and point to a need for tighter integration of the network interface. Hence, we argue for an alternate architecture---Thin Servers with Smart Pipes (TSSP)---for cost-effective high-performance memcached deployment. TSSP couples an embedded-class low-power core to a memcached accelerator that can process GET requests entirely in hardware, offloading both network handling and data look up. We demonstrate the potential benefits of our TSSP architecture through an FPGA prototyping platform, and show the potential for a 6X-16X power-performance improvement over conventional server baselines.

References

Private communication with Facebook engineers, 2012.Google Scholar
Zynq-7000 All Programmable SoC, 2012.Google Scholar
D. Abts, M. R. Marty, P. M. Wells, P. Klausler, and H. Liu. Energy proportional datacenter networks. In Proceedings of the International Symposium on Computer Architecture, 2010. Google ScholarDigital Library
M. Al-Fares, S. Radhakrishnan, B. Raghavan, N. Huang, and A. Vahdat. Hedera: Dynamic flow scheduling for data center networks. In Proceedings of the Symposium on Networked Systems Design and Implementation, 2010. Google ScholarDigital Library
D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, and V. Vasudevan. FAWN: A fast array of wimpy nodes. In Proceedings of the Symposium on Operating Systems Principles, 2009. Google ScholarDigital Library
B. Atikoglu, Y. Xu, E. Frachtenberg, S. Jiang, and M. Paleczny. Workload analysis of a large-scale key-value store. In Proceedings of the International Conference on Measurement and Modeling of Computer Systems, 2012. Google ScholarDigital Library
L. A. Barroso. Warehouse-scale computing: Entering the teenage decade. In Proceedings of the International Symposium on Computer Architecture, 2011. Google Scholar
D. Beaver, S. Kumar, H. C. Li, J. Sobel, and P. Vajgel. Finding a needle in haystack: facebook's photo storage. In Proceedings of the Symposium on Operating System Design and Implementation, 2010. Google ScholarDigital Library
M. Berezecki, E. Frachtenberg, M. Paleczny, and K. Steele. Many-core key-value store. In Proceedings of the International Green Computing Conference, 2011. Google ScholarDigital Library
A. Bhattacharjee and M. Martonosi. Characterizing the TLB behavior of emerging parallel workloads on chip multiprocessors. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2009. Google ScholarDigital Library
N. L. Binkert, L. R. Hsu, A. G. Saidi, R. G. Dreslinski, A. L. Schultz, and S. K. Reinhardt. Performance Analysis of System Overheads in TCP/IP Workloads. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2005. Google ScholarDigital Library
N. L. Binkert, A. G. Saidi, and S. K. Reinhardt. Integrated network interfaces for high-bandwidth TCP/IP. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 2006. Google ScholarDigital Library
M. Cha, A. Mislove, and K. P. Gummadi. A measurement-driven analysis of information propagation in the flickr social network. In Proceedings of the International Conference on World Wide Web, 2009. Google ScholarDigital Library
S. R. Chalamalasetti, K. Lim, M. Wright, A. AuYoung, P. Ranganathan, and M. Margala. An FPGA memcached appliance. In Proceedings of the International Symposium on Field Programmable Gate Arrays, 2013. Google ScholarDigital Library
G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon's Highly Available Key-value Store. In Proceedings of the Symposium on Operating Systems Principles, 2007. Google ScholarDigital Library
Facebook. Memcached Tech Talk with M. Zuckerberg, 2010.Google Scholar
A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A. Maltz, P. Patel, and S. Sengupta. Vl2: a scalable and flexible data center network. In Proceedings of the Conference on Data Communication, 2009. Google ScholarDigital Library
T. H. Hetherington, T. G. Rogers, L. Hsu, M. O'Connor, and T. M. Aamodt. Characterizing and evaluating a key-value store application on heterogeneous cpu-gpu systems. In Proceedings of the International Symposium on Performance Analysis of Systems and Software, 2012. Google ScholarDigital Library
R. Huggahalli, R. Iyer, and S. Tetrick. Direct Cache Access for High Bandwidth Network I/O. In Proceedings of the International Symposium on Computer Architecture, 2005. Google ScholarDigital Library
J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, M. W. ur Rahman, N. S. Islam, X. Ouyang, H. Wang, S. Sur, and D. K. Panda. Memcached design on high performance rdma capable interconnects. In Proceedings of the International Conference on Parallel Processing, 2011. Google ScholarDigital Library
R. Kapoor, G. Porter, M. Tewari, G. M. Voelker, and A. Vahdat. Chronos: predictable low latency for data center applications. In Proceedings of the Symposium on Cloud Computing, 2012. Google ScholarDigital Library
A. Kirsch and M. Mitzenmacher. The power of one move: Hashing schemes for hardware. IEEE/ACM Transactions on Networking, 18(6):1752--1765, dec. 2010. Google ScholarDigital Library
G. Liao and L. Bhuyan. Performance measurement of an integrated nic architecture with 10gbe. In Proceedings of the Symposium on High Performance Interconnects, 2009. Google ScholarDigital Library
K. Lim, P. Ranganathan, J. Chang, C. Patel, T. Mudge, and S. Reinhardt. Understanding and Designing New Server Architectures for Emerging Warehouse-Computing Environments. In Proceedings of the International Symposium on Computer Architecture, 2008. Google ScholarDigital Library
Z. Metreveli, N. Zeldovich, and M. F. Kaashoek. Cphash: a cache-partitioned hash table. In Proceedings of the Symposium on Principles and Practice of Parallel Programming. ACM, 2012. Google ScholarDigital Library
G. Minshall, Y. Saito, J. C. Mogul, and B. Verghese. Application performance pitfalls and TCP's Nagle algorithm. SIGMETRICS Performance Evaluation Review, 27(4), 2000. Google ScholarDigital Library
R. Nishtala, H. Fugal, S. Grimm, M. Kwiatkowski, H. Lee, H. C. Li, R. McElroy, M. Paleczny, D. Peek, P. Saab, D. Stafford, T. Tung, and V. Venkataramani. Scaling memcache at facebook. In Proceedings of the Symposium on Networked Systems Design and Implementation, 2013. Google ScholarDigital Library
G. Regnier, S. Makineni, I. Illikkal, R. Iyer, D. Minturn, R. Huggahalli, D. Newell, L. Cline, and A. Foong. Tcp onloading for data center servers. Computer, 37(11), nov. 2004. Google ScholarDigital Library
S. M. Rumble, D. Ongaro, R. Stutsman, M. Rosenblum, and J. K. Ousterhout. It's time for low latency. In Proceedings of the Conference on Hot Topics in Operating Systems, 2011. Google ScholarDigital Library
P. Stuedi, A. Trivedi, and B. Metzler. Wimpy nodes with 10gbe: leveraging one-sided operations in soft-rdma to boost memcached. In Proceedings of the USENIX Annual Technical Conference, 2012. Google ScholarDigital Library
V. Janapa Reddi, Benjamin Lee, Trishul Chilimbi, and Kushagra Vaid. Web Search Using Mobile Cores: Quantifying and Mitigating the Price of Efficiency. In Proceedings of the International Symposium on Computer Architecture, 2010. Google ScholarDigital Library
A. Wiggins and J. Langston. Enhancing the Scalability of Memcached, 2012.Google Scholar

Recommendations

Thin servers with smart pipes: designing SoC accelerators for memcached
ICSA '13

Distributed in-memory key-value stores, such as memcached, are central to the scalability of modern internet services. Current deployments use commodity servers with high-end processors. However, given the cost-sensitivity of internet services and the ...
Read More
A multithreaded PowerPC processor for commercial servers

This paper describes the microarchitecture of the RS64 IV, a multithreaded PowerPC® processor, and its memory system. Because this processor is used only in IBM iSeries™ and pSeries™ commercial servers, it is optimized solely for commercial server ...
Read More
Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache
ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture

Recent research advocates using large die-stacked DRAM caches to break the memory bandwidth wall. Existing DRAM cache designs fall into one of two categories --- block-based and page-based. The former organize data in conventional blocks (e.g., 64B), ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture
June 2013
686 pages
ISBN:9781450320795
DOI:10.1145/2485922
General Chair:
Avi Mendelson
Technion
ACM SIGARCH Computer Architecture News Volume 41, Issue 3
ICSA '13
June 2013
666 pages
ISSN:0163-5964
DOI:10.1145/2508148
Issue’s Table of Contents
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 June 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
ISCA '13 Paper Acceptance Rate56of288submissions,19%Overall Acceptance Rate543of3,203submissions,17%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 159
  Total Citations
  View Citations
- 1,470
  Total Downloads
- Downloads (Last 12 months)38
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Thin servers with smart pipes: designing SoC accelerators for memcached

ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture

ABSTRACT

References

Cited By

Recommendations

Thin servers with smart pipes: designing SoC accelerators for memcached

A multithreaded PowerPC processor for commercial servers

Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache