ABSTRACT
Rack-scale computers, comprising a large number of micro-servers connected by a direct-connect topology, are expected to replace servers as the building block in data centers. We focus on the problem of routing and congestion control across the rack's network, and find that high path diversity in rack topologies, in combination with workload diversity across it, means that traditional solutions are inadequate. We introduce R2C2, a network stack for rack-scale computers that provides flexible and efficient routing and congestion control. R2C2 leverages the fact that the scale of rack topologies allows for low-overhead broadcasting to ensure that all nodes in the rack are aware of all network flows. We thus achieve rate-based congestion control without any probing; each node independently determines the sending rate for its flows while respecting the provider's allocation policies. For routing, nodes dynamically choose the routing protocol for each flow in order to maximize overall utility. Through a prototype deployed across a rack emulation platform and a packet-level simulator, we show that R2C2 achieves very low queuing and high throughput for diverse and bursty workloads, and that routing flexibility can provide significant throughput gains.
Supplemental Material
- H. Abu-Libdeh, P. Costa, A. Rowstron, G. O'Shea, and A. Donnelly. Symbiotic Routing in Future Data Centers. In SIGCOMM, 2010. Google ScholarDigital Library
- M. Alizadeh, A. Greenberg, D. A. Maltz, J. Padhye, P. Patel, B. Prabhakar, S. Sengupta, and M. Sridharan. Data Center TCP (DCTCP). In SIGCOMM, 2010. Google ScholarDigital Library
- M. Alizadeh, A. Kabbani, T. Edsall, B. Prabhakar, A. Vahdat, and M. Yasuda. Less Is More: Trading a Little Bandwidth for Ultra-Low Latency in the Data Center. In NSDI, 2012. Google ScholarDigital Library
- M. Alizadeh, S. Yang, M. Sharif, S. Katti, N. McKeown, B. Prabhakar, and S. Shenker. pFabric: Minimal Near-optimal Datacenter Transport. In SIGCOMM, 2013. Google ScholarDigital Library
- J. M. andJeff Shamma. Revisiting log-linear learning: Asynchrony, completeness and payoff-based implementation. Games and Economic Behavior, 2012.Google Scholar
- S. Angel, H. Ballani, T. Karagiannis, G. O'Shea, and E. Thereska. End-to-end Performance Isolation through Virtual Datacenters . In OSDI, 2014. Google ScholarDigital Library
- K. Asanovic. FireBox: A Hardware Building Block for 2020 Warehouse-Scale Computers. In FAST, 2014. Keynote.Google Scholar
- B. Awerbuch, R. Khandekar, and S. Rao. Distributed Algorithms for Multicommodity Flow Problems via Approximate Steepest Descent Framework. ACM Trans. Algorithms, 9(1), Dec. 2012. Google ScholarDigital Library
- S. Balakrishnan, R. Black, A. Donnelly, P. England, A. Glass, D. Harper, S. Legtchenko, A. Ogus, E. Peterson, and A. Rowstron. Pelican: A Building Block for Exascale Cold Data Storage. In OSDI, 2014. Google ScholarDigital Library
- H. Ballani, P. Costa, T. Karagiannis, and A. Rowstron. Towards Predictable Datacenter Networks. In SIGCOMM, 2011. Google ScholarDigital Library
- H. Ballani, K. Jang, T. Karagiannis, C. Kim, D. Gunawardena, and G. O'Shea. Chatty Tenants and the Cloud Network Sharing Problem. In NSDI, 2013. Google ScholarDigital Library
- D. Bertsekas and R. Gallager. Data Networks. Prentice Hall, 1987. Google ScholarDigital Library
- D. Bertsimas and J. Tsitsiklis. Simulated Annealing. Statistical Science, 8(1), 1993.Google Scholar
- R. S. Cahn. Wide Area Network Design: Concepts and Tools for Optimization. Morgan Kaufmann, 1998. Google ScholarDigital Library
- M. Chowdhury and I. Stoica. Coflow: A Networking Abstraction for Cluster Applications. In HotNets, 2012. Google ScholarDigital Library
- P. Costa, H. Ballani, and D. Narayanan. Rethinking the Network Stack for Rack-scale Computers. In HotCloud, 2014. Google ScholarDigital Library
- Cray Inc. Modifying Your Application to Avoid Aries Network Congestion, 2013.Google Scholar
- Cray Inc. Network Resiliency for Cray XC30 Systems, 2013.Google Scholar
- A. Daglis, S. Novaković, E. Bugnion, B. Falsafi, and B. Grot. Manycore Network Interfaces for In-memory Rack-scale Computing. In ISCA, 2015. Google ScholarDigital Library
- W. Dally and B. Towles. Principles and Practices of Interconnection Networks. Morgan Kaufmann, 2003. Google ScholarDigital Library
- J. Dean and L. A. Barroso. The Tail at Scale. Communications of ACM, 2013. Google ScholarDigital Library
- A. A. Dixit, P. Prakash, Y. C. Hu, and R. R. Kompella. On the Impact of Packet Spraying in Data Center Networks. In INFOCOM, 2013.Google ScholarCross Ref
- F. R. Dogar, T. Karagiannis, H. Ballani, and A. Rowstron. Decentralized Task-aware Scheduling for Data Center Networks. In SIGCOMM, 2014. Google ScholarDigital Library
- A. Dragojević, D. Narayanan, O. Hodson, and M. Castro. FaRM: Fast Remote Memory. In NSDI, 2014. Google ScholarDigital Library
- A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A. Maltz, P. Patel, and S. Sengupta. VL2: A Scalable and Flexible Data Center Network. In SIGCOMM, 2009. Google ScholarDigital Library
- S. Han, N. Egi, A. Panda, S. Ratnasamy, G. Shi, and S. Shenker. Network Support for Resource Disaggregation in Next-generation Datacenters. In HotNets, 2013. Google ScholarDigital Library
- J. H. Holland. Adaptation in Natural and Artificial Systems. University of Michigan Press, 1975. Google ScholarDigital Library
- C.-Y. Hong, M. Caesar, and P. B. Godfrey. Finishing flows quickly with preemptive scheduling. In SIGCOMM, 2012. Google ScholarDigital Library
- K. Jang, J. Sherry, H. Ballani, and T. Moncaster. Silo: Predictable Message Latency in the Cloud. In SIGCOMM, 2015. Google ScholarDigital Library
- V. Jeyakumar, M. Alizadeh, D. Mazières, B. Prabhakar, C. Kim, and A. Greenberg. EyeQ: Practical Network Performance Isolation at the Edge. In NSDI, 2013. Google ScholarDigital Library
- A. Kalia, M. Kaminsky, and D. G. Andersen. Using RDMA Efficiently for Key-value Services. In SIGCOMM, 2014. Google ScholarDigital Library
- S. Kandula, S. Sengupta, A. Greenberg, P. Patel, and R. Chaiken. The nature of data center traffic: measurements & analysis. In IMC, 2009. Google ScholarDigital Library
- D. Nace, N.-L. Doan, E. Gourdin, and B. Liau. Computing Optimal Max-min Fair Resource Allocation for Elastic Flows. IEEE/ACM Trans. Netw., 14(6), Dec. 2006. Google ScholarDigital Library
- S. Novakovic, A. Daglis, E. Bugnion, B. Falsafi, and B. Grot. Scale-out NUMA. In ASPLOS, 2014. Google ScholarDigital Library
- G. P. Nychis, C. Fallin, T. Moscibroda, O. Mutlu, and S. Seshan. On-chip Networks from a Networking Perspective: Congestion and Scalability in Many-core Interconnects. In SIGCOMM, 2012. Google ScholarDigital Library
- J. Perry, A. Ousterhout, H. Balakrishnan, D. Shah, and H. Fugal. Fastpass: A Centralized "Zero-queue" Datacenter Network. In SIGCOMM, 2014. Google ScholarDigital Library
- L. Popa, G. Kumar, M. Chowdhury, A. Krishnamurthy, S. Ratnasamy, and I. Stoica. FairCloud: Sharing the Network in Cloud Computing. In SIGCOMM, 2012. Google ScholarDigital Library
- A. Putnam, A. Caulfield, E. Chung, D. Chiou, K. Constantinides, J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray, M. Haselman, S. Hauck, S. Heil, A. Hormati, J.-Y. Kim, S. Lanka, J. Larus, E. Peterson, S. Pope, A. Smith, J. Thong, P. Y. Xiao, and D. Burger. A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services. In ISCA, 2014. Google ScholarDigital Library
- S. Radhakrishnan, Y. Geng, V. Jeyakumar, A. Kabbani, G. Porter, and A. Vahdat. SENIC: Scalable NIC for End-Host Rate Limiting. In NSDI, 2014. Google ScholarDigital Library
- B. Radunović and J.-Y. L. Boudec. A Unified Framework for Max-min and Min-max Fairness with Applications. IEEE/ACM Trans. Netw., 15(5), Oct. 2007. Google ScholarDigital Library
- C. Raiciu, S. Barre, C. Pluntke, A. Greenhalgh, D. Wischik, and M. Handley. Improving Datacenter Performance and Robustness with Multipath TCP. In SIGCOMM, 2011. Google ScholarDigital Library
- T. Roughgarden and E. Tardos. How Bad is Selfish Routing? J. ACM, 2002. Google ScholarDigital Library
- B. Schroeder and G. A. Gibson. Understanding Failures in Petascale Computers. Journal of Physics, 78, 2007.Google Scholar
- A. Singh, W. J. Dally, B. Towles, and A. K. Gupta. Locality-preserving Randomized Oblivious Routing on Torus Networks. In SPAA, 2002. Google ScholarDigital Library
- L. G. Valiant and G. J. Brebner. Universal Schemes for Parallel Communication. In STOC, 1981. Google ScholarDigital Library
- B. Vamanan, J. Hasan, and T. N. Vijaykumar. Deadline-Aware Datacenter TCP (D$^2$TCP). In SIGCOMM, 2012. Google ScholarDigital Library
- M. Walraed-Sullivan, J. Padhye, and D. A. Maltz. Theia: Simple and Cheap Networking for Ultra-Dense Data Centers. In HotNets, 2014. Google ScholarDigital Library
- C. Wilson, H. Ballani, T. Karagiannis, and A. Rowtron. Better Never Than Late: Meeting Deadlines in Datacenter Networks. In SIGCOMM, 2011. Google ScholarDigital Library
- H. Wu, G. Lu, D. Li, C. Guo, and Y. Zhang. MDCube: A High Performance Network Structure for Modular Data Center Interconnection. In CoNEXT, 2009. Google ScholarDigital Library
- Amazon joins other web giants trying to design its own chips. http://bit.ly/1J5t0fE.Google Scholar
- Boston Viridis Data Sheet. http://bit.ly/1fBnsQ9.Google Scholar
- Calxeda EnergyCore ECX-1000. http://bit.ly/1nCgdHO.Google Scholar
- Design Guide for Photonic Architecture. http://bit.ly/NYpT1h.Google Scholar
- Google Ramps Up Chip Design. http://ubm.io/1iQooNe.Google Scholar
- How Microsoft Designs its Cloud-Scale Servers. http://bit.ly/1HKCy27.Google Scholar
- HP Moonshot System. http://bit.ly/1mZD4yJ.Google Scholar
- Intel Atom Processor D510. http://intel.ly/1wJmS3D.Google Scholar
- Intel, Facebook Collaborate on Future Data Center Rack Technologies. http://intel.ly/MRpOM0.Google Scholar
- Intel Rack Scale Architecture. http://ubm.io/1iejjx5.Google Scholar
- Maze: A Rack-scale Computer Emulation Platform. http://aka.ms/maze.Google Scholar
- RDMA Aware Networks Programming User Manual. http://bit.ly/1ysVa1O.Google Scholar
- SeaMicro SM15000 Fabric Compute Systems. http://bit.ly/1hQepIh.Google Scholar
Index Terms
- R2C2: A Network Stack for Rack-scale Computers
Recommendations
R2C2: A Network Stack for Rack-scale Computers
SIGCOMM'15Rack-scale computers, comprising a large number of micro-servers connected by a direct-connect topology, are expected to replace servers as the building block in data centers. We focus on the problem of routing and congestion control across the rack's ...
Annulus: A Dual Congestion Control Loop for Datacenter and WAN Traffic Aggregates
SIGCOMM '20: Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communicationCloud services are deployed in datacenters connected though high-bandwidth Wide Area Networks (WANs). We find that WAN traffic negatively impacts the performance of datacenter traffic, increasing tail latency by 2.5x, despite its small bandwidth demand. ...
TCP incast solutions in data center networks: A classification and survey
AbstractIn recent years, Data Centers Networks (DCNs) have been deployed to serve as the backbone to support the extensive variety of services offered through the Internet like social networking, web hosting, and e-commerce. The Transmission ...
Comments