Hardware-Software Co-Design of an RPC Processor

Pourhabibi Zarandi, Arash

doi:10.5075/epfl-thesis-7217

Pourhabibi Zarandi, Arash

2021

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Files

Abstract

The booming popularity of online services has led to a major evolution in the way these services are built and deployed. To cope with such online data-intensive services, service providers deploy several massive-scale datacenters, also referred to as warehouse-scale computers, each populated with up to hundreds of thousands of servers. The services also follow the paradigm of microservices, which decomposes online services into fine-grained software modules frequently communicating over the datacenter network using Remote Procedure Calls (RPCs). Microservices simplify and accelerate software development and allow independent development and performance debugging of each microservice using the most suitable programming language and tools. Furthermore, microservices simplify software deployment and enable scaling and updating individual microservices independently. However, because services are deployed in a distributed fashion, frequent communication is needed to complete a request, putting pressure on the networking infrastructure of the datacenter. As a result, networking technology has been evolving rapidly both in software and hardware to address this extra communication overhead, also referred to as the "RPC tax" in datacenters. High-performance network fabrics and new network protocols have been developed to address the performance and scalability issues associated with the increasing volume of communication between software components. Although the tax on inter-microservice communication includes both the RPC layer and the underlying network stack, ongoing advancements have mainly targeted the network stack, leading to a drastic reduction of the networking latency and exposing the RPC layer itself as a bottleneck. While modern fabrics continue improving network bandwidth, silicon's efficiency and density scaling met an abrupt slowdown with the end of Dennard scaling and the slowdown of Moore's law, putting more pressure on the RPC layer running on the general-purpose CPUs. Overall, the RPC layer accounts for a significant fraction of both a single request's latency and the datacenter's total compute capacity; thus, optimizing the hardware-software stack for RPCs is of critical importance. In this thesis, we break down the underlying modules that comprise production RPC layers and show that CPUs can only expect limited improvements for such tasks, mandating a shift to hardware to remove the RPC layer as a limiter of microservice performance. Motivated by the growing RPC tax in datacenters, we advocate for hardware-software co-design to evade the RPC tax. We present design principles guiding the architecture of an RPC processor and show that conclusively removing the RPC layer bottleneck requires all of the RPC layer's modules to be executed by a NIC-attached hardware accelerator. We propose a NIC-integrated RPC processor that runs production RPC layers and acts as an intermediary stage between the NIC and the microservice running on the CPU. Because such an RPC processor can peek into the request's data, it opens up further opportunities such as intelligent load balancing and request dispatch. We make the case that such an RPC processor is an ideal candidate for inclusion in future server chips to better support and run microservices as they decompose into even finer granularity.