Files

Abstract

Even if Dennard scaling came to an end fifteen years ago, Moore'™s law kept fueling an exponential growth in compute performance through increased parallelization. However, the performance of memory and, in particular, Dynamic Random Access Memory (DRAM), has been increasing at a slower pace for decades, making memory system optimization increasingly crucial. Conventional solutions mitigate the issue by shifting as many memory accesses as possible from off-chip DRAM to on-chip Static RAM (SRAM) memory, which has higher performance but lower capacity. This is achieved by relying on spatial and temporal locality or on precise compile-time information about the access pattern. However, when the access pattern is irregular and data-dependent, these solutions are ineffective and the processing-memory gap grows even wider as DRAMs themselves are optimized for sequential accesses. In this thesis, we present a novel memory system for throughput-oriented compute engines that perform irregular read accesses to DRAM. When accesses are irregular, we acknowledge that obtaining a reasonable benefit from on-chip memory may be unrealistic; therefore, we focus on minimizing stalls and reusing each memory response to serve as many misses as possible without relying on long-term data storage. This is the same insight behind nonblocking caches but on a vastly larger scale in terms of outstanding misses, which greatly increases the opportunities for data reuse when accelerators emit a large number of outstanding reads. Because we optimize miss handling rather than increasing hit rate, we call our architecture miss-optimized memory system (MOMS). We first focus on the microarchitectural level to show how a MOMS can support three orders of magnitude more outstanding misses than a traditional nonblocking cache in a way that can be efficiently implemented on Field-Programmable Gate Arrays (FPGAs). Once we maximize the reuse of each individual word returned by the DRAM, we introduce two techniques to increase the DRAM throughput. When the DRAM controller is optimized for burst requests, we group incoming requests over multiple words that are requested as a burst. Conversely, when the DRAM controller handles single requests efficiently, our MOMS reorders requests by DRAM bank and row on a much larger scale than general-purpose DRAM controllers. We then discuss techniques to use efficiently the vast amount of resources provided by multi-die FPGAs and introduce two-level architectures which balance reuse maximization and contention of shared hardware. Finally, we develop a graph processing accelerator backed by a MOMS. On three algorithms on graphs with billions of edges and up to a hundred million nodes, our accelerator outperforms the state-of-the-art on FPGAs and achieves higher performance per watt and unit bandwidth than the state-of-the-art on CPUs and GPUs. Memory systems designers face increasing pressure to keep up with the compute engines performance. Our results suggest that miss-optimized memory systems can help to reduce the memory-processing gap where it is largest, that is, when accesses to memory are irregular and difficult to serve from local buffers.

Details

PDF