Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Design options for small-scale shared memory multiprocessors.
(USC Thesis Other)
Design options for small-scale shared memory multiprocessors.
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
INFORMATION TO USERS
This manuscript has been reproduced from the microfilm master. UMI
films the text directly fijom the original or copy submitted. Thus, some
thesis and dissertation copies are in typewriter 6ce, while others may be
from any type o f computer printer.
The quality of this reproduction is dependent upon the quality of the
copy submitted. Broken or indistinct print, colored or poor quality
illustrations and photographs, print bleedthrough, substandard margins,
and improper alignment can adversely affect reproduction.
In the unlikely event that the author did not send UMI a complete
manuscript and there are missing pages, these will be noted. Also, if
unauthorized copyright material had to be removed, a note will indicate
the deletion.
Oversize materials (e.g., maps, drawings, charts) are reproduced by
sectioning the original, beginning at the upper lefr-hand comer and
continuing from left to right in equal sections with small overlaps. Each
original is also photographed in one exposure and is included in reduced
form at the back o f the book.
Photographs included in the original manuscript have been reproduced
xerographically in this copy. Higher quality 6” x 9” black and white
photographic prints are available for any photographs or illustrations
appearing in this copy for an additional charge. Contact UMI directly to
order.
UMI
A Bell & Howell Informaticn Company
300 North 2keb Road, Ann Aibor MI 48106-1346 USA
313/761-4700 800/521-0600
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
D e s ig n Optio n s fo r S m a l l S c a l e S h a r e d M e m o r y
M u l t ip r o c e sso r s
by
Luiz André Barroso
A Dissertation Presented to the
F a c u l t y o f t h e G r a d u a t e S c h o o l
U n i v e r s i t y o f S o u t h e r n C a l i f o r n i a
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF P h i l o s o p h y
(Computer Engineering)
December, 1996
Copyright 1996 Luiz André Barroso
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
UMI Number: 9720182
UMI Microform 9720182
Copyright 1997, by UMI Company. All rights reserved.
This microform edition is protected against unauthorized
copying under Title 17, United States Code.
UMI
300 North Zeeb Road
Ann Arbor, MI 48103
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
U N IV ER SITY O F SO U TH ER N C A L IF O R N IA
THE GRADUATE SCHOOL
UNIVERSITY PARK
LOS ANGELES. CALIFORNIA 90007
This dissertation, written by
L u iz A n d re B a r r o s o
under the direction of fe? Dissertation
Committee, and approved by all its members,
has been presented to and accepted by The
Graduate School, in partial fulfillm ent of re
quirements for the degree of
DOCTOR OF PHILOSOPHY
Dean of Graduate Studies
December 17, 1996
DISSERTATION COMMITTEE
Chairperson
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
to Jacqueline Chame
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Acknowledgments
During my stay at USC I have had the previlege of interacting with a number of
people who have, in many different and significant ways, helped me along the way. Out of
this large group I would like to mention just a few names. Koray Oner, Krishnan
Ramamurthy, Weihua Mao, Barton Sano and Fong Pong have been fiiends and colleages
in the every day grind. Koray and Jaeheon Jeong and I spent way too many sleepless
nights together building and debugging the RPM multiprocessor. Thanks to their work
ethic, talent and self-motivation we were able to get it done. I am also thankful to the
support of my thesis committee throughout the years.
Although separated from them by thousends of miles, my family has been very much
present all along, and I cannot thank them enough for their love and support. The Nobrega
Chame family has been no less loving and supportive. My friends, PC and Beto, have also
been in my heart and thoughts despite the distance.
I am indebted to the people at Digital Equipment Western Research Laboratory for
offering me a job in a very special place. Thanks in particular to Joel Bartlett, Kourosh
Gharachorloo and Marco Annaratone for reminding me that I had a thesis to finish when I
was imersed in a lot of other fun stuff.
Jacqueline Chame is the main reason why I have survived it.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Table of Contents
C H A P T E R 1: IN T R O D U C T IO N I
1.1 Motivations............................................................................................................................................1
1.2 Summary of Research Contributions................................................................................................4
1.3 Prior Related Work and Background................................................................................................ 6
1.3.1 Multiprocessor Interconnect Architectures.....................................................................6
1.3.1.1 Uniform vs. Non-Uniform Memory Access Architectures......................... 6
1.3.1.2 Limits on Bus Performance............................................................................ 7
1.3.1.3 Point-to-Point Links.........................................................................................9
1.3.1.4 Ring Networks................................................................................................10
1.3.1.5 Crossbar Networks......................................................................................... 11
1.3.1.6 Other Networks...............................................................................................12
1.3.1.7Cluster-based Architectures...........................................................................12
1.3.2 Cache Coherence Protocols............................................................................................1 3
1.3.2.1 Snooping..........................................................................................................14
1.3.2.2 Centralized Directories..................................................................................16
1.3.2.3 Distributed Directories...................................................................................18
1.3.3 Reducing and Tolerating Memory Latencies................................................................ 19
1.3.3.1 Prefetching................................................................................................... 19
1.3.3.2Relaxed Consistency Models........................................................................ 21
1.3.3.3 Multithreading................................................................................................ 23
1.3.3.4 Hardware Support for Synchronization.......................................................23
1.3.4 Performance Evaluation Methodologies.......................................................................24
C H A P T E R 2: C A C H E C O H E R E N C E IN R IN G B A S E D M U L T IP R O C E S S O R S 25
2.1 Ring Architectures............................................................................................................................ 25
2.1.1 Token-Passing Ring........................................................................................................27
2.1.2 Register Insertion Ring.................................................................................................. 27
2.1.3 Slotted Ring.....................................................................................................................29
2.1.4 Packaging and Electrical Considerations......................................................................30
2.2 Dividing the Ring into Message Slots............................................................................................. 31
2.3 Cache Coherence Protocols for a Slotted Ring Multiprocessor....................................................33
2.3.1 Centralized Directory Protocols................................................................................... 33
2.3.2 Distributed Directory Protocols.................................................................................... 39
2.3.3 Snooping Protocols.........................................................................................................42
2.4 Summary............................................................................................................................................ 47
C H A P T E R 3: P E R F O R M A N C E E V A L U A T IO N M E T H O D O L O G Y 49
3.1 Trace-driven Simulations................................................................................................................. 49
3.2 A Hybrid Analytical Methodology................................................................................................. 52
3.2.1 Analytic Models for Ring-based Protocols................................................................. 53
3.3 Program-driven Simulations............................................................................................................56
3.4 Benchmarks........................................................................................................................................57
C H A P T E R 4 : P E R F O R M A N C E O F U N ID IR E C T IO N A L R IN G M U L T IP R O C E S S O R S 60
4.1 Snooping vs. Centralized Directory Protocols............................................................................... 62
4.2 Distributed Directory Protocols.......................................................................................................69
4.3 Effect of Cache Block Size.............................................................................................................. 72
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5: PERFORMANCE OF BIDIRECTIONAL RING MULTIPROCESSORS 75
5.1 Bidirectional Rings and Evaluation Assumptions............................................................................. 75
5.2 Simulation o f Unidirectional and Bidirectional Rings......................................................................77
5.3 Discussion................................................................................................................................................ 79
5.4 Sum m ary...................................................................................................................................................88
CHAPTER 6: PERFORMANCE OF NUMA BUS MULTIPROCESSORS 89
6.1 A High-Performance NUMA Bus Architecture................................................................................ 89
6.2 A NUMA Bus Snooping Protocol.......................................................................................................90
6.3 Packet- vs. Circuit-Switched B uses.................................................................................................... 91
6.4 Performance Evaluation of a Packet-Switched NUMA Bus............................................................92
6.5 Potential o f Software Prefetching........................................................................................................97
6.6 Sum m ary................................................................................................................................................. 104
CHAPTER 7: PERFORMANCE OF CROSSBAR MULTIPROCESSORS 105
7 .1 A NUMA Crossbar-based Multiprocessor Architecture.................................................................105
7.1.1 Cache Coherence Protocols for Crossbar-connected M ultiprocessors.....................107
7.1.2 Simulation Results for Ring, Bus and Crossbar-based Systems................................ 108
7.2 Sum m ary................................................................................................................................................. 114
CHAPTER 8: HARDWARE SUPPORT FOR LOCKING OPERATIONS 115
8.1 Atomic O perations................................................................................................................................ 115
8.2 Test&Set Primitives in Write-Invalidate Protocols......................................................................... 116
8.3 Queue On Lock Bit (QOLB)............................................................................................................... 119
8.4 Hardware Support for Locking on Snooping Slotted R ings.......................................................... 120
8.5 Performance Impact o f Hardware Locking Mechanisms................................................................122
8.6 Summary................................................................................................................................................. 128
CHAPTER 9: THE IMPACT OF RELAXED MEMORY CONSISTENCY MODELS 129
9 .1 Introduction.............................................................................................................................................129
9.2 A Send-Delayed Consistency Implementation................................................................................130
9.3 A Send-and-Receive Delayed Consistency Implementation.........................................................131
9.4 Performance of Relaxed Consistency M odels.................................................................................132
9.5 Sum m ary................................................................................................................................................. 145
CHAPTER 10: CONCLUSIONS 146
10.1 Sum m ary..............................................................................................................................................146
10.2Performance o f Bus-based System s...................................................................................................147
10.3 Design Options for Ring-based System s......................................................................................147
10.4Performance Comparison of Ring- and Crossbar-based Systems................................................. 148
10.5 Future W ork........................................................................................................................................... 149
CHAPTER 11: Bibliography 151
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
List of Tables
Table 2.1. Snooping rate (nanosecond.)........................................................................... 47
Table 3.1 Snooping protocol parameters from trace-driven simulations o f the
program............................................................................................................ 53
Table 3.2. Directory protocol parameters from trace-driven simulations of the
program............................................................................................................ 55
Table 4.1 Basic Trace Characteristics............................................................................. 61
Table 4.2 Fraction of remote misses that require more than one ring traversal in the
distributed directory protocol (%)..................................................................71
Table 5.1 Basic application characteristics. Reference counts are in millions............78
Table 6.1. Percentage of covered shared data misses................................................... 103
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
List of Figures
Figure L I . Bus multiprocessor release dates vs. maximum number of processors........ 3
Figure 1.2. UMA (a) and NUMA (b) configurations......................................................... 7
Figure 2.1. Unidirectional Ring.......................................................................................... 25
Figure 2.2. Register insertion ring interface diagram....................................................... 28
Figure 2.3. Illustration of a Ring Backplane.....................................................................3 1
Figure 2.4. Processing node architecture for a centralized directory protocol...............34
Figure 2.5. Centralized directory protocol: read miss on a dirty block.......................... 37
Figure 2.6. A linked list directory protocol.......................................................................39
Figure 2.7. An SCI sharing list with five inversions........................................................41
Figure 2.8. Read miss on a dirty block: (a) requester removes miss reply message; (b)
home removes miss reply message................................................................43
Figure 2.9. Grouping message slots into frames.............................................................. 46
Figure 3.1 Structure of a trace-driven simulator............................................................. 50
Figure 4.1. Breakdown of misses to shared data for the directory protocol...................62
Figure 4.2. MP3D: processor and ring utilization of snooping and directory............... 63
Figure 4.3. WATER: processor and ring utilization of snooping and directory............ 64
Figure 4.4. CHOLESKY: processor and ring utilization of snooping and directory ...64
Figure 4.5. PTHOR: processor and ring utilization of snooping and directory............ 65
Figure 4.6. Average miss latencies for SPLASH applications on snooping and
directory........................................................................................................... 66
Figure 4.7. FFT, SIMPLE and WEATHER: processor and ring utilization.................. 67
Figure 4.8. Probe traffic for 16 processor systems........................................................... 68
Figure 4.9. MP3D: Normalized execution times.............................................................. 69
Figure 4.10. WATER: Normalized execution times...........................................................70
Figure 4.11. CHOLESKY: Normalized execution tim es.................................................. 70
Figure 4.12. PTHOR: Normalized execution times...........................................................7 1
Figure 4.13. Effect of block size..........................................................................................73
Figure 5.1. A B idirectional ring interconnect.................................................................. 76
Figure 5.2. Execution time for SPLASH applications; 200MHz processors.................80
Figure 5.3. Execution time for SPLASH-2 applications; 200MHz processors............. 81
Figure 5.4. Execution time for SPLASH applications; 500MHz processors................ 82
Figure 5.5. Execution time for SPLASH-2 applications; 500MHz processors............. 83
Figure 5.6. Minimum latency comparison of unidirectional and bidirectional rings. ..85
Figure 5.7. Average time to send a probe for unidirectional and bidirectional rings ...86
Figure 5.8. Average miss latency for unidirectional and bidirectional rings................. 87
Figure 6.1. 32-bit slotted ring vs. 64-bit split-transaction NUMA bus (P=8)............... 93
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 6.2. 32-bit slotted ring vs. 64-bit split-transaction NUMA bus (P=16)............. 94
Figure 6.3. 32-bit slotted ring vs. 64-bit split-transaction NUMA bus (P=32)............. 95
Figure 6.4. Bus utilization values; 64-bit split-transaction buses, 100 MHz and 50
M H z................................................................................................................. 97
Figure 6.5. Prefetching performance: MP3D; 500MHz ring vs. lOOMHz bus.............. 99
Figure 6.6. Prefetching performance: WATER; 500MHz ring vs. lOOMHz b u s......... 100
Figure 6.7. Prefetching performance: CHOLESKY; 500MHz ring vs. lOOMHz bus 101
Figure 6.8. Prefetching performance: PTHOR; 500MHz ring vs. lOOMHz bus.......... 102
Figure 7.1. Diagram of a Symmetric Crossbar for a NUMA system.............................106
Figure 7.2. Execution time for SPLASH applications; 200 MHz processors...............109
Figure 7.3. Execution time for SPLASH-2 applications; 200 MHz processors............110
Figure 7.4. Execution time for SPLASH applications; 500 MHz processors..............111
Figure 7.5. Execution time for SPLASH-2 applications; 500MHz processors 112
Figure 8.1. High-contention locks with Test&Test&Set (a possible scenario)............. 118
Figure 8.2. Execution time improvement with hardware support for locking on
SPLASH applications; 200MHz processors................................................123
Figure 8.3. Execution time improvement with hardware support for locking on
SPLASH-2 applications; 200MHz processors............................................ 124
Figure 8.4. Execution time improvement with hardware support for locking on
SPLASH applications; 500MHz processors............................................... 125
Figure 8.5. Execution time improvement with hardware support for locking on
SPLASH-2 applications; 500MHz processors............................................ 126
Figure 9.1. MP3D: Impact of relaxed consistency (500MHz processors).................. 133
Figure 9.2. WATER: Impact of relaxed consistency (500MHz processors)................ 134
Figure 9.3. CHOLESKY: Impact of relaxed consistency (500MHz processors)........135
Figure 9.4. PTHOR: Impact of relaxed consistency (500MHz processors)................ 136
Figure 9.5. BARNES: Impact of relaxed consistency (500MHz processors).............137
Figure 9.6. VOLREND: Impact of relaxed consistency (500MHz processors)..........138
Figure 9.7. OCEAN: Impact of relaxed consistency models (500MHz processors) .139
Figure 9.8. LU: Impact of relaxed consistency models (500MHz processors)............140
Figure 9.9. Percentage ring slot utilization for snooping............................................... 142
Figure 9.10. Release and delayed consistency improvements for 128B block systems;
P=16; 500MHz processors............................................................................144
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Abstract
Shared memory multiprocessors are quickly becoming the preferred platform
for parallel processing in scientific and commercial computing. Uniform Memory
Architectures (UMA) in the form of bus-based systems are by far the most popular
implementation of shared memory multiprocessing today. Unfortunately the future of both
bus system and the UMA model is not very promising since buses are not going to be able
to provide the bandwidth required by the next generation of microprocessors, even for
systems with a relatively small number of processors. In this thesis we extensively analyze
a variety of architectural options for shared memory multiprocessors with up to 32
processors. We pay particular attention to the potential of ring-connected multiprocessors
in this arena. A novel design of a slotted ring and an associated snooping cache protocol
are shown to be an attractive alternative to bus- and crossbar-connected systems.
Bus, ring and crossbar systems are analyzed under various cache protocols, and
latency tolerance techniques. The potential gains of adding hardware support for
synchronization operations is also studied. A framework of analytical models, trace-driven
simulations and program-driven simulations is used to evaluate the performance of the
many configurations under study, using a representative set of scientific and numerical
benchmark programs.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 1
INTRODUCTION
1.1 Motivations
The mid to late eighties saw the introduction of high-performance microprocessor-
based workstations which quickly secured a significant fraction of the numerical and
scientific computing market. The key to this success was an extremely favorable price-
performance ratio that was largely due to continuing leaps in the performance of relatively
inexpensive microprocessors. The idea of using those same microprocessors in
multiprocessor configurations appealed to many computer manufacturers as several such
systems have since been released, with varying degrees of success. Some of the most
successful systems were those that extended the existing memory buses to support
multiple processor modules [52,1]. Others opted for connecting processor-memory pairs
through I/O channels [62,3]. Extending the memory bus allows all processors direct access
to the same memory modules, creating what is called a shared-memory paradigm. In such
a scheme, processors communicate and synchronize through a globally accessible
memory space, resulting in a very low-overhead fine-grained communication.
Communication between processors that are connected through I/O channels on the other
hand requires explicit I/O operations that are typically available to the program as message
send/receive primitives, therefore multiprocessors that use this scheme are referred to as
message-passing machines. Such primitives incur in higher overhead and impose a
programming paradigm in which communication and data partitioning have to be handled
explicitly, making it more difficult to write parallel programs as well as to port existing
sequential ones.
Several researchers and some computer vendors have addressed the problem of how
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
to scale up the number of processors in multiprocessor configurations to the hundreds or
thousands, proposing massively parallel processing (MPP) systems. To date, these
approaches have fallen short of succeeding as commercial products. MPPs, due to their
inherently more complex architecture, end up taking too long to design and cost too much
for the larger market segments to afford. Economy of scale factors further increase the
price of MPPs with respect to cheaper uniprocessor or smaller multiprocessors. The longer
design lead time is particularly harmful considering the pace in which microprocessor
technology is increasing. It is typical for a multiprocessor system to be at least one
generation behind uniprocessors with respect to the microprocessor used. In addition to
that, massively parallel systems require a significant software effort to deliver scalable
performance. Many existing programs and algorithms do not scale up well, and will
always favor an uniprocessor or a small-scale multiprocessor with respect to a massively
parallel machine.
There is however a scalability problem of a different nature that is of greater concern
and that we refer to as technological scalability. We vaguely define the technological
scalability of a system component as measure of how the performance of the component
scales up as the underlying circuit technology improves. The idea is that some
components, due to architectural and physical characteristics, will better translate
improvements in circuit, process or packaging technology into better subsystem
performance while others may see only marginal performance increase. Although
microprocessor technology keeps improving at a very fast rate, memory and interconnect
technology are improving at a much slower pace, creating a widening performance gap
between the building blocks of parallel systems in particular. In shared-memory
multiprocessors cache memories are used to bridge this gap by maintaining copies of
recently used memory blocks in a fast SRAM bank located next to the processor. In such a
scheme, all processor accesses to memory regions that reside in the cache (i.e., cache hits)
are served at SRAM speeds. Caches also have the beneficial effect of decreasing the rate in
which memory request are issued, saving valuable network and memory bandwidth. The
fact that multiple copies of a memory position may potentially exist in different cache
memories makes it necessary to introduce a hardware scheme that keeps them coherent,
called a cache-coherence protocol. Although caching is instrumental in improving shared-
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
memory multiprocessor performance, the performance gap is still significant since all
accesses that miss in the cache may experience very long latencies. Moreover, the cache
coherence protocol itself introduces an additional overhead.
The technological scalability problem is particularly noticeable in the architecture of
modem small-scale multiprocessors ^ The vast majority of such systems are based on a
shared bus interconnect which, as we wiU address later, has very serious technological
scalability constrains that prevent it from delivering increasingly higher bandwidths.
Consequently, the maximum number of processors that can be used in bus-based
configurations keeps decreasing every year, as more powerful processors are introduced.
Figure 1.1 illustrates this trend by plotting the approximate release dates of bus-based
multiprocessors against the maximum number or processors supported. Not shown in the
plot is the fact that an Alliant FX-80[1] used a 33MHz CISC processor while the
AlphaServer2100 [31] uses a 300MHz superscalar RISC processor, a performance
difference of well over one order of magnitude.
Figure 1.1. Bus multiprocessor release dates vs. maximum number of processors
I
c
30
25
20
15
10
Alliant FX-80 S eq u en t Sym m etry
o o
I I
SGI Challenge
HP890 O
O O
SparcCenter 2000
AlphaServer
"t) O
J _______ I _______ L
P6
88 89 90 91 92
Year of release
93 94 95 96
The intrinsic limited technological scalability of buses presents a challenge that
1 . In the context of this thesis, small-scale multiprocessors are systems with no more than 32 processing elements
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
motivates the exploration of alternative interconnect technologies even for small-scale
multiprocessors. The departure from a bus-based architecture also motivates the study of
different ways to provide a shared-memory programming model, and widens the design
space for cache-coherence protocols. Bus-based systems are well suited to snooping
protocols which require that all caches in the system observe all global memory
transactions. Multiprocessors that are not bus-based are generally not suited to snooping
protocols and there is currently no consensus approach to handling cache-coherence in
such systems.
1.2 Summary of Research Contributions
If it is true that bus interconnections will not prevail as the choice fabric for small-
scale multiprocessors, what technology will replace it? In this thesis we focus on ring-
based interconnections as a possible answer to that question. We propose a ring
architecture based on fixed-size message slots that can be implemented in a backplane
and, due to the simphcity of the media access mechanism, allows very high clocking
speeds. We then describe how existing directory protocols are implemented in the ring and
we propose a novel snooping protocol for this interconnect. Several features of the
protocol and cache designs are discussed and evaluated in the context of an unidirectional
slotted ring. Evaluations are conducted using analytical models, trace-driven simulations
and program-driven simulations, all based on real parallel applications. We also address
the performance of bidirectional rings, since some protocols could potentially benefit from
bidirectionality of communication.
We find that our snooping protocol shows the best overall performance for an
unidirectional slotted ring multiprocessor. A unidirectional snooping ring also performs
better than all bidirectional ring configurations analyzed.
We also analyze the performance of the slotted ring multiprocessor with that of high-
performance bus-based systems and crossbar-based systems. Our experiments
demonstrate that a snooping slotted ring performs better than bus-based systems,
particularly as the processor speed increases. The snooping slotted ring also compares
well to a crossbar-based system, even though crossbar switches are more complex and can
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
sustain higher aggregate communication bandwidths.
One of the findings from the experiments outlined above is that the performance of
ring- and crossbar-based systems is mostly constrained by remote access and protocol
latencies, and not by the aggregate bandwidth. There is, therefore, an opportunity for
performance improvement by utilizing latency tolerance techniques, such as relaxed
consistency models. We then re-visit most of the systems previously analyzed in the
context of relaxed consistency models and a few other architectural variations. The results
show that both ring- and crossbar-based systems benefit significantly from latency
tolerance techniques, while bus-based systems do not.
The fact that a snooping protocol can be efficiently implemented in a system that is
not bus-based is the most important contribution of this thesis. It contradicts the general
perception that snooping will only be suited for bus-based systems and it signals that there
are opportunities to trade a higher utilization of interconnection resources for a lower
average latency of transactions.
Our experiments also identify synchronization latency as an important factor in the
execution time of the applications in our benchmark suite. We therefore study the potential
benefits of adding hardware support for locking operations on the various systems under
study. In addition, we propose a new hardware locking mechanism for a snooping slotted
ring that leverages off the existing snooping hardware and the inherent ordering of nodes
in the unidirectional ring.
Finally, an important contribution of this thesis is the definition of the requirements
for snooping protocols in Non-Uniform Memory Access (NUMA) bus architectures, and
its performance analysis. Partitioning the shared memory space into physically distributed
memory banks, one next to each processing element significantly decreases bus bandwidth
consumption as accesses that can be satisfied by the local memory bank do not incur in
bus transactions.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
1.3 Prior Related Work and Background
1.3.1 Multiprocessor Interconnect Architectures
In this section we discuss briefly the principal types of interconnection schemes and
their applicability to small-scale multiprocessors.
1.3.1.1 Uniform vs. Non-Uniform Memory Access Architectures
As mentioned before, the early efforts in connecting multiple processors evolved out
of traditional uniprocessor architectures by extending the memory bus to accommodate
multiple processors. Bus architectures are well suited to the implementation of a shared-
memory paradigm with a very low overhead, particularly in Uniform Memory Access
(UMA) configurations. In these systems (see Figure 1.2a) multiple processor-cache
modules and memory (DRAM) modules are connected to the memory banks through the
global interconnect. Such systems are called UMA configurations because the access time
to any memory position from any given processor is always the same, provided that the
interconnect has the same diameter for aU processor-memory pairs (example: bus,
crossbar, MEN). One advantage of these systems is that it is easy to expand either the
number of processors or the amount of memory independently. Moreover, a programmer
does not need to worry about data placement or partitioning. However, the fact that all
memory accesses (or misses in a cache coherent system) have to go through the system
interconnect increases latency as well as the communication load. An alternative is to have
some DRAM at each of the processor boards, so that accesses that are local to a processing
element do not have to use the system interconnect (see Figure 1.2b). The physical
addressing space is still unique and shared among all processors. Such a configuration is
called Non-Uniform Memory Access (NUMA) due to the fact that processor accesses that
fall into the local memory bank will be satisfied faster than accesses to a remote memory
bank.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 1.2. UMA (a) and NUMA (b) configurations
(a) (b )
Proc;
Procl
cache
cache
DRAM DRAM
Proc) Proc)
cache cache
cache
Proi
Still today, most multiprocessors use bus interconnects on UMA configurations, due
to its simplicity of implementation and packaging. In this thesis we explore NUMA
configurations for bus and non-bus based systems, since we believe that those have better
potential for scalable performance at a reasonable increase in complexity.
1.3.1.2 Limits on Bus Performance
In the past few years it has become evident that bus interconnection technology will
not be able to keep up with the improvements in microprocessor technology. When the
Sequent Symmetry [52] was released (1988) it used 20MHz 16 bit CISC processors, and a
64-bit bus clocked also at 20MHz. In 1995, the AlphaServer 2100 uses 300 MHz 64-bit 4-
issue superscalar RISC processors and a 128-bit bus clocked at 75MHz. While
microprocessor memory bandwidth requirements increased by a factor of roughly 240, bus
bandwidth increased by less than a factor of 32. Consequently, less and less processors can
be plugged into a shared bus as new generations of processors become available.
There are topological and electrical factors that contribute to the relatively modest
improvements in bus bandwidths observed lately. The topological factor is a consequence
of the shared medium nature of buses. It dictates that only one bus transfer can be
performed at any bus cycle, and that all bus agents have to arbitrate for the bus prior to
being able to start a transfer. Arbitration protocols frequently involves multiple bus cycles
(particularly with distributed arbiters). Modem buses attempt to alleviate this limitation by
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
providing separate arbitration, address, and data lines, so that the arbitration for the bus
can be overlapped with the address phase of a previous transaction, which in turn can
(partially) overlap with a data transfer for yet another bus transaction. These are
sometimes called pipelined, or split-transaction buses. Unfortunately, the amount of
overlap available in bus transactions stops at this level.
An alternative to traditional arbitration protocols is to use collision detection
schemes, such as CSMA/CD (carrier sense multiple access with collision detection) which
is used in bus based local area networks. In this scheme, a bus agent with a packet to
transmit senses the medium to determine if a transmission is going on. If not it
immediately starts the transmission while at the same time sensing the electrical levels in
the bus. If two or more agents start to transmit at the same time, a collision occurs in the
bus. An agent senses the collision and aborts the transmission. At that point the colliding
agents can either wait a random amount of time and retry or they can enter an actual
arbitration phase. Collision based methods such as this have not been used in
multiprocessor buses to date for several reasons. First it performs poorly under medium to
heavy traffic where collisions become much more frequent. Second, it requires the ability
to sense the medium to determine that a collision has occurred, which is not easy to
accomplish in a parallel bus. Finally collisions on a parallel bus will cause large current
surges that would be difficult to handle.
The electrical factors are however more serious as they present physical limitations
to increasing bus bandwidth. Wires in a bus interconnect have multiple taps, each tap
being able to drive and sense the voltage level in the wire. At very high speeds each tap
introduces stray impedances that cause reflection and signal attenuation, resulting in
longer settling times. Moreover, the length of the wires on backplane a bus increases
somewhat linearly with the number of taps, because of the physical spacing necessary to
plug in printed circuit boards. Longer wires also translates into longer settling times as the
signal has to travel the length of the bus. Since a transmitter has to wait until the signal has
safely settled before driving another data, these effects directly bound the minimum bus
clock period. Attempts to improve bus clock frequency typically involve increasing the
current levels and/or reducing the voltage swing, so to improve signal rise time. Both
approaches have limitations. Increasing currents will worsen switching effects, such as
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
ground bounce and crosstalk interference. Reducing the voltage swing makes the bus less
noise immune.
Another electrical problem is caused by the bidirectionality of bus communications.
Since the same wires are used to transmit and receive data, the bus interface has to switch
between sensing and driving modes. Before a bus interface can start driving it has to make
sure that the previous signal has been removed and any reflections have settled to avoid
electrical conflicts. This delay is again influenced by the number of taps, their stray
impedances, and the bus length.
For a given bus clock cycle one can attempt to increase bandwidth by increasing the
width of the bus, therefore transferring more data at a time. Pin limitations and crosstalk
interference are also limiting factors in this case. The maximum skew also increases
rapidly with the number of parallel wires, and it adds on to the minimum clock period.
1.3.1.3 Point-to-Point Links
The alternatives to a bus interconnect are topologies that rely on a non-shared
physical communication medium. All such topologies therefore use point-to-point links as
building blocks. Point-to-point links have several attractive features and have been
growing increasingly popular in the last few years. By having only a single driver and a
single receiver at each end of the wire, point-to-point links are much simpler electrically
than buses. Point-to-point links are easy to terminate properly because there is only one
termination point. Better termination and lower characteristic impedances allow fast signal
rise time and propagation speed with lower driving currents, which makes it easier to use
low voltage signaling while still maintaining reasonable noise immunity.
Unlike buses, transmission rates are not directly dependent on the length of the wire
since a transmitter does not have to wait until the receiver has sensed the data before
driving another value, and in effect several bits can be “in flight” between the source and
destination, depending on the length of the wire and the link clock frequency. This is
called signal pipelining. This is especially useful in local area network environments. For
multiprocessor networks, in which the processing elements are tightly coupled, the length
of point-to-point links can be kept very small so that in most cases the time of flight is not
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
significant.
Overall, point-to-point connections are more technologically scalable than bus
connections, and their delivered bandwidth is expected to benefit continuously from
improvements in circuit technology. The potential of point-to-point connections is
currently demonstrated by the IEEE Scalable Coherent Interface (SCI) [43] set of
standards. Current SCI-based systems use 500MHz 16-bit point-to-point links. Wider (up
to 128-bit parallel) and faster (up to IGHz) links are expected by the end of 1996 [26].
1.3.1.4 Ring Networks
Point-to-point links are not networks per se, but instead they are the building blocks
for a myriad of network topologies. The simplest among those is an unidirectional ring
network. Unidirectional rings have the smallest number of links per node (smallest
degree), and they do not need intermediate switching elements (as in multistage
interconnection networks or crossbars). Consequently, ring networks are likely to be the
least expensive point-to-point based interconnects for both multiprocessors and local area
networks.
Because of its simple topology, the unidirectional ring requires the simplest routing
mechanism possible, with the only routing decision being to either remove a message
from the ring or to leave it in the ring path so that it is directly forwarded to the next node.
Therefore, complex buffer management is not necessary, and cut-through/wormhole
routing is avoided completely, reducing drastically the amount of expensive high-speed
memory in the network interface and simplifying the network controller data path. A
unidirectional ring is also deadlock free.
The current speeds attainable by point-to-point links present a formidable challenge
to the designer of network interface logic and switches. In this context, the lower
complexity of a unidirectional ring interface makes it possible to take advantage of the raw
link bandwidth and deliver it to the system.
Similarly to buses, rings can be efficiently implemented in active or passive
backplanes, facilitating wiring and packaging. Unlike buses, rings can also be
implemented in more loosely coupled configurations, using flat copper cables or optic
10
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
fiber ribbon cables. For backplane implementations, single-ended terminated traces will
suffice. Pseudo ECL (PECL) parallel signals can be used for cables of up to a few meters.
Low-Voltage Differential Signaling (LVDS) [43] allows very low error rates and high
speeds for distances of up to 100 meters. Parallel optical fiber ribbon cable technology
[39] can be used for even longer distances (order of IKm), as well as for short distances.
Finally, bidirectional ring networks can be implemented by using two unidirectional
rings, each transmitting in a different direction. Although the network interface logic is
somewhat more complex than in a unidirectional ring, a bidirectional ring network is still
quite simple when compared with more general switched topologies.
Examples of shared memory systems that have used ring intercormects recently
include the Convex Exemplar [66], the Kendal Square KSRl [46], and the upcoming
Sequent NUMA-Q [53].
1.3.1.5 Crossbar Networks
Point-to-point links can also be used to connect nodes to switching elements, such as
crossbars. Ideal crossbar switches are an example of a conflict-fi-ee network in the sense
that it is possible for all nodes to communicate through the crossbar simultaneously, as
long as every sender chooses a different destination. In other words, there can only be
output conflicts.
Monolithic crossbar implementations are known not to scale well since the number
of internal connections increases with the square of the number of ports times the width of
a port. Due to the connectivity required, efficient crossbar implementations are only
possible when the entire crossbar fits into a single integrated circuit. In other words, for a
given technology, as the number of ports increases the width of a port tends to decrease.
Crossbar switches with a larger number of wide ports are typically implemented as
multi-stage networks (MINs). Such implementations lack the conflict-free feature since
messages to different destination may select the same output port of an internal switch. In
addition, multi-stage networks are subject to tree-saturation [57], in which an internal
conflict or an output conflict backs-up the traffic in the upstream switches and
consequently delaying even the messages that are not directed to the “hot” path. Modern
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
crossbar switches for MINs [67] virtually eliminate tree-saturation by using a large multi
ported central buffer pool that is shared by all output ports. This central buffer is used to
store messages directed to a busy output ports. Since entries in the buffer pool can be
dynamically allocated to diSerent output ports, an active output port will have a larger
amount of buffering at its disposal than if buffers were statically assigned to inputs or
output ports. A very active output also means that the remaining outputs are relatively
inactive, therefore requiring little or no buffer space. The drawback of this scheme is that it
has a relatively higher delay for messages that conflict because those have to be stored and
later retrieved from the buffer pool.
1.3.1.6 Other Networks
Buses, crossbars and rings are clearly not the only interconnection options for
shared-memory multiprocessor systems. However, we argue that in the context of small
scale multiprocessors, these are the ones that make the most sense. Large MINs, meshes,
or fat trees scale well to large numbers of processors, but are not as effective for small
configurations.
1.3.1.7 Cluster-based Architectures
One interesting way to build large scale multiprocessors is to use small symmetric
multiprocessors (or SMPs) [51,74,66] such as bus based systems as nodes of a second
level network, therefore creating a larger system. The advantages of such approach are
manifold. The first level interconnect (intra-SMP) can provide very low latencies and high
bandwidth for local communication, while the second level interconnect can be designed
for high aggregate bandwidth. The first level interconnect can take advantage of physical
proximity and easier packaging to provide a very cost-effective solution. By using SMPs
as nodes, the number of ports in the second level network can be reduced for a given total
number of processors in the system. The bandwidth per port in this scheme has to be
higher, since each port will serve a larger number of processing elements.
Cluster-based architectures are particularly effective when a significant fraction of
12
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
the application parallelism can be captured by a single SMP node, or when the application
can be mapped so that there is communication locality within a SMP node. Therefore,
SMP nodes with a larger number of processors are preferred. Since bus based systems are
likely to connect at most four processors in the near future, alternative SMP
interconnections such as rings or crossbars could be favored as the building blocks for
larger scale systems as well.
Large scale multiprocessors based on larger SMP nodes have also a favorable cost
structure. An entry level configuration consisting of a single SMP node (fully populated or
not) is likely to cheaper than a configuration that requires a customer to buy the second
level interconnect up front. Even for larger configurations with more than one SMP node,
the cost of the second level interconnect is amortized by a larger number of processors.
1.3.2 Cache Coherence Protocols
Processor caches are widely used today in both uniprocessor and multiprocessor
systems. They are so critical to performance that virtually all modem microprocessors
include one or two levels of cache memory in the processor chip itself. Caches are
instrumental in bridging the gap between very fast processors and slower (but large)
dynamic memory banks.
In NUMA multiprocessors, caches are particularly important since the latencies to
access data located in a remote memory bank can be extremely high. An effective caching
strategy is one that hides the NUMA performance penalties, so that the programmer or the
compiler does not have to worry about placement of data in the various memory banks.
While this is difficult to fully achieve in practice, strategies that allow the caching of
remote memory locations can approximate this behavior by allowing subsequent accesses
to a remote memory location to be satisfied locally. In such a scheme, however, multiple
copies of a memory location can exist in the system, and it is necessary to coordinate any
write operations so that all processors in the system have a coherent view of the memory
space. Such coherence is enforced by means of a cache coherence protocol, that is
typically implemented in hardware. Cache coherence protocols coordinate writes to
memory locations by either killing all other cache copies of the memory block that
1. 1
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
contains that location or by updating them. These two approaches are called write-
invalidate [56] and write-update [69] protocols. Write-update protocols keep the other
cached copies in the system “alive”, therefore yielding higher cache hit rates, but placing a
high load in the system interconnect, since successive updates have to be propagated to all
caches with a copy of the block, even if the processors associated with those caches are not
touching the data being updated. Write-invalidate protocols represents a more lazy
approach, in which other cache copies of a block are invalidated when a processor
attempts a write operation. In this scheme, further writes by the invalidating processor can
proceed locally with no risk to overall coherence, since that is the only copy o f the block
in the system. When another processor tries to access an invalidated cache block, it has to
reload it from the processor that contains the most recent copy. Write-invalidate protocols
exhibit somewhat lower cache hit rates with respect to write-update protocols, but require
much lower interconnect bandwidth and are better at tracking the sharing pattern of a
given memory block. Overall performance of write-invalidate protocols is typically much
better than that of write-update protocols, although some types of sharing pattern would
clearly benefit from a write-update scheme. Producer-consumer sharing is one such
example.
Recently, some studies have advocated hybrid update/invalidate schemes in attempts
to combine the strengths of both. Competitive updates protocols [44] use a write-update
protocol by default, but allow copies to be invalidated when it is determined that an
updated cache copy is not being accessed by the associated processor. Other approaches
advocate selecting an update or invalidate protocol depending on the anticipated sharing
behavior of a given data structure. Both approaches are beyond the scope of this
dissertation, and therefore we focus on write-invalidate protocols only.
Cache coherence protocols are also classified with respect to how the information
about what caches have copies of a memory block is kept. The following subsections
describe the main options, all of which will be further analyzed in this dissertation.
1.3.2.1 Snooping
Most if not all of the bus based multiprocessors to date have been UMA machines,
14
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
with a number of processing nodes with local caches connected to one or more memory
banks, as depicted in Figure 1.2a. In this configuration, all memory accesses by a
processing node are visible by all other nodes in the system. Snooping protocols take
advantage of this feature to build a simple and elegant solution to the cache coherence
problem [33,56,69]. Basically, each processing element constantly monitors all bus
transactions, trying to match the address of a transaction with the addresses contained in
its local cache. When there is a match, the logic in the processing element takes the
appropriate action to ensure that (a) its local copy will not become stale and that (b) the
remote request will receive an up-to-date copy of the block it is missing for.
To illustrate the operation of a snooping protocol on a UMA bus, lets take a simple
write-invalidate protocol in which a memory block in a cache (i.e., a cache block) can be
in one of the three states: Invalid (not present), Read-Only (valid only for reads), and
Read-Write (valid for both read and writes). In this protocol, whenever a processor
accesses a block that is Invalid in its cache, it puts the block address in the bus while
signaling a read operation. All other caches snoop on the read operation and if either the
cache block is not present in any other cache or it is present but Read-Only in one or more
caches, no coherence action is required, and the corresponding memory bank replies to the
read operation. The state of the block in the node that requested it is Read-Only. If some
other cache had that block in the Read-Write state, it is assumed that cache has modified
the block (i.e., the block is dirty), and therefore no other cache in the system may have a
copy. Also, the memory bank corresponding to the block has stale information. In this
case, the cache with the dirty copy intervenes in the bus read operation and it replies with
the updated copy of the block. In this case, the node that had the dirty copy has to change
its cache state to Read-Only. For a processor to be able to write to a cache block, it has to
have it cached in Read-Write state. If a write operation is attempted with the block in
Invalid or Read-Only state, a bus operation has to be issued so that all the other caches in
the system are informed that the block is about to be modified. In the write-invalidate
scheme that we are assuming, that means that all other copies of the block have to be
invalidated.
In the snooping scheme that we have just described, the memory banks contain no
state information regarding which blocks are currently being cached by which processors.
15
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
The processing nodes themselves are responsible for keeping cached copies consistent in a
distributed fashion by snooping in each others memory transactions. Snooping protocols
require the addition of a bus watcher (or snooper) logic to the processing elements that has
to contain a copy of the processor cache directory, in order to quickly determine matches
between bus transactions and cached addresses without using precious processor cache
bandwidth.
Snooping protocols for NUMA buses are slightly different than those for UMA
buses, as described above. We will focus on these differences in Chapter 6. We will also
describe how snooping protocols can be implemented on a ring based multiprocessor.
1.3.2.2 Centralized Directories
Snooping protocols are feasible in bus based systems due to the fact that a bus is a
shared media in which every transaction is a broadcast. For system organizations that are
not based on buses, it is generally believed that snooping is not feasible, and other
schemes must be used. In 1978 Censier and Feautrier [13] proposed a scheme in which the
memory banks themselves would keep a directory entry associated with each block of data
in main memory, so that it would know what processing elements have cached copies of
what blocks, and in what states. The model is still similar to the snooping protocol
described above, in the sense that it allows multiple Read-Only copies but only one Read-
Write copy in the system. This class of protocols is generally called centralized directory
protocols or full-map directory protocols, since all the information about the system state
of a memory block is centralized in the memory bank that the block address maps to. The
directory entry can be implemented with a bit-vector, in which a bit set in the position
would indicate that processing element n has a copy of the block in its cache. This
structure is therefore called a presence bit vector. When there is only one presence bit set
in the vector, it is necessary to indicate whether that processing element has the block
cached Read-Only or Read-Write. This is accomplished by adding one more bit to the
structure, called a dirty bit. More state information may be necessary, depending on the
specific implementation.
In such directory protocols, no particular topology is assumed for the system
16
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
interconnect. Whenever a processing element access misses in the local cache, it sends a
message to the memory bank that the block address maps to. The memory bank now has a
fairly complex logic (a memory directory controller) that fetches the directory entry for
that block and determines what actions should be taken to satisfy the request in a way that
maintains overall coherence. Typical behavior of a directory controller includes sending
point-to-point invalidation messages to all processing elements with Read-Only copies of
the block when a write is attempted, waiting for the acknowledgments and then replying to
the requester with a copy of the block (if necessary) and permission to change its state to
Read-Write. If the block was cached dirty at the time a write was attempted, the directory
controller sends a message to the processing element with the dirty copy asking it to write
it back and invalidate its copy, after which it forwards the copy of the block to the
requester^. When a read is attempted and there is no dirty copy in the system, the directory
controller immediately replies with the block. If a dirty copy exists, a message is sent to
the processing element that has the dirty copy, specifying that it should change its cache
state to Read-Only and send a copy of the block back to the memory bank, which in turn
forwards it to the original requester. In all cases above, the directory controller is
responsible for setting and clearing the appropriate presence bits and dirty bit, to reflect all
changes in the sharing pattern of a memory block.
The main problem with centralized directory protocols is that the presence bit
structure does not scale well, since there has to be one bit in every directory entry for each
cache in the system. A multiprocessor with 256 processors and 32B cache blocks will
require more memory for the directory entries than for the data itself. Several schemes
have been proposed to address this problem [65], most of them based on the assumption
that in the frequent case a cache block will only be present in a few caches. Chaiken et al
[14] presents a taxonomy for these so called limited directory protocols. In these
protocols, a directory entry contains a limited number of hardware pointers that can be
used to store the ID of a processing element with a cached copy. If a read miss request is
received by the directory controller and all of the hardware pointers are already allocated
there are two options: invalidate one of the current processing elements to make room for
2. Several optimizations are possible, in which the processing element with a dirty copy directly forwards it to the
requester. We will discuss some o f these in a later chapter.
17
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
the new one or change to a mode in which it is assumed that all caches in the system may
have a copy of the block, and therefore a subsequent write will require a broadcast
invalidation.
Although limited directory protocols are very important for scalability, a full-map
strategy is the most effective for the range of system sizes that we are focusing in. A
system with 32 processors and 32B cache blocks has a directory entry overhead for
presence bit vectors of 12.5%. A limited directory entry for the same system, with similar
directory entry overhead, would be able to accommodate only six hardware pointers. From
now on, whenever we mention centralized directory protocol, or simply directory
protocol, we are referring to this full-map strategy.
1.3.2.3 Distributed Directories
Distributed directories is a generic term for a method that has been proposed by the
researchers involved with the Scalable Coherent Interface (SCI) standard [43], to attempt
to reduce the memory overhead of centralized directory protocols, while still maintaining
complete information about the sharing of a cache block. The idea is to associate one or
two hardware pointers to each block frame in the processing element caches, and one
hardware pointer to each block frame in the memory banks. The pointer in the memory
bank stores the ID of a processing element that caches the block. The cache block frame in
this processing element in turn points to the next processing element that caches the block,
and so on. Therefore, a distributed linked list is created for each block frame in the system.
The list may be singly or doubly linked, depending on the particular implementation.
Distributed directory protocols are more scalable than centralized directory
protocols with respect to the memory overhead for directory information. In such schemes
the overhead scales up with log^ of the number of processing elements in the system,
instead of linearly in the case of full-map protocols.
These linked list protocols are typically much more complex than centralized
directory or snooping protocols, and will also incur in higher delays for some cache
transitions that involve traversing the list. Another problem with linked list protocols is
that order in which processing elements appear on the list is determined solely by the
1 8
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
timings of cache misses, and it is completely oblivious to the underlying interconnect
topology. As a result, the messaging sequence required to traverse the list may be quite
suboptimal with respect to the system topology. Unfortunately it seems very difficult to
optimize the list for a given topology, since it would require that the list be rearranged
while it is being formed, i.e., during a cache miss. Any extra overhead included in a cache
miss will have a first order impact in system performance.
1.3.3 Reducing and Tolerating Memory Latencies
Regardless of the particular cache protocol being used, the latencies involved any
time main memory has to be accessed continue to increase with respect to the processor
cycle time. Therefore, architectural or algorithmic enhancements that either reduce the
effective cache miss latencies as perceived by an executing thread, or that allow the
execution of a thread not to block on a cache miss (or other relevant coherence events) are
increasingly important, particularly in multiprocessor systems. Virtually all latency
tolerance techniques have the side effect of increasing the communication load by
allowing greater overlaps between communication and computation. It is therefore
important to determine how the different interconnect architectures and protocols react to
the increased load. In this thesis we also study the potential for performance improvement
of small scale multiprocessors when latency tolerance techniques are employed.
1.3.3.1 Prefetching
The idea behind all prefetching schemes is to anticipate the need for a piece of
information and trigger the fetching of that information enough in advance in the
execution stream so that when it is actually needed it is already present in some kind of
local buffering. Prefetching of instruction streams is an extremely successful technique
that has been applied even before caches became popular. The regularity and locality of
instructions make them perfect candidates for prefetching, and it has been implemented in
hardware on almost all microprocessors for many years. Prefetching of data is not as easy
since data access patterns are not as easy to predict both statically (by the compiler) or
19
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
dynamically (by the processor hardware).
Data prefetching schemes are classified with respect to whether the prefetched data
is bound to a processor register directly (binding prefetch) or whether it just brings a piece
of data up in the memory hierarchy (non-binding prefetch) but not into a register in the
execution core. Non-binding prefetch typically brings data into one of the possibly various
levels of cache memory in a processing element. Both prefetching schemes require that the
caches be implemented in such a way that they do not block when an access misses, so
that they can issue a prefetch and continue to serve processor accesses. Such
implementations are called lockup-free caches [23].
Binding prefetching schemes basically attempt to move register loads up in the code
stream as much as possible so that when the instruction that uses the value of that register
is dispatched there is a good chance that the load has already performed. The moving of
loads can be done by an optimizing compiler, during dependency analysis. In addition, in a
processor that uses dynamic scheduling/expeculative execution, even if the instruction that
accesses the loaded register is dispatched before the loads completes, the processor may
not stall since it can continue to dispatch other instructions, and reorder the execution
stream at the retire stage [42].
The most common type of non-binding prefetching is called software prefetching,
since it requires a special prefetch instmction to be inserted in the code stream by the
programmer or the compiler. The prefetch instruction acts as a load or a store (shared or
exclusive prefetch) to the memory system, but it does not load any data into the processor
core. The effect in the memory system as if a normal load or store had been issued, with
misses or invalidation transactions being sent accordingly. Most modem instmction sets
include definitions of prefetch instmctions, even though they are not always implemented
in actual systems. Non-binding prefetches triggered by hardware instead of software have
also been proposed [4][15]. Such schemes are referred to as hardware prefetching, and
typically use some heuristic to fetch a few consecutive (possibly with a stride) cache
blocks to a cache block that was the target of a cache miss.
Non-binding prefetches are “safer” than binding ones with respect to the expected
memory ordering and data coherence since they act only as hints to the memory system. If
a non-binding prefetch is issued too early and a write operation is performed before the
20
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
prefetched data is actually consumed, the cache protocol itself makes sure that the new
value of the data is seen by invalidating the prefetched cache block. Since binding
prefetching fetches into a processor register, it is not subject to the cache protocol, and
extra care has to be taken to ensure correct operation. Modem processors such as the Intel
Pentium Pro [42] monitor the system bus so that they can flush the execution pipeline if a
write is seen to a value that was speculatively loaded (prefetched) before the instruction
that uses that value has retired. The effect is the same of a mispredicted branch target
taken.
1.3.3.2 Relaxed Consistency Models
While prefetching reduces the effective latency as seen by the program, relaxed
consistency models hide those latencies by allowing execution to continue past a store
operation that has not been propagated to the rest of the system. A consistency model
defines the assumptions that a programmer makes with respect to the order in which loads,
stores and synchronization operations are performed in the memory system. The strictest
consistency model was defined by Lamport [49] as requiring that all the memory
operations of a processor appear to execute in the order specified by the program, and that
the result of the execution is the same as if all global operations were performed in some
sequential order. Implementations of sequential consistency in multiprocessors basically
require that the processor blocks in any global memory operation, and does not issue
another operation until the previous on has committed, a strategy called strong ordering of
memory references. This is a very hard restriction, and it prevents several hardware and
compiler optimizations.
Scheurich and Dubois [22] pioneered the work on relaxed consistency models by
introducing the idea of weak ordering. Weak ordering models assume a properly labeled
program in which accesses to shared data has to be controlled by accesses to
synchronization variables. Under weak ordering, accesses to synchronization variables are
strongly ordered, but there is no restriction in the order between other loads and stores
other than the following:
(a) before an access to a synchronization variable can be issued, all previous global
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
accesses have to be globally performed (i.e., performed with respect to all processors)
(b) no access to global data can be issued before a previous access to a
synchronization variable is globally performed.
(c) multiple stores to the same address have to be issued in the program order
In this model a the execution almost never has to block on a store, even if it misses in
the cache, or finds the cache block in Read-Only state. The miss or invalidation request
can proceed on the background while the value of the write is buffered between the
processor and the cache. An improvement over the weak ordering model was later
proposed by Gharachorloo et al [30], in which a distinction is made between the actions
taken on different synchronization primitives. In release consistency, as originally
proposed, no loads or stores can be issued before a previous lock operation (or acquire)
has been performed, and an unlock operation (or release) can only be issued after all
previous loads and stores have been performed. The differences between weak ordering
and release consistency are subtle, as are the performance differences observed in most
simulation experiments [79]. Release consistency basically allows a lock to bypass
previous loads and stores, and an unlock to be bypassed by subsequent loads and stores.
A large variety of implementations that take advantage of relaxations in the memory
ordering constrains are possible. Dubois et al [24] presented a class or protocols that can
delay the sending of invalidations until a lock is released, and can also delay the
invalidation of cache blocks for which invalidation messages have been received until a
lock acquire is reached. These delayed consistency protocols implement release
consistency in such a way that the effect of false sharing misses is drastically reduced.
False sharing misses are loosely defined as misses caused by a situation in which two (or
more) processors are sharing a cache block but they actually share no data. Such situation
is caused by poor or unfortunate mapping of data structures in cache blocks, which cannot
always be avoided by smart placement techniques. False sharing effects are more
important for systems with large cache blocks (128B and beyond). Forcing a data
alignment that prevents false sharing is not always feasible for large cache blocks because
it may cause very high levels of memory fragmentation. Another side effect of forcing
data alignment is that the processor will have to touch a larger number of blocks than
otherwise, which can potentially reduce the cache hit ratios.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
1.3.3.3 Multithreading
A different way to tackle the problem of increasing memory latencies in
uniprocessors and shared memory multiprocessors is to use microprocessors that can
switch between different contexts very quickly such that whenever a long-delay cache
miss occurs, control is passed to a different thread. When the miss response arrives, the
original thread is signalled and regains control. The basic idea is the same as in a
multiprogrammed operating system that swaps out a process that is waiting for disk I/O
and schedules another from the ready queue. The hardware support necessary however is
quite different and much more complex. Context switching has to be very efficient since
the delays involved are in the order of a microsecond (i.e., order of magnitude of a remote
miss delay in a large multiprocessor), whereas in multiprogramming delays are typically
those of I/O devices, which are in the order of several tens of milliseconds.
Proposed multithreaded microarchitectures [48,2] include multiple identical register
sets, so that the context of various threads can be present at the same time in the processor.
Such architectures are quite complex since the state of the instruction pipeline also has to
be saved and interrupts have to be handled precisely since a returning cache miss has to be
consumed very quickly before an intervening invalidation or replacement causes the block
to be flushed from the cache.
Multithreading also depends on the compiler exposing a significant number of
parallel threads of control in an application so that there is always a thread ready to be
switched in when the running thread experiences say a remote miss. Otherwise, it may be
able to improve the throughput of the system but it will not speed up individual
applications.
1.3.3.4 Hardware Support for Synchronization
Naive implementations of synchronization primitives in shared memory
multiprocessors typically exhibit very poor performance as they not always interact well
2 . 1
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
with the underlying cache coherence protocol, particularly with write-invalidate protocols.
This poor performance can be sometimes misinterpreted as being inherent poor scalability
of the problem or algorithm.
Some multiprocessor architectures address the above stated problem by including
separate synchronization networks that completely bypass the cache protocol [71]. A
different approach is to use the existing network and cache protocol but augmented with
special transactions and state information to better handle synchronization primitives. In
this thesis we take the later approach. We analyze a variety of architectures and protocols
with and without hardware support for synchronization. We also propose and analyze a
new technique to efficiently support high contention locks in slotted rings under a
snooping protocol.
1.3.4 Performance Evaluation Methodologies
Since no hardware was built for the purpose of this thesis, we had to rely on other
ways to evaluate the performance of the various systems under study. A range of methods
was used as the work evolved, from approximate analytical models to highly detailed
program-driven simulations. The early investigations used trace-driven simulations and a
hybrid analytical methodology that used parameters derived from the trace-driven
simulations as inputs. After that we built a more detailed program-driven simulation
environment that was used to verify the accuracy of the early results as well as to obtain
results for more complex mechanisms that could not be captured accurately by the
analytical models. Chapter 3 explains the simulators and models used as well as describes
t set of benchmarks that drove our experiments.
24
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 2
CACHE COHERENCE IN RING BASED
MULTIPROCESSORS
2.1 Ring Architectures
The unidirectional ring is the simplest form of point-to-point interconnection,
which means minimum number of links per node and simpler interface hardware. In
particular, the unidirectional ring requires the simplest routing mechanism possible: the
only routing decision is whether to remove a message from the ring or to forward it to the
next node. Consequently store-and-forward is avoided, communication delays are shorter
and the raw bandwidth provided by point-to-point links is better utilized. Today’s point-to-
point connections are so fast that the board logic can eventually become the performance
bottleneck, and therefore simple and fast routing mechanisms will be critical.
Figure 2.1. Unidirectional Ring
cache
shared
memory
partition
mterface.
out
latches
/
(a)
A 16-Node Unidirectional Ring
(b)
Node Structure
The general architecture of the unidirectional ring is shown in Figure 2.1, and
25
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
consists of a set of processing elements containing a CPU, local cache memory, a fraction
of the shared memory space, and a ring interface. The data path on the ring interface
consists of one input link, a set of latches, and one output link. At each ring clock cycle the
contents of a latch are copied to the following latch, within a ring interface and across the
links, so that the interconnection behaves as a circular pipeline. The main function of the
latches is to hold an incoming message for a few clock cycles in order to determine
whether to forward it or not. The number of latches in each interface should be kept as
small as possible so to reduce the latency of messages. The summation of the capacity of
all the latches in the ring (in the ring interfaces), plus the number of bits in transit on the
wires is defined as the bit capacity of the ring. It is a function of the design of the ring
interface, the width of the links and latches, the length of the wires in between nodes and
the ring clock frequency. The ring bit capacity is an upper bound on the amount of data
that can be in transit at any given time.
In the type of ring interconnect schematically shown in Figure 2.1 there is no global
arbitration to determine when a node is allowed to send a message, as is the case with bus
systems. Therefore the decision of whether or not to transmit a message is taken locally at
each node, following an access control mechanism. This decision is complicated by the
fact that messages can be larger than the width of the data path, and may span multiple
pipeline stages. Furthermore, messages can be of different sizes. An interconnection for a
cache-coherent system has to deal with at least two types of messages, which we call
probe messages (or probes) and block messages (or blocks). Probes are short messages
carrying coherence requests (i.e., a miss or an invalidate request), consisting typically of a
cache block address field and other control/routing information. Block messages are made
up of a header, which is similar to a probe, and carry cache blocks for misses and write
backs. The ring access control mechanism has to be able to handle the offered traffic in a
way that optimizes the utilization of the communication resources while ensuring fairness
and avoiding node starvation.
There are two ways to deal with variable message sizes. The first one consists in
splitting messages into equal sized packets, and in sending packets in non-consecutive
pipeline cycles (i.e., fragmentation), which are reassembled at the destination. The second
one consists in making sure that a message can be transmitted in consecutive pipeline
26
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
cycles regardless of the size. Message fragmentation is common practice in local and wide
area networks since the overhead of doing so is small with respect to the typical
transmission latencies, and most of the work is done by the system software. Moreover, a
general communication network has to have the functionality to deal with a highly
heterogeneous message trafBc, in which application level messages can be arbitrarily long.
It is unlikely that the overhead of fragmenting messages and reassembling them will be
justified in the context of a tightly coupled system in which transmission latencies are
small, and these mechanisms would have to be implemented in hardware. Assuming no
fragmentation, there are basically three well known ring access control mechanisms which
are briefly described in what follows.
2.1.1 Token-Passing Ring
A popular strategy in ring-connected local area networks is to identify a special
message as being a transmission token. Whoever holds the token is allowed to transmit
one or more messages, depending on the details of the protocol, before passing the token
to the next node downstream. The token has to be bit-encoded in such a way that it cannot
be mistaken by an ordinary message.
The simplicity of token-passing is its greatest advantage. It is oblivious to the size of
the ring and it imposes no limitation on message sizes. However, if the bit capacity of the
ring is larger than the average message size, some of the available ring bandwidth is
wasted as the remaining “bits” on the ring cannot be utilized for transfers. Another
drawback is that a node has to wait for a token even when there are no other active nodes
in the system, an average delay of just under half of the ring round-trip message delay.
2.1.2 Register Insertion Ring
An alternative to token-passing was proposed initially by Hafner [38] in 1974 that
allows a node to transmit a message without having to wait for a token. In his approach,
each ring interface has a bypass FIFO buffer that can be inserted into the ring path to hold
off upstream messages and allow the node to transmit (see Figure 2.2). At the end of
transmission, if any message was actually inserted into the bypass buffer, its output is
27
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
redirected to the output link of the interface, allowing that message to proceed. The
interface remains unable to transmit until its bypass buffer is emptied. The bypass buffer is
emptied whenever idle symbols^ are received.
Figure 2.2. Register insertion ring interface diagram
send queue
receive queue
bypass HFO buffer
output link input link
input latch output latch
The register insertion mechanism eliminates the need to wait for tokens while it also
permits multiple messages to be in transit in the ring at the same time. It requires however
that enough buffering be present at each interface to hold the largest message size that the
interface can issue. For very fast parallel rings this can become quite expensive, even
unfeasible when large messages have to be handled, since it would require a lot of very
fast registers. Fortunately in a cache coherent multiprocessor, the largest messages are
slightly larger than a cache block, i.e., typically less than 100 bytes. The IEEE Scalable
Coherent Interface [43] set of standards have adopted the register insertion access control
mechanism in their link-layer specification of ring-based multiprocessors.
Unlike the token-passing mechanism, the register insertion approach is susceptible
to unfair communication patterns and even node starvation. The problem arises when a
node has just finished transmitting a message and has some data in its bypass buffer. If one
or more upstream nodes are very active, it may happen that a node can never empty its
bypass buffer, and therefore can never transmit another message. The solution to this
problem requires an additional fairness of access policy. In the protocols proposed by the
SCI standard, idle symbols with go/no-go bits are used to provide feedback to active
upstream nodes, causing them to reduce their message injection rate. Scott et al [61] report
1. An idle symbol is a ring data atom that is empty, i.e.. carries no actual information.
28
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
that turning on the starvation prevention mechanism significantly impacts the effective
ring bandwidth.
2.1.3 Slotted Ring
The slotted ring access control mechanism can be seen as a generalization of token-
passing in which there are potentially multiple tokens. The idea is to divide the bit
capacity of a ring into marked message slots of fixed size. In its simplest form, all slots
have the same size, and can carry the largest possible message in the system. The message
slots circulate continuously through the ring, whether they are carrying messages or not. A
bit in the area reserved for the header in a slot indicates if the slot is busy or empty. The
access control mechanism is analogous to token passing, with an empty message slot
representing the permission to transmit. A message is removed from the ring by marking
corresponding slot as empty.
If the bit capacity of the ring is too small with respect the maximum message size
supported, the slotted ring degenerates into a token-passing ring. If the bit capacity is
larger than the message slots, than multiple slots can be used and the effective
communication bandwidth is increased by allowing more than one ongoing transmission
at a time.
We believe that the slotted ring approach has clear advantages over register insertion.
The absence of a bypass buffer FIFO makes a slotted ring interface simpler and less costly
to implement than a register insertion ring interface. Moreover, a slotted ring interface
never has to buffer an entire message (as is the case with register insertion), but only the
header in order to determine whether to let it flow through or to remove it from the ring.
Dealing with fairness and node starvation is also simpler in the slotted ring, since in most
cases it suffices to ensure that a node that receives a message does not immediately reuse
the same slot, but lets it pass empty to the next node. In our experiments, this strategy has
virtually no impact in communication performance.
Comparing the overall performance of slotted and register insertion rings is a
difficult task since technological parameters and implementation considerations would
suggest that a slotted ring interface could be cheaper and clocked faster than a register
29
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
insertion ring interface. Ignoring this factor, analytical studies by Bhuyan and others [9]
conclude that a register insertion ring performs marginally better than a slotted ring for
hght loads, but is outperformed by a slotted ring under medium and heavy loads. These
results reinforce our intuition since under light loads, most of the bypass buffer FIFOs are
removed from the ring path, reducing the latency of a round-trip message in the register
insertion ring to that of the slotted ring. Also, under light loads it is likely that whenever a
node has a message to transmit, its bypass buffer FIFO will be empty, allowing it to send
the message with no delay. In the slotted ring, even when all slots are empty a node still
has to wait for the beginning of a slot, incurring in a delay that is proportional to the size of
the message slots. Under medium and heavy loads, however, some of the bypass buffer
FIFOs in the register insertion ring will be fully or partially in the ring path, increasing
message latency. Furthermore, as the load increases, the mechanisms to enforce fairness of
access come into play in the register insertion ring, effectively reducing the available
communication bandwidth [61].
In this thesis we concentrate on a slotted ring network, as opposed to token-passing
or register insertion rings. The rationale behind this choice includes the cost and
performance issues outlined above, which indicate that a slotted ring is a promising
alternative in the design space of tightly coupled multiprocessor interconnects. Our desire
to study all the major classes of cache coherence protocols also drove this study towards
slotted rings, since snooping, centralized directory and distributed directory protocols can
be efficiently implemented on top of it. Register insertion rings are not natural candidates
for snooping protocols, as will become clear in the subsections that follow.
2.1.4 Packaging and Electrical Considerations
For a ring network to be competitive with a bus, it has to allow for simple packaging
and a wiring between nodes that facilitates high-speed signaling. The types of ring
architectures that we propose and analyze in this thesis are better suited for a passive
backplane (or centerplane) implementation. Although the length of point-to-point links
does not have a first order impact on their maximum clock rate, shorter links are preferred
for lower skew between the parallel traces. Figure 2.3 shows an example of how a
in
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
backplane ring can be implemented in a way that minimizes link length. The “comer”
traces can be made exactly the same length as the others through careful layout, if
necessary. With this backplane layout all traces can be made shorter than two inches,
resulting in wire propagation delays that are well under one nanosecond. There is no
crossover between traces which simplifies routing and eliminates the need for vias and
signal plane crossings that disturb signal propagation. All boards on the figure can be
identical.
For further reduction in link trace lengths one can use a centerplane approach (i.e.,
with boards plugging in from both sides of the cabinet). In this case, boards 0-3 on one
side of the centerplane can be aligned with boards 4-7 on the opposite side.
Figure 2.3. Illustration of a Ring Backplane
driver
connector
receiver
connector
backplane
link traces
Ring node
P C B \
f t
/
backplane
Keeping all the ring interface clocks synchronized can be accomplished in various
ways. One option is to send the clock information with the data and use a phase-lock loop
(PLL) in each interface to re-synchronize the local clocks. On a backplane implementation
however, it is simpler to generate the clock on the backplane and to distribute it to all
boards with controlled skew traces. Special dummy cards can be used to shorten the ring
at any point so that any number of nodes is allowed.
2.2 Dividing the Ring into Message Slots
While dividing the ring bit capacity in equal sized message slots may be a simple
strategy, if a significant number of messages will be much smaller than the largest
3 1
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
message size, the message slots (and consequently the communication bandwidth) will be
often underutilized. In this case, the solution is to allow a mix of different sized message
slots in an attempt to match the expected message trafBc.
Matching a static allocation of message slots to the dynamic traffic patterns in a
generic communication network is extremely hard since the communication can be highly
heterogeneous. A poor slot allocation will unfairly favor a particular message class while
leaving others starving for bandwidth. Fortunately, for a shared memory multiprocessor
network that will basically carry cache coherence protocol messages, the traffic patterns
are relatively predictable and the problem of static allocation of slots becomes one of
determining the mix of probe and block message slots which maximizes performance. A
good hint comes firom the observation that the common case in most protocols involves
one block message being issued as the reply to one probe message. If that is true, the right
mix of message slots is likely to be close to 1:1. Of course, there are cases in which this
ratio does not hold. On some protocols write-back messages due to replacement are not
acknowledged, counting as a block message without a corresponding probe. Invalidation
requests caused by an attempt by a node to write on a clean (read-only cache state) block
may involve the sending of only probe messages, possibly many of them, without a
corresponding block message. In our studies, the best mix of slots was determined
experimentally for each of the cache protocols under consideration^.
Another issue in partitioning a slotted ring into message slots is that in most cases
there is a remainder of ring pipeline stages in which no message slot fits. If the remainder
is very large, it may be beneficial to artificially insert additional pipeline stages on the ring
so that a message slot can fit. The decision to do so or not involves a trade-off between
bandwidth and latency and the relative size of the remainder with respect to an useful
message slot. “Patching up” the ring with extra pipeline stages to fit another message slot
increases the total number of slots and therefore the communication concurrency, but at
the same time it increases the latency of all messages since the ring becomes a deeper
pipe. Again, the decision to patch up the ring for the different configurations in this thesis
was based upon experimentation.
2. A 1:1 mix o f probe and message slots is by no means an equal split of the byte bandwidth in the interconnect, since
probe messages are much smaller than block messages.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Implementing the insertion of extra pipeline stages in the ring is not technologically
challenging. It involves a simplified version of a bypass buffer FIFO that is commonly
known as e-store (or elastic store) buffer. The number of pipeline stages to be inserted can
be determined at initialization, and will depend on several factors, including the number of
processors actually plugged into the ring backplane.
2.3 Cache Coherence Protocols for a Slotted Ring Multiprocessor
In the following subsections we show how the different classes of cache coherence
protocols can be implemented on a slotted ring multiprocessor. The centralized and
distributed directory protocols described below are basically adaptations of existing ones.
The snooping protocol however is an original contribution of this thesis. All the protocols
described are NUMA write-invalidate, write-allocate protocols, and assume strong
ordering of accesses. A processor blocks on all read and write misses, as well as on writes
to clean blocks. In later chapters we will incorporate relaxed consistency models as well as
other variations to the baseline protocols.
2.3.1 Centralized Directory Protocols
Directory protocols are generally considered the prescribed solution for cache
coherence on non-bus based systems, having been originally proposed by Censier and
Feautrier [13]. More recently centralized directory protocols have been implemented in
the DASH multiprocessor [51] project at Stanford University, and in the RPM
Multiprocessor Emulator at the University of Southern California [8].
Our centralized protocol design assumes a slotted ring with a certain mix of probe
and block message slots. The processing node architecture depicted in Figure 2.4 consists
of a local snooping bus that connects a cache, a memory bank with its associated directory
controller, and the ring interface. The processor may have an additional on-chip cache.
The coherence protocol is implemented by the cache controller and the memory directory
controller which operate independently, as opposed to a centralized node controller such
as the DASH remote access cache (RAC) or the Alewife CMMU [2]. A decentralized
implementation, as in the RPM emulator, permits concurrency in coherence handling and
.1 3
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
therefore better hardware resource utilization. Miss requests and coherence requests by the
local processor are issued by the cache controller to the home memory controller of the
block, which can be either the local memory bank or a remote memory bank. The local/
remote determination is by physical address range. Both the memory controller and the
ring interface are aware of the processor identification number (ID) as well as the range of
physical addresses that map to the node. A request to a remote location or to a remote
cache is picked up by the ring interface firom the local bus and routed to the slotted ring.
Messages arriving to a node from the slotted ring can be directed either to the cache (as
with an invalidation request) or to the memory (as with a miss request from a remote
node).
Figure 2.4. Processing node architecture for a centralized directory protocol.
Processor)
Cache &
Cache controller
I
I
Memory,
Memory controller
and
Memory directory
local bus
^ Slotted rin^
interface J
c — >
Both cache and memory directories are stored in static RAM for faster access. Our
baseline centralized directory protocol has three permanent cache states (Invalid, Read-
Only, Read-Write) and three transient cache states (Pending-Read, Pending-Write,
Pending-Write-on-Clean). A load access (with a tag match) will hit in the cache if the state
is Read-Only or Read-Write. If the cache state is Invalid, a read miss message is sent and
the cache state changes to Pending-Read. The arrival of the read miss reply message fills
the cache block frame and changes its state to Read-Only. A store access hits in the cache
only if the cache state is Read-Write. If the cache state is Invalid a write miss message is
sent and the cache state changes to Pending-Write^. If the cache state is Read-Only, a
3 4
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
write-on-clean message is sent and the cache state goes to Pending-Write-on-Clean. The
arrival of either the write miss reply or the write-on-clean reply message changes the
cache block frame state to Read-Write.
An access to a cache block in which there is no tag match requires the current cache
block to be replaced. A cache block can be replaced immediately if the state is Invalid or
Read-Only, however a Read-Write state requires the block to be written back to the
corresponding memory bank (i.e., the home node).
The memory state of a block is encoded in its directory entry which consists of a
presence bit vector, a dirty bit, a lock bit, a lock type field and a requester ID field. The
presence bit vector has one bit for each cache in the system. A set presence bit indicates
that the corresponding cache has a cached copy of the block. A set dirty bit indicates that
there is one Read-Write (or dirty) cached copy of the block in the system. In this case only
one presence bit can be set, enforcing the single-writer/multiple-reader semantic. In fact,
since the replacement of a Read-Only block does not require a message to the memory
directory controller, a set presence bit (when the dirty bit is reset) does not guarantee that
the corresponding cache still has a copy of the block.
The lock bit, lock type field and requester ED fields are used when a coherence
request cannot be satisfied immediately by the memory directory controller, but instead
involves communication with other system caches. This is the case when a read or write
miss request is received and the directory entry indicates that the block is dirty on another
cache, or when a write-on-clean request is received and there are other Read-Only caches
in the system. In those cases, the memory directory controller has to send messages to the
caches involved in the transaction and wait for the replies. During this time, other requests
to that memory block have to be rejected. This is accomplished by locking the
corresponding directory entry (i.e., setting the lock bit), and storing the type of the
outstanding transaction in the lock type field, as well as the ED of the original requester.
The latency of cache misses and other coherence transactions on the slotted ring
under a centralized directory protocol varies depending on the type of access, the relative
position of the home node with respect to the requester, and on how the particular block is
3. Note that due to our earlier assumption o f strong ordering, a processor access will never find a cache block in a tran
sient state, since the processor would have blocked on the earlier access that led to the u-ansient state.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
being shared when the miss occurs. Instruction fetches and accesses to private data are
resolved locally at the node, since code and private data segments are placed in the local
memory bank. Accesses to shared data in which the home happens to be in the local
memory bank may also be satisfied locally if the memory state of the block is such that no
other caches have to be involved in the transaction. That is the case with read misses to
non-dirty {clean) blocks, write-on-clean requests when the requester is the only node with
a cached copy, and write misses when there is no other cached copy of the block.
Accesses to shared data in which the home is in a remote node will always require at
least a full ring traversal for the request-reply pair since the ring is unidirectional. A single
traversal of the ring is required in the following situations:
• the block is not cached by any other node in the system.
• the access is a read miss and the block is only cached Read-Only by other nodes.
• the access is a read or a write miss and the block is currently dirty in the cache at the
home node.
• the access is a write-on-clean and only the requester and possibly the cache at the home
node have Read-Only copies of the block.
The remaining scenarios are the ones in which the home node has to send further
ring messages to other caches before a reply can be sent to the requester. Those are when:
1. the access is a read or a write miss and the block is cached Read-Write in another
node’s cache (i.e., not the home node). The dirty node'^ becomes Read-Only (in a read
miss) or Invalid (in a write miss), after replying with the updated copy of the block.
2. the access is a write miss or a write-on-clean request and the block is cached Read-Only
in at least one other node’s cache (i.e., not the requester or the home nodes). All Read-
Only copies are invalidated and the corresponding nodes reply with invalidation
acknowledgments^.
4. Throughout this thesis we use “dirty” and “Read-Write” interchangeably. “Dirty node" refers to the node with a
cached copy o f a block in the Read-Write state.
5. Notice that we do not assume that the directory ring has the capability to send a multicast invalidation message. Mul
ticast support (in the absence of snooping hardware) is fairly complex and does not benefit performance significantly.
. 16
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
The simplest way to deal with these cases is to have the home node send the
appropriate messages to the caches involved, wait for the responses and then reply to the
requester. This scheme is sometimes called a four-hop directory protocol, since two
request-reply sets (at least 4 messages) are needed to complete the coherence transaction.
In this case, the latency of messages will always include two full ring traversals.
A more efBcient albeit significantly more complex is the three-hop scheme, used in
some cases in the Stanford DASH Multiprocessor [51]. The idea is to have the dirty node
or the invalidated Read-Only caches reply directly to the original requester, which in turn
communicates with the home after the transaction is completed. Although again at least
four messages will be exchanged, the original requester is allowed to proceed earlier^
since the final communication with the home node can occur on the background.
Figure 2.5. Centralized directory protocol: read miss on a dirty block
miss reply (block) (3)
home
node
P8) dirty
h ri node
P12,
PI 4
P15
requester
Unlike with other generic topologies, a three-hop scheme will not always
significantly reduce latencies in a unidirectional ring, since the three-hop scheme will still
cause the ring to be traversed twice in some situations, as in the one depicted in Figure 2.5
below. If the dirty node (in case of a miss) or any of the Read-Only nodes (in case of a
write miss or write-on-clean request) happens to be in the ring path between the requester
6. For misses to blocks cached dirty, as soon as the up-to-date copy o f the block arrives from the former dirty node, the
requester can proceed, forwarding the copy o f the block to the home node in the background in case o f a read miss. For
write misses and write-on-clean requests, the requester waits until all invalidation acknowledgments have been received
before proceeding. Additionally, the home node sends a copy of the block to the requester on a write miss.
37
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
and the home node, the three hops will in fact take two complete ring traversals. Our
evaluation experiments assume a three-hop protocol.
Although only experimentation can determine the best mix of probe and block
message slots for a protocol, it is possible to narrow down the possible scenarios based
upon the protocol definition. The most frequent ring transactions are expected to involve a
single ring traversal, with a probe request message being followed by a block reply
message, each traversing half the ring on the average. These transactions require the same
probe and block slot bandwidth. A write-on-clean transaction however is likely to require
no block message slots at all, with a request probe followed by a reply probe and possibly
other probe messages to invalidate other Read-Only blocks. All the miss requests in which
the home node has to send additional messages before the coherence transaction is
completed will tend to increase the relative number of probe messages. Write-back
messages due to replacement are block messages with no corresponding probe
acknowledgment, but those are not expected to be significant for reasonably large cache
sizes. Therefore, we expect the total number of probe messages to be between Ix and 2x
the total number of block messages, with any message traversing roughly half the ring on
the average. Our simulation experiments^ with a variety of benchmarks, block sizes, cache
sizes, and system sizes confirm this. We measure the offered probe traffic as being the
total number of probe messages sent during execution multiplied by the average fraction
of the ring traversed by a probe. Block message traffic is measured in the same way. In all
simulations, the probe traffic varies between 1.24x and 1.6 Ix the block message traffic.
We simulated all benchmarks with probezblock slot ratios of 1:1, 1.5:1 and 2:1. The 2:1
mix consistently outperformed the others across all our experiments. The reason for the
better performance of the 2:1 mix with respect to the 1.5:1 mix lies on the fact that a probe
slot is much smaller than a block slot (block sizes used varied from 32B to 128B),
therefore the inclusion of an extra probe slot subtracted only a small amount of bandwidth
from the block message traffic, but decreased noticeably the average utilization of probe
slots, therefore reducing contention delays on virtually all types of coherence transactions.
7. We postpone the description o f the simulation experiments until a later chapter. These results are presented here only
for the sake of justifying an architectural choice.
38
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2.3.2 Distributed Directory Protocols
Distributed directory protocols have been proposed as a way to avoid the scalability
problem with fuU-map directory protocols without resorting to schemes that store only
partial information regarding the sharing of a block, as with the limited directory protocols
outlined earlier. For the scope of this thesis however we have already stated that the lack of
scalability of the fuU-map directory protocol is not an issue since this study is restricted to
relatively small systems. The main reason why we also study distributed directory
protocols is the significant interest that the SCI standard has generated among some
computer manufacturers[66,53]. SCI adopts a distributed directory protocol and uses a
unidirectional ring as the primary interconnect stmcture. The similarities between our
work and the developments in the SCI front make it important for us to look at their design
space as well.
Distributed directory protocols or linked list protocols as they are also called, store
the information about the sharing of a cache hlock in a distributed fashion, instead of
centralizing it in the home node. In a linked list protocol, each block frame in a cache has
one or more pointer fields linking all nodes with cached copies of a hlock in a sharing list.
The home node keeps a pointer to the node at the head of the sharing list (the head node),
which is responsible for maintaining the coherence of the block (see Figure 2.6).
Figure 2.6. A linked list directory protocol
Procy Proc X
Memory
I block block
(Head)
Proc z
Q-
block
(Tail)
By using this structure, the memory requirements to store directory information now
grows with Zog, of the number of nodes in the system, instead of linearly as is the case with
the full-map protocol. Our discussion of distributed directory protocols from this point on
assumes the baseline version that is adopted by the SCI standard.
The permanent cache states in the SCI protocol are the following: Head, Head-Only.
39
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Middle, Tail and Invalid. A load access (with a tag match) is a hit when the cache block
frame is in any state other than Invalid. A load miss causes a message to be sent to the
home node. If there is no current head, the home replies with a copy of the block and the
requester becomes Head-Only. If there is a head, the miss request is forwarded to it and the
home node points to the requester as the new head. The (old) head when receiving the
forwarded request sends a message to the to the requester containing a copy of the block
and its own ID. Upon receiving the copy of the block from the old head, the requester
becomes Head and the old head becomes either Middle or Tail, depending on whether
there are other nodes in the sharing Ust or not.
Removal from the sharing list involves sending messages to the adjacent forward and
backward nodes informing them to link to each other. A Head-Only node has the
additional responsibility of writing back the block to the memory since it is assumed that a
block in a sharing list may have been modified.
Only a Head-Only node has write permission to a cache block. If a node is Invalid,
Middle, or Tail it has to first become the Head then send an invalidation message to all the
other nodes in the list and wait for the acknowledgment. At that point it has become a
Head-Only node, and can proceed with the store. Middle and Tail nodes have to remove
themselves from the sharing list and re-append themselves at the head. In general, writes
issued when there are other nodes in the sharing list are very costly operations, particularly
if the issuing node is already on the sharing list as a Middle or Tail. Also, since the sharing
list can be arbitrarily long (as long as there are nodes in the system) and has to be traversed
sequentially, the invalidation delay itself is typically large. We avoid describing the
operation of the SCI cache coherence protocol in further detail due to its complexity. The
reader is referred to the SCI standard documents [59] for a complete description.
A key distinction in the definition of the SCI protocol with respect to a typical full-
map protocol is the way coherence enforcement responsibilities are removed from the
home (memory) node, and transferred to the head (cache) node. There are some clear
advantages in doing this. First, it reduces the load on the memory banks which could
alleviate hot spot contention. It also avoids the need to lock a directory entry in the home
node, which increases the maximum throughput of coherence requests to a given cache
line. Finally, a head node is capable of determining locally whether it has the only cache
40
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
copy in the system by checking if its forward pointer is null, therefore it is able to write to
the block without having to synchronize with the home. In essence it implements a Read-
Only-Exclusive state. The main disadvantage of having the head as the coherence enforcer
is that all misses and other coherence requests to a block that has a non-empty sharing list
(i.e., has a head node) will be at least three-hop transactions.
The fact that the sharing list is formed on a demand basis further affects the delay of
invalidations since the order in which nodes appear on the list is completely oblivious to
the topology of the underlying network. For an unidirectional ring this can be especially
harmful, since the ring is the network with the largest diameter for a given number of
nodes. The example in Figure 2.7 shows one possible scenario. If node P2 suffers a write
miss it has to insert itself on the head of the list and proceed to invalidate it. The message
sequence will require the ring to be traversed six times before P2 is allowed to proceed. In
general, the additional number of times that the ring has to be traversed in a transaction
that involves more than two nodes will be the same as the number of times that the order in
which the nodes appear on the sharing list is inverted with respect to the ring order, i.e., the
number of inversions. If Figure 2.7 there are five inversions. If P2 was already a non-head
member of the list, at least one more traversal would be needed.
Figure 2.7. An SCI sharing list with five inversions
noae
requester
sharing
list
P7)head
The ratio of probe messages to block messages exchanged in a slotted ring under the
SCI protocol is higher than with the centralized directory protocol. That is because in the
41
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
SCI protocol it is the head who enforces coherent access to a block, but only the home
knows who the head is. Therefore, whenever there is a head, two probe messages will be
necessary in order to reach it from a requesting node. The measured probe traffic with SCI
in our experiments was between I.6x and I.8x the block message traffic. Consequently,
we also choose the mix of two probe slots for every block slot for the SCI protocol
experiments.
2.3.3 Snooping Protocols
It is generally believed that snooping protocols are only suitable for bus-based
systems, and therefore protocols based on directories are favored for point-to-point
connected systems, such as the slotted ring. This intuition is based on the observation that
snooping relies heavily on the broadcast of coherence requests, which comes for free in
bus systems but can be very expensive in general point-to-point interconnects. We contend
however that snooping is an attractive strategy for the unidirectional ring due to its low
cost of broadcast with respect to unicast. The bandwidth used by a broadcast in an
unidirectional ring is roughly twice the bandwidth used for an average unicast. Moreover,
only probes have to be broadcast; block messages, which are longer and therefore more
bandwidth critical, do not require broadcasting. Being able to efficiently broadcast
requests is an enabling feature but other issues have to be addressed before a snooping
implementation is considered feasible.
The fundamental idea behind implementing snooping in a slotted ring is that a ring
interface can snoop on a passing probe without having to remove that probe from the ring.
The probe is only removed from the ring by the sender, after all nodes had the chance to
snoop. The main difference between ring and bus snooping is that the snooping is not done
simultaneously by all nodes in the ring. Additionally, a snooper in a bus can activate
bussed signals as a response to a probe and those signals are seen almost immediately by
all the other nodes in the system. No such feature is present in the slotted ring, therefore
any acknowledgment signals will have to be carried out as ring messages, or piggybacked
in subsequent messages.
Our baseline snooping protocol [5] is an ownership-based write-back, write-
42
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
invalidate protocol with an allocate-on-write policy and three permanent cache states
(Invalid, Read-Only, and Read-Write), similarly to the previously defined centralized
directory protocol. However, instead of a full-map directory entry, only a single bit of state
information is kept with every block frame of physical memory at the home node. This bit,
called dirty bit similarly to the centralized directory protocol, indicates whether the home
has the current version of the block or not. The home node has the current version of a
block whenever there is no Read-Write copy of the block in the system. In this case the
home node owns the block and the dirty bit is reset. When a node attempts a store it starts
a cache transaction that will eventually bring its local cache block state to Read-Write. At
that point, that node has the only valid copy of the block in the system and therefore it
owns the cache block. A set dirty bit indicates to the home that it no longer owns the
block.
Figure 2.8. Read miss on a dirty block: (a) requester removes miss reply message; (b)
home removes miss reply message.
home
home (1) read miss (probe)
(2) miss reply (block)
dirty
requester
requester
In the absence of conflicting accesses to the same memory block, the behavior of the
snooping protocol is quite simple. When a miss probe is broadcast, it triggers a response
from the node that currently owns the block, causing it to insert a block message in the
ring with an up-to-date copy of the missed cache block. If the probe is a read miss, all
snoopers ignore it with the possible exception of the snooper in the dirty node. If there is a
43
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
dirty node, the miss reply has to update the copy of the block at the home node as well,
therefore it is only removed from the ring by the node that is furthest downstream between
the home and the requester (See Figure 2.8).
If the probe is a write miss, all nodes with Read-Only copies of the block are
invalidated, and if there is a dirty node it also goes to the Invalid state after replying to the
miss. If the write miss probe finds the dirty bit reset at the home, the home replies with the
copy of the block and sets the dirty bit for that block. A write miss reply from a dirty node
does not update the home.
A write-on-clean request, is treated the same way as a write miss request, but it does
not require a block message reply since the requester already has a valid (Read-Only) copy
of the block.
A cache block frame in the Read-Write state has to write back the block to the home
node before it can replace it, i.e., it has to relinquish ownership of the block back to the
home node. The write-back arrival at the home resets the corresponding dirty bit.
It is important to observe that all the coherence transactions in the snooping protocol
are completed in such a way that the latency seen by the requester only includes the
equivalent of one full ring traversal. Although a probe travels the entire ring, as soon as the
owner sees the probe it fetches the cache block and replies directly to the requester, with
no need to further synchronize with other possible nodes, as is the case with the
centralized directory protocol. The snooping mechanism in all the other nodes ensures that
coherence will be maintained.
In the description above there is no reference to acknowledgments for probe
messages. Those are clearly necessary for both fault detection and conflict resolution. A
conflict occurs when probes are issued for a block for which there is a previous
outstanding coherence transaction. For the time being let us assume that a positive
acknowledgment mechanism exists in the form of a bit that is piggybacked in a subsequent
probe slot. Later we discuss how this is implemented. The main idea is that the current
owner of the block serves as a serialization point and arbitrates (whenever necessary)
between conflicting requests by acknowledging only one of them. When a requester sees
its probe returning without an acknowledgment it assumes that it has been rejected and it
re-issues the request*.
44
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
The dirty bit is important to the performance of snooping on the slotted ring since it
ensures that at most one node will respond to any given coherence transaction. The
existence of a single responder (e.g., the owner) significantly simplifies the protocol and
allows for important performance optimizations. On a UMA bus, a dirty bit is not
necessary for correctness since a Read-Write node can intervene on the memory bank’s
response to a cache miss and reply to the requester instead. Intervening in this sense is not
possible on the ring. However, on a NUMA bus or ring system, a dirty bit serves an
additional purpose of allowing local read accesses to clean blocks to proceed without
having to send a ring message. In the absence of a dirty bit, these accesses would have to
issue probes on the ring since there is the chance that some other cache has a Read-Write
copy of the block.
An additional optimization for NUMA buses and rings would be to include a
cached-remote bit with every memory block frame in the home node. The function of this
bit is to indicate when a node that is not the home has a Read-Only copy of the block. A
reset cached-remote bit could cut down on the number of useless invalidation probe
messages on the ring since it would assure that there are no cached copies of a block to
invalidate.
The relative ratio of probe to block messages on the snooping protocol is lower than
on both directory protocols presented earlier. That is because a single broadcast probe is
used in most cases, and the request probe already takes care of invalidating cached copies
when necessary. This would suggest that 1:1 mix of probe and block slots might be
preferred. However, a probe always traverses the entire ring while a block message only
traverses half the ring on the average. Consequently we also use a 2:1 mix of probe to
block slots in snooping. Our simulation experiments confirm this mix as being the
optimum for a snooping ring.
8. A write-on-clean request has to be re-issued as a read miss request, since the requester can no longer assume that it
has an up-to-date copy o f the block.
45
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 2.9. Grouping message slots into frames
Odd
Probe
Block Even
Probe
rameO
Frame 2
Snooping implementations have harder real-time constraints than non-snooping
implementations, since the snooper has to take actions on all memory operations issued by
the system. Therefore the snooper hardware has to be able to respond at the maximum rate
at which coherence requests arrive from the interconnection. Considering today’s point-to-
point connection speeds, it may become very hard to meet such requirement. Using the
slotted ring access control is one way to overcome this problem, since slots for coherence
requests can be separated by a minimum number of clock cycles by interleaving them with
other types of slots, forming what we call frames. A frame contains two probe slots and
one block slot, maintaining the 2:1 mix that is desired (see Figure 2.9). Furthermore, to
alleviate the problem of having to snoop on two consecutive probes, we separate the probe
slots into one even probe and one odd probe slot. Even and odd refer to the parity of the
block address that is being accessed. By doing that, we can interleave the dual directory in
the snooper into two (even/odd) banks, and we guarantee that two snooping accesses to the
same bank will be separated by a frame.
Dealing with such real-time constraints of snooping protocols is feasible in the
slotted ring because of the existence of fixed size message slots. We do not believe that the
same applies to a register-insertion ring, in which there is no way to guarantee the spacing
between probes that is required by the snooper. Table 2.1 below shows the snooping rate
46
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
required for a ring clocked at 500MHz for different cache block sizes and ring widths.
Table 2.1. Snooping rate (nanosecond.)
block size
ring data width (bits)
16 32 64
16 bytes 40 20 10
32 bytes 56 28 14
64 bytes 88 44 22
128 bytes 152 76 38
Organizing probe and block slots into frames makes it possible to implement the
piggyback acknowledgment mechanism for probes that is required by the snooping
protocol. Basically, the acknowledgment bit for a probe slot resides in the header of the
respective probe slot in a subsequent frame, allowing a node enough time to respond.
2.4 Summary
In this chapter we described the types of ring architectures that could be used in a
shared memory multiprocessor. Token-passing, register-insertion and slotted ring access
control schemes were discussed and we chose to pursue the investigation of a slotted ring
design. The rationale behind choosing the slotted ring include its simplicity of
implementation, less high-speed buffering requirement, simple starvation avoidance
policies and the possibility of supporting all the major cache coherence protocol classes.
The implementation of snooping protocols requires the system to guarantee a minimum
inter-arrival delay between consecutive cache coherence requests which is only possible in
the slotted ring.
We also described a baseline centralized directory protocol and a distributed
directory protocol based on linked lists which was proposed by the IEEE SCI standard
committee. Implementation issues for both protocols on a slotted ring were discussed as
well.
Finally we presented the design of a snooping cache protocol for a slotted ring
multiprocessor that is the first proposed snooping protocol for a non-bus system. We
showed how a ring snooper interface can be implemented and how conflicting requests are
resolved. Contrary to the directory protocols, the snooping protocol guarantees that all
47
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
coherence transactions can complete in only one ring traversal. Directory protocols dictate
that requests have to be sent to the home first. Whenever the home cannot reply directly, as
when the owner is a remote cache, a firaction of cache transactions may require the ring to
be traversed multiple times. In the following chapters we examine the performance of
these systems in detail.
48
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 3
PERFORMANCE EVALUATION METHODOLOGY
Since no hardware was built for the purpose of this thesis, we had to rely on other
ways to evaluate the performance of the various systems under study. A range of methods
was used as the work evolved, from approximate analytical models to highly detailed
program-driven simulations. All performance numbers presented in this thesis were
generated using one of three methods: trace-driven simulations, analytical models
parameterized by trace-driven simulation results, and program-driven simulations. We
present the three methods in this order in this chapter. The trace-driven simulations and
analytical models were used in our early investigations and allowed us to sweep a very
wide design space relatively efficiently. The program-driven simulation environment was
later developed to verify the early results as well as to more accurately evaluate more
complex systems and more subtle system configurations that were not well captured by
the analytical models.
3.1 Trace-driven Simulations
Trace-driven simulation is a widely used methodology for system performance
evaluation and debugging. First the a trace of the execution of a program has to be
obtained. The trace is an ordered list of records, each record being a log of a relevant
processor operation. In a trace derived for a multiprocessor performance evaluation, a
trace record typically contains a log of a memory operation, including the ID of the issuing
processor, the address, and the type of operation (load, store, instmction fetch, etc.).
The methods used to derive traces of parallel execution include in-line tracing [25],
simulated execution [17], and hardware monitoring [70]. In-line tracing is a form of
49
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
software monitoring that uses compiler techniques to insert extra instructions in the
executable code of the program to create a “road map” of the actual execution of the
program (which branches were taken, etc.). Post-processing of the “road map” recreates
the execution trace. In simulated execution, the parallel program is executed on the top of
a multiprocessor instruction set simulator, instead of on a real machine. This method is
very flexible and is also used for program-driven simulations. Hardware monitoring
consists of adding hardware devices to the processor boards that snoop on all memory
cycles visible at the board level. This method is not very popular since it requires hardware
that is not typically present in current computers. Moreover, events that are not visible at
the board level such as on-chip cache activity, can not be accounted for. We have derived
traces of parallel applications by modifying the CacheMire test bench [11] from Lund
University, Sweden. CacheMire is itself a program-driven simulator.
Figure 3.1 Structure of a trace-driven simulator
input trace
~ v F ~
^ processofS
Y m o d el J
cache model
i \
V
C
input trace
C
processof\
model )
cache model
\ r
input trace
(
proccssofN
model )
cache model
/V
\ t
interconnection network & memory models
J
We have developed a set of trace-driven simulators of bus and ring architectures
using CSIM [63], a library of C functions tailored for process-oriented simulation. CSIM
functions basically implement an event calendar and a process scheduler, so that all
processes in the simulation execute within a single Unix process. Our simulator (see
Figure 3.1) is composed of a set of simulation processes that share a number of common
facilities and synchronize through the use of event variables. There is one simulation
process representing each processor. All potentially mutually exclusive system resources
are modeled as facilities, including cache memories, buffers, interconnect resources and
main memory. Once a process is activated, it reads the input trace for the next reference
50
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
related to the thread assigned to it and simulates it. A reference that hits in the cache only
accounts for one processor cycle. Misses can generate coherence messages through the
network and experience variable delays depending on conflicts in the network and on the
specific coherence transaction. The interconnections are simulated in cycle-by-cycle
precision as are the cache coherence overhead and network interference.
In this type of asynchronous trace-driven simulations, the relative timing of different
program threads can shift with respect to the order in which events actually happened
when the trace was derived. This is one of the main validity problems related to the trace-
driven approach, as pointed out by Koldinger et al [47] and Bittar[IO]. However, by
enforcing that accesses to critical sections are respected in the simulated execution, and by
implementing barrier synchronizations in the simulator, the essential behavior of the
original execution is preserved, and the results obtained are considered relatively accurate
[32].
Although the current version of the trace-driven simulator is tailored for
performance evaluation, it also verifies the correctness of the protocols to some extent,
checking for deadlocks and livelocks, as well as violations of memory coherence.
The main performance parameters of interest that can be extracted from the
asynchronous trace-driven simulations of cache coherent multiprocessors are:
• various coherence statistics: total miss rates, miss rates on shared data, miss rate on pri
vate data, number of invalidations and invalidation patterns, among others.
• processor utilization: average percentage of the time in which the processor executes
instructions instead of waiting for misses or for synchronizations. This is the most rele
vant measure of system efficiency.
• network utilization: average percentage of the time in which the network is busy.
• latency of messages: time taken by miss and invalidation requests.
• network access delay: average time since a message is ready to transmit until the net
work can accept the message.
5 1
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3.2 A Hybrid Analytical Methodology
A very efficient and elegant way to analyze the performance of computer system is
to rely on mathematical models that concentrate on the essential features of the system
that impact performance. Such models use ei±er stochastic or operational arguments to
derive average system behavior based on a statistic description of the workload. The
advantages of analytical models are small execution time, flexibility and inherent insight
into the fundamental performance characteristics of the system being modeled.
For the study of memory systems in parallel computers, however, it is hard to
formulate accurate analytical models. That is because the performance of memory systems
is highly dependent on factors that are difficult to model, such as spatial and temporal
locality, access ordering, synchronization actions, and dynamic sharing behavior.
Consequently most if not all the recent studies on multiprocessor memory system
performance rely on simulation methods for quantitative evaluations, especially trace-
driven simulations and program-driven simulations.
Our analytical methodology attempts to use the best of both analytical models and
trace-driven simulations to build a hybrid compromise solution in terms of accuracy and
performance of the evaluation. The idea is to run a trace-driven simulation to derive the set
of parameters that describe the access pattern and cache coherence behavior of the
program. These parameters are for the most part timing independent, in the sense that they
vary very little when say the network speed goes up by an order of magnitude*. We then
use an analytical model to study different timing relationships between processor and
network speeds. Below the formulation of the models for the snooping slotted ring and the
directory-based slotted ring are derived. We could not find an accurate model for an SCI
slotted ring, therefore our results for those are derived with trace- and program-driven
simulations only.
I. [nsensitivity to timing is also an application characteristic. Applications with static data and task allocation tend to
follow the same flow of execution regardless of timing issues. Applications with dynamic data-dependent behavior (such
as task queue based programs) are more affected by the relative timing among threads. In general none o f the applica
tions that we study are significantly sensitive to variations on thread interleaving.
52
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3.2.1 Analytic Models for Ring-based Protocols
We first describe the overall program execution time as a weighted sum of all types
of events and their respective latencies. We assume a SPMD model in which we are only
analyzing the parallel section of the program. Therefore the execution of the various
threads is considered homogeneous and statistically identical. We follow the modeling
methodology proposed by Menascé and Barroso [54], in which an estimate of the program
execution time and a count of shared memory operations is used to derive the average
arrival rate of messages in the network. Using the average message arrival rate we derive
the network utilization and subsequently the average network latency. The average
network latency is in turn used to derive a new estimate of the program execution time,
and so on. This fixed point iteration was proved to converge whenever the network latency
is a monotonically non-decreasing function of the message arrival rate, which is the case
for all relevant network models.
For a snooping slotted ring, we write the program execution time (PET) as:
P E T — ■ P e y e ^ Im is s ' ^ s h m iss ' E^hm iss ^ i n v ' ^ i / i v ^ w b a c k ' E ^ h a c k ^ )
where the parameters derived from the trace-driven simulation are the event counts
{Nevent) the latencies of the associated events {Levent) have a fixed component that is
based on the hardware timings and a variable component that arises from contention for
the interconnect. The event count parameters are listed in Table 3.1 below:
Table 3.1 Snooping protocol parameters from trace-driven simulations of the program
Parameter Définition
Ncyc
Total number of instructions executed in a given processor
^Imiss
Total number of misses to local memory by a processor
^shmiss
Total number o f misses to shared memory by a processor
Total number of invalidations sent by a processor
^wback
Total number of write-backs by a processor
The latency of a local miss (Li^iss) is considered constant (i.e., we assume no
contention for memory banks). The latency of shared misses, invalidations and write backs
are expressed as follows:
53
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
^shm iss ^Im iss ^cto c k ' p i p ^ s i z S + ^ p ro b e s ^ b lo c k s -)
Anw = K i o c k ■ pipesize + W p . ^ ^ e s (EQ 3)
^wback ~ ^blocks (EQ 4)
where and are respectively the average waiting times for the beginning of
an empty probe and block slots, so that message transmission can begin. Rdock is the
period of the ring clock and pipesize is the number of pipeline stages in the entire ring. The
only unknowns in equations 2-4 are the waiting times for probe and block slots. If we
assume that the message arrival process is Poisson distributed, the residual time to find the
beginning of a slot can be considered to be uniformly distributed between 0 and Tf^ame'
where Tf^ame is tiie time interval between two slots of the same type. Therefore we can
write the average waiting time to find an empty probe slot as
^ p r o b e s ~ 'Rframe ^ 1 ^ ^ ^p robes ( 1 ^ p ro b e s)
i = 1
(EQ5)
which reduces to
^ p r o b e s ~ fram e ^
1/ 2 +
1 - i-'prnbes‘ >
(EQ6)
where Uprobes is the average utilization of a probe slot. The expression for the waiting time
to find an empty block slot is identical to Equation 6. The utilization of probes and block
slots can be expressed as the ratio between the average arrival rate and the average service
time (i.e., average time that a slot is kept busy by a probe or block message) of probe and
block messages (see Equation 7 below).
Uprobes (EQ7)
\^probes
The service times for probes and slots is given below, assuming that a probe travels the
entire ring while a block only travels half the ring on the average.
Pprobes = ---- ( E Q 8 )
p ip e s iz e X Rrioak
54
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[^blocks ~
B
slots
x 2
(E Q 9)
pipesize
Where Pilots are the number of probe and block slots in the ring. The arrival
rates of probes and block slots are expressed as a function of the program execution time
(P£7) and the counts of events that issue probe and block slots, and are given below:
probes
^shmiss ^inv
P E T
^sh m iss ^w b a ck
X N,
proc
X
(EQ 10)
(EQ II)
block P E T " " p r o c
Therefore, the model is basically a fixed point iteration that starts with an estimate for
probe and block utilizations until convergence. Our convergence criteria was a percentage
difference smaller than 0.(X)1% between successive iterations. Convergence was typically
very quick, hardly requiring more than 15 iterations.
The model for a directory ring is very similar to the above. The main difference is
that we had to discriminate between shared misses and invalidations that required two ring
traversals and those that required only one. The event counters derived from the
simulations for the directory model are listed in Table 3.2 below.
Table 3.2. Directory protocol parameters from trace-driven simulations of the
program
Parameter Definition
i^cyc
Total number of instructions executed in a given processor
^bniss
Total number of misses to local memory by a processor
^Ishm iss
Total number of I-cycle misses to shared memory by a processor
^2shmiss
Total number of 2-cycle misses to shared memory by a processor
^ lin v
Total number of 1-cycle invalidations sent by a processor
^2irtv
Total number of 2-cycle invalidations sent by a processor
i^wback
Total number of write-backs by a processor
The models used for bus-based systems are a simplification of this, since the bus can
be considered a single slot interconnect to which all nodes have simultaneous access.
55
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3.3 Program-driven Simulations
To allow the timings of the system being simulated to affect the execution of a
program, it is necessary that the application being used to drive the simulation execute at
the same time as the simulator itself. For instance, when a load instruction is issued by the
program, it is not known whether it will hit on a cache or not until the simulator of the
cache is called. If the load is a miss, the execution of the issuing thread may have to be
suspended until the miss is satisfied, possibly delaying the execution of the thread with
respect to the rest of the program and allowing another thread to arrive first at a lock
acquire operation. Had the cache been larger, the load operation could have hit and the
order of lock acquisition could have been different.
In program- and execution-driven simulations[l 1,17,73], the system processors are
implemented as simulation processes, similarly to caches, interconnects, buffers and
memory modules. Every time a processor is scheduled to execute it simulates the
execution of one (or a few) instruction(s). The execution of an instruction may activate the
simulation of caches, interconnects or memory modules as needed. Timing relations
between different events are kept by a simulated clock, and an event list that ensures that
earlier events execute first.
Program-driven simulations are even slower than trace-driven simulations since
processor execution has to proceed concurrently with system simulation, but they are
generally considered the most accurate simulation methodology. How accurate a program-
driven simulator actually is depends on the level of detail in which the various system
components are simulated as well as on how fine is the time granularity in which the
processes are scheduled.
After using trace-driven simulations and analytical models for our initial studies, we
developed a full feature program-driven simulation environment that is capable of
efficiently simulating a variety of bus, ring and crossbar systems in an arbitrary level of
detail. Our simulator used the core instruction interpreter module from the CacheMire test
bench as part of a much larger package that similarly to the trace-driven simulators also
uses the CSIM library. Our most complex simulation models were developed in this
environment, including simulation of multi-level cache structures, hardware support for
56
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
synchronization operations and relaxed consistency models.
The simulator runs entirely within a Unix process, and uses the CSIM process-
oriented scheduler to switch between simulation threads. We have structured the simulator
in such a way that we can vary the simulation granularity. When maximum accuracy is
desired, the simulator allows re-scheduling at every instruction of every application
process. Under this mode, the simulator is the most faithful to the actual ordering of events
in the target system being simulated, but at the same time it slows down the simulation
dramatically. When maximum performance is desired, the simulator only allow re
scheduling at global events, such as misses or invalidations to shared-memory, or
synchronization operations. In this mode the simulator may execute instruction from a
given application process for relatively long runs before giving back control to the
scheduler, which greatly reduces the context switch overhead and therefore improves the
simulation speed. However, runs of instructions will appear to execute atomically in the
simulator, whereas in the target system they could have been affected by other global
events, such as incoming invalidations. In general, the maximum performance (i.e.,
coarser scheduling granularity) mode showed extremely good accuracy when compared to
the slower, more accurate mode. As a result we used maximum performance mode in the
majority of the results presented, mnning the more accurate mode once for every batch of
simulations for validation purposes.
3.4 Benchmarks
Regardless of the accuracy of the evaluation methodology used, the quality of a
study can only be appreciated if the benchmarks used to drive the simulators are realistic
workloads, representative of an important class of applications. In this study we use a total
of 11 different programs that are representative of Single-Program Multiple-Data (SPMD)
scientific and numerical workloads. These programs are divided into three groups of
benchmarks. The first one was obtained already in the form of traces from Anant
Agarwal’s group at MIT, and are FFT, Weather and Simple. FFT is a radix-2 fast Fourier
transform program. SIMPLE solves equations for hydrodynamics behavior using finite
difference methods. WEATHER also uses finite difference methods to model the
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
atmosphere around the globe. Although these traces were useful in early evaluations and
debugging of our simulators, they were not adequate to our study for three main reasons:
(1) they were 64 processor traces, and the systems of interest for us were in the 8-32
processor range; (2) they assumed a CISC-type instruction set which is not representative
of most modem processors; (3) we did not have access to the source codes, therefore it
was difficult to relate simulation results back to the program structure. Nonetheless, we
present results based on these traces in this thesis to illustrate hypothetical system
behaviors for very large processor configurations.
The majority of the results shown use two groups of benchmarks from the SPLASH
[64] and the SPLASH-2 [76] suites, developed at Sanford University. We used the
SPLASH applications in both trace- and program-driven simulations, and to drive the
analytical models, while the SPLASH-2 programs were only used in program-driven
simulations. All SPLASH and SPLASH-2 programs were used to simulate 8, 16 and 32
processor configurations.
MP3D, WATER, PTHOR and CHOLESKY were applications taken from the
SPLASH suite. MP3D is a rarefied fluid flow simulation program used to study the forces
applied to objects flying in the upper atmosphere at hypersonic speeds, and it is based on
Monte Carlo methods. WATER evaluates the interactions in a system of water molecules
in liquid state and consists of solving a set of motion equations for molecules confined in a
cubic box for a number of time steps. CHOLESKY performs a parallel Cholesky
factorization of a sparse matrix, and it uses supemodal elimination techniques. PTHOR is
a digital circuit simulator that uses a variant of Chandy-Misra distributed time algorithm
with deadlock resolution.
BARNES, VOLREND, OCEAN and LU were taken from the SPLASH-2 suite.
BARNES simulates the interaction of a system of bodies in three dimensions over a
variable number of time steps, using the Bames-Hut hierarchical N-body method.
VOLREND renders a three-dimensional image using a ray casting technique. OCEAN
studies large-scale ocean movements based on eddy and boundary currents. LU is an
implementation of a dense LU factorization kernel. It factors a matrix into the product of a
lower triangular and an upper triangular matrix.
All together, these applications make up a comprehensive set of inputs to the
5 8
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
architectural simulations and analytical models used in this thesis. Moreover, they are not
simple algorithms, but full scale applications, representative of typical numerical and
scientific parallel programs. Despite the focus on scientific programs in our experiments,
we believe that the results of our research are fundamental in nature, and therefore should
apply also to other application domains.
59
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 4
PERFORMANCE OF UNIDIRECTIONAL RING
MULTIPROCESSORS
In this chapter we make use of trace-driven simulation and the hybrid modeling
technique described in chapter 3 to evaluate the relative performance of snooping,
centralized directory and distributed directory protocols on an unidirectional slotted ring
with up to 64 nodes. All simulations and models in this chapter share a common set of
assumptions that are listed below:
• Sequential consistency as enforced by strong ordering of references. In other words, the
processor execution blocks at all read and write misses, as well as all writes to read
only blocks.
• The processor is a single-issue RISC-type architecture. All instructions take one pro
cessor cycle to complete.
• Single level data cache with a load/store latency of one processor cycle. The cache is
direct-mapped with 16B cache blocks and a 128 KB size.
• Instruction caches are not simulated. A 100% hit ratio is assumed in all instruction
fetch references ^
• Each processing node inserts three pipeline stages in the slotted ring. The ring point-to-
point links and the ring latches are 32-bit wide, and clocked at 500 MHz.
1. Instruction cache hit ratios is typically very high for scientific programs. By choosing not to simulate them we reduce
the execution time o f trace-driven simulations by a factor of 4. on the average.
(S O
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
• Memory latency to fetch a block and deliver the first word to the processor is 140ns.
With a 32-bit wide ring, a probe slot uses two ring stages and a block slot uses six
stages. Therefore a ring frame (composed of two probe slots and a block slot) occupies ten
consecutive ring stages. The minimum size of a ring is given by the product of the number
of nodes (or processors) and the number of pipeline stages per node, which is set to three
in these experiments. The actual ring size is typically larger than that in order to
accommodate an integer number of frames. As a result, an 8 processor ring has 30 stages,
a 16 processor ring has 50 stages, and a 32 processor ring has 100 stages. With a ring clock
cycle of 2nsec, the ring round-trip latency is then 60nsec, lOOnsec, and 200nsec
respectively for 8, 16 and 32 processor systems.
The programs used and their basic characteristics are given in Table 4.1. The same
input data set sizes are used when varying the number of processors for a given program.
We simulated MP3D for 10 iterations with 8000 molecules. For WATER we used 64
molecules for 2 time steps. CHOLESKY used an input matrix of 1291x1291 elements.
PTHOR was simulated for 1000 ticks.
Table 4.1 Basic TVace Characteristics
b enchm ark proc.
d ata
references
(xlO*)
instruction
references
(xlO*)
% private
d a ta
references
% sh ared
reads
% shared
writes
% total
d ata
miss rate
% shared
d ata
m iss rate
MP3D 8 4.1Ü 1Ü.9Ü 28.3 ■■ 44.6 26.9 ■ T.jr~ lO.Ol
16 4.25 11.52 27.4 46.3 26.0 7.85 10.61
32 4.74 13.60 24.7 50.9 23.3 16.89 22.21
WATER 8 5.18 9.72 78.6 19.1 2.15 0.42 1.84
16 5.31 10.22 76.8 20.8 2.11 0.65 2.50
32 5.44 10.76 74.6 22.5 2.07 1.39 5.18
CHOLESKY 8 2.36 7.02 36.6 51.0 9.93 8.55 12.01
16 3.17 9.92 35.6 53.8 7.54 16.38 23.29
32 5.19 17.5 32.1 59.1 4.67 35.73 50.16
PTHOR 8 15.8 51.4 25.9 67.6 5.18 5.17 6.84
16 22.6 74.8 21.7 73.2 3.91 4.65 5.89
32 39.5 131.3 18.6 77.7 2.55 5.32 6.50
FFT 64 4.31 3.12 76.0 12.0 11.9 6.85 26.12
WEATHER 64 15.63 13.64 83.9 13.0 3.09 5.25 30.78
SIMPLE 64 14.02 11.59 70.9 25.9 3.17 15.97 54.16
61
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4.1 Snooping vs. Centralized Directory Protocols
We compare snooping and centralized directory protocols by mainly looking at three
performance metrics: the average processor utilization, the average latency of a cache
miss, and the average utilization of the ring intercoimect. The average processor utilization
is an indication of the fraction of the time in which a processor is stalled due to a memory
operation, either a cache coherence protocol transaction pending or a synchronization
event. It is therefore an indication of the speed of the execution of a particular program the
modeled architecture.
Figure 4.1 shows a breakdown of the shared data misses for directory into local,
remote clean, 1-cycle dirty and 2-cycle misses. Local misses are shared data misses that
can be satisfied within the requesting node, so that no messages have to be sent on the
ring. Clean misses are misses to non-dirty blocks mapping to a remote home, taking only
one ring traversal and involving one probe message and one block message; I-cycle dirty
misses are misses to dirty blocks that also require only one ring traversal because of the
fortunate relative position of the dirty node with respect to the requester and the home
node, but take longer than clean misses because they require 3 hops instead of 2; 2-cycle
misses are the remaining shared misses taking two ring traversals.
Figure 4.1. Breakdown of misses to shared data for the directory protocol
10 20 30 40 50 60 70 80 90 too
mp3dS
3dI6
mp3d32
waters
waterl6
water32
choS
choI6
cho32
pthorS
pthorlô
pthor32
fft
weather
simple
H local ^ remote clean
d l 1-cycle dirty ^2-cycle
62
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
We observe from Figure 4.1 that the fraction of remote clean misses tends to increase
with the system size for each of the SPLASH benchmarks. The increases on the fraction of
remote clean misses seem to follow the behavior of the data miss ratio. In other words,
whenever the miss ratio increases, most of the added misses are to clean or uncached
blocks.
Processor utilization and average ring slot utilization are displayed for systems with
8, 16 and 32 processors for the SPLASH benchmarks (Figure 4.2-4.5), and for systems
with 64 processors for the remaining benchmarks (Figure 4.7). The differences in the
latency of misses between snooping and directory is shown in Figure 4.6, using all the
SPLASH benchmarks. The processor cycle time varies from 1 to 20 nsec. A processor
cycle time of 10 nsec means a peak instruction rate of 1 (X ) MIPS.
Figure 4.2. MP3D: processor and ring utilization of snooping and directory
MP3D MP3D
100
§
1
60
s
40
Q .
5 I 15 1 0 20
100
0 0
1 5 1 0 15 20
processor cycle (nsec.) processor cycle (nsec.)
- a - 8 proc. snooping - a - 16 proc. snooping - a - 32 proc. snooping
-o - 8 proc. directory - o - 16 proc. directory 32 proc. directory
The snooping protocol outperforms the directory protocol for all system sizes for
MP3D because the fraction of 1-cycle dirty and 2-cycle misses is significant in all cases.
The performance gap between the two schemes is not as wide for the 32 processor system,
in which the fraction of remote clean misses is much larger.
The ring utilization levels are always higher for snooping, as expected. However, as
shown in Figure 4.7, the difference between the latencies of both protocols only narrows
63
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
for the 32-processor MP3D. Two factors contribute to this: the increase in traffic starts to
affect the latencies of snooping as the processor cycle decreases (the ring utilization of
snooping is over 60% for processor speeds over 100 MIPS) and the larger fraction of
remote clean misses in the directory protocol for the 32-processor case reduces the
average miss latency.
Figure 4.3. WATER: processor and ring utilization of snooping and directory
WATER
WATER
200
80
§
1
60
8
40
O .
1 5 IS 20 1 0
100
60
5 1 0
processor cycle (nsec.)
1 5 I 20
processor cycle (nsec.)
8 proc. snooping 16 proc. snooping - ^ 3 2 proc. snooping
-Q- 8 proc. directory -<>.16 proc. directory 32 proc. directory
Figure 4.4. CHOLESKY: processor and ring utilization of snooping and directory
CHOLESKY CHOLESKY
100
§
a
60
40
I 5 1 0 IS 20
100
BO
5 15 I 1 0 20
processor cycle (nsec.) processor cycle (nsec.)
8 proc. snooping .4-1 6 proc. snooping 32 proc. snooping
8 proc. directory .<>_ 16 proc. directory ^ 32 proc. directory
64
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 4.5. PTHOR: processor and ring utilization of snooping and directory
PTHOR
PTHOR
100
80
40
o.
20
5 15 10 20 1
100
80
60
40
20
0
15 20 5 1 0 I
processor cycle (nsec.) processor cycle (nsec.)
8 proc. snooping 16 proc. snooping -a- 32 proc. snooping
- o 8 proc. directory _o_ 16 proc. directory 32 proc. directory
For WATER, the extremely high hit ratio hides most of the differences between the
snooping and directory protocols in terms of processor and ring utilization levels. The
miss latency values however indicate the impact of the longer latency of 1-cycle dirty and
2-cycle misses. For the 8 and 32 processor cases, snooping starts to show a significantly
better performance as the processor cycle decreases. CHOLESKY has a smaller fraction
of 1 and 2-cycle misses for each system size than WATER and MP3D, and the difference
between the latencies of misses for the two protocols is not as wide. For the 32-processor
CHOLESKY, the miss latencies in the snooping systems are affected by contention delays
and the processor utilization of the two schemes become roughly the same as the
processor cycle decreases.
In PTHOR, even the 8 processor system has as a relatively small fraction of longer
latency misses. However there is still a notable performance advantage for snooping for all
cases in terms of processor utilization. Again, when the load in the intercormection starts
to increase, the snooping protocol shows the effects of contention delays earlier than the
directory protocol.
65
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 4.6. Average miss latencies for SPLASH applications on snooping and directory
MP3D WATER
600
SC O
400
Æ 300
200
100
15 20 5 1 0 l
600
500
400
5 300
200
100
20 15 5 10 1
processor cycle (nsec.) processor cycle (nsec.)
CHOLESKY PTHOR
600
500
400
^ 300
200
100
15 10 20 5 1
processor cycle (nsec.)
600
500
400
î
JS 300
■ I
200
100
5 10 15
processor cycle (nsec.)
8 proc. snooping 16 proc. snooping 32 proc. snooping
8 proc. directory - o . 16 proc. directory 32 proc. directory
For FFT, SIMPLE and WEATHER, which are 64 processor traces, the processor
utilization values drop considerably as a result of longer latencies. Again, the correlation
between the mix of remote misses and the differences in performance between the two
protocols is noteworthy. Among the three benchmarks, FFT is the only one that shows a
significant number of 2-cycle misses and 1-cycle dirty misses. Consequently the snooping
protocol shows a better average miss latency than the directory protocol for this trace
66
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
when ring utilization values are relatively low. However, for SIMPLE, in which there is a
very small fraction of higher latency misses, the difference in average latency figures is
negligible. Once more, as the processor cycle decreases, the latencies of snooping surpass
those of the directory protocol, due to contention delays.
Figure 4.7. FFT, SIMPLE and WEATHER: processor and ring utilization
I
FFT WEATHER
100
80
60
20
0
5 I 1 0 IS 20
100
80
60
40
20
0
15 1 5 10 20
processor cycle (nsec.)
SIMPLE
processor cycle (nsec.)
100
80
60
-#-ring utilization (snooping)
- o ring utilization (directory)
•o-
40
20
0
5 1 10 1 5 20
^ p ro c e sso r utilization (snooping)
^ processor utilization (directory)
processor cycle (nsec.)
The general trend in the above evaluation is that whenever the ring utilization levels
fall below 60%, the miss latencies are all but unaffected by contention. When the ring
traffic increases, the contention delays affect snooping earlier than directory and the
latency curves start to converge. Our simulation experiments with a 64-bit parallel slotted
67
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
ring (not shown here) seem to agree with this assessment. With 64-bit parallel rings,
utilization levels never surpass 50% and consequently, snooping performs far better than
directory in all cases.
We have also observed that although the ring utilization values for snooping are
always higher than for directory it is not true that the snooping scheme always generates
more trafBc.We have measured the message trafBc in our trace-driven simulations as being
the summation of all messages generated in one run, weighted by the fraction of the ring
traversed by each message. The block message traffic is roughly the same for both
schemes in all benchmarks. However, the probe traffic for the directory protocol is
sensitive to the mix of remote misses, i.e., it tends to grow with the fraction of 1-cycle
dirty and 2-cycle misses. In Figure 4.8 we show the probe traffic for 16 processor systems
and a block size of 16 Bytes. We can see that for MP3D and WATER the probe traffic of
snooping is actually lower than that of directory. This effect caimot be seen in the ring
utilization curves since the average ring utilization is measured over the execution time of
the program, and the execution times for snooping are shorter. In fact, the main cause for
the lower ring utilization values for the directory scheme is not lower traffic, but longer
latencies and consequently longer execution times.
Figure 4.8. Probe traffic for 16 processor systems
1500
1000 -
(3
§
500
0
MP3D CHOLESKY
WATER PTHOR
■ s n o o p in g
68
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4.2 Distributed Directory Protocols
Due to its much greater complexity, the distributed directory protocol did not allow
the formulation of reasonably accurate models of performance. In this section we present a
summary of our results for distributed directory protocols derived directly from trace-
driven simulations. Figures 4.9-4.12 below show the execution time of both centralized
and distributed directory protocols normalized by the execution time of snooping for the
SPLASH applications. The charts were derived for 500MHz 32-bit wide rings and 200
MHz processors. The distributed directory protocol used is a version of the basic SCI
coherence protocol for a slotted ring. The logical behavior of the protocol is unchanged.
Figure 4.9. MP3D: Normalized execution times
§ 1 6 0
S 120
P=8 P=16 P=32
We observe that, for all the SPLASH benchmarks used in this study, the distributed
directory cache coherence protocol shows consistently worse performance than a
centralized directory protocol implementation. The reason for this behavior is clear from
Table 4.2, which displays the percentage of misses that require two ring cycles (2 eye.) and
three or more ring cycles (3+ eye.) to complete. As we can see, the fraction of misses that
require two ring cycles is between 30% and 60% for the distributed directory protocol, as
opposed to between 10% and 30% (see Figure 4.2) for the centralized directory protocol.
For some programs, as is the case with PTHOR, there is even a significant fraction of
69
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
remote misses and invalidations that require three or more ring traversals to complete.
Figure 4.10. WATER: Normalized execution times
140
C D 60
P=8 P=16 P=32
Figure 4.11. CHOLESKY: Normalized execution times
C 160
I 140
® 100
P=8 P=16
P=32
70
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 4.12. PTHOR: Normalized execution times
ê 140
(D
X
œ 100
1 80
I 60
S 40
z
WATER is the only benchmark in which the distributed directory protocol exhibits
good performance with respect to the other two alternatives. Again, this is caused by the
relatively small amount of communication that is required in this application.
Table 4.2 Fraction of remote misses that require more than one ring traversal in the
distributed directory protocol (%)
Benchm arks
P=8 P=16 P=32
2 eye. 3+ eye. 2 eye. 3+ eye. 2 eye. 3+ eye.
MP3D 32.1 0.5 41 j 1.3 48.1 2.2
WATER 42.7 0.4 54.3 0.5 60.7 1.4
CHOLESKY 31.3 0.3 40.3 0.8 45.5 1.7
PTHOR 3 5 J 3.6 42.7 5.0 48.0 4.4
As the number of processors increase, the performance of the distributed directory
protocol tends to deteriorate as compared to the other approaches. There are two main
reasons for this. For a given data set size, the average size of the sharing list tends to
increase for most applications when we increase the number of processors, and that results
in an increase in the fraction of coherence transactions that require multiple ring traversals
to commit, particularly when barrier synchronization is being implemented on the top of
cached locks. In addition, in larger ring systems, the latency of the interconnect has a
71
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
larger relative impact in the execution time, and therefore it becomes more sensitive to
transactions with multiple ring traversals.
Another factor that contributes to the poor performance of distributed directory
protocols that is not evident from the numbers in Table 4.2 is the fact that writes on cached
blocks in which the writer is not the head will always require a cache fill. That is because
a non-head member of the sharing list has to remove itself from the list before it can
become the head. After it removes itself from the list, a node cannot assume that is has a
valid copy of the block and as a result it has to get a fresh copy from the current head (or
memory). In a snooping or a centralized directory protocol, such a situation will almost
always complete without requiring a cache fill.
An additional side effect of the typically longer cache transactions on the distributed
directory protocol is that it fails to utilize effectively the available bandwidth of the
interconnect. Since the processors in our simulations block on all coherence transactions,
longer latencies effectively decrease the rate in which a processor is able to inject
messages on the interconnect, therefore decreasing the utilization of network resources.
4.3 Effect of Cache Block Size
To investigate the effects of varying the block size we have again used trace-driven
simulations. We show the processor utilization results for snooping only, since the results
for directory are quite similar. In Figure 4.13, the vertical bars indicate the processor
utilization for systems with 8, 16 and 32 processors, with block sizes varying from 16 to
64 bytes. The corresponding miss ratios are shown as solid lines, and their values are
shown on the vertical axis on the right hand side of the chart. The data cache size is fixed
at 128 KB.
The results for execution time are consistent with other studies that used the
programs in the SPLASH benchmark suite. Such programs have been tuned for finer
granularity sharing, therefore it is no surprise that cache block sizes between 16B and 32B
generally show better performance.
72
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 4.13. Effect of block size
MP3D
60
30
CL
1 0
P=8
I
0 I rrfl m b
P=16
I
P=32
B=I6 B=32 B=64 B=I6 Bs^2 B = ^
■ e
B=I6 B=32 B ^ 4
2 0
15
%
1 0 S
C3
«
•o
^ % processor utilization _o- % data miss rate
CHOLESKY
60
r
.N 40
3
S 30
I»
1 0
P=8 P=16 P=32
0 I— —
1
40
35
2
25 %
2 0
c s
T3
a
1 0
B=16 Bs32 B ^ 4 B=16 B=32 B té 4 B sl6 B=32 B ^
^ % processor utilization _o- % data miss rate
If we use the product of the cache block size by the data miss rate as a rough
approximation of the trafBc, we can say that whenever the miss rate does not drop by a
factor of two when we double the block size, the traffic in the ring will increase. The
processor utilization factor is primarily influenced by the miss rate but it is also affected by
the ring utilization as it translates into longer latencies due to contention.
A secondary effect of increasing the block size is that it decreases the number of
message slots in the ring, therefore decreasing the parallehsm in the interconnection. In
general, whenever the increase in block size causes a significant decrease in the miss ratio
and the ring utilization values are still low, the performance increases. This is the case for
73
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
MP3D with P=8 and P=16 as the block size increases from 16 to 32 bytes, and also for
CHOLESKY with P=8 as the block size increases from 16 to 32 bytes. When the larger
block size does not lower the miss ratio enough (CHOLESKY, P=16), or when the traffic
in the system was already high (MP3D and CHOLESKY with P=32), the performance
will drop as the block size increases.
Finally, changing the block size affects the performance of the main memory. In our
simulations we interleave the distributed memory in such a way that consecutive cache
block addresses map to different home nodes. This is done to approximate a random
memory allocation with the intent of distributing the load evenly across all nodes. When
the cache block is doubled it effectively makes the interleaving coarser grained, and it
increases the likelihood of hot spots. Increasing the block size from 16B to 64B typically
increases the variance in memory bank utilization by 5% to 9%. Such increases however
do not have a sizable impact on the final performance metrics for the programs that we
have simulated.
74
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 5
PERFORMANCE OF BmiRECTIONAL RING
MULTIPROCESSORS
5.1 Bidirectional Rings and Evaluation Assumptions
In Chapter 4 we presented several results from performance analysis and simulations
of snooping, centralized directory and distributed directory protocols. Those results
indicated that both centralized and directory protocol perform worse than snooping for a
range of system parameters and sizes, and for all the benchmarks used. The main reason
for the lower performance of the directory schemes was that they included transactions
that required the ring to be traversed more than once. In most of these cases, multiple ring
traversals were caused by the relative positions of the nodes involved in the transaction
and the ring order.
The snooping protocol on the other hand is oblivious to ordering issues in the ring
interconnect. Snooping only allows a coherence transaction to commit after all nodes in
the system had a chance to “see” the data being referenced. As a result, the theoretical
minimum latency of any snooping transaction will be the latency to communicate with the
node diametrically opposed to the requester, which is already obtained by the
unidirectional snooping protocol.
Intuitively, one way to overcome the multiple traversal problem of directory based
schemes is to allow the ring to transmit messages in both directions. A bidirectional ring
can be implemented by superimposing two unidirectional rings, each flowing in a different
direction (see FigureS. 1). Using half-duplex signaling on the ring wires is not an attractive
alternative, since it would introduce a switching problem comparable in complexity with
bus signaling.
75
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Bidirectionality has the potential to reduce protocol transaction latencies for
directory schemes since all messages are point-to-point, and therefore it is possible to take
advantage of topological proximity of nodes on the ring. Bidirectionality does not reduce
the latency of communicating with the most distant node, and therefore it cannot affect
latencies on the snooping protocol.
Figure 5.1. A Bidirectional ring interconnect
PI
P2
Bidirectional ring interfaces are more complex than unidirectional ones, since they
have to support multiple input/output queues, and multiplex the reception of messages
from both ring directions into the memory or the caches. In addition, an arbitration
mechanism has to be used to determine which ring to send a message to. Here we assume
a simple mechanism that selects the ring which provides the shortest path to the
destination, and randomly selects one of the rings in case of ties. Although this is a simple
mechanism, it still requires a table lookup logic that has to be implemented at very high
speeds.
To allow for a fair comparison between unidirectional and bidirectional rings, the
width of the rings in the bidirectional case have to be half the width of a corresponding
unidirectional ring. We also ignore whatever impacts the higher complexity of the
bidirectional ring interface could have on the ring clock cycle, and consider that both rings
are clocked at the same speed. Each ring in the bidirectional case keeps the same frame
structure of two probe slots for each block slot.
76
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5.2 Simulation of Unidirectional and Bidirectional Rings
In this chapter we use the four SPLASH benchmarks used in Chapter 4, but we also
show results for four of the benchmarks from the SPLASH-2 suite. In addition, we
increase the data set sizes used in Chapter 4 for more realistic values. Optimizations on the
simulator code allowed us to use a larger number of benchmarks and larger data set while
still keeping reasonable simulation times. The SPLASH benchmarks were re-compiled
using optimization flags that reduced instruction counts and private data accesses. Table
5.1 displays the benchmarks used in this chapter and their main characteristics.
Besides using more realistic problem sizes, the results shown in this chapter use a
more sophisticated model of the multiprocessor nodes. Each node now has two levels of
cache, a first level cache (FLC) with 16KB, and a second level cache (SLC) with 128KB.
There is no write-buffer in this configuration since the strong ordering model that we use
cannot take advantage of it. The cache block size used is 32B, which is the one for which
the SPLASH and SPLASH-2 benchmarks have been tuned [64,76]. Both caches are
direct-mapped. The first-level cache has a 1-cycle hit latency. A miss on the first level
cache that hits on the second level cache takes 4 cycles, using a read-through scheme (e.g.,
the word being touched is forwarded to the processor first, with the rest of the block being
filled on the background). The first level cache uses a write-through, allocate-on-read
pohcy. Contention for both caches and the local interconnect is modeled, and inclusion
between the two caches is maintained. As with previous experiments, data allocation is a
pseudo-random scheme in which cache blocks with consecutive addresses reside in
different home nodes.
The cache coherence protocols used here are the ones described in Chapter 2, with
the bidirectional systems using exactly the same protocols as their unidirectional
counterparts. The only enhancement is the addition of Read-Exclusive states in both
snooping and centralized directory protocols. A Read-Exclusive state is reached when a
read miss is issued for a block that is determined to be uncached elsewhere in the system.
A local write to a block in the Read-Exclusive state changes it to Read-Write, without
need for communicating with the rest of the nodes. The system treats a Read-Exclusive
node as if it were in the Read-Write state.
77
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Table 5.1 Basic application characteristics. Reference counts are in millions.
Application dataset
no.
procs
instr.
fetch
refs.
total
data
refs.
shared
read
refs.
shared
write
refs.
shared
miss
rate (%)
total
miss
rate(%)
MP3D 20K mois, 10 8 30.7 10.2 4.7 2.7 6.56 4.88
iterations
16 30.7 10.2 4.7 2.7 6.69 4.99
32 30.7 10.2 4.7 2.7 7.20 5.36
WATER 216 mois, 2 8 100.1 76.9 8.0 1 .1 1.27 0.16
steps 16 100.1 76.9 8.0 1 .1 1.42 0.19
32 100.1 76.9 8.0 1 .1 1.62 0.21
CHOLESKY bcsstkl4 8 44.2 22.4 15.3 2.4 1.38 1.14
16 60.3 27.7 18.2 2.4 1.42 1.11
32 99.7 40.4 25.7 2.59 1.31 0.97
PTHOR rise. 8 36.0 12.0 6.8 0.8 8.44 5.53
IK ticks, 10 16 43.0 15.0 8.6 0.9 8.87 5.80
eye. 32 73.7 27.7 16.9 1.0 7.05 4.64
BARNES 4K particles 8 585.2 351.3 40.1 1.0 2.42 0.31
16 585.7 351.4 40.2 1.0 2.54 0.32
32 586.2 351.5 40.3 1.0 3.33 0.41
VOLREND head scale- 8 405.6 107.3 7.0 0.2 5.08 0.36
down 16 405.6 107.3 7.1 2.0 5.36 0.39
32 406.5 107.5 7.2 2.0 5.88 0.43
OCEAN 130x130 8 146.7 97.2 59.6 13.6 3.03 2.32
grid 16 152.8 9 9 3 60.9 13.9 2.97 2.26
32 159.7 101.9 62.0 14.5 1.59 1.23
LU 256x256 8 65.5 37.9 23.2 11.1 0.66 0.60
matrix. 16 67.4 38.0 23.2 11.1 0.81 0.75
16x16 block 32 70.9 38.1 23.2 11.1 0.60 0.55
Figures 5.2 to 5.5 show normalized average execution times for snooping (Sring),
centralized directory (Dring), distributed directory (Sci), bidirectional centralized
directory (BiDring) and bidirectional distributed directory (BiSci). The execution time of
each group of stacked bars is normalized to the execution time of snooping, and broken
down into the contributions of processor, read, write, replacement, lock acquire and lock
release latencies. Acquire operations use a test&test&set primitive, with different locks
78
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
mapping into different cache blocks. Each figure shows results for 8, 16 and 32
processors. Figures 5.2 and 5.3 assume scalar 200MHz processors, while the results in
Figures 5.4 and 5.5 are for 500MHz processors. As in Chapter 4, the slotted rings are
500MHz, 32-bit wide, with the bidirectional rings being 16-bit wide each.
The lock acquire time reflects the time for a processor to obtain a semaphore for
mutual exclusion, but it generally also accounts for the time waiting on a barrier release.
Some of the barrier time charged as busy time, but that is not significant for these
applications. CHOLESKY and PTHOR use task queue synchronization, and exhibit
dynamic behavior so that changes in architectural parameters of the simulator may change
the execution path, sometimes significantly. This dynamic behavior is more significant
when there is high contention for task queue locks, i.e., for larger processor configurations
and for faster processors.
5.3 Discussion
The results from Figures 5.2-5.5 are somewhat surprising in that bidirectionality
rarely helps the performance of centralized directory protocols. In fact in a significant
number of cases the bidirectional ring actually preforms worse than the unidirectional
ring. For the distributed directory protocol, bidirectionality appears to show improvements
across most of the applications, but even in this case, the improvements are not very
significant. In all cases, the snooping protocol still outperforms all other directory
strategies, centralized or distributed.
The biggest potential gains for bidirectional rings happen when the requester and
home nodes are immediate neighbors or separated by very few intermediate nodes.
Bidirectionality should also help reduce the latencies of three-hop transactions,
particularly the ones that would otherwise involve multiple unidirectional ring traversals.
There is a variety of factors that contribute to offset the potential gains if
bidirectionality. The first one is that each half-ring in the bidirectional case has half the
bandwidth and twice the latency of a single unidirectional ring. This is a fundamental
assumption since we need to compare the two strategies under similar hardware
requirements.
79
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 5.2. Execution time for SPLASH applications; 200MHz processors.
500MHz 32411 rings, 32B blocks, 16KB FLC, 128KB SLC, 200MHz Processors
2 0 0 . 0
WOO
1 6 0 . 0
1 4 0 . 0
120.0
100.0
•0 . 0
6 0 . 0
4 0 . 0
2 0 . 0
0 . 0
a c q u ir e
w r.b ack
MP3D8
: I
WATERS CHOLESKY8 FmORS
500MHz32-bit rings, 32B blocks, 16KB FLC, 128KB SLC, 200MHz Processors
1 8 5
2 0 0 . 0
1 6 0 1 7 0
1 7 3 1 7 6 1 8 0 . 0
1 6 0 . 0
1 4 0 . 0
a c q u ir e
w r.back
i n v a J .
I l l
a
MP3D16
! ! I
WATER16
N I : I
CHOLESKY16 PTHOR16
500MHz 324R rings, 32B blocks, 16KB FLC, 128KB SLC, 200MHz Processors
2 4 0 . 0
220.0
200.0
1 8 0 . 0
1 6 0 . 0
1 4 0 0
120.0
1 0 0 . 0
8 0 . 0
8 0 0
4 0 . 0
2 0 . 0
0 . 0
2 0 2 2 2 2
a c q u ir e
w r.b a c k
MP3D32 WATER32 CHOLESKY32 PTHOR32
80
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 5.3. Execution time for SPLASH-2 applications; 200MHz processors.
500MHz 32*K itngs, 32B bloeks, 16KB FLC, 128KB SLC, 200MHz P ra c e s n ts
1 3 * 1 3 S 1 4 0 . 0
1 1 3 1 1 1
w r.back
BARNES8 VOLREND8 OCEAN8 LU8
200.0
1 8 0 . 0
1 8 0 0
1 4 0 . 0
1 2 0 0
1 0 0 0
8 0 0
8 0 0
4 0 0
2 0 0
0 . 0
500MHz 32-btt rings, 32B blocks, 16KB FLC, 126KB SLC, 200MHz Processors
1 8 6
1 0 0 1 0 2
w r . b a c k
BARNES16 VOLREND16 OCEAN16 LU16
200.0
1 8 0 . 0
1 8 0 . 0
1 4 0 . 0
120.0
100.0
8 0 . 0
8 0 . 0
4 0 . 0
2 0 . 0
0 .0
500MHZ 32-blt rings, 328 blocks, 16KB FLC, 128KB SLC. 200MHz Processors
2 6 6 2 6 3 3 2 0 2 9 7
1 0 8 1 0 0
release
a c q u ire
w r . b a c k
in v a l.
w r f t e
reed
I b u s y
% ê
BARNES32 VOLREND32 OCEAN32 LU32
8 1
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 5.4. Execution time for SPLASH applications; 500MHz processors
500MHz 32-b(t ring#, 32B block», 16KB FLC, 128KB SLC, 500MHz Proce»»or»
1 7 » 1 7 4
200.0
8 0 . 0
120.0
100.0
w r . b a c k
t n v a L
MP3D8 WATERS CHOLESKY8 p r a o R s
500MHz 32-blt rings, 328 block*, 16KB FLC, 128KB SLC, 500MHz ProcM sors
1 0 7 1 9 4
200.0 1 0 2 1 1 0
1 8 0 . 0
1 6 0 . 0
1 4 0 . 0
1 0 0 .0
w r.b a c k
MP3D16
I I I
WATER16
i I I ” s
CHOLESKY16
a
PTHOR16
500MHz 32-bit rings, 32B bioeks, 16KB FLC, 128KB SLC, 500MHz Processors
3 0 0 . 0
2 0 0 0
2 4 0 . 0
2 2 0 . 0
2 0 0 0
1 0 0 . 0
1 0 0 . 0
1 4 0 . 0
1 2 0 . 0
a c q u ir e
1 0 0 . 0
w r . b a c k
€ é é % è
MP3D32 WATER32 CHOLESKY32 PTHOR32
82
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 5.5. Execution time for SPLASH-2 applications; 500MHz processors
500MHz 32-blt rtngs, 32B blocks, 16KB FLC, 128KB SLC, 500MHz Processors
2 0 0 . 0
3 *
1 8 0 . 0
C
1 8 0 . 0
1
P
1 4 0 . 0
a 1 2 0 . 0
3
8
1 0 0 . 0
a
8 0 . 0
i
8 0 . 0
1
s 4 0 . 0
o
z
2 0 . 0
0 . 0
1 3 0 1 3 2
t o o 1 0 0 1 0 *
a c q u re
w r . b a c k
I I I " !
HARNESS
^ -I -g « s
moo a
VOLREND8
% I
OCEANS
: I “
LU8
2 0 0 . 0
1 1 0 . 0
1 8 0 . 0
1 4 0 . 0
1 2 0 . 0
1 0 0 . 0
8 0 . 0
8 0 . 0
4 0 . 0
2 0 . 0
0 . 0
500MHz 32-btt rings, 32B blocks, 16KB FLC, 128KB SLC#60MHz Processors
2 0 4
1 4 6 1 4 6
1 2 0 1 1 8
1 0 0 1 0 2
re le a s e
a c q u i r e
w r . b a c k
i n v a l .
w r i t e
r e a d
b u s y
BARNES16 VOLREND16 OCEAN16 LU16
I
500MHz 32-blt rings, 328 blocks, 16KB FLC, 128KB SLC. 500MHz Processors
2 0 2 3 0 0 . 0
2 8 0 0
2 4 0 . 0
2 2 0 . 0
2 0 0 . 0
1 8 0 . 0
1 8 0 . 0
w r . b a c k
' I
OCEAN32 BARNES32 VOLREND32 LU32
83
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
The second factor is that the frames in the bidirectional ring are longer, as a result of
the narrower data path, and therefore the average waiting time to find the beginning of a
particular slot doubles. With a narrower ring the number of ring latches that has to be
introduced by a ring interface increases, since it is necessary to latch at least an entire
probe message in the node in order to make routing decisions. If all probes fit into 64 bits,
the minimum number of stages on a 32-bit ring is four (including an input and an output
stage), as opposed to six for a 16-bit half-ring in the bidirectional case. Finally, having two
half-rings introduces the possibility of imbalance in the utilization of the communication
resources since one half-ring may receive a larger share of the load in a given phase of the
computation. It is possible that a data distribution scheme that minimizes the ring
distances between a process and the data it accesses the most could improve significantly
the performance of the bidirectional rings. However, such strategies are frequently not
feasible for shared memory programs with dynamic data behavior.
It is also noticeable how the lock acquire time becomes dominant for many
applications as we increase the number of processors in the system. Two factors contribute
to this. Since we are not scaling up the data set sizes when we increase the system size,
locking and barriers become relatively more frequent and the contention for locks also
increases. In addition, our test&test&set implementation of locks interacts very
inefficiently with the write-invalidate protocols used here. In a later chapter we will
examine this problem more closely.
Figure 5.6 shows the minimum message latency (e.g., excluding memory/cache
delays) in ring clock cycles for a read miss request in which the home node owns the
block, therefore the coherence transaction involves only a request-response pair between
the requester and the home node. The Figure assumes 32-bit unidirectional rings and 16-
bit bidirectional half-rings, and no contention for the interconnect. It does take into
account the average number of ring clock cycles spent waiting for the beginning of a slot,
which is assumed to be uniformly distributed between zero and the interval of time
between two consecutive slots of the same type.
84
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 5.6. Minimum latency comparison of unidirectional and bidirectional rings.
240
^220
a 200
"S'140
« 120
S 100
c 40
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
distance between requester and home nodes
8-proc. unidirectional ring
16-proc. unidirectional ring
32-proc. unidirectional ring
Q — o bidirectional ring
As we can see, the latency of bidirectional ring transactions will be smaller only if
the communicating nodes are relatively close to each other on the ring. For an 8-processor
system the minimum latency figures of the bidirectional ring is smaller than those of the
unidirectional ring only if a node is communicating with its immediate neighbors
(distance = one). For a 16-processor system, the minimum bidirectional ring latency is
smaller for 8-out-of-15 remote nodes, and 18-out-of-31 for a 32-processor system. Figure
5.6 suggests that bidirectional rings would have a tendency to do better on average
latencies as the ring size increases.
85
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 5.7. Average time to send a probe for unidirectional and bidirectional rings
500MHz 32-bft rings, 32B Mocks, 16KB FLC, 128KB SLC, 200MHz Processors
2 0 0 . 0
' T 1 5 0 0
1 0 0 . 0
5 0 . 0
0 .0
I I I f f f s I I I i
â 3 a
I
a s
MP3D WATER CHOLESKY PTHOR
S00MHz32.bK rings, 32B Mocks, 16KB FLC, 126KB SLC, 200MHz Processors
2 0 0 0
— 1 5 0 . 0
1 0 0 . 0
5 0 . 0
I I I f
I
I I I I I
I « I
a »
P=B
P = t6
P=32
P=8
P=16
P=32
BARNES VOLREND OCEAN LU
Figure 5,7 displays the average time to send a probe for unidirectional and
bidirectional rings. That is the time between when a node is ready to send a probe message
and the time that a corresponding free slot arrives. It does not include the time to actually
insert the entire probe in the ring pipe. This metric is a function of the communication load
(e.g., ring slot utilizations), and of the size of the frames. It is noteworthy that the time to
send a probe in the bidirectional ring is always larger than on the unidirectional ring, for a
given protocol, although we have observed negligible differences in average slot
utilization between the unidirectional and bidirectional rings. This is therefore a direct
effect of the larger frame sizes in the narrower half-rings on the bidirectional
interconnects.
S6
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 5.8. Average miss latency for unidirectional and bidirectional rings
500MHz 32-bit rings, 32B Mocks, 16KB FLC, 128KB SLC, 200MHz Processors
2 5 0 . 0
1
2 0 0 . 0
i
1 5 0 . 0
3
S
1 0 0 . 0
X
r
î 5 0 0
0 .0
I f I I
I I
I I I f I f
ÿ a s s
MP3D WATER CHOLESKY PTHOR
P=8
P=16
P=32
5 5 0 0
500MHz 3 2 tlt rings, 32B Mocks, 16KB FLC, 128KB SLC, 200MHz Processors
I I I à i
BARNES
s g
I i ^
VOLREND
I I
< ü g
OCEAN
I I I
LU
S g
P=8
P=16
P=32
The actual average miss latency values from the execution-driven simulations of
unidirectional and bidirectional directory protocols is shown in Figure 5.8. These
measurements include all types of misses for both centralized and distributed directory
protocols, but does not include invalidation (write-on-clean) messages.
The average miss latencies on Figure 5.8 confirm our expectations that
bidirectionality helps the larger (32 processor) systems better than it helps the smaller (8
and 16 processor) systems. Particularly the distributed directory protocol (Sci) seems to
benefit the most from bidirectionality, due to its frequent use of multi-hop transactions in
which a bidirectional ring has a tendency to help by avoiding multiple ring traversals.
87
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5.4 Summary
In this chapter we explored the potential advantages of bidirectionality for the
centralized and distributed directory protocols. The motivation was the fact that a
bidirectional ring could reduce the latency of multiple ring traversal transactions that were
found to be significantly frequent on directory protocols.
We found out that, for a pseudo-random data allocation policy, bidirectionality
seldom improves the overall performance of both centralized and distributed directory
protocols. Only the 32 processor distributed directory configuration seems to benefit
somewhat consistently from bidirectionality.
Since our simulation experiments assume the same bisection bandwidth for
bidirectional and unidirectional systems, bidirectionality of communication implies lower
bandwidth per channel, longer latencies for the same number of hops, and longer average
waiting times for a firee message slot. These factors end up offsetting the potential gains of
bidirectional communication in most cases where it could potentially be helpful.
In this chapter we have also shown for the first time our experiments using program-
driven simulation and a much more detailed model of the processing nodes. We have also
introduced four programs from the SPLASH-2 benchmark in our application suite.
Overall, the use of bidirectionality does not change the performance landscape from the
experiments in Chapter 4. Snooping (unidirectional) continues to show the overall best
performance for both faster (500MHz) and slower (200MHz) processors. Centralized
directory protocols still perform better overall than distributed directory protocols.
After having looked carefully into the performance of distributed directory protocols
we have determined that they are not competitive with either snooping or centralized
directory schemes. Consequently for the remainder of this thesis we will only consider
centralized directory protocols when evaluating the potential improvements to directory
protocols on slotted ring interconnects.
88
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 6
PERFORMANCE OF NUMA BUS
MULTIPROCESSORS
6.1 A High-Performance NUMA Bus Architecture
Bus-based multiprocessor architectures have dominated the shared memory
multiprocessor market. However all of the bus-based systems to date have been UMA
machines, with all the system memory connected directly to the system bus. In other
words, the processor elements always have to arbitrate for the bus in order to access any of
the memory banks, which are therefore equidistant to all processors in the system. The
reasons for the longevity of the UMA model in bus based systems are its simplicity of
implementation and upgradability. UMA buses are simpler to implement than NUMA
buses because it does not require any logic on the processor element to differentiate
between local and remote accesses. Ease of upgradability comes from the fact that a
customer can make decisions with respect to computing power and memory capacity
independently*.
In order to fuUy understand the limitations of the bus interconnection with respect to
other options for small-scale multiprocessors, we propose a more aggressive NUMA bus
design and use it in our performance evaluations. A NUMA bus is built with processor-
memory elements, in such a way that the system memory is partitioned into banks
associated with each processor element in the system, similarly to the ring architectures
proposed and analyzed in chapters 3 and 4. Introducing the NUMA model in buses is an
important enhancement given the limited bandwidth of this class of interconnect. With a
I. When all bus slots are occupied, a user may have to trade CPU modules for memory modules
S9
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
NUMA model, local memory operations such as instruction fetches, accesses to private
variables and accesses to shared data that is placed in the local memory bank could be
completed without using any bus bandwidth at all. However, it is necessary to modify the
baseline snooping mechanism in order to take advantage of this locality.
6.2 A NUMA Bus Snooping Protocol
The basic snooping mechartism as it is used in UMA bus multiprocessors is based
upon the principle that all memory accesses in the system are visible to all caches and
memory modules. Enforcing this principle makes it impossible to explore the locality
possibilities of NUMA architectures, since it means broadcasting on the bus even those
accesses that could be “safely” satisfied by a local memory module. For instance, if a
processor issues a miss for an address that resides in the local memory bank, it is not safe
to satisfy this miss locally since there is no information on whether the block is currently
owned by another cache in the system, in which case the memory copy is stale.
Consequently, even though the access is local, it is necessary to arbitrate for the shared bus
and issue the miss on the bus in order to allow a possible dirty node in the system to
intervene and provide the most recent copy of a block.
We propose to enhance the basic bus snooping protocol so that it can take advantage
of local memory references by adding minimal state information to the memory banks.
This strategy is similar to the one used in the snooping ring protocol and consists of
adding a dirty bit per block frame in main memory. As in the snooping ring protocol, a set
dirty bit indicates that some cache in the system currently owns the cache block and that it
may have it in modified state. With the addition of the dirty bit, all read misses that map to
the local memory and find the bit reset can proceed without broadcasting the read miss
request on the bus. However a local write miss or invalidate would still require a bus
broadcast, since in order to acquire ownership it has to invalidate all other caches. A
second bit can be added to indicate whether any remote cache (e.g. remote with respect to
the memory bank in question) may have a copy of the block. This remote bit is set
whenever a remote node misses for the block, and it is only reset on a write back
(replacement of a dirty copy) or when the local cache obtains ownership of the block (by
90
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
issuing a write miss or an invalidate). With the remote bit it is possible to avoid
unnecessary broadcast of write misses and invalidations from the local cache since a
remote bit reset guarantees that there is no remote cache to invalidated
In our simulations we found that using NUMA buses with a dirty bit increased
performance of our set of benchmarks between 20% and 45% with respect to a UMA bus
with interleaved memory banks. On the other hand, the addition of the remote bit had a
negligible impact on overall performance (less than 3% in all cases), and no impact
whatsoever when relaxed consistency models were used. As a result, we chose not to
incorporate it in our NUMA snooping protocol^.
6.3 Packet- vs. Circuit-Switched Buses
At the time that we started examining the performance of bus based systems ( 1992),
most commercial multiprocessors used circuit-switched buses, in which the bus is held by
the requesting node until the responder (memory or cache) replies with the data. An
alternative is to split the bus transactions in separate request and response sub-transactions
so that intervening accesses can proceed while a responder is fetching the data. This
alternative bus scheme is called packet-switching or split-transaction. A circuit-switched
bus simplifies the design of memory banks, since they are only required to act as bus
slaves and do not need to arbitrate for the bus or keep any state for outstanding
transactions. The main disadvantage of circuit-switched buses is that they reduce the
effective utilization of the bus, particularly when the start-up time to fetch a memory block
is large with respect to the bus clock cycle.
Packet-switched buses albeit more complex than circuit-switched ones have been the
architecture of choice of most bus-based multiprocessors introduced since 1994. That is a
result of the need to optimize the use of already bandwidth limited buses in the presence of
high start-up times to fetch data from memory banks. Using packet-switched buses
increases the complexity and therefore the delay due to bus arbitration logic, but since
1. The remote bit may be set while there are no remote cached copies, since the replacement of a
read-only block does not notify the home. The remote bit is reset on a write-back or when the local
cache acquires ownership of the block by writing to it.
2. The snooping ring protocol evaluated here also does not take advantage of a remote bit.
91
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
most modern buses are able to overlap bus arbitration with data movement, this effect can
be at least partially hidden.
In our simulations we use a packet-switched bus with overlapped arbitration and
separate address lines so that a probe (request) can proceed in parallel with another block
(reply) access. We also assume that arbitration optimizations such as bus parking and idle
bus arbitration are used.
6.4 Performance Evaluation of a Packet-Switched NUMA Bus
We now evaluate the performance of a packet-switched NUMA bus-based
multiprocessor and compare it with that of the snooping slotted ring. With the addition of
a dirty bit on the NUMA bus architecture, the resulting snooping bus protocol logically
identical to the snooping ring protocol that was described in Chapter 4.
We attempt to compare the bus and ring systems taking into consideration the most
relevant technological parameters. Today’s fastest backplane buses are clocked between
75MHz and 90MHz, therefore we show results for 50MHz and lOOMHz buses.
It is not straightforward to find a reference value for ring clock speed in current
systems since virtually all existing ring-based interconnects use flat cables or optical fiber
ribbon cables, as opposed to the more tightly-coupled backplane model assumed here.
Cable-based point-to-point links are currently clocked between 500MHz and 1.25GHz.
We will take the conservative approach of using 500MHz as the ring clock cycle for all our
remaining evaluation experiments.
When comparing backplane buses and rings, a reasonable assumption would be to
use the same data width for both systems. However, the driver and receiver circuits in the
bus are typically integrated in such a way that they share the same set of backplane pins,
while in the ring they have to use different sets of pins. Since pin count is a very important
constraint in backplane interconnects and systems packaging as a whole, we compare a
64-bit wide bus with a 32-bit wide slotted ring. Again, this is a conservative assumption
since (a) pin count is not the only constraint in backplane packaging and (b) a 64-bit bus
has to have in fact about twice as many lines since it requires a separate address bus (we
use a 32-bit address bus), several arbitration lines, command code lines, and other open-
92
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
drain wired-and lines to implement the snooping protocol (e.g., shared signal, intervene
signal, locked signal, etc.).
Figures 6.1-6.4 use the hybrid analytical methodology described in Chapter 3 to
compare a 500MHz snooping slotted ring (32-bit wide) with a 64-bit packet-switched
NUMA snooping bus at 50MHz and lOOMHz clock cycles. These results use the same
processing element assumptions as in Chapter 4: scalar processors with strong ordering
and one level of 128KB direct-mapped cache with a 16B cache block. Performance results
are presented in the form of percentage processor utilization (Figures 6.1-6.3) and
percentage bus utilization (Figure 6.4), and are plotted against the processor cycle time in
nanoseconds.
Figure 6.1. 32-bit slotted ring vs. 64-bit split-transaction NUMA bus (P=8)
.2
MP3D,P=8
too
§
.5
I
I
5 15 1 0 2 0
processor cycle (nsec.)
CH0LESKY,P=8
processor cycle (nsec.)
. 500 MHz 32-bit ring
too
60
40
2 0
0
5 1 0 15 2 0
I
I
WATER, P=8
1 0 0
c
o
i
N
I
8
5 1 0 15 2 0
processor cycle (nsec.)
PTH0R,P=8
1 0 0
80
60
40
2 0
0
5 1 0 15 2 0
processor cycle (nsec.)
100 MHz 64-bit bus - a- 50 MHz 64-bit bus
93
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
The bus clock cycle remains constant across system sizes, which is somewhat
optimistic because of the electrical characteristics of buses mentioned previously. As a
result, the pure latency to satisfy a remote miss is fixed for the bus case (assuming no
contention), while it increases linearly with the number of nodes for the ring case. Using a
16-byte cache block, the minimum number of bus cycles to satisfy a remote miss is 8,
excluding arbitration delays and the time to fetch the block in a remote node’s memory or
cache.
Figure 6.2. 32-bit slotted ring vs. 64-bit split-transaction NUMA bus (P=16)
MP3D, P=16 WATER, P=16
i
1 0 0
80
60
40
2 0
0
5 1 0 15 2 0
IC O
I 80
3
I 60
8 40
I 2 0
1 0 0
§
3
I
§
0 .
5 1 0 15 2 0
processor cycle (nsec.)
CH0LESKY,P=16
processor cycle (nsec.)
PTH0R,P=16
1 5 10 15
processor cycle (nsec.)
.500 MHz 32-bit ring - 4
20 1
100
■
.2 80
8 40
) .
0
. . . .
5 10 15
processor cycle (nsec.)
100 MHz 64-bit bus -yw 50 MHz 64-bit bus
2 0
The limited bandwidth of the bus makes the actual miss latency values quite
sensitive to variations in the processor speed, whereas the latency values for the ring
remain nearly constant. Note that processor speed is only one of the factors affecting the
94
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
load in the interconnect. The average miss ratio for shared data and the fraction of shared
data references are also indicators of how loaded the intercoimect is, for a given system
size. Figure 6.4 displays the average bus utilization levels for the four SPLASH
benchmarks.
Figure 6.3. 32-bit slotted ring vs. 64-bit split-transaction NUMA bus (P=32)
MP3D,P=32 WATER, P=32
1 0 0
0
«
N
I
I
5 ^
15 5 1 0 2 0
1 0 0
c
0
C 9
N
w
3
I
5 1 0 2 0 15
processor cycle (nsec.)
CHOLESKY,P=32
processor cycle (nsec.)
PTHOR, P=32
1 0 0
c
0
3
I
I
5 15 1 0 2 0
1 0 0
c
o
3
3
O
i
a
5 1 0 15 2 0
processor cycle (nsec.)
5(X ) MHz 32-bit ring
processor cycle (nsec.)
100 MHz 64-bit bus _a_ 50 MHz 64-bit bus
MP3D has a relatively high miss ratio for shared data and also has a significant
fraction of shared data accesses. In the 8 processor MP3D the performance of the 100
MHz bus is comparable to the 500 MHz ring for slower processors (< 50 MIPS), but it
falls behind for increasingly faster processors due to bus conflicts. For the 16 processor
MP3D the performance gap (in processor utilization) between ring and bus configimations
increases as the buses enter saturation. In the ring configurations the network utilization is
95
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
still under 50% even for 500 MIPS processors. In the 32 processor MP3D both buses are
completely saturated, whereas the ring utilization stays under 80%. The behavior of
CHOLESKY is very similar to MP3D.
The evaluations using WATER show a different behavior. In this case the miss rate
values are extremely low, as is the fraction of references to shared data. The load on the
interconnect is much lower than in MP3D. For P=8 and P=16, the bus starts to saturate for
processor speeds higher than 200 MIPS. Even for 32 processors, the bus systems still
show a very good performance level with 100 MIPS processors. For the 16 and 32
processor configurations, the pure latency of the 100 MHz bus is smaller than that of the
500 MHz ring. Therefore, for slower processors the bus configurations could outperform
the slotted rings in the case of WATER, even if only by a narrow margin. However, in all
cases, the slotted ring is less affected by contention delays which is a result of its higher
bandwidth. Eventually, as the buses reach saturation, the ring configurations have far
better performance.
In the case of PTHOR, the 100 MHz bus shows approximately the same processor
utilization figures as the 500 MHz ring for systems with processing elements slower than
50 MIPS, and P < 16. As with the other programs, as the processor cycle decreases the
slotted ring outperforms the 100 MHz bus by up to a factor of three. For P=32, the
performance gap between the slotted ring and the split-transaction bus increases even
further, as the slotted ring is able to maintain reasonable processor utilization levels, but
the buses enter saturation.
In the case of PTHOR, the 100 MHz bus shows approximately the same processor
utilization figures as the 500 MHz ring for systems with processing elements slower than
50 MIPS, and P < 16. As with the other programs, as the processor cycle decreases the
slotted ring outperforms the 100 MHz bus by up to a factor of three. For P=32, the
performance gap between the slotted ring and the split-transaction bus increases even
further, as the slotted ring is able to maintain reasonable processor utilization levels, but
the buses enter saturation.
96
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 6.4. Bus utilization values; 64-bit split-transaction buses, 100 MHz and 50 MHz
MP3D WATER
100
I
a
■a
3
M
a
5 15 10 20 1
100
a
5 20 10 15 1
processor cycle (nsec.)
CHOLESKY
processor cycle (nsec.)
PTHOR
100
I
a
•a
3
a
5 1 0 15 2 0 1
100
I
i
3
5 2 0 1 10 15
processor cycle (nsec.)
P=8, lOOMHz bus
_o_ P=8,50MHz bus
.P=16, lOOMHz bus
.P=16,50MHz
processor cycle (nsec.)
. P=32, lOOMHz bus
.P=32,50MHz
The evaluation results showed here also indicate that the slotted ring could benefit
from latency tolerance techniques, such as lockup-firee caches, weak ordering schemes and
prefetching because the large latencies observed for the slotted ring are, in most cases, not
caused by heavy contention but by pure delays. In other words, there is latency to be
tolerated despite the fact that the network is often underutilized. Since most latency
tolerance techniques have the collateral effect of increasing the load on the interconnect
because of the overlap of communication and computation, they can be self-defeating in
an interconnect working close to saturation levels.
6.5 Potential of Software Prefetching
Software prefetching [55] allows overlapping of miss resolution and computation by
issuing prefetch instructions far enough ahead in the code that there is a good chance that
97
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
the instruction that uses the value will find it in the cache. It has therefore the potential to
eliminate virtually all miss latencies from parallel programs.
In practice prefetching is hampered by several implementation issues. First there is
an overhead to calculate the prefetch address and issue the prefetch instruction that is
added to the program execution time. Indiscriminate aggressive prefetching may cause the
displacement of a processor working set firom its cache to make room for the prefetched
data, effectively increasing the miss ratio. In a multiprocessor, invalidation traffic may kill
prefetched cache lines before they are touched, rendering the prefetch useless. Finally,
prefetching increases intercoimect load, which in turn increases the average remote miss
latencies because of contention for memory and interconnection resources.
Here we study the potential benefits of prefetching using a technique that is an
enhancement of the one used by Tullsen and Eggers [72] in their analysis of prefetching
performance in bus-based multiprocessors. This technique fakes the behavior of a near
optimal compiler-directed prefetching algorithm by feeding a memory trace to program
that simulates the caches and the coherence protocol and generates an augmented trace
with prefetch references inserted P instructions before a miss to the location is due (P is
called the prefetch distance). This oracle program can insert prefetches for shared data
read and write misses, with exclusive prefetches (e.g., prefetch a block in Read-Write
mode) only being inserted when the block in question is not touched by any other
processor in the system within a time window that contains the prefetch distance interval.
An exclusive prefetch window that contains accesses by other processes is likely to be
useless since ownership of the block will be stolen away before the actual write operation
is reached.
Similarly, shared prefetches (e.g., prefetch a block in Read-Only mode) are only
inserted when no writes to the block by other processors occur within the same time
window, since there is a high probability that the prefetch will be killed by a subsequent
invalidation before the data is consumed. In our experiments we set the time window to be
10% wider than the prefetch distance to account for some variation in the interleaving of
accesses seen by the oracle. The processing element configuration and cache coherence
protocols used are the same ones described earlier in this chapter.
98
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 6.5. Prefetching performance: MP3D; 500MHz ring vs. lOOMHz bus;
^ 0 0 r P=8
I 8 0
I ^0
I 60
0
1
a.
processor cycle time (ns)
P=16
4 6 8
processor cycle time (ns)
c
o
s 70
3
O
S
o
s
50
40
30
0 2 4 6 8 10
processor cycle time (ns)
— Snooping ring
-# Snooping ring + prefetch
Bus
#-----# Bus + prefetch
99
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 6.6. Prefetching performance: WATER; 500MHz ring vs. lOOMHz bus;
^ 0 0 r p = 8
90 -
§
s
3
0
1
40
Q.
20
0 2 4 6 8 10
processor cycle time (ns)
§
80
S 70
3
0
1
2
50
O l
2 0 4 6 8 10
processor cycle time (ns)
100
90
F P=32
C
o
s
3
1
2
50
40
a.
10
0 2 4 6 8 10
processor cycle time (ns)
Snooping ring
#---- # Snooping ring + prefetch
Bus
-# Bus + prefetch
100
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 6.7. Prefetching performance: CHOLESKY; 500MHz ring vs. lOOMHz bus;
c
0
1
3
o
:
o.
20
processor cycle time (ns)
100
P=16
§
8
50
Q.
20
processor cycle time (ns)
§
8
3
0
1
S
40
30
Q.
processor cycle time (ns)
Snooping ring
) Snooping ring + prefetch
Bus
-# Bus + prefetch
101
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 6.8. Prefetching performance: PTHOR; 500MHz ring vs. lOOMHz bus;
100
90
80
70
60
50
40
30
20
10
0
P=8
2 4 6 8
processor cycle time (ns)
10
100
90
I
I ^0
= 60
50
40
30
o.
r P=16
o
î
10
processor cycle time (ns)
100
90
F P=32
C
o
s
70
60
5
1
8
2
40
30
20
10
Q.
0 6 2 4 a 10
processor cycle time (ns)
Snooping ring
#---- # Snooping ring + prefetch
Bus
• - ■ # Bus + prefetch
102
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
We set the prefetch distance to 200 instructions for shared data and 20 instructions
for private data. The overhead of issuing prefetches is two instruction cycles, one to
compute the prefetch address and one to issue the prefetch itself L Prefetches do not
victimize blocks in the cache until data is returned. Figures 6.5-6.S show trace-driven
simulation results for snooping rings and buses with and without prefetching.
The effectiveness of the prefetch oracle in covering shared data misses for the
various applications is shown in Table 6.1. The coverage of private data misses was nearly
100% for all applications.
Table 6.1. Percentage of covered shared data misses
Program:
P=8 P=I6 P=32
MP3D 81 75 64
WATER 78 76 72
CHOLESKY 85 77 71
PTHOR 91 82 75
The prefetch oracle coverage factor decreases when we increase the number of
processors in the system. That is because, with the same input data set sizes, there is an
increase in the relative significance of read/write sharing and consequently there is a larger
fraction of misses that happen too near accesses by other processors to the same block.
The actual miss rate seen by the processor in the prefetching simulations is quite close to
the coverage factor times the original program miss ratio, which indicates good statistic
correlation between the interleaving of accesses seen by the oracle and by the final
simulation. There are also some extra misses in the prefetching simulation that are due to
the prefetch displacing a cache block that is touched by the processor before the
prefetched data is actually used. Fortunately in our case this scenario is not a frequent one.
The simulation results confirm those of Tullsen and Eggers in that they shows that a
bus-based multiprocessor can only take limited advantage of software prefetching due to
shortage of interconnect bandwidth. In all the SPLASH applications, although most
misses were being covered by prefetching, the bus system with prefetching saw gains of
under 5% in processor utilization. Furthermore, as the processor speed increased the gains
1. In practice the overhead of adding prefetches may be higher, depending on how complex it is to
compute the address.
in ?
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
of prefetching for the bus system decreased, as opposed to the ring system.
As the processor speed increases, the cost of issuing prefetches (two processor
cycles) becomes less significant, and the miss latencies tend to increase. A prefetch
distance of 200 instructions translates into at least 200 processor cycles. If there are any
misses between the issuing of the prefetch and the use, the effective distance seen at
execution time increases accordingly. For each ring system size, there is a value for the
processor cycle in which the remote miss latency surpasses the prefetching distance set by
the oracle. For 8- and 16- processor systems, that value is less than 2 nanoseconds. For 32-
processor systems it falls near 4 nanoseconds. Therefore, in all applications the ring
system benefits increasingly from prefetching as the processor gets faster, with the
exception of the 32-processor applications in which prefetching benefits cease to increase
when the processor cycle time drops below 4 nanoseconds.
6.6 Summary
In this chapter we presented an aggressive design for CC-NUMA bus
multiprocessor, and compared its performance with that of a snooping slotted ring. The
hybrid analytical methodology described in Chapter 3 was used in the performance
evaluation experiments.
Bus systems were shown to be competitive with rings only up to 8-processors, or for
applications with negligible miss and invalidation traffic. The limited bandwidth of the bus
is exposed by applications such as MP3D and CHOLESKY which impose a heavy load on
the memory system.
Also in this chapter we have evaluated the potential benefits of software prefetching
in bus and ring systems using an off-line oracle algorithm to process the traces and insert
prefetches approximately 200 instructions above the use of a reference that is likely to
miss in the cache. The oracle technique can be seen as a best-case scenario for the
potential of compiler prefetch algorithms. Ring systems benefit substantially from
prefetching, while bus systems show only minor improvements. Other latency tolerance
techniques are evaluated later in this thesis.
104
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 7
PERFORMANCE OF CROSSBAR
MULTIPROCESSORS
7.1 A NUMA Crossbar-based Multiprocessor Architecture
Crossbars have been considered as an interconnection for multiprocessors since the
early days of parallel computing. The C.mmp [77] experimental machine at Camegie-
Mellon was one of the first systems to utilize them. The main advantage of a crossbar
interconnect is that it removes all conflict firom the network subsystem. In other words,
traffic will only suffer from contention when the endpoints of the communication overlap.
Crossbars have never been widely used however, since their high connectivity comes with
a high complexity cost and a low scalability. Recently, designers have been forced to
revisit crossbar interconnects as a result of increasing speed gap between processor and
bus cycle times. The Convex SPP [66] and the Sun Universal Port Architecture (UPA) [68]
are modem examples of the use of crossbar interconnects in shared-memory
multiprocessor systems.
Early crossbar designs for multiprocessors used an asymmetric topology in which
there was no direct path between processor elements, but only between processor and
memory modules. This scheme works for non-cache coherent UMA shared-memory
systems in which all communication is done through memory and only processor elements
can initiate the communication. In cache-coherent high-performance systems it is
necessary for processor elements to communicate directly and perform cache-to-cache
transfers in order to reduce miss latencies. In particular, for NUMA systems in which
processor and memory are packaged as a single node, all-to-all connectivity is required.
Figure 7.1 depicts a diagram for a symmetric crossbar which is similar in architecture to
105
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
the Convex SPP hj'peraode crossbar switch.
Figure 7.1. Diagram of a Symmetric Crossbar for a NUMA system
input port
( p ^ ) [ PMI J [ PM2 ) I PM3
processor-m emory
node
1 ^ output port
f
In the diagram above, each node in the system is connected to the crossbar by
unidirectional input and output ports. It is possible to simplify the packaging by
multiplexing input and output ports into the same physical wires. This simplification
comes with no performance penalty if the hardware in each processor-memory node is
incapable of sending and receiving data at the same time. Arbitration in this crossbar
switch architecture is done on a per-output port basis.
The scalability of crossbar switches is quite poor, since the number of connections
required scales with the square of the number of nodes in the system. In general there is an
engineering trade-off between the number of ports that can be accommodated in a crossbar
switch and the width of each port. As a result, it is not feasible to build large crossbar
switches with wide ports. Since wide data ports are necessary to fulfill the bandwidth
requirements of microprocessors, most crossbar implementations today only scale up to 4-
8 ports. In order to build larger systems it is necessary to cascade several crossbar switches
in multi-stage configurations. Although a multi-stage network can maintain the peak
106
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
bandwidth of its crossbar building blocks, it introduces internal network conflicts that are
not present in crossbars, which reduce the effective network bandwidth, particularly in the
presence of unbalanced traffic.
In this thesis we study crossbar-based systems with up to 32 processors. Although it
is clear that crossbar networks with more than 8 processors are likely to be multi-staged,
we optimistically assume that even a 16-port crossbar is built in a single stage. A 32-
processor crossbar is built as a two-stage configuration, using eight 8x8 switches.
In our experiments, each processor-memory node has separate input and output ports
into and out of the crossbar switch, in order to increase communication concurrency.
Consequently, the width of each crossbar port is half the width of a corresponding bus
width, since we use the number of interconnect pins per port as the main interconnect
packaging constraint. We use a crossbar clock cycle value that falls in between that of a
bus and of a ring interconnect of the same technology. A crossbar switch is clocked slower
than a ring link since its routing is more complex and it has a higher fan-in/fan-out than a
ring interface logic. It can however be faster than a bus since it uses unidirectional ports.
We set the crossbar clock cycle to 200MHz, as compared to lOOMHz for buses and
500MHz for rings. The bi-section bandwidth of a crossbar system is therefore much
greater than a bus or a ring system, and it increases linearly with the system size, as
opposed to these other interconnects.
7.1.1 Cache Coherence Protocols for Crossbar-connected Multiprocessors
As with ring-connected systems, there are different ways to implement cache
coherency in a crossbar-connected system. With a variation of the crossbar switch shown
in Figure 7.1, the Sun UPA implements a type of snooping on a crossbar system. The idea
is to have a centralized crossbar/coherency controller that keeps copies of the tags of all
processor caches in the system. This controller manages all memory and coherence
requests, performs the dual-tag lookups as in a snooping scheme, and determines whether
the response will come from a memory port or a processor port. The data transfer is
performed by a true data crossbar switch, under the command of the central controller.
Architectures such as the Sun UPA are likely to be popular for very small-scale systems,
107
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
since they maintain heavy similarities with bus snooping controllers and enforce a simple
global ordering of events as in buses. However they are not scalable beyond 2 or 4
processor systems due to its centrahzed nature. Therefore we do not consider this
architecture in our evaluations.
Implementing crossbar-based snooping in a distributed fashion, as in bus-based
systems, is also impractical since it would require very frequent crossbar broadcasts which
are difficult to implement and wasteful of bandwidth.
Directory-based protocols such as the one used in the ring architecture are directly
applicable to a point-to-point interconnect such as a crossbar. Both centralized and
distributed-directory protocols are feasible alternatives. We concentrate on the centralized
directory protocol for the evaluation of crossbar-based systems since it is clearly the one
with the best performance potential, as discussed in previous chapters.
7.1.2 Simulation Results for Ring, Bus and Crossbar-based Systems
We now use the execution-driven simulators and the SPLASH/SPLASH-2
applications described in Table 5.1 to evaluate the performance of crossbar-based systems,
comparing it with snooping (Sring) and centralized directory unidirectional rings (Dring)
and with the NUMA bus architecture (Bus) as described earlier in this chapter. Each node
in the system has a scalar CPU, a fraction of the system memory, a 16KB first level cache,
a 128KB second level cache. Both caches are direct-mapped with a block size of 32B.
Rings and crossbars use 32-bit ports, while the bus is 64-bit wide (data). Rings are clocked
at 500MHz, buses at lOOMHz and crossbars at 200MHz. Figures 1.2-1.5 show the
breakdown of the execution time of the various systems normalized by the execution time
of the snooping ring. The cache coherence protocol for the crossbar system is identical to
the one described in Chapter 5 (Section 5.2) for centralized directory slotted rings.
Using these parameters and accounting for arbitration delays, the interconnect delay
in the absence of contention is 120 nanoseconds for the bus, and crossbar systems with up
to 16 processors, 240 nanoseconds for a 32-processor crossbar system, 78, 142, and 270
nanoseconds for 8-, 16- and 32-processor rings. These delays are for 2-hop cache
transactions that involve only the requester and the home. Transactions involving 3- and 4-
108
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
hops in the directory protocols will take significantly longer.
Figure 7.2. Execution time for SPLASH applications; 200 MHz processors.
1 2 S 1 2 5 1 2 3
I I
I I
PTH0R8 WATERS
I m
MP3D8
release
acquire
w r. back
InvaL
write
read
busy
_ 1 S O . O
1 2 0 . 0
acquire
wr. back
InvaL
write
N M N M M M
WATER16 CHOLESKY16 PTH0R16
220.0
200.0
1 8 0 .0
1 6 0 .0
1 4 0 .0
120.0
100.0
8 0 .0
6 0 .0
4 0 .0
20.0
0.0
acquire
wr. back
inval
write
read
busy
g M M M M M M M
MP3D32 WATER32 CHOLESKY32 PTHOR32
109
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 7.3. Execution time for SPLASH-2 applications; 200 MHz processors.
1 8 0 .0
too 101 100 ICO 100 100 101 g o
release
ecqm re
w r.back
in v a L
w rite
reed
busy
M M
BARNES8
I I I I II I I II I I
VOLREND8 0CEAN8 LU8
200.0
100.0
1 4 0 .0
120.0
t o o 1 0 2 1 0 2 t o o
100
release
acquire
w r.back
in v a l.
w rite
read
b u sy
I I M I I M I I M
BARNES16 V0LREND16 OCEAN16 LU16
1 4 0 .0
120.0
to e 1 0 7
100.0
release
acquire
w r.back
inval
w rite
read
bu sy
M M M M
BARNES32 VOLREND32 OCEAN32
no
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 7.4. Execution time for SPLASH applications; 500 MHz processors.
I40.D
acq u ire
w r.back
in v a L
w rite
read
bu sy
MP3D8 WATERS CHOLESKY8 FTHOR8
220.0
d
200.0
1
180.0
p
160.0
O 140.0
1
120.0
U J 100.0
1
80.0
60.0
1
40.0
Z
20.0
0.0
I release
acquire
w r.back
inval
w rite
read
busy
a I
MP3D16
I I M I M I
WATER16 CHOLESKY16
I H I
PTHOR16
220.0
1800
140.0
120.0
100.0 release
acquire
w r.back
inval
w rite
read
bu sy
M M
MP3D32
M M
WATER32 CHOLESKY32
M M
PTHOR32
III
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 7.5. Execution time for SPLASH-2 applications; 500MHz processors.
200.0
1 8 0 J )
1 6 0 .0
t4 0 i)
120.0
100.0
9 0 J 0
6 0 .0
4 0 .0
20.0
O J O
100 100
release
acquire
w r.back
In v a L
w rite
read
busy
I I I I I I I I I I I I I I I I
BARNES8 VOLREND8 0CEAN8 LU8
to o >02
release
acquire
w r.back
in v a l.
w rite
read
b usy
i g M g g a ^ g g M
BARNES16 VOLREND16 0CEAN16 LU16
200.0 p
I I I I
BARNES32
g g M
VOLREND32
g g M
OCEAN32
release
acquire
w r. b ack
In v al
w rite
read
b usy
LÜ32
For the 8-processor systems, snooping ring has the best performance across all
applications, although for WATER, BARNES, VOLREND and LU, with 200MHz
processors, the differences in execution time is sometimes negligible. These applications
112
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
are the ones with the lowest total miss ratio (see Table 5.1), and therefore are the least
impacted by the interconnect architecture and the behavior of the cache coherence
protocol. For 500MHz processors, the bus system starts to experience non-negligible
interconnect contention and as a result, starts to show performance degradations for
BARNES and LU as well. As expected, bus performance gets increasingly worse for each
application as the number or speed of processors increase, since it exposes its lower
bandwidth capacity.
The crossbar system performs better than the directory ring across virtually all
applications, system sizes and processor speeds. Although this result is expected for 16-
and 32-processor systems, in which misses on the crossbar experience lower latency than
on the unidirectional ring, it is somewhat of a surprise that it also happens for the 8-
processor systems in a few cases. The reason is that, although the pure latency of a 2-hop
miss on the crossbar is higher than on the ring, 3- and 4-hop misses are 33%-39% faster on
the crossbar.
Another surprising result is that the crossbar is outperformed by the snooping ring
for 16- and 32-processor systems. In those configurations, the read miss latency of the
crossbar is lower than that of the snooping ring, and its aggregate bandwidth is more than
twice that of the unidirectional ring. Our expectations were that snooping ring and
crossbar would have relatively even performance for 16-processor systems since the
difference in 2-hop miss latencies is relatively small and both systems would have lightly
loaded interconnects (ring is under 20% and crossbar is under 12% utilized). For 32-
processor systems we expected the crossbar to outperform the snooping ring given the
slightly lower 2-hop miss latency and the higher load on the interconnect (the snooping
slotted ring utilization is typically over 35% for 32-processor systems). The lower-than-
expected performance of the crossbar system is a result of the increasing impact of
synchronization operations as the system size increases in our experiments, and of the
very high overhead of handling locks and barriers through the normal write-invalidate
protocol mechanisms. This effect is evident by the large fraction of the execution time
spend in acquire operations in the 32-processor systems.
We use a fixed problem size for each application. As the system size increases there
is less work between barrier synchronizations, which are present in the majority of the
113
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
applications under study. Moreover, the overhead at each barrier increases since a larger
number of processors has to decrement the barrier counter and check for barrier
completion. For five of the applications, there is also a significant increase in contention
for ordinary locks as the system size increases.
7.2 Summary
The snooping ring system still performed best overall, which is surprising given that
the crossbar system has higher network bandwidth and lower latency for both 16 and 32
processor systems. The reasons for this were the poor performance of the directory based
protocol under high-contention locking and the fact that the snooping ring still requires
only two hops for transactions in which the directory protocol needs three or more hops to
complete.
High-contention locks and barrier synchronizations between larger numbers of
processors incur in significant overhead for write-invalidate protocols, but they are
particularly harmful to the directory protocols presented here, since they have higher
latencies for invalidating multiple cache copies and for read misses on dirty blocks, which
are frequent in synchronization operations. In the following chapter, we examine this
problem in more detail and evaluate potential hardware solutions for it.
114
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 8
HARDWARE SUPPORT FOR LOCKING
OPERATIONS
8.1 Atomic Operations
In a shared-memory multiprocessor the building block of all synchronization
primitives is a mechanism that allows a processor to read and subsequently modify a
memory position in such a way that no intervening access from another processor takes
place in between the read and the write. Different instruction set architectures implement
such atomicity mechanisms in one of two ways: read-modify-write operations and load-
locked/store-conditional operations.
Read-modify-write operations requires hardware support for reading the old value of
a memory position and store a new value while maintaining the memory position
inaccessible by other processors in the system. Test&Set is a common implementation of
read-modify-write that reads the old value of a location while storing a known flag value
in it. If the value read is equal to the flag value it means that another processor has set the
flag first, and typically indicates that an attempt to acquire a lock has failed. An ordinary
write can be used to clear the locked position.
Load-locked/store-conditional (LLSC) takes an optimistic approach to locking. A
load-locked operation returns a value but also marks the position with the ID of the last
processor that accessed it*. A subsequent store-conditional operation will only store the
new value after it has checked that no other processor has accessed the position since the
1. This is an abstract description of the mechanism. Actual implementations on top of cache-based systems do not
require an actual ID to be stored.
115
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
corresponding load-locked operation. If any intervening access has occurred the store-
conditional fails and the sequence has to be restarted from the load-store operation.
Test&Set and LLSC operations provide the same functionality, and both can have
harmful interactions with the underlying cache coherence protocol. Here we focus on
Test&Set operations since they are more common in modem instruction set architectures.
8.2 Test&Set Primitives in Write-invalidate Protocols
Test&Set instructions are present in many processor architectures, and are typically
used to implement simple locks, above which a variety of more complex synchronization
operations can be built. The algorithm of a simple lock is given below:
LABEL: i <= Test&Set(lock_address);
if (i = FLAG) goto LABEL
The main disadvantage of this algorithm is that since the Test&Set operation
includes a store to the lock_address, whenever more than one processor is spinning on the
lock there will be continuous (and useless) traffic on the interconnect. A more effective
algorithm for multiprocessors uses ordinary reads to spin on a locked position once a
Test&Set has failed, and only attempts another Test&Set once the lock is cleared. This
algorithm, usually referred to as Test&Test&Set is presented below:
LABEL1 : i <= Test&Set(lock_address);
if (i = FLAG) goto LABEL2;
else goto SUCCESS;
LABEL2: if (i = FLAG) goto LABEL2
goto LABELl:
SUCCESS :
By spinning with ordinary reads and caching the lock position the spin-wait is made
local and communication only takes place when the lock is released. All the simulations
presented so far use the Test&Test&Set algorithm described above to implement locks.
116
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Barrier synchronizations are built on top of locks by using a monitor structure. Each
processor that reaches the barrier acquires a lock that protects access to the barrier counter
and increments it. If all processors have not reached the barrier yet it releases the barrier
counter lock and spins on a different lock that will only be released when all processors
have reached the barrier.
Barriers will typically cause high-contention for locks at both the barrier entry and
exit points. Contention at the barrier entry point will happen when the various threads
reach the barrier within a small time window. Contention at the barrier exit will always
occur since all processes will try to exit the barrier at the same time.
Write-invalidate protocols require a large number of mostly useless transactions
whenever a lock that is contended for by more than one processor is released. Let us look
at the situation in which multiple processors try to acquire a lock. Initially PO has the lock
and since no other processor has tried to acquire the lock yet, it has the lock cached read-
write. When PI tries to acquire the lock it issues a write miss transaction that invalidates
the copy in PO and gives ownership to PI. At this point, PI read-spins on its local copy of
the lock. When P2 tries to acquire the lock it also issues a write miss transaction that
invalidates P i’s copy and transfers the ownership of the block to P2. Since PI is read-
spinning it immediately read misses on the block and re-acquires a read-only copy from
P2. At this point both PI and P2 are read-spinning in their local copies. Each new
processor that tries to acquire the lock at this point will cause a write miss (with remote
shared copies to invalidate) followed by as many read misses as there were processors
spinning. When PO releases the lock it issues a write that invalidates all spinning processor
copies and acquires ownership of the block. At this point all spinning processors issue
read misses. The first read miss to succeed will cause a write-back from PO to the home
node. All subsequent read-misses will find the block clean at the home and be satisfied
immediately^. As the read misses are satisfied, the waiting processors see that the lock is
available and issue Test&Set (store) operations.The first processor to get to the home node
will succeed, force the invalidation of all read-only copies and obtain ownership (read-
write) of the block. All other invalidation requests will fail and be re-issued as write-
2. Depending on the dynamics o f the protocol, it is possible that a processor may acquire the lock before all read-misses
are satisfied. In this case, a processor may never observe that the lock has been passed.
117
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
misses which will all, each in turn, obtain ownership of the block only to pass it to the next
writer. As each losing processor sees that the lock has already been taken again they will
issue read-misses and spin on their local copies. Figure 8.1 summarizes the actions just
described.
Figure 8.1. High-contention locks with Test&Test&Set (a possible scenario).
Pq has the lock, P; to Pp^ are spinning
on read-only copies, P^+i attempts to
acquire the lock:________________
Pi to Pn are spinning on read-only
copies, Pq releases the lock:
1 write miss with N copies invalid
dated
2. 1 read miss on a dirty block
3. N-1 read misses on a clean block
^ -
1 write miss with N copies invah^
dated
2. 1 read miss on a dirty block
3. N-1 read misses on a clean block
4. 1 write-on-clean with N-1 copies
invalidated
5. N-1 write misses on a dirty block
6. 1 read miss on a dirty block
N-2 read misses on clean blocks ^
The number of messages exchanged in the two scenarios described in Figure 8.1 will
vary depending on the specifics of the protocol implementation and on the exact timing of
the actions. If the behavior of the system is exactly as described above, a centralized
directory protocol would require a minimum of 3N+2 probe messages and N+2 block
messages to add on processor to the set of N processors waiting for a held lock. It would
further require a minimum of 9N-1 probe messages and 3N+1 block messages to release a
lock when N processors are waiting on it. This analysis is certainly underestimating the
message traffic since is does not count the many probe requests that will fail and be re
issued due to contention for the home directory.
IIS
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
A snooping protocol is more efficient than a directory protocol in high-contention
lock operations, although it still incurs in significant overheads. To add a processor to the
set of N processors waiting for the lock takes a minimum of N+I probe messages and N+1
block messages. To release a lock that N processors are waiting on takes a minimum of 3M
probe messages and 3N-1 block messages.
This overhead is responsible for the large fraction of execution time spent on acquire
operations on 16- and 32-processor configurations. It also explains why the snooping
protocols (bus and ring) suffers less from high-contention locking overheads than the
directory protocols (ring and crossbar).
Graunke and Thakkar [35] study the performance of software algorithms based on
Test&Set for high-contention locks on a snooping bus multiprocessor. They conclude that
normal Test&Test&Set locks are inadequate for more than a “modest number of
processors” (under 8 processors firom their analysis). Their prescribed solution to larger
numbers of processors is a queue based locking scheme that uses a different lock position
for each waiting processor, in such a way that each the passing of the lock involves only
the current lock holder and the first processor on the waiting queue. Queueing locks as
well as other proposed software locking schemes partially attenuate the overheads of high-
contention locks, but they do so by typically increasing the overhead of a non-contended
lock.
We believe that locking synchronization is a fundamental and frequent operation in a
shared memory multiprocessor, and therefore it should be efficiently supported in
hardware. In the remainder of this chapter we briefly describe an existing hardware
solution for locking that are applicable to directory based and snooping protocols. We then
present a new mechanism that supports fast locking operations on a slotted ring under the
snooping protocol. This mechanism adds very little complexity to the existing snooping
ring protocol.
8.3 Queue On Lock Bit (QOLB)
Goodman, Vernon and Woest [34] proposed a hardware locking mechanism
originally called Queue on SyncBit (QOSB) which was later renamed to Queue on Lock
119
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Bit (QOLB). In the following we briefly describe the behavior of this mechanism. For a
complete description please see the original paper.
QOLB builds on top of an existing write-invalidate cache coherence protocol by
creating new cache states and transactions that allow the formation of a hardware FIFO of
processors waiting for a lock to be released. Every waiting processor creates a shadow
copy of the cache block that contains the lock with no valid data and a special lock bit'
(incorporated in the encoding of the cache state) set, and it spins on the shadow copy until
it finds the lock bit reset. Since the shadow copy contains no valid data, it is used to store
the ID of the next processor in the queue of waiters. The only processor with the valid
copy is the one at the head of the queue, which currently holds the lock and has its lock bit
reset. Therefore, the ED of the first waiting processor and the ID of the processor at the tail
of the waiting queue are stored at the home node copy of the block.
A processor trying to acquire a lock will issue a special QOLB transaction to the
home node. If the lock is not taken, it gets an exclusive copy of the block containing the
lock and the cache and memory states change to locked. A second processor that attempts
to acquire the lock finds the memory state locked and enqueues itself as the first waiter. In
this case it creates a shadow copy with the lock bit reset and spins locally on it. The home
memory stores its ED as the first waiter as well as the tail of the list. Subsequent processors
that join the waiting list cause the tail pointer at the memory to be updated. The home node
also forwards the ED of the requester to the old tail processor.
An acquire operation in the QOLB scheme requires a maximum of three probe
messages if the lock is taken, and a maximum of one probe and two block messages to
pass the lock to the first waiting processor. Moreover it does not introduce any extra
overhead when there is no contention for a lock.
8.4 Hardware Support for Locking on Snooping Slotted Rings
The QOLB mechanism described above can be apphed to both snooping and
directory based protocols. However the particular structure of the slotted ring, combined
3. This is not the “lock bit” that is associated with a directory entry at the home node for centralized directory protocols,
as described in Section 2.3.1.
1 2 0
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
with the properties of the snooping protocol make it possible to implement queue-based
locking without having to explicitly maintain a waiting queue, resulting in a much simpler
and more efficient locking protocol. We name this mechanism Token Locking, since it
transfers the lock to the next waiting processor in a fashion that resembles a token passing
ring access protocol.
Token locking creates two new cache states (locked and lock_wait) and two new
cache protocol request types (acquire and release). If a processor tries to acquire a lock
that is cached read-write locally, it simply changes its state to locked. If the lock position
is cached read-only or invahd, it changes to lock_wait and issues an acquire probe in the
ring.
The acquire probe invalidates all read-only copies of the block but has no effect on
lock_wait or locked cache copies. It is acknowledged by the home node or by the current
owner by setting the ack bit in the (piggyback) response field of the acquire probe. A set
ack bit indicates to the requester that it has acquired the lock and therefore it changes the
cache state to locked. The home node sets the ack bit in response to an acquire only if it
owns the block (e.g., the block is uncached or cached read-only), after which it sets the
dirty bit to indicate that the it no longer owns the block. A node with a read-write copy of
the block does not hold the lock, therefore it invalidates its copy and acknowledges the
acquire probe. Upon receiving the ack bit reset in the probe reply area, the requester
changes to lock_wait and the local processor is allowed to spin on the shadow copy of the
block"^.
The release probe is issued by the node with the locked cached copy at an unlock
point. All nodes with the corresponding cache copy in lock_wait state will read the value
of the ack bit and set it. The node that sees zero as the previous value of the ack bit is the
new lock holder, therefore it changes its cache state to locked. If there are no waiting
nodes, the probe ack bit returns reset, and the requester changes its state from locked to
read-write.
As it can be seen from the description above the value of the data in the block used
by the token locking mechanism is undefined. All locking state is kept in the directories of
4. It is im portant to allow “live” spinning instead of just freezing the processor because the program may decide to use
preemptive locking techniques that allow the scheduling of another thread/processor if a lock is determined to be taken.
121
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
the caches and in the home node. Unlike QOLB, token locking does not necessarily
maintain FIFO order between the arrival at the lock_wait state and the granting of the lock
since the lock is passed along in ring order. However, it is guaranteed that a processor will
get the lock after at most P-1 lock releases. Token locking is more efficient than QOLB for
a slotted ring since there is no need to communicate processor IDs in order to keep the
queue for the lock. It is simpler to implement that QOLB because it requires no additional
hardware other than the implementation of two new cache states and the encoding of two
new protocol messages. The snooping hardware that is already in place has all the
functionality that is required.
8.5 Performance Impact of Hardware Locking Mechanisms
The impact of hardware assists for efficient locking is analyzed in this section using
program-driven simulation of SPLASH and SPLASH-2 applications. We chose to evaluate
this impact by repeating the evaluation experiments of Chapter 7 but using QOLB to
implement locking for the directory ring, the snooping bus and the crossbar systems, and
token locking for the snooping ring system. The objective here is not to compare token
locking with QOLB, but to see how much of a factor a poor locking strategy can be in the
performance of shared-memory multiprocessors with various cache and interconnect
architectures.
These experiments will also yield a more leveled comparison among the various
ring, bus and crossbar architectures since we have determined that the directory based
systems were being more adversely affected by the simple Test&Test&Set locking scheme
used in all our previous experiments.
Figures 8.1-8.4 show the breakdown of the normalized execution time for snooping
ring, directory ring, bus and crossbar systems. The number on top of each bar shows the
percentage improvement over the same system without hardware support for locking (the
values between parenthesis are the actual normalized execution times in the few cases
where they go beyond the scale of the chart).
122
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 8.2. Execution time improvement with hardware support for locking on
SPLASH applications; 200MHz processors.
d
I
Î
* o
1 8 0 .0
1 6 0 .0
1 4 0 .0
120.0
100.0
f i l l
MP3D8
release
acquire
wr. back
Inval
write
read
busy
WATERS PTHOR8
d
I
200.0
1 8 0 .0
100.0
release
acquire
wr. back
inval.
write
read
busy
MP3D16
I I M
WATER16
M M M M
CHOLESKY16 PTH0R16
E
F
s
1 8 0 .0
1 6 0 .0
1 4 0 .0
1200
100.0
release
acquire
wr. back
inval.
write
read
busy
MP3D32 WATER32
I I
CHOLESKY32 PTHOR32
123
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 8.3. Execution time improvement with hardware support for locking on
SPLASH-2 applications; 200MHz processors
1 4 0 0
120. 0
0 .2 - 0 . 2 * 0 . 6 fj
O
OCEANS BARNES8 VOLREND8
release
acquire
w r. back
inval
< 6 0 .0
1 4 0 .0
120.0
2 .1 - 0 .1
100.0
release
acquire
w r. b ack
inval
w rite
read
busy
BARNES16 V0LREND16 OCEAN16 LU16
2 0 0 .0
1 6 0 .0
1 4 0 .0
8 .1 1 9 - 7 7j
release
acquire
w r. b ack
in v a l.
w rite
read
busy
I I I I I I I I I I I I
BARNES32 VOLREND32 OCEAN32
g : a a
LU32
124
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 8.4. Execution time improvement with hardware support for locking on
SPLASH applications; 500MHz processors
1 8 0 .0
retease
acquire
w r. b ack
in v a l.
w rite
read
busy
N a g
MP3D8
: g a ü N M M M
WATERS CHOLESKY8 PTHOR8
200.0
1 8 0 0
1 6 0 0
1 4 0 .0
120.0
1000
8 0 .0
6 0 .0
4 0 .0
20.0
0.0
w r. back
i g M
MP3D16
M M
WATER16
i M M M
CHOLESKY16 PTHOR16
200.0
1 8 0 .0
1 6 0 .0
1 4 0 .0
120.0
100.0
6 0 .0
6 0 .0
4 0 .0
20.0
0.0
release
acquire
w r. back
in v a l.
w rite
read
busy
I M I
MP3D32
M M g g M g g M
WATER32 CHOLESKY32 PTHOR32
1 2 5
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 8.5. Execution time improvement with hardware support for locking on
SPLASH-2 applications; 500MHz processors
tao.o
160.0
release
acquire
w r. b ack
in v a l.
w rite
read
busy
M M
BARNES8
I I “ f i
VOLREND8 OCEAN8 LU8
£ 120.0
release
acquire
w r. b ack
in v a l.
w rite
read
b usy
BARNES16 V0LREND16 0CEAN16 LÜ16
220.0
200.0
1800
1600
1400
120.0
100.0
80.0
60.0
40.0
200
0.0
4 .5 1 8 .4 release
acq u ire
w r. b ack
inval
w rite
read
b usy
M M M M M M M M
BARNES32 VOLREND32 OCEAN32 LU32
Overall there is little, if any, improvements for the 8-processor systems. In fact in
some cases the QOLB locking mechanism seems to slightly hurt total performance. Even
126
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
though these performance degradations are typically under 2% and could be attributed to
the slightly different execution paths between the runs, it is important to notice that there
are cases in which QOLB incurs an extra cost. For locks that are acquired twice or more
by the same processor with no intervening acquires by other processors, the schemes with
no hardware support are able to re-acquire the lock (or release it) without communicating
with the rest of the system, provided the block that contains the lock is not displaced from
the cache. In QOLB, since the waiting list is maintained in the home node, a release
operation has to issue messages in order to pass the lock to a possible waiting processor.
Because there are no processors waiting, the home node gains ownership of the block,
which will case the previous lock holder to communicate with the home again when it
needs to re-acquire the lock.
Hardware support for locking starts paying-off for some of the 16-processor
applications, such as MP3D, ŒOLESKY, PTHOR, and OCEAN, while showing
marginal gains at best for the remaining programs. Overall, the directory based systems
(e.g., Dring and Xbar) are the ones than benefit the most from hardware support for
locking. This was expected given the particularly bad performance of directory protocols
under high-contention locks, as explained earlier. Hardware support for locking is least
effective in the bus system. The explanation for this is two-fold. The relative fraction of
the execution time spent on locking (e.g., acquire/release) operations is smaller in the bus
system, since it also suffers from long read and write latencies due to bus contention. In
addition, the bus snooping protocol with no hardware assists for locking performs better
than the directory schemes and the snooping ring protocol, since the bus snooper is able to
snarf blocks that are being read by other processors when there is heavy read contention
for the block, which occurs when there are multiple waiters and the lock is cleared.
For the 32-processor systems, all applications benefit significantly from hardware
locking schemes, with the exception of WATER. In WATER, there is significant locking
activity for mutual exclusion but there is no significant contention for the lock. Typically
there is one processor waiting for a lock that is released. In this case, the problem size is
such that the application does not scale to 32 processors.
As with the 16-processor systems, the directory based schemes are the ones that
show the largest gains in 32-processor configurations, but the snooping bus and ring
127
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
systems also show improvements.
The addition of hardware support for locking makes it possible for the crossbar
systems with 16 processors to reach the same level of performance of snooping rings. For
the 32-processor systems, the crossbar configuration outperforms the snooping ring by an
average of 7% across all applications.
8.6 Summary
In this chapter we have explored existing and new hardware mechanisms for aiding
high contention locking operations. Although the 8-processor systems did not show
relevant improvements, 16- and 32-processor systems have benefited significantly from
these mechanisms. The 32-processor system in particular showed extraordinary
improvements by using hardware locking mechanisms, reaching over 20% for most
applications.
Among the various configurations analyzed, hardware locking was especially
beneficial to the directory-based systems. Improvements in the crossbar system
performance allowed it to match the snooping ring for 16-processor systems and to
outperform it by up to 12% in 32-processor systems.
A new locking mechanism for the snooping slotted ring was proposed, called token
locking. We have shown how this mechanism can be implemented in the slotted ring while
requiring no added functionality on top of the existing snooper hardware. Token locking
improved application performance by an average of 8% for 16-processor systems and by
24% for 32-processor systems.
128
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 9
THE IMPACT OF RELAXED MEMORY
CONSISTENCY MODELS
9.1 Introduction
All the evaluations performed so far have assumed processor modules that enforce
strong ordering of memory references as a mechanism to ensure a sequentially consistent
view of the memory system. Strong ordering dictates that the next processor access will
only be issued after the previous access (in program order) is satisfied. This policy leads to
a processor frequently blocking unnecessarily and therefore prevents any type of
concurrency between computation and memory accesses.
Exploring overlap between computation and the satisfaction of load misses is
difficult to accomplish since it is common that an instruction that uses the value returned
by the load follows the load closely in the program order. Compilers can improve the
distance between the load and the use by moving the load instruction up in the code as
much as possible or by issuing prefetching instructions far in advance. However with the
exception of well behaved loop nest computations, it is difficult to move a load up or to
issue the prefetch far enough in advance to tolerate the ever increasing miss latencies in
multiprocessors. Dynamically scheduled processors with speculative execution provide an
additional cushion in tolerating load misses by attempting to execute past the load as much
as possible and rolhng back if the speculated path of execution fails. Current state of the
art speculative execution as exempUfied by the Intel Pentium Pro processor [42] is able to
tolerate up to a maximum of approximately 20 processor cycles effectively, which falls
significantly short of the miss latencies in current high-performance multiprocessors.
Exploring overlap between computation and the propagation of stores, even stores
129
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
that miss in the processor cache(s) is easier to accomplish than with the loads since there
are no true data dependencies involved. It does however change significantly the
programmer’s view of the memory system since now different processors may see
different orders between the same pair of accesses. The weak ordering memory model as
pioneered by Dubois and Scheurich [22] and described in Section 1.3.3.2 defines one such
view, in which strong ordering is relaxed to allow the processor to continue to execute past
a store operation until it reaches a synchronization access. Release consistency [30] is an
optimization of weak ordering that further relaxes the access order by distinguishing
between types of synchronization operations (e.g., acquires and releases).
In this chapter we analyze the potential performance benefits of relaxed consistency
models in the performance of bus, ring and crossbar CC-NUMA multiprocessors. Two
schemes are used: send-delayed consistency and send-and-receive delayed consistency.
Both schemes were introduced by Dubois et al [24], and represent the most aggressive
relaxed consistency models that we are aware of. We do not analyze schemes to tolerate
load latencies in this thesis since those are highly dependent on compiler optimizations
(prefetching/code motion algorithms) and processor micro-architecture (speculative
execution), both of which are outside the scope of our work. Our particular
implementation the relaxed models is described in the following sections.
9.2 A Send-Delayed Consistency Implementation
The send-delayed consistency implementation used here is layered on top of the
cache protocols and interconnect architectures described previously. We still assume a
single issue (scalar) processor with a one-cycle execution latency per instruction (we do
not model the processor pipeline). The processor module, as before, contains the CPU, a
first-level write-through cache (no-write allocate) and second level write-back cache
(write-allocate). In addition we include a (unbounded) write-buffer (32-bit wide) between
the first- and the second-level caches, and a write-cache [16] in parallel with the second-
level cache. The write-cache is a small fully associative cache with one valid/dirty bit per
word, and the same block size as the second-level cache. Entries in the write-cache are
allocated at a write miss or a write-on-clean, and the particular words written have their
130
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
valid/dirty bit set, so that correct merging of modifications can occur. An entry in the
write-cache only needs to be removed when it runs out of space or when the program
reaches a release point. In the later case, according to release consistency, all buffered
modifications have to be committed before the release operation can commit. As opposed
to the scheme presented by Dahlgren and Stenstrom [16], there is no second-level write-
buffer to hold write requests that have been issued to the system. Here, the state of the
write-cache itself indicates whether it has an outstanding write/write-on-clean request or
not.
In our simulations the write-cache has eight entries. Whenever the write cache
allocates the fifth entry it issues the appropriate write/write-on-clean request to the system
for the two least recently written write-cache entries, in order to prevent the write-cache
from filling-up. If the write-cache does fill-up, the next write/write-on-clean operation
issued by the second-level cache blocks the second-level cache until some entry is freed*.
We found that this policy virtually eliminates stalls due to write-cache fill-up, while at the
same time allowing writes to coalesce in the write-cache. Such policy has the effect of
implementing a send-buffer, as in Dubois’ send-delayed protocols [24]. Because the size
of the write-cache is kept small, the overhead of flushing it at release points is reduced.
Aside from the delaying of sending invalidations, our implementation follows the
RCpc model as described in [30], for a scalar, statically scheduled processor. In this model
loads and stores can bypass each other provided dependencies are observed. No new
operations are issued until a previous (in program order) acquire succeeds. A release can
only be issued when all previous stores have completed, but loads and stores after the
release do not have to wait for the release to be issued.
9.3 A Send-and-Receive Delayed Consistency Implementation
Our implementation of send-and-receive delayed consistency [24] is an extension of
the send-delayed consistency model described in the previous section in which a stale
state is added to the second-level cache entries. Upon receiving an invalidation request, a
I. Notice that the processor does not necessarily block when the second-cache blocks. It may continue executing and
issuing stores until the write-buffer fills-up as well.
131
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
cache line state is changed to stale instead of invalid. The presence bit in the home node
(for directory protocols) is cleared, so that the system level state of a stale block is in fact
invalid. However, a stale cache copy can continue to be accessed for loads until the
corresponding processor issues an acquire operation, at which point all stale copies of the
block are invalidated. Such protocol is said to be receive-delayed with respect to
invalidations since the effect of the a received invalidation request is delayed.
The rationale behind receive-delayed consistency protocols is that, in a correct
parallel program, all accesses to writable shared data have to be protected by
synchronization accesses (e.g., the program has to be properly labeled) so to avoid race
conditions. If an invalidation is received for a block that is cached locally, it is permissible
to keep accessing the old copy of the block since if it were necessary for the local
processor to see the new write there would have been a synchronization handshake
between the local processor and the writing processor to indicate that a new value was
available.
By allowing stale copies to remain alive for reads, receive-delayed protocols reduce
the potentially heavy coherence activity that occurs when two or more processors are
accessing the same cache block but touching different data while at least one processor is
writing to the block. Such activity, called false-sharing, can significantly increase the
number of misses and other coherence actions, particularly when the block size is large.
9.4 Performance of Relaxed Consistency Models
Using the program-driven simulation models, we have performed extensive analysis
of both the send-delayed consistency and the send-and-receive delayed consistency
implementations in slotted rings, buses and crossbar systems. Figures 9.1-9.8 show the
normalized execution times for all SPLASH and SPLASH-2 applications. In the figures
and in the remaining of this chapter, SD denotes the send-delayed consistency model as
described in Section 9.2, and RD denotes the send-and-receive delayed protocol described
in Section 9.3. Moreover, the hardware locking mechanisms described in the previous
chapter are used in all configurations.
132
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 9.1. MP3D: Impact of relaxed consistency models (500MHz processors)
MP3D8
200
180
160
140
120
100
80
60
40
2 0
0
acquire
wr. back
o o a o w Q O a a a
I*?*? S % "5 S 'S ?
Q **
MP3D16
250
200
150
100
50
0
I
release
acquire
wr. back
Inval.
write
read
busy
MP3D32
300
280
260
240
220
200
180
160
140
1 2 0
1 0 0
80
60
40
2 0
0
III
X
I
release
acquire
wr. back
inval.
write
read
busy
133
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 9.2. WATER: Impact of relaxed consistency models (500MHz processors)
WATERS
a c q u ire
w r. b a c k
WATER16
160
140
1 2 0
100
80
60
40
2 0
0
a c q u ire
w r. b a c k
( O
| S |
Q
i ? s
X
WATERSa
a c q u ire
w r. b a c k
a C k
r
C O Û Q S O O
3 (0 C C < C O S
a + + a + +
%
134
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 9.3. CHOLESKY: Impact of relaxed consistency models (500MHz processors)
CH0LESKY8
160
140
120
100
80
G O
40
2 0
0
acquire
wr. back
CH0LESKY16
200
180
160
140
120
100
80
60
40
2 0
0
I
release
acquire
wr. back
Inval.
write
read
busy
CHOLESKY32
220
200
180
160
140
120
100
80
60
40
2 0
0
i
release
acquire
wr. back
inval.
write
read
busy
1 3 5
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 9.4. PTHOR: Impact of relaxed consistency models (500MHz processors)
PTH0R8
160
140
120
100
80
60
40
2 0
0
wr. back
i l l
0 )
i l l
Q
i l l i l l
X
PTH0R16
200
180
160
140
120
100
80
60
40
2 0
0
a c q u ire
wr. back
ill
( O
( 9 0 0 M O O OCOO
ZMK OME < M E
g + + m + + m + +
o * *
PTHOR32
200
180
160
140
120
100
80
60
40
2 0
0
acquire
wr. back
ill
( 0
C9QQ (/>OQ GCOQ
z c o s 3 w oc < (00 ;
g + + (D + + m + +
Q ^
136
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 9.5. BARNES: Impact of relaxed consistency models (500MHz processors)
BARNES8
160
140
120
100
80
60
40
20
0
a c q u i r e
wr. back
(0
z o o
2 Ç
X
BARNES16
160
140
120
100
80
60
40
20
0
a c q u ir e
wr. back
i l l
(0 Û
K a o
2 •? Ç
X
BARNES32
200
180
160
140
120
100
80
60
40
20
0
a c q u ir e
wr. back
C Ç a o
2 % ^
X
137
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 9.6. VOLREND: Impact of relaxed consistency models (500MHz processors)
V0LREND8
160
140
120
100
80
60
40
20
0
acquire
wr. back
i l l
(0
i l l
o
i l l i l l
X
V0LREND16
160
140
120
100
80
60
40
20
0
release
acquire
wr. back
write
| H
CO Q
(0 Û Û C Û Û
3 «Î * 5 g Ç
X
VOLREND32
160
140
120
100
80
60
40
20
0
1 1 » 1 1 1
release
acquire
wr. back
i l l
(0
O Q Û (OÛQ C O O
Z C O C 3 C 0 C < C O C
g + + tn + + flo + +
O X
1 3 8
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 9.7. OCEAN: Impact of relaxed consistency models (500MHz processors)
OCEANS
200
180
160
140
120
100
80
60
40
20
0
w r. back
§ s g i g g S 8 s
£ + + g + + m + +
(0 Q
0CEAN16
w r. back
0 Û O ( O Q O
1 ? Ç S * 2 ?
Q
OCEAN32
200
150
100 -
acquire
w r. back
139
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 9.8. LU: Impact of relaxed consistency models (500MHz processors)
LUS
a c q u i r e
wr. back
LU16
I » 1 5 »
tn ia
I
release
acquire
wr. back
inval.
write
read
busy
LU32
180
160
140
120
100
80
60
40
20
0
acquire
wr. back
inval.
X
140
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
For each application chart in Figures 9.1-9.8 there are four groups of bars,
corresponding to snooping slotted ring (SRING), centralized directory slotted ring
(DRING), packet switched snooping bus (BUS) and centralized directory crossbar
(XBAR). The architecture of each of these systems corresponds to those described in
Chapters 4, 6 and 7 but enhanced with the support for hardware locking introduced in
Chapter 8. The italicized numbers on the top of the SD and RD bars correspond to the
percentage improvement observed with respect to the associated baseline sequentially
consistent configuration. The cache block size is 32B for all experiments.
The effect of SD is of virtually eliminating the contributions of write misses and
invalidation messages (write-on-clean messages) from the execution time in virtually all
cases. Such an effect tends to benefit more drastically the configurations with larger write
miss and invalidation latencies. In general that is what we observe when comparing the
directory based systems (DRING and XBAR) with the snooping ring system (SRING).
The directory based protocols have a significant number of higher latency transactions that
are due to write misses and invalidations, therefore they benefit the most from relaxed
consistency models.
The most important result from these experiments is the fact that relaxed consistency
models are effective in reducing the execution time of all apphcations for ring and
crossbar systems. The magnitude of the reduction depends on many factors, but mostly on
the fraction of time that a processor blocks due to write misses or invalidations. Average
improvements from SD are of 16% for SRING, 21% for DRING and 20% for XBAR. A
slight increase in read and acquire latency is noticeable in these systems when going from
sequential consistency to RD, due to increased contention for cache and interconnect
resources. For BUS, however, the limited available bandwidth is quickly consumed by
relaxing the consistency models, resulting in net gains that are marginal at best (average of
5%). In fact OCEAN, BARNES, CHOLESKY and MP3D mostly show no gain or loss of
performance from going to SD on the bus systems.
SRING, DRING and XBAR showed net gains for SD mainly because they had
enough spare bandwidth to accommodate the increased interconnection load that results
from overlapping write accesses with computation. The percentage utilization of crossbar
output ports is typically under 15% even for 32-processor systems. Figure 9.9 shows the
141
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
effect of relaxing the memory consistency model on the utilization of slots for the
snooping slotted ring.
Figure 9.9. Percentage ring slot utilization for snooping
P=8 iP=16 A — a P=32
100
90
80
70
60
50
40
30
20
10
0
MP3D
SRING +SD +RD
100
CHOLESKY
30
SRING +SD +RD
100
90
80
WATER
60
40
o-
10
SRING +SD +RD
100
PTHOR
40
SRING +SD +RD
100
BARNES
60
20
SRING +SD +RD
100
OCEAN
70
SRING +SD +RD
100
VOLREND
20
SRING +SD +RD
LU
100
70
SRING +SD +RD
142
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
As Figure 9.9 shows, there is typically a sharp increase in ring traffic slot utilization
by going from the baseline SRING to SD. However, overall ring utilization remains low
for most of the applications, even for applications with low processor utilization (e.g., low
fraction of busy time in the execution time breakdowns). This is caused in part by the
relatively long latencies of blocking read and synchronization accesses which prevent the
program from issuing ring accesses at higher rates.
RD is observed to marginally increase the performance of SD in these experiments.
Average gains for RD are of 18% for SRING, 22% for DRING and 22% for XBAR. In the
cases where it shows the larger gains, RD is observed to reduce significantly the
contribution of read accesses to the execution time. Such reductions are usually not in the
of read miss latency, but on the number of reads (and write) misses incurred by the
program through the reduction of false-sharing. The modest gains of RD with respect to
SD is expected given the use of a small cache block sizes (32B).
The network utilization of RD with respect to SD depends on the balance of two
opposing effects. By increasing the lifetime of invalidated cache blocks, SD tends to
increase the load on the network by allowing the processor to execute faster. On the other
hand, by reducing the ping-ponging of cache blocks that are falsely shared, RD reduces
the number of cache transactions that are issued in the system.
RD also slightly increases acquire latency since at each acquire point all stale blocks
in the cache have to be invalidated before proceeding. In our simulations we assumed that
the time to invalidate stale blocks is 4 processor cycles (four times the access time of the
second-level cache) when there are no first-level cache blocks to be invalidated. Such
timing is realistic considering clearable SRAM chip technology available today [24]. An
extra processor cycle is wasted for each first-level cache invalidation that is required.
Figure 9.10 shows the percentage improvements in (normalized) execution time for
SRING, BUS and XBAR, when the cache block size is increased to 128B (keeping the
cache sizes constant), in the 16 processor systems. With the larger block size RD shows an
average performance improvement of 6% with respect to SD, for SRING across all
applications. Not considering the applications that do not suffer from false-sharing, the
average improvement of RD with respect to SD is of 10%.
143
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 9.10. Release and delayed consistency improvements for 128B block systems;
P=16; 500MHz processors.
WATER MP3D
wr. back
PTHOR
wr. back
BARNES VOLREND
wr. back
OCEAN
wr back
144
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
9.5 Summary
One of the prescribed methods to increase performance in modem shared memory
multiprocessor systems is to relax the ordering rules for issuing and completing accesses.
Delayed consistency is one of the most aggressive consistency models that can be
implemented in hardware.
In this Chapter we have quantified the potential performance gains for using delayed
consistency protocols in small scale shared memory multiprocessors. We have shown that,
while slotted ring and crossbar systems can significantly benefit from both models, bus
system performance is only slightly affected by them. This is a result of the limited
bandwidth available on the bus systems that responds negatively to the increased load
caused by the relaxation of the memory consistency model.
Overall, send-delayed consistency showed performance increases over 20% across
all applications for the ring and crossbar systems, with send-and-receive delayed
consistency accounting for an additional 3%-6% improvement. In applications that exhibit
false-sharing behavior, send-and-receive delayed consistency improved ring and crossbar
performance by about 10%-12% with respect to send-delayed consistency alone.
The additional complexity of supporting delayed consistency in a system that
already has release consistency is small, and is restricted to modifying the policy for
flushing entries in the write-cache (for send-delayed) and the implementation of a stale bit
in the cache state that can be cleared efficiently at all release operations (for receive-
delayed). For systems with larger block sizes, (equal or greater than 128B) potential
performance improvements appear to justify this added complexity.
145
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 10
CONCLUSIONS
10.1 Summary
This thesis explores the design space of Non-Uniform Memory Access (NUMA)
shared memory multiprocessors with up to 32 CPUs for a variety of interconnect
topologies, cache protocols and consistency models. The fundamental motivating factor is
the realization that shared buses have electrical and topological limitations that prevent
them from keeping up with increases in processor architecture. Under this scenario, it is
necessary to look for alternative means to connect small scale multiprocessors that can
overcome the limitations of buses and can therefore scale-up the offered bandwidth as
technology improves at rates that are similar to those of microprocessors.
The main contributions of this thesis are the proposed design of a ring interconnect, a
slotted ring media access control mechanism that is suitable to high-speed cache
coherence traffic, and a snooping cache coherence protocol that takes advantage of the
broadcasting capabilities of the slotted ring. Other contributions include the description
and evaluation of an aggressive NUMA snooping bus protocol (the first that we are aware
of), a new hardware locking mechanism for snooping rings (token locking), and extensive
comparisons and performance evaluations of various interconnect options for small scale
multiprocessors (unidirectional rings, bidirectional rings, buses and crossbars), under
various types of cache coherence protocols (snooping, centralized directory and
distributed directory protocols), consistency models (sequential consistency and delayed
consistency), with and without hardware support for locking operations (QOLB locking
and token locking). We have also evaluated the potential benefits of software prefetching
in ring and bus systems.
146
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
10.2 Performance of Bus-based Systems
Our experiments point out quite evidently the reasons why bus architectures are due
to be replaced by other more technologically scalable interconnects. We show that while
buses with up to eight processors can perform reasonably well when the processor speed is
low or the application miss ratio is very low, their bandwidth limitations become a major
undermining factor for larger systems, faster processors or more aggressive latency
tolerance mechanisms. Bus-based systems show marginal gains at best in architectural
optimizations that otherwise have large potential gains in other systems. That is the case
when software prefetching and relaxed consistency models are used.
10.3 Design Options for Ring-based Systems
A significant portion of our efforts were directed to exploring the design space for
ring-based shared memory multiprocessor architectures. The attractiveness of rings are
their simplicity and their similarities to buses. Simplicity comes from the fact that a ring
requires no central switching, arbitration or routing policies, and that be translated directly
into faster clocking of point-to-point links. Rings are similar to buses in that the overhead
for doing broadcasts is not much greater than that of sending point-to-point messages. The
slotted ring access control mechanism appears to have some advantages over other options
such as the register insertion mechanism adopted by SCI, in the context of a cache
coherent multiprocessor system. First, it allows for a simpler ring interface design.
Second, it is less susceptible to unfairness of communication bandwidth or starvation.
Third, it is easy to predict how a cache protocol utilizes the slot bandwidth and therefore it
is possible to partition the slots in a way that matches the communication traffic quite well.
Finally, the existence of slots makes it easy to implement fast acknowledgment schemes
that are necessary to resolve conflicts in the protocol and guarantee forward progress of
the applications. An added bonus is that a slotted ring allows a designer to guarantee a
minimum inter-arrival time of cache coherence requests into a node, which facilitates
overall design and enables snooping implementations.
On top of the slotted ring we evaluate three classes of protocols. Centralized (full-
map) directory protocols, distributed directory protocols with linked lists, and a new
147
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
snooping protocol. A centralized directory protocol is the prescribed solution for a non
bus system, while a distributed directory protocol is being strongly pursued by the SCI
standards group and some industrial partners. We show that neither of the above protocols
is the most effective one for slotted rings, and that our proposed snooping protocol
outperforms them sometimes quite significantly for our suite of benchmark programs. Our
snooping ring protocol trades bandwidth for lower latencies by always broadcasting
request transactions and by doing so, preventing multiple hops which could cause the ring
to be traversed more than once.
Snooping unidirectional slotted rings perform better than centralized and distributed
directory protocols, even when the directory schemes are used on a bidirectional ring
configuration of same bisection bandwidth. In fact, we show that bidirectionality buys
very little (if any) performance gains for centralized directory protocols, and only showing
modest gains for distributed directory protocols.
10.4 Performance Comparison of Ring- and Crossbar-based Systems
In order to compare ring-based systems with alternatives that can offer even greater
interconnect bandwidth, we modeled a NUMA crossbar system that runs a centralized
directory protocol that is essentially identical to the one used for the slotted ring studies.
As was the case with the buses, we used very aggressive parameters in the crossbar model
in order to provide an honest comparison to the ring systems. While the crossbar
performed generally better than the centralized directory ring, it still performed worse than
the snooping ring, even for 32-processor systems. This result was somewhat contrary to
our expectations since a 16-processor or greater crossbar has better latencies than an
equivalent ring and higher communication bandwidth. The reasons for that were the
particularly poor performance of the centralized (invalidation-based) directory based
protocols on some of our applications that made heavy use of high-contention locks to
implement barriers. While the snooping bus and ring systems were also affected by this
phenomena, its effect was lessen by the more effective way in which snooping resolves
coherence transactions in an intensively read-write sharing scenario.
To address the poor performance of all the protocols in implementing high
148
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
contention locks and barriers through the write-invalidate protocols and test&set
operations, we studied all systems again but giving each of them some hardware support
for high contention locks. We added Queue On Lock Bit (QOLB) [75] functionality to the
bus, directory ring and crossbar systems to support efficient passing of locks. For the
slotted ring however we proposed a new mechanism, named token locking, that achieves
the same goal as QOLB but requires much less hardware resources while leveraging off
the topology of the ring and the snooping functionality. With hardware support for
locking, crossbar systems with 16 processors could match the performance of slotted rings
and 32-processor crossbars in fact performed about 7% better than snooping rings.
The combination of release consistency and delayed consistency protocols could
further increase the performance of ring and crossbar systems by over 25% on the average.
The interesting result here was that even for 32-processor systems the snooping ring had
sufficient bandwidth capacity to handle the increased load caused by relaxing the
consistency model, and therefore showed substantial improvements in execution time.
Delayed consistency showed only marginal gains beyond release consistency for smaller
cache block sizes (32B). Simulations with 128B blocks showed more promising gains,
particularly it improved performance by over 10% for the applications that suffered from
false-sharing behavior.
Overall, slotted ring multiprocessors were shown to be a very promising way of
building small shared memory multiprocessors. The results in this thesis indicate for
systems with up to 16 processors rings are more effective than aggressive crossbar
implementations and should therefore be considered as a choice for systems in this range.
Although we did not study clustering in this thesis, we believe that rings can also be
attractive in multi-level configurations in which nodes consisting of ring-connected
processors are linked by a high-bandwidth switching network (such as a crossbar or a
multistage network).
10.5 Future Work
In the process of carefully analyzing a significant number of options in the design
space of small scale shared memory multiprocessors we have identified other areas that
149
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
deserve further investigation.
In our studies we have used a scalar, in-order processor model. While we beUeve that
our results are consistent with statically scheduled superscalar processors, it is difficult to
predict the impact of dynamically scheduled processors that can speculate beyond
branches. These processors would not only be more tolerant to load misses, but they
would also change the mix of accesses as seen by the memory system, since speculative
loads and instruction fetches are issued to the memory system, but only committed stores
are seen outside the processor core. As a result, the mix of accesses as seen by the memory
system would include a larger fraction of loads and fetches, and a smaller fraction of
stores. It would be interesting to investigate how this change in the access patterns could
favor other types of cache protocols or cache organizations.
We have concentrated on write-invalidate protocols throughout this thesis, since
previous studies have determined that write-update protocols generate too much traffic in
the interconnect. However, the way technology trends are moving it seems to be easier to
build interconnects that have very high port bandwidth as opposed to very low latency.
Interconnects with such characteristics would be good candidates for write-update or
hybrid update/invalidate protocols, since the bandwidth requirements of those could
potentially be accommodated, and the resulting memory system could have significantly
lower miss ratios.
Finally we have considered that each node in the system has a single processor.
Advances in packaging and circuit integration seem to make it inevitable that future nodes
will have multiple processors. Ring-based interconnects could be advantageous in such
clustered configurations since the entire network bisection bandwidth is available to every
node in the system. It would be interesting to compare rings and crossbars as second level
interconnects of systems with multiprocessor nodes.
150
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Bibliography
[1] Alliant Computer Systems Corporation, “The Alliant FX/2800 Multiprocessor”,
Littleton MA, 1991.
[2] A. Agarwal, R. Bianchini, D. Chaiken, K. Johnson, D. Kranz, J. Kubiatowicz, B-
H. Lim, K. Mackenzie and D. Yeung, “The MIT Alewife Machine: Archiecture
and Performance”, in proceedings of the 22nd Annual International Symposium
on Computer Architecture, pp. 2-13, Santa Margherita Liguire, Italy, June 1995.
[3] A. Arlauskas, “iPSC/2 System: A Second Generation Hypercube”, in Geoffrey
Fox, editor, ACM Third Conference on Hypercube Concurrent Computers and
Applications”, pp. 38-42, New York, 1988.
[4] J-L. Baer and T-F. Chen, “An Effective On-Chip Preloading Scheme to Reduce
Data Access Penalty”, in proceedings of Supercomputing'91, pp. 176-186,
Albuquerque NM, November 1991.
[5] L. Barroso and M. Dubois, “Cache Coherence on a Slotted Ring”, Proceedings of
the 1991 International Conference on Parallel Processing, Vol. I, pp. I230-I237 ,
St. Charles, IL, August 1991.
[6] L. Barroso and M. Dubois, “The Performance of Cache-Coherent Ring-based
Multiprocessors”, Proceedings of the 20th International Symposium on Computer
Architecture, pp. 268-277, San Diego, CA, May 1993.
[7] L. Barroso and M. Dubois, “Performance Evaluation of the Slotted Ring
Multiprocessor”, IEEE Transactions on Computers, Vo. 44, No. 7, pp. 878-890,
July 1995.
[8] L. Barroso et al, “RPM: A Rapid Prototyping Engine for Multiprocessor Systems”,
IEEE Computer, Vol. 28, No. 2, February 1995.
[9] L. Bhuyan, D. Ghosal, and Q. Yang, “Approximate Analysis of Single and
Multiple Ring Networks”, IEEE Transactions on Computers, Vol. 38, No. 7, pp.
1027-1040, July 1989.
[10] P. Bitar, “A Critique of Trace-Driven Simulation for Shared-Memory
Multiprocessors”, in M. Dubois and S. Thakkar, Editors, Cache and Interconnect
Architectures in Multiprocessors, pp. 37-52, Kluwer Academic Publishers, 1990.
[11] M. Brorsson, F. Dahlgren, H. Nilsson and P. Stenstrdm, “The CacheMire Test
Bench - A Flexible and Efficient Approach for Simulation of Multiprocessors”,
Proceedings of the 26th Annual Simulation Symposium, March 1993.
151
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[12] M. Carlton and A. Despain, “Multiple-Bus Shared Memory System”, IEEE
Computer Vbl. 23, No. 6, June 1990, pp. 80-83.
[13] L. Censier, and P. Feautrier, “A New Solution to Coherence Problems in
Multicache Systems”, IEEE Transactions on Computers, C-27(12), pp. 1112-1118,
December 1978.
[14] D. Chaiken, C. Fields, K. Kurihara and A. Agarwal, “Directory-Based Cache
Coherence in Large Scale Multiprocessors”, IEEE Computer, Vol. 23, No. 6, pp.
49-59, June 1990.
[15] F. Dahlgren and P. Stenstrom, “Effectiveness of Hardware-Based Stride and
Sequential Prefetching in Shared-Memory Multiprocessors:, in proceedings of the
1st International Symposium on High-Performance Architecture, Raleigh NC,
January 1995.
[16] F. Dahlgren and P. Stenstrom, “Using Write Caches to Improve Performance of
Cache Coherence Protocols in Shared-Memory Multiprocessors”, in Journal of
Parallel and Distributed Computing, Vol. 26, No. 2, pp. 193-210, April 1995.
[17] H. Davis, S. Goldshmidt and J. Hennessy, “Tango: A Multiprocessor Simulation
and Tracing System”, in proceedings of the 1991 International Conference on
Parallel Processing, pp. 11:99-107, St. Charles EL, August 1991.
[18] D. Del Corso, M. Kirrman, and J. Nicoud, Microcomputer Buses and Links,
Academic Press, 1986.
[19] G. Delp, D. Farber, R. Minnich, J. Smith and M-C. Tam, “Memory as a Network
Abstraction”, IEEE Network Magazine, pp. 34-41, July 1991.
[20] Digital Equipment Corp., “Alpha Architecture Handbook”, DEC, Massachussets,
February 1992.
[21] M. Dubois and J-C. Wang, “Shared Data Contention in a Cache Coherence
Protocol”, proceedings of the 1988 International Conference on Parallel
Processing, St. Charles IL, pp. 146-155, August 1988.
[22] M. Dubois and C. Scheurich, “Memory Access Dependencies in Shared Memory
Multiprocessors”, IEEE Trans, on Software Engineering, 16(6), pp. 660-674, June
1990.
[23] M. Dubois and C. Scheurich, “Lockup-Free Caches in High-Performance
Multiprocessors”, The Journal of Parallel and Distributed Computing, January
1991, pp. 25-36.
152
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[24] M. Dubois, J-C. Wang, L. Barroso, K. Lee and Y-S. Chen, “Delayed Consistency
and its Effects on the Miss Rate of Parallel Programs”, Proceeding of
Supercomputing’91, Albuquerque NB, November 1991.
[25] S. Eggers et al., “Techniques for Efficient Inline Tracing on a Shared-Memory
Multiprocessor”, Proceedings of Performance 1990 and ACM Sigmetrics, pp. 37-
47, May 1990.
[26] D. Engebretsen, D. Kuchta, R. Boot, J. Crow and W. Nation, “Parallel Fiber-Optic
SCI Links”, IEEE Micro, Vol 16, No. 1, February 1996.
[27] D. Farber and K. Larson, “The System Architecture of the Distributed Computer
System - the Communication System”, Symp. on Computer Networks,
Polytechnic Institute of Brooklyn, April 1972.
[28] K. Farkas, Z. Vranesic and M. Stumm, “Cache Consistency in Hierarchical Ring-
Based Multiprocessors”, Proceedings of Supercomputing’92, November 1992.
[29] M. Ferrante, “CYBERPLUS and MAP V Interprocessor Communications for
Parallel and Array Processor Systems”, Multiprocessors and Array Processors, W.
J. Karplus editor. The Society for Computer Simulations, 1987, pp. 45-54.
[30] K. Gharachorloo, D. Lenosky, J. Laudon, P. Gibbons, A. Gupta and J. Hennessy,
“Memory Consistency and Event Ordering in Scalable Shared-Memory
Multiprocessors”, in proceedings of the ACM 17th Annual International
Symposium on Computer Architecture, pp. 22-33, Seattle WA, May 1992.
[31] N. Godiwala and B. Maskas, “The Second-generation Processor Module for
AlphaServer 2100 Systems”, Digital Technical Journal, Vol. 7, No. 1, pp. 12-27,
July 1995.
[32] S. Goldschmidt and J. Hennessy, ‘The Accuracy of Trace-Driven Simulations of
Multiprocessors”, in proceedings of the ACM SIGMETRICS Conference on
Measurement and Modeling of Computer Systems, pp. 146-157, Santa Clara CA,
May 1993.
[33] J. Goodman, “Using Cache Memory to Reduce Processor/Memory Traffic”, Proc.
of the 10th Int. Symp. on Computer Architecture, June 1983, pp. 124-131.
[34] J. Goodman, M. Vernon and P. Woest, “Efficient synchronization Primitives for
Large-Scale Cache-Coherent Multiprocessors”, in proceedings of the Third
International Conference on Architectural Support for Programming Languages
and Operating Systems, pp. 64-73, Boston MA, April 1989.
[35] G. Graunke and S. Thakkar, “Synchronization Algorithms for Shared-Memory
Multiprocessors”, IEEE Computer, Vol. 23, No. 6, pp. 60-69, July 1990.
153
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[36] A. Gupta, J. Hennessy, K. Gharachorloo, T. Mowry and W.D. Weber,
“Comparative Evaluation of Latency Reducing and Tolerating Techniques”,
Proceedings of the 18th International Symposium on Computer Architecture, pp.
254-263, Toronto, Canada, May 1991.
[37] D. Gustavson, “The Scalable Coherent Interface and Related Standards Projects”,
IEEE Micro, Vol. 12, No. 1, pp. 10-22, February 1992.
[38] E. Hafher et al, “A Digital Loop Communication System”, IEEE Transactions on
Communications, Vol. 22, No. 6, pp. 877-881, June 1974.
[39] K. Hahn, “POLO - Parallel Optical Links for Gigabyte Data Communications”,
unpublished technical report, Hewlett-Packard Laboratories, Palo Alto, CA, 1996.
[40] R. Halstead Jr. et al., “Concert: Design of a Multiprocessor Development System”,
Proc. of the 13th Int. Symp. on Computer Architecture, June 1986, pp. 40-48.
[41] A. Hooper, R. Needham, “The Cambridge Fast Ring Networking System,” IEEE
Trans, on Computers, Vol. 37, No. 10, October 1988, pp. 1214-1224.
[42] Intel Corp., “The Pentium Pro Processor at 150MHz”, Santa Clara CA, October
1995.
[43] D. James, “SCI (Scalable Coherent Interface) Cache Coherence”, Cache and
Interconnect Architectures In Multiprocessors, M. Dubois and S. Thakkar editors,
Kluwer Academic Publishers, Massachusetts, 1990, pp. 189-208.
[44] M. Karlin, M. Manasse, L. Rudolph and D. Sleator, “Competitive Snoopy
Caching”, in proceedings of the 27th Annual Symposium on Foundations of
Computer Science, pp. 244-254, 1986.
[45] R. Katz et al., “Implementing a Cache Consistency Protocol”, Proc. of the 12th Int.
Symp. on Computer Architecture, June 1985, pp. 276-283.
[46] Kendall Square Research, “Technical Summary”, Walthan, Massachusetts, 1992.
[47] E. Koldinger, S. Eggers and H. Levy, “On the Validity of Trace-Driven Simulatino
for Multiprocesosrs”, in proceedings of the 18th Annual International Symposium
on Computer Architecture, pp. 244-253, Toronto Canada, May 1991.
[48] J. Kowalik, editor, “Parallel MIMD Computation: HEP Supercomputer and Its
Applications”, MIT Press, 1985.
[49] L. Lamport, “How to Make a Multiprocessor Computer that Correctly Executes
Multiprocess Programs”, IEEE Transactions on Computers, Vol. 28, No. 9, pp.
282-312, September 1979.
154
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[50] D. Lenoski et al., “The Directory-Based Cache Coherence Protocol for the DASH
Multiprocessor”, Proc. of the 17th Int. Symp. on Computer Architecture, June
1990, pp. 148-160.
[51] D. Lenoski, J. Laudon, T. Joe, D. Nakahira, L. Stevens, A. Gupta and J. Hennessy,
“The DASH Prototype: Implementation and Performance, in proceedings of the
ACM International Symposium on Computer Architecture, pp. 92-103, Gold
Coast, Australia, May 1992.
[52] T. Lovett and S. Thakkar, “The Symmetry Multiprocessor System”, in Proceedings
of the 1988 International Conference on Parallel Processing, pp. 1:303-310,
St.Charles IL, August 1988.
[53] T. Lovett and R. Clapp, “STiNG: A CC-NUMA Computer System for the
Commercial Marketplace”, in proceedings of the ACM 23rd International
Symposium on Computer Architecture, Philadelphia PA, May 1996.
[54] D. Menasce, and L. Barroso, “A Methodology for Performance Evaluation of
Parallel Applications in Multiprocessors”, Journal of Parallel and Distributed
Computing, Vol 14, No. 1, pp. 1-14, January 1992.
[55] T. Mowry and A. Gupta, “Tolerating Latency through Software-controlled
Prefetching in Shared-Memory Multiprocessors”, Journal of Parallel and
Distributed Computing, Vol. 12, No 2., pp. 87-106, June 1991.
[56] M. Papamarcos and J. Patel, “A Low Overhead Coherence Solution for
Multiprocessors with Private Cache Memories”, Proc. of the 11th Int. Symp. on
Computer Architecture, New York, 1986, pp. 414-423.
[57] G. Pfister and V Norton, “Hot Spot Contention and Combining in Multistage
Interconnection Networks”, IEEE Transactions on Computers, Vol. C-34, No. 10,
pp. 943-948, October 1985.
[58] J. Pierce, “How Far Can Data Loops Go?”, IEEE Trans, on Communications, Vol
COM-20, June 1972, pp. 527-530.
[59] SCI (Scalable Coherent Interface): An Overview, IEEE P I596: Part I, docl71-i.
Draft 0.59, February 1990.
[60] R. Saavedra-Barrera, D. Culler and T. von Eicken, “Analysis of Multithreaded
Architecture for Parallel Computing”, 2nd Annual ACM Symposium on Parallel
Algorithms and Architectures, pp. 169-178, Greece, July 1990.
[61] S. Scott, J. Goodman and M. Vernon, “Performance of the SCI Ring”, Proceedings
of the 19th International Symposium on Computer Architecture, pp. 403-414,
Gold Coast, Australia, June 1992.
155
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[62] M. Schmidtvoigt, “Efficient Parallel Communication with the nCUBE 2S
Processor”, Parallel Computing, Vol. 20, No. 4, pp. 509-530, April 1994.
[63] H. Schwetman, “CSIM: A C-Based, Process-Oriented Simulation Language”,
Proceedings of the 1986 Winter Simulation Conference, pp. 387-396, 1986.
[64] J. Singh, W-D. Weber and A. Gupta, “SPLASH: Stanford Parallel Applications for
Shared Memory”, SIGArch Computer Architecture News, Vol. 20, No. 1, pp. 5-43,
March 1992.
[65] P. Stenstrom, “A Survey of Cache Coherence Schemes for Multiprocessors”, IEEE
Computer Vol. 23, No. 6, June 1990, pp. 12-25.
[66] T. Sterling, D. Savarese, P. MacNeice, K. Olson, C. Mobarry, B. Fryxell and P.
Merkey, “A Performance Evaluation of the Convex SPP-1000 Scalable Shared
Memory Parallel Computer”, in proceedings of Supercomputing’95, pp 1-17, San
Diego CA, December 1995.
[67] C. Stunkel, D. Shea, B. Abali, M. Atkins, C. Bender, D. Grice, P. Hochschild, D.
Joseph, B. Nathanson, R. Swetz, R. Stucke, M. Tsao and P. Varker, “The SP2
High-Perforance Switch”, IBM Systems Journal, Vol 34. No. 2, February 1995.
[68] Sun Microelectronics, “Universal Port Architecture: The New-Media system
Architecture”, electronic white-paper, http://www.sun.com/sparc/whitepapers/
wp95-023.html, 1995.
[69] C. Thacker, L. Stewart and E. Satterthwaite, “Firefly: A Multiprocessor
Workstation”, IEEE Transactions on Computers, Vol 37, No. 8, August 1988.
[70] S. Thakkar, “Performance of the Symmetry Multiprocessor System”, In M. Dubois
and S. Thakkar, editors. Scalable Shared Memory Multiprocessors, Kluwer
Academic Publishers, 1991.
[71 ] Thinking Machines Corp., “CM-5 Technical Summary”, Cambridge MA, 1991.
[72] D. Tullsen and S. Eggers, “Effective Cache Prefetching on Bus-Based
Multiprocessors”, ACM Transactions on Computer Systems, pp. 57-88, February
1995.
[73] J. Veenstra and R. Fowler, “MINT Tutorial and User Manual”, University of
Rochester Technical Report 452, June 1993.
[74] Z. Vranesic, M. Stumm, D. Lewis and R. White, “Hector: A Hierarchically
Structured Shared Memory Multiprocessor”, IEEE Computer, Vol. 24, No. I, pp.
72-78, January 1991.
136
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[75] P. Woest and J. Goodman, “An Analysis of Synchronization Mechanisms in
Shared-Memory Multiprocessors”, in proceedings of the International Symposium
on shared Memory Multiprocessing, pp. 21-34, Tokyo Japan, April 1991.
[76] S. Woo, M. Ohara, E. Torrie, J-P. Singh and A. Gupta, “The SPLASH-2 Programs:
Characterization and Methodological Considerations”, in proceedings of the ACM
22nd International Symposium on Computer Architecture, pp. 24-36, Sta.
Marguerita Ligure Italy, June 1995.
[77] W. Wulf, R. Levin, S. Harbison, “HYDRA/C.mmp: An Experimental Computer
System”, McGraw Hill, 1981.
[78] Q. Yang, L.N. Bhuyan and B. C. Liu, “Analysis and Comparison of Cache
Coherence Protocols for a Packet-Switched Multiprocessor”, IEEE Transactions
on Computers, Vol 38, No. 8, pp. 1143-1153, August 1989.
[79] R. Zucker and J-L. Baer, “A Performance Study of Memory Consistency Models”,
in proceedings of the 19th Annual International Symposium on Computer
Archtecture, pp. 2-12, Gold Coast Australia, May 1992.
157
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
High performance parallel logic programming on distributed shared memory multiprocessors.
PDF
Automatic code partitioning for distributed-memory multiprocessors (DMMs)
PDF
Logic synthesis for low power VLSI designs
PDF
Theory and practice in system-level design of application-specific heterogeneous multiprocessors
PDF
Optical shared-memory computing and multiple access protocols for photonic networks
PDF
Parallelism control in multithreaded multiprocessors
PDF
Design and analysis of interactive video-on-demand systems
PDF
Shared-memory multiprocessors using optical multichannel interconnect architectures
PDF
Access ordering and coherence in shared memory multiprocessors
PDF
Parallel algorithms for irregular vision problems on distributed-memory machines
PDF
Energy recovery techniques for CMOS microprocessor design
PDF
Efficient communication algorithms for parallel computing platforms
PDF
VLSI architectures for video compression applications
PDF
A geometry-centric approach to progressive three-dimensional graphics coding
PDF
Inference of multiple curves and surfaces from sparse data
PDF
Rate-distortion-based dependent coding for stereo images and video: Disparity estimation and dependent bit allocation.
PDF
Systematic design of the control path for high performance VLSI microprocessors.
PDF
Hybrid fractal/wavelet methods for image compression
PDF
Design of software task pipelines for embedded signal processing
PDF
VLSI array sensory information processing for communication and multimedia
Asset Metadata
Creator
Barroso, Luiz Andre
(author)
Core Title
Design options for small-scale shared memory multiprocessors.
Degree
Doctor of Philosophy
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Computer Science,engineering, electronics and electrical,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c17-538154
Unique identifier
UC11349919
Identifier
9720182.pdf (filename),usctheses-c17-538154 (legacy record id)
Legacy Identifier
9720182.pdf
Dmrecord
538154
Document Type
Dissertation
Rights
Barroso, Luiz Andre
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
engineering, electronics and electrical