A new-generation parallel computer and its performance evaluation

https://doi.org/10.1016/S0167-739X(00)00082-0Get rights and content

Abstract

An innovative design is proposed for an MIMD distributed shared-memory (DSM) parallel computer capable of achieving gracious performance with technology expected to become feasible/viable in less than a decade. This New Millennium Computing Point Design was chosen by NSF, DARPA, and NASA as having the potential to deliver 100 TeraFLOPS and 1 PetaFLOPS performance by the year 2005 and 2007, respectively. Its scalability guarantees a lifetime extending well into the next century. Our design takes advantage of free-space optical technologies, with simple guided-wave concepts, to produce a 1D building block (BB) that implements efficiently a large, fully connected system of processors. Designing fully connected, large systems of electronic processors could be a very beneficial impact of optics on massively parallel processing. A 2D structure is proposed for the complete system, where the aforementioned 1D BB is extended into two dimensions. This architecture behaves like a 2D generalized hypercube, which is characterized by outstanding performance and extremely high wiring complexity that prohibits its electronics-only implementation. With readily available technology, a mesh of clear plastic/glass bars in our design facilitate point-to-point bit-parallel transmissions that utilize wavelength-division multiplexing (WDM) and follow dedicated optical paths. Each processor is mounted on a card. Each card contains eight processors interconnected locally via an electronic crossbar. Taking advantage of higher-speed optical technologies, all eight processors share the same communications interface to the optical medium using time-division multiplexing (TDM). A case study for 100 TeraFLOPS performance by the year 2005 is investigated in detail; the characteristics of chosen hardware components in the case study conform to SIA (Semiconductor Industry Association) projections. An impressive property of our system is that its bisection bandwidth matches, within an order of magnitude, the performance of its computation engine. Performance results based on the implementation of various important algorithmic kernels show that our design could have a tremendous, positive impact on massively parallel computing. 2D and 3D implementations of our design could achieve gracious (i.e., sustained) PetaFLOPS performance before the end of the next decade.

Introduction

The demand for ever greater performance by many computation problems has been the driving force for the development of computers with thousands of processors. Two important aspects are expected to dominate the massively parallel processing field. High-level parallel languages supporting a shared address space (for DSM computers) and point-to-point interconnection networks with workstation-like nodes. Near PetaFLOPS (i.e., 1015 floating-point operations per second) performance and more is required by many applications, such as weather modeling, simulation of physical phenomena, aerodynamics, simulation of neural networks, simulation of chips, structural analysis, real-time image processing and robotics, artificial intelligence, seismology, animation, real-time processing of large databases, etc. Dongarra pointed out in 1995 that the world’s top 10 technical computing sites had peak capacity of only about 850 GigaFLOPS, with each site containing hundreds of computers. The goal of 1 TeraFLOPS (i.e., 1012 floating-point operations per second) peak performance was reached in late 1996 with the installation of an Intel supercomputer at Sandia Laboratories.

The PetaFLOPS performance objective seems to be a distant dream primarily because of the, as currently viewed, unsurmountable difficulty in developing low-complexity, high-bisection bandwidth, and low-latency interconnection networks to connect thousands of processors (and remote memories in DSM systems). To quote Dally, “wires are a limiting factor because of power and delay as well as density” [6]. Several interconnection networks have been proposed for the design of massively parallel computers, including, among others, regular meshes and tori [5], enhanced meshes, fat trees, (direct binary) hypercubes [17], and hypercube variations [15], [24], [26], [27]. The hypercube dominated the high-performance computing field in the 1980s because it has good topological properties and rather rich interconnectivity that permits efficient emulation of many topologies frequently employed in the development of algorithms [17], [29]. Nevertheless, these properties come at the cost of often prohibitively high VLSI (primarily wiring) complexity due to a dramatic increase in the number of communication channels with any increase in the number of PEs (processing elements). Its high VLSI complexity is undoubtedly its dominant drawback, that limits scalability [25] and does not permit the construction of powerful, massively parallel systems. The versatility of the hypercube in emulating efficiently other important topologies constitutes an incentive for the introduction of hypercube-like interconnection networks of lower complexity that, nevertheless, preserve to a large extent the former’s topological properties [26], [27].

To support scalability, current approaches to massively parallel processing use bounded-degree networks, such as meshes or k-ary n-cubes (i.e., tori), with low node degree (e.g., FLASH [11], Cray Research MPP, Intel Paragon, and Tera). However, low-degree networks result in large diameter, large average internode distance, and small bisection bandwidth. Relevant approaches that employ reconfiguration to enhance the capabilities of the basic mesh architecture (e.g., reconfigurable mesh, mesh with multiple broadcasting, and mesh with separable broadcast buses) will not become feasible for massively parallel processing in the foreseeable future because of the requirements for long clock cycles and precharged switches to facilitate the transmission of messages over long distances [32].

The high VLSI complexity problem is unbearable for generalized hypercubes. Contrary to nearest-neighbor k-ary n-cubes that form rings with k nodes in each dimension, generalized hypercubes implement fully connected systems with k nodes in each dimension [2]. The nD (symmetric) generalized hypercube GH(n,k) contains kn nodes. The address of a node is xn−1xn−2x1x0, where xi is a radix-k digit with 0≤xik−1. This node is a neighbor to the nodes with addresses xn−1xn−2xi′⋯x1x0 for all 0≤in−1 and xi′≠xi. Therefore, two nodes are neighbors if and only if their n-digit addresses differ in a single digit. For the sake of simplicity, we restrict our discussion to symmetric generalized hypercubes where the nodes have the same number of neighbors in all dimensions. Therefore, each node has k−1 neighbors in each dimension for a total of n(k−1) neighbors per node. The nD GH(n,k) has diameter equal to only n. Fig. 1 shows the GH(2,7) with two dimensions (i.e., n=2) and k=7. For n=2 and k an even number, the diameter of the generalized hypercube is only 2 and its bisection width is the immense k3/4. The increased VLSI/wiring cost of generalized hypercubes results in outstanding performance that permits optimal emulation of hypercubes and k-ary n-cubes, and efficient implementation of complex communication patterns.

Table 1 compares the numbers of channels in the binary hypercube (i.e., m-cube), the k-ary n-cube, and the generalized hypercube GH(n,k), all with the same number N of processors. We assume bidirectional data channels for full-duplex communication, and that N=kn=2m (therefore, k=N1/n=2m/n). For example, for m=14 (i.e., for systems with N=16 384 processors) and 64-bit channels, the numbers of wires for data transfers are

  • 14 680 064 wires for the 14-cube;

  • 4 194 304 wires for the 128-ary 2-cube; and

  • 266 338 300 wires for the GH(128,2).

In order to reduce the number of communication channels in systems similar to the generalized hypercube, the spanning bus hypercube uses a shared bus for the implementation of each fully connected subsystem in a given dimension [24]. However, shared buses result in significant performance degradation because of the overhead imposed by the protocol that determines each time ownership of the bus. Similarly, hypergraph architectures implement all possible permutations of their nodes in each dimension by employing crossbar switches [21]. Reconfigurable generalized hypercubes interconnect all nodes in each dimension dynamically via a scalable mesh of very simple, low-cost programmable switches [28]. However, all these proposed reductions in hardware complexity may not be sufficient for very high performance, such as PetaFLOPS-related, computing. To quote Patterson, “Currently the most expensive scheme is a crossbar switch, which provides an explicit path between every communicating device. This becomes prohibitively expensive when connecting thousands of processors” [14].

To summarize, low-dimensional massively parallel computers with full connectivity for nodes in each dimension, such as generalized hypercubes, are very desirable because of their outstanding topological properties (e.g., extremely small diameter and average internode distance, and immense bisection width), but their electronic implementation is a Herculean task because of packaging (and primarily wiring) constraints. Therefore, the introduction of pioneering technologies for the implementation of such systems could give life, for the first time, to scalable and feasible computing platforms capable of very high performance. This is our main objective. We have chosen a combination of electrical and free-space optical technologies to satisfy this objective. Our free-space optics approach results in a dramatic reduction in the number of wires. There is another major drawback for pure electrical implementations of systems with thousands of processors; processor speeds increase much faster than memory and interconnection network speeds, and therefore there is an utmost need for the development of very advanced memory-latency hiding mechanisms, namely prefetching, cache coherence, multithreading, and relaxed memory consistency. However, mitigation of the memory-latency problem is possible if free-space optical technologies are used for the implementation of large, almost fully connected, point-to-point interconnection networks.

Optical technologies have been enlisted before in the design of parallel computers [1], [8], [10], [21]. However, past efforts were often plagued by large power consumption (often due to redundant broadcasts), inefficient reconfiguration schemes with mechanical components that did not match electronic speeds, unreliability, strict alignment requirements, large complexity prohibiting scalability, etc. Also, interconnects with wavelength selectivity for channel allocation have been under extensive study recently [1], [8], [16].

An optical crossbar-like multichannel switch is employed in [21] to support full connectivity in 1D subsystems for the implementation of a “hypermesh”. A single fiber is used with WDM techniques to implement permutations among all nodes in the subsystem. The maximal size of the WDM optical crossbar is limited to only 16 nodes for the foreseeable future due to constraints on wavelength tunability with a single fiber. To build a massively parallel hypermesh machine, one must use many smaller WDM optical switches arranged into a chosen network architecture. Such a machine, however, may be slow in data transfers, routing may become cumbersome, and the cost and packaging complexity of the interconnection network may be prohibitively high. On the other hand, our design is scalable, has very low packaging complexity, uses fast point-to-point interconnection technology, and is characterized by low power consumption. The latter design can implement not only permutations among the nodes but also more powerful communications operations such as multicasting, broadcasting, all-to-all personalized, etc. Optical hypermeshes can implement a class of systems that are a subset of our optics-based generalized hypercube architecture.

Other proven architectural features, in addition to the chosen interconnection network technologies, are required for the success of any proposed system. DSM systems already dominate the massively parallel processing field [4], [11] because the simultaneous incorporation of the message-passing and shared-memory communication paradigms introduces versatility in programming [4], [12]. Around the year 2005, 8- to 16-way multithreaded microprocessors are expected to be common [9], [13]. A DSM system with thousands of processors then may be handling simultaneously hundreds of thousands to a million threads, thus making the problems of cache coherence, debugging, scheduling, and performance monitoring extremely difficult to handle [9]. Hardware/software codesign will be needed to develop relevant solutions for such systems.

To summarize, we strongly believe that free-space optical technologies will have a very significant influence on massively parallel processing because of reduced packaging complexity that facilitates the construction of powerful systems with increased connectivity. Optical technologies employing simultaneously TDM and WDM techniques eliminate the need for wires in the implementation of communications channels, and could be used to implement densely populated, fully connected BBs with large numbers of processors and small packaging complexity. Ours is a meticulous effort towards formulating a relevant, attainable objective and presenting a viable solution to fulfill this objective. This effort covers innovative architectures, feasibility analysis for corresponding designs, applications development, and performance evaluation [30], [31].

Our paper is organized as follows. Section 2 presents the basic architecture of our NSF/DARPA/NASA-funded New Millennium Computing Point Design. Section 3 contains a detailed description of our design for a system capable of 100 TeraFLOPS by the year 2005. Overall performance characteristics of this system and a feasibility analysis are also included. Section 4 presents an analysis for the optical component of the interconnection network. Section 5 contains performance results for some important computation- and/or communication-intensive problems. Finally, conclusions are presented in Section 6.

Section snippets

Basic architecture

Our architecture encompasses a 2D interconnection network that employs electrical and optical technologies. Section 2.1 presents the structure of the 1D building block (BB) and issues related to its implementation. The 2D complete system is constructed by repeating this 1D BB in two dimensions, and also incorporating additional glue logic. Section 2.2 describes the 2D complete structure.

Case study: 100 Tera(FL)OPS performance

Our design is the result of a project funded jointly by NSF, DARPA, and NASA, under their New Millennium Computing Point Design program that has funded eight projects nationwide. The objective of this federal program is the design and feasibility analysis of massively parallel computers that could deliver 100 TeraFLOPS performance by the year 2005. They believe that the success of this aggressive program will lead by the year 2007 to the development of computers capable of PetaFLOPS (i.e., 1015

Analysis of the optical interconnection network in the BB

Here we present results of analysis, simulation, and feasibility study for the optical interconnection network of our design.

Performance evaluation

Performance evaluation results are presented here for our case-study system capable of 100 TeraOPS, focusing on its communications capabilities and its implementing frequently used computation- and/or communication-intensive algorithmic kernels.

Conclusions

We have proven in this paper the suitability of our “point design” for very high performance computing. The complete system is characterized by immense bisection bandwidth and other outstanding properties. Not only can our proposed system graciously achieve its performance objective, but also its dramatically low interconnect complexity renders it viable. Such a dramatic reduction in the system interconnect complexity is not possible with any other existing or expected technology. Performance

Dr. Sotirios G. Ziavras received the Diploma in EE from the National Technical University of Athens, Greece, in 1984, the MSc in ECE from Ohio University in 1985, and the DSc in Computer Science from George Washington University in 1990, where he was a Distinguished Graduate Teaching Assistant and a Graduate Research Assistant. From 1988 to 1989, he was also with the Center for Automation Research at the University of Maryland, College Park. He was a Visiting Assistant Professor in the ECE

References (32)

  • E. Frietman, et al., Parallel optical interconnects: implementation of optoelectronics in multiprocessors architecture,...
  • Findings and Recommendations, PetaFLOPS Software Summer Study (PetaSoft’96), June...
  • M. Kajita, K. Kasahara, T.J. Kim, I. Ogura, I. Redmond, E. Schenfeld, Free-space wavelength division multiplexing...
  • D. Lenoski, et al., The stanford FLASH multiprocessor, IEEE Comput. 3 (1992)...
  • X. Li et al.

    Parallel DSP algorithms on turboNet: an experimental hybrid message-passing/shared-memory architecture

    Conc.: Pract. Exp.

    (1996)
  • The National Technology Roadmap for Semiconductors, Semiconductor Industry Association (SIA),...
  • Cited by (1)

    Dr. Sotirios G. Ziavras received the Diploma in EE from the National Technical University of Athens, Greece, in 1984, the MSc in ECE from Ohio University in 1985, and the DSc in Computer Science from George Washington University in 1990, where he was a Distinguished Graduate Teaching Assistant and a Graduate Research Assistant. From 1988 to 1989, he was also with the Center for Automation Research at the University of Maryland, College Park. He was a Visiting Assistant Professor in the ECE Department at George Mason University in Spring 1990. He is currently an Associate Professor in the ECE and CIS Departments at NJIT. His research interests are unconventional processor and computer systems designs, and parallel computing systems and algorithms.

    Dr. Haim Grebel is a Professor of Electrical and Computer Engineering at NJIT. His research interests are in the area of linear and non-linear optics which are applicable to optoelectronics networks and devices.

    Dr. Anthony T. Chronopoulos received his PhD at the University of Illinois in Urbana-Champaign in 1987, in Computer Science. He is an IEEE senior member. He has performed research under 10 different research grants, and has given over 50 conference and invited research lectures. He has advised over 10 graduate students. He has co-authored 50 refereed journal papers and conference proceedings papers in the areas of high performance computing, distributed systems and applications and computational science. He has been awarded four NSF grants.

    Mr. Florent Marcelli was an exchange student at NJIT in 1997.

    The work presented in this research was supported in part by NSF and DARPA, also co-sponsored by NASA, under the New Millennium Computing Point Design Grant ASC-9634775.

    View full text