Networks

Introduction

HPC Architecture

Shared-memory SIMD machines

Distributed-memory SIMD machines

Shared-memory MIMD machines

Distributed-memory MIMD machines

ccNUMA machines

Clusters

Processors

AMD Magny-Cours

IBM POWER6

IBM POWER7

IBM PowerPC 970MP

IBM BlueGene processors

Intel Xeon

The SPARC processors

Accelerators

GPU accelerators

ATI/AMD

nVIDIA

General accelerators

The IBM/Sony/Toshiba Cell processor

ClearSpeed/Petapath

FPGA accelerators

Convey

Kuberre

SRC

Networks

Infiniband

InfiniPath

Myrinet

Available systems
The Bull bullx system

The Cray XE6

The Cray XMT

The Cray XT5_h

The Fujitsu FX1

The Hitachi SR16000

The IBM BlueGene/L&P

The IBM eServer p575

The IBM System Cluster 1350

The NEC SX-9

The SGI Altix UV series

Systems disappeared from the list

Systems under development

Glossary

Acknowledgments

References

Fast interprocessor networks are together with fast processors the decisive factors for both good integrated parallel systems and clusters. In the early days of clusters the interprocessor communication, and hence the scalability of applications, was hampered by the the high latency and the lack of bandwidth of the network (mostly Ethernet) that was used. This situation has changed very much and to give a balanced view of the possibilities opened by the improved networks a discussion of some of these networks is in order. The more so as some of these networks are, or have been employed also in "integrated" parallel systems.

Of course Gigabit Ethernet (GbE) is now amply available and with a maximum theoretical bandwidth of 125 MB/s would be able to fulfill a useful role for some applications that are not latency-bound in any way. Furthermore, also 10 Gigabit Ethernet (10 GigE) is increasingly offered. The adoption of Ethernet is hampered by the latencies that are incurred when the TCP/IP protocol is used for the message transmission. In fact, the transmission latencies without this protocol are much lower: about 5 µs for GbE and 0.5 µs for 10 GigE. Using the TCP/IP protocol, however, gives rise to latencies of somewhat less than 40 µs and in-switch latencies of 30–40 µs for GbE and roughly 4–10 µs latency for 10GbE. As such it is not quite at par with the ubiquitous Infiniband innterconnects with regard to latency and bandwidth. However, the costs are lower and may compensate for a somewhat lower performance in many cases. Various vendors, like Myrinet and SCS, have circumvented the problem with TCP/IP by implementing their own protocol thus using standard 10GigE equipment but with their own network interface cards (NICs) to handle the proprietary protocol. In this way latencies of 2–4 µs can be achieved: well within the range of other network solutions. Very recently Mellanox came out with 40 GbE on an InfiniBand fabric. It is too early however, to give characteristics of this new medium.

We restrict ourselves here to networks that are independently marketed as the proprietary networks for systems like those of Cray and SGI are discussed together with the systems in which they are incorporated. We do not pretend to be complete because in this new field players enter and leave the scene at a high rate. Rather we present main developments which one is likely to meet when one scans the high-performance computing arena. Unfortunately, the spectrum of network types is narrowed by the demise of Quadrics. Quadrics' QsNet^II was rather expensive but it had excellent characteristics. The next generation, QsNet^III was on the brink of deployment when the Italian mother company, Alinea terminated Quadrics. Much to the regret of HPC users and vendors.

A complication with the fast networks offered for clusters is the connection with the nodes. Where in integrated parallel machines the access to the nodes is customised and can be made such that the bandwidth of the network matches the internal bandwidth in a node, in clusters one has to make do with the PCI bus connection that comes with the PC-based node. The type of PCI bus which ranges from 32-bit wide at 33 MHz to 64-bit wide at 66 MHz determines how fast the data from the network can be shipped in and out the node and therefore the maximum bandwidth that can be attained in internode communication. In practice the available bandwidths are in the range 110–480 MB/s. Since 1999 PCI-X is available, initially at 1 GB/s, in PCI-X 2.0 also at 2 and 4 GB/s. Coupling with PCI-X is now common in PC nodes that are meant to be part of a cluster. Recently PCI Express became available. This provides a 200 MB/s bandwidth per data lane where 1×, 2×, 4×, 8×, 12×, 16×, and 32× multiple data lanes are supported: this makes it amply sufficient for the host bus adapters of any communication network vendor sofar. So, for the networks discussed below often different bandwidths are quoted, depending on the PCI bus type and the supporting chip set. Therefore, when speeds are quoted it is always with the proviso that the PCI bus of the host node is sufficiently wide/fast.

Lately, PCIe 2, commonly known as PCIe Gen2 has emerged with a two times higher bandwidth. Currently PCIe Gen2 is mostly used within servers to connect to high-end graphics cards (including GPUs used as computational accelerators) at speeds of 4–8 GB/s but evidently it could also be used to connect to either other computational accelerators or network interface cards that are designed to work at these speeds.

An idea of network bandwidths and latencies for some networks, both propriety and vendor-independent is given in Table 2.3. Warning: The entries are only approximate because they also depend on the exact switch and host bus adapter characteristics as well as on the internal bus speeds of the systems. The circumstances under with these values were obtained was very diverse. So, there is no guarantee that these are the optimum attainable results.

Table 2.3:Some bandwidths and latencies for various networks as measured with an MPI Ping-Pong test.

Bandwidth Latency

Network GB/s µs

Arista 10GbE(stated) 1.2 4.0
BLADE 10GbE(measured) 1.0 4.0

Cray SeaStar2+ (measured) 6.0 4.5

Cray Gemini (measured) 6.1 1.0

IBM (Infiniband) (measured) 1.2 4.5

SGI NumaLink 5(measured) 5.9 0.4

Infiniband (measured) 1.3 4.0

Infinipath (measured) 0.9 1.5

Myrinet 10-G (measured) 1.2 2.1