Interconnects

Introduction

HPC Architecture

Shared-memory SIMD machines

Distributed-memory SIMD machines

Shared-memory MIMD machines

Distributed-memory MIMD machines

ccNUMA machines

Clusters

Processors

AMD Opteron

IBM POWER7

IBM BlueGene/Q processor

Intel Xeon

The SPARC processors

Accelerators

GPU accelerators

ATI/AMD

nVIDIA

General computational accelerators

Intel Xeon Phi

FPGA accelerators

Convey

Kuberre

SRC

Interconnects

Infiniband

Available systems
The Bull bullx system

The Cray XC30

The Cray XE6

The Cray XK7

The Eurotech Aurora

The Fujitsu FX10

The Hitachi SR16000

The IBM BlueGene/Q

The IBM eServer p775

The NEC SX-9

The SGI Altix UV series

Systems disappeared from the list

Systems under development

Glossary

Acknowledgments

References

Fast interprocessor networks are together with fast processors the decisive factors for both good integrated parallel systems and clusters. In the early days of clusters the interprocessor communication, and hence the scalability of applications, was hampered by the the high latency and the lack of bandwidth of the network (mostly Ethernet) that was used. This situation has changed very much and to give a balanced view of the possibilities opened by the improved networks a discussion of some of these networks is in order. The more so as some of these networks are, or have been employed also in "integrated" parallel systems.

Of course Gigabit Ethernet (GbE) is now amply available and with a maximum theoretical bandwidth of 125 MB/s would be able to fulfill a useful role for some applications that are not latency-bound in any way. Furthermore, also 10 Gigabit Ethernet (10 GigE) is increasingly offered. The adoption of Ethernet is hampered by the latencies that are incurred when the TCP/IP protocol is used for the message transmission. In fact, the transmission latencies without this protocol are much lower: about 5 µs for GbE and 0.5 µs for 10 GigE. Using the TCP/IP protocol, however, gives rise to latencies of somewhat less than 40 µs and in-switch latencies of 30–40 µs for GbE and roughly 4–10 µs latency for 10GbE. As such it is not quite at par with the ubiquitous Infiniband innterconnects with regard to latency and bandwidth. However, the costs are lower and may compensate for a somewhat lower performance in many cases. Various vendors, like Myrinet and SCS, have circumvented the problem with TCP/IP by implementing their own protocol thus using standard 10GigE equipment but with their own network interface cards (NICs) to handle the proprietary protocol. In this way latencies of 2–4 µs can be achieved: well within the range of other network solutions. Mellanox provides now 40 GbE with an application latency of 4 µs.

We restrict ourselves here to networks that are independently marketed as the proprietary networks for systems like those of Cray and SGI are discussed together with the systems in which they are incorporated. We do not pretend to be complete because in this new field players enter and leave the scene at a high rate. Rather we present main developments which one is likely to meet when one scans the high-performance computing arena.

A complication with the fast networks offered for clusters is the connection with the nodes. Where in integrated parallel machines the access to the nodes is customised and can be made such that the bandwidth of the network matches the internal bandwidth in a node, in clusters one has to make do with the PCI bus connection that comes with the PC-based node. The type of PCI bus which ranges from 32-bit wide at 33 MHz to 64-bit wide at 66 MHz determines how fast the data from the network can be shipped in and out the node and therefore the maximum bandwidth that can be attained in internode communication. In practice the available bandwidths are in the range 110–1024 MB/s. In 1999 the coupling started with PCI-X 1.0 250 MB/s 64-bit wide, 66 MHz per lane. It was replaced by PCI-X 2.0 at double speed and subsequently by PCI Express (PCIe). Currently the second generation, PCIe Gen2 is the most common form at 500 MB/s for a single lane but often up to 16 lanes are employed, up to 8 GB/s. This PCIe Gen2 ×16 is also often used to connect accelerators to the host system. Since September 2011 the first PCI Gen3 are becoming available, again doubling the speed to 1 GB/s/lane. As 1×, 2×, 4×, 8×, 12×, 16×, and 32× multiple data lanes are supported: this is fast enough for the host bus adapters of any communication network vendor sofar. Presently, PCI Gen3 is only used by Cray (see the Cray XC30) and in Mellanox's Infiniband (see Infiniband).

An idea of network bandwidths and latencies for some networks, both propriety and vendor-independent is given in Table 2.3. Warning: The entries are only approximate because they also depend on the exact switch and host bus adapter characteristics as well as on the internal bus speeds of the systems. The circumstances under with these values were obtained was very diverse. So, there is no guarantee that these are the optimum attainable results. Obviously, we cannot give maximum latencies as these depend on the system size and the interconnect topology.

Table 2.3:Some bandwidths and latencies for various networks as measured with an MPI Ping-Pong test.

Bandwidth Latency

Network GB/s µs

Arista 10GbE(stated) 1.2 4.0
BLADE 10GbE(measured) 1.0 4.0

Cray SeaStar2+ (measured) 6.0 4.5

Cray Gemini (measured) 6.1 1.0

Cray Aries (measured) 9.5 1.2

SGI NumaLink 5(measured) 5.9 0.4

Infiniband, FDR14 (measured) 5.8 ≤ 1