Infiniband

Introduction
HPC Architecture
  1. Shared-memory SIMD machines
  2. Distributed-memory SIMD machines
  3. Shared-memory MIMD machines
  4. Distributed-memory MIMD machines
  5. ccNUMA machines
  6. Clusters
  7. Processors
    1. AMD Opteron
    2. IBM POWER7
    3. IBM BlueGene/Q processor
    4. Intel Xeon
    5. The SPARC processors
  8. Accelerators
    1. GPU accelerators
      1. ATI/AMD
      2. nVIDIA
    2. General computational accelerators
      1. Intel Xeon Phi
    3. FPGA accelerators
      1. Convey
      2. Kuberre
      3. SRC
  9. Interconnects
    1. Infiniband
Available systems
  • The Bull bullx system
  • The Cray XC30
  • The Cray XE6
  • The Cray XK7
  • The Eurotech Aurora
  • The Fujitsu FX10
  • The Hitachi SR16000
  • The IBM BlueGene/Q
  • The IBM eServer p775
  • The NEC SX-9
  • The SGI Altix UV series
  • Systems disappeared from the list
    Systems under development
    Glossary
    Acknowledgments
    References

    Infiniband has become rapidly a widely accepted medium for internode networks. The specification was finished in June 2001. From 2002 on a number of vendors has started to offer their products based on the Infiniband standard. A very complete description (1200 pages) can be found in [31]. Infiniband is employed to connect various system components within a system. Via Host Channel Adapters (HCAs) the Infiniband fabric can be used for interprocessor networks, attaching I/O subsystems, or to multi-protocol switches like Gbit Ethernet switches, etc. Because of this versatility, the market is not limited just to the interprocessor network segment and so Infiniband is expected to become relatively inexpensive because a higher volume of sellings can be realised. The characteristics of Infiniband are rather nice: there are product definitions both for copper and glass fiber connections, switch and router properties are defined and for high bandwidth multiple connections can be employed. Also the way messages are broken up in packets and reassembled as well as routing, prioritising, and error handling are all described in the standard. This makes Infiniband independent of a particular technology and it is, because of its completeness, a good basis to implement a communication library (like MPI) on top of it.

    Conceptually, Infiniband knows of two types of connectors to the system components, the Host Channel Adapters (HCAs), already mentioned, and Target Channel Adapters (TCAs). The latter are typically used to connect to I/O susbsystems while HCAs does more concern us as these are the connectors used in interprocessor communication.
    In Table 2.4 we list the bandwidth for the 1 and 4 links and data rates that are presently of interest as in interconnects mostly 4-wide Infiniband is used.

    Table 2.4:Theoretical bandwidth for some Infiniband data rates for 1 and 4 links.

      SDR DDR QDR FDR10 FDR14 EDR
      GB/s GB/s GB/s GB/s GB/s GB/s
    1 link 0.25 0.5 1.0 1.25 1.71 3.13
    4 links 1.0 2.0 4.0 5.0 6.82 12.5

    Of the data rates (for the full names of the abbrevated data rates see the glossary) listed in the table presently DDR, QDR and FDR(10 or 14) are in use. At this moment mostly QDR and FDR are offered. The bandwidth difference between FDR10 and FDR (also called FDR14) lies in the different error correction coding schemes: from SDR up till FDR-10 an 8/10 bit error correction scheme is used. Meaning that of every 10 bits 8 carry data while 2 bits are dedicated to error correction. In FDR(14) and EDR a different scheme is used in which of every 66 bits 64 bits represent data and 2 bits are used for error correction.

    Messages can be sent on the basis of Remote Memory Direct Access (RDMA) from one HCA/TCA to another: a HCA/TCA is permitted to read/write the memory of another HCA/TCA. This enables very fast transfer once permission and a write/read location are given. A port together with its HCA/TCA provide a message with a 128-bit header which is IPv6 compliant and that is used to direct it to its destination via cut-through wormhole routing: In each switching stage the routing to the next stage is decoded and send on. Short messages of 32 B can be embedded in control messages which cuts down on the negotiation time for control messages.

    To take advantage of the speed of QDR at least PCIe Gen2 ×8 must be present at the nodes to which the HCAs are connected and for FDR PCIe Gen3 is required. The switches can be configured in any desired topology but in practice a fat tree topology is mostly preferred (see Figure 7b). It depends of course on the quality of the MPI implementation put on top of the Infiniband specifications how much of the raw speed can be realised. FDR-based interconnects are now becoming routinely deployed and bandwidths of 5--6 GB/s and an MPI latency of $< 1$ \mus for small messages is quoted by Mellanox, one of the large Infiniband vendors. The in-switch latency is typically about 200 ns. Until early 2012 QDR Infiniband products were available from Mellanox and Qlogic. However, Qlogic has during 2012 been absorbed by Intel. Presumably with the intent of producing their own interconnect technology in the near future. For the moment we are in the rather uncomfortable situation that only Mellanox is left as an Infiniband product vendor.