The Cray Inc. XE6/XE6m

Introduction
HPC Architecture
  1. Shared-memory SIMD machines
  2. Distributed-memory SIMD machines
  3. Shared-memory MIMD machines
  4. Distributed-memory MIMD machines
  5. ccNUMA machines
  6. Clusters
  7. Processors
    1. AMD Opteron
    2. IBM POWER7
    3. IBM BlueGene/Q processor
    4. Intel Xeon
    5. The SPARC processors
  8. Accelerators
    1. GPU accelerators
      1. ATI/AMD
      2. nVIDIA
    2. General computational accelerators
      1. Intel Xeon Phi
    3. FPGA accelerators
      1. Convey
      2. Kuberre
      3. SRC
  9. Interconnects
    1. Infiniband
Available systems
  • The Bull bullx system
  • The Cray XC30
  • The Cray XE6
  • The Cray XK7
  • The Eurotech Aurora
  • The Fujitsu FX10
  • The Hitachi SR16000
  • The IBM BlueGene/Q
  • The IBM eServer p775
  • The NEC SX-9
  • The SGI Altix UV series
  • Systems disappeared from the list
    Systems under development
    Glossary
    Acknowledgments
    References

    Machine type Distributed-memory multi-processor
    Models XE6/XE6m
    Operating system CNL, Cray's microkernel Unix (for the compute nodes).
    Connection structure XE6: 3-D Torus, XE6m: 2-D torus.
    Compilers Fortran 95, C, C++, UPC, Co-Array Fortran
    Vendors information Web page www.cray.com/Products/XE/Systems/XE6.aspx,
    www.cray.com/Products/XE/Systems/XE6m.aspx
    Year of introduction XE6: 2010, XE6m: 2011

    System parameters:

    Model Cray XE6m Cray XE6
    Clock cycle 2.1–2.8 GHz 2.1–2.8 GHz
    Theor. peak performance    
    Per Processor 105.2 Gflop/s 105.2 Gflop/s
    Per Cabinet 12.2–20.2 Tflop/s 12.2–20.2 Tflop/s
    Max. Configuration 121.2 Tflop/s
    Memory    
    Per Cabinet ≤ 30.1 TB ≤ 30.1 TB
    Max. Configuration ≤ 184 TB
    No. of processors    
    Per Cabinet 192 192
    Max. Configuration 1152
    Communication bandwidth    
    Point-to-point ≤ 8.3 GB/s ≤ 8.3 GB/s
    Bisectional/cabinet 2.39 TB/s 2.39 TB/s

    Remarks:

    Until the introduction of the XC30 the structure of the Cray machines was very stable over the years: a 3-D torus that connects the processor nodes. The XE6 is the last in this line. The nodes as well as the routers have made it through quite a development, however. From the earliest XT-systems with a single AMD core to the XE6 with two 16-core Interlagos processors in the XE6 node. Also the interconnect routers have gone through an evolution from the first SeaStar router to the new Gemini, perhaps the most distinguishing factor of the system. The Genimi is based on the 48-port YARC chip that boasts a 160 GB/s internal aggregate bandwidth. Since the Gemini Network Interface Card (NIC) operates at 650 MHz and the NIC is able to transfer 64 B every 5 cycles, the bandwidth per direction is 8.3 GB/s while the latency varies from 0.7--1.4 µs depending on the type of transfer [1]. In practice bandwidths of over 6 GB/s per direction were measured, compatible with the claim in Cray's brochure of an injection bandwidth of over 20 GB/s/node. A nice feature of the Gemini router is that it supports adaptive routing, even on a packet to packet basis. As the 3-D torus topology is vulnerable with regard to link failures this will make the network much more robust.

    Besides the compute nodes there are I/O nodes that can be configured as interactive nodes or nodes that connect to background storage. The I/O nodes only contain one opteron processor but, in contrast to the compute nodes they run a full Linux operating system. The compute nodes run a special Linux variant, called Extreme Scalability Mode, that greatly reduces the variability of the runtimes of similar tasks This ensures very predictable execution times as no interference from system tasks occurs. This so-called OS-jitter can be quite detrimental to overall performance, especially for very large machine configurations. In the IBM BlueGene systems (see the BlueGene systems) a similar separation between compute and service nodes is employed.

    Cray offers the usual compilers and AMD's ACML numerical library but also its own scientific library and compilers for the PGAS languages UPC and Co-Array Fortran (CAF). Besides Cray's MPI implementation also its shmem library for one-sided communication is available.

    In 2011 the XE6m model has become available, where "m" stands for midrange. The XE6m has at most 6 cabinets with a peak speed of just over 120 Tflop/s. A further rationalisation is that not a 3-D but a 2-D torus is employed as the interconnection network. For the XE6 model itself no maximum configuration is given. The Cray documentation suggests that more than a million cores would be possible.

    Measured Performances:

    In [39] a speed of 1.11 Pflop/s was reported for a 142272-core XE6, based on 2.4 GHz Instanbul processors for the solution of a linear system of unspecified size. The efficiency was 81.3%.