The Cray Inc. XC30

Introduction
HPC Architecture
  1. Shared-memory SIMD machines
  2. Distributed-memory SIMD machines
  3. Shared-memory MIMD machines
  4. Distributed-memory MIMD machines
  5. ccNUMA machines
  6. Clusters
  7. Processors
    1. AMD Opteron
    2. IBM POWER7
    3. IBM BlueGene/Q processor
    4. Intel Xeon
    5. The SPARC processors
  8. Accelerators
    1. GPU accelerators
      1. ATI/AMD
      2. nVIDIA
    2. General computational accelerators
      1. Intel Xeon Phi
    3. FPGA accelerators
      1. Convey
      2. Kuberre
      3. SRC
  9. Interconnects
    1. Infiniband
Available systems
  • The Bull bullx system
  • The Cray XC30
  • The Cray XE6
  • The Cray XK7
  • The Eurotech Aurora
  • The Fujitsu FX10
  • The Hitachi SR16000
  • The IBM BlueGene/Q
  • The IBM eServer p775
  • The NEC SX-9
  • The SGI Altix UV series
  • Systems disappeared from the list
    Systems under development
    Glossary
    Acknowledgments
    References

    Machine type Distributed-memory multi-processor
    Models XC30
    Operating system CNL, Cray's microkernel Unix (for the compute nodes).
    Connection structure Dragonfly network.
    Compilers Fortran 95, C, C++, UPC, Co-Array Fortran, Chapel
    Vendors information Web page www.cray.com/Products/XC/resources.aspx,
    Year of introduction 2012

    System parameters:

    Model Cray XC30
    Clock cycle ≤ 2.7
    Theor. peak performance  
    Per Processor 12×21.6 Gflop/s
    Per Cabinet 92 Tflop/s (CPU-only)
    Max. Configuration
    Memory  
    Per Cabinet ≤ 24.6 TB
    No. of processors  
    Per Cabinet 384
    Max. Configuration > 50000
    Communication bandwidth  
    Point-to-point ≤ 10 GB/s

    Remarks:

    The XC30 is the commercial version of the Cascade system that Cray developed for the DARPA HPCS program. It is the new top-of-the-line system in Cray's product portfolio and as such replaces the Cray XE6 system (which still is marketed, see below). The structure of the machine is quite different from the XE6 in that it both uses different processors and a different interconnect. The only part that is largely the same is the software stack.

    To start with the processor: in the XC30 Intel Ivy Bridge processors are used instead of AMD Opteron processors. Apart from generally being somewhat higher performing than the AMD Interlagos processors, the main and decisive factor is the presence of a PCIe Gen3 × 16 interface in the Intel processors that enables a direct connection to the new Aries interconnect chip at a speed of 16 GB/s. The second and more important difference is the Aries interconnect chip itself. The Aries is based on a 48-port YARC chip that has a 500 GB/s internal aggregate bandwidth. The Gemini interconnect chip in the XE6 was the first generation of the YARC chip with a considerably lower total bandwidth (160 GB/s). More important, where the Gemini implements a 3-D torus, the Aries chip implements a so-called dragonfly topology (see [22]). A dragonfly topology is hierarchical in nature and is made up of 3 levels: router, group, and system. An important condition is that the interconnectivity within the router and between routers is high (the number of incoming/outgoing ports of a router chip is called the radix of that router. Hence, the 48-port YARC chip can be regarded as having a high radix) to enable routing with a small number of hops. In the Aries router chip this condition is fulfilled by dedicating 15 links to a backplane, a so-called Rank-1 network that interconnects the blades in a cabinet without cables and also 15 links to a neighbouring cabinet with a Rank-2 network using copper cables. In this way two adjacent cabinets form a group. Another 10 links, using fibre optic cables connect the groups in a Rank-3 network system-wide. In this way a message between any two nodes in a group requires a maximum of two hops. The number of hops between groups depends on the connection between these groups. For the Cray XC30 it is an all-to-all connection which means only one hop between groups is added. The lowest bandwidth is that between groups at nominally 10 GB/s. In practical experiments with MPI an actual bandwidth of 9.5 GB/s was observed.

    As mentioned, the Aries chip has 4 PCIe Gen3 ×6 links which means that they can connect to any PCIe Gen3 enabled device. In the case of the XC30 it is possible to interchange a standard Ivy Bridge chip for an accelerator chip: either an NVIDIA K20X GPU or an Intel Phi.

    Besides the compute nodes there are I/O nodes that can be configured as interactive nodes or nodes that connect to background storage. The I/O nodes only contain one Intel processor and run a full Linux operating system. The compute nodes run a special Linux variant, called Extreme Scalability Mode, that greatly reduces the variability of the runtimes of similar tasks. This so-called OS-jitter can be quite detrimental to overall performance, especially for very large machine configurations. Cray claims a scalability of over a million cores for this streamlined OS. In the IBM BlueGene systems a similar separation between compute and service nodes is employed.

    Cray offers the usual Intel compiler set and its MKL numerical library but also its own scientific library and compilers supporting the PGAS languages UPC, Co-Array Fortran (CAF), and Chapel, the PGAS language that was developed by Cray for DARPA's HPCS program. Besides Cray's MPI implementation also its {\tt shmem} library for one-sided communication is available.

    A one-cabinet or one-group midsize XC30 is planned to come to market somewhere in 2013. As yet Cray has not given details of such a system. For the present XC30 system no maximum configuration is given. It is only stated to scale to more than 50,000 nodes.

    Measured Performances:

    In the TOP500 list of June 2013 a speed of 627 out of 745.5 Tflop/s was reported for te Swiss Piz Daint system, solving a linear system of order 35,480 with an efficiency of 84%.