The Cray Inc. XK7

Introduction
HPC Architecture
  1. Shared-memory SIMD machines
  2. Distributed-memory SIMD machines
  3. Shared-memory MIMD machines
  4. Distributed-memory MIMD machines
  5. ccNUMA machines
  6. Clusters
  7. Processors
    1. AMD Opteron
    2. IBM POWER7
    3. IBM BlueGene/Q processor
    4. Intel Xeon
    5. The SPARC processors
  8. Accelerators
    1. GPU accelerators
      1. ATI/AMD
      2. nVIDIA
    2. General computational accelerators
      1. Intel Xeon Phi
    3. FPGA accelerators
      1. Convey
      2. Kuberre
      3. SRC
  9. Interconnects
    1. Infiniband
Available systems
  • The Bull bullx system
  • The Cray XC30
  • The Cray XE6
  • The Cray XK7
  • The Eurotech Aurora
  • The Fujitsu FX10
  • The Hitachi SR16000
  • The IBM BlueGene/Q
  • The IBM eServer p775
  • The NEC SX-9
  • The SGI Altix UV series
  • Systems disappeared from the list
    Systems under development
    Glossary
    Acknowledgments
    References

    Machine type Distributed-memory multi-processor
    Models XK7
    Operating system CNL, Cray's microkernel Unix (for the compute nodes).
    Connection structure 3-D Torus.
    Compilers Fortran 95, C, C++, UPC, Co-Array Fortran, CUDA, OpenCL.
    Vendors information Web page www.cray.com/Products/XK/XK7.aspx
    Year of introduction 2012.

    System parameters:

    Model Cray XK7
    Clock cycle 2.1–2.8 GHz
    Theor. peak performance  
    Per Cabinet >100 Tflop/s
    Max. Configuration
    Memory  
    Per Cabinet ≤ 15.4 TB
    Max. Configuration
    No. of processors  
    Per Cabinet 96 CPUs; 96 GPUs
    Max. Configuration
    Communication bandwidth  
    Point-to-point ≤ 8.3 GB/s
    Bisectional/cabinet 2.39 TB/s

    Remarks:

    The XK7 machine has the structure of the Cray XE6 (see above) but in a node two of the Opteron processors have been replaced by NVIDIA GPUs. For appropriate applications this will boost the performance more than 5-fold. Because the application speed is so dependent on the application and the the amount of data to be shipped back and forth between the GPU's memory and the system memory no sensible speed estimate can begiven, except that for a cabinet the performance may well exceed 100 Tflop/s when the application is right.

    Apart from the usual software stack for Cray products of coarse CUDA and OpenCL are supported for the GPUs and also OpenACC, the OpenMP-like directive/pragma-based library and runtime that should make it easier for the general programmer to take advantage of the GPUs.

    Measured Performances:

    In [39] a speed of 17.59 Pflop/s was reported on the 560640-core XK7 Titan machine of ONRL, for the solution of a linear system of unspecified size. The efficiency was 64.9%; surprisingly high for a GPU-based system.