The Cray Inc. XK7

Introduction

HPC Architecture

Shared-memory SIMD machines

Distributed-memory SIMD machines

Shared-memory MIMD machines

Distributed-memory MIMD machines

ccNUMA machines

Clusters

Processors

AMD Opteron

IBM POWER7

IBM BlueGene/Q processor

Intel Xeon

The SPARC processors

Accelerators

GPU accelerators

ATI/AMD

nVIDIA

General computational accelerators

Intel Xeon Phi

FPGA accelerators

Convey

Kuberre

SRC

Interconnects

Infiniband

Available systems
The Bull bullx system

The Cray XC30

The Cray XE6

The Cray XK7

The Eurotech Aurora

The Fujitsu FX10

The Hitachi SR16000

The IBM BlueGene/Q

The IBM eServer p775

The NEC SX-9

The SGI Altix UV series

Systems disappeared from the list

Systems under development

Glossary

Acknowledgments

References

Machine type Distributed-memory multi-processor
Models XK7
Operating system CNL, Cray's microkernel Unix (for the compute nodes).
Connection structure 3-D Torus.
Compilers Fortran 95, C, C++, UPC, Co-Array Fortran, CUDA, OpenCL.
Vendors information Web page www.cray.com/Products/XK/XK7.aspx
Year of introduction 2012.

System parameters:

Model Cray XK7
Clock cycle 2.1–2.8 GHz
Theor. peak performance
Per Cabinet >100 Tflop/s
Max. Configuration —
Memory
Per Cabinet ≤ 15.4 TB
Max. Configuration —
No. of processors
Per Cabinet 96 CPUs; 96 GPUs
Max. Configuration —
Communication bandwidth
Point-to-point ≤ 8.3 GB/s
Bisectional/cabinet 2.39 TB/s

Remarks:
The XK7 machine has the structure of the Cray XE6 (see above) but in a node two of the Opteron processors have been replaced by NVIDIA GPUs. For appropriate applications this will boost the performance more than 5-fold. Because the application speed is so dependent on the application and the the amount of data to be shipped back and forth between the GPU's memory and the system memory no sensible speed estimate can begiven, except that for a cabinet the performance may well exceed 100 Tflop/s when the application is right.

Apart from the usual software stack for Cray products of coarse CUDA and OpenCL are supported for the GPUs and also OpenACC, the OpenMP-like directive/pragma-based library and runtime that should make it easier for the general programmer to take advantage of the GPUs.

Measured Performances:
In [39] a speed of 17.59 Pflop/s was reported on the 560640-core XK7 Titan machine of ONRL, for the solution of a linear system of unspecified size. The efficiency was 64.9%; surprisingly high for a GPU-based system.