The Cray Inc. XT5_h

Introduction

HPC Architecture

Shared-memory SIMD machines

Distributed-memory SIMD machines

Shared-memory MIMD machines

Distributed-memory MIMD machines

ccNUMA machines

Clusters

Processors

AMD Opteron

IBM POWER6

IBM PowerPC 970

IBM BlueGene processors

Intel Itanium 2

Intel Xeon

The MIPS processor

The SPARC processors

Accelerators

GPU accelerators

General accelerators

FPGA accelerators

Networks

Infiniband

InfiniPath

Myrinet

QsNet

Available systems

The Bull NovaScale

The C-DAC PARAM Padma

The Cray XT3

The Cray XT4

The Cray XT5_h

The Cray XMT

The Fujitsu/Siemens M9000

The Fujitsu/Siemens PRIMEQUEST

The Hitachi BladeSymphony

The Hitachi SR11000

The HP Integrity Superdome

The IBM BlueGene/L&P

The IBM eServer p575

The IBM System Cluster 1350

The Liquid Computing LiquidIQ

The NEC Express5800/1000

The NEC SX-9

The SGI Altix 4000

The SiCortex SC series

The Sun M9000

Systems disappeared from the list

Systems under development

Glossary

Acknowledgments

References

Cray has come a long way in the realisation of true hybrid systems, meaning that processors of different type can work seamlessly together in one computer infrastructure. Cray is consistently working towards a hardware infrastructure that allows for this ideal situation. The current product is named the Cray XT5_h that can harbour scalar Opteron-based XT5 nodes, X2 vector nodes, and XR1 nodes that house a Xilinx FPGA. The XT5_h is not yet able to let them work all together at one computational task but SIO (Service and I/O nodes) can log on to the system and deal with any of the different types of nodes within the system. The common infrastucture is formed by the 3-D torus network implemented by the SeaStar2+ communication nodes. Each of the 6 ports of the SeaStar2+ provides a bandwidth of 9.6 GB/s and is so connected to the X2, XT5, and XR1 blades. The bandwidth to the X2 blades is 4.8 GB/s for reasons not disclosed in Cray's documentation.
Below we will discuss the different types of blades of the XT5_h. It is not possible to give the maximum performance of the system as a whole but we can, at least for the X2, and XT5 blades do so. Obviously this is not possible for the XR1 blades as the performance is totally dependent on the algorithm that is executed.

The Cray Inc. X2

Machine type Shared-memory multi-vector processor
Models Cray X2 blade
Operating system UNICOS/lc, Cray's microkernel Unix
Connection structure Fat Tree
Compilers Fortran 95, C, C++, Co-Array Fortran, UPC
Vendors information Web page www.cray.com/products/xt5/index.html/
Year of introduction 2007

System parameters:

Model Cray X2
Clock cycle 1.6 GHz
Theor. peak performance
Per Processor 25.6 Gflop/s
Per node 102.4 Gflop/s
Per blade 204.8 Gflop/s
Max. Configuration 3.3 Pflop/s
Memory
Per node 32/64 GB
No. of processors
Max. Configuration 32,576
Memory bandwidth 28.5 GB/s

Remarks:
The X2 vector processor blade fits in Cray's XT5_h infrastructure where _h stands for "hybrid". The processor is not very different from its predecessor, the X1E. The clock frequency has gone up from 1.125 GHz to 1.6 GHz. A vector pipe set, containing an add-, multiply-, and a miscellaneous functions pipe, can generate 2 floating-point results/cycle. As there are 8 pipe sets in a processor this yields a peak performance of 25.6 Gflop/s. Four CPUs are housed in one node with 32 or 64 GB memory and operate in SMP mode. One X2 blade in turn houses two nodes for a total peak performance of 204.8 Gflop/s. Like in its predecessors, the bandwidth from/to the memory of 28.5 GB/s is not sufficient to support the the operation of the processors at full speed. So, the scalar processor in a CPU has 2-way set-associative L1 instruction and data caches, while the unified 512 KB, 16-way set-associative L2 cache and the 8 MB L3 cache are shared by all data. Part of the bandwidth discrepancy has been met by decoupling the vector functional units from each other. For instance, the load/store unit is able to issue store instructions before the associated results are present. In this way functional units incur minimal waiting times because they do not have to synchronise unnecessarily.

To integrate the X2 blade fully in the XT5_h infrastructure it is connected to a SeaStar2+ router that connects it to the other types of nodes in the system, however at half the normal speed link at 4.8 GB/s. The X2 blades have also their own very high bandwidth interconnect, implemented as a dense fat tree. Cray uses its proprietary so-called YARC router chips. The density lies in the fact that the radix, i.e., the amount of ports/router is high: 64 ports/YARC chip. Four of these chips constitute one Rank 1 router, that connects 32 CPUs at level 1. A rank 2 router can in turn connect 128 rank 1 routers, etc., up to the maximum of 32K processors. A unique feature of the network is that it allows side links that connect rank 1 routers in order to connect subtrees statically. One may have reasons to configure the network in such a way, especially when one has a number of X2 boards that naturally matches with such a configuration. The network topology can become rather complicated in this way, however. The worst case distance Ω = 7 in the maximally configured network which means that only 7 hops are needed to connect two of the most distant processors in a 32K processor configuration. The point-to-point bandwidth between processor is quite high: 15 GB/s or almost 60% of the local memory bandwidth within a CPU. An extensive description of this interesting interconnect can be found in [40]. In an SMP node OpenMP can be employed. When accessing other CPU boards one can use Cray's shmem library for one-sided communication, MPI, Co-Array Fortran, etc.

Measured Performances:
Results of a subset of the synthetic HPCC benchmark, some application kernels 7 application code are compared to an Intel Woodcrest processor (HPCC subset) and an Cray XT4 are discussed in [41].

Cray Inc. XT5

Machine type Distributed-memory multi-processor
Models XT5
Operating system UNICOS/lc, Cray's microkernel Unix
Connection structure 3-D Torus
Compilers Fortran 95, C, C++
Vendors information Web page www.cray.com/products/xt5/index.html/
Year of introduction 2007

System parameters:

Model Cray XT5
Clock cycle —
Theor. peak performance
Per Processor —
Per Cabinet 7 Tflop/s
Max. Configuration —
Memory
Per Cabinet ≤ 6.14 TB
Max. Configuration 196 TB
No. of processors
Per Cabinet 192
Max. Configuration —
Communication bandwidth
Point-to-point ≤ 9.6 GB/s
Bisectional/cabinet 842 GB/s

Remarks:
The quality of the layout of the online product information of Cray has improved siginificantly over the years. Unfortunately, the information content, at least for the XT5, has been decreasing at the same rate. So, no more information is given about the processor on the XT5 blade than that AMD 2000 processors are used and may be either dual- or quad-core. No per processor performance is given either, only that a 192-processor, quad-core, cabinet will have a peak performance of 7 Tflop/s. This amounts to a peak speed/core of 9.11 Gflop/s. As the fastest 2.5 MHz quad-core Phenom processor is capable of 5.0 Gflop/s/core with its regular floating-point units, it turns out that Cray conveniently has counted in the SSE capability of the processors. A practice that has not been exercised before, nor should be recommended. Also no information is given about the maximum configuration as is done for its predecessors, the XT3 and XT4.

There is information about the interconnection network: is it based on the SeaStar2+ router. The structure is the same as that of the SeaStar2 router but faster by about 25%. The bi-directional link speed has gone up from 7.6 GB/s to 9.6 GB/s. The bandwidth from the SeaStar2+ to the processors is still the same: 6.4 GB/s, the generic speed of AMD's HyperTransport 1.1.
At the moment no XT5 systems have been installed yet. However, in a XT5_h also XT4 blades can be accommodated for those who cannot wait.

Measured Performances:
As yet, there are no performance results known for XT5 systems.

Cray Inc. XR1

Machine type Distributed-memory multi-processor
Models XR1
Operating system UNICOS/lc, Cray's microkernel Unix
Connection structure 3-D Torus
Compilers Fortran 95, C, C++
Vendors information Web page www.cray.com/products/xt5/index.html/
Year of introduction 2007

System parameters:

Model Cray XR1
Clock cycle 500MHz
Memory
Per FPGA 200,448 LUTs
No. of processors
Per blade 2
Max. Configuration —
Communication bandwidth
Point-to-point ≤ 9.6 GB/s

Remarks:
An XR1 blade contains two Xilinx Virtex-4 LX400 FPGAs, each packaged as DRC's Reconfigurable Processing Units (RPUs). This means that DRC already has preconfigured the I/O connection between the AMD opteron on the board and the two RPUs. The connection is directly via AMD's HyperTransport 1.0 protocol, making the bandwidth off and on the FPGAs quite high in comparison to most accelerator platforms. The AMD processor is used for connection to the SeaStar2+ router in the XT5_h infrastructure and as an intermediate processor from where the RPUs are activated. As the algorithms to be configured and run on an XR1 board can be so different no sensible performance figures can be given, even as an upper bound.

As already discussed in section DRC various programming interfaces can be used, including Handel-C, Mitrion-C, etc., for the development of routines to be run on the XR1 boards.

Measured Performances:
As said, because of the enormous difference in speed between one application and the other no meaningful performance number can be defined. Only in terms of speedup against a CPU-based version one can reasonably assess what the benefits of an XR1-accelerated system are.

The Cray Inc. XT5h

The Cray Inc. X2

Cray Inc. XT5

Cray Inc. XR1

The Cray Inc. XT5_h