The Cray Inc. XMT

Introduction

HPC Architecture

Shared-memory SIMD machines

Distributed-memory SIMD machines

Shared-memory MIMD machines

Distributed-memory MIMD machines

ccNUMA machines

Clusters

Processors

AMD Magny-Cours

IBM POWER6

IBM POWER7

IBM PowerPC 970MP

IBM BlueGene processors

Intel Xeon

The SPARC processors

Accelerators

GPU accelerators

ATI/AMD

nVIDIA

General accelerators

The IBM/Sony/Toshiba Cell processor

ClearSpeed/Petapath

FPGA accelerators

Convey

Kuberre

SRC

Networks

Infiniband

InfiniPath

Myrinet

Available systems
The Bull bullx system

The Cray XE6

The Cray XMT

The Cray XT5_h

The Fujitsu FX1

The Hitachi SR16000

The IBM BlueGene/L&P

The IBM eServer p575

The IBM System Cluster 1350

The NEC SX-9

The SGI Altix UV series

Systems disappeared from the list

Systems under development

Glossary

Acknowledgments

References

Machine type Distributed-memory multi-processor
Models XMT
Operating system UNICOS/lc, Cray's microkernel Unix
Connection structure 3-D Torus
Compilers C, C++.
Vendors information Web page www.cray.com/products/XMT.aspx
Year of introduction 2007

System parameters:

Model Cray XMT
Clock cycle 500 MHz
Theor. peak performance
Per Processor 1.5 Gflop/s
Per Cabinet 144 Tflop/s
Max. Configuration 12 Tflop/s
Memory
Per Cabinet ≤ 768 GB
Max. Configuration ≤ 64 TB
No. of processors
Per Cabinet 96
Max. Configuration 8024
Communication bandwidth
Point-to-point ≤ 8.3 GB/s
Bisectional/cabinet 2.39 TB/s

Remarks:
The macro architecture of the Cray XMT is very much alike to that of the Cray XT6 (similar to the Cray XE6 but with the older SeaStar2+ router instead of the Gemini router). However, the processors used are completely different: These so-called Threadstorm processors are made for massive multithreading and resemble the processors of the late Cray MTA-2 (see Systems disappeared from the list and [32]).

Let us look at the architectural features: Although the memory in the XMT is physically distributed, the system is emphatically presented as a shared-memory machine (with non-uniform access time). The latency incurred in memory references is hidden by multi-threading, i.e., usually many concurrent program threads (instruction streams) may be active at any time. Therefore, when for instance a load instruction cannot be satisfied because of memory latency the thread requesting this operation is stalled and another thread of which an operation can be done is switched into execution. This switching between program threads only takes 1 cycle. As there may be up to 128 instruction streams per processor and 8 memory references can be issued without waiting for preceding ones, a latency of 1024 cycles can be tolerated. References that are stalled are retried from a retry pool. A construction that worked out similarly was to be found in the late Stern Computing Systems SSP machines (see in Systems disappeared from the list ).

An XMT processor has 3 functional units that together can deliver 3 flops per clock cycle for a theoretical peak performance of 1.5 Gflop/s. There is only one level of caches, data and instruction, because due to the nature of the applications at which the machine is directed more cache levels would be virtually useless. The high degree of latency hiding through massive multi-threading is the mechanism of choice here to combat memory latency.

Unlike the earlier MTA-2 there is no Fortran compiler anymore for the XMT. Furthermore, the new 3-D torus network, identical to that of the Cray XE6 and the faster clock cycle of 500 MHz makes the machine highly interesting for applications with very unstructured but massively parallel work as for instance in sorting, data mining, combinatorial optimisation and other complex pattern matching applications. Also algorithms like sparse matrix-vector multiplications might perform well.

Measured Performances:
As yet no independent performance results are available to prove the value of this interesting architecture.