The Cray Inc. XMT

Introduction
HPC Architecture
  1. Shared-memory SIMD machines
  2. Distributed-memory SIMD machines
  3. Shared-memory MIMD machines
  4. Distributed-memory MIMD machines
  5. ccNUMA machines
  6. Clusters
  7. Processors
    1. AMD Magny-Cours
    2. IBM POWER6
    3. IBM POWER7
    4. IBM PowerPC 970MP
    5. IBM BlueGene processors
    6. Intel Xeon
    7. The SPARC processors
  8. Accelerators
    1. GPU accelerators
      1. ATI/AMD
      2. nVIDIA
    2. General accelerators
      1. The IBM/Sony/Toshiba Cell processor
      2. ClearSpeed/Petapath
    3. FPGA accelerators
      1. Convey
      2. Kuberre
      3. SRC
  9. Networks
    1. Infiniband
    2. InfiniPath
    3. Myrinet
Available systems
  • The Bull bullx system
  • The Cray XE6
  • The Cray XMT
  • The Cray XT5h
  • The Fujitsu FX1
  • The Hitachi SR16000
  • The IBM BlueGene/L&P
  • The IBM eServer p575
  • The IBM System Cluster 1350
  • The NEC SX-9
  • The SGI Altix UV series
  • Systems disappeared from the list
    Systems under development
    Glossary
    Acknowledgments
    References

    Machine type Distributed-memory multi-processor
    Models XMT
    Operating system UNICOS/lc, Cray's microkernel Unix
    Connection structure 3-D Torus
    Compilers C, C++.
    Vendors information Web page www.cray.com/products/XMT.aspx
    Year of introduction 2007

    System parameters:

    Model Cray XMT
    Clock cycle 500 MHz
    Theor. peak performance    
    Per Processor 1.5 Gflop/s
    Per Cabinet 144 Tflop/s
    Max. Configuration 12 Tflop/s
    Memory  
    Per Cabinet ≤ 768 GB
    Max. Configuration ≤ 64 TB
    No. of processors  
    Per Cabinet 96
    Max. Configuration 8024
    Communication bandwidth  
    Point-to-point ≤ 8.3 GB/s
    Bisectional/cabinet 2.39 TB/s

    Remarks:

    The macro architecture of the Cray XMT is very much alike to that of the Cray XT6 (similar to the Cray XE6 but with the older SeaStar2+ router instead of the Gemini router). However, the processors used are completely different: These so-called Threadstorm processors are made for massive multithreading and resemble the processors of the late Cray MTA-2 (see Systems disappeared from the list and [32]).

    Let us look at the architectural features: Although the memory in the XMT is physically distributed, the system is emphatically presented as a shared-memory machine (with non-uniform access time). The latency incurred in memory references is hidden by multi-threading, i.e., usually many concurrent program threads (instruction streams) may be active at any time. Therefore, when for instance a load instruction cannot be satisfied because of memory latency the thread requesting this operation is stalled and another thread of which an operation can be done is switched into execution. This switching between program threads only takes 1 cycle. As there may be up to 128 instruction streams per processor and 8 memory references can be issued without waiting for preceding ones, a latency of 1024 cycles can be tolerated. References that are stalled are retried from a retry pool. A construction that worked out similarly was to be found in the late Stern Computing Systems SSP machines (see in Systems disappeared from the list ).

    An XMT processor has 3 functional units that together can deliver 3 flops per clock cycle for a theoretical peak performance of 1.5 Gflop/s. There is only one level of caches, data and instruction, because due to the nature of the applications at which the machine is directed more cache levels would be virtually useless. The high degree of latency hiding through massive multi-threading is the mechanism of choice here to combat memory latency.

    Unlike the earlier MTA-2 there is no Fortran compiler anymore for the XMT. Furthermore, the new 3-D torus network, identical to that of the Cray XE6 and the faster clock cycle of 500 MHz makes the machine highly interesting for applications with very unstructured but massively parallel work as for instance in sorting, data mining, combinatorial optimisation and other complex pattern matching applications. Also algorithms like sparse matrix-vector multiplications might perform well.

    Measured Performances:

    As yet no independent performance results are available to prove the value of this interesting architecture.