The Cray Inc. XT3.

Introduction
HPC Architecture
  1. Shared-memory SIMD machines
  2. Distributed-memory SIMD machines
  3. Shared-memory MIMD machines
  4. Distributed-memory MIMD machines
  5. ccNUMA machines
  6. Clusters
  7. Processors
    1. AMD Opteron
    2. IBM POWER5+
    3. IBM BlueGene processor
    4. Intel Itanium 2
    5. Intel Xeon
    6. The SPARC processors
  8. Networks
    1. Infiniband
    2. InfiniPath
    3. Myrinet
    4. QsNet
    5. SCI
Available systems
  1. The Bull NovaScale
  2. The C-DAC PARAM Padma
  3. The Cray X1E
  4. The Cray XD1
  5. The Cray XT3
  6. The Fujitsu/Siemens PRIMEPOWER
  7. The Fujitsu/Siemens PRIMEQUEST
  8. The Hitachi BladeSymphony
  9. The Hitachi SR11000
  10. The HP Integrity Superdome
  11. The IBM eServer p575
  12. The IBM BlueGene/L
  13. The NEC Express5800/1000
  14. The NEC SX-8
  15. The SGI Altix 4000
  16. The SunFire E25K
Systems disappeared from the list
Systems under development
Glossary
Acknowledgements
References

Machine type Distributed-memory multi-vector processor.
Models XT3.
Operating system UNICOS/lc, Cray's microkernel Unix.
Connection structure 3-D Torus.
Compilers Fortran 95, C, C++.
Vendors information Web page www.cray.com/products/xt3/
Year of introduction 2004.

System parameters:

Model Cray XT3
Clock cycle 2.4 GHz
Theor. peak performance
Per Processor 4.8 Gflop/s
Per Cabinet 460.8 Gflop/s
Max. Configuration 147 Tflop/s
Memory
Per Cabinet ≤ 768 GB
Max. Configuration 196 TB
No. of processors
Per Cabinet 96
Max. Configuration 30,508
Communication bandwidth
Bisectional/cabinet 333 GB/s

Remarks:

The Cray XT3 is the commercial spinoff of the 10,000+ processor Red Storm machine, built by Cray for Sandia Laboratories. The structure is similar, be it that there are no provisions are made to have a &lquo;classified&rquo; and an &lquo;unclassified&rquo; part in the machine. The basic processor in a node, called PE (Processing Element) in Cray jargon, is the AMD Opteron 100, at 2.4 GHz. Cray has chosen for this uniprocessor version of the chip because of the lower memory latency (about 60 ns) in contrast to the SMP-enabled versions that have a memory latency that can be up to 2 times higher. Per PE up to 8 GB of memory can be configured, connected by a 6.4 HyperTransport to the processor. For connection to the outside world a PE harbours 2 PCI-X busses, a dual-ported FiberChannel Host Bus Adaptor for connecting to disk, and a 10 GB Ethernet card.

The Opteron was also chosen because of the high bandwidth the relatively ease of connecting the processor of to the network processor, Cray's SeaStar chip. For the physical connection another HyperTransport channel at 6.4 GB/s is used. The SeaStar has 6 ports with a bandwidth of 7.6 GB/s each (3.8 GB/s, incoming and outgoing). Because of its 6 ports the natural interconnection mode is therefore a 3-D torus.

Like for the earlier Cray T3E (see Systems disappeared from the list), Cray has chosen to use a microkernel approach for the compute PEs. These are dedicated to computation and communication and are not disturbed by other OS tasks that can seriously influence the scalability (see [32]). For tasks like communicating with users, networking, and I/O special PEs are added that have versions of the OS that can handle these tasks.

The XT3 is obviously designed for a distributed memory parallel model, supporting Cray's MPI 2.0 and its one-way communication shmem library that date back to the Cray T3D/T3E systems but is still popular because of its simplicity and efficiency. The system comes in cabinets of 96 PEs, including service PEs. For larger configurations the ratio of service PEs to compute PEs (generally) can be lowered. So, a hypothetical maximal configuration of 30,508 PEs would need only 106 service PEs.

Measured Performances:
The Red Storm machine at Sandia National Lab, USA, can be regarded as a prototype for the XT3 systems, be it at a clock frequency of 2.0 GHz instead of 2.4 GHz. In a speed of 3[45] 6,190 out of 43,520 Gflop/s is reported: an efficiency of 83%. ORNL in the USA reports in [45] a performance of 20,527 Gflop/s on a 5200-processor regular XT3 for a linear system of unknown size. The efficiency is 82.2% in this case.