The HP Integrity SuperDome

Introduction
HPC Architecture
  1. Shared-memory SIMD machines
  2. Distributed-memory SIMD machines
  3. Shared-memory MIMD machines
  4. Distributed-memory MIMD machines
  5. ccNUMA machines
  6. Clusters
  7. Processors
    1. AMD Opteron
    2. IBM POWER5+
    3. IBM BlueGene processor
    4. Intel Itanium 2
    5. Intel Xeon
    6. The SPARC processors
  8. Networks
    1. Infiniband
    2. InfiniPath
    3. Myrinet
    4. QsNet
    5. SCI
Available systems
  1. The Bull NovaScale
  2. The C-DAC PARAM Padma
  3. The Cray X1E
  4. The Cray XD1
  5. The Cray XT3
  6. The Fujitsu/Siemens PRIMEPOWER
  7. The Fujitsu/Siemens PRIMEQUEST
  8. The Hitachi BladeSymphony
  9. The Hitachi SR11000
  10. The HP Integrity Superdome
  11. The IBM eServer p575
  12. The IBM BlueGene/L
  13. The NEC Express5800/1000
  14. The NEC SX-8
  15. The SGI Altix 4000
  16. The SunFire E25K
Systems disappeared from the list
Systems under development
Glossary
Acknowledgements
References

Machine type RISC-based ccNUMA system.
Models HP Integrity SuperDome.
Operating system HP-UX (HP's usual Unix flavour),Linux
Connection structure Crossbar
Compilers Fortran 77, Fortran 90, HPF, C, C++.
Vendors information Web page http://h20341.www2.hp.com/integrity/cache/342370-0-0-0-121.html
Year of introduction 2004.

System parameters:

Model Integrity SuperDome
Clock cycle 1.6 GHz
Theor. peak performance
Per proc. core (64-bits) 6.4 Gflop/s
Maximal (64-bits) 409.6 Gflop/s
Main memory
Memory/maximal 1 TB
No. of processors ≤ 64
Communication bandwidth
aggregate (global) 64 GB/s
(cell—backplane) 8 GB/s
(within cell, see below) 16 GB/s

Remarks:

The Integrity Superdome is HP's investment in the future for high-end servers. Within a timespan of a few years it should replace the PA-RISC-based HP 9000 Superdome. HP has anticipated on this by giving it exactly the same macro structure: cells are connected to a backplane crossbar that enables the communication between cells. For the backplane it is immaterial whether a cell contains PA-RISC or Itanium processors. The Superdome has a 2-level crossbar: one level within a 4-processor cell and another level by connecting the cells the crossbar backplane. Every cell connects to the backplane at a speed of 8 GB/s and the global aggregate bandwidth for a fully configured system is therefore 64 GB/s.

As said, the basic building block of the Superdome is the 4-processor cell. All data traffic within a cell is controlled by the Cell Controller, a 10-port ASIC. It connects to the four local memory subsystems at 16 GB/s, to the backplane crossbar at 8 GB/s, and to two ports that each serve two processors at 6.4 GB/s/port. As each processor houses two CPU cores the available bandwidth per CPU core is 1.6 GB/s. Like the SGI Altix systems (see section \ref{altix}), the cache coherency in the Superdome is secured by using directory memory. The NUMA factor for a full 64 processor systems is by HP's account very modest: only 1.8.

The Integrity Superdome, like its predecessor, is a ccNUMA machine. It therefore supports OpenMP over its maximum of 64 processors. As the Integrity Superdome is based on the Itanium 2 for which much Linux development is done in the past few years, the system can also be run with the Linux OS. In fact, because the machine can be partitioned, it is possible to run both Linux and HP-UX in the different complexes of the same machine. One can even mix the old PA-RISC processors with Itanium processors within one system: cells with different types of processors, making the system a hybrid Integrity and HP 9000 Superdome.

HP at the time of writing this report still does not offer Integrity Superdome machines based on the dual-core Montecito processors like Bull, Fujitsu, Hitachi, and SGI do. This is somewhat surprising as HP was one of the original developers of the EPIC processors of which the Itanium and Montecito are representatives.

Measured Performances:
There are no performance results of the newest version of the system. However, in [45] a speed of 1716.5 Gflop/s is reported for solving a full linear system of unspecified size. This result is achieved on a system with a total of 384 1.5 GHz processors. As the Theoretical Peak Performance of such a cluster is 2304 Gflop/s the efficiency is 75%.