The SGI Altix UV series.

Introduction

HPC Architecture

Shared-memory SIMD machines

Distributed-memory SIMD machines

Shared-memory MIMD machines

Distributed-memory MIMD machines

ccNUMA machines

Clusters

Processors

AMD Magny-Cours

IBM POWER6

IBM POWER7

IBM PowerPC 970MP

IBM BlueGene processors

Intel Xeon

The SPARC processors

Accelerators

GPU accelerators

ATI/AMD

nVIDIA

General accelerators

The IBM/Sony/Toshiba Cell processor

ClearSpeed/Petapath

FPGA accelerators

Convey

Kuberre

SRC

Networks

Infiniband

InfiniPath

Myrinet

Available systems
The Bull bullx system

The Cray XE6

The Cray XMT

The Cray XT5_h

The Fujitsu FX1

The Hitachi SR16000

The IBM BlueGene/L&P

The IBM eServer p575

The IBM System Cluster 1350

The NEC SX-9

The SGI Altix UV series

Systems disappeared from the list

Systems under development

Glossary

Acknowledgments

References

Machine type RISC-based ccNUMA system
Models Altix UV 100, 1000
Operating system Linux (SuSE SLES9/10, RedHat EL4/5) + extensions
Connection structure 2-D Torus (UV 100), paired 2-D torus (UV 1000).
Compilers Fortran 95, C, C++
Vendors information Web page www.sgi.com/products/servers/altix/uv/
Year of introduction 2010

System parameters:

Model Altix UV 100 Altix UV 1000
Clock cycle 2.25 GHz 2.25 GHz
Theor. peak performance
Per core (64-bits) 9.0 Gflop/s 9.0 Gflop/s
Maximum (64-bits) 6.9 Tflop/s 18.5 Tflop/s
Main memory
Memory/blade ≤ 128 GB ≤ 128 GB
Memory/maximal ≤ 6 TB ≤ 16 TB
Communication bandwidth
Point-to-point 7.5 GB/s 7.5 GB/s

Remarks:

The Altix UV is the latest (5^th) generation of ccNUMA shared-memory systems made by SGI. Unlike the earlier two generations the processor used is not from the Intel Itanium line but rather from the Xeon family: the Xeon X7500, or Nehalem EX. We only present the UV 100 and UV 1000 models here as the UV 10 falls below our performance criterion. The UV 100 is in about all respects just a smaller version of the the UV 1000. Only the packaging and the interconnect topology are presumably different but the information about the topology of the interconnect is somewhat confusing. SGI's fact sheet about the UV systems contains the information stated above but a white paper from 2009 gives a detailed picture of a fat tree interconnection on the 8-blade chassis level and for a 2048 core system. Only above 2048 cores (the current UV 1000) a 2-D torus is described for systems up to 262,144 cores. For the moment we assume that the information in the fact sheet is the most probable.

A UV blade contains two X7500 processors, connected to each other by two QPI links while each processor also connects to the Northbridge chipset for I/O, etc. Lastly both processors are connected via a QPI link to the UV hub that takes care of the communication with the rest of the system. The bandwidth from the hub to the processors is 25.6 GB/s while the 4 ports for outside communication are approximately 10 GB/s each.

The hub does much more than acting as a simple router. It ensures cache coherency in the dirstibuted shared memory. There is an Active Memory Unit that supports atomic memory operations and takes care of thread synchonisation. The Global Register Unit (GRU) within the hub also extends the x86 addressing mode (44-bit physical, 48-virtual) to 53, resp. 60 bits to accomodate the potentially very large global address space of the system. In addition it houses an external TLB cache that enables large memory page support. Furthermore it can perform asynchronous block copy operations akin to the block transfer unit in Cray's Gemini router. In addition the GRU is accomodates scatter/gather operations which greatly can speed up cache-unfriendly sparse algorithms. Lastly, MPI operations can be off-loaded from the CPU and barriers and synchonisation for reduction operations are taken care of in the MPI Offload Engine (MOE).

The UV systems come with the usual Intel stack of compilers and tools. To take full advantage of the facilities of the hub it is advised to use SGI's MPI version based on its Message Passing Toolkit although independent implementations, like OpenMPI, also will work.

Measured Performances:
There are synthetic benchmark results from the EuroBen benchmark suite on the Altix UV at LRZ, Garching, Germany. They can be found at [10]. The test shows excellent scalability for up to 64 cores. The system runs, however, on a clock frequency of 2.0 GHz instead of 2.25 GHz.