Introduction
HPC Architecture
  1. Shared-memory SIMD machines
  2. Distributed-memory SIMD machines
  3. Shared-memory MIMD machines
  4. Distributed-memory MIMD machines
  5. ccNUMA machines
  6. Clusters
  7. Processors
    1. AMD Magny-Cours
    2. IBM POWER6
    3. IBM POWER7
    4. IBM PowerPC 970MP
    5. IBM BlueGene processors
    6. Intel Xeon
    7. The SPARC processors
  8. Accelerators
    1. GPU accelerators
      1. ATI/AMD
      2. nVIDIA
    2. General accelerators
      1. The IBM/Sony/Toshiba Cell processor
      2. ClearSpeed/Petapath
    3. FPGA accelerators
      1. Convey
      2. Kuberre
      3. SRC
  9. Networks
    1. Infiniband
    2. InfiniPath
    3. Myrinet
Available systems
  • The Bull bullx system
  • The Cray XE6
  • The Cray XMT
  • The Cray XT5h
  • The Fujitsu FX1
  • The Hitachi SR16000
  • The IBM BlueGene/L&P
  • The IBM eServer p575
  • The IBM System Cluster 1350
  • The NEC SX-9
  • The SGI Altix UV series
  • Systems disappeared from the list
    Systems under development
    Glossary
    Acknowledgments
    References

    Machine type RISC-based ccNUMA system
    Models Altix UV 100, 1000
    Operating system Linux (SuSE SLES9/10, RedHat EL4/5) + extensions
    Connection structure 2-D Torus (UV 100), paired 2-D torus (UV 1000).
    Compilers Fortran 95, C, C++
    Vendors information Web page www.sgi.com/products/servers/altix/uv/
    Year of introduction 2010

    System parameters:

    Model Altix UV 100 Altix UV 1000
    Clock cycle 2.25 GHz 2.25 GHz
    Theor. peak performance    
    Per core (64-bits) 9.0 Gflop/s 9.0 Gflop/s
    Maximum (64-bits) 6.9 Tflop/s 18.5 Tflop/s
    Main memory    
    Memory/blade ≤ 128 GB ≤ 128 GB
    Memory/maximal ≤ 6 TB ≤ 16 TB
    Communication bandwidth  
    Point-to-point 7.5 GB/s 7.5 GB/s

    Remarks:

    The Altix UV is the latest (5th) generation of ccNUMA shared-memory systems made by SGI. Unlike the earlier two generations the processor used is not from the Intel Itanium line but rather from the Xeon family: the Xeon X7500, or Nehalem EX. We only present the UV 100 and UV 1000 models here as the UV 10 falls below our performance criterion. The UV 100 is in about all respects just a smaller version of the the UV 1000. Only the packaging and the interconnect topology are presumably different but the information about the topology of the interconnect is somewhat confusing. SGI's fact sheet about the UV systems contains the information stated above but a white paper from 2009 gives a detailed picture of a fat tree interconnection on the 8-blade chassis level and for a 2048 core system. Only above 2048 cores (the current UV 1000) a 2-D torus is described for systems up to 262,144 cores. For the moment we assume that the information in the fact sheet is the most probable.

    A UV blade contains two X7500 processors, connected to each other by two QPI links while each processor also connects to the Northbridge chipset for I/O, etc. Lastly both processors are connected via a QPI link to the UV hub that takes care of the communication with the rest of the system. The bandwidth from the hub to the processors is 25.6 GB/s while the 4 ports for outside communication are approximately 10 GB/s each.

    The hub does much more than acting as a simple router. It ensures cache coherency in the dirstibuted shared memory. There is an Active Memory Unit that supports atomic memory operations and takes care of thread synchonisation. The Global Register Unit (GRU) within the hub also extends the x86 addressing mode (44-bit physical, 48-virtual) to 53, resp. 60 bits to accomodate the potentially very large global address space of the system. In addition it houses an external TLB cache that enables large memory page support. Furthermore it can perform asynchronous block copy operations akin to the block transfer unit in Cray's Gemini router. In addition the GRU is accomodates scatter/gather operations which greatly can speed up cache-unfriendly sparse algorithms. Lastly, MPI operations can be off-loaded from the CPU and barriers and synchonisation for reduction operations are taken care of in the MPI Offload Engine (MOE).

    The UV systems come with the usual Intel stack of compilers and tools. To take full advantage of the facilities of the hub it is advised to use SGI's MPI version based on its Message Passing Toolkit although independent implementations, like OpenMPI, also will work.

    Measured Performances:
    There are synthetic benchmark results from the EuroBen benchmark suite on the Altix UV at LRZ, Garching, Germany. They can be found at [10]. The test shows excellent scalability for up to 64 cores. The system runs, however, on a clock frequency of 2.0 GHz instead of 2.25 GHz.