Introduction
HPC Architecture
  1. Shared-memory SIMD machines
  2. Distributed-memory SIMD machines
  3. Shared-memory MIMD machines
  4. Distributed-memory MIMD machines
  5. ccNUMA machines
  6. Clusters
  7. Processors
    1. AMD Opteron
    2. IBM POWER7
    3. IBM BlueGene/Q processor
    4. Intel Xeon
    5. The SPARC processors
  8. Accelerators
    1. GPU accelerators
      1. ATI/AMD
      2. nVIDIA
    2. General computational accelerators
      1. Intel Xeon Phi
    3. FPGA accelerators
      1. Convey
      2. Kuberre
      3. SRC
  9. Interconnects
    1. Infiniband
Available systems
  • The Bull bullx system
  • The Cray XC30
  • The Cray XE6
  • The Cray XK7
  • The Eurotech Aurora
  • The Fujitsu FX10
  • The Hitachi SR16000
  • The IBM BlueGene/Q
  • The IBM eServer p775
  • The NEC SX-9
  • The SGI Altix UV series
  • Systems disappeared from the list
    Systems under development
    Glossary
    Acknowledgments
    References

    Machine type RISC-based ccNUMA system
    Models Altix UV 2000
    Operating system Linux (SuSE SLES9/10, RedHat EL4/5) + extensions
    Connection structure Unspecified.
    Compilers Fortran 95, C, C++, CUDA, OpenCL
    Vendors information Web page www.sgi.com/products/servers/uv/
    Year of introduction 2012

    System parameters:

    Model Altix UV 2000
    Clock cycle 2.0– 2.9 GHz
    Theor. peak performance  
    Per core (64-bits) 16–23.2 Gflop/s
    Maximum (64-bits) 47.5 Tflop/s (CPU only)
    No. of cores 4096
    Main memory  
    Memory/blade ≤ 128 GB
    Memory/maximal 8.2 TB
    Communication bandwidth  
    Point-to-point 6.7 GB/s/direction

    Remarks:

    The Altix UV 2 series is the latest (6th) generation of ccNUMA shared-memory systems made by SGI and the second in the Altix UV series. Apart from the UV 2000 there is also a 4-socket UV 20, but this is too small to be discussed here. The processor used is the Intel Ivy Bridge. The distinguising factor of the UV systems is their distributed shared memory that can be up to 8.2 TB. Every blade can carry up to 128 GB of memory that is shared in a ccNUMA fashion through hubs and the 6th generation of SGI's proprietary NumaLink6. A very high-speed interconnect with a point-to-point bandwidth of 6.7 GB/s per direction; doubled with respect to the former NumaLink5.

    Like Bull, Cray, and Eurotech SGI offers the possibility to exchange two CPUs on a blade by either nVIDIA's Tesla K20X GPUs or by Xeon Phi accelerators.

    A UV blade contains 4 12-core Ivi Bridge processors, connected to each other by two QPI links while each processor also connects to the Northbridge chipset for I/O, etc. Lastly all processors are connected via QPI links to the UV hub that takes care of the communication with the rest of the system. The bandwidth from the hub to the processors is 25.6 GB/s while the 4 ports for outside communication are approximately 13.5 GB/s each.

    The hub does much more than acting as a simple router. It ensures cache coherency in the dirstibuted shared memory. There is an Active Memory Unit that supports atomic memory operations and takes care of thread synchonisation. The Global Register Unit (GRU) within the hub also extends the x86 addressing mode (44-bit physical, 48-virtual) to 53, resp. 60 bits to accomodate the potentially very large global address space of the system. In addition it houses an external TLB cache that enables large memory page support. Furthermore it can perform asynchronous block copy operations akin to the block transfer unit in Cray's Gemini and Aries routers. In addition the GRU is accomodates scatter/gather operations which greatly can speed up cache-unfriendly sparse algorithms. Lastly, MPI operations can be off-loaded from the CPU and barriers and synchonisation for reduction operations are taken care of in the MPI Offload Engine (MOE).

    The UV systems come with the usual Intel stack of compilers and tools. To take full advantage of the facilities of the hub it is advised to use SGI's MPI version based on its Message Passing Toolkit although independent implementations, like OpenMPI, also will work.

    Measured Performances:
    By contrast to the SGI ICE X cluster systems, there are no performance results available for this new system.