The Hitachi SR16000

Introduction
HPC Architecture
  1. Shared-memory SIMD machines
  2. Distributed-memory SIMD machines
  3. Shared-memory MIMD machines
  4. Distributed-memory MIMD machines
  5. ccNUMA machines
  6. Clusters
  7. Processors
    1. AMD Magny-Cours
    2. IBM POWER6
    3. IBM POWER7
    4. IBM PowerPC 970MP
    5. IBM BlueGene processors
    6. Intel Xeon
    7. The SPARC processors
  8. Accelerators
    1. GPU accelerators
      1. ATI/AMD
      2. nVIDIA
    2. General accelerators
      1. The IBM/Sony/Toshiba Cell processor
      2. ClearSpeed/Petapath
    3. FPGA accelerators
      1. Convey
      2. Kuberre
      3. SRC
  9. Networks
    1. Infiniband
    2. InfiniPath
    3. Myrinet
Available systems
  • The Bull bullx system
  • The Cray XE6
  • The Cray XMT
  • The Cray XT5h
  • The Fujitsu FX1
  • The Hitachi SR16000
  • The IBM BlueGene/L&P
  • The IBM eServer p575
  • The IBM System Cluster 1350
  • The NEC SX-9
  • The SGI Altix UV series
  • Systems disappeared from the list
    Systems under development
    Glossary
    Acknowledgments
    References

    Machine type RISC-based distributed memory multi-processor
    Models SR16000 XM1, L2, and VL1.
    Operating system AIX (IBM's Unix variant).
    Connection structure Mult-dimensional crossbar (see remarks)
    Compilers Fortran 77, Fortran 95, Parallel Fortran, C, C++
    Vendors information Web page w ww.hitachi.co.jp/Prod/comp/hpc/SR\_series/sr16000/index.html
    Year of introduction 2009(L2), 2010 (VL1, XM1).

    System parameters:

    Model SR16000 XM1 SR16000 L2 SR16000 VL1
    Clock cycle 3.3 GHz 4.7 GHz 5.0 GHz
    Theor. peak performance      
    Per proc. (64-bits) 844.8 Gflop/s 601.6 Gflop/s 1280 Gflop/s
    Maximal 433 Tflop/s 308 Tflop/s 655 Tflop/s
    Main memory      
    Memory/node ≤ 256 GB ≤ 256 GB ≤ 4 TB
    Memory/maximal 131 TB 131 TB 2 PB
    No. of processors 4–512 4–512 4–512
    Communication bandwidth  
    Point-to-point ≤ 16 GB/s (bidirectional) ≤ 16 GB/s (bidirectional) ≤ 16 GB/s (bidirectional)

    Remarks:

    The SR16000 is the fourth generation of distributed-memory parallel systems of Hitachi. It replaces its predecessor, the SR11000 (see Systems Disappeared from the List). We discuss here the latest models, the SR16000 XM1, L2 and VL1. All three systems are water cooled. The processors used in the SR16000 L2 and VL1 are IBM's POWER6 but the packaging is somewhat different from what is used in IBM's p575 systems (see the IBM p575 page).

    Unlike in their predecessor,the SR11000, the processors in all three models are fit for Hitachi's Pseudo Vector Processing, a technique that enables the processing of very long vectors without the detrimental effects that normally occur when out-of-cache data access is required.

    The peak performance per basic processor, or IP, can be attained with 2 simultaneous multiply/add instructions resulting in a speed of 18.8 Gflop/s on the SR16000 in the L2 (and 20 Gflop/s in the VL1). However, 32 basic processors in the L2 and 64 processors in the VL1 are coupled to form one processing node all addressing a common part of the memory. For the user this node is the basic computing entity with a peak speed of 601.6, resp. 1280 Gflop/s. Hitachi refers to this node configuration as COMPAS, Co-operative Micro-Processors in single Address Space. In fact this is a kind of SMP clustering as discussed in the sections on the main architectural classes and ccNUMA machines. In constrast to the preceding SR8000 does not contain an SP anymore, a system processor that performed system tasks, managed communication with other nodes and a range of I/O devices. These tasks are now performed by the processors in the SMP nodes themselves. The structure of the XM1 model is identical to that of the L2 model, except that POWER7 processors are employed at a clock frequency of 3.3 GHz. This gives the XM1 model a performance advantage of over 40% while at the same time using considerably less energy that a similar L2 configuration.

    The SR16000 has a multi-dimensional crossbar with a single-directional link speed of 4–16 GB/s. For this QDR InfiniBand is used in a torus topology. From 4–8 nodes the cross-section of the network is 1 hop. For configurations 16–64 it is 2 hops and from 128-node systems on it is 3 hops.

    Like in some other systems as the Cray XE6, and the late AlphaServer SC and NEC Cenju-4, one is able to directly access the memories of remote processors. Together with the very fast hardware-based barrier synchronisation this should allow for writing distributed programs with very low parallelisation overhead.

    The usual communication libraries like PVM and MPI are provided. In case one uses MPI it is possible to access individual IPs within the nodes. Furthermore, in one node it is possible to use OpenMP on individual IPs. Mostly this is less efficient than using the automatic parallelisation as done by Hitachi's compiler but in case one offers coarser grained task parallelism via OpenMP a performance gain can be attained. Hitachi provides its own numerical libraries to solve dense and sparse linear systems, FFTs, etc. As yet it is not known whether third party numerical libraries like NAG is available.

    Note: Large HPC configurations of the SR16000 are not sold in Europe as this is judged to be of insufficient economical interest by Hitachi.

    Measured Performances:
    From the SR16000 XM1 and VL1 as yet no performance figures are known but late 2009 a speed of 56.65 out of 77 Tflop/s was registered in [35] for the solution of a linear system of size N = 1,110,000 on the 4096-core SR16000, model L2 of the National Institute for Fusion Science in Japan. This amounts to an efficiency of 73.6%.