The Hitachi SR16000

Introduction
HPC Architecture
  1. Shared-memory SIMD machines
  2. Distributed-memory SIMD machines
  3. Shared-memory MIMD machines
  4. Distributed-memory MIMD machines
  5. ccNUMA machines
  6. Clusters
  7. Processors
    1. AMD Opteron
    2. IBM POWER7
    3. IBM BlueGene/Q processor
    4. Intel Xeon
    5. The SPARC processors
  8. Accelerators
    1. GPU accelerators
      1. ATI/AMD
      2. nVIDIA
    2. General computational accelerators
      1. Intel Xeon Phi
    3. FPGA accelerators
      1. Convey
      2. Kuberre
      3. SRC
  9. Interconnects
    1. Infiniband
Available systems
  • The Bull bullx system
  • The Cray XC30
  • The Cray XE6
  • The Cray XK7
  • The Eurotech Aurora
  • The Fujitsu FX10
  • The Hitachi SR16000
  • The IBM BlueGene/Q
  • The IBM eServer p775
  • The NEC SX-9
  • The SGI Altix UV series
  • Systems disappeared from the list
    Systems under development
    Glossary
    Acknowledgments
    References

    Machine type RISC-based distributed memory multi-processor
    Models SR16000 XM1, M1, and VM1.
    Operating system AIX (IBM's Unix variant).
    Connection structure Mult-dimensional crossbar (see remarks)
    Compilers Fortran 77, Fortran 95, Parallel Fortran, C, C++
    Vendors information Web page www.hitachi.co.jp/Prod/comp/hpc/SR_series/sr16000/index.html (Only in Japanese).
    Year of introduction 2010(XM1), 2011 (M1, VM1).

    System parameters:

    Model SR16000 XM1 SR16000 M1 SR16000 VM1
    Clock cycle 4.76 GHz 3.70 GHz 5.0 GHz
    Theor. peak performance      
    Per proc. (64-bits) 1049.6 Gflop/s 818.4 Gflop/s
    Maximal 537 Tflop/s 502 Tflop/s 8192 Gflop/s
    Main memory      
    Memory/node ≤ 256 GB ≤ 256 GB ≤ 8 TB
    Memory/maximal 131 TB 131 TB
    No. of processors 1–512 32–512 1
    Communication bandwidth  
    Point-to-point ≤ 16 GB/s (bidirectional) ≤ 16 GB/s (bidirectional) ≤ 16 GB/s (bidirectional)

    Remarks:

    The SR16000 is the fourth generation of distributed-memory parallel systems of Hitachi. It replaces its predecessor, the SR11000 (see Systems Disappeared from the List). We discuss here the latest models, the SR16000 XM1,M1 and VM1. All three systems are water cooled. The processors used in the SR16000 L2 and VL1 are IBM's POWER7 but the packaging is somewhat different from what is used in IBM's p775 systems (see the IBM p775 page).

    Unlike in their predecessor,the SR11000, the processors in all three models are fit for Hitachi's Pseudo Vector Processing, a technique that enables the processing of very long vectors without the detrimental effects that normally occur when out-of-cache data access is required.

    The peak performance per basic processor, or IP, can be attained with 2 simultaneous multiply/add instructions resulting in a speed of 17.8 Gflop/s on the SR16000 in the M1 (and 20 Gflop/s in the VM1). However, 32 basic processors in the M1 and 64 processors in the VM1 are coupled to form one processing node all addressing a common part of the memory. For the user this node is the basic computing entity with a peak speed of 1049.6, 818.4 Tflop/s, resp. 8192 Gflop/s. Hitachi refers to this node configuration as COMPAS, Co-operative Micro-Processors in single Address Space. In fact this is a kind of SMP clustering as discussed in the sections on the main architectural classes and ccNUMA machines. In constrast to the preceding SR8000 does not contain an SP anymore, a system processor that performed system tasks, managed communication with other nodes and a range of I/O devices. These tasks are now performed by the processors in the SMP nodes themselves. The structure of the XM1 model is identical to that of the M1 model, except that POWER7 processors are employed at a clock frequency of 4.76 GHz instead of 3.70 GHz.

    The SR16000 has a multi-dimensional crossbar with a single-directional link speed of 4–16 GB/s. For this QDR InfiniBand is used in a torus topology. From 4–8 nodes the cross-section of the network is 1 hop. For configurations 16–64 it is 2 hops and from 128-node systems on it is 3 hops.

    Like in some other systems as the Cray XE6, and the late AlphaServer SC and NEC Cenju-4, one is able to directly access the memories of remote processors. Together with the very fast hardware-based barrier synchronisation this should allow for writing distributed programs with very low parallelisation overhead.

    The usual communication libraries like PVM and MPI are provided. In case one uses MPI it is possible to access individual IPs within the nodes. Furthermore, in one node it is possible to use OpenMP on individual IPs. Mostly this is less efficient than using the automatic parallelisation as done by Hitachi's compiler but in case one offers coarser grained task parallelism via OpenMP a performance gain can be attained. Hitachi provides its own numerical libraries to solve dense and sparse linear systems, FFTs, etc. As yet it is not known whether third party numerical libraries like NAG is available.

    Note: Large HPC configurations of the SR16000 are not sold in Europe as this is judged to be of insufficient economical interest by Hitachi.

    Measured Performances:
    In [39] a SR16000 XM1 system is listed. A 10340-core system attained a speed of 253 Tflop/s on the LinPACK benchmark on a system of unspecified size. The efficiency was 80.0%.