The Hitachi SR16000

Introduction

HPC Architecture

Shared-memory SIMD machines

Distributed-memory SIMD machines

Shared-memory MIMD machines

Distributed-memory MIMD machines

ccNUMA machines

Clusters

Processors

AMD Magny-Cours

IBM POWER6

IBM POWER7

IBM PowerPC 970MP

IBM BlueGene processors

Intel Xeon

The SPARC processors

Accelerators

GPU accelerators

ATI/AMD

nVIDIA

General accelerators

The IBM/Sony/Toshiba Cell processor

ClearSpeed/Petapath

FPGA accelerators

Convey

Kuberre

SRC

Networks

Infiniband

InfiniPath

Myrinet

Available systems
The Bull bullx system

The Cray XE6

The Cray XMT

The Cray XT5_h

The Fujitsu FX1

The Hitachi SR16000

The IBM BlueGene/L&P

The IBM eServer p575

The IBM System Cluster 1350

The NEC SX-9

The SGI Altix UV series

Systems disappeared from the list

Systems under development

Glossary

Acknowledgments

References

Machine type RISC-based distributed memory multi-processor
Models SR16000 XM1, L2, and VL1.
Operating system AIX (IBM's Unix variant).
Connection structure Mult-dimensional crossbar (see remarks)
Compilers Fortran 77, Fortran 95, Parallel Fortran, C, C++
Vendors information Web page w ww.hitachi.co.jp/Prod/comp/hpc/SR\_series/sr16000/index.html
Year of introduction 2009(L2), 2010 (VL1, XM1).

System parameters:

Model SR16000 XM1 SR16000 L2 SR16000 VL1
Clock cycle 3.3 GHz 4.7 GHz 5.0 GHz
Theor. peak performance
Per proc. (64-bits) 844.8 Gflop/s 601.6 Gflop/s 1280 Gflop/s
Maximal 433 Tflop/s 308 Tflop/s 655 Tflop/s
Main memory
Memory/node ≤ 256 GB ≤ 256 GB ≤ 4 TB
Memory/maximal 131 TB 131 TB 2 PB
No. of processors 4–512 4–512 4–512
Communication bandwidth
Point-to-point ≤ 16 GB/s (bidirectional) ≤ 16 GB/s (bidirectional) ≤ 16 GB/s (bidirectional)

Remarks:

The SR16000 is the fourth generation of distributed-memory parallel systems of Hitachi. It replaces its predecessor, the SR11000 (see Systems Disappeared from the List). We discuss here the latest models, the SR16000 XM1, L2 and VL1. All three systems are water cooled. The processors used in the SR16000 L2 and VL1 are IBM's POWER6 but the packaging is somewhat different from what is used in IBM's p575 systems (see the IBM p575 page).

Unlike in their predecessor,the SR11000, the processors in all three models are fit for Hitachi's Pseudo Vector Processing, a technique that enables the processing of very long vectors without the detrimental effects that normally occur when out-of-cache data access is required.

The peak performance per basic processor, or IP, can be attained with 2 simultaneous multiply/add instructions resulting in a speed of 18.8 Gflop/s on the SR16000 in the L2 (and 20 Gflop/s in the VL1). However, 32 basic processors in the L2 and 64 processors in the VL1 are coupled to form one processing node all addressing a common part of the memory. For the user this node is the basic computing entity with a peak speed of 601.6, resp. 1280 Gflop/s. Hitachi refers to this node configuration as COMPAS, Co-operative Micro-Processors in single Address Space. In fact this is a kind of SMP clustering as discussed in the sections on the main architectural classes and ccNUMA machines. In constrast to the preceding SR8000 does not contain an SP anymore, a system processor that performed system tasks, managed communication with other nodes and a range of I/O devices. These tasks are now performed by the processors in the SMP nodes themselves. The structure of the XM1 model is identical to that of the L2 model, except that POWER7 processors are employed at a clock frequency of 3.3 GHz. This gives the XM1 model a performance advantage of over 40% while at the same time using considerably less energy that a similar L2 configuration.

The SR16000 has a multi-dimensional crossbar with a single-directional link speed of 4–16 GB/s. For this QDR InfiniBand is used in a torus topology. From 4–8 nodes the cross-section of the network is 1 hop. For configurations 16–64 it is 2 hops and from 128-node systems on it is 3 hops.

Like in some other systems as the Cray XE6, and the late AlphaServer SC and NEC Cenju-4, one is able to directly access the memories of remote processors. Together with the very fast hardware-based barrier synchronisation this should allow for writing distributed programs with very low parallelisation overhead.

The usual communication libraries like PVM and MPI are provided. In case one uses MPI it is possible to access individual IPs within the nodes. Furthermore, in one node it is possible to use OpenMP on individual IPs. Mostly this is less efficient than using the automatic parallelisation as done by Hitachi's compiler but in case one offers coarser grained task parallelism via OpenMP a performance gain can be attained. Hitachi provides its own numerical libraries to solve dense and sparse linear systems, FFTs, etc. As yet it is not known whether third party numerical libraries like NAG is available.

Note: Large HPC configurations of the SR16000 are not sold in Europe as this is judged to be of insufficient economical interest by Hitachi.

Measured Performances:
From the SR16000 XM1 and VL1 as yet no performance figures are known but late 2009 a speed of 56.65 out of 77 Tflop/s was registered in [35] for the solution of a linear system of size N = 1,110,000 on the 4096-core SR16000, model L2 of the National Institute for Fusion Science in Japan. This amounts to an efficiency of 73.6%.