The SR16000 is the fourth generation of distributed-memory parallel systems of Hitachi. It replaces its predecessor, the SR11000 (see Systems Disappeared from the List). We discuss here the latest models, the SR16000 XM1,M1 and VM1. All three systems are water cooled. The processors used in the SR16000 L2 and VL1 are IBM's POWER7 but the packaging is somewhat different from what is used in IBM's p775 systems (see the IBM p775 page).
Unlike in their predecessor,the SR11000, the processors in all three models are fit for Hitachi's Pseudo Vector Processing, a technique that enables the processing of very long vectors without the detrimental effects that normally occur when out-of-cache data access is required.
The peak performance per basic processor, or IP, can be attained with 2 simultaneous multiply/add instructions resulting in a speed of 17.8 Gflop/s on the SR16000 in the M1 (and 20 Gflop/s in the VM1). However, 32 basic processors in the M1 and 64 processors in the VM1 are coupled to form one processing node all addressing a common part of the memory. For the user this node is the basic computing entity with a peak speed of 1049.6, 818.4 Tflop/s, resp. 8192 Gflop/s. Hitachi refers to this node configuration as COMPAS, Co-operative Micro-Processors in single Address Space. In fact this is a kind of SMP clustering as discussed in the sections on the main architectural classes and ccNUMA machines. In constrast to the preceding SR8000 does not contain an SP anymore, a system processor that performed system tasks, managed communication with other nodes and a range of I/O devices. These tasks are now performed by the processors in the SMP nodes themselves. The structure of the XM1 model is identical to that of the M1 model, except that POWER7 processors are employed at a clock frequency of 4.76 GHz instead of 3.70 GHz.
The SR16000 has a multi-dimensional crossbar with a single-directional link speed of 4–16 GB/s. For this QDR InfiniBand is used in a torus topology. From 4–8 nodes the cross-section of the network is 1 hop. For configurations 16–64 it is 2 hops and from 128-node systems on it is 3 hops.
Like in some other systems as the Cray XE6, and the late AlphaServer SC and NEC Cenju-4, one is able to directly access the memories of remote processors. Together with the very fast hardware-based barrier synchronisation this should allow for writing distributed programs with very low parallelisation overhead.
The usual communication libraries like PVM and MPI are provided.
In case one uses MPI it is possible to access individual IPs within the nodes.
Furthermore, in one node it is possible to use OpenMP on individual IPs.
Mostly this is less efficient than using the automatic parallelisation as
done by Hitachi's compiler but in case one offers coarser grained task
parallelism via OpenMP a performance gain can be attained. Hitachi provides
its own numerical libraries to solve dense and sparse linear systems, FFTs, etc.
As yet it is not known whether third party numerical libraries like NAG is
Note: Large HPC configurations of the SR16000 are not sold in Europe as this is judged to be of insufficient economical interest by Hitachi.