|
System parameters:
Remarks: The SR16000 is the fourth generation of distributed-memory parallel systems of Hitachi. It replaces its predecessor, the SR11000 (see Systems Disappeared from the List). We discuss here the latest models, the SR16000 XM1, L2 and VL1. All three systems are water cooled. The processors used in the SR16000 L2 and VL1 are IBM's POWER6 but the packaging is somewhat different from what is used in IBM's p575 systems (see the IBM p575 page). Unlike in their predecessor,the SR11000, the processors in all three models are fit for Hitachi's Pseudo Vector Processing, a technique that enables the processing of very long vectors without the detrimental effects that normally occur when out-of-cache data access is required. The peak performance per basic processor, or IP, can be attained with 2 simultaneous multiply/add instructions resulting in a speed of 18.8 Gflop/s on the SR16000 in the L2 (and 20 Gflop/s in the VL1). However, 32 basic processors in the L2 and 64 processors in the VL1 are coupled to form one processing node all addressing a common part of the memory. For the user this node is the basic computing entity with a peak speed of 601.6, resp. 1280 Gflop/s. Hitachi refers to this node configuration as COMPAS, Co-operative Micro-Processors in single Address Space. In fact this is a kind of SMP clustering as discussed in the sections on the main architectural classes and ccNUMA machines. In constrast to the preceding SR8000 does not contain an SP anymore, a system processor that performed system tasks, managed communication with other nodes and a range of I/O devices. These tasks are now performed by the processors in the SMP nodes themselves. The structure of the XM1 model is identical to that of the L2 model, except that POWER7 processors are employed at a clock frequency of 3.3 GHz. This gives the XM1 model a performance advantage of over 40% while at the same time using considerably less energy that a similar L2 configuration. The SR16000 has a multi-dimensional crossbar with a single-directional link speed of 4–16 GB/s. For this QDR InfiniBand is used in a torus topology. From 4–8 nodes the cross-section of the network is 1 hop. For configurations 16–64 it is 2 hops and from 128-node systems on it is 3 hops. Like in some other systems as the Cray XE6, and the late AlphaServer SC and NEC Cenju-4, one is able to directly access the memories of remote processors. Together with the very fast hardware-based barrier synchronisation this should allow for writing distributed programs with very low parallelisation overhead.
The usual communication libraries like PVM and MPI are provided.
In case one uses MPI it is possible to access individual IPs within the nodes.
Furthermore, in one node it is possible to use OpenMP on individual IPs.
Mostly this is less efficient than using the automatic parallelisation as
done by Hitachi's compiler but in case one offers coarser grained task
parallelism via OpenMP a performance gain can be attained. Hitachi provides
its own numerical libraries to solve dense and sparse linear systems, FFTs, etc.
As yet it is not known whether third party numerical libraries like NAG is
available. Note: Large HPC configurations of the SR16000 are not sold in Europe as this is judged to be of insufficient economical interest by Hitachi.
Measured Performances: |