The Hitachi SR8000.

Next: The HP 9000 SuperDome. Up: Recount of (almost) available ... Previous: The Fujitsu VPP5000 series.

The Hitachi SR8000.

Machine type RISC-based distributed memory multi-processor
Models SR8000, SR8000 E1, SR8000 F1, SR8000 G1.
Operating system HI-UX/MPP (Micro kernel Mach 3.0)
Connection structure Mult-dimensional crossbar (see remarks)
Compilers Fortran 77, Fortran 90, Parallel Fortran, HPF, C, C++
Vendors information Web page www.hitachi.co.jp/Prod/comp/hpc/eng/sr81e.html
Year of introduction 1998, E1 and F1: 1999, G1: 2000.

Machine type	RISC-based distributed memory multi-processor
Models	SR8000, SR8000 E1, SR8000 F1, SR8000 G1.
Operating system	HI-UX/MPP (Micro kernel Mach 3.0)
Connection structure	Mult-dimensional crossbar (see remarks)
Compilers	Fortran 77, Fortran 90, Parallel Fortran, HPF, C, C++
Vendors information Web page	www.hitachi.co.jp/Prod/comp/hpc/eng/sr81e.html
Year of introduction	1998, E1 and F1: 1999, G1: 2000.

System parameters:

Model SR8000 SR8000 E1 SR8000 F1 SR8000 G1
Clock cycle 250 MHz 300 MHz 375 MHz 450 MHz
Theor. peak performance
Per proc. (64-bits) 8 Gflop/s 9.6 Gflop/s 12 Gflop/s 14.4 Gflop/s
Maximal 1 Tflop/s 4.9 Tflop/s 6.1 Tflop/s 7.3 Tflop/s
Main memory
Memory/node <= 8 GB <= 16 GB <= 16 GB <= 16 GB
Memory/maximal <= 1 TB <= 8 TB <= 8 TB <= 8 TB
No. of processors 4--128 4--512 4--512 4--512
Communication bandwidth 1 GB/s 1.2 GB/s 1 GB/s 1.6 GB/s

Model	SR8000	SR8000 E1	SR8000 F1	SR8000 G1
Clock cycle	250 MHz	300 MHz	375 MHz	450 MHz
Theor. peak performance
Per proc. (64-bits)	8 Gflop/s	9.6 Gflop/s	12 Gflop/s	14.4 Gflop/s
Maximal	1 Tflop/s	4.9 Tflop/s	6.1 Tflop/s	7.3 Tflop/s
Main memory
Memory/node	<= 8 GB	<= 16 GB	<= 16 GB	<= 16 GB
Memory/maximal	<= 1 TB	<= 8 TB	<= 8 TB	<= 8 TB
No. of processors	4--128	4--512	4--512	4--512
Communication bandwidth	1 GB/s	1.2 GB/s	1 GB/s	1.6 GB/s

Remarks:

The SR8000 is the third generation of distributed-memory parallel systems of Hitachi. It is to replace both its direct predecessor, the SR2201 and the late top-vectorprocessor, the S-3800 (see Systems Disappeared from the List).

The basic node processor is a 2.22--4 ns clock PowerPC node with major enhancements made by Hitachi. E.g., a hardware barrier synchronisation is added and the additions required for "Pseudo Vector Processing" (PVP). The latter means that for operations on long vectors one does not incur the detrimental effects of cache misses that often ruin the performance of RISC processors unless code is carefully blocked and unrolled. This facility was already available on the SR2201 and experiments have shown that this idea seems to work well (see [13]).

The peak performance per basic processor, or IP, can be attained with 2 simultaneous multiply/add instructions resulting in a speed of 1 Gflop/s on the SR8000. However, eight basic processors are coupled to form one processing node all addressing a common part of the memory. For the user this node is the basic computing entity with a peak speed of 8 Gflop/s. Hitachi refers to this node configuration as COMPAS, Co-operative Micro-Processors in single Address Space. In fact this is a kind of SMP clustering as discussed in the sections on the main architectural classes and ccNUMA machines. A difference with most of these systems is that for the user the individual processors in a cluster node are not accessible. Every node also contains an SP, a system processor that performs system tasks, manages communication with other nodes and a range of I/O devices.

The SR8000 has a multi-dimensional crossbar with a bi-directional link speed of 1 GB/s. From 4--8 nodes the cross-section of the network is 1 hop. For configurations 16--64 it is 2 hops and from 128-node systems on it is 3 hops.

The E1 and F1 models are in almost every respect equal to the basic SR8000 model, however, the clock cycles for these models are 3.3 and 2.66 ns, respectively. Furthermore, the E1, F1, and G1 models can house twice the amount of memory per node and the maximum configurations can be extended to 512 processors making them at the time of writing this report the most powerful commercially available systems --- at least in theory. The Hitachi documentation quotes a bandwidth of 1.2 GB/s for the network in the E1 model while it is 1 GB/s for the basic SR8000 and the F1. By contrast, the G1 model has a bandwidth of 1.6 GB/s.

Like in some other systems as the Cray T3E, and the AlphaServer SC, and the late NEC Cenju-4, one is able to directly access the memories of remote processors. Together with the very fast hardware-based barrier synchronisation this should allow for writing distributed programs with very low parallelisation overhead.

The following software products will be supported in addition to those already mentioned above: PVM, MPI, PARMACS, Linda, and FORGE90. In addition a numerical libraries like NAG and IMSL are offered.

Measured Performances:
Results for the all of the SR8000 types are available from [6], of which we quote the most significant ones. On a 144-node G1 (450 MHz) configuration a speed of 1709 Gflop/s out of 2074 was observed, an efficiency of 63% for the solution of a 141,000 full linear system. On a 112-node 375 MHz F1 model 1035 out of 1344 Gflop/s could be achieved, an efficiency of 77%. On a single node of this processor speeds of over 6.2 and 4.1 Gflop/s were measured in solving a full linear system and a full symmetric eigenvalue problem of order 5000, respectively (see [7] for the last two results). Furthermore 2 SR8000 G1 frames have been coupled and a speed of 1709 Gflop/s out of 2074 has been attained on 1152 processors for solving a 141,000-order linear system. The efficiency in this case is 82%, quite high for externally coupled systems.

Next: The HP 9000 SuperDome. Up: Recount of (almost) available ... Previous: The Fujitsu VPP5000 series.

Aad van der Steen
Mon Jul 29 16:17:52 MDT 2002