Introduction
HPC Architecture
  1. Shared-memory SIMD machines
  2. Distributed-memory SIMD machines
  3. Shared-memory MIMD machines
  4. Distributed-memory MIMD machines
  5. ccNUMA machines
  6. Clusters
  7. Processors
    1. AMD Opteron
    2. IBM POWER6
    3. IBM PowerPC 970
    4. IBM BlueGene processors
    5. Intel Itanium 2
    6. Intel Xeon
    7. The MIPS processor
    8. The SPARC processors
  8. Accelerators
    1. GPU accelerators
    2. General accelerators
    3. FPGA accelerators
  9. Networks
    1. Infiniband
    2. InfiniPath
    3. Myrinet
    4. QsNet
Available systems
  1. The Bull NovaScale
  2. The C-DAC PARAM Padma
  3. The Cray XT3
  4. The Cray XT4
  5. The Cray XT5h
  6. The Cray XMT
  7. The Fujitsu/Siemens M9000
  8. The Fujitsu/Siemens PRIMEQUEST
  9. The Hitachi BladeSymphony
  10. The Hitachi SR11000
  11. The HP Integrity Superdome
  12. The IBM BlueGene/L&P
  13. The IBM eServer p575
  14. The IBM System Cluster 1350
  15. The Liquid Computing LiquidIQ
  16. The NEC Express5800/1000
  17. The NEC SX-9
  18. The SGI Altix 4000
  19. The SiCortex SC series
  20. The Sun M9000
Systems disappeared from the list
Systems under development
Glossary
Acknowledgments
References

Machine type Distributed-memory multi-vector processor
Models SX-8B, SX-8A, SX-8xMy
Operating system Super-UX (Unix variant based on BSD V.4.3 Unix)
Connection structure Multi-stage crossbar (see Remarks)
Compilers Fortran 90, HPF, ANSI C, C++
Vendors information Web page http://www.hpce.nec.com/hardware/index.html
Year of introduction 2007

System parameters:

Model SX-9B SX-9A SX-9xMy
Clock cycle 3.2 GHz 3.2 GHz 3.2 GHz
Theor. peak performance      
Per Proc. (64 bits) 102.4 Gflop/s 102.4 Gflop/s 102.4 Gflop/s
Maximal      
Single frame: 819.2 Gflop/s 1.6 Tflop/s Gflop/s
Multi frame: 838.9 Tflop/s
Main memory, DDR2-SDRAM 256–512 GB 512–1024 GB ≤ 512 TB
Main memory, FCRAM 128–256 GB 256–512 GB ≤ 256 TB
No. of processors 4–8 8–16 32–8192

Remarks:

The NEC SX-9 is a technology shrunken version of its predecessor the SX-8 (see Systems disappeared from the list). As a result the clock cycle has increased from 2.0 to 3.2 GHz, the density of processors/frame has doubled and the power consumption almost halved. The structure of the CPUs, however, has stayed the same. The SX-9 series is basically offered in three models as displayed in the table above. All models are based on the same processor, an 8-way replicated vector processor where each set of vector pipes contains a logical, mask, add/shift, multiply, and division pipe (see section \ref{s:sm-simd} for an explanation of these components). As multiplication and addition can be chained (but not division) and two of each are present, the peak performance of a pipe set at 3.2 GHz is 12.8 Gflop/s. Because of the 8-way replication a single CPU can deliver a peak performance of 102.4 Gflop/s. The official NEC documentation quotes higher peak performances because the peak performance of the scalar processor (rated at 6.4 Gflop/s, see below) is added to the peak performance of the vector processor to which it belongs. We do not follow this practice as a full utilisation of the scalar processor along with the vector processor in reality will be next to non-existent. The scalar processor that is 2-way super scalar and at 3.2 GHz has a theoretical peak of 6.4 Gflop/s. The peak bandwidth per CPU is 160 B/cycle. This is sufficient to ship 20 8-byte operands back or forth, enough to feed 5 operands every 2 cycles to each of the replicated pipe sets.

Unlike from what one would expect from the naming the SX-8B is the simpler configuration of the two single-frame systems: it can be had with 1–4 processors but is in virtually all other respects equal to the larger SX-8A that can house 4–8 processors. There is one difference connected to the maximal amout of memory per frame: NEC now offers the interesting choice between the usual DDR2-SDRAM or FCRAM (Fast Cycle Memory. The latter type of memory can a factor of 2–3 faster than the former type of memory. However, because of the more complex structure of the memory, the density is about two times lower. Hence that in the system parameters table, the entries for FCRAM are about two times lower than for SDRAM. The lower bound for SDRAM in the SX-8A and SX-8B systems are the same: 32 GB. For the very memory-hungry applications that are usually run on vector-type systems, the availability of FCRAM can be beneficial for quite some of these applications.

In a single frame of the SX-9A models fit up to 16 CPUs. Internally the CPUs in the frame are connected by a 1-stage crossbar with the same bandwidth as that of a single CPU system: 512 GB/s/port. The fully configured frame can therefore attain a peak speed of 1.6 Gflop/s.

In addition, there are multi-frame models (SX-9xMy) where x = 8,...,8192 is the total number of CPUs and y = 2,...,512 is the number of frames coupling the single-frame systems into a larger system. There are two ways to couple the SX-9 frames in a multi-frame configuration: NEC provides a full crossbar, the so-called IXS crossbar to connect the various frames together at a speed of 128 GB/s for point-to-point unidirectional out-of-frame communication. When connected by the IXS crossbar, the total multi-frame system is globally addressable, turning the system into a NUMA system. However, for performance reasons it is advised to use the system in distributed memory mode with MPI.

For distributed computing there is an HPF compiler and for message passing an optimised MPI (MPI/SX) is available. In addition for shared memory parallelism, OpenMP is available.

Measured Performances:
Presently no independent performance results for the SX-9 are known.