The IBM BlueGene/L&P

Introduction
HPC Architecture
  1. Shared-memory SIMD machines
  2. Distributed-memory SIMD machines
  3. Shared-memory MIMD machines
  4. Distributed-memory MIMD machines
  5. ccNUMA machines
  6. Clusters
  7. Processors
    1. AMD Magny-Cours
    2. IBM POWER6
    3. IBM POWER7
    4. IBM PowerPC 970MP
    5. IBM BlueGene processors
    6. Intel Xeon
    7. The SPARC processors
  8. Accelerators
    1. GPU accelerators
      1. ATI/AMD
      2. nVIDIA
    2. General accelerators
      1. The IBM/Sony/Toshiba Cell processor
      2. ClearSpeed/Petapath
    3. FPGA accelerators
      1. Convey
      2. Kuberre
      3. SRC
  9. Networks
    1. Infiniband
    2. InfiniPath
    3. Myrinet
Available systems
  • The Bull bullx system
  • The Cray XE6
  • The Cray XMT
  • The Cray XT5h
  • The Fujitsu FX1
  • The Hitachi SR16000
  • The IBM BlueGene/L&P
  • The IBM eServer p575
  • The IBM System Cluster 1350
  • The NEC SX-9
  • The SGI Altix UV series
  • Systems disappeared from the list
    Systems under development
    Glossary
    Acknowledgments
    References

    Machine type RISC-based distributed-memory multi-processor
    Models IBM BlueGene/L&P.
    Operating system Linux
    Connection structure 3-D Torus, Tree network
    Compilers XL Fortran 90, XL C, C++
    Vendors information Web page www-1.ibm.com/servers/deepcomputing/bluegene
    Year of introduction 2004 for BlueGene/L, 2007 for BlueGene/P

    System parameters:

    Model BlueGene/L BlueGene/P
    Clock cycle 700 MHz 850 MHz
    Theor. peak performance    
    Per Proc. (64-bits) 2.8 Gflop/s 3.4 Gflop/s
    Maximal 367/183.5 Tflop/s 1.5/3 Pflop/s
    Main memory    
    Memory/card ≤ 512 MB ≤ 2 GB
    Memory/maximal ≤ 16 TB ≤ 442 TB
    No. of processors ≤ 2×65,536 ≤ 4×221,184
    Communication bandwidth    
    Point-to-point (3-D Torus) 175 MB/s 350 MB/s
    Point-to-point (Tree network) 350 MB/s 700 MB/s

    Remarks:

    The BlueGene/L is the first in a new generation of systems made by IBM for very massively parallel computing. The individual speed of the processor has therefore been traded in favour of very dense packaging and a low power consumption per processor. The basic processor in the system is a modified PowerPC 400 at 700 MHz. Two of these processors reside on a chip together with 4 MB of shared L3 cache and a 2 KB L2 cache for each of the processors. The processors have two load ports and one store port from/to the L2 caches at 8 bytes/cycle. This is half of the bandwidth required by the two floating-point units (FPUs) and as such quite high. The CPUs have 32 KB of instruction cache and of data cache on board. In favourable circumstances a CPU can deliver a peak speed of 2.8 Gflop/s because the two FPUs can perform fused multiply-add operations. Note that the L2 cache is smaller than the L1 cache which is quite unusual but which allows it to be fast.

    The packaging in the system is as follows: two chips fit on a compute card with 512 MB of memory. Sixteen of these compute cards are placed on a node board of which in turn 32 go into one cabinet. So, one cabinet contains 1024 chips, i.e., 2048 CPUs. For a maximal configuration 64 cabinets are coupled to form one system with 65,356 chips/130,712 CPUs. In normal operation mode one of the CPUs on a chip is used for computation while the other takes care of communication tasks. In this mode the Theoretical Peak Performance of the system is 183.5 Tflop/s. It is however possible when the communication requirements are very low to use both CPUs for computation, doubling the peak speed; hence the double entries in the System Parameters table above. The number of 360 Tflop/s is also the speed that IBM is using in its marketing material.

    The BlueGene/L possesses no less than 5 networks, 2 of which are of interest for inter-processor communication: a 3-D torus network and a tree network. The torus network is used for most general communication patterns. The tree network is used for often occurring collective communication patterns like broadcasting, reduction operations, etc. The hardware bandwidth of the tree network is twice that of the torus: 350 MB/s against 175 MB/s per link.

    BlueGene/P
    In the second half of 2007 the second generation BlueGene system, the BlueGene/P was realised and several systems have been installed. A first system is expected to be installed this year. The macro-architecture of the BlueGene/P is very similar to that of the L model, except that about everything in the system is faster and bigger. The chip is a variant of the PowerPC 450 family and runs at 850 MHz. As, like in the BlueGene/L processor 4 floating-point instructions can be performed per cycle, the theoretical peak performance is 3.4 Gflop/s. Four processor cores reside on a chip (as opposed to 2 in the L model). The L3 cache grew from 4 to 8 MB and the memory per chip increases four-fold to 2 GB. In addition the bandwidth in B/cycle has doubled and became 13.6 GB/s. Unlike the dual-core BlueGene/L chip the quad-core model P chip can work in true SMP mode, making it amenable to the use of OpenMP.

    One board in the system carries 32 quad-core chips while again 32 boards can be fitted in one rack with 4,096 cores. A rack therefore has a Theoretical Peak Performance of 13.9 Tflop/s. The IBM Press release sets the maximum number of cores in a system to 884,736 in 216 racks and a Theoretical Peak Performance of 3 Pflop/s. The higher bandwidth of the main communication networks (torus and tree) also goes up by a factor of about 2 while the latency is halved.
    Like the BlueGene/L the P model is very energy-efficient: a 1024-processor (4096-core) rack only draws 40 KW.

    \nal In both the BlueGene/L and /P the compute nodes runs a reduced-kernel type of Linux to reduce the OS-jitter that normally occurs when very many nodes are involved in computation. Interface nodes for interaction with the users and providing I/O services run a full version of the operating system.

    Measured Performances:
    In [35] a speed of 478.2 Tflop/s on the HPC Linpack benchmark for a BlueGene/L is reported, solving a linear system of size N = 2,456,063, on 212,992 processor cores. processors amounting to an efficiency of 80.1%.
    In the same report a speed of 180 Tflop/s out of a maximum of 222.82 Tflop/s for a 65,536-core BlueGene/P was published, again with an efficiency of 80.1% but on a smaller linear system of size N = 1,766,399.