The IBM BlueGene/P&Q

Machine type RISC-based distributed-memory multi-processor
Models IBM BlueGene/P&Q.
Operating system Linux
Connection structure Model P: 3-D Torus, Tree network.
Model Q: 5-D Torus.
Compilers XL Fortran 90, XL C, C++
Vendors information Web page www-1.ibm.com/servers/deepcomputing/bluegene
Year of introduction 2007 for BlueGene/P, 2012 for BlueGene/Q

System parameters:

Model BlueGene/P BlueGene/Q
Clock cycle 850 MHz 1.6 GHz
Theor. peak performance
Per Proc. (64-bits) 3.4 Gflop/s 204.8 Gflop/s
Maximal 1.5/3 Pflop/s —
Main memory
Memory/card 2 GB 16 GB
Memory/maximal ≤ 16 TB ≤ 442 TB
No. of processors ≤ 4×221,184 —
Communication bandwidth
Point-to-point (3-D Torus) 350 MB/s 2 GB/s
Point-to-point (Tree network) 700 MB/s

Remarks:

The BlueGene/P

In the second half of 2007 the second generation BlueGene system, the BlueGene/P was realised and several systems have been installed. The macro-architecture of the BlueGene/P is very similar to that of the old L model, except that about everything in the system is faster and bigger. The chip is a variant of the PowerPC 450 family and runs at 850 MHz. As, like in the BlueGene/L processor 4 floating-point instructions can be performed per cycle, the theoretical peak performance is 3.4 Gflop/s. Four processor cores reside on a chip (as opposed to 2 in the L model). The L3 cache grew from 4 to 8 MB and the memory per chip increases four-fold to 2 GB. In addition the bandwidth in B/cycle has doubled and became 13.6 GB/s. Unlike the dual-core BlueGene/L chip the quad-core model P chip can work in true SMP mode, making it amenable for the use of OpenMP.

One board in the system carries 32 quad-core chips while again 32 boards can be fitted in one rack with 4,096 cores. A rack therefore has a Theoretical Peak Performance of 13.9 Tflop/s. The IBM Press release sets the maximum number of cores in a system to 884,736 in 216 racks and a Theoretical Peak Performance of 3 Pflop/s. The higher bandwidth of the main communication networks (torus and tree) also goes up by a factor of about 2 while the latency is halved.
Like the BlueGene/L the P model is very energy-efficient: a 1024-processor (4096-core) rack only draws 40 KW.

Like the late BlueGene/L the P model is very energy-efficient: a 1024-processor (4096-core) rack only draws 40 KW.

Measured Performances

In [35] a speed of 478.2 Tflop/s on the HPC Linpack benchmark for a BlueGene/L is reported, solving a linear system of size N = 2,456,063, on 212,992 processor cores. processors amounting to an efficiency of 80.1%.
In the same report a speed of 180 Tflop/s out of a maximum of 222.82 Tflop/s for a 65,536-core BlueGene/P was published, again with an efficiency of 80.1% but on a smaller linear system of size N = 1,766,399.

The BlueGene/Q

The BlueGene Q is the last generation sofar in the BlueGene family. As can be seen from the table above the performance per processor has increased hugely. This is due to several factors: the clock frequency almost doubled and also the floating-point output doubled because of the 4 floating-point units/core capable of turning out 4 fused multiply-add results per cycle. Furthermore there are 16 instead of 4 cores per processor. Note, however, that the amount of memory per core has not increased: where in the P model 2 GB on a card feeds 8 cores (there are two processors on a card), in the Q model 32 cores draw on 16 GB of memory (again with 2 processors on a card).

Another deviation from the earlier models is the interconnect. It is now a 5-D torus with a link speed of 2 GB/s while the tree network present in the former L model and in the P model has disappeared. The two extra dimensions will compensate for this loss while the resiliency of the network is increased: a 3-D torus is rather vulnerable in terms of link failures. A processor has 11 links of which 10 are necessary for the 5-D torus directions and one spare link that can be used for other purposes or in case of failure of another link. This is all the more critical for the very large systems that are envisioned to be built from the components. Although there is no official maximum size given for BlueGene/Q systems, the 20 Pflop/s Sequoia system was commisioned for Lawrence Livermore Laboratory and a 10 Pflop/s for Argonne National Lab. Like with the earlier models this can be achieved because of the high density. A BlueGene/Q node card houses 32 2-processor compute cards while 16 node cards are fitted onto a midplane. A rack contains two of these populated midplanes which therefore delivers almost 420 Tflop/s. Consequently, tens of racks are needed to build systems of such sizes and reliability features become extremely important.

In both the BlueGene/P and /Q the compute nodes run a reduced-kernel type of Linux to reduce the OS-jitter that normally occurs when very many nodes are involved in computation. Interface nodes for interaction with the users and providing I/O services run a full version of the operating system. In the BlueGene/Q jitter reduction is also achieved by the 17^th core that is dedicated to OS tasks (see the BlueGene processors).

Measured Performances

In theTOP 500 list a speed of 1.003 Pflop/s out of a maximum of 1.003 Pflop/s for a 294912-core BlueGene/P was published, with an efficiency of 80.2% but on a linear system of unspecified size.
In the same report for a BlueGene/Q (the Sequoia system) used 1,572,864 cores to solve a dense linear system at a speed of 16.3 Pflop/s with an efficiency of 81.1%.