The BlueGene Q is the last generation sofar in the BlueGene family. The performance per processor has increased hugely with respect to its predecessor te BlueGene/P: from 3.4 to 204.8 Gflop/s. This is due to several factors: the clock frequency almost doubled and also the floating-point output doubled because of the 4 floating-point units/core capable of turning out 4 fused multiply-add results per cycle. Furthermore there are 16 instead of 4 cores per processor. Note, however, that the amount of memory per core has not increased: where in the P model 2 GB on a card feeds 8 cores (there are two processors on a card), in the Q model 32 cores draw on 16 GB of memory (again with 2 processors on a card). Another deviation from the earlier models is the interconnect. It is now a 5-D torus with a link speed of 2 GB/s while the tree network present in the former L model and in the P model has disappeared. The two extra dimensions will compensate for this loss while the resiliency of the network is increased: a 3-D torus is rather vulnerable in terms of link failures. A processor has 11 links of which 10 are necessary for the 5-D torus directions and one spare link that can be used for other purposes or in case of failure of another link. This is all the more critical for the very large systems that are envisioned to be built from the components. Although there is no official maximum size given for BlueGene/Q systems, the 20 Pflop/s Sequoia system was commisioned for Lawrence Livermore Laboratory and a 10 Pflop/s for Argonne National Lab. Like with the earlier models this can be achieved because of the high density. A BlueGene/Q node card houses 32 2-processor compute cards while 16 node cards are fitted onto a midplane. A rack contains two of these populated midplanes which therefore delivers almost 420 Tflop/s. Consequently, tens of racks are needed to build systems of such sizes and reliability features become extremely important.
In the BlueGene/Q the compute nodes run a reduced-kernel type of
Linux to reduce the OS-jitter that normally occurs when very many nodes are
involved in computation. Interface nodes for interaction with the users and
providing I/O services run a full version of the operating system. The
jitter reduction is also achieved by the 17
## Measured PerformancesIn theTOP 500 list a speed of 17.17 Pflop/s out of a maximum of 1.003 Pflop/s was measured for the BlueGene/Q Sequoia system using 1,572,864 cores to solve a dense linear system with an efficiency of 85.3%. |