The IBM BlueGene/Q

Introduction
HPC Architecture
  1. Shared-memory SIMD machines
  2. Distributed-memory SIMD machines
  3. Shared-memory MIMD machines
  4. Distributed-memory MIMD machines
  5. ccNUMA machines
  6. Clusters
  7. Processors
    1. AMD Opteron
    2. IBM POWER7
    3. IBM BlueGene/Q processor
    4. Intel Xeon
    5. The SPARC processors
  8. Accelerators
    1. GPU accelerators
      1. ATI/AMD
      2. nVIDIA
    2. General computational accelerators
      1. Intel Xeon Phi
    3. FPGA accelerators
      1. Convey
      2. Kuberre
      3. SRC
  9. Interconnects
    1. Infiniband
Available systems
  • The Bull bullx system
  • The Cray XC30
  • The Cray XE6
  • The Cray XK7
  • The Eurotech Aurora
  • The Fujitsu FX10
  • The Hitachi SR16000
  • The IBM BlueGene/Q
  • The IBM eServer p775
  • The NEC SX-9
  • The SGI Altix UV series
  • Systems disappeared from the list
    Systems under development
    Glossary
    Acknowledgments
    References

    Machine type RISC-based distributed-memory multi-processor
    Models IBM BlueGene/Q.
    Operating system Linux
    Connection structure 5-D Torus.
    Compilers XL Fortran 90, XL C, C++
    Vendors information Web page www-1.ibm.com/servers/deepcomputing/bluegene
    Year of introduction 2012

    System parameters:

    Model BlueGene/Q
    Clock cycle 1.6 GHz
    Theor. peak performance    
    Per Proc. (64-bits) 204.8 Gflop/s
    Maximal 1.5/3 Pflop/s
    Main memory    
    Memory/card 16 GB
    Memory/maximal
    No. of processors
    Communication bandwidth    
    Point-to-point (3-D Torus) 2 GB/s

    Remarks:

    The BlueGene Q is the last generation sofar in the BlueGene family. The performance per processor has increased hugely with respect to its predecessor te BlueGene/P: from 3.4 to 204.8 Gflop/s. This is due to several factors: the clock frequency almost doubled and also the floating-point output doubled because of the 4 floating-point units/core capable of turning out 4 fused multiply-add results per cycle. Furthermore there are 16 instead of 4 cores per processor. Note, however, that the amount of memory per core has not increased: where in the P model 2 GB on a card feeds 8 cores (there are two processors on a card), in the Q model 32 cores draw on 16 GB of memory (again with 2 processors on a card).

    Another deviation from the earlier models is the interconnect. It is now a 5-D torus with a link speed of 2 GB/s while the tree network present in the former L model and in the P model has disappeared. The two extra dimensions will compensate for this loss while the resiliency of the network is increased: a 3-D torus is rather vulnerable in terms of link failures. A processor has 11 links of which 10 are necessary for the 5-D torus directions and one spare link that can be used for other purposes or in case of failure of another link. This is all the more critical for the very large systems that are envisioned to be built from the components. Although there is no official maximum size given for BlueGene/Q systems, the 20 Pflop/s Sequoia system was commisioned for Lawrence Livermore Laboratory and a 10 Pflop/s for Argonne National Lab. Like with the earlier models this can be achieved because of the high density. A BlueGene/Q node card houses 32 2-processor compute cards while 16 node cards are fitted onto a midplane. A rack contains two of these populated midplanes which therefore delivers almost 420 Tflop/s. Consequently, tens of racks are needed to build systems of such sizes and reliability features become extremely important.

    In the BlueGene/Q the compute nodes run a reduced-kernel type of Linux to reduce the OS-jitter that normally occurs when very many nodes are involved in computation. Interface nodes for interaction with the users and providing I/O services run a full version of the operating system. The jitter reduction is also achieved by the 17th core that is dedicated to OS tasks (see the BlueGene processors).

    Measured Performances

    In theTOP 500 list a speed of 17.17 Pflop/s out of a maximum of 1.003 Pflop/s was measured for the BlueGene/Q Sequoia system using 1,572,864 cores to solve a dense linear system with an efficiency of 85.3%.