The Bull bullx systems

Introduction
HPC Architecture
  1. Shared-memory SIMD machines
  2. Distributed-memory SIMD machines
  3. Shared-memory MIMD machines
  4. Distributed-memory MIMD machines
  5. ccNUMA machines
  6. Clusters
  7. Processors
    1. AMD Opteron
    2. IBM POWER7
    3. IBM BlueGene/Q processor
    4. Intel Xeon
    5. The SPARC processors
  8. Accelerators
    1. GPU accelerators
      1. ATI/AMD
      2. nVIDIA
    2. General computational accelerators
      1. Intel Xeon Phi
    3. FPGA accelerators
      1. Convey
      2. Kuberre
      3. SRC
  9. Interconnects
    1. Infiniband
Available systems
  • The Bull bullx system
  • The Cray XC30
  • The Cray XE6
  • The Cray XK7
  • The Eurotech Aurora
  • The Fujitsu FX10
  • The Hitachi SR16000
  • The IBM BlueGene/Q
  • The IBM eServer p775
  • The NEC SX-9
  • The SGI Altix UV series
  • Systems disappeared from the list
    Systems under development
    Glossary
    Acknowledgments
    References
    Machine type Hybrid distributed-memory system.
    Models bullx blade system B51x, B70x; bullx R42 xE2/F2.
    Operating system Linux, WindowsServer 2008
    Connection structure Variable.
    Compilers Intel's Fortran 95, C(++)
    Vendors information Web page http://www.bull.com/bullx/
    Year of introduction 2012.

     

    System parameters:
    Model Blade B51x and B70x/R42x E2/F2
    Clock cycle up to 2.7 GHz
    Theor. peak performance 3.53 Tflop/s/blade chassis
      3.53 Tflop/s/blade chassis
    820 Gflop/s/R424 server unit
    Accelerator NVIDIA Kepler K20X
      Intel Xeon Phi
    < ≤ 256 GB/blade on server unit
    No. of processors Variable
    Comm. bandwidth 4–6.8 GB/s (QDR of FDR Infiniband)
    Aggregate peak Variable

    Remarks:

    As already stated before, it becomes more and more difficult to distinguish between clusters and what used to be called "integrated" parallel systems as in the latter type increasingly standard components are employed that also can be found in any cluster. For the new bullx systems, available from Bull since spring 2009 this is certainly the case. There are, however, a number of distiguishing features of the bullx systems that made us decide to discuss them in this overview.

    The systems come in two variants, two a blade systems with 18 blades in a 7U chassis. The two blade systems are similar except in the cooling: the B510 series is air-cooled while the equivalent B710 series has direct liquid cooling, i.e., water is run through a copper heat sink blade fitted onto the board that holds the compute components like the CPUs (and/or accelerators, see below) and the memory. The other type of system based on 1U units that pack 2 boards together, each containing 2 processors. The processor employed in both models is Intel's 12-core Ivy Bridge processor discussed on the Xeon page. The density of both types of systems is equal and it is up to the preference of the customer what type is chosen.

    The blade B510 system becomes hybrid, i.e., it integrates GPU accelerators by putting B515 blades in the system. The B515s have a double-blade form factor that contain 2 Ivy Bridge processors and two nVIDIA Kepler K20X cards (see the nVIDIA page). Likewise, Intel's Xeon Phi accelerators are offered in the same form factor, both having a peak speed of over 1 Tflop/s.

    For the R42x E2-based system there is 1U enclosure containing a K20X GPU processor in which case it is called a R423 unit. The R424 packs four processor in 2U. So, the density is the same as for the R422 but it has more reliability features built in. The same goes for the R425 which contains 4 Ivy Bridge processors and 2 K20X GPUs. The F2 model is identical to the E2 model, except that it allows for extended storage with SAS disks and RAID disks. In all cases also Intel Xeon Phis are available accelerators. Both for the blade and the rack systems instead of spinning disks also SSD storage is supported.

    For the rack systems QDR Infiniband (see the Infiniband page) is available as an interconnection medium. For in the blade-based models a 36 QDR or FDR port module is integrated in the 7U chassis holding the blades. Of course the topology between chassis or rack units is up to the customer and therefore variable with respect to global bandwidth and point-to-point latencies.

    Measured Performances:
    For a 77184-core B510-based system with 2.7 GHz processors at CEA, France, a performance of 1.359 Pflop/s was measured in solving a dense linear system an unkown size. See the TOP500 list.

     

    The Bull bullx S6010/S6030 systems

    Machine type Hybrid distributed-memory system.
    Models bullx S6010/6030.
    Operating system Linux, WindowsServer 2008
    Connection structure Variable.
    Compilers Intel's Fortran 95, C(++)
    Vendors information Web page http://www.bull.com/bullx/bullxS.html
    Year of introduction 2010/2011.

     

    System parameters:
    Model Blade S6010, S6030
    Clock cycle 2.26, 2.4, or 2.66 GHz
    Theor. peak performance 9.0, 11.6, 12.8 Gflop/s/core
      145–515 Gflop/s/3U drawer
    Accelerator NVIDIA Tesla S2050
    Main memory ≤ 512 GB/node
    No. of processors Variable
    Comm. bandwidth 2.5 GB/s (QDR Infiniband)
    Aggregate peak Variable

    Remarks:
    Bull calls the S6010/S6030 building blocks for this system "Supernodes" because of the amount of memory that can be accomodated. A Bull-proprietary switch enables 4 4-processor nodes to work as one ccNUMA system with up to 160 cores (depending on what processor is chosen). In that respect there is much choice, because one choose from the 4- to 8-core Nehalem EX at 2.26 GHz to the 6- to 10-core Westmere EX processor running at either 2.4 or 2.66 GHz. This is also the reason for the wide variety of performances for a 3U drawer unit: the contents can range from 4 4-core Nehalem processors to 4 10-core Westmere EX processors.

    The packaging of the S6010 is rather odd: The node is L-shaped and by flipping it over one can fit it on top of another S6010 node such that it fits in a 3U rack space. The S6030 has a height of 3U and contains the same components as two S6010s but in addition more PCIe slots: 2 PCIe Gen2 × 16 and 4 PCIe × 8 against 1 PCIe × 16/S6010. Furthermore it can house much more disk storage: 6 SATA disks against 1 in the S6010 and up to 8 SAS disks or SATA SSD units. Clearly, the S6010 it more targeted at the computational tasks, while the S630 also is well-equiped for server tasks.
    Like in the blade systems discussed above, the nodes are connected via QDR Infiniband at 2.5 GB/s in a topology that can be chosen by the customer.

    Measured Performances:
    The TERA 100 system of CEA in France is a S60x0-based machine containing 8-core Nehalem EX processors. In the TOP500 list a performance was reported for this 138368-core system of 1.05 Pflop/s out of 1.254 Pflop/s on a linear system of size 4,926,336, amounting to an efficiency of 83.7%.