GPU Accelerators

HPC Architecture
  1. Shared-memory SIMD machines
  2. Distributed-memory SIMD machines
  3. Shared-memory MIMD machines
  4. Distributed-memory MIMD machines
  5. ccNUMA machines
  6. Clusters
  7. Processors
    1. AMD Opteron
    2. IBM POWER7
    3. IBM BlueGene/Q processor
    4. Intel Xeon
    5. The SPARC processors
  8. Accelerators
    1. GPU accelerators
      1. ATI/AMD
      2. nVIDIA
    2. General computational accelerators
      1. Intel Xeon Phi
    3. FPGA accelerators
      1. Convey
      2. Kuberre
      3. SRC
  9. Interconnects
    1. Infiniband
Available systems
  • The Bull bullx system
  • The Cray XC30
  • The Cray XE6
  • The Cray XK7
  • The Eurotech Aurora
  • The Fujitsu FX10
  • The Hitachi SR16000
  • The IBM BlueGene/Q
  • The IBM eServer p775
  • The NEC SX-9
  • The SGI Altix UV series
  • Systems disappeared from the list
    Systems under development

    Graphics processing is characterised by doing the same (floating-point) operation on massive amounts of data. To accommodate for this way of processing Graphical Processing Units (GPUs) consist of a large amount of relatively simple processors, fast but limited local memory, and fast internal buses to transport the operands and results. Until recently all calculations, and hence the results, were in 32-bit precision. This is hardly of consequence for graphics processing as the colour of a pixel in a scene may be a shade off without anyone noticing. HPC users often have similar computational demands as those in the graphical world: the same operation on very many data items. So, it was natural to look into GPUs with their many integrated parallel processors and fast memory. The first adopters of GPUs from the HPC community therefore disguised their numerical program fragments as graphical code (e.g., by using the graphical language OpenGL) to get fast results, often with remarkable speedups. Another advantage is that GPUs are relatively cheap because of the enormous amounts that are sold for graphical use in virtually every PC. A drawback was the 32-bit precision of the usual GPU and, in some cases more important, there is no error correction available. By carefully considering which computation really needs 64-bit precision and which does not and adjusting algorithms accordingly the use of a GPU can be entirely satisfactorily, however. GPU vendors have been quick in focussing on the HPC community. They tended to rename their graphics cards to GPGPU, general-purpose GPU, although the product was largely identical to the graphics cards sold in every shop. But there also have been real improvements to attract HPC users: 64-bit GPUs have come onto the market. In addition, it is no longer necessary to reformulate a computational problem into a piece of graphics code. Both ATI/AMD and NVIDIA claim IEEE 754 compatibility (being the floating-point computation standard) but neither of them support it to the full (see section on NVIDIA). There are C-like languages and runtime environments available that makes the life of a developer for GPUs much easier: for NVIDIA this is CUDA, which has become quite popular with users of these systems. AMD/ATI is concentrating on standard OpenCL (see below). It is somewhat more cumbersome but still provides a much better alternative to emulating graphics code.

    When one develops a code for a particular GPU platform it cannot be transferred to another without a considerable effort in rewriting the code. This drawback is taken up by the GPU vendors (and not only them). OpenCL should yield code that in principle is platform independent, thus protecting the development effort put into the acceleration of a program. Presently, Apple, ATI/AMD, Intel, and NVIDIA are members of the consortium that are willing to provide a OpenCL language interface. First experiences with OpenCL version 1.0 as provided by the Khronos Group showed generally low performances but one might expect that these may improve with the new release of OpenCL 2.0 which was released in July 2013.

    Still, many HPC users/developers do not want to go through the trouble of extensively transforming their program. To meet this large audience Cray, NVIDIA, CAPS, and the Portland Group have defined a set of comment directives/pragmas under the name of OpenACC the use of which should result in offloading portions of the code to an attached GPU. Although is may not exploit the GPU to the full, it is much easier to use and leaves the original code intact.

    Another way to be (relatively) independent of the platform is to employ some language transformer. For instance, CAPS provides such transforming tools that can target different types of accelerators or multi-core CPUs. CAPS' product HMPP the transformation is brought about by inserting pragmas in the C code or comment directives in Fortran code. HMPP is with this ability the presently the only one that can accelerate Fortran code on general GPU accelerators. The Portland Group sells a CUDA/Fortran compiler that only targets NVIDIA GPUs. accelerators.

    In ATI/AMD and NVIDIA we describe some high-end GPUs that are more or less targeting the HPC community.