Computational Accelerators

Introduction

HPC Architecture

Shared-memory SIMD machines

Distributed-memory SIMD machines

Shared-memory MIMD machines

Distributed-memory MIMD machines

ccNUMA machines

Clusters

Processors

AMD Magny-Cours

IBM POWER6

IBM POWER7

IBM PowerPC 970MP

IBM BlueGene processors

Intel Xeon

The SPARC processors

Accelerators

GPU accelerators

ATI/AMD

nVIDIA

General accelerators

The IBM/Sony/Toshiba Cell processor

ClearSpeed/Petapath

FPGA accelerators

Convey

Kuberre

SRC

Networks

Infiniband

InfiniPath

Myrinet

Available systems
The Bull bullx system

The Cray XE6

The Cray XMT

The Cray XT5_h

The Fujitsu FX1

The Hitachi SR16000

The IBM BlueGene/L&P

The IBM eServer p575

The IBM System Cluster 1350

The NEC SX-9

The SGI Altix UV series

Systems disappeared from the list

Systems under development

Glossary

Acknowledgments

References

In the last few years computational accelerators have emerged and have taken a firm foothold now. They come in various forms of which we will discuss some general characteristics. Accelerators are not a new phenomenon: in the 1980's, for instance, Floating Point Systems sold attached processors like the AP120-B with a peak performance of 12 Mflop/s, easily 10 times faster than the general purpose systems they were connected to. Also the processor array machines described in the DM-SIMD section could be regarded as accelerators for matrix-oriented computations in their time. A similar phenomenon is on us at the moment. HPC users never tend to be content with the performance of the machines they have at their disposal and are continuously looking for ways to speed up their calculations or parts of them. Accelerator vendors are complying to this wish and presently there is a fair amount of products that, when properly deployed, can deliver significant performance gains.
The scene is roughly divided in three unequal parts:

Graphical cards or Graphical Processing Units (GPUs as opposed to the general CPUs).
General floating-point accelerators.
Field Programmable Gate Arrays.
The appearance of accelerators is believed to set a trend in HPC computing. Namely, that the processing units should be diversified according to their abilities. Not unlike the occurence of different functional units within a CPU core.

(In principle it is entirely possible to perform floating-point computations with integer functional units, but the costs are so high that no one will attempt it.)
In a few years this will lead to hybrid systems that incorporate different processors for different computational tasks. Of course, processor vendors can choose to (attempt to) integrate such special purpose processing units within their main processor line but as of now it is not sure if or how this will happen.

When speaking of special purpose processors, i.c., computational accelerators, one should realise that they are indeed good at some specialized computations while totally unable to perform others. So, not all applications can benefit of them and those which can, not all to the same degree. Futhermore, using accelerators effectively is not at all trivial. Although the Software Development Kits (SDKs) for accelerators have improved enormously lately, for many applications it is still a challenge to obtain a significant speedup. An important factor in this is that data must be shipped in and out the accelerator and the bandwidth of the connecting bus is in most cases a severe bottleneck. One generally tries to overcome this by overlapping data transport to/from the accelerator with processing. Tuning the computation and data transport task can be cumbersome. This hurdle has been recognised by at least two software companies, Acceleware, CAPS, and Rapidmind. They offer products that automatically transform standard C/C++ programs into a form that integrates the functionality of GPUs, multi-core CPUs (which are often also not used optimally), and, in the case of Rapidmind, of Cell processors.

There is one other and important consideration that makes accelerators popular: in comparison to general purpose CPUs they all are very power-effective. Sometimes orders of magnitude when expressed in flop/Watt. Of course they will do only part of the work in a complete system but still the power savings can be considerable which is very attractive these days.

We will now proceed to discuss the three classes of accelerators mentioned above. It must be realised though that the developments in this field are extremely rapid and therefore the information given here will be obsolete very fast and could be of an approximate nature.