GPU Accelerators

Introduction

HPC Architecture

Shared-memory SIMD machines

Distributed-memory SIMD machines

Shared-memory MIMD machines

Distributed-memory MIMD machines

ccNUMA machines

Clusters

Processors

AMD Opteron

IBM POWER6

IBM PowerPC 970

IBM BlueGene processors

Intel Itanium 2

Intel Xeon

The MIPS processor

The SPARC processors

Accelerators

GPU accelerators

General accelerators

FPGA accelerators

Networks

Infiniband

InfiniPath

Myrinet

QsNet

Available systems

The Bull NovaScale

The C-DAC PARAM Padma

The Cray XT3

The Cray XT4

The Cray XT5_h

The Cray XMT

The Fujitsu/Siemens M9000

The Fujitsu/Siemens PRIMEQUEST

The Hitachi BladeSymphony

The Hitachi SR11000

The HP Integrity Superdome

The IBM BlueGene/L&P

The IBM eServer p575

The IBM System Cluster 1350

The Liquid Computing LiquidIQ

The NEC Express5800/1000

The NEC SX-9

The SGI Altix 4000

The SiCortex SC series

The Sun M9000

Systems disappeared from the list

Systems under development

Glossary

Acknowledgments

References

ATI/AMD
The latest product from ATI (now wholly owned by AMD) is the ATI Firestream 9170 card. There is not enough information available for a block diagram but we list the most important features of the processor:

Table 2.1:Some specifications for the ATI/AMD Firestream 9170 GPU.

Feature size 55 nm
Number of processors 320
Memory (GDDR3) 2 GB
Clock Cycle 775 MHz
Peak Perfomance 497 Gflop/s
Power requirement ≤ 100 W
Interconnect (PCIe Gen2) 16×, 8 GB
Floating-point support Partial (32/64-bit)

It is expected that in the third quarter of 2008 its successor, the Firestream~9250 will come out. This card will have roughly double the performance of the Firestream~9170 and presumably will use $\le 150$ W. The specifications given indicate that per core 2 floating-point results per cycle can be generated, presumably the result of an add and a multiply operation. Whether these results can be produced independently or result from linked operations is not known because of the lack of information.

Like its direct competitor, NVIDIA, ATI offers a C-like language, BROOK+, and the accompanying run time system to ease the use of the card. The SDK containing these products is free and can be installed both for Linux (RedHat and SuSE) and Windows environments. Objects that have to be handled by the card are declared in a special syntax and there are library functions to put the data onto the card and retrieve results from it. Functions that should be performed on the card are called ``Kernels''. They typically operate on the stream objects defined as such in the BROOK+ program. Although this looks simple, to get an optimum performance one should tune the amount of computation carefully with the data transport, for however fast the PCIe bus might be that must transport the data to/from the GPU, there is still a significant amount of time involved in shipping the data on and off the card. BROOK+ is as yet rather restricted in its functionality. To help out in situations that are not covered by BROOK+ a assembly language, CAL can be used. This is, however, far from easy.

NVIDIA
NVIDIA is the other big player in the GPU field with regard to HPC. Its latest product is the C1060 as an individual card but it is also possible to have 4 of these cards in a 1U rack enclosure, obviously with four times the performance. Such rack-mounted systems are primarily made with the HPC community in mind. Again, we do not have enough information to provide a reliable block diagram but the most important details are given below:

Table 2.2: Some specifications for the NVIDIA C1060 GPU.

Number of processors 240
Memory (GDDR3) 4 GB
Clock Cycle 1.3 GHz
Peak Perfomance 936 Gflop/s
Power requirement 225 W peak; 160 W typical
Interconnect (PCIe Gen2) 8×, 4 GB/s; 16×, 8 GB/s
Floating-point support Partial (32/64-bit)

From these specifications can be derived that 3 floating-point results per core per cycle can be delivered. Because of the scant information on the core structure it is not clear how this comes about. The power requirement given may not be entirely appropriate for HPC workloads. A large proportion of the work being done will be from the BLAS library that is provided by NVIDIA, more specifically, the dense matrix-matrix multiplication in it. This operation occupies any computational core to the full and one may expect a somewhat higher power consumption than what is considered as typical for other kinds of work.

Like ATI, NVIDIA provides an SDK comprised of a compiler named CUDA, libraries that include BLAS and FFT routines, and a runtime system that accomodates both Linux (RedHat and SuSE) and Winodws. CUDA is a C/C++-like language with extensions and primitives that cause operations to be executed on the card instead of on the CPU core that initiates the operations. Transport to and from the card is done via library routines and many threads can be initiated and placed in appropriate positions in the card memory so as not causing memory congestion on the card. This means that for good performance one needs knowledge of the memory structure on the card to exploit it accordingly. This is not unique to the C1060 GPU, it pertains to the ATI Firestream GPU and other accelerators as well.