GPU Accelerators

Introduction
HPC Architecture
  1. Shared-memory SIMD machines
  2. Distributed-memory SIMD machines
  3. Shared-memory MIMD machines
  4. Distributed-memory MIMD machines
  5. ccNUMA machines
  6. Clusters
  7. Processors
    1. AMD Opteron
    2. IBM POWER6
    3. IBM PowerPC 970
    4. IBM BlueGene processors
    5. Intel Itanium 2
    6. Intel Xeon
    7. The MIPS processor
    8. The SPARC processors
  8. Accelerators
    1. GPU accelerators
    2. General accelerators
    3. FPGA accelerators
  9. Networks
    1. Infiniband
    2. InfiniPath
    3. Myrinet
    4. QsNet
Available systems
  1. The Bull NovaScale
  2. The C-DAC PARAM Padma
  3. The Cray XT3
  4. The Cray XT4
  5. The Cray XT5h
  6. The Cray XMT
  7. The Fujitsu/Siemens M9000
  8. The Fujitsu/Siemens PRIMEQUEST
  9. The Hitachi BladeSymphony
  10. The Hitachi SR11000
  11. The HP Integrity Superdome
  12. The IBM BlueGene/L&P
  13. The IBM eServer p575
  14. The IBM System Cluster 1350
  15. The Liquid Computing LiquidIQ
  16. The NEC Express5800/1000
  17. The NEC SX-9
  18. The SGI Altix 4000
  19. The SiCortex SC series
  20. The Sun M9000
Systems disappeared from the list
Systems under development
Glossary
Acknowledgments
References

ATI/AMD

The latest product from ATI (now wholly owned by AMD) is the ATI Firestream 9170 card. There is not enough information available for a block diagram but we list the most important features of the processor:

 

Table 2.1:Some specifications for the ATI/AMD Firestream 9170 GPU.
Feature size 55 nm
Number of processors 320
Memory (GDDR3) 2 GB
Clock Cycle 775 MHz
Peak Perfomance 497 Gflop/s
Power requirement ≤ 100 W
Interconnect (PCIe Gen2) 16×, 8 GB
Floating-point support Partial (32/64-bit)

It is expected that in the third quarter of 2008 its successor, the Firestream~9250 will come out. This card will have roughly double the performance of the Firestream~9170 and presumably will use $\le 150$ W. The specifications given indicate that per core 2 floating-point results per cycle can be generated, presumably the result of an add and a multiply operation. Whether these results can be produced independently or result from linked operations is not known because of the lack of information.

Like its direct competitor, NVIDIA, ATI offers a C-like language, BROOK+, and the accompanying run time system to ease the use of the card. The SDK containing these products is free and can be installed both for Linux (RedHat and SuSE) and Windows environments. Objects that have to be handled by the card are declared in a special syntax and there are library functions to put the data onto the card and retrieve results from it. Functions that should be performed on the card are called ``Kernels''. They typically operate on the stream objects defined as such in the BROOK+ program. Although this looks simple, to get an optimum performance one should tune the amount of computation carefully with the data transport, for however fast the PCIe bus might be that must transport the data to/from the GPU, there is still a significant amount of time involved in shipping the data on and off the card. BROOK+ is as yet rather restricted in its functionality. To help out in situations that are not covered by BROOK+ a assembly language, CAL can be used. This is, however, far from easy.

 

NVIDIA

NVIDIA is the other big player in the GPU field with regard to HPC. Its latest product is the C1060 as an individual card but it is also possible to have 4 of these cards in a 1U rack enclosure, obviously with four times the performance. Such rack-mounted systems are primarily made with the HPC community in mind. Again, we do not have enough information to provide a reliable block diagram but the most important details are given below:

 

Table 2.2: Some specifications for the NVIDIA C1060 GPU.
Number of processors 240
Memory (GDDR3) 4 GB
Clock Cycle 1.3 GHz
Peak Perfomance 936 Gflop/s
Power requirement 225 W peak; 160 W typical
Interconnect (PCIe Gen2) 8×, 4 GB/s; 16×, 8 GB/s
Floating-point support Partial (32/64-bit)

From these specifications can be derived that 3 floating-point results per core per cycle can be delivered. Because of the scant information on the core structure it is not clear how this comes about. The power requirement given may not be entirely appropriate for HPC workloads. A large proportion of the work being done will be from the BLAS library that is provided by NVIDIA, more specifically, the dense matrix-matrix multiplication in it. This operation occupies any computational core to the full and one may expect a somewhat higher power consumption than what is considered as typical for other kinds of work.

Like ATI, NVIDIA provides an SDK comprised of a compiler named CUDA, libraries that include BLAS and FFT routines, and a runtime system that accomodates both Linux (RedHat and SuSE) and Winodws. CUDA is a C/C++-like language with extensions and primitives that cause operations to be executed on the card instead of on the CPU core that initiates the operations. Transport to and from the card is done via library routines and many threads can be initiated and placed in appropriate positions in the card memory so as not causing memory congestion on the card. This means that for good performance one needs knowledge of the memory structure on the card to exploit it accordingly. This is not unique to the C1060 GPU, it pertains to the ATI Firestream GPU and other accelerators as well.