ATI/AMDThe latest product from ATI (now wholly owned by AMD) is the ATI Firestream 9170 card. There is not enough information available for a block diagram but we list the most important features of the processor:
Table 2.1:Some specifications for the ATI/AMD Firestream 9170 GPU.
It is expected that in the third quarter of 2008 its successor, the Firestream~9250 will come out. This card will have roughly double the performance of the Firestream~9170 and presumably will use $\le 150$ W. The specifications given indicate that per core 2 floating-point results per cycle can be generated, presumably the result of an add and a multiply operation. Whether these results can be produced independently or result from linked operations is not known because of the lack of information. Like its direct competitor, NVIDIA, ATI offers a C-like language, BROOK+, and the accompanying run time system to ease the use of the card. The SDK containing these products is free and can be installed both for Linux (RedHat and SuSE) and Windows environments. Objects that have to be handled by the card are declared in a special syntax and there are library functions to put the data onto the card and retrieve results from it. Functions that should be performed on the card are called ``Kernels''. They typically operate on the stream objects defined as such in the BROOK+ program. Although this looks simple, to get an optimum performance one should tune the amount of computation carefully with the data transport, for however fast the PCIe bus might be that must transport the data to/from the GPU, there is still a significant amount of time involved in shipping the data on and off the card. BROOK+ is as yet rather restricted in its functionality. To help out in situations that are not covered by BROOK+ a assembly language, CAL can be used. This is, however, far from easy.
NVIDIANVIDIA is the other big player in the GPU field with regard to HPC. Its latest product is the C1060 as an individual card but it is also possible to have 4 of these cards in a 1U rack enclosure, obviously with four times the performance. Such rack-mounted systems are primarily made with the HPC community in mind. Again, we do not have enough information to provide a reliable block diagram but the most important details are given below:
Table 2.2: Some specifications for the NVIDIA C1060 GPU.
From these specifications can be derived that 3 floating-point results per core per cycle can be delivered. Because of the scant information on the core structure it is not clear how this comes about. The power requirement given may not be entirely appropriate for HPC workloads. A large proportion of the work being done will be from the BLAS library that is provided by NVIDIA, more specifically, the dense matrix-matrix multiplication in it. This operation occupies any computational core to the full and one may expect a somewhat higher power consumption than what is considered as typical for other kinds of work. Like ATI, NVIDIA provides an SDK comprised of a compiler named CUDA, libraries that include BLAS and FFT routines, and a runtime system that accomodates both Linux (RedHat and SuSE) and Winodws. CUDA is a C/C++-like language with extensions and primitives that cause operations to be executed on the card instead of on the CPU core that initiates the operations. Transport to and from the card is done via library routines and many threads can be initiated and placed in appropriate positions in the card memory so as not causing memory congestion on the card. This means that for good performance one needs knowledge of the memory structure on the card to exploit it accordingly. This is not unique to the C1060 GPU, it pertains to the ATI Firestream GPU and other accelerators as well. |