NVIDIA is the other big player in the GPU field with regard to HPC. Its latest
product is the Tesla K series, code name Kepler. It came out at the end of 2012.
A successor may be expected in 2014. Of the K20 series we only discuss the
fastest one, the K20X. A simplified block diagram is shown in block diagram is
shown in Figure 20.
The GigaThread Engine is able to schedule different tasks in the Streaming
Multiprocessors (SMXs) in parallel. This greatly improves the occupation rate of
the SMXs and thus the throughput. As shown in Figure 20
15 SMXs are present.
Each SMX in turn harbours 192 cores that used to be named Streaming Processors
(SPs) but now are called CUDA cores by NVIDIA. A diagram of an SMX with some
internals is given in Figure 21. Via the instruction
cache 2 Warp schedulers (a warp is a bundle 32 threads) the program threads are
pushed onto the cores. In addition each SMX has 32 Special Function Units (SFUs)
that take care of the evaluation of functions, like goniometric functions, etc.,
that are more complicated than profitably can be computed by the simple
floating-point units in the cores.
Figure 21:
Diagram of a Streaming Processor of the Tesla K20X.
Before we discuss some new features of the K20X that cannot be expressed in the
diagrams we list some properties of the Tesla K20X in
Table 2.2.
Table 2.2:Some specifications for the NVIDIA Tesla K20X
| Number of cores |
2688 |
| Memory (GDDR5) |
6 GB |
| Internal bandwidth |
250 GB/s |
| Clock Cycle |
732 MHz |
| Peak Perfomance (32-bit) |
3.52 Tflop/s |
| Peak Perfomance (64-bit) |
1.17 Tflop/s |
| Power requirement (peak) |
≤ 235 W |
| Interconnect (PCIe Gen2) |
16×, 8 GB/s |
| Error correction |
Yes |
| Floating-point support |
Full (32/64-bit) |
As can be seen from the table the 64-bit performance is one-third of the single
precision performance in accordance with the fact that there is one DP Unit for
every core. Another notable item in the table is that the interconnection with
the host is still based on PCIe Gen2 where one would expect it be Gen3 as was
originally planned. Apparently NVIDIA was not able to make it work with the
PCIe Gen3 port on Intel's latest chips and NIVIDIA has therefore fallen back to
Gen2. The peak power requirement given will probably be an appropriate measure
for HPC workloads. A large proportion of the work being done will be from the
BLAS library that is provided by NVIDIA, more specifically, the dense
matrix-matrix multiplication in it. This operation occupies any computational
core to the full and will therefore consume close to the maximum of the power.
The K20X supports some significant improvements over its predecessors that are
especially of interest for HPC: one of these is what NVIDIA calls Hyper-Q that
allows for 32 MPI tasks to run simultaneously on the GPU instead of just one.
Apart from effectively de-serializing MPI tasks in this way it also allows for a
better utilisation of the GPU. Another MPI-related feature is GPU Direct that
enables MPI data exchange between GPUs without involving the host CPU. This does
not only decrease the overhead of the CPU ackowledgment, it also omits the extra
copies to the CPUs that host the GPUs which leads to a significant acceleration
of the data exchange.
Perhaps the most interesting enhancement is the support of dynamic parallelism.This means that the GPU is able to initiate compute kernels independent from the
host CPU. Where formerly each kernel had to be started by the host together with
the corresponding data transfer associated with this kernel, with the dynamic
parallelism feature the kernels initiated within the GPU already have their data
available on the GPU. This cuts back on the data traffic between the GPU and the
host, the most severe bottleneck in CPU-GPU computation.
Like ATI, NVIDIA provides an SDK comprised of a compiler named CUDA, libraries
that include BLAS and FFT routines, and a runtime system that accomodates both
Linux (RedHat and SuSE) and Windows. CUDA is a C/C++-like language with
extensions and primitives that cause operations to be executed on the card
instead of on the CPU core that initiates the operations. Transport to and from
the card is done via library routines and many threads can be initiated and
placed in appropriate positions in the card memory so as not to cause memory
congestion on the card. This means that for good performance one needs knowledge
of the memory structure on the card to exploit it accordingly. This is not
unique to the K20X GPU, it pertains to the ATI Firestream GPU and other
accelerators as well.
NVIDIA also supports OpenCL, though CUDA is at present much more popular among
developers. For Windows users the NVIDIA Parallel Nsight for Visual Studio is
available that should ease the optimisation of the program parts run on the
cards.