NVIDIA

Introduction
HPC Architecture
  1. Shared-memory SIMD machines
  2. Distributed-memory SIMD machines
  3. Shared-memory MIMD machines
  4. Distributed-memory MIMD machines
  5. ccNUMA machines
  6. Clusters
  7. Processors
    1. AMD Opteron
    2. IBM POWER7
    3. IBM BlueGene/Q processor
    4. Intel Xeon
    5. The SPARC processors
  8. Accelerators
    1. GPU accelerators
      1. ATI/AMD
      2. nVIDIA
    2. General computational accelerators
      1. Intel Xeon Phi
    3. FPGA accelerators
      1. Convey
      2. Kuberre
      3. SRC
  9. Interconnects
    1. Infiniband
Available systems
  • The Bull bullx system
  • The Cray XC30
  • The Cray XE6
  • The Cray XK7
  • The Eurotech Aurora
  • The Fujitsu FX10
  • The Hitachi SR16000
  • The IBM BlueGene/Q
  • The IBM eServer p775
  • The NEC SX-9
  • The SGI Altix UV series
  • Systems disappeared from the list
    Systems under development
    Glossary
    Acknowledgments
    References

    NVIDIA is the other big player in the GPU field with regard to HPC. Its latest product is the Tesla K series, code name Kepler. It came out at the end of 2012. A successor may be expected in 2014. Of the K20 series we only discuss the fastest one, the K20X. A simplified block diagram is shown in block diagram is shown in Figure 20.

     

    Simplified block diagram of the NVIDIA Tesla K20X

    Figure 20: Simplified block diagram of the NVIDIA Tesla K20X GPU.

    The GigaThread Engine is able to schedule different tasks in the Streaming Multiprocessors (SMXs) in parallel. This greatly improves the occupation rate of the SMXs and thus the throughput. As shown in Figure 20 15 SMXs are present.
    Each SMX in turn harbours 192 cores that used to be named Streaming Processors (SPs) but now are called CUDA cores by NVIDIA. A diagram of an SMX with some internals is given in Figure 21. Via the instruction cache 2 Warp schedulers (a warp is a bundle 32 threads) the program threads are pushed onto the cores. In addition each SMX has 32 Special Function Units (SFUs) that take care of the evaluation of functions, like goniometric functions, etc., that are more complicated than profitably can be computed by the simple floating-point units in the cores.

     

    Diagram of a Streaming Processor of the NVIDIA Tesla C2050/C2070

    Figure 21: Diagram of a Streaming Processor of the Tesla K20X.

    Before we discuss some new features of the K20X that cannot be expressed in the diagrams we list some properties of the Tesla K20X in Table 2.2.

    Table 2.2:Some specifications for the NVIDIA Tesla K20X
    Number of cores 2688
    Memory (GDDR5) 6 GB
    Internal bandwidth 250 GB/s
    Clock Cycle 732 MHz
    Peak Perfomance (32-bit) 3.52 Tflop/s
    Peak Perfomance (64-bit) 1.17 Tflop/s
    Power requirement (peak) ≤ 235 W
    Interconnect (PCIe Gen2) 16×, 8 GB/s
    Error correction Yes
    Floating-point support Full (32/64-bit)

    As can be seen from the table the 64-bit performance is one-third of the single precision performance in accordance with the fact that there is one DP Unit for every core. Another notable item in the table is that the interconnection with the host is still based on PCIe Gen2 where one would expect it be Gen3 as was originally planned. Apparently NVIDIA was not able to make it work with the PCIe Gen3 port on Intel's latest chips and NIVIDIA has therefore fallen back to Gen2. The peak power requirement given will probably be an appropriate measure for HPC workloads. A large proportion of the work being done will be from the BLAS library that is provided by NVIDIA, more specifically, the dense matrix-matrix multiplication in it. This operation occupies any computational core to the full and will therefore consume close to the maximum of the power.
    The K20X supports some significant improvements over its predecessors that are especially of interest for HPC: one of these is what NVIDIA calls Hyper-Q that allows for 32 MPI tasks to run simultaneously on the GPU instead of just one. Apart from effectively de-serializing MPI tasks in this way it also allows for a better utilisation of the GPU. Another MPI-related feature is GPU Direct that enables MPI data exchange between GPUs without involving the host CPU. This does not only decrease the overhead of the CPU ackowledgment, it also omits the extra copies to the CPUs that host the GPUs which leads to a significant acceleration of the data exchange.
    Perhaps the most interesting enhancement is the support of dynamic parallelism.This means that the GPU is able to initiate compute kernels independent from the host CPU. Where formerly each kernel had to be started by the host together with the corresponding data transfer associated with this kernel, with the dynamic parallelism feature the kernels initiated within the GPU already have their data available on the GPU. This cuts back on the data traffic between the GPU and the host, the most severe bottleneck in CPU-GPU computation. Like ATI, NVIDIA provides an SDK comprised of a compiler named CUDA, libraries that include BLAS and FFT routines, and a runtime system that accomodates both Linux (RedHat and SuSE) and Windows. CUDA is a C/C++-like language with extensions and primitives that cause operations to be executed on the card instead of on the CPU core that initiates the operations. Transport to and from the card is done via library routines and many threads can be initiated and placed in appropriate positions in the card memory so as not to cause memory congestion on the card. This means that for good performance one needs knowledge of the memory structure on the card to exploit it accordingly. This is not unique to the K20X GPU, it pertains to the ATI Firestream GPU and other accelerators as well.
    NVIDIA also supports OpenCL, though CUDA is at present much more popular among developers. For Windows users the NVIDIA Parallel Nsight for Visual Studio is available that should ease the optimisation of the program parts run on the cards.