The Xeon Phi

Introduction
HPC Architecture
  1. Shared-memory SIMD machines
  2. Distributed-memory SIMD machines
  3. Shared-memory MIMD machines
  4. Distributed-memory MIMD machines
  5. ccNUMA machines
  6. Clusters
  7. Processors
    1. AMD Opteron
    2. IBM POWER7
    3. IBM BlueGene/Q processor
    4. Intel Xeon
    5. The SPARC processors
  8. Accelerators
    1. GPU accelerators
      1. ATI/AMD
      2. nVIDIA
    2. General computational accelerators
      1. Intel Xeon Phi
    3. FPGA accelerators
      1. Convey
      2. Kuberre
      3. SRC
  9. Interconnects
    1. Infiniband
Available systems
  • The Bull bullx system
  • The Cray XC30
  • The Cray XE6
  • The Cray XK7
  • The Eurotech Aurora
  • The Fujitsu FX10
  • The Hitachi SR16000
  • The IBM BlueGene/Q
  • The IBM eServer p775
  • The NEC SX-9
  • The SGI Altix UV series
  • Systems disappeared from the list
    Systems under development
    Glossary
    Acknowledgments
    References

    In November 2012 Intel presented its first official many core product, the Xeon Phi. Depending on the model it contains 60 or 61 cores. We here discuss the fastest variant, the 7110P model that runs at a clock cycle of 1.1 GHz. The Xeon Phi can be regarded as Intel's answer primarily to the GPU-based accelerators. A distinct advantage for users is that no different programming model is required when the host processor is of the x86 type. A disadvantage which is shared with all other accelerators is that the Xeon Phi has its own (GDDR5) memory and data to be used or produced by the accelerator have to be transported to/from the accelerator via PCI3 Gen3 16$\times$, i.e., at 16 GB/s.
    In Figure 22 a rough diagram of the Xeon Phi is given as it proved very difficult to get sufficient details: for one thing all Intel documentation refers to a ''high-speed'' ring to transport data and instructions between the cores but nowhere a bandwidth is stated.

     

    Block diagram of an Intel Xeon Phi processor.

    Figure 22: Block diagram of an Intel Xeon Phi processor.

    One can get an idea of this bandwidth knowing that the two counter-rotating rings (like in the Xeon Ivy Bridge) are 512 bits wide in each direction and the latency between any neighbouring connection points on the ring is 1 clock cycle. With a clock frequency of 1.1 GHz the bandwidth in each direction should therefore be in the order of 70 GB/s. This matches nicely with the peak memory bandwidth of 352 GB/s to/from the 8 GB of GDDR5 memory. Apart from the rings that transport data there are rings for address fetching and cache coherency of the L2 caches that belong to the cores. The tag directories associated with the cores hold the addresses and their validity state. The addresses are spread evenly over the tag directories to avoid bottlenecks in fetching them. Also the memory controllers that give access to the memory are interspersed with the cores to give a smooth access to the data.
    The cores in the Xeon Phi are derived from the Intel Pentium P54C but with many enhancements that should give it the peak performance of over 1 Tflop/s that is stated by Intel.

    Block diagram of an Intel Xeon Phi processor core.

    Figure 23: Block diagram of an Intel Xeon Phi processor core.

    As can be seen from Figure 23 the core contains a 512-bit wide vector unit capable of yielding 8 64-bit or 16 32-bit floating-point results per cycle. As the vector unit supports fused multiply-add operations actually 16 64-bit operations or 32 32-bit operations may be possible. The AVX instruction set executed in the Vector Processing Unit (VPU) includes mask operations which helps in executing loops with conditionals in them. In addition there are scatter-gather operations to deal with loop strides larger than 1 and an extended math unit that can provide transcendental function results.
    Instructions in the core are executed in-order which greatly simplifies the core logic. However, 4-wide multi-threading is supported (as suggested in the upper-left corner of Figure 23. This helps in executing a good level of instructions per cycle even without out-of-order execution. Instruction pipe 0 drives the VPU as well as the scalar floating-point part of the code while pipe 1 only drives the scalar integer processing in ALU0 and ALU1. According to Intel the x86-specific logic and the associated L2 area only occupy less than 2% of the die area while the core is able to execute the complete range of x86 instructions (be it not always very efficiently).
    As remarked before, one can use the same SIMD pragmas/directives as are used for AVX instructions on X86 CPUs but there are also some explicit offload pragmas/directives that causes a part of the code to be executed on the Phi processor after transporting the associated data.