The Xeon Phi

Introduction

HPC Architecture

Shared-memory SIMD machines

Distributed-memory SIMD machines

Shared-memory MIMD machines

Distributed-memory MIMD machines

ccNUMA machines

Clusters

Processors

AMD Opteron

IBM POWER7

IBM BlueGene processors

Intel Xeon

The SPARC processors

Accelerators

GPU accelerators

ATI/AMD

nVIDIA

General computational accelerators

Intel Xeon Phi

FPGA accelerators

Convey

Kuberre

SRC

Interconnects

Infiniband

InfiniPath

Myrinet

Available systems
The Bull bullx system

The Cray XC30

The Cray XE6

The Cray XK7

The Eurotech Aurora

The Fujitsu FX10

The Hitachi SR16000

The IBM BlueGene/L&P

The IBM eServer p775

The NEC SX-9

The SGI Altix UV series

Systems disappeared from the list

Systems under development

Glossary

Acknowledgments

References

In November 2012 Intel presented its first official many core product, the Xeon Phi. Depending on the model it contains 60 or 61 cores. We here discuss the fastest variant, the 7110P model that runs at a clock cycle of 1.1 GHz. The Xeon Phi can be regarded as Intel's answer primarily to the GPU-based accelerators. A distinct advantage for users is that no different programming model is required when the host processor is of the x86 type. A disadvantage which is shared with all other accelerators is that the Xeon Phi has its own (GDDR5) memory and data to be used or produced by the accelerator have to be transported to/from the accelerator via PCI3 Gen3 16$\times$, i.e., at 16 GB/s.
In Figure 22 a rough diagram of the Xeon Phi is given as it proved very difficult to get sufficient details: for one thing all Intel documentation refers to a ''high-speed'' ring to transport data and instructions between the cores but nowhere a bandwidth is stated.

Figure 22: Block diagram of an Intel Xeon Phi processor.

One can get an idea of this bandwidth knowing that the two counter-rotating rings (like in the Xeon Sandy Bridge, see \ref{s:sandybridge}) are 512 bits wide in each direction and the latency between any neighbouring connection points on the ring is 1 clock cycle. With a clock frequency of 1.1 GHz the bandwidth in each direction should therefore be in the order of 70 GB/s. This matches nicely with the peak memory bandwidth of 352 GB/s to/from the 8 GB of GDDR5 memory. Apart from the rings that transport data there are rings for address fetching and cache coherency of the L2 caches that belong to the cores. The tag directories associated with the cores hold the addresses and their validity state. The addresses are spread evenly over the tag directories to avoid bottlenecks in fetching them. Also the memory controllers that give access to the memory are interspersed with the cores to give a smooth access to the data.
The cores in the Xeon Phi are derived from the Intel Pentium P54C but with many enhancements that should give it the peak performance of over 1 Tflop/s that is stated by Intel.

Figure 23: Block diagram of an Intel Xeon Phi processor core.

As can be seen from Figure 23 the core contains a 512-bit wide vector unit capable of yielding 8 64-bit or 16 32-bit floating-point results per cycle. As the vector unit supports fused multiply-add operations actually 16 64-bit operations or 32 32-bit operations may be possible. The AVX instruction set executed in the Vector Processing Unit (VPU) includes mask operations which helps in executing loops with conditionals in them. In addition there are scatter-gather operations to deal with loop strides larger than 1 and an extended math unit that can provide transcendental function results.
Instructions in the core are executed in-order which greatly simplifies the core logic. However, 4-wide multi-threading is supported (as suggested in the upper-left corner of Figure 23. This helps in executing a good level of instructions per cycle even without out-of-order execution. Instruction pipe 0 drives the VPU as well as the scalar floating-point part of the code while pipe 1 only drives the scalar integer processing in ALU0 and ALU1. According to Intel the x86-specific logic and the associated L2 area only occupy less than 2% of the die area while the core is able to execute the complete range of x86 instructions (be it not always very efficiently).
As remarked before, one can use the same SIMD pragmas/directives as are used for AVX instructions on X86 CPUs but there are also some explicit offload pragmas/directives that causes a part of the code to be executed on the Phi processor after transporting the associated data.