Figure 23:
Block diagram of an Intel Xeon Phi processor core.
As can be seen from Figure 23 the core contains a
512-bit wide vector unit capable of yielding 8 64-bit or 16 32-bit
floating-point results per cycle. As the vector unit supports fused multiply-add
operations actually 16 64-bit operations or 32 32-bit operations may be
possible. The AVX instruction set executed in the Vector Processing Unit (VPU)
includes mask operations which helps in executing loops with conditionals in
them. In addition there are scatter-gather operations to deal with loop strides
larger than 1 and an extended math unit that can provide transcendental function
results.
Instructions in the core are executed in-order which greatly simplifies the core
logic. However, 4-wide multi-threading is supported (as suggested in the
upper-left corner of Figure 23. This helps in executing a
good level of instructions per cycle even without out-of-order execution.
Instruction pipe 0 drives the VPU as well as the scalar floating-point part of
the code while pipe 1 only drives the scalar integer processing in ALU0 and
ALU1. According to Intel the x86-specific logic and the associated L2 area only
occupy less than 2% of the die area while the core is able to execute the
complete range of x86 instructions (be it not always very efficiently).
As remarked before, one can use the same SIMD pragmas/directives as are used for
AVX instructions on X86 CPUs but there are also some explicit offload
pragmas/directives that causes a part of the code to be executed on the Phi
processor after transporting the associated data.