Intel Xeon

Introduction

HPC Architecture

Shared-memory SIMD machines

Distributed-memory SIMD machines

Shared-memory MIMD machines

Distributed-memory MIMD machines

ccNUMA machines

Clusters

Processors

AMD Opteron

IBM POWER7

IBM BlueGene processors

Intel Xeon

The SPARC processors

Accelerators

GPU accelerators

ATI/AMD

nVIDIA

General computational accelerators

Intel Xeon Phi

FPGA accelerators

Convey

Kuberre

SRC

Interconnects

Infiniband

InfiniPath

Myrinet

Available systems
The Bull bullx system

The Cray XC30

The Cray XE6

The Cray XK7

The Eurotech Aurora

The Fujitsu FX10

The Hitachi SR16000

The IBM BlueGene/L&P

The IBM eServer p775

The NEC SX-9

The SGI Altix UV series

Systems disappeared from the list

Systems under development

Glossary

Acknowledgments

References

The Sandy Bridge processor

The current Xeon for HPC servers is the Sandy Bridge processor. A block diagram of the core is shown in Figure Figure 16.
Block diagram of a Sandy Bridge core.
Figure 16: Block diagram of the Sandy Bridge core.
Notable differences with respect to its predecessor, the Westmere processor, are the doubling of the bandwidth between the L1 data cache and the load buffer, the 1.5 kμ-op cache and the larger capacity of the scheduler from 32 to 54 operations. Furthermore, the vector units can perform the AVX instruction set in 256-bit wide registers. Unlike in the SSE instructions, that are still supported, the source operands are not overwritten. So, they can be reused in subsequent operations. Like in the Westmere processor, instructions for the Advanced Encryption Standard (AES) are included that should speed up en/decryption tasks. Also, like in the Westmere, the use of 1 GB pages is supported.
To stay backwards compatible with the x86 (IA-32) Instruction Set Architecture which comprises a CISC instruction set Intel developed a modus in which these instructions are split in so-called micro operations (μ-ops) of fixed length that can be treated in the way RISC processors do. In fact the μ-ops constitute a RISC operation set. The price to be payed for this much more efficient instruction set is an extra decoding stage. Branch prediction has been improved and also a second level TLB cache been added.
As in the earlier Core architecture 4 μ-ops/cycle and some macro-instructions as well as some μ-ops can be fused, resulting in less instruction handling, easier scheduling and better instruction throughput because these fused operations can be executed in a single cycle. As can be seen in Figure 16 the processor cores have an execution trace cache which holds partly decoded instructions of former execution traces that can be drawn upon, thus foregoing the instruction decode phase that might produce holes in the instruction pipeline. The allocator dispatches the decoded instructions, the μ-ops, to the unified reservation station that can issue up to 6 μ-ops/cycle to the execution units, collectively called the Execution Engine. Up to 128 μ-ops can be in flight at any time. Figure 16 shows that port 0 and port 1 drive two Integer ALUs as well as (vector) floating-point instructions. Port 5 only operates on floating-point instructions while ports 2–4 are dedicated to load/store operations.
The two integer Arithmetic/Logical Units at port 0 and 1 are kept simple in order to be able to run them at twice the clock speed. In addition there is an ALU at port 1 for complex integer operations that cannot be executed within one cycle. The length of the operands for these units is 128 bits.
A feature that cannot be shown in the figures is that the Sandy Bridge supports multi-threading much in the style of IBM's simultaneous multithreading. Intel calls it Hyperthreading. Hyperthreading was earlier introduced in the Pentium 4 but disappeared in later Intel processors because the performance gain was very low. Now with a much higher bandwidth and larger caches speedups of more than 30% for some codes have been observed with Hyperthreading. Another feature that cannot be shown is the so-called Turbo Mode already mentioned above. It means that the clock cycle can be raised from its nominal speed (2.9 GHz for the E5-2690) by steps of 133 MHz to up to 3.8 GHz as long as the thermal envelope of the chip is not exceeded. So, when some cores are relatively idle other cores can take advantage by operating at a higher clock speed.
The L3 cache is inclusive which means that it contains all data that are in the L2 and L1 cache. The consequence is that when a data item cannot be found in the L3 cache it is also not in any of the caches of the other cores and one therefore need not search them.
In the Sandy Bridge processor the cores are not fully interconnected which each other anymore but rather by two counter-rotating rings as shown in Figure 17
Layout of an E5-2600 Sandy Bridge processor
Figure 17: Layout of an E5-2600 Sandy Bridge processor.

The red squares in the diagram represent the stations within the rings that inject or draw data into/from them. The latency between stations is 1 clock cycle. The maximum latency for updating an entry in the L3 cache is therefore (8+2)/2=5 cycles. Although the parts of L3 cache are drawn separately in the figure, all cores of course have access to all parts of the cache. The data only have to be transported via the rings to the core(s) that need them.
As can be seen in the diagram there are 2 QPI links that connect to the other CPU on the board at a bandwidth of 32 GB/s/link. The QPI links maintain cache coherency between the caches of the CPUs. The aggregated memory bandwidth is 51.2 GB/s over 4 channels to 1600 MHz DDR3 memory. THe I/O unit is integrated on the chip and with one 8 GB/s and two 16 GB/s PCI3 Gen3 ports have an aggregated bandwidth of 40 GB/s.

The Ivy Bridge processor
Early 2013 the server version of the Sandy Bridge's successor, Ivy Bridge, will become become available. It is a technology shrink of the Sandy Bridge processor from 32 nm to 22 nm. This allows for the placement of 12 cores on a chip while the size of the L3 cache grows to 30 MB, i.e., per core still 2.5 MB is available.