Intel Xeon

Introduction
HPC Architecture
  1. Shared-memory SIMD machines
  2. Distributed-memory SIMD machines
  3. Shared-memory MIMD machines
  4. Distributed-memory MIMD machines
  5. ccNUMA machines
  6. Clusters
  7. Processors
    1. AMD Opteron
    2. IBM POWER5+
    3. IBM BlueGene processor
    4. Intel Itanium 2
    5. Intel Xeon
    6. The SPARC processors
  8. Networks
    1. Infiniband
    2. InfiniPath
    3. Myrinet
    4. QsNet
    5. SCI
Available systems
  1. The Bull NovaScale
  2. The C-DAC PARAM Padma
  3. The Cray X1E
  4. The Cray XD1
  5. The Cray XT3
  6. The Fujitsu/Siemens PRIMEPOWER
  7. The Fujitsu/Siemens PRIMEQUEST
  8. The Hitachi BladeSymphony
  9. The Hitachi SR11000
  10. The HP Integrity Superdome
  11. The IBM eServer p575
  12. The IBM BlueGene/L
  13. The NEC Express5800/1000
  14. The NEC SX-8
  15. The SGI Altix 4000
  16. The SunFire E25K
Systems disappeared from the list
Systems under development
Glossary
Acknowledgements
References

Although the Intel Xeon processors are not applied in integrated parallel systems these days, they play a major role in the cluster community as the mojority of compute nodes in Beowulf clusters are of this type. Therefore we briefly discuss also this type of processor. We concentrate on the Xeon, the server version of the IA-32 processor family, as this is the type to be found in clusters, mostly in 2-processor nodes.

As of 2006 Intel has introduced an enhanced microarchitecture for the IA-32 instruction set architecture called the Core architecture. The server version with the code name Woodcrest is a first implementation of this new microarchitecture. The Woodcrest processor has two processor cores as now all high-end processors do. In addition, many improvements have been made to increase the performance and at the same time to decrease the power requirements.

In Figure 11 a block diagram of the processor is shown with one of the cores in some detail. Note that the two cores share one second level cache while the L1 caches and TLBs are local to each of the cores.

Block diagram of the Intel Xeon processor

Figure 11: Block diagram of the Intel Xeon processor.

To stay backwards compatible with the x86 (IA-32) Instruction Set Architecture which comprises a CISC instruction set Intel developed a modus in which these instructions are split in so-called micro operations (µ-ops) of fixed length that can be treated in the way RISC processors do. In fact the µ-ops constitute a RISC operation set. The price to be payed for this much more efficient instruction set is an extra decoding stage.

Many of the improvements of the Core architecture are not evident from the block diagram. For instance in the Core architecture 4 µ-ops/cycle can be scheduled instead of 3 as in the former microarchitecture. Futhermore, some macro-instructions as well as some µ-ops can be fused, resulting in less instruction handling, easier scheduling and better instruction throughput because these fused operations can be executed in a single cycle.

As can be seen in Figure 11 the processor cores have an execution trace cache which holds partly decoded instructions of former execution traces that can be drawn upon, thus foregoing the instruction decode phase that might produce holes in the instruction pipeline. The allocator dispatches the decoded instructions, the µ-ops, to the appropriate µ-op queue, one for memory operations, another for integer and floating-point operations.

Intel Xeon floating-point unit

Figure 12: Intel Xeon floating-point unit.

Two integer Arithmetic/Logical Units are kept simple in order to be able to run them at twice the clock speed. In addition there is an ALU for complex integer operations that cannot be executed within one cycle. The floating-point units, depicted in Figure 12, contain also additional units that execute the Streaming SIMD Extensions 2 and 3 (SSE2/3) repertoire of instructions, a 144-member instruction set, that is especially meant for vector-oriented operations like in multimedia, and 3-D visualisation applications but which will also be of advantage for regular vector operations as occur in dense linear algebra. The length of the operands for these units is 128 bits. The throughput of these SIMD units has been increased by a factor of 2 in the Core architecture which greatly increase the performance of the appropriate instructions. The Intel compilers have the ability to address the SSE2/3 units. This makes it in principle possible to achieve a 2--3 times higher floating-point performance.

The Xeons boast so-called Hyperthreading: with the processor two threads can run concurrently under some circumstances. In this it is not unique anymore as all main processor makers now provide some form of multi-threading. It may for instance be used for speculative execution of if branches. Experiments have shown that up to 30% performance improvements can be attained for a variety of codes. In practice the performance gain about 3--5%, however.
The secondary cache has a size of 4 MB for the Woodcrest implementation of the Core processor. The two core share a forntside bus with a bandwidth of 10.6 GB/s.

Since its predecessor, the Nocona processor the Intel processors have the ability to run (and address) 64-bit codes, thereby following AMD, in fact copying the approach used in the AMD Opteron and Athlon processors. The technique is called Extended Memory 64 Technology (EM64T) by Intel. In principle it uses ``unused bits'' from in the instruction words of the x86 instruction set to signal whether an 64-bit version of an instruction should be executed. Of course some additional devices are needed for operating in 64-bit mode. These include 8 new general purpose registers(GPRs), 8 new registers for SSE2/3 support, and 64-bit wide GPRs and instruction pointers.

As in the dual-core Montecito (see Itanium 2) We do not show that in a separate figure as the configuration is very similar to that in the Montecito. It will depend heavily on the quality of the compilers whether they will be able to take advantage of all the facilities present in the dual-core processor.