|
The AMD Athlon, Opteron, and Phenom processors are clones with respect to Intel's x86 Instruction Set Architecture, and especially the dual-core Opteron has been (and is) quite popular for use in clusters. In addition, the Opteron and Phenom are used in Cray and the Liquid Computer systems that are presented later. The Phenom is a quad core chip (also known under the code name ``Barcelona'') that has come onto the market by the end of 2007. We will discuss this new processor in the following. The first version of the processor contained an error in the L2 TLB that could be patched at the expense of the performance. As of April 2008 there is a new version that does not have TLB problems and works without the performance penalty. In contrast to the Opteron, a third level of cache has been added that is common to the four cores as can be seen in Figure 7. The size is 2 MB.
Figure 7: Block diagram of an AMD Phenom (Barcelona) processor.
As already mentioned, the AMD processors have many features that are also present in modern RISC processors: it supports out-of-order execution, has multiple floating-point units, and can issue up to 9 instructions simultaneously. A block diagram of the Phenom processor core is shown in Figure 8 The four cores are connected by an on-chip crossbar (see next section) that also connects to the memory controller and to other processors on the board (if present).
Figure 8: Block diagram of an AMD Phenom processor core. The figure shows that the processor has three pairs of Integer Execution Units and Address Generation Units that via an 32-entry Integer Scheduler takes care of the integer computations and of address calculations. Both the Integer Future File and the Floating-Point Scheduler are fed by the 72-entry Reorder Buffer that receives the decoded instructions from the instruction decoders. The decoding in the Phenom core has become more efficient than in the earlier processors: SSE instructions decode now into 1 micro-operation (μop) as are most integer and floating-point instructions. In addition, a new piece of hardware, called the sideband stack optimiser, has been added (not shown in the figure) that takes care of the stack manipulations in the instruction stream thus making instruction reordering more efficient thereby increasing the effective number of instructions per cycle. The floating-point units allow out-of-order execution of instructions via the FPU Stack Map & Rename unit. It receives the floating-point instructions from the Instruction Control Unit and reorders them if necessary before handing them over to the FPU Scheduler. The Floating-Point Register File is 120 elements deep on par with the number of registers as available in RISC processors. (For the x86 instructions 16 registers in a flat register file are present instead of the register stack that is usual for Intel architectures.) The floating-point part of the processor contains three units: Floating Add and Multiply units that can work in superscalar mode, resulting in two floating-point results per clock cycle and a unit handling ``miscelaneous'' operations, like division and square root. Because of the compatibility with Intel's Pentium 4 processors, the floating-point units also are able to execute Intel SSE2/3 instructions and AMD's own 3DNow! instructions. However, there is the general problem that such instructions are not directly accessible from higher level languages, like Fortran 90 or C(++). Both instruction sets were originally meant for massive processing of visualisation data but are increasingly used to also for standard dense linear algebra operations. Due to the shrinkage of technology to 45 nm each core can harbour a secondary cache of 512 KB. This, together with a significantly enhanced memory bus can deliver up to 6.4 GB/s of bandwidth to/from the memory. This memory bus, called HyperTransport by AMD, is derived from licensed Compaq technology and similar to that employed in HP/Compaq's former EV7 processors. It allows for ``glueless'' connection of several processors to form multi-processor systems with very low memory latencies. In the Phenom the third generation, HyperTransport 3.0 is used which in principle can transfer 10.4 GB/s per directional link. However, because the present sockets do not yet support this speed the throughput is still that of the earlier HyperTransport 1.1 link, 4 GB/s/link/direction. The clock frequency is in the range of 2.2–2.5 GHz which makes the Phenom an interesting alternative for the few RISC processors that are still available at this moment. Especially the HyperTransport interconnection possibilities makes it highly attractive for building SMP-type clusters. |