AMD Opteron

Introduction
HPC Architecture
  1. Shared-memory SIMD machines
  2. Distributed-memory SIMD machines
  3. Shared-memory MIMD machines
  4. Distributed-memory MIMD machines
  5. ccNUMA machines
  6. Clusters
  7. Processors
    1. AMD Opteron
    2. IBM POWER6
    3. IBM PowerPC 970
    4. IBM BlueGene processors
    5. Intel Itanium 2
    6. Intel Xeon
    7. The MIPS processor
    8. The SPARC processors
  8. Accelerators
    1. GPU accelerators
    2. General accelerators
    3. FPGA accelerators
  9. Networks
    1. Infiniband
    2. InfiniPath
    3. Myrinet
    4. QsNet
Available systems
  1. The Bull NovaScale
  2. The C-DAC PARAM Padma
  3. The Cray XT3
  4. The Cray XT4
  5. The Cray XT5h
  6. The Cray XMT
  7. The Fujitsu/Siemens M9000
  8. The Fujitsu/Siemens PRIMEQUEST
  9. The Hitachi BladeSymphony
  10. The Hitachi SR11000
  11. The HP Integrity Superdome
  12. The IBM BlueGene/L&P
  13. The IBM eServer p575
  14. The IBM System Cluster 1350
  15. The Liquid Computing LiquidIQ
  16. The NEC Express5800/1000
  17. The NEC SX-9
  18. The SGI Altix 4000
  19. The SiCortex SC series
  20. The Sun M9000
Systems disappeared from the list
Systems under development
Glossary
Acknowledgments
References

The AMD Athlon, Opteron, and Phenom processors are clones with respect to Intel's x86 Instruction Set Architecture, and especially the dual-core Opteron has been (and is) quite popular for use in clusters. In addition, the Opteron and Phenom are used in Cray and the Liquid Computer systems that are presented later. The Phenom is a quad core chip (also known under the code name ``Barcelona'') that has come onto the market by the end of 2007. We will discuss this new processor in the following. The first version of the processor contained an error in the L2 TLB that could be patched at the expense of the performance. As of April 2008 there is a new version that does not have TLB problems and works without the performance penalty. In contrast to the Opteron, a third level of cache has been added that is common to the four cores as can be seen in Figure 7. The size is 2 MB.

Block diagram of an AMD Phenom (Barcelona) processor

Figure 7: Block diagram of an AMD Phenom (Barcelona) processor.

As already mentioned, the AMD processors have many features that are also present in modern RISC processors: it supports out-of-order execution, has multiple floating-point units, and can issue up to 9 instructions simultaneously. A block diagram of the Phenom processor core is shown in Figure 8 The four cores are connected by an on-chip crossbar (see next section) that also connects to the memory controller and to other processors on the board (if present).

Block diagram of an AMD Phenom processor core

Figure 8: Block diagram of an AMD Phenom processor core.

The figure shows that the processor has three pairs of Integer Execution Units and Address Generation Units that via an 32-entry Integer Scheduler takes care of the integer computations and of address calculations. Both the Integer Future File and the Floating-Point Scheduler are fed by the 72-entry Reorder Buffer that receives the decoded instructions from the instruction decoders. The decoding in the Phenom core has become more efficient than in the earlier processors: SSE instructions decode now into 1 micro-operation (μop) as are most integer and floating-point instructions. In addition, a new piece of hardware, called the sideband stack optimiser, has been added (not shown in the figure) that takes care of the stack manipulations in the instruction stream thus making instruction reordering more efficient thereby increasing the effective number of instructions per cycle.

The floating-point units allow out-of-order execution of instructions via the FPU Stack Map & Rename unit. It receives the floating-point instructions from the Instruction Control Unit and reorders them if necessary before handing them over to the FPU Scheduler. The Floating-Point Register File is 120 elements deep on par with the number of registers as available in RISC processors. (For the x86 instructions 16 registers in a flat register file are present instead of the register stack that is usual for Intel architectures.)

The floating-point part of the processor contains three units: Floating Add and Multiply units that can work in superscalar mode, resulting in two floating-point results per clock cycle and a unit handling ``miscelaneous'' operations, like division and square root. Because of the compatibility with Intel's Pentium 4 processors, the floating-point units also are able to execute Intel SSE2/3 instructions and AMD's own 3DNow! instructions. However, there is the general problem that such instructions are not directly accessible from higher level languages, like Fortran 90 or C(++). Both instruction sets were originally meant for massive processing of visualisation data but are increasingly used to also for standard dense linear algebra operations.

Due to the shrinkage of technology to 45 nm each core can harbour a secondary cache of 512 KB. This, together with a significantly enhanced memory bus can deliver up to 6.4 GB/s of bandwidth to/from the memory. This memory bus, called HyperTransport by AMD, is derived from licensed Compaq technology and similar to that employed in HP/Compaq's former EV7 processors. It allows for ``glueless'' connection of several processors to form multi-processor systems with very low memory latencies. In the Phenom the third generation, HyperTransport 3.0 is used which in principle can transfer 10.4 GB/s per directional link. However, because the present sockets do not yet support this speed the throughput is still that of the earlier HyperTransport 1.1 link, 4 GB/s/link/direction.

The clock frequency is in the range of 2.2–2.5 GHz which makes the Phenom an interesting alternative for the few RISC processors that are still available at this moment. Especially the HyperTransport interconnection possibilities makes it highly attractive for building SMP-type clusters.