IBM POWER7

Introduction
HPC Architecture
  1. Shared-memory SIMD machines
  2. Distributed-memory SIMD machines
  3. Shared-memory MIMD machines
  4. Distributed-memory MIMD machines
  5. ccNUMA machines
  6. Clusters
  7. Processors
    1. AMD Opteron
    2. IBM POWER7
    3. IBM BlueGene/Q processor
    4. Intel Xeon
    5. The SPARC processors
  8. Accelerators
    1. GPU accelerators
      1. ATI/AMD
      2. nVIDIA
    2. General computational accelerators
      1. Intel Xeon Phi
    3. FPGA accelerators
      1. Convey
      2. Kuberre
      3. SRC
  9. Interconnects
    1. Infiniband
Available systems
  • The Bull bullx system
  • The Cray XC30
  • The Cray XE6
  • The Cray XK7
  • The Eurotech Aurora
  • The Fujitsu FX10
  • The Hitachi SR16000
  • The IBM BlueGene/Q
  • The IBM eServer p775
  • The NEC SX-9
  • The SGI Altix UV series
  • Systems disappeared from the list
    Systems under development
    Glossary
    Acknowledgments
    References

    The POWER7 processor is presently the processor of IBM's Power 775 HPC system line. In addition, Hitachi is offering a variant of its SR16000 system with the POWER7 processor. Figure 11 shows the layout of the cores, caches, and memory controllers on the chip.

    .donaad
    Diagram of the IBM POWER7 chip layout

    Figure 11: Diagram of the IBM POWER7 chip layout.

    The technology from which the chips are built is identical to that of the POWER6: 45 nm Silicon-On-Insulator but in all other aspects the differences with the former generation are large: not only the number on cores has quadrupled. Also the memory speed has increased going from DDR2 to DDR3 via two on-chip memory controllers. As in earlier POWER versions the inbound and outbound bandwidth from memory to chip are different: 2 B/cycle in and 1.5 B/cycle out. With a bus frequency of 6.4 GHz and 4 in/out channels per controller this amounts to 51.2 GB/s inward and 38.4 GB/s outward. IBM asserts that an aggregate sustained bandwidth of ≈ 100 GB/s can be reached. Although this is very high in absolute terms with a clock frequency of 3.5–3.86 GHz for the processors this is no luxury. Therefore it is possible to run the chip in so-called TurboCore mode. In this case four of the 8 cores are turned off and the clock frequency is raised to 4.14 GHz thus almost doubling the bandwidth for the active cores. As one core is capable of absorbing/producing 16 B/cycle when executing a fused floating multiply-add operation the bandwidth requirement of one core at 4 GHz is already 64 GB/s. So, the cache hierarchy and possible prefetching are extremely important for a reasonable occupation of the many functional units.

    Another new feature of the POWER7 with regard to its predecessor is that the L3 cache has been moved onto the chip. To be able to do this IBM chose to implement the 32 MB L3 cache in embedded DRAM (eDRAM) instead of SRAM as is usual. eDRAM is slower than SRAM but much less bulky and because the cache now is on-chip the latency is considerably lower (about a factor of 6). The L3 cache communicates with the L2 caches that are private to each core. The L3 cache is partitioned in that it contains 8 regions of 4 MB, one region per core. Each partition serves as a victim cache for the L2 cache to which it is dedicated and in addition to the other 7 L3 cache partitions.
    Each chip features 5 10-B SMP links that supports SMP operation of up to 32 sockets. Also at the core there are many differences with its predecessor. A single core is depicted in Figure 12.

    Diagram of the IBM POWER7 core

    Figure 12: Diagram of the IBM POWER7 core.

    To begin with, the number of floating-point units is doubled to four, each capable of a fused multiply-add operation per cycle. Assuming a clock frequency of 3.83 GHz this means that a peak speed of 30.64 Gflop/s can be attained with these units. A feature that was omitted from the POWER6 core has been re-implemented in the POWER7 core: dynamic instruction scheduling assisted by the load and load reorder queues. As shown in Figure 12 there are two 128-bit VMX units. One of them executes vector instructions akin to the x86 SSE instructions. However there is also a VMX permute unit that can order non-contiguous operands such that the VMX execute unit can handle them. The instruction set for the VMX unit is an implementation of the AltiVec instruction set that is also employed in the PowerPC processors. There are also similarities with the POWER6 processor: the core contains a Decimal floating-point unit (DFU)and a checkpoint recovery unit that can re-schedule operations that have failed for some reason.

    Another difference that cannot be shown is that the cores now support 4 SMT threads instead of 2. This will be very helpful for the large amounts of functional units to be kept busy. Eight instructions can be taken from the L1 instruction cache. The instruction decode unit can handle 6 instructions simultaneously while 8 instructions can be dispatched every cycle to the various functional units.

    The POWER7 core has elaborate power management features that reduces the power usage for parts that are idle for some time. There are two power-saving mode: nap mode and sleep mode. In the former the caches and TLBs stay coherent to re-activate quickly. In sleep mode, however, the caches are purged and the clock turned off. Only the mininum voltage to maintain the memory contents is applied. Obviously the wake-up time is longer in this case but the power saving can be significant.

     

    As yet there is no POWER8 processor produced, although most of its features are already known. A follow-on to the POWER7, the POWER7+ is available with a clock frequency of up to 4.4 GHz and an L3 cache that is increased from 32 to 80 MB. However, this chip is not offered in the HPC P775 servers from IBM but rather in there enterprise systems where the large L3 cache may have a larger advantage. In addition, the higher clock rate with the associated higher power consumption would be less favorable in highly floating-point oriented workloads.