The SPARC processors

Introduction
HPC Architecture
  1. Shared-memory SIMD machines
  2. Distributed-memory SIMD machines
  3. Shared-memory MIMD machines
  4. Distributed-memory MIMD machines
  5. ccNUMA machines
  6. Clusters
  7. Processors
    1. AMD Magny-Cours
    2. IBM POWER6
    3. IBM POWER7
    4. IBM PowerPC 970MP
    5. IBM BlueGene processors
    6. Intel Xeon
    7. The SPARC processors
  8. Accelerators
    1. GPU accelerators
      1. ATI/AMD
      2. nVIDIA
    2. General accelerators
      1. The IBM/Sony/Toshiba Cell processor
      2. ClearSpeed/Petapath
    3. FPGA accelerators
      1. Convey
      2. Kuberre
      3. SRC
  9. Networks
    1. Infiniband
    2. InfiniPath
    3. Myrinet
Available systems
  • The Bull bullx system
  • The Cray XE6
  • The Cray XMT
  • The Cray XT5h
  • The Fujitsu FX1
  • The Hitachi SR16000
  • The IBM BlueGene/L&P
  • The IBM eServer p575
  • The IBM System Cluster 1350
  • The NEC SX-9
  • The SGI Altix UV series
  • Systems disappeared from the list
    Systems under development
    Glossary
    Acknowledgments
    References

    Since SUN has been taken over by Oracle all processor development has been shelved. The development of the SPARC processor architecture is now in the hands of Fujitsu that advances with its own SPARC64 implementation. Fujitsu/Siemens markets its HPC servers based on the latter processor. Below we discuss the current SPARC chip that is commercially available in the Fujitsu machines. Although a follow-on processor, the SPARC64 VIII seems ready for incoporation in Japan's 10 Petaflop/s system that currently is being built we only discuss the SPARC64 VII here as this is the one that is commerically available. Presently it is not known when its successor will appear on the open market.

    The SPARC64 VII is, obviously, Fujitsu's seventh generation of the processor. Of course, the processor must be able to execute the SPARC instruction set but the processor internals are rather different from Sun's late implementations. Figure 20 shows a block diagram of the quad-core SPARC64 VII.

    Block diagram of the Fujitsu SPARC64 VII processor core.

    Figure 20: Block diagram of the Fujitsu SPARC64 VII processor core.

    Actually, the core achitecture has not changed from the SPARC64 VI but thanks to the decrease of the feature size from 90 nm to 65 nm, now 4 cores can be placed on a chip while the highest available clock frequency is raised from 2.4 GHz to 2.52 GHz.

    The L1 instruction and data caches are 64 KB, two times smaller than in the SPARC64 VI core and both 2-way set-associative. This decrease in size is somewhat surprising and probably due to the technology shrink to 65 nm feature size. There is also an Instruction Buffer (IBF) than contains up to 48 4-byte instructions and continues to feed the registers through the Instruction Word Register when an L1 I-cache miss has occurred. A maximum of four instructions can be scheduled each cycle and find their way via the reservation stations for address generation (RSA), integer execution units (RSE), and floating-point units (RSF) to the registers. The two general register files serve both the two Address Generation units EAG-A, and -B and the Integer Execution units EX-A and -B. The latter two are not equivalent: only EX-A can execute multiply and divide instructions. There also two floating-point register files (FPR), that feed the two Floating-Point units FL-A and FL-B. These units are different from those of Sun in that they are able to execute fused multiply-add instructions as is also the case in the POWER and Itanium processors. Consequently, a maximum of 4 floating-point results/cycle can be generated. In addition, FL-A and -B also perform divide and square root operations in contrast to the SPARC4+ that has a separate unit for these operations. Because of their iterative nature the divide and square root operations are not pipelined. The feedback from the execution units to the registers is decoupled by update buffers: GUB for the general registers and FUB for the floating-point registers.

    The dispatch of instructions via the reservation stations,that each can hold 10 instructions, gives the opportunity of speculative dispatch: i.e., dispatching instructions of which the operands are not yet ready at the moment of dispatch but will be by the time that the instruction is actually executed. The assumption is that it results in a more even flow of instructions to the execution units.

    The SPARC64 VII does not have a third level cache but on chip there is a large (6 MB) unified L2 12-way set-associative write-through cache that is shared by the 4 cores in a processor as can be seen in Figure 21. Note that the system bandwidth is the highest available. For the lower end systems this bandwidth is about 8 GB/s.
    The Memory Management Unit (not shown in Figure 20) contains separate sets of Translation Look aside Buffers (TLB) for instructions and for data. Each set is composed of a 32-entry µTLB and a 1024-entry main TLB. The µTLBs are accessed by high-speed pipelines by their respective caches.

    Block diagram of the Fujitsu SPARC64 VII processor chip.
          Four cores share the L2 cache

    Figure 21: Block diagram of the Fujitsu SPARC64 VII processor chip. Four cores share the L2 cache.

    What cannot be shown in the diagrams is that, like the IBM and Intel processors, the SPARC VII is dual-threaded per core. The type of multithreading is similar to that found in the Intel processors and is called Simultaneous Multithreading, differing from the type of multithreading present in the IBM processors but with the same name. At this moment the highest clock frequency SPARC64 available is 2.52 GHz. As already remarked, the floating-point units are capable of a fused multiply-add operation, like the POWER and Itanium processors, and so the theoretical peak performance is presently 10.08 Gflop/s/core and consequently 40.3 Gflop/s/processor.