The SPARC processors

Introduction
HPC Architecture
  1. Shared-memory SIMD machines
  2. Distributed-memory SIMD machines
  3. Shared-memory MIMD machines
  4. Distributed-memory MIMD machines
  5. ccNUMA machines
  6. Clusters
  7. Processors
    1. AMD Opteron
    2. IBM POWER7
    3. IBM BlueGene/Q processor
    4. Intel Xeon
    5. The SPARC processors
  8. Accelerators
    1. GPU accelerators
      1. ATI/AMD
      2. nVIDIA
    2. General computational accelerators
      1. Intel Xeon Phi
    3. FPGA accelerators
      1. Convey
      2. Kuberre
      3. SRC
  9. Interconnects
    1. Infiniband
Available systems
  • The Bull bullx system
  • The Cray XC30
  • The Cray XE6
  • The Cray XK7
  • The Eurotech Aurora
  • The Fujitsu FX10
  • The Hitachi SR16000
  • The IBM BlueGene/Q
  • The IBM eServer p775
  • The NEC SX-9
  • The SGI Altix UV series
  • Systems disappeared from the list
    Systems under development
    Glossary
    Acknowledgments
    References

     

    The SPARC processors

    Since SUN has been taken over by Oracle the development of the SPARC processor family is split in two distinct lines: The SPARC T-series, multi-core heavily multi-threaded processors, of which the present T4 generation is the latest with 8 cores and 8 threads per core at a clock cycle of 2.5–3 GHz. This is the processor line used by Oracle. The processors in this line, however, are not for HPC use. The other processor line is the SPARC64 line of processors developed by Fujitsu which have features that are geared towards heavy computation. The current generation is the SPARC64 IXfx.

    This processor was first produced in end 2011 with a feature size of 40 nm and it is operated at a clock speed of 1.848 GHz somewhat lower than that of its predecessor, the VIIIfx that ran at 2 GHz. However, the number of cores has been doubled form 8 to 16. This results in a usage of only 110 Watt at a peak performance of 236.5 Gflop/s. In many respects the SPARC64 IXfx resembles its predecessor, the SPARC64 VIIIfx but there are some differences in the processor structure and in the instruction set that are meant to speed up floating-point computation. The chip layout is shown in Figure 18. The off-chip bandwidth to the memory is very high: very high: 85 GB/s.

    Processor structure of the SPARC IXfx.

    Figure 18: Processor structure of the SPARC IXfx.

    Figure 19 shows a block diagram of core of the SPARC64 IXfx. The L1 instruction and data caches are 32 KB both are 2-way set-associative. IXfx version has no L3 cache. A feature that cannot be displayed is the extension of the instruction set with vector instructions which greatly reduce the overhead of vectorisable code as is demonstrated in [23]. Furthermore, there is a hardware retry mechanism that re-executes instructions that were subject to single-bit errors.
    The Memory Management Unit (not shown in Figure 19) contains separate sets of Translation Look aside Buffers (TLB) for instructions and for data. Each set is composed of a 32-entry μTLB and a 1024-entry main TLB. The μTLBs are accessed via high-speed pipelines by their respective caches.

    Block diagram of the Fujitsu SPARC64 IXfx processor core.

    Figure 19: Block diagram of the Fujitsu SPARC64 IXfx processor core.

    There is also an Instruction Buffer (IBF) than contains up to 48 4-byte instructions and continues to feed the registers through the Instruction Word Register when an L1 I-cache miss has occurred. A maximum of four instructions can be scheduled each cycle and find their way via the reservation stations for address generation (RSA), integer execution units (RSE), and floating-point units (RSF) to the registers. The general register file serves both the two Address Generation units EAG-A, and -B and the Integer Execution units EX-A and -B. The latter two are not equivalent: only EX-A can execute multiply and divide instructions. The floating-point register file (FPR) has, like the GPR, been extended: from 64 entries to 256. This greatly helps in heavy loop unrolling and in the vector operations. The FPR feeds the four floating-point units FL-A, FL-B, FL-C, and FL-D that all are capable of performing fused multiply-add operations. Consequently, a maximum of 8 floating-point results/cycle can be generated. The feedback from the execution units to the registers is decoupled by update buffers: GUB for the general registers and FUB for the floating-point registers.
    What cannot be shown in the diagrams is that, like the IBM and Intel processors, the SPARC IXfx is multi-threaded: dual-threaded in this case. As already remarked, the floating-point units are capable of a fused multiply-add operation, like the POWER processors, and so the theoretical peak performance with a clock cycle of 1.848 GHz, 16 cores per processor and 8 floating-point results per clock cycle is 236.5 Gflop/s/processor.