The SPARC processors

Introduction
HPC Architecture
  1. Shared-memory SIMD machines
  2. Distributed-memory SIMD machines
  3. Shared-memory MIMD machines
  4. Distributed-memory MIMD machines
  5. ccNUMA machines
  6. Clusters
  7. Processors
    1. AMD Opteron
    2. IBM POWER5+
    3. IBM BlueGene processor
    4. Intel Itanium 2
    5. Intel Xeon
    6. The SPARC processors
  8. Networks
    1. Infiniband
    2. InfiniPath
    3. Myrinet
    4. QsNet
    5. SCI
Available systems
  1. The Bull NovaScale
  2. The C-DAC PARAM Padma
  3. The Cray X1E
  4. The Cray XD1
  5. The Cray XT3
  6. The Fujitsu/Siemens PRIMEPOWER
  7. The Fujitsu/Siemens PRIMEQUEST
  8. The Hitachi BladeSymphony
  9. The Hitachi SR11000
  10. The HP Integrity Superdome
  11. The IBM eServer p575
  12. The IBM BlueGene/L
  13. The NEC Express5800/1000
  14. The NEC SX-8
  15. The SGI Altix 4000
  16. The SunFire E25K
Systems disappeared from the list
Systems under development
Glossary
Acknowledgements
References

The situation with respect to the SPARC processors is rather unclear. Sun markets systems that it positions in the HPC market based on the UltraSPARC4+. Sun has shelved its own plans to produce UltraSPARC V and VI processor by April 2004 in favour of processor designs with many (≥ 8) processor cores, each capable of handling several execution threads. This so-called Rock processor is still some the SPARC development in the hands of its partner Fujitsu that will advance with its own SPARC64 implementation. Fujitsu markets a server based on the latter processor but at the same time also sells a newer, similar system based on the Itanium-2 processor. So, it is far from sure what the future of the various SPARC implementations will be. Nevertheless, systems with these processors are still here and therefore we discuss them in this section.

The UltraSPARC IV+

The UltraSPARC IV+ is the fifth generation of the UltraSPARC family. Like virtually all processor makers also Sun has put two processor cores on a chip from the UltraSPARC IV on. The CPU cores in the UltraSPARC IV were in fact slightly modified UltraSPARC III processors. The UltraSPARC IV+ is a technology shrink of the UltraSPAR IV fabricated with a 90 nm feature size. We show a block diagram of the processor core and its embedding in the UltraSPARC IV chip in Figure 13.

Block diagram of the UltraSPARC IV processor core.
 processor

Figure 13: Block diagram of the UltraSPARC IV+ processor core.

The processor is characterised by a large amount of caches of various sorts as can be seen in Figure 13. The Data Cache Unit (DCU) contains apart from a 4-way set associative cache of 64 KB also a write and a prefetch cache, both of 2 KB. All these L1 caches operate at half speed: loads and stores from the processor can be done in 2 cycles. The prefetch cache is independent from the data cache and can load data when this is deemed appropriate. The write cache defers writes to the L2 cache and so may evade unnecessary writes of individual bytes until entire cache lines have to be updated. The Instruction Issue Unit (IIU) contains a 64 KB 4-way set associative instruction cache together with the instruction TLB which is called Instruction translation buffer in Sun's terminology. The size of the instruction cache could be doubled thanks to the technology shrink in comparison with the UltraSPARC III and IV. In addition a 2 MB L2 cache and an L3 Tag Cache could be placed on the chip while a 32 MB L3 cache was added off chip. The IIU also contains a so-called miss queue that holds instructions that are immediately available for the execute units when a branch has been mis-predicted. Branch prediction is fully static in the UltraSPARC-III. It is implemented as a 16 KB table in the IIU that is pipelined because of its size.

The Integer Execute Unit (IEU) has two Add/Logical Units and a branch unit. Integer adds and multiplies are pipelined but the divide operation is not. It is performed by an Arithmetic Special Unit (not shown in the figure) that does not burden the pipelines for the ALUs. The integer register file is effectively divided in two and is called the Working and Architectural Register File by SUN. Operands are accessed and results stored in the working registers. When an exception occurs, the results to be undone in the working registers are overwritten by those from the architectural file. One of the enhancements with respect to the original UltraSPARC III design is the adding of hash indexing for the write cache. This should decrease the number of write misses and thus leave more write store bandwidth for results that need storing.

The floating-point unit (FPU) has two independent pipelined units for addition and multiplication and a non-pipelined unit for floating division and square-root computation that require in the order of 20--25 cycles. The FPU also contains graphics hardware (not shown in Figure 13) that shares the pipelined adder and multiplier with general 64-bit calculations. For the chips delivered at 1.5 GHz, the theoretical peak performance is 3.0 Gflop/s per processor core. It is expected that the UltraSPARC IV+ technology can be shrunk to reach a clock frequency that is slightly more by the end of its life cycle. In the UltraSPARC IV+ the FPUs are enhanced by adding hardware support of handling for IEEE 754 floating-point errors (which can be very costly otherwise when properly handled).

As is evident from Figure 14 the Memory Control Unit (MCU) is on chip as well as the L3 cache controller (in the MCU) and the L3 cache tags. This shortens the latency of accesses from all memory levels. In addition, both controllers communicate with the System Interface Unit (SIU), also on-chip to keep in touch with the snoop pipe controller in the SIU. The processor has been built with multi-processing in mind and the snoop controller keeps track of data requests in the whole system to ensure coherency of the caches when required.

Chip layout of the UltraSPARC IV processor.

Figure 14: Chip layout of the UltraSPARC IV processor.

The UltraSPARC IV+ is around since half 2005. Sun refers to having the two processor cores on a chip and running one execution thread on each of them as Chip Multithreading (CMT). This is not quite what one would normally would understand as multi-threading because one would then expect more execution threads per processor core. So, the CMT terminology is somewhat confusing and one would hope that Sun will drop it in favour of the common use of the term.

SPARC64

For quite some time Fujitsu is making its own SPARC implementation, called SPARC64. Presently the SPARC64 is in its fifth generation, the SPARC64 V. Obviously, the processor must be able to execute the SPARC instruction set but the processor internals are rather different from Sun's implementation. Figure 15 shows a block diagram of the SPARC64 V.

Block diagram of the Fujitsu SPARC64 V processor.

Figure 15: Block diagram of the Fujitsu SPARC64 V processor.

Notwithstanding the mutual compatibility of the different SPARC implememtations, there is quite some difference in the actual realisation of the processors as can be seen by comparing the SPARC IV+ processor core and the SPARC64 V diagrams in Figures 13 and 15, respectively. The L1 instruction and data caches are 128 KB, two times larger than in the SPARC4+ core and both 2-way set-associative. There is also an Instruction Buffer (IBF) than contains up to 48 4-byte instructions and continues to feed the registers through the Instruction Word Register when an L1 I-cache miss has occurred. A maximum of four instructions can be scheduled each cycle and find their way via the reservation stations for address generation (RSA), integer execution units (RSE), and floating-point units (RSF) to the registers. The two general register files serve both the two Address Generation units EAG-A, and -B and the Integer Execution units EX-A and -B. The latter two are not equivalent: only EX-A can execute multiply and divide instructions. There also two floating-point register files (FPR), that feed the two Floating-Point units FL-A and FL-B. These units are different from those of Sun in that they are able to execute fused multiply-add instructions as is also the case in the POWER and Itanium processors. Consequently, a maximum of 4 floating-point results/cycle can be generated. In addition, FL-A an -B also perform divide and square root operations in contrast to the SPARC4+ that has a separate unit for these operations. Because of there iterative nature the divide and square root operations are not pipelined. The feedback from the execution units to the registers is decoupled by update buffers: GUB for the general registers and FUB for the floating-point registers.

The dispatch of instructions via the reservation stations,that each can hold 10 instructions, gives the opportunity of speculative dispatch: i.e., dispatching instructions of which the operands are not yet ready at the moment of dispatch but will be by the time that the instruction is actually executed. The assumption is that it results in a more even flow of instructions to the execution units.

The SPARC64 V does not have a third level cache but on chip there is a large (4 MB) unified L2 cache that is a 4-way set-associative write-through cache. Furthermore, the Memory Management Unit (not shown in Figure 15) contains separate sets of Translation Look aside Buffers (TLB) for instructions and for data. Each set is composed of a 32-entry µTLB and a 1024-entry main TLB. The µTLBs are accessed by high-speed pipelines by their respective caches.

At this moment the highest clock frequency SPARC64 available is 2.08 GHz. As already remarked, the floating-point units are capable of a fused multiply-add operation, like the POWER and Itanium processors, the peak floating-point and so the theoretical peak performance is presently 8.32 Gflop/s. Fujitsu plans to bring out a dual core SPARC64 VI by the end of 2006 in which the core is essentially the same as in the SPARC64 V with a clock frequency of 2.4 GHz and for 2007 even a quad-core SPARC VII is scheduled. However, not much about its structure is known yet.