|
The SPARC processorsSince SUN has been taken over by Oracle the development of the SPARC processor family is split in two distinct lines: The SPARC T-series, multi-core heavily multi-threaded processors, of which the present T4 generation is the latest with 8 cores and 8 threads per core at a clock cycle of 2.5–3 GHz. This is the processor line used by Oracle. The processors in this line, however, are not for HPC use. The other processor line is the SPARC64 line of processors developed by Fujitsu which have features that are geared towards heavy computation. The current generation is the SPARC64 IXfx.This processor was first produced in end 2011 with a feature size of 40 nm and it is operated at a clock speed of 1.848 GHz somewhat lower than that of its predecessor, the VIIIfx that ran at 2 GHz. However, the number of cores has been doubled form 8 to 16. This results in a usage of only 110 Watt at a peak performance of 236.5 Gflop/s. In many respects the SPARC64 IXfx resembles its predecessor, the SPARC64 VIIIfx but there are some differences in the processor structure and in the instruction set that are meant to speed up floating-point computation. The chip layout is shown in Figure 18. The off-chip bandwidth to the memory is very high: very high: 85 GB/s.
Figure 18: Processor structure of the SPARC IXfx.
Figure 19 shows a block diagram of core of the
SPARC64 IXfx.
The L1 instruction and data caches are 32 KB both are 2-way set-associative.
IXfx version has no L3 cache. A feature that cannot be displayed is the
extension of the instruction set with vector instructions which greatly reduce
the overhead of vectorisable code as is demonstrated in [23]. Furthermore, there is a hardware retry
mechanism that re-executes instructions that were subject to single-bit
errors.
Figure 19: Block diagram of the Fujitsu SPARC64 IXfx processor core.
There is also an Instruction Buffer (IBF) than contains up to 48 4-byte
instructions and continues to feed the registers through the Instruction Word
Register when an L1 I-cache miss has occurred. A maximum of four instructions
can be scheduled each cycle and find their way via the reservation stations for
address generation (RSA), integer execution units (RSE), and floating-point
units (RSF) to the registers. The general register file serves both the two
Address Generation units EAG-A, and -B and the Integer Execution units EX-A and
-B. The latter two are not equivalent: only EX-A can execute multiply and divide
instructions. The floating-point register file (FPR) has, like the GPR, been
extended: from 64 entries to 256. This greatly helps in heavy loop unrolling and
in the vector operations. The FPR feeds the four floating-point units FL-A,
FL-B, FL-C, and FL-D that all are capable of performing fused multiply-add
operations. Consequently, a maximum of 8 floating-point results/cycle can be
generated. The feedback from the execution units to the registers is decoupled
by update buffers: GUB for the general registers and FUB for the floating-point
registers. |