next up previous contents
Next: MIPS R14000A Up: The Main Architectural Classes Previous: Intel Itanium 2

Intel Pentium 4

Although Pentium processors are not applied in integrated parallel systems these days, they play a major role in the cluster community as most compute nodes in Beowulf clusters are of this type. Therefore we briefly discuss also this type of processor.
Intel only provides scant information on its processor. Therefore, a rough block diagram of the P4 processor can only be synthesized from various sources. It is shown in Figure 12.

Block diagram of the Intel Pentium 4
Figure 12: Block diagram of the Intel Pentium 4.

There is a number of distinctive features with respect to the earlier Pentium generations. There are two main ways to increase the performance of a processor: by raising the clock frequency and by increasing the number of instructions per cycle (IPC). These two approaches are generally in conflict: when one wants to increase the IPC the chip will become more complicated. This will have a negative impact on the clock frequency because more work has to be done and organised within the same clock cycle. Very seldomly chip designers succeed in raising both clock frequency and IPC simultaneously. Also in the Pentium 4 this could not be done. Intel has chosen for a high clock speed (initially about 40% more than that of the Pentium III with the same fabrication technology) while the IPC decreased by 10--20%. This still gives a net performance gain even if other changes would have been made to the processor. To sustain the very high clock rate that the present processors have, currently > 2 GHz, a very deep instruction pipeline is required. The instruction pipeline has no less than 20 stages, double the number of stages in that of the Pentium III. Although this favours a high clock rate, the penalty for a pipeline miss (e.g., a branch mis-predict) is much heavier and therefore Intel has improved the branch prediction by a increasing the size of the Branch Target Buffer from 0.5 to 4 KB. In addition, the Pentium 4 has an execution trace cache which holds partly decoded instructions of former execution traces that can be drawn upon, thus foregoing the instruction decode phase that might produce holes in the instruction pipeline. The allocator dispatches the decoded instructions, "micro operations", to the appropriate µop queue, one for memory operations, another for integer and floating-point operations.
Two integer Arithmetic/Logical Units are kept simple in order to be able to run them at twice the clock speed. In addition there is an ALU for complex integer operations that cannot be executed within one cycle. There is only one Floating-point functional unit that delivers one result per cycle. However, besides the normal Floating-point Unit, there also are additional units that execute the Streaming SIMD Extensions 2 (SSE2) repertoire of instructions, a 144-member instruction set, that is especially meant for multimedia, and 3-D visualisation applications. The length of the operands for these units is 128 bits. The Intel compilers have the ability to address the SSE2 units. This makes it in principle possible to achieve a two times higher floating-point performance.
The primary cache is quite small by today's standards: 8 KB. This is again to accommodate the high clock speed. With this size of cache it is possible to have a latency of two cycles for the cache, where it was 3 cycles in the Pentium III. The secondary cache has a size of 256 KB and has a wide 256-bit bus, which amounts to a bandwidth of 54.4 Gb/s. Also the memory bandwidth has improved significantly over that of the Pentium III: although the bus cycle frequency is 133 MHz, four transactions per cycle can be done, making it effectively a 533 MHz bus. This should give quite an improvement for codes that cannot be kept in cache.
It will depend heavily on the availability of compilers that are able to take advantage of all the facilities present in the P4 processor. But if they can, the processor could form a good basis for any HPC platform.


next up previous contents
Next: MIPS R14000A Up: The Main Architectural Classes Previous: Intel Itanium 2



Aad van der Steen
Mon Jul 29 13:57:44 MDT 2002