Although Pentium processors are not applied in integrated parallel
systems these days, they play a major role in the cluster community as
most compute nodes in Beowulf clusters are of this type. Therefore we
briefly discuss also this type of processor.
Intel only provides scant information on its processor. Therefore, a
rough block diagram of the P4 processor can only be synthesized from
various sources. It is shown in Figure 12.
There is a number of distinctive features with respect to the earlier
Pentium generations. There are two main ways to increase the
performance of a processor: by raising the clock frequency and by
increasing the number of instructions per cycle (IPC). These two
approaches are generally in conflict: when one wants to increase the
IPC the chip will become more complicated. This will have a negative
impact on the clock frequency because more work has to be done and
organised within the same clock cycle. Very seldomly chip designers
succeed in raising both clock frequency and IPC simultaneously. Also in
the Pentium 4 this could not be done. Intel has chosen for a high clock
speed (initially about 40% more than that of the Pentium III with the
same fabrication technology) while the IPC decreased by 10--20%. This
still gives a net performance gain even if other changes would have
been made to the processor. To sustain the very high clock rate that
the present processors have, currently > 2 GHz, a very deep instruction
pipeline is required. The instruction pipeline has no less than 20
stages, double the number of stages in that of the Pentium III.
Although this favours a high clock rate, the penalty for a pipeline
miss (e.g., a branch mis-predict) is much heavier and therefore Intel
has improved the branch prediction by a increasing the size of the
Branch Target Buffer from 0.5 to 4 KB. In addition, the Pentium 4 has
an execution trace cache which holds partly decoded instructions of
former execution traces that can be drawn upon, thus foregoing the
instruction decode phase that might produce holes in the instruction
pipeline. The allocator dispatches the decoded instructions, "micro
operations", to the appropriate µop queue, one for memory
operations, another for integer and floating-point operations.
Figure 12: Block diagram of the Intel Pentium 4.
Two integer Arithmetic/Logical Units are kept simple in order to be
able to run them at twice the clock speed. In addition there is an ALU
for complex integer operations that cannot be executed within one
cycle. There is only one Floating-point functional unit that delivers
one result per cycle. However, besides the normal Floating-point Unit,
there also are additional units that execute the Streaming SIMD
Extensions 2 (SSE2) repertoire of instructions, a 144-member
instruction set, that is especially meant for multimedia, and 3-D
visualisation applications. The length of the operands for these units
is 128 bits. The Intel compilers have the ability to address the SSE2
units. This makes it in principle possible to achieve a two times
higher floating-point performance.
The primary cache is quite small by today's standards: 8 KB. This is
again to accommodate the high clock speed. With this size of cache it is
possible to have a latency of two cycles for the cache, where it was 3
cycles in the Pentium III. The secondary cache has a size of 256 KB and
has a wide 256-bit bus, which amounts to a bandwidth of 54.4 Gb/s. Also
the memory bandwidth has improved significantly over that of the Pentium
III: although the bus cycle frequency is 133 MHz, four transactions
per cycle can be done, making it effectively a 533 MHz bus. This should
give quite an improvement for codes that cannot be kept in cache.
It will depend heavily on the availability of compilers that are able to
take advantage of all the facilities present in the P4 processor. But if
they can, the processor could form a good basis for any HPC platform.
Next:
MIPS R14000A
Up:
The Main Architectural Classes
Previous:
Intel Itanium 2