next up previous contents
Next: Intel Pentium 4 Up: The Main Architectural Classes Previous: IBM POWER4

Intel Itanium 2

The Itanium 2 is a representative of Intel's IA-64 64-bit processor family and as such the second generation. Its predecessor, the Itanium, has been out for almost a year, but has not spread widely, primarily because the Itanium 2 would follow quickly with projected performance levels up to twice that of the first Itanium. The Itanium 2 will become available in 1--2 month at the time of writing and would improve on some aspects of the first generation, in particular integer processing and cache/memory bandwidth.

The Itanium family of processors has characteristics that are different from the RISC chips presented elsewhere in this section. A block diagram of the Itanium 2 is shown in 11.

Block diagram of the Intel Itanium 2
Figure 11: Block diagram of the Intel Itanium 2.

The clock frequency for the Itanium 2 in the products to be shipped will be around 1 GHz. Figure 11 shows a large amount of functional units that must be kept busy. This is done by large instruction words of 128 bits that contain 3 41-bit instructions and a 5-bit template that aids in steering and decoding the instructions. This is an idea that is inherited from the Very Large Instruction Word (VLIW) machines that have been on the market for some time about ten years ago. The two load/store units fetch two instruction words per cycle so six instructions per cycle are dispatched. The Itanium has also in common with these systems that the scheduling of instructions, unlike in RISC processors, is not done dynamically at run time but rather by the compiler. The VLIW-like operation is enhanced with predicated execution which makes it possible to execute instructions in parallel that normally would have to wait for the result of a branch test. Intel calls this refreshed VLIW mode of operation EPIC, Explicit Parallel Instruction Computing. Furthermore, load instructions can be moved and the loaded variable used before a branch or a store by replacing this piece of code by a test on the place is originally came from to see whether the operations have been valid. To keep track of the advanced loads an Advanced Load Address Table records them. When a check is made about the validness of an operation depending on the advanced load, the ALAT is searched and when no entry is present the operation chain leading to the check is invalidated and the appropriate fix-up code is executed. Note that this is code that is generated at compile time so no control speculation hardware is needed for this kind of speculative execution. This would become exceedingly complex for the many functional units that may be simultaneously in operation at any time.
As can be seen from Figure 11 there are four floating-point units capable of performing Fused Multiply Accumulate (FMAC) operations. However, two of these work at the full 82-bit precision which is the internal standard on Itanium processors, while the other two can only be used for 32-bit precision operations. When working in the customary 64-bit precision the Itanium has a theoretical peak performance of 4 Gflop/s at a clock frequency of 1 GHz. Using 32-bit floating arithmetic, the peak is doubled. In the first generation Itanium there were 4 integer units for integer arithmetic and other integer or character manipulations. Because the integer performance of this processor was modest, 2 integer units have been added to improve this. In addition four MMX units to accommodate instructions for multi-media operations, an inheritance from the Intel Pentium processor family. For compatibility with this Pentium family a special IA-32 decode and control unit is present.
The register files for integers and floating-point numbers is large: 128 each. However, only the first 32 entries of these registers are fixed while entries 33--128 are implemented as a register stack. The primary data and instruction caches are 4-way set associative and rather small: 16 KB each. This is the same as in the former Itanium processor. However, speed of the L1 cache is now doubled to full speed: data and instructions can now be delivered every clock cycle to the registers. Further more the secondary cache has been enlarged from 96 KB to 256 KB and it is 8-way set-associative. Moreover, the L3 cache is moved onto the chip and is no less than 3 MB. This cache structure greatly improves the bandwidth to the processor core, on average by a factor of 3. This does more for the performance improvement than the relatively modest increase in clock speed from 800 MHz to 1 GHz. Also the bandwidth from/to memory has increased by more than a factor of 3. The bus is now 128 bits wide and operates at a clock frequency of 400 MHz, totaling to 6.4 GB/s in comparison to 2.1 GB/s for its predecessor.

The introduction of the first Itanium has been deferred time and again which quenched the interest for use in high-performance systems. With the availability of the Itanium 2 in the second half of 2002 it is expected that the adoption will speed up. Apart from HP/Compaq also SGI, NEC and Fujitsu will include these processors in their systems in the not too distant future while phasing out the Alpha, PA-RISC, MIPS and SUN processors.


next up previous contents
Next: Intel Pentium 4 Up: The Main Architectural Classes Previous: IBM POWER4



Aad van der Steen
Mon Jul 29 13:41:50 MDT 2002