Systems under development

Introduction

HPC Architecture

Shared-memory SIMD machines

Distributed-memory SIMD machines

Shared-memory MIMD machines

Distributed-memory MIMD machines

ccNUMA machines

Clusters

Processors

AMD Opteron

IBM POWER6

IBM PowerPC 970

IBM BlueGene processors

Intel Itanium 2

Intel Xeon

The MIPS processor

The SPARC processors

Accelerators

GPU accelerators

General accelerators

FPGA accelerators

Networks

Infiniband

InfiniPath

Myrinet

QsNet

Available systems

The Bull NovaScale

The C-DAC PARAM Padma

The Cray XT3

The Cray XT4

The Cray XT5_h

The Cray XMT

The Fujitsu/Siemens M9000

The Fujitsu/Siemens PRIMEQUEST

The Hitachi BladeSymphony

The Hitachi SR11000

The HP Integrity Superdome

The IBM BlueGene/L&P

The IBM eServer p575

The IBM System Cluster 1350

The Liquid Computing LiquidIQ

The NEC Express5800/1000

The NEC SX-9

The SGI Altix 4000

The SiCortex SC series

The Sun M9000

Systems disappeared from the list

Systems under development

Glossary

Acknowledgments

References

Although we mainly want to discuss real, marketable systems and no experimental, special purpose, or even speculative machines, it is good to look ahead a little and try to see what may be in store for us in the near future.

Below we discuss systems that may lead to commercial systems to be introduced on the market between somewhat more than half a year to a year from now. The commercial systems that result from it will sometimes deviate significantly from the original research models depending on the way the development is done (the approaches in Japan and the USA differ considerably in this respect) and the user group which is targeted.

A development that , at the time, of significance was the introduction of Intel's IA-64 Itanium processor family. Six vendors are offering Itanium 2-based systems at the moment and it is known that HP has ended the marketing of its Alpha and PA-RISC based systems in favour of the Itanium processor family. Likewise SGI stopped the further development of MIPS processor based machines. The only vendor going against this trend is SiCortex that re-introduced a MIPS processor based machine. This means that the processor base for HPC systems is rather narrow. However, the shock that was caused in the USA by the advent of the Japanese Earth Simulator system has helped in refueling the funding of alternative processor and computer architecture research of which we see the consequences in the last few years.

In the section on accelerators section we already noted the considerable interest generated by systems that provide acceleration by means of FPGAs or other special computational accelerators like those from ClearSpeed, etc.Within the near future a HPC cannot afford not to include somehow such accelerators into their architectures. One also cannot expect general processor and HPC vendors to ignore this trend. In some way they will either integrate the emerging accelerator capability into their system (as is in the road maps of, e.g., Cray and SGI, see below), try to incorporate accelerating devices on the chips themselves (as seems the way Intel in going), or provide ways to thightly integrate accelerator hardware with a CPU and memory via a fast direct connection. This we already see with AMD processors and will shortly be the case also with Intel processors. We briefly review the status of these developments below.

Cray Inc.
In the end of 2002 the next generation vector processor, the X1, from Cray Inc. was ready to ship. It built on the technology found in the Cray SV-1s. Cray widely publicises a roadmap of future systems as far as around 2010 primarily based on the Cascade project. This is the project that has started with help of DARPA's High Productivity Computer Systems initiative (HPCS) that has as one of its goals that 10 Pflop/s systems (sustained) should be available by 2010. This should not only entail the necessary hardware but also a (possibly new) language to productively program such systems. Cascade was Cray's answer to this initiative. Together with IBM Cray has continuing support from the HPCS program (HP, SGI, and SUN, respectively have fallen out).
Cray seems reasonably on track with its Cascade project: The XT5_h system is already capable of housing X2, XT5, and XR1 boards within one infrastructure and it may well be that XMT processors will be included in following generations. At the moment, however, the different processor types cannot yet work seamlessly together in the sense that one can freely mix code in one program that will be dispatched to a set of the most appropriate processor type. This is the ultimate goal to be realised in a sequence of future systems. The follow-on systems bear imaginative names like "Baker" (about end 2008, begin 2009), "Granite" and "Marble", ultimately leading to a system that should be able to deliver 10 Pflop/s sustained by 2010.
Until recently Cray was to some extent dependent on AMD with respect to the scalar processors. In fact, Cray was hurt somewhat by the delay of AMD's quad-core Barcelona processor but this has not led to a major slip in the scheduled plans and Cray recently has entered into discussions with Intel which may well lead to a change of processor in the scalar-type nodes or at least to a diversification. Cray's interest in the Intel processor has presumably been fuelled by the adoption of the QuickPath interface in the next generation of Intel's processors which would allow them to be integrated into the Cray systems in a way that is very similar to that which is used with AMD's HyperTransport interface.

IBM
IBM has been working for some years on its BlueGene systems. Many of these first models, the BlueGene/L, have been installed in the last few years. The BlueGene/L follow-up the BlueGene/P has been available for about a year now first and several /P systems have been installed in Europe as well as in the USA. Theoretically, the BlueGene/P will attain a peak speed of 3 Pflop/s, and the BlueGene/Q will have a peak speed of around 10 Pflop/s. The BlueGene systems are hardly meant for the average HPC user but rather for a few special application fields that are able to benefit from the massive parallelism that is required to apply such systems successfully.
Of course the development of the POWERx processors also will make its mark: the POWER6 processor has the usual technology-related advantages over its predecessor, and the first POWER6-based systems are in operation since a few months. Furthermore, it is a subject of research how to couple 8 cores such that a virtual vector processor with a peak speed of around 120 Gflop/s can be made. This approach is called the ViVA (Virtual Vector Architecture). It is reminiscent of Hitachi's SR8000 processors (which used POWER5 processors) or the MSP processors in the late Cray X1E. This road will take some years to go and may appear in the POWER7 processor and extend to the next generation(s) of the POWERx.
The Cell processor, and in particular the PowerXCell 8i, will be further integrated in the systems provided by IBM. This has already been done in the Roadrunner systems with the tri-blade configuration and as is evident from the IBM 1350 cluster structure, the current cluster systems can already include the accelerator blades along with other processor blades. Like, with Cray, it is not yet possible to have one code targetting a random mix of processor blades but this could be a next step
Like Cray, IBM is one of the two vendors that are still supported by the HPCS program from DARPA. Although this support is less important for IBM than for Cray, parts of the research that now is done regarding porting applications to BlueGene-type systems, the viability of the ViVA concept, and the integration of Cell processors is certainly helped by this support. A system based on the POWER7 should be IBM's answer to DARPA's request for a machine able of a 10 Pflop/s sustained performance in 2010.

Intel-based systems
All systems that are based on the Itanium line of processors, i.e., Bull, Fujitsu, Hitachi, HP, NEC, and SGI, are critically dependent on the ability of Intel to timely deliver the Tukwila processor, which is slated for 2008. Not only the number of cores in this processor will double to four while the modest clock frequency will go up in the 2 GHz realm, most importantly, the processor finally will get rid of the front-side bus that is a serious bottleneck in the access of data from the memory. The Tukwila processor will use the QuickPath Interface (QPI), formerly known as the Common System Interface (CSI) which presumably will provide a bandwidth of over 25.6 GB/s. Of course this is necessary because of the increased number of cores/chip. In addition, the QPI specification will be open like AMD has done with its HyperTransport bus in the Torrenza initiative. This means that both low-latency networks and attached computational accelerators can be connected directly at high speed. This in turn will allow vendors to diversify their products, possibly to optimise them for specific application areas similar to Cray's future plans (see above).
Furthermore the QPI will also be available for the Xeon line of processors, the next one being the Nehalem processor, which would even allow for mixing them with Itanium-based systems components. In fact SGI plans to do so in its Altix systems and Hitachi already provides it in their BladeSymphony systems. Other vendors may follow this trendIntel is not insensitive with respect to the fast increase of the use of computational accelerators, GPUs in particular. An answer might be the many-core processors that are part of Intels future plans. The Larabee processor that is expected to be available in 2009 unites general CPU cores and about 16 simple GPU-like 512-bit wide SIMD cores on a chip with a feature size of 45 nm at a clock frequency of about 2--2.5 GHz. The SIMD core would be connected by a 1024 bit wide ring. In all the structure is somewhat reminiscent of the Cell processor. A rough estimate of the performance would be about 1 Tflop/s (for the right type of computations), in the same league as the high-end GPUs now sold by ATI an NVIDIA. Intel is not expected to stop here and will develop other many-core processors in the near future. as it is another way of system diversification.

SGI
SGI has plans that are more or less similar to Cray's Cascade project: coupling of heterogeneous processor sets through its proprietary network, in this case a successor of the NUMAlink4 network architecture. A first step in that direction is the availability of the so-called RASC blades that can be put into the Altix 4700 infrastructure. Each RASC blade features 2 FPGAs that can be used as computational accelerators for certain algorithms in applications. A next step is mixing Itanium-based components and Xeon-based components in the same system. Once the Common System Interface is available (see above) this should be doable without excessive costs because the CSI chipset will support both the Itanium and Xeon processor variants. The idea is to further diversify the future systems, ultimately into a system with the codename "Ultraviolet". Development of such systems is quite costly and unlike Cray and IBM, SGI does not have support from DARPA's HPCS program, so it remains to be seen whether such plans will pass the stage of intentions in regard of the present difficult financial position of SGI.

SUN
Like Cray and IBM, SUN had been awarded a grant from DARPA to develop so-called high-productivity systems in DARPA's HPCS program. In 2007 SUN has fallen out of this program and so it determined to concentrate even more on developing heavily multi-threaded processors. The second generation of the Niagara chip, the T2 is in production and is on the market for some time now. It supports 64 threads with 8 processor cores and has a floating-point unit attached to each core. This is a large improvement in comparison of the former T1 that had only one floating-point unit/chip. Still, the T2 is not geared for the HPC area. This will be reserved for Sun's Rock processor which is to come out somewhere in 2008. It has 16 processor cores in a 4×4 grid on the chip, each core supporting 2 threads. Each 4-core part shares 32 KB of L1 data cache and L1 instruction cache together with 512 KB of integrated L2 cache. The L1 and L2 caches are connected by a 4×4 crossbar. The Rock processor will be geared to HPC work. As yet, however, no details about the clock frequency is available so it is hard to estimate what the impact of the Rock processor will be. SUN plans to produce two server variants: "Peble", a one-socket version and "Boulder" that may have 2, 4, or 8 sockets.
There are several other novel features in the processor. For one, the dual threads in a core are used for "scouting", i.e., one thread runs ahead to look for possiblities to prefetch operands that will be needed by the other thread. When the active computation thread stalls for whatever reason, the roles are reversed. Furthermore, SUN will be probably implement Transactional Memory in the Rock processor. This would enable exclusive memory access for the threads within a program without the need of explicit locks in the threads. This would greatly simplify many multi-threaded applications and give much less overhead.
For the mainstream market SUN continues to rely on the UltraSPARC64 developments from Fujitsu-Siemens as present in Fujitsu-Siemens and SUN M9000 systems..