Systems under development

Introduction

HPC Architecture

Shared-memory SIMD machines

Distributed-memory SIMD machines

Shared-memory MIMD machines

Distributed-memory MIMD machines

ccNUMA machines

Clusters

Processors

AMD Magny-Cours

IBM POWER6

IBM POWER7

IBM PowerPC 970MP

IBM BlueGene processors

Intel Xeon

The SPARC processors

Accelerators

GPU accelerators

ATI/AMD

nVIDIA

General accelerators

The IBM/Sony/Toshiba Cell processor

ClearSpeed/Petapath

FPGA accelerators

Convey

Kuberre

SRC

Networks

Infiniband

InfiniPath

Myrinet

Available systems
The Bull bullx system

The Cray XE6

The Cray XMT

The Cray XT5_h

The Fujitsu FX1

The Hitachi SR16000

The IBM BlueGene/L&P

The IBM eServer p575

The IBM System Cluster 1350

The NEC SX-9

The SGI Altix UV series

Systems disappeared from the list

Systems under development

Glossary

Acknowledgments

References

Although we mainly want to discuss real, marketable systems and no experimental, special purpose, or even speculative machines, it is good to look ahead a little and try to see what may be in store for us in the near future.

Below we discuss systems that may lead to commercial systems to be introduced on the market between somewhat more than half a year to a year from now. The commercial systems that result from it will sometimes deviate significantly from the original research models depending on the way the development is done (the approaches in Japan and the USA differ considerably in this respect) and the user group which is targeted.

A development that was, at the time, of significance was the introduction of Intel's IA-64 Itanium processor family. Six vendors are offering Itanium 2-based systems at the moment and it is known that HP has ended the marketing of its Alpha and PA-RISC based systems in favour of the Itanium processor family at the moment (be it not for HPC purposes) and for this reason HP ended the marketing of its Alpha and PA-RISC based systems in favour of the Itanium processor family. Likewise SGI stopped the further development of MIPS processor based machines. This means that the processor base for HPC systems has become very narrow. However, the shock that was caused in the USA by the advent of the Japanese Earth Simulator system has helped in refueling the funding of alternative processor and computer architecture research of which we see the consequences in the last few years.

In the section on accelerators section we already noted the considerable interest generated by systems that provide acceleration by means of FPGAs or other special computational accelerators like those from ClearSpeed, etc.Within the near future a HPC cannot afford not to include somehow such accelerators into their architectures. One also cannot expect general processor and HPC vendors to ignore this trend. In some way they will either integrate the emerging accelerator capability into their system (as is in the road maps of, e.g., Cray and SGI, see below), try to incorporate accelerating devices on the chips themselves (as seems the way AMD and Intel are going), or provide ways to thightly integrate accelerator hardware with a CPU and memory via a fast direct connection. We briefly review the status of these developments below.

AMD
AMD took great pains the last few years to increase the number of standard general purpose x86_64 cores on its chips. The present number being 12 in the Magny-Cours processor. At the same time AMD acquired the GPU manufacturer ATI a few years back and it stands to reason that AMD somehow wants to combine the technologies of both branches. As it indeed plans to do in its future Fusion program. However, products of this type to be used in HPC servers are still some years away (in contrast to those for notebooks and desktop systems that may come around the second half of 2011). The first architectural change will be in the AMD Bulldozer chips that will feature 2 128-bit FMAC units for floating-point processing and 2×4 Integer units with 2 L1 caches and a shared L2 cache. In effect it looks much like a dual-core chip, be it that the chip resources are distributed in another way. At this moment it is impossible to make performance predictions for this new architecture which is radically different from the processors as produced lately by AMD. AMD will support the AVX instruction set for vector processing as defined by Intel.

Cray Inc.
At the end of 2002 the next generation vector processor, the X1, from Cray Inc. was ready to ship. It built on the technology found in the Cray SV-1s. Cray widely publicises a roadmap of future systems as far as around 2010 primarily based on the Cascade project. This is the project that has started with help of DARPA's High Productivity Computer Systems initiative (HPCS) that has as one of its goals that 10 Pflop/s systems (sustained) should be available by 2010. This should not only entail the necessary hardware but also a (possibly new) language to productively program such systems. Cascade was Cray's answer to this initiative. Together with IBM Cray has continuing support from the HPCS program (HP, SGI, and SUN, respectively have dropped out).
Cray seems reasonably on track with its Cascade project but it has done away with its former ideas of a very heterogeneous system that would integrate scalar and vector processors as in the abandoned XT5_h and its FPGA-accelerated processor boards. However, Cray now plans to integrate nodes that can accomodate GPUs as most cluster vendors and, e.g., Bull are doing. So heterogeneity is creeping back in a different form. The follow-on systems bear imaginative names like "Baker" which is fact the Cray XE6, "Granite" and "Marble", ultimately leading to a system that should be able to deliver 10 Pflop/s sustained by 2011. In the systems following the XE6 a successor of the already fast Gemini router is expected, the Aries. Not much is known yet of this router, except that the connection to the processors will not be based on HyperTransport anymore but rather on PCI Express Gen3. This will give Cray the opportunity to use either AMD or Intel processors, whichever suits them best

IBM
IBM has been working for some years on its BlueGene systems. Many of these first models, the BlueGene/L, have been installed in the last few years. The BlueGene/L follow-up the BlueGene/P has been available for about a year now first and several /P systems have been installed in Europe as well as in the USA. Theoretically, the BlueGene/P will attain a peak speed of 3 Pflop/s, and the BlueGene/Q, the next generation originally was planned to have a peak speed of around 10 Pflop/s. The BlueGene/Q Sequoia system committed for Lawrence Livermore, however, should have a peak speed of 20 Pflop/s, using 1.6 million cores in nodes of 16 cores. The system is slated for 2012.
The POWER7 processor is now available (see the Hitachi SR16000 XM1) but IBM itself does not yet market HPC systems with the POWER7 processors. A second multi(10)-Pflop/s system containing this processor is the Blue Waters machine, to be installed at NCSA, USA in 2011. Four processors are packaged in a Quad Chip Module. With a clock frequency of over 3.5 GHz such a node will have a peak speed of almost 1 Tflop/s. Apart from using the new POWER7 processor (see the POWER7 processor) this system will also extensively uses optical coupling of the components. IBM has developed an optics-enabled processor board for this system that will do away with the multitude of cables that otherwise would be needed. (see [27]). It may be expected that this technology will turn up in IBM's commercial HPC systems in the near future.
The Cell processor, and in particular the PowerXCell 8i, was introduced by IBM as an accelerator besides its general purpose processors and marketed in the systems provided by IBM. This was done in the Roadrunner systems with the tri-blade configuration and, as evident from the section on the IBM 1350 cluster. commercialised in the current cluster systems along with other processor blades. However, there will be no follow-on to the Cell processor. The first version was developed and produced together with Sony an Toshiba that do not use it for HPC purposes. These companies have shown no interest in a futher development of this processor line and continuing this line just for HPC purposes is economically not attractive where GPUs can reach similar or higher performances at lower costs.

Intel-based systems
As mentioned before in this report, the Itanium (IA-64) line of processors has become irrelevant for the HPC area. Intel instead is focussing on its multi-core x86_64 line of general processors of which the Sandy Bridge will be the next generation with the server versions to come out in 2011 (consumer versions will turn up late 2010). The Sandy Bridge server version will have at least 8 cores and will feature the AVX instruction set for vector processing in units that are 256 bits wide.
Intel is not insensitive with respect to the fast increase of the use of computational accelerators, GPUs in particular. An answer might be the many-core processors that are part of Intels future plans. The Larabee processor that was expected to become available in 2010 was retracted and instead Intel is exploring rather similar architectures, called Knight Ferry and, later on, Knights Corner, the latter planned to be the first official product where the former is presented as a development platform. Intel collectively calls this line of processors Many Integrated Core (MIC) processors. Like the retracted Larabee, the (heterogeneous) cores will be connected by a fast ring bus. In the Knights Ferry 32 CPUs with a feature size of 45 nm cores combined with 512-bit wide vector units operate at a clock frequency of about 2–2.5 GHz. The cores are connected by a 1024 bit wide ring. The Knights Corner will be built from 22 nm technology and will contain more than 50 cores. Whether this will be sufficient to ward off the adoption of GPUs as computational accelerators no one can predict for the moment.

SGI
Now that SGI has launched its Altix UV systems, it may be expected that it will continue in this direction. It is probable that, like Cray, next generations will adopt PCIe, Genx to connect the processors to each other and to the hub instead of QPI to become vendor-independent but no offical plans are mentioning such a transition. It is also possible that nodes will be offerred that contain GPUs along the general purpose processors in a node as it does in its Altix ICE clusters. There are at the moment no announcements in that direction, however.

Energy-efficient developments
There is already a considerable activity with regard to build Exaflop/s systems, foremost in the IESP (International Exascale Software Project, [14]. The time frame mentioned for appearance of such systems is 2019--2020. A main concern, however, is the power consumption of such machines. With the current technologies, even with improvements caused by reduced feature size factored in, would be roughly in the 120-150 MW range. I.e., the power consumption of a mid-sized city. This is obviously unacceptable and the power requirement circulating in the IESP discussions is ≤ 20 MW.
This has fuelled the interest for alternative processor architectures like those used in embedded processors and at the moment a a reseach project, "Green Flash", is underway at Lawrence Berkeley Laboratory to build an Exaflops/s system using 20⋅10⁶ Tensilica Xtensa embedded processors. Besides that a host of other maufacturers are now exploring this direction. Among them ARM, Texas Instruments, Adapteva, Tilera and many others.
Although the processor is an important constituent in the power budget there are others that are becoming increasingly important. Among them are memory and the processor interconnect. Directions to decrease the power consumption of these components lie in transition to a form of non-volatile memory, i.e., memory that does not use energy to maintain its contents and the use of photonic interconnects. With respect to the latter development: the first step seems to have been set by IBM with the Blue Waters system (see section on IBM above). With regard to non-volatile memory there are a number of candidate technologies, like MRAM, RRAM and memristors. MRAM is already used in embedded applications but at an insufficient density to be applicable in present systems. RRAM and memristors may become available in a 2–3 year time frame. For storage, another major power consumer, SSDs in the form of Phase Change RAM may replace spinning disks in a few years.