Although we mainly want to discuss real, marketable systems and no experimental, special purpose, or even speculative machines, it is good to look ahead a little and try to see what may be in store for us in the near future.
Below we discuss systems that may lead to commercial systems to be introduced on the market between somewhat more than half a year to a year from now. The commercial systems that result from it will sometimes deviate significantly from the original research models depending on the way the development is done (the approaches in Japan and the USA differ considerably in this respect) and the user group which is targeted.
In the section on accelerators we already noted the considerable interest generated by systems that provide acceleration by means of GPUs, FPGAs or other special computational accelerators like the Intel Phi, etc. Within the near future a HPC vendor cannot afford not to include somehow such accelerators into their architectures. The reaction of many vendors has been to offer systems in which accelerators are incorporated (Bull, Cray, Eurotech, SGI, etc.) Also chip vendors may try to incorporate accelerating devices on the chips themselves (as seems the way AMD and Intel are going), or provide ways to thightly integrate accelerator hardware with a CPU and memory via a fast direct connection. We briefly review the status of these developments below.
AMDAMD took great pains the last few years to increase the number of standard general purpose x86_64 cores on its chips. The present number being 16 in the Piledriver processor. This processor is a second implementation of its "Bulldozer" type of processors that can house many cores of different type. In view of the fact that AMD acquired the GPU manufacturer ATI a few years back it stands to reason that AMD wants to make the most in combining the technologies of both branches. AMD speaks of its Fusion architecture in connection to these plans. In fact, the first of these processors called Accelerated Processing Units (APUs) are already on the market for desktop systems but until now no APUs fit for HPC server systems are available. The number of GPU cores and the number of SIMD units within them are presently still too low to be a match for the high-end GPUs as delivered by both AMD and NVIDIA. This is partly due to the packaging and for another part to the memory. When the number of graphical units increases the power usage increases with it. In addition, the standard DDR3 memory or th coming DDR4 do not deliver enough bandwidth to feed many graphical cores. The way forward probably lies in shrinking the feature size and the availability of 3-D memory in the next few years. Still it will be a challenge to strike a feasible (and commercially viable) balance between the many conflicting design parameters that are involved in such a tight integration.
Cray Inc.With the introduction of the XC30 (Cascade) system Cray has made a switch to a very close connection with Intel: not only houses the XC30 Intel processors (presently Ivy Bridge and next Haswell), Cray also sold its crown jewel, the Aries router, to Intel. For the moment Cray is the only vendor that may use it and also the next generation interconnect, the Pisces router, which is developed together with Intel will only available for Cray. However, that will not be the case for generations after that. So, it will be difficult for Cray to make a distinction with other vendors from then on.
It might be that Cray wants to diversify its services as one might deduct from its Sonexion storage products and its uRIKA systems for HPA workloads. Indeed, the latter field seems to become more important every year and providing special systems for this area could be a good prospect. For HPC, however, it seems hard to keep standing out between the many other vendors in a few years.
IBMIBM has been working for years on its three generations of BlueGene systems. Many of these first models, the BlueGene/L, have now been replaced by its follow-up, the BlueGene/P or the latest generation the BlueGene/Q (see the BlueGene page).
The other line of HPC sysytems of IBM is based on its POWERx processors. Presently this is the POWER7 which in due course will be replaced by its technology shrink, the POWER7+. Shortly the POWER8 will see the the light, be it not yet for HPC systems. The most interesting part of the current p775 system is, however, the proprietary optical interconnect, giving it a definite advantage over Infiniband in terms of bandwidth. For the moment, however, nothing is known about a further development of this interconnect which is quite advanced technology-wise but also quite expensive. So, IBM will attempt to lower the cost of this interconnect, which should be possible with the advance in optronic technology while also improving its already impressive characteristics.
Intel-based systemsAs mentioned before in this report, the Itanium (IA-64) line of processors has become irrelevant for the HPC area. Intel instead is focussing on its multi-core x86_64 line of general processors of which the Haswell will be the next generation with the server versions to come out in 2014.
Intel is not insensitive with respect to the fast increase of the use of computational accelerators, an answer might be the many-core processors of which the Knights Bridge (Xeon Phi) is the first product that has to compete with the current generation of GPUs. As the peak performances of both are in the same ball park this may work for the next few years. The next generation, code name Knights Landing, is rumored to be made in 14 nm technology which already would increase to 4 Tflop/s for 64-bit floating-point arithmetic apart from any architectural improvements that may increase the peak speed. As such architectural improvements certainly will occur, this can be regarded as a lower bound. In addition, one can expect that the Knights Landing can be placed directly in a socket, thus increasing the bandwidth. Whether it will still need separate GDDR memory or able to share with the CPUs is still a guess. It is sure, however, that with an increased core count the high speed ring connection will not be able to sustain a sufficient bandwidth. So, another way of interconnecting the cores will occur,although in what form is still in the dark.
As already remarked above , Intel has bought the Aries switch technology from Cray while it already had acquired the Infiniband vendor Qlogic. Together with the fact that Intel has a PCIe Gen3 connection on chip it is safe to presume that Intel is interested in bringing the interconnect technology towards the chip. In this way no interconnect switches are required anymore. A fact that should worry Infiniband vendors like Mellanox.
nVIDIAIt is known for some time that NVIDIA in its Denver project plans to incorporate an ARM processor within the GPU. Where in the Kepler series there is already possible to launch compute kernels within the GPU itself, the inclusion of the ARM processor can be seen as an extension that will enable a still more independent operation of the GPU with respect to the host CPU. As such it could become a serious threat to the Intel Phi accelerators as one will undoubtly will be able to access the ARM processor as any RISC CPU and so make programmability much easier. The first Denver-based products are to be expected in 2015, while the direct successor of the Kepler GPU, the Maxwell will appear in 2014. There is a debate whether this will be in the first quarter or later in the year. When Maxwell in brought out early, it will be in 28 nm technology. This is because TMSC, that produces the chips, will not be ready for mass production in 20 nm technology in 1Q2014.
Energy-efficient developmentsThere is already a considerable activity with regard to build Exaflop/s systems, foremost in the IESP (International Exascale Software Project, . The time frame mentioned for appearance of such systems is 2019--2020. However, it seems that in the USA the development money to keep this schedule is hard to come by and a probable year would be rather 2022. However, both in Asia and Europe there are also plans for developing Exascale systems, be it not necessarily within the constraints as were set by USAs DoD. So, we still may see such (a) system(s) emerging before 2020.
A main concern, however, is the power consumption of such machines. With the current technologies, even with improvements caused by reduced feature size factored in, would be roughly in the 120-150 MW range. I.e., the power consumption of a mid-sized city. This is obviously unacceptable and the power requirement circulating in the IESP discussions is ≤ 20 MW, the constraint put in by the US Department of Defense. Even if one ignores such a hard constraint it is obvious one cannot proceed with business as usual energy-wise.
This has fuelled the interest for alternative processor architectures like those used in embedded processors and at the moment a a reseach project, "Green Flash", is underway at Lawrence Berkeley Laboratory to build an Exaflops/s system using 20⋅106 Tensilica Xtensa embedded processors. Besides that a host of other maufacturers are now exploring this direction. Among them ARM, Texas Instruments, Adapteva, Tilera and many others.
Although the processor is an important constituent in the power budget there are others that are becoming increasingly important. Among them are memory and the processor interconnect. Directions to decrease the power consumption of these components lie in transition to a form of non-volatile memory, i.e., memory that does not use energy to maintain its contents and the use of photonic interconnects. With respect to the latter development: the first step is set by IBM with its p775 eServer.
With regard to non-volatile memory one has to distinguish between storage class memory that can replace spinning disks (or the current generation of SSD Flash Memory) and the technologies that should replace DRAM as we currently use it. The former may be replaced by Phase Change RAM (PCRAM) in the very near future. For the latter type of memory there are a number of candidate technologies, like MRAM, FeRAM and memristors. MRAM is already used in embedded applications but at an insufficient density to be applicable in present systems. FeRAM and memristors may become available in a 2–3 year time frame.