|
Cray has come a long way in the realisation of true hybrid systems, meaning that
processors of different type can work seamlessly together in one computer
infrastructure. Cray is consistently working towards a hardware infrastructure
that allows for this ideal situation. The current product is named the Cray
XT5h that can harbour scalar Opteron-based XT5 nodes, X2 vector
nodes, and XR1 nodes that house a Xilinx FPGA. The XT5h is not yet
able to let them work all together at one computational task but SIO (Service
and I/O nodes) can log on to the system and deal with any of the different types
of nodes within the system. The common infrastucture is formed by the 3-D torus
network implemented by the SeaStar2+ communication nodes. Each of the 6 ports of
the SeaStar2+ provides a bandwidth of 9.6 GB/s and is so connected to the X2,
XT5, and XR1 blades. The bandwidth to the X2 blades is 4.8 GB/s for reasons not
disclosed in Cray's documentation.
The Cray Inc. X2
System parameters:
Remarks: The X2 vector processor blade fits in Cray's XT5h infrastructure where h stands for "hybrid". The processor is not very different from its predecessor, the X1E. The clock frequency has gone up from 1.125 GHz to 1.6 GHz. A vector pipe set, containing an add-, multiply-, and a miscellaneous functions pipe, can generate 2 floating-point results/cycle. As there are 8 pipe sets in a processor this yields a peak performance of 25.6 Gflop/s. Four CPUs are housed in one node with 32 or 64 GB memory and operate in SMP mode. One X2 blade in turn houses two nodes for a total peak performance of 204.8 Gflop/s. Like in its predecessors, the bandwidth from/to the memory of 28.5 GB/s is not sufficient to support the the operation of the processors at full speed. So, the scalar processor in a CPU has 2-way set-associative L1 instruction and data caches, while the unified 512 KB, 16-way set-associative L2 cache and the 8 MB L3 cache are shared by all data. Part of the bandwidth discrepancy has been met by decoupling the vector functional units from each other. For instance, the load/store unit is able to issue store instructions before the associated results are present. In this way functional units incur minimal waiting times because they do not have to synchronise unnecessarily. To integrate the X2 blade fully in the XT5h infrastructure it is connected to a SeaStar2+ router that connects it to the other types of nodes in the system, however at half the normal speed link at 4.8 GB/s. The X2 blades have also their own very high bandwidth interconnect, implemented as a dense fat tree. Cray uses its proprietary so-called YARC router chips. The density lies in the fact that the radix, i.e., the amount of ports/router is high: 64 ports/YARC chip. Four of these chips constitute one Rank 1 router, that connects 32 CPUs at level 1. A rank 2 router can in turn connect 128 rank 1 routers, etc., up to the maximum of 32K processors. A unique feature of the network is that it allows side links that connect rank 1 routers in order to connect subtrees statically. One may have reasons to configure the network in such a way, especially when one has a number of X2 boards that naturally matches with such a configuration. The network topology can become rather complicated in this way, however. The worst case distance Ω = 7 in the maximally configured network which means that only 7 hops are needed to connect two of the most distant processors in a 32K processor configuration. The point-to-point bandwidth between processor is quite high: 15 GB/s or almost 60% of the local memory bandwidth within a CPU. An extensive description of this interesting interconnect can be found in [40]. In an SMP node OpenMP can be employed. When accessing other CPU boards one can use Cray's shmem library for one-sided communication, MPI, Co-Array Fortran, etc. Measured Performances: Results of a subset of the synthetic HPCC benchmark, some application kernels 7 application code are compared to an Intel Woodcrest processor (HPCC subset) and an Cray XT4 are discussed in [41].
Cray Inc. XT5
System parameters:
Remarks: The quality of the layout of the online product information of Cray has improved siginificantly over the years. Unfortunately, the information content, at least for the XT5, has been decreasing at the same rate. So, no more information is given about the processor on the XT5 blade than that AMD 2000 processors are used and may be either dual- or quad-core. No per processor performance is given either, only that a 192-processor, quad-core, cabinet will have a peak performance of 7 Tflop/s. This amounts to a peak speed/core of 9.11 Gflop/s. As the fastest 2.5 MHz quad-core Phenom processor is capable of 5.0 Gflop/s/core with its regular floating-point units, it turns out that Cray conveniently has counted in the SSE capability of the processors. A practice that has not been exercised before, nor should be recommended. Also no information is given about the maximum configuration as is done for its predecessors, the XT3 and XT4.
There is information about the interconnection network: is it based on
the SeaStar2+ router. The structure is the same as that of the SeaStar2 router
but faster by about 25%. The bi-directional link speed has gone up from 7.6
GB/s to 9.6 GB/s. The bandwidth from the SeaStar2+ to the processors is still
the same: 6.4 GB/s, the generic speed of AMD's HyperTransport 1.1. Measured Performances: As yet, there are no performance results known for XT5 systems.
Cray Inc. XR1
System parameters:
Remarks: An XR1 blade contains two Xilinx Virtex-4 LX400 FPGAs, each packaged as DRC's Reconfigurable Processing Units (RPUs). This means that DRC already has preconfigured the I/O connection between the AMD opteron on the board and the two RPUs. The connection is directly via AMD's HyperTransport 1.0 protocol, making the bandwidth off and on the FPGAs quite high in comparison to most accelerator platforms. The AMD processor is used for connection to the SeaStar2+ router in the XT5h infrastructure and as an intermediate processor from where the RPUs are activated. As the algorithms to be configured and run on an XR1 board can be so different no sensible performance figures can be given, even as an upper bound. As already discussed in section DRC various programming interfaces can be used, including Handel-C, Mitrion-C, etc., for the development of routines to be run on the XR1 boards. Measured Performances: As said, because of the enormous difference in speed between one application and the other no meaningful performance number can be defined. Only in terms of speedup against a CPU-based version one can reasonably assess what the benefits of an XR1-accelerated system are. |