ClearSpeed/Petapath

Introduction
HPC Architecture
  1. Shared-memory SIMD machines
  2. Distributed-memory SIMD machines
  3. Shared-memory MIMD machines
  4. Distributed-memory MIMD machines
  5. ccNUMA machines
  6. Clusters
  7. Processors
    1. AMD Magny-Cours
    2. IBM POWER6
    3. IBM POWER7
    4. IBM PowerPC 970MP
    5. IBM BlueGene processors
    6. Intel Xeon
    7. The SPARC processors
  8. Accelerators
    1. GPU accelerators
      1. ATI/AMD
      2. nVIDIA
    2. General accelerators
      1. The IBM/Sony/Toshiba Cell processor
      2. ClearSpeed/Petapath
    3. FPGA accelerators
      1. Convey
      2. Kuberre
      3. SRC
  9. Networks
    1. Infiniband
    2. InfiniPath
    3. Myrinet
Available systems
  • The Bull bullx system
  • The Cray XE6
  • The Cray XMT
  • The Cray XT5h
  • The Fujitsu FX1
  • The Hitachi SR16000
  • The IBM BlueGene/L&P
  • The IBM eServer p575
  • The IBM System Cluster 1350
  • The NEC SX-9
  • The SGI Altix UV series
  • Systems disappeared from the list
    Systems under development
    Glossary
    Acknowledgments
    References

    ClearSpeed is presently probably the only company that specifically makes computational accelerators for HPC computing. It has done this for some time which means that at this moment the ClearSpeed products are in their 3rd generation. Unlike the GPUs, the ClearSpeed processors were made to operate on 64-bit floating-point data from the start and full error correction is present in the ClearSpeed processors. The latest processor is the CSX700 chip that is packaged in a number of products. The most common is the e710 card that fits in a PCIe slot of any PC or server unit. A variant with a different form factor but with the same functionality is the e720 card that can be put into blade servers. Petapath, a spin-off of ClearSpeed especially for HPC, markets the so-called Feynman e740 and e780 devices. These units pack 4, resp. 8 e710 cards in one unit and can be connected by high-speed PCI Express (16× Gen. 2 PCIe at 8 GB/s) to a host processor.

    There is another feature that is peculiar to the e720 card: its power consumption is extremely low, 25 W maximal, 15 W typical. This is partly due to the low clock frequency of 250 MHz. The e710 card contains, apart from the CSX700 processor, 2 GB DDR2 SDRAM, and an FPGA that manages the data traffic to and from the card. As said, the interconnect to the host system is compliant with PCIe 8×, amounting to a bandwidth of 2 GB/s. ClearSpeed is quite complete in giving technical details. So, we are able to show a block diagram of the CSX processor in Figure 24.

    Block diagram of a ClearSpeed MTAP unit.

    Figure 24: Block diagram of a ClearSpeed MTAP unit. Two of these units reside on a CSX700 chip.

    Two so-called Multi-Threaded Array Procesor (MTAP) units are located on one CSX700 chip. As can be seen an MTAP contains 96 processors (with 4 redundant ones per MTAP). The are controlled via the Poly Controller, "poly" being the indication for the data types that can be processed in parallel. The processing elements themselves are able to communicate fast between themselves via a dedicated ring network. Every cycle a 64-bit data item can be shifted to the right or to the left through the ring. In Figure 25 we show the details of a processing element.

    Block diagram of a ClearSpeed processing element.

    Figure 25: Block diagram of a PE in an MTAP of a CSX700 chip. The numbers near the arrows indicate the number of bits that can be transferred per cycle.

    A maximum of two 64-bit floating-point results can be generated per cycle. As one MTAP contains 96 PEs and there are 2 MTAPs on a chip the peak performance of a CSX700 chip is 96 Gflop/s at a clock frequency of 250 MHz.
    Note the Control & Debug unit present in an MTAP. It enables debugging within the accelerator on the PE level. This is a facility that is missing in the GPUs and the FPGA accelerators we will discuss later.
    Also ClearSpeed employs an extended form of C, called Cn, for program development on the card. The extension is very slight, however. The keywords mono and poly are added to indicate data that should be processed serially or in parallel, respectively. Because ClearSpeed is in the accelerator trade for quite some time, the SDK is very mature. Apart from the Cn compiler already mentioned, it contains a library with a large set of the BLAS/LAPACK routines, FFTs, and Random Number generators. For dense linear algebra there is an interface that enables calling the routines from a host program in Fortran. Furthermore, a graphical debugging and optimisation tool is present that may or may not be embedded in IBM's Eclipse Integrated Development Environment (IDE) as a plug-in.