Shared-memory MIMD machines

HPC Architecture
  1. Shared-memory SIMD machines
  2. Distributed-memory SIMD machines
  3. Shared-memory MIMD machines
  4. Distributed-memory MIMD machines
  5. ccNUMA machines
  6. Clusters
  7. Processors
    1. AMD Opteron
    2. IBM POWER7
    3. IBM BlueGene/Q processor
    4. Intel Xeon
    5. The SPARC processors
  8. Accelerators
    1. GPU accelerators
      1. ATI/AMD
      2. nVIDIA
    2. General computational accelerators
      1. Intel Xeon Phi
    3. FPGA accelerators
      1. Convey
      2. Kuberre
      3. SRC
  9. Interconnects
    1. Infiniband
Available systems
  • The Bull bullx system
  • The Cray XC30
  • The Cray XE6
  • The Cray XK7
  • The Eurotech Aurora
  • The Fujitsu FX10
  • The Hitachi SR16000
  • The IBM BlueGene/Q
  • The IBM eServer p775
  • The NEC SX-9
  • The SGI Altix UV series
  • Systems disappeared from the list
    Systems under development

    In Figure 1 already one subclass of this type of machines was shown. In fact, the single-processor vector machine discussed there was a special case of a more general type. The figure shows that more than one FPU and/or VPU may be possible in one system.

    In shared-memory MIMD machines several processors access a common memory of which they draw their instructions and data. As such a system is used to perform a common task by executing parts of the programs in parallel there must be some way to coordinate and synchronise the various parts of the program. This is done by a network that connects the CPU cores to each other and to the memory.
    In fact, as the present-day CPUs all harbour multiple cores (see, e.g., the AMD and Intel Xeon pages), a single multi-core CPU already can be regarded as an SM-SIMD system on a single chip. The following discussion therefore pertains to single CPUs as well although their interconnection is less of a problem.

    When more CPUs are added, the collective bandwidth to the memory ideally should increase linearly with the number of processors, while each processor should preferably communicate directly with all others without the much slower alternative of having to use the memory in an intermediate stage. Unfortunately, full interconnection is quite costly, growing with O(n²) while increasing the number of processors with O(n). So, various alternatives have been tried. Figure 4 shows some of the interconnection structures that are (and have been) used.

    As can be seen from the figure, a crossbar uses connections, an Ω-network uses nlog2n connections while, with the central bus, there is only one connection. This is reflected in the use of each connection path for the different types of interconnections: for a crossbar each datapath is direct and does not have to be shared with other elements. In case of the Ω-network there are log2 n switching stages and as many data items may have to compete for any path. For the central databus all data have to share the same bus, so n data items may compete at any time.

    Figure of networks

    Figure 5: Some examples of interconnection structures used in shared-memory MIMD systems.

    The bus connection is the least expensive solution, but it has the obvious drawback that bus contention may occur thus slowing down the computations. Various intricate strategies have been devised using caches associated with the CPUs to minimise the bus traffic. This leads however to a more complicated bus structure which raises the costs. In practice it has proved to be very hard to design buses that are fast enough, especially where the speed of the processors have been increasing very quickly and it imposes an upper bound on the number of processors thus connected that in practice appears not to exceed a number of 10-20. In 1992, a new standard (IEEE P896) for a fast bus to connect either internal system components or to external systems has been defined. This bus, called the Scalable Coherent Interface (SCI) should provide a point-to-point bandwidth of 200-1,000 Mbyte/s. It is in fact used in the HP Exemplar systems, but also within a cluster of workstations as offered by SCALI. The SCI is much more than a simple bus and it can act as the hardware network framework for distributed computing, see [20]. It now has been effectively superseded by InfiniBand, however (see the networks section).

    A multi-stage crossbar is a network with a logarithmic complexity and it has a structure which is situated somewhere in between a bus and a crossbar with respect to potential capacity and costs. The Ω-network is as depicted in Figure 5 is an example. Commercially available machines like the IBM eServer p775, the SGI Altix UV, and many others use such a network structure, but a number of experimental machines also have used this or a similar kind of interconnection. The BBN TC2000 that acted as a virtual shared-memory MIMD system used an analogous type of network (a Butterfly-network) and it is quite conceivable that new machines may use it, especially as the number of processors grows. For a large number of processors the nlog2n connections quickly become more attractive than the used in crossbars. Of course, the switches at the intermediate levels should be sufficiently fast to cope with the bandwidth required. Obviously, not only the structure but also the width of the links between the processors is important: a network using 16-bit parallel links will have a bandwidth which is 16 times higher than a network with the same topology implemented with serial links.

    The earlier mentioned stand-alone multi-processor vectorprocessors used crossbars. This was feasible because the maximum number of processors within a system node was small (16 at most).In the late Cray X2 the number of processors had so much increased, however, that it had to change to a logarithmic network (see interconnects). Not only it becomes harder to build a crossbar of sufficient speed for the larger numbers of processors, the processors themselves generally also increase in speed individually, doubling the problems of making the speed of the crossbar match that of the bandwidth required by the processors.

    Whichever network is used, the type of processors in principle could be arbitrary for any topology. In practice, however, bus structured machines do not have vector processors as the speeds of these would grossly mismatch with any bus that could be constructed with reasonable costs. All available bus-oriented systems use RISC processors as far as they still exist. The local caches of the processors can sometimes alleviate the bandwidth problem if the data access can be satisfied by the caches thus avoiding references to the memory.

    The systems discussed in this subsection are of the MIMD type and therefore different tasks may run on different processors simultaneously. In many cases synchronisation between tasks is required and again the interconnection structure is very important here. Some Cray vectorprocessors employed in the past special communication registers within the CPUs (the X-MP and Y-MP/C series) by which they could communicate directly with the other CPUs they have to synchronise with. This is, however, not practised anymore as it is viewed too costly a feature. The systems may also synchronise via the shared memory. Generally, this is much slower but may still be acceptable when the synchronisation occurs relatively seldom. Of course in bus-based systems communication also has to be done via a bus. This bus is mostly separated from the databus to assure a maximum speed for the synchronisation.

    As said, a single CPU can be regarded as small SM-SIMD system in itself: up to sixteen processor cores are connected to each other by high-speed links. Because the number of cores is limited this is mostly done by a crossbar but also high-speed ring connections are employed.