|
Before going on to the descriptions of the machines themselves, it is
important to consider some mechanisms that are or have been used to
increase the performance. The hardware structure or
architecture determines to a large extent what the possibilities
and impossibilities are in speeding up a computer system beyond the
performance of a single CPU. Another important factor that is
considered in combination with the hardware is the capability of
compilers to generate efficient code to be executed on the given
hardware platform. In many cases it is hard to distinguish between
hardware and software influences and one has to be careful in the
interpretation of results when ascribing certain effects to hardware or
software peculiarities or both. In this chapter we will give most
emphasis to the hardware architecture. For a description of machines
that can be considered to be classified as "high-performance" one is
referred to [7] and
[34].
Since many years the taxonomy of Flynn
[11] has proven to be useful for the
classification of high-performance computers. This classification is
based on the way of manipulating of instruction and data streams and
comprises four main architectural classes. We will first briefly sketch
these classes and afterwards fill in some details when each of the
classes is described separately.
- SISD machines: These are the conventional systems that use one CPU
for the execution of a program and hence can accommodate one instruction stream
that is executed serially. Nowadays about all large servers have more than one
multi-core CPU but each of these execute instruction streams that are unrelated.
Therefore, such systems still should be regarded as (a couple of) SISD machines
acting on different data spaces. Examples of SISD machines are for instance
workstations as offered by many vendors. The definition of SISD machines is
given here for completeness' sake. We will not discuss this type of machines in
this report.
- SIMD machines: Such systems often have a large number of processing
units, that all may execute the same instruction on different data in lock-step.
So, a single instruction manipulates many data items in parallel. Examples of
SIMD machines in this class were the CPP DAP Gamma II and the Quadrics Apemille
which are not marketed anymore since several years. Nevertheless, the concept is
still interesting and as recurred recentlyb in co-processors in HPC
systems be it in a somewhat restricted form, like in GPUs.
- Another type of SIMD computation is vectorprocessing. Until a few years ago
this used to be done in stand-alone vectorprocessing systems as for instance
produced by Cray and NEC. Presently, the concept is mostly implemented in the
vector units of common chips. Examples are the SSE and AVX units in AMD and
Intel chips.
- MISD machines: Theoretically in these types of machines multiple
instructions should act on a single stream of data. As yet no
practical machine in this class has been constructed nor are such systems
easily to conceive. We will disregard them in the following discussions.
- MIMD machines: These machines execute several instruction
streams in parallel on different data. The difference with the
multi-processor SISD machines mentioned above lies in the fact that the
instructions and data are related because they represent different
parts of the same task to be executed. So, MIMD systems may run many
sub-tasks in parallel in order to shorten the time-to-solution for the
main task to be executed. There is a large variety of MIMD systems and
especially in this class the Flynn taxonomy proves to be not fully
adequate for the classification of systems. Systems that behave very
differently like a four-processor NEC SX-9 and a 10,000-processor
IBM Bleugene/P fall both in this class. In the following we will
make another important distinction between classes of systems and treat
them accordingly.
- Shared memory systems: Shared memory systems have multiple CPUs all
of which share the same address space. This means that the knowledge of
where data is stored is of no concern to the user as there is only one
memory accessed by all CPUs on an equal basis. Shared memory systems can be
both SIMD or MIMD. We will sometimes use the abbreviations SM-SIMD and
SM-MIMD for the two subclasses.
- Distributed memory systems: In this case each CPU has its
own associated memory. The CPUs are connected by some network and may
exchange data between their respective memories when required. In
contrast to shared memory machines the user must be aware of the
location of the data in the local memories and will have to move or
distribute these data explicitly when needed. Again, distributed memory
systems may be either SIMD or MIMD. The first class of SIMD systems
mentioned which operate in lock step, all have distributed memories
associated to the processors. As we will see, distributed-memory MIMD
systems exhibit a large variety in the topology of their interconnection
network. The details of this topology are largely hidden from the user
which is quite helpful with respect to portability of applications
but that at the same time may have an impact on the performance.
For the distributed-memory systems we will sometimes use DM-SIMD and
DM-MIMD to indicate the two subclasses.
As already alluded to, although the difference between shared- and distributed
memory machines seems clear cut, this is not always entirely the case from
user's point of view. For instance, the late Kendall Square Research systems
employed the idea of "virtual shared memory" on a hardware level. Virtual
shared memory can also be simulated at the programming level: A specification
of High Performance Fortran
[19] which by means of compiler directives
distributes the data over the available processors. Therefore, the system on
which HPF is implemented in this case will look like a shared memory machine to
the user. Other vendors of Massively Parallel Processing systems (sometimes
called MPP systems), like SGI, also are able to support proprietary
virtual shared-memory programming models due to the fact that these physically
distributed memory systems are able to address the whole collective address
space. So, for the user such systems have one global address space
spanning all of the memory in the system. We will say a little more about the
structure of such systems in the ccNUMA
section. In addition, packages like TreadMarks
([2]) provide a virtual shared memory
environment for networks of workstations. A good overview of such systems is
given at [9]. Since 2006 Intel markets its
"Cluster OpenMP" (based on TreadMarks) as a commercial product. It allows the
use of the shared-memory OpenMP parallel model
[29] to be used on distributed-memory
clusters. Since a few years also companies like ScaleMP and 3Leaf provide
products to aggregate physical distributed memory into virtual shared memory. In
case of ScaleMP with a small amount of assisting hardware.
Lastly, so-called Partitioned Global Adrress Space
(PGAS) languages like Co-Array Fortran (CAF) and Unified Parallel C (UPC) are
gaining in popularity due to the recently emerging multi-core processors. With
proper implementation this allows a global view of the data and one has language
facilities that make it possible to specify processing of data associated with a
(set of) processor(s) without the need for explicitly moving the data around.
Distributed processing takes the DM-MIMD concept one step
further: instead of many integrated processors in one or several
boxes, workstations, mainframes, etc., are connected by (Gigabit)
Ethernet, FDDI, or otherwise and set to work concurrently on tasks in
the same program. Conceptually, this is not different from DM-MIMD
computing, but the communication between processors is often orders of
magnitude slower.
Packages that initially were made to realise distributed computing like PVM
(standing for Parallel Virtual Machine)
[12], and MPI
(Message Passing Interface,
[24]),
[25]) have become de facto standards
for the "message passing" programming model.
MPI and PVM have become so widely accepted that they have been adopted have been
adopted by virtually all major vendors of distributed-memory MIMD systems and
even on shared-memory MIMD systems for compatibility reasons. In addition there
is a tendency to cluster shared-memory systems by a fast communication network
to obtain systems with a very high computational power. E.g., the NEC SX-9 has
this structure. So, within the clustered nodes a shared-memory programming style
can be used while between clusters message-passing should be used. It must be
said that PVM is not used very much anymore and that MPI has more or less become
the de facto standard.
For SM-MIMD systems we should mention OpenMP
([29],
[5],
[6]), that
can be used to parallelise Fortran and C(++) programs by inserting comment
directives (Fortran 77/90/95) or pragmas (C/C++) into the code. OpenMP
has quickly been adopted by the major vendors and has become a well
established standard for shared memory systems.
Note, however, that for both MPI-3 and OpenMP 3, the latest standards, many
systems/compilers only implement a part of these standards. One has therefore to
inquire carefully whether a particular system has the full functionality of
these standards available. The standard vendor documentation will almost never
be clear on this point.
|