This section contains the explanation of some often-used terms that either are not explained in the text or, by contrast, are described extensively and for which a short description may be convenient.
API: API stands for Application Program(mer) Interface. Ususally it consists of a set of library functions that enable to access and/or control the functionality of certain non-standard devices, like an I/O device or a computational accelerator.
Architecture: The internal structure of a computer system or a chip that determines its operational functionality and performance.
Architectural class: Classification of computer systems according to its architecture: e.g., distributed memory MIMD computer, symmetric multi processor (SMP), etc. See this glossary and section architecture for the description of the various classes.
ASCI: Accelerated Strategic Computer Initiative. A massive funding project in the USA concerning research and production of high-performance systems. The main motivation is said to be the management of the USA nuclear stockpile by computational modeling instead of actual testing. ASCI has greatly influenced the development of high-performance systems in a single direction: clusters of SMP systems.
ASIC: Application Specific Integrated Circuit. A chip that is designed to fulfill a specific task in a computer system, e.g. for routing messages in a network
Bank cycle time: The time needed by a (cache-)memory bank to recover from a data access request to that bank. Within the bank cycle time no other requests can be accepted.
Beowulf cluster: Cluster of PCs or workstations with a private network to connect them. Initially the name was used for do-it-yourself collections of PCs mostly connected by Ethernet and running Linux to have a cheap alternative for "integrated" parallel machines. Presently, the definition is wider including high-speed switched networks, fast RISC-based processors and complete vendor-preconfigured rack-mounted systems with either Linux or Windows as an operating system.
Bit-serial: The operation on data on a bit-by-bit basis rather than on byte or 4/8-byte data entities in parallel. Bit-serial operation is done in processor array machines where for signal and image processing this mode is advantageous.
Cache — data, instruction: Small, fast memory close to the CPU that can hold a part of the data or instructions to be processed. The primary or level 1 (L1) caches are virtually always located on the same chip as the CPU and are divided in a cache for instructions and one for data. A secondary or level 2 (L2) cache is sometimes located off-chip and holds both data and instructions. Caches are put into the system to hide the large latency that occurs when data have to be fetched from memory. By loading data and or instructions into the caches that are likely to be needed, this latency can be significantly reduced.
Capability computing: A type of large-scale computing in which one wants to accommodate very large and time consuming computing tasks. This requires that parallel machines or clusters are managed with the highest priority for this type of computing possibly with the consequence that the computing resources in the system are not always used with the greatest efficiency.
Capacity computing: A type of large-scale computing in which one wants to use the system (cluster) with the highest possible throughput capacity using the machine resources as efficient as possible. This may have adverse effects on the performance of individual computing tasks while optimising the overall usage of the system.
ccNUMA: Cache Coherent Non-Uniform Memory Access. Machines that support this type of memory access have a physically distributed memory but logically it is shared. Because of the physical difference of the location of the data items, a data request may take a varying amount of time depending on the location of the data. As both the memory parts and the caches in such systems are distributed a mechanism is necessary to keep the data consistent system-wide. There are various techniques to enforce this (directory memory, snoopy bus protocol). When one of these techniques is implemented the system is said to be cache coherent.
Clock cycle: Fundamental time unit of a computer. Every operation executed by the computer takes at least one and possibly multiple cycles. Typically, the clock cycle is now in the order of one to a quarter of a nanosecond.
Clock frequency: Reciproke of the clock cycle: the number of cycles per second expressed in Hertz (Hz). Typical clock frequencies nowadays are 1–4 GHz.
Clos network: A logarithmic network in which the nodes are attached to switches that form a spine that ultimately connects all nodes.
Co-Array Fortran: Co-Array Fortran (CAF) is a so-called Partitioned Global Address Space programming language (PGAS language, see below) that extends Fortran by allowing to specify a processor number for data items within distributed data structures. This allows for processing such data without explicit data transfer between the processors. CAF will be incorporated in Fortran 2003, the upcoming Fortran standard.
Communication latency: Time overhead occurring when a message is sent over a communication network from one processor to another. Typically the latencies are in the order of a few µs for specially designed networks, like Infiniband or Myrinet, to about 40 µs for Gbit Ethernet.
Control processor: The processor in a processor array machine that issues the instructions to be executed by all the processors in the processor array. Alternatively, the control processor may perform tasks in which the processors in the array are not involved, e.g., I/O operations or serial operations.
CRC: Type of error detection/correction method based on treating a data item as a large binary number. This number is divided by another fixed binary number and the remainder is regarded as a checksum from which the correctness and sometimes the (type of) error can be recovered. CRC error detection is for instance used in SCI networks.
Crossbar (multistage): A network in which all input ports are directly connected to all output ports without interference from messages from other ports. In a one-stage crossbar this has the effect that for instance all memory modules in a computer system are directly coupled to all CPUs. This is often the case in multi-CPU vector systems. In multistage crossbar networks the output ports of one crossbar module are coupled with the input ports of other crossbar modules. In this way one is able to build networks that grow with logarithmic complexity, thus reducing the cost of a large network.
Distributed Memory (DM): Architectural class of machines in which the memory of the system is distributed over the nodes in the system. Access to the data in the system has to be done via an interconnection network that connects the nodes and may be either explicit via message passing or implicit (either using HPF or automatically in a ccNUMA system).
Dragonfly topology: A hierarchical three-level interconnect topology for the levels router, group, and system. The crux in this topology is to have high-radix routers that tightly interconnect on a local and group level. This leads to a low number of hops for traversing the system.
EPIC: Explicitly Parallel Instruction Computing. This term is coined by Intel for its IA-64 chips and the Instruction Set that is defined for them. EPIC can be seen as Very Large Instruction Word computing with a few enhancements. The gist of it is that no dynamic instruction scheduling is performed as is done in RISC processors but rather that instruction scheduling and speculative execution of code is determined beforehand in the compilation stage of a program. This simplifies the chip design while potentially many instructions can be executed in parallel.
Fat tree: A network that has the structure of a binary (quad) tree but that is modified such that near the root the available bandwidth is higher than near the leafs. This stems from the fact that often a root processor has to gather or broadcast data to all other processors and without this modification contention would occur near the root.
Feature size: The typical distance between the various devices on a chip. By now processors with a feature size of 45 nanometer (10-9 m, nm) are on the market. Lowering the feature size is beneficial in that the speed of the devices on the chip will increase because of the smaller distances the electrons have to travel. The feature size cannot be shrunk much more though, because of the leaking of current that can go up to unacceptable levels. The lower bound is now believed to be around 7–8 nm for Silicon.
Flop: floating-point operation. Flop/s are used as a measure for the speed or performance of a computer. Because of the speed of present day computers, rather Megaflop/s (Mflop/s, 106 flop/s), Gigaflop/s (Gflop/s, 109 flop/s), Teraflop/s (Tflop/s, 1012 flop/s), and Petaflop/s (Pflop/s, 1015 flop/s) are used.
FPGA: FPGA stands for Field Programmable Gate Array. This is an array of logic gates that can be hardware-programmed to fulfill user-specified tasks. In this way one can devise special purpose functional units that may be very efficient for this limited task. As FPGAs can be reconfigured dynamically, be it only 100–1,000 times per second, it is theoretically possible to optimise them for more complex special tasks at speeds that are higher than what can be achieved with general purpose processors.
Frontside Bus: Bus that connects the main memory with the CPU core(s) via a memory controller (only in scalar processors, not in vector processors). The frontside bus has increasingly become a bottleneck in the performance of a system because of the limited capacity of delivering data to the core(s).
Functional unit: Unit in a CPU that is responsible for the execution of a predefined function, e.g., the loading of data in the primary cache or executing a floating-point addition.
Grid — 2-D, 3-D: A network structure where the nodes are connected in a 2-D or 3-D grid layout. In virtually all cases the end points of the grid are again connected to the starting points thus forming a 2-D or 3-D torus.
HBA: HBA stands for Host Bus Adaptoralso known as Host Channel Adaptor (specifically for InfiniBand). It is the part in an external network that constitutes the interface between the network itself and the PCI bus of the compute node. HBAs usually carry a good amount of processing intelligence themselves for initiating communication, buffering, checking for correctness, etc. HBAs tend to have different names in different networks: HCA or TCA for Infiniband, LANai for Myrinet, etc.
HPA: High Performance (Data) Analytics. The analysis of very large amounts of data in order to find useful/interesting patterns. First used in marketing circles to track customer profiles but increasingly used for scientific data analysis, e.g., in genomics, climatology records, astronomy data, etc
HPCS: Abbrevation of High Productivity Computer Systems: a program initiated by DARPA, the US Army Agency that provides large-scale financial support to future (sometimes futuristic) research that might benefit the US Army in some way. The HPCS program was set up to ensure that by 2010 computer systems will exist that are capable of a performance of 1 Pflop/s in real applications as opposed to the Theoretical Peak Performance which might be much higher. Initially Cray, HP, IBM, SGI, and SUN participated in the program. After repeated evaluations only Cray and IBM still get support from the HPCS program.
HPF: High Performance Fortran. A compiler and run time system that enables to run Fortran programs on a distributed memory system as on a shared memory system. Data partition, processors layout, etc. are specified as comment directives that makes it possible to run the processor also serially. Present HPF available commercially allow only for simple partitioning schemes and all processors executing exactly the same code at the same time (on different data, so-called Single Program Multiple Data (SPMD) mode).
Hypercube: A network with logarithmic complexity which has the structure of a generalised cube: to obtain a hypercube of the next dimension one doubles the perimeter of the structure and connect their vertices with the original structure.
HyperTransport: An AMD-developed bus that directly connects a processor to its memory without use of a Frontside Bus at high speed. It also can connect directly to other processors and because the specification is open, also other types of devices, like computational accelerators can be connected in this fashion to the memory. Intel provides a similar type of bus, called QPI.
IDE: Integrated Development Environment. A software environment that integrates several tools to write, debug, and optimise programs. The added value of an IDE lies (should lie) in the direct feedback from the tools. This should shorten the development time for optimised programs for the target architecture. The most well-kown IDE is probably IBM's Eclipse.
Instruction Set Architecture: The set of instructions that a CPU is designed to execute. The Instruction Set Architecture (ISA) represents the repertoire of instructions that the designers determined to be adequate for a certain CPU. Note that CPUs of different making may have the same ISA. For instance the AMD processors (purposely) implement the Intel IA-32 ISA on a processor with a different structure.
LUT: Look-up table. A measure for the amount of memory cells on an FPGA.
Memory bank: Part of (cache) memory that is addressed consecutively in the total set of memory banks, i.e., when data item a(n) is stored in bank b, data item a(n+1) is stored in bank b+1. (Cache) memory is divided in banks to evade the effects of the bank cycle time (see above). When data is stored or retrieved consecutively each bank has enough time to recover before the next request for that bank arrives.
Message passing: Style of parallel programming for distributed memory systems in which non-local data that is required explicitly must be transported to the processor(s) that need(s) it by appropriate send and receive messages.
MPI: A message passing library, Message Passing Interface, that implements the message passing style of programming. Presently MPI is the de facto standard for this kind of programming.
Multi-core chip: A chip that contains more than one CPU core and (possibly common) caches. Due to the progression of the integration level more devices can be fitted on a chip. AMD, Fujitsu, IBM, and Intel make multi-core chips. Currently the maximum amount of cores per chip is around 16.
Multithreading: A capability of a processor core to switch to another processing thread, i.e., a set of logically connected instructions that make up a (part of) a process. This capability is used when a process thread stalls, for instance because necessary data are not yet available. Switching to another thread that has instructions that can be executed will yield a better processing utilisation.
NUMA factor: The difference in speed of accessing local and non-local data. For instance when it takes 3 times longer to access non-local data than local data, the NUMA factor is 3.
OpenMP: A shared memory parallel programming model in which shared memory systems and SMPs can be operated in parallel. The parallelisation is controlled by comment directives (in Fortran) or pragmas (in C and C++), so that the same programs also can be run unmodified on serial machines.
PCI bus: Bus on PC node, typically used for I/O, but also to connect nodes with a communication network. The highest bandwidth PCI-X, a common PCI bus version is ≅ 1 GB/s, while its successor, PCI Express, Generation 2 now normally is available with a 4–8 GB/s bandwidth. The newest version PCI Express, Generation 3 allows for a maximum of 16 GB/s for a 16\tm connection.
PGAS languages: Partitioned Global Address Space languages. A family of languages that allow to specify how data items are distributed over the available processes. This gives the opportunity to process these data items in a global fashion without the need for explicit data transfer between processors. It is believed that at least for a part of the HPC user community this makes parallel programming more accessible. The most known languages are presently Unified Parallel C (UPC) and Co-Array Fortran (CAF). Also Titanium, a Java-like language is employed. Apart from these languages that are already in use (be it not extensively) Chapel, developed by Cray and X10, developed by IBM, both under a contract with the US Department of Defense, have PGAS facilities (and more). However, the latter languages are still in the development phase and no complete compilers for these languages are available yet.
Pipelining: Segmenting a functional unit such that it can accept new operands every cycle while the total execution of the instruction may take many cycles. The pipeline construction works like a conveyor belt accepting units until the pipeline is filled and than producing results every cycle.
Processor array: System in which an array (mostly a 2-D grid) of simple processors execute its program instructions in lock-step under the control of a Control Processor.
PVM: Another message passing library that has been widely used. It was originally developed to run on collections of workstations and it can dynamically spawn or delete processes running a task. PVM now largely has been replaced by MPI.
Quad-core chip: A chip that contains four CPU cores and (possibly common) caches. Due to the progression of the integration level more devices can be fitted on a chip. AMD, Fujitsu, IBM, and Intel made quad-core chips that have been followed by so-called many-core chips that will hold 8–16 cores.
QPI: QuickPath Interface: Formerly known as Common System Interface (CSI). A bus structure developed by Intel and available since early 2009 that directly connects a (variety of) processor(s) of a system at high speed to its memory without the need for a Frontside Bus. It also can be used for connecting processors to each other. A similar type of interconnection, HyperTransport, is already provided for some years by AMD for its Opteron processors (see AMD for details)
Register file: The set of registers in a CPU that are independent targets for the code to be executed possibly complemented with registers that hold constants like 0/1, registers for renaming intermediary results, and in some cases a separate register stack to hold function arguments and routine return addresses.
RISC: Reduced Instruction Set Computer. A CPU with its instruction set that is simpler in comparison with the earlier Complex Instruction Set Computers (CISCs) The instruction set was reduced to simple instructions that ideally should execute in one cycle.
SDK: Software Development Kit. The term has become more common as many vendors that sell computational accelerator hardware also need to provide the software that is necessary to make the accelerator effectively usable for non-specialist HPC users.
Shared Memory (SM): Memory configuration of a computer in which all processors have direct access to all the memory in the system. Because of technological limitations on shared bandwidth generally not more than about 16 processors share a common memory.
shmem: One-sided fast communication library first provided by Cray for its systems. However, shmem implementations are also available for SGI and some other systems.
SMP: Symmetric Multi-Processing. This term is often used for compute nodes with shared memory that are part of a larger system and where this collection of nodes forms the total system. The nodes may be organised as a ccNUMA system or as a distributed memory system of which the nodes can be programmed using OpenMP while inter-node communication should be done by message passing.
TLB: Translation Look-aside Buffer. A specialised cache that holds a table of physical addresses as generated from the virtual addresses used in the program code.
Torus: Structure that results when the end points of a grid are wrapped around to connect to the starting points of that grid. This configuration is often used in the interconnection networks of parallel machines either with a 2-D grid or with 3-D grid.
U: A unit used in defining the height of a component in a standard 19-inch wide rack system. 1 U is 44.5 mm or 1.75 inch.
UPC: Unified Parallel C (UPC). A PGAS language (see above). UPC is an extension of C that offers the possibility to specify how data items are distributed over the available processors, thus enabling processing these data without explicit data transfers between the processors.
Vector unit (pipe): A pipelined functional unit that is fed with operands from a vector register and will produce a result every cycle (after filling the pipeline) for the complete contents of the vector register.
Virtual Shared Memory: The emulation of a shared memory system on a distributed memory machine by a software layer.
VLIW processing: Very Large Instruction Word processing. The use of large instruction words to keep many functional units busy in parallel. The scheduling of instructions is done statically by the compiler and, as such, requires high quality code generation by that compiler. VLIW processing has been revived in the IA-64 chip architecture, there called EPIC (see above).