Figure 1:
Block diagram of a vector processor.
The single-processor vector machine will have only one of the vectorprocessors
depicted and the system may even have its scalar floating-point capability
shared with the vector processor (as was the case in some
Cray systems). It may be noted that the VPU does not
show a cache. The majority of vectorprocessors do not employ a cache anymore. In
many cases the vector unit cannot take advantage of it and execution speed may
even be unfavourably affected because of frequent cache overflow. Of late,
however, this tendency is reversed because of the increasing gap in speed
between the memory and the processors: the Cray X2 has a cache and the follow-on
of NEC's SX-9 vector system has a facility that is somewhat like a cache.
Although vectorprocessors have existed that loaded their operands directly from
memory and stored the results again immediately in memory (CDC Cyber 205,
ETA-10), all present-day vectorprocessors use vector registers. This usually
does not impair the speed of operations while providing much more flexibility in
gathering operands and manipulation with intermediate results.
Because of the generic nature of Figure 1 no
details of the interconnection between the VPU and the memory are
shown. Still, these details are very important for the effective speed
of a vector operation: when the bandwidth between memory and the VPU is
too small it is not possible to take full advantage of the VPU because
it has to wait for operands and/or has to wait before it can store
results. When the ratio of arithmetic to load/store operations is not
high enough to compensate for such situations, severe performance
losses may be incurred.
The influence of the number of load/store paths for the dyadic vector
operation c = a + b (a, b, and c vectors)
is depicted in 2.