              How to run the EuroBen Efficiency Benchmark.
              ===========================================

Below we describe how to go about installing and running the EuroBen
Efficiency Benchmark. In the case that you run into trouble
please mail to:

                 Aad van der Steen; steen@phys.uu.nl 
                 -----------------------------------

The EuroBen Efficiency Benchmark has the following structure:

                         |-- Makefile
			 |------------ commun/
			 |-- dddot/
			 |------------ fft1d/
			 |-- gmxm/
			 |------------ gmxv/
             effbench/ --|-- linful/
                         |------------ linspar/
	                 |-- ping/
			 |------------ pingpong/
			 |-- qsort/
			 |------------ smxv/
			 |-- transp/
			 |------------ wvlt2d/

================================================================================
                         INSTALLATION AND EXECUTION
================================================================================
The master Makefile in effbench/ can be used for installation of the
13 programs:

commun  -- A test for the speed of various communication patterns (MPI).
dddot   -- A test for the speed a distributed dotproduct (MPI).
fft1d   -- A test for the speed of a 1-D FFT.
gmxv    -- A test for the speed of the matrix-vector multiply Ab = c.
gmxm    -- A test for the speed of the matrix-matrix multiply AB = C.
linful  -- A test for the solution of a full linear system Ax = b.
linspar -- A test for the solution of a sparse linear system Ax = b.
ping    -- A very detailed test to assess bandwidth and latency
           in point-to-point communication.(1-sided communication, MPI).
pingpong-- A very detailed pingpong test to assess bandwidth and latency
           in point-to-point communication.(2-sided communication, MPI).	   
qsort   -- A test for the speed of Quicksort on Integers and 8-byte Reals.	   	   
smxv    -- A test for the speed of the sparse matrix-vector multiply Ab = c in
           CRS format.
transp  -- A test for the speed of a global distributed matrix transpose (MPI).	   
wvlt2d  -- A test for the speed of a 2-D Haar Wavelet Transform.


We assume that, at least, for the first time, you will want to run
the programs with the same compiler options. You should perform the following
steps:

1) cd basics/
   1a - Modify the subroutine 'state.f' such that it reflects the state
        of the system: type of machine, compiler version, compiler
        options, OS version, etc.
   1b - OPTIONAL (Non-MPI programs):
	The program directories for the sequential programs contain
	the timing functions 'wclock.f' and 'cclock.c'. 'wclock.f' is a
	Fortran interface routine that calls 'cclock.c', which in turn
	relies on the 'gettimeofday' Unix system routine. This timer
	works almost everywhere (except under UNICOS) and delivers the
	wallclock time with a resolution of a few microseconds. It is
	generally better than the Fortran 90 routine 'System_Clock'.
	If, for any reason you want to use another/better wallclock
	timer, modify the Real*8 function wclock.f in basics.

2) Go back to effbench/
   2a - Do a 'make state': The 'state.f' routine that you have modified
        is copied to all the program directories.
   2b - OPTIONAL (Non-MPI programs):
	If you have modified 'wclock.f' for the sequential programs in
	basics/, do a 'make clock-seq':  The 'wclock.f' is copied to
	the relevant program directories.

3) cd install/
   3a - In install/ you will find header files with definitions for
        the 'make' utility.
   3a1: Sequential programs:
        Modify the 'Make.Incl-seq' such that is contains the correct 
        name for the Fortran 90 compiler, Loader (usually the same as
        the compiler), and the options for the Fortran 90 and C
        compiler. For completeness' sake there are empty definitions
        for libraries (LIBS) and include files (INCS) you might want 
        to use but in normal situations they are not needed for the
        sequential programs. 
   3a2: Parallel programs:
	Modify the 'Make.Incl-mpi' such that is contains the correct
	name for the Fortran 90 compiler, Loader (usually the same as
	the compiler), and the options for the Fortran 90 compiler. The
	names for the compiler systems for MPI programs may be
	different from those for sequential programs. For completeness'
	sake there is an empty definition for the include file (INCS) you
	might want to use but in normal situations this is not needed.
	For libraries (LIBS) fill in the name of the MPI library
	(if necessary).
   3a3: Modify the 'Speed.Incl' file:
        It contains only one line starting with '++++' 
        Replace it by the Theoretical Peak Performance of your 
        system expressed in Mflop/s per CPU. So for a system with a
	Theoretical Peak Performance of 3.6 Gflop/s per processor:
	++++ --> 3600.
	NOTE: This should really be per processor (per socket if you will)
	      and NOT per core.

4) Go back to effbench/
   Do a 'make lib'. This will cause an object library 'intlib.a' to be
   made that is used by the sequential numerical programs to compute
   the integral of the performance over the appropriate problem size
   ranges and to calculate latencies for some MPI programs.

5)
   5a: Do a 'make make-seq': This will cause the Makefiles in the
       directories of the sequential programs to be completed according
       to the specifications you made in 'install/Make.Incl-seq'.
   5b: Do a 'make make-mpi': This will cause the Makefiles in the
       directories of the MPI programs to be completed according
       to the specifications you made in 'install/Make.Incl-mpi'.

6) Do a 'make makeall': This will cause in the directories <prog>, where
   <prog> is 'commun/', 'dddot/','fft1d/', etc. the executables to be made
   each with the name x.<prog>. This will take a minute. 
   6a - For the non-MPI programs these can be run by: 'x.<prog>'.
   6b - For the MPI programs run them by: 'mpirun -np <p> x.<prog>'
        or 'mpiexec -n <p> x.<prog> where <p> is the desired number
	of processes and x.<prog> the MPI executable. (or by any
	equivalent of mpirun if this is not available, also see 8b below).

7) Do a 'make speed': This will cause the Theoretical Peak Performance
   to be set to the correct value in the relevant directories.

8) 
   8a: Do a 'make runall': This will run all sequential programs in turn.   
       The results are placed in a directory called 'Log.`hostname`', where
       'hostname' is the local name of your system. This will take 
       a few minutes. The results have names '<prog>.log' where <prog> is
       any of the programs listed above.
   8b: For the MPI programs 'make runall' will cause the MPI programs to
       be run and the results to be transferred to 'Log.`hostname`'. The
       programs are run with the following number of processors by:
         mpirun -np 6  x.commun  > commun.log
         mpirun -np 16 x.dddot > dddot.log
	 mpirun -np 2  x.ping > ping.log
	 mpirun -np 2  x.pingpong > pingpong.log
	 mpirun -np 8  x.transp > transp.log
	 
       NOTE: Although improbable, with newer MPI-2 implementations
	     'mpirun -np <procs> <x.prog>' may have to be replaced by:
	     'mpiexec -n <procs> <x.prog>'.
	     This is provided for in the scripts 'x.all' in the 5
	     relevant directories: comment out the 'mpirun' line and
	     decomment the 'mpiexec' line.	      

================================================================================
                   CUSTOMISING THE RUNS: (OPTIONAL)
================================================================================
You might want to run some of the programs in an alternative setting
This might include:

- Other compiler options.
  In that case do the following for any of the programs <prog>:
 9a) cd <prog>/
    9a1 - Modify the definition of 'FFLAGS' in the Makefile.
    9a2 - Modify the compiler options line in subroutine 'state.f'.
    9a3 - Do a 'make veryclean': this will remove all old objects and
          the excutable.
    9a4 - Do a 'make'.
    9a5 - Do an 'x.all': this runs the program and writes the result to
         '<prog>.log'.
    9a6 - mv <prog>.log ../Log.hostname/ or,
    9a7 - ATERNATIVELY, when you have run several programs with
          altered repeat counts:
          a. cd ..
          b. Do a 'make collect': this causes any result file 
             '<prog>.log' to be moved from the '<prog>/' directories
             to 'Log.hostname/'.

- Substitute library calls or other equivalent code instead of that
  of the model implementation.
 9b) cd <prog>/
    9b1 - Modify the definition of 'FFLAGS' in the Makefile (if required).
    9b2 - Modify the compiler options line in subroutine 'state.f' (if
          required).
    9b3 - Do a 'make veryclean': this will remove all old objects and
          the excutable.
    9b4 - Invalidate the routines to be replaced by removing or renaming
          them and, if necessary, modify the Makefile accordingly.
	  Specifically: 
	  A. For program 'gmxm' and 'gmxv' it is assumed that you would
	     like to replace the given Fortran routines by the routines
	     'dgemm' and 'dgemv', respectively. If so, modify the zero in
	     the first line in 'gmxm.in' and 'gmxv.in' to an integer value
	     /= 0 (and invalidate the supplied BLAS routines in the
	     respective directories). If you use routines that are different
	     from the BLAS routines, still modify 'gmxm.in' and 'gmxv.in'
	     files by changing the zero to a non-zero value, but, in
	     addition, change the calls to 'dgemm' and 'dgemv' to that of
	     your own favorite library routines.
	  B. In program 'linful' the factorisation and solution are based
	     on the usual LAPACK routines. So, you only have to invalidate
	     the source routines present in the directory.
	  C. As there is no universally accepted standard for FFTs there
	     is no alternative to modifying the code in 'fft1d.f'. Replace
	     lines 81 and 82 by the call(s) to your favorite library
	     routine.	        
    9b5 - Do a 'make'.
    9b6 - Run the program as before. 
    9b7 - BE SURE TO REPORT THE SUBSTITUTION(S) IN THE RESULTS!
   
================================================================================
           	     ABOUT THE EFFICIENCY MEASURE
================================================================================
1) Programs 'fft1d', 'linful', 'linspar', and 'wvlt2d' measure an overall
   efficiency (ratio of actual performance and theoretical peak performance) by
   integrating the actual performance over a range of problem sizes. For
   instance, 'linspar' is evaluated in the range N = 1000,...,20000 with 10
   additional problem sizes in between. The problem sizes are given in the
   appropriate '<prog>.in' file, with <prog> any of the four programs mentioned.
   If for any reason (for instance because you suspect that the curve used for
   the integration does not catch the performance behaviour of your processor
   adequately), you may wish to add measuring points WITHIN the range given for
   each of the programs. You are welcome to do so by adding the appropriate
   line(s) to the '<prog>.in' file(s). Note, however, that it is NOT allowed to
   modify the lower and upper bounds themselves.

2) The four programs show a fraction of the peak performance that is required
   to be attained and also the effiency measure that actually is attained by
   integrating over the observation range. Obviously, the actual fraction of
   the theoretical peak performance must be greater or equal to the required
   fraction. Also obviously, the better the fraction is, the higher the effiency
   of the processor is. It does not matter whether your final result is obtained
   by using the original code, by optimising  it, or by using a library as long
   as the library is a standard tool and generally accessible for the average
   users of such processors.

================================================================================
           	           FURTHER REMARKS
================================================================================
1) The program 'ping' is a program that measures bandwidth and latency by
   means of one-sided MPI communication (MPI_Get and MPI_Put). The present
   situation is that many MPI implementations still do not support one-sided
   communication as required in MPI-2. Consequently, program 'ping' may not
   compile on your system and hence you will have no result for it. Because of
   the slow adoption of full MPI-2 we presently do not consider this result as
   mandatory but it certainly adds value to the total result of the benchmark.
2) Please run the benchmark FIRST AS-IS, i.e., with the minimal changes to get
   it running (probably none are necessary). Then, if you are inclined to do so,
   do the optimisations you have in mind and run again.

================================================================================
                                Lastly,
                         ====================
                         | Best of success! |
                         ====================
