Scalasca  (Scalasca 2.3.1, revision 14987)
Scalable Performance Analysis of Large-Scale Applications
Preparing a reference execution

As a first step of every performance analysis, a reference execution using an uninstrumented executable should be performed. On the one hand, this step verifies that the code executes cleanly and produces correct results, and on the other hand later allows to assess the overhead introduced by instrumentation and measurement. At this stage an appropriate test configuration should be chosen, such that it is both repeatable and long enough to be representative. (Note that excessively long execution durations can make measurement analysis inconvenient or even prohibitive, and therefore should be avoided.)

After unpacking the NPB-MPI source archive, the build system has to be adjusted to the respective environment. For the NAS benchmarks, this is accomplished by a Makefile snippet defining a number of variables used by a generic Makefile. This snippet is called make.def and has to reside in the config/ subdirectory, which already contains a template file that can be copied and adjusted appropriately. In particular, the MPI Fortran compiler wrapper and flags need to be specified, for example:

  MPIF77     = mpif77
  FFLAGS     = -O2

Note that the MPI C compiler wrapper and flags are not used for building BT, but may also be set accordingly to experiment with other NPB benchmarks.

Next, the benchmark can be built from the top-level directory by running make, specifying the number of MPI ranks to use (for BT, this is required to be a square number) as well as the problem size on the command line:

  % make bt NPROCS=64 CLASS=D
     =      NAS Parallel Benchmarks 3.3      =
     =      MPI/F77/C                        =

  make[1]: Entering directory `/tmp/NPB3.3-MPI/BT'
  make[2]: Entering directory `/tmp/NPB3.3-MPI/sys'
  cc -g  -o setparams setparams.c
  make[2]: Leaving directory `/tmp/NPB3.3-MPI/sys'
  ../sys/setparams bt 64 D
  make[2]: Entering directory `/tmp/NPB3.3-MPI/BT'
  mpif77 -c  -O2 bt.f
  mpif77 -c  -O2 make_set.f
  mpif77 -c  -O2 initialize.f
  mpif77 -c  -O2 exact_solution.f
  mpif77 -c  -O2 exact_rhs.f
  mpif77 -c  -O2 set_constants.f
  mpif77 -c  -O2 adi.f
  mpif77 -c  -O2 define.f
  mpif77 -c  -O2 copy_faces.f
  mpif77 -c  -O2 rhs.f
  mpif77 -c  -O2 solve_subs.f
  mpif77 -c  -O2 x_solve.f
  mpif77 -c  -O2 y_solve.f
  mpif77 -c  -O2 z_solve.f
  mpif77 -c  -O2 add.f
  mpif77 -c  -O2 error.f
  mpif77 -c  -O2 verify.f
  mpif77 -c  -O2 setup_mpi.f
  cd ../common; mpif77 -c  -O2 print_results.f
  cd ../common; mpif77 -c  -O2 timers.f
  make[3]: Entering directory `/tmp/NPB3.3-MPI/BT'
  mpif77 -c  -O2 btio.f
  mpif77 -O2 -o ../bin/bt.D.64 bt.o make_set.o initialize.o exact_solution.o
  exact_rhs.o set_constants.o adi.o define.o copy_faces.o rhs.o solve_subs.o
  x_solve.o y_solve.o z_solve.o add.o error.o verify.o setup_mpi.o
  ../common/print_results.o ../common/timers.o btio.o
  make[3]: Leaving directory `/tmp/NPB3.3-MPI/BT'
  make[2]: Leaving directory `/tmp/NPB3.3-MPI/BT'
  make[1]: Leaving directory `/tmp/NPB3.3-MPI/BT'

Valid problem classes (of increasing size) are W, S, A, B, C, D and E, and can be used to adjust the benchmark runtime to the execution environment. For example, class W or S is appropriate for execution on a single-core laptop with 4 MPI ranks, while the other problem sizes are more suitable for "real" configurations.

The resulting executable encodes the benchmark configuration in its name and is placed into the bin/ subdirectory. For the example make command above, it is named bt.D.64. This binary can now be executed, either via submitting an appropriate batch job (which is beyond the scope of this user guide) or directly in an interactive session.

  % cd bin
  % mpiexec -n 64 ./bt.D.64

   NAS Parallel Benchmarks 3.3 -- BT Benchmark

   No input file Using compiled defaults
   Size:  408x 408x 408
   Iterations:  250    dt:   0.0000200
   Number of active processes:    64

   Time step    1
   Time step   20
   Time step   40
   Time step   60
   Time step   80
   Time step  100
   Time step  120
   Time step  140
   Time step  160
   Time step  180
   Time step  200
   Time step  220
   Time step  240
   Time step  250
   Verification being performed for class D
   accuracy setting for epsilon =  0.1000000000000E-07
   Comparison of RMS-norms of residual
             1 0.2533188551738E+05 0.2533188551738E+05 0.1479210131727E-12
             2 0.2346393716980E+04 0.2346393716980E+04 0.8488743310506E-13
             3 0.6294554366904E+04 0.6294554366904E+04 0.3034271788588E-14
             4 0.5352565376030E+04 0.5352565376030E+04 0.8597827149538E-13
             5 0.3905864038618E+05 0.3905864038618E+05 0.6650300273080E-13
   Comparison of RMS-norms of solution error
             1 0.3100009377557E+03 0.3100009377557E+03 0.1373406191445E-12
             2 0.2424086324913E+02 0.2424086324913E+02 0.1582835864248E-12
             3 0.7782212022645E+02 0.7782212022645E+02 0.4053872777553E-13
             4 0.6835623860116E+02 0.6835623860116E+02 0.3762882153975E-13
             5 0.6065737200368E+03 0.6065737200368E+03 0.2474004739002E-13
   Verification Successful

   BT Benchmark Completed.
   Class           =                        D
   Size            =            408x 408x 408
   Iterations      =                      250
   Time in seconds =                   462.95
   Total processes =                       64
   Compiled procs  =                       64
   Mop/s total     =                126009.74
   Mop/s/process   =                  1968.90
   Operation type  =           floating point
   Verification    =               SUCCESSFUL
   Version         =                      3.3
   Compile date    =              29 Jan 2015

   Compile options:
      MPIF77       = mpif77
      FLINK        = $(MPIF77)
      FMPI_LIB     = (none)
      FMPI_INC     = (none)
      FFLAGS       = -O2
      FLINKFLAGS   = -O2
      RAND         = (none)

   Please send the results of this run to:

   NPB Development Team

   If email is not available, send this to:

   MS T27A-1
   NASA Ames Research Center
   Moffett Field, CA  94035-1000

   Fax: 650-604-3957

Note that this application verified its successful calculation and reported the associated wall-clock execution time for the core computation.

Scalasca    Copyright © 1998–2016 Forschungszentrum Jülich GmbH, Jülich Supercomputing Centre
Copyright © 2009–2015 German Research School for Simulation Sciences GmbH, Laboratory for Parallel Programming