Scalasca
(Scalasca 2.5, revision 18206)
Scalable Performance Analysis of Large-Scale Applications
|
As a first step of every performance analysis, a reference execution using an uninstrumented executable should be performed. First, this step verifies that the code executes cleanly and produces correct results. Second, it later allows to assess the run-time overhead introduced by instrumentation and measurement. And finally, it provides a baseline to compare with after applying some code optimizations. At this stage an appropriate test configuration should be chosen, such that it is both repeatable and long enough to be representative. (Note that excessively long execution durations can make measurement analysis inconvenient or even prohibitive, and therefore should be avoided.)
After unpacking the NPB-MPI source archive, the build system has to be adjusted to the respective environment. For the NAS benchmarks, this is accomplished by a Makefile snippet defining a number of variables used by a generic Makefile. This snippet is called make.def
and has to reside in the config/
subdirectory, which already contains a template file that can be copied and adjusted appropriately. In particular, the MPI Fortran compiler wrapper and flags need to be specified, for example:
MPIF77 = mpifort FFLAGS = -O2 FLINKFLAGS = -O2
Note that the MPI C compiler wrapper and flags are not used for building BT, but may also be set in the config/make.def
file accordingly to experiment with other NPB benchmarks.
Next, the benchmark can be built from the top-level directory by running make
, specifying the number of MPI ranks to use via the NPROCS
variable—for BT, this is required to be a square number—as well as the problem size via the CLASS
variable on the command line. Valid problem classes (of increasing size) are W, S, A, B, C, D, and E, and can be used to adjust the benchmark runtime to the execution environment. For example, class W or S is appropriate for execution on a laptop with 4 MPI ranks, while the other problem sizes are more suitable for "real" configurations. For the example run on JURECA, 144 MPI ranks and problem class D have been chosen:
$ make bt NPROCS=144 CLASS=D ========================================= = NAS Parallel Benchmarks 3.3 = = MPI/F77/C = ========================================= cd BT; make NPROCS=144 CLASS=D SUBTYPE= VERSION= make[1]: Entering directory `/tmp/NPB3.3-MPI/BT' make[2]: Entering directory `/tmp/NPB3.3-MPI/sys' cc -g -o setparams setparams.c make[2]: Leaving directory `/tmp/NPB3.3-MPI/sys' ../sys/setparams bt 144 D make[2]: Entering directory `/tmp/NPB3.3-MPI/BT' mpifort -c -O2 bt.f mpifort -c -O2 make_set.f mpifort -c -O2 initialize.f mpifort -c -O2 exact_solution.f mpifort -c -O2 exact_rhs.f mpifort -c -O2 set_constants.f mpifort -c -O2 adi.f mpifort -c -O2 define.f mpifort -c -O2 copy_faces.f mpifort -c -O2 rhs.f mpifort -c -O2 solve_subs.f mpifort -c -O2 x_solve.f mpifort -c -O2 y_solve.f mpifort -c -O2 z_solve.f mpifort -c -O2 add.f mpifort -c -O2 error.f mpifort -c -O2 verify.f mpifort -c -O2 setup_mpi.f cd ../common; mpifort -c -O2 print_results.f cd ../common; mpifort -c -O2 timers.f make[3]: Entering directory `/tmp/NPB3.3-MPI/BT' mpifort -c -O2 btio.f mpifort -O2 -o ../bin/bt.D.144 bt.o make_set.o initialize.o exact_solution.o \ exact_rhs.o set_constants.o adi.o define.o copy_faces.o rhs.o solve_subs.o \ x_solve.o y_solve.o z_solve.o add.o error.o verify.o setup_mpi.o \ ../common/print_results.o ../common/timers.o btio.o make[3]: Leaving directory `/tmp/NPB3.3-MPI/BT' make[2]: Leaving directory `/tmp/NPB3.3-MPI/BT' make[1]: Leaving directory `/tmp/NPB3.3-MPI/BT'
The resulting executable encodes the benchmark configuration in its name and is placed into the bin/
subdirectory. For the example make
command above, it is named bt.D.144
. This binary can now be executed, either via submitting an appropriate batch job (which is beyond the scope of this user guide) or directly in an interactive session.
$ cd bin $ mpiexec -n 144 ./bt.D.144 NAS Parallel Benchmarks 3.3 -- BT Benchmark No input file inputbt.data. Using compiled defaults Size: 408x 408x 408 Iterations: 250 dt: 0.0000200 Number of active processes: 144 Time step 1 Time step 20 Time step 40 Time step 60 Time step 80 Time step 100 Time step 120 Time step 140 Time step 160 Time step 180 Time step 200 Time step 220 Time step 240 Time step 250 Verification being performed for class D accuracy setting for epsilon = 0.1000000000000E-07 Comparison of RMS-norms of residual 1 0.2533188551738E+05 0.2533188551738E+05 0.1497879774166E-12 2 0.2346393716980E+04 0.2346393716980E+04 0.8488743310506E-13 3 0.6294554366904E+04 0.6294554366904E+04 0.3034271788588E-14 4 0.5352565376030E+04 0.5352565376030E+04 0.8308967344119E-13 5 0.3905864038618E+05 0.3905864038618E+05 0.6650300273080E-13 Comparison of RMS-norms of solution error 1 0.3100009377557E+03 0.3100009377557E+03 0.1373406191445E-12 2 0.2424086324913E+02 0.2424086324913E+02 0.1600422929406E-12 3 0.7782212022645E+02 0.7782212022645E+02 0.4090394153928E-13 4 0.6835623860116E+02 0.6835623860116E+02 0.3617356324816E-13 5 0.6065737200368E+03 0.6065737200368E+03 0.2605201960010E-13 Verification Successful BT Benchmark Completed. Class = D Size = 408x 408x 408 Iterations = 250 Time in seconds = 216.00 Total processes = 144 Compiled procs = 144 Mop/s total = 270070.08 Mop/s/process = 1875.49 Operation type = floating point Verification = SUCCESSFUL Version = 3.3.1 Compile date = 18 Mar 2019 Compile options: MPIF77 = mpifort FLINK = $(MPIF77) FMPI_LIB = (none) FMPI_INC = (none) FFLAGS = -O2 FLINKFLAGS = -O2 RAND = (none) Please send feedbacks and/or the results of this run to: NPB Development Team Internet: npb@nas.nasa.gov
In the selected configuration, the BT benchmark executes 250 iterations of the time step loop, and then verifies that the result matches the expected outcome. Before exiting, the benchmark also reports some configuration details, as well as the wall-clock execution time (216.00 seconds) for the core computation.
Copyright © 1998–2019 Forschungszentrum Jülich GmbH,
Jülich Supercomputing Centre
Copyright © 2009–2015 German Research School for Simulation Sciences GmbH, Laboratory for Parallel Programming |