Scalasca  (Scalasca 2.3.1, revision 14987)
Scalable Performance Analysis of Large-Scale Applications
Optimizing the measurement configuration

To avoid drawing wrong conclusions based on skewed performance data due to excessive measurement overhead, it is often necessary to optimize the measurement configuration before conducting additional experiments. This can be achieved in various ways, e.g., using runtime filtering, selective recording, or manual instrumentation controlling measurement. Please refer to the Score-P Manual [13] for details on the available options. However, in many cases it is already sufficient to filter a small number of frequently executed but otherwise unimportant user functions to reduce the measurement overhead to an acceptable level. The selection of those routines has to be done with care, though, as it affects the granularity of the measurement and too aggressive filtering might "blur" the location of important hotspots.

To help identifying candidate functions for runtime filtering, the initial summary report can be scored using the -s option of the scalasca -examine command:

  % scalasca -examine -s scorep_bt_64_sum
  INFO: Post-processing runtime summarization report...
  scorep-score -r ./scorep_bt_64_sum/profile.cubex > ./scorep_bt_64_sum/scorep.score
  INFO: Score report written to ./scorep_bt_64_sum/scorep.score

  % head -n 20 scorep_bt_64_sum/scorep.score

  Estimated aggregate size of event trace:                   3700GB
  Estimated requirements for largest trace buffer (max_buf): 58GB
  Estimated memory requirements (SCOREP_TOTAL_MEMORY):       58GB
  (hint: When tracing set SCOREP_TOTAL_MEMORY=58GB to avoid intermediate flushes
   or reduce requirements using USR regions filters.)

  flt type     max_buf[B]          visits  time[s] time[%] time/     region
                                                           visit[us]
       ALL 62,076,748,138 152,783,214,921 60774.81   100.0      0.40 ALL
       USR 62,073,899,966 152,778,875,273 58840.43    96.8      0.39 USR
       MPI      2,267,202       2,909,568  1633.69     2.7    561.49 MPI
       COM        580,970       1,430,080   300.69     0.5    210.26 COM

       USR 20,525,692,668  50,517,453,756 12552.16    20.7      0.25  binvcrhs_
       USR 20,525,692,668  50,517,453,756  8069.43    13.3      0.16  matmul_sub_
       USR 20,525,692,668  50,517,453,756  6308.60    10.4      0.12  matvec_sub_
       USR    447,119,556   1,100,528,112   130.68     0.2      0.12  exact_solution_
       USR     50,922,378     124,121,508    19.78     0.0      0.16  binvrhs_
       MPI        855,834         771,456    11.17     0.0     14.48  MPI_Isend
       MPI        855,834         771,456     5.16     0.0      6.69  MPI_Irecv

As can be seen from the top of the score output, the estimated size for an event trace measurement without filtering applied is apprximately 3.7 TiB, with the process-local maximum across all ranks being roughly 62 GB (~58 GiB). Considering the 24 GiB of main memory available on the JUROPA compute nodes and the 8 MPI ranks per node, a tracing experiment with this configuration is clearly prohibitive if disruptive intermediate trace buffer flushes are to be avoided.

The next section of the score output provides a table which shows how the trace memory requirements of a single process (column max_buf) as well as the overall number of visits and CPU allocation time are distributed among certain function groups. Currently, the following groups are distinguished:

The detailed breakdown by region below the summary provides a classification according to these function groups (column type) for each region found in the summary report. Investigation of this part of the score report reveals that most of the trace data would be generated by about 50 billion calls to each of the three routines matvec_sub, matmul_sub and binvcrhs, which are classified as USR. And although the percentage of time spent in these routines at first glance suggest that they are important, the average time per visit is below 250 nanoseconds (column time/visit). That is, the relative measurement overhead for these functions is substantial, and thus a significant amount of the reported time is very likely spent in the Score-P measurement system rather than in the application itself. Therefore, these routines constitute good candidates for being filtered (like they are good candidates for being inlined by the compiler). Additionally selecting the exact_solution routine, which generates 447 MB of event data on a single rank with very little runtime impact, a reasonable Score-P filtering file would therefore look like this:

  SCOREP_REGION_NAMES_BEGIN
      EXCLUDE
          binvcrhs_
          matvec_sub_
          matmul_sub_
          exact_solution_
  SCOREP_REGION_NAMES_END

Please refer to the Score-P User Manual [13] for a detailed description of the filter file format, how to filter based on file names, define (and combine) blacklists and whitelists, and how to use wildcards for convenience.

The effectiveness of this filter can be examined by scoring the initial summary report again, this time specifying the filter file using the -f option of the scalasca -examine command. This way a filter file can be incrementally developed, avoiding the need to conduct many measurements to step-by-step investigate the effect of filtering individual functions.

  % scalasca -examine -s -f npb-bt.filt scorep_bt_64_sum
  scorep-score -f npb-bt.filt -r ./scorep_bt_64_sum/profile.cubex \
               > ./scorep_bt_64_sum/scorep.score_npb-bt.filt
  INFO: Score report written to ./scorep_bt_64_sum/scorep.score_npb-bt.filt

  % head -n 25 scorep_bt_64_sum/scorep.score_npb-bt.filt

  Estimated aggregate size of event trace:                   3298MB
  Estimated requirements for largest trace buffer (max_buf): 53MB
  Estimated memory requirements (SCOREP_TOTAL_MEMORY):       55MB
  (hint: When tracing set SCOREP_TOTAL_MEMORY=55MB to avoid intermediate flushes
   or reduce requirements using USR regions filters.)

  flt type     max_buf[B]          visits  time[s] time[%] time/     region
                                                           visit[us]
   -   ALL 62,076,748,138 152,783,214,921 60774.81   100.0      0.40 ALL
   -   USR 62,073,899,966 152,778,875,273 58840.43    96.8      0.39 USR
   -   MPI      2,267,202       2,909,568  1633.69     2.7    561.49 MPI
   -   COM        580,970       1,430,080   300.69     0.5    210.26 COM

   *   ALL     54,527,956     130,325,541 33713.95    55.5    258.69 ALL-FLT
   +   FLT 62,024,197,560 152,652,889,380 27060.86    44.5      0.18 FLT
   *   USR     51,679,784     125,985,893 31779.57    52.3    252.25 USR-FLT
   -   MPI      2,267,202       2,909,568  1633.69     2.7    561.49 MPI-FLT
   *   COM        580,970       1,430,080   300.69     0.5    210.26 COM-FLT

   +   USR 20,525,692,668  50,517,453,756 12552.16    20.7      0.25 binvcrhs_
   +   USR 20,525,692,668  50,517,453,756  8069.43    13.3      0.16 matmul_sub_
   +   USR 20,525,692,668  50,517,453,756  6308.60    10.4      0.12 matvec_sub_
   +   USR    447,119,556   1,100,528,112   130.68     0.2      0.12 exact_solution_
   -   USR     50,922,378     124,121,508    19.78     0.0      0.16 binvrhs_
   -   MPI        855,834         771,456    11.17     0.0     14.48 MPI_Isend

Below the (original) function group summary, the score report now also includes a second summary with the filter applied. Here, an additional group FLT is added, which subsumes all filtered regions. Moreover, the column flt indicates whether a region/function group is filtered ("+"), not filtered ("-"), or possibly partially filtered ("*", only used for function groups).

As expected, the estimate for the aggregate event trace size drops down to 3.3 GiB, and the process-local maximum across all ranks is reduced to 53 MiB. Since the Score-P measurement system also creates a number of internal data structures (e.g., to track MPI requests and communicators), the suggested setting for the SCOREP_TOTAL_MEMORY environment variable to adjust the maximum amount of memory used by the Score-P memory management is 55 MiB when tracing is configured (see Section Trace collection and analysis).



Scalasca    Copyright © 1998–2016 Forschungszentrum Jülich GmbH, Jülich Supercomputing Centre
Copyright © 2009–2015 German Research School for Simulation Sciences GmbH, Laboratory for Parallel Programming