![]() |
Scalasca
(Scalasca 2.2.2, revision 13327)
Scalable Performance Analysis of Large-Scale Applications
|
To avoid drawing wrong conclusions based on skewed performance data due to excessive measurement overhead, it is often necessary to optimize the measurement configuration before conducting additional experiments. This can be achieved in various ways, e.g., using runtime filtering, selective recording, or manual instrumentation controlling measurement. Please refer to the Score-P Manual [12] for details on the available options. However, in many cases it is already sufficient to filter a small number of frequently executed but otherwise unimportant user functions to reduce the measurement overhead to an acceptable level. The selection of those routines has to be done with care, though, as it affects the granularity of the measurement and too aggressive filtering might "blur" the location of important hotspots.
To help identifying candidate functions for runtime filtering, the initial summary report can be scored using the -s
option of the scalasca -examine
command:
% scalasca -examine -s scorep_bt_64_sum INFO: Post-processing runtime summarization report... scorep-score -r ./scorep_bt_64_sum/profile.cubex > ./scorep_bt_64_sum/scorep.score INFO: Score report written to ./scorep_bt_64_sum/scorep.score % head -n 20 scorep_bt_64_sum/scorep.score Estimated aggregate size of event trace: 3700GB Estimated requirements for largest trace buffer (max_buf): 58GB Estimated memory requirements (SCOREP_TOTAL_MEMORY): 58GB (hint: When tracing set SCOREP_TOTAL_MEMORY=58GB to avoid intermediate flushes or reduce requirements using USR regions filters.) flt type max_buf[B] visits time[s] time[%] time/ region visit[us] ALL 62,076,748,138 152,783,214,921 60774.81 100.0 0.40 ALL USR 62,073,899,966 152,778,875,273 58840.43 96.8 0.39 USR MPI 2,267,202 2,909,568 1633.69 2.7 561.49 MPI COM 580,970 1,430,080 300.69 0.5 210.26 COM USR 20,525,692,668 50,517,453,756 12552.16 20.7 0.25 binvcrhs_ USR 20,525,692,668 50,517,453,756 8069.43 13.3 0.16 matmul_sub_ USR 20,525,692,668 50,517,453,756 6308.60 10.4 0.12 matvec_sub_ USR 447,119,556 1,100,528,112 130.68 0.2 0.12 exact_solution_ USR 50,922,378 124,121,508 19.78 0.0 0.16 binvrhs_ MPI 855,834 771,456 11.17 0.0 14.48 MPI_Isend MPI 855,834 771,456 5.16 0.0 6.69 MPI_Irecv
As can be seen from the top of the score output, the estimated size for an event trace measurement without filtering applied is apprximately 3.7 TiB, with the process-local maximum across all ranks being roughly 62 GB (~58 GiB). Considering the 24 GiB of main memory available on the JUROPA compute nodes and the 8 MPI ranks per node, a tracing experiment with this configuration is clearly prohibitive if disruptive intermediate trace buffer flushes are to be avoided.
The next section of the score output provides a table which shows how the trace memory requirements of a single process (column max_buf
) as well as the overall number of visits and CPU allocation time are distributed among certain function groups. Currently, the following groups are distinguished:
MPI
: MPI API functions.OMP
: OpenMP constructs and API functions.COM
: User functions/regions that appear on a call path to an OpenMP construct, or an OpenMP or MPI API function. Useful to provide the context of MPI/OpenMP usage.USR
: User functions/regions that do not appear on a call path to an OpenMP construct, or an OpenMP or MPI API function.The detailed breakdown by region below the summary provides a classification according to these function groups (column type
) for each region found in the summary report. Investigation of this part of the score report reveals that most of the trace data would be generated by about 50 billion calls to each of the three routines matvec_sub
, matmul_sub
and binvcrhs
, which are classified as USR
. And although the percentage of time spent in these routines at first glance suggest that they are important, the average time per visit is below 250 nanoseconds (column time/visit
). That is, the relative measurement overhead for these functions is substantial, and thus a significant amount of the reported time is very likely spent in the Score-P measurement system rather than in the application itself. Therefore, these routines constitute good candidates for being filtered (like they are good candidates for being inlined by the compiler). Additionally selecting the exact_solution
routine, which generates 447 MB of event data on a single rank with very little runtime impact, a reasonable Score-P filtering file would therefore look like this:
SCOREP_REGION_NAMES_BEGIN EXCLUDE binvcrhs_ matvec_sub_ matmul_sub_ exact_solution_ SCOREP_REGION_NAMES_END
Please refer to the Score-P User Manual [12] for a detailed description of the filter file format, how to filter based on file names, define (and combine) blacklists and whitelists, and how to use wildcards for convenience.
The effectiveness of this filter can be examined by scoring the initial summary report again, this time specifying the filter file using the -f
option of the scalasca -examine
command. This way a filter file can be incrementally developed, avoiding the need to conduct many measurements to step-by-step investigate the effect of filtering individual functions.
% scalasca -examine -s -f npb-bt.filt scorep_bt_64_sum scorep-score -f npb-bt.filt -r ./scorep_bt_64_sum/profile.cubex \ > ./scorep_bt_64_sum/scorep.score_npb-bt.filt INFO: Score report written to ./scorep_bt_64_sum/scorep.score_npb-bt.filt % head -n 25 scorep_bt_64_sum/scorep.score_npb-bt.filt Estimated aggregate size of event trace: 3298MB Estimated requirements for largest trace buffer (max_buf): 53MB Estimated memory requirements (SCOREP_TOTAL_MEMORY): 55MB (hint: When tracing set SCOREP_TOTAL_MEMORY=55MB to avoid intermediate flushes or reduce requirements using USR regions filters.) flt type max_buf[B] visits time[s] time[%] time/ region visit[us] - ALL 62,076,748,138 152,783,214,921 60774.81 100.0 0.40 ALL - USR 62,073,899,966 152,778,875,273 58840.43 96.8 0.39 USR - MPI 2,267,202 2,909,568 1633.69 2.7 561.49 MPI - COM 580,970 1,430,080 300.69 0.5 210.26 COM * ALL 54,527,956 130,325,541 33713.95 55.5 258.69 ALL-FLT + FLT 62,024,197,560 152,652,889,380 27060.86 44.5 0.18 FLT * USR 51,679,784 125,985,893 31779.57 52.3 252.25 USR-FLT - MPI 2,267,202 2,909,568 1633.69 2.7 561.49 MPI-FLT * COM 580,970 1,430,080 300.69 0.5 210.26 COM-FLT + USR 20,525,692,668 50,517,453,756 12552.16 20.7 0.25 binvcrhs_ + USR 20,525,692,668 50,517,453,756 8069.43 13.3 0.16 matmul_sub_ + USR 20,525,692,668 50,517,453,756 6308.60 10.4 0.12 matvec_sub_ + USR 447,119,556 1,100,528,112 130.68 0.2 0.12 exact_solution_ - USR 50,922,378 124,121,508 19.78 0.0 0.16 binvrhs_ - MPI 855,834 771,456 11.17 0.0 14.48 MPI_Isend
Below the (original) function group summary, the score report now also includes a second summary with the filter applied. Here, an additional group FLT
is added, which subsumes all filtered regions. Moreover, the column flt
indicates whether a region/function group is filtered ("+
"), not filtered ("-
"), or possibly partially filtered ("*
", only used for function groups).
As expected, the estimate for the aggregate event trace size drops down to 3.3 GiB, and the process-local maximum across all ranks is reduced to 53 MiB. Since the Score-P measurement system also creates a number of internal data structures (e.g., to track MPI requests and communicators), the suggested setting for the SCOREP_TOTAL_MEMORY
environment variable to adjust the maximum amount of memory used by the Score-P memory management is 55 MiB when tracing is configured (see Section Trace collection and analysis).
![]() |
Copyright © 1998–2015 Forschungszentrum Jülich GmbH,
Jülich Supercomputing Centre
Copyright © 2009–2015 German Research School for Simulation Sciences GmbH, Laboratory for Parallel Programming |