Cube GUI User Guide  (CubeGUI 4.8.2, revision 7895e762)
Introduction in Cube GUI and its usage
Cube Advisor Plugin

Advisor is a standard plugin and is available as long as the measurement contains a Time metric. The main goal of the Advisor plugin is to provide a user a fast access to the various performance evaluations of the performance of their HPC application.

Getting Started with Advisor

If measurement contains metric Time, CubeGUI will enable the Advisor plugin in the "General" tab in the plugins section.

Some Supported Assessments can be disabled due to missing performance properties, e.g. missing PAPI counters. In such cases one potential solution is to merge original measurement with measurement which includes missing properties and run analysis again. Measurement merging can be done with one of the context-free plugins Plugin "Merge" or Plugin "Mean".

Moreover, some assessments are hidden (e.g. Multiplicative Hybrid Assessment and JSC Hybrid Assessment) and can be available in "expert" mode (see Command line options).

Supported Assessments

Advisor supports various performance assessments, such as

Only-MPI Assessment

Attempting to optimize the performance of a parallel code can be a daunting task, and often it is difficult to know where to start. For example, we might ask if the way computational work is divided is a problem? Or perhaps the chosen communication scheme is inefficient? Or does something else impact performance? To help address this issue, POP has defined a methodology for analysis of parallel codes to provide a quantitative way of measuring relative impact of the different factors inherent in parallelization. This article introduces these metrics, explains their meaning, and provides insight into the thinking behind them.

A feature of the methodology is, that it uses a hierarchy of Only-MPI Assessment, each metric reflecting a common cause of inefficiency in parallel programs. These metrics then allow a comparison of the parallel performance (e.g. over a range of thread/process counts, across different machines, or at different stages of optimization and tuning) to identify which characteristics of the code contribute to the inefficiency.

The first step to calculating these metrics is to use a suitable tool (e.g. Score-P or Extrae) to generate trace data whilst the code is executed. The traces contain information about the state of the code at a particular time, e.g. it is in a communication routine or doing useful computation, and also contains values from processor hardware counters, e.g. number of instructions executed, number of cycles.

The Only-MPI Assessment are then calculated as efficiencies between 0 and 1, with higher numbers being better. In general, we regard efficiencies above 0.8 as acceptable, whereas lower values indicate performance issues that need to be explored in detail. The ultimate goal then for POP is rectifying these underlying issues by the user. Please note, that Only-MPI Assessment can be computed only for inclusive callpaths, as they are less meaningful for exclusive callpaths. Furthermore, Only-MPI Assessment are not available in "Flat view" mode.

The approach outlined here is applicable to various parallelism paradigms, however for simplicity the Only-MPI Assessment presented here are formulated in terms of a distributed-memory message-passing environment, e.g., MPI. For this the following values are calculated for each process from the trace data: time doing useful computation, time in communication, number of instructions & cycles during useful computation. Useful computation excludes time within the overhead of parallel paradigms (Computation time).

At the top of the hierarchy is Global Efficiency (GE), which we use to judge overall quality of parallelization. Typically, inefficiencies in parallel code have two main sources:

and to reflect this we define two sub-metrics to measure these two inefficiencies. These are the Parallel Efficiency and the Computation Efficiency, and our top-level GE metric is the product of these two sub-metrics:

GE = Parallel EfficiencyComputation Efficiency
Note
Computation Efficiency can be computed only at scale with multiple measurements and currently is not supported by Advisor.

We sincerely hope this methodology will be adopted by our users and others and will form part of the project's legacy. If you would like to know more about the POP metrics and the tools used to generate them please check out the rest of the Learning Material on our website, especially the document on POP Metrics

Multiplicative Hybrid Assessment

Note
Multiplicative Hybrid Assessment is available only in "expert" mode (see Command line options).

This is one approach to extend POP metrics for hybrid (MPI+OpenMP) applications. In this approach Parallel Efficiency split into two components:

In this analysis Parallel Efficiency (PE) can be computed as a product of these two sub-metrics:

PE = Process EfficiencyThread Efficiency

Additive Hybrid Assessment

This is one approach to extend POP metrics for hybrid (MPI+OpenMP) applications. In this approach Parallel Efficiency split into two components:

In this analysis Parallel Efficiency (PE) an be computed directly or as a sum of these two sub-metrics minus one:

PE = Process Efficiency + Thread Efficiency - 1

This scheme has two advantages: each hybrid efficiency measures absolute cost of the issue(s) under consideration, i.e. relative to the runtime; additive method gives more freedom in defining child metrics.

BSC Hybrid Assessment

This is one approach to extend POP metrics for hybrid (MPI+OpenMP) applications. It provides three types of efficiencies, i.e.:

JSC Hybrid Assessment

Note
JSC Hybrid Assessment is available only in "expert" mode (see Command line options).

This is JSC spin-off of POP metrics for hybrid (MPI+OpenMP) applications. In this approach there are two sets of metrics, i.e.:

There are two peculiarities for this model

KNL Vectorization analysis

We investigate loops with regard to their degree of vectorization and offer suggestions for optimization candidates. This required hardware counter measurements, obtained in multiple runs, due to the limited num ber of available counter registers. In the context of counter measurements this is not unusual for the Score-P work-flow. The suggestion of specific optimization candidates on the other hand is a deviation from the standard Score-P metric semantics.

The Score-P metric concept operates on the actual value of a metric (in absolute or relative terms) and analysis sometimes requires implicit information, e.g. if a higher value is worse than a small value. This approach leaves the decision about the rel- evance of a metric value of a certain call-path to the user. They need to judge the severity of an issue based on the knowledge of the hardware architecture, the source code, the input data, the use case, or even external parameters. Providing a generic set of thresholds, deciding if a metric value is problematic, is a hard problem in general, as too many parameters are involved, some outside the scope of the perfor- mance analysis tool.

In the case of vectorization assistance we used the cooperation with Intel R to investigate the use of explicit knowledge about the architecture for providing such thresholds in that limited context. In the following we describe the metrics we focused on and the challenges they pose for the Score-P work-flow and analysis.

KNL Memory usage analysis

With Score-P, we measure the bandwidth values per code-region outside of OpenMP parallel regions, due the given uncore counter restrictions. Depending on the application, there might be a lot of code regions that show a high band- width value. To find the most bandwidth sensitive candidates among these regions, we need to sort them by their last-level cache-misses (LLC). This gives us the MCDRAM candidate metric per code region, as shown in Figure 4. We derive the MCDRAM candidate metric, i.e., we sort the high bandwith callpaths by their last- level cache misses, in the Cube plugin KNL advisor (see also 5.2). As input we use the PAPI-measured access counts for each DDR4 memory channel and the PAPI-SCIPHI Score-P and Cube extensions for Intel Phi measured LLC counts. We take care of measuring the memory accesses only per- process while running exclusively on a single KNL node. As Score-P and Cube purely work on code regions, the MCDRAM candidates are also code regions. As a drawback, if a candidate code region accesses several data structures, we cannot point to the most bandwidth sensitive structure. Vtune [1], HPCToolkit [3][12] or ScaAnalyzer [13] might provide more detailed insight. In addition to this drawback, the above approach is not generally applicable for tools as accessing counters from the uncore requires priviledged access to a ma- chine, either by setting the paranoia flag or by providing a special kernel module. On production machines, this access is, for security reasons, often not granted. This does not only apply to memory accesses, but to all uncore counters.


Cube Writer Library    Copyright © 1998–2022 Forschungszentrum Jülich GmbH, Jülich Supercomputing Centre
Copyright © 2009–2015 German Research School for Simulation Sciences GmbH, Laboratory for Parallel Programming