We investigate loops with regard to their degree of vectorization and offer suggestions for optimization candidates. This required hardware counter measurements, obtained in multiple runs, due to the limited num ber of available counter registers. In the context of counter measurements this is not unusual for the Score-P work-flow. The suggestion of specific optimization candidates on the other hand is a deviation from the standard Score-P metric semantics.

The Score-P metric concept operates on the actual value of a metric (in absolute or relative terms) and analysis sometimes requires implicit information, e.g. if a higher value is worse than a small value. This approach leaves the decision about the rel- evance of a metric value of a certain call-path to the user. They need to judge the severity of an issue based on the knowledge of the hardware architecture, the source code, the input data, the use case, or even external parameters. Providing a generic set of thresholds, deciding if a metric value is problematic, is a hard problem in general, as too many parameters are involved, some outside the scope of the perfor- mance analysis tool.

In the case of vectorization assistance we used the cooperation with Intel R to investigate the use of explicit knowledge about the architecture for providing such thresholds in that limited context. In the following we describe the metrics we focused on and the challenges they pose for the Score-P work-flow and analysis.

KNL Vectorization metrics

We focus on the three metrics. The first metric calculates the computational density, i.e. the number of operations performed on average for each piece of loaded data. The L1 compute to data access ratio can be used to judge how suitable an application is to run on the KNL architecture. Ideally, operations should be vectorized and each datum fetched from L1 cache should be used for multiple operations.

Similar to this, the L2 compute to data access ratio is calculated as the number of vector operations against the loads that initially miss the L1 cache. While the L1 metric is critical in esti- mating a codes general suitability, the L2 metric is an indicator whether the code is operating efficiently.

The thresholds are considered the limits where an investigation into the code section?s vectorization would be useful. These limits are based on recommendations of Intel R for the KNL architecture and while these hold true for most applications running on KNL, they are only guide- lines and should be applied with care.

An additional metric, the VPU intensity, offers a rule of thumb on how well a loop is vectorized, calculating the proportion of vectorized operations on total arithmetic operations. This metric should be applied only to small pieces of code and certain non-arithmetic operations, such as mask manipulation instructions, are counted as vector operations, which can skew this ratio. One defines the metrics as ratios of hardware counters provided by the KNL architecture. These can be accessed in Score-P through the PAPI metrics interface

Metric: L1 Compute to data access ratio
Threshold: < 1

UOPS RETIRED.PACKED SIMD/ MEM UOPS RETIRED.ALL LOADS

Metric: L2 Compute to data access ratio
Threshold: < 100? L1 Compute to data access ratio
```
UOPS RETIRED.PACKED SIMD/ MEM UOPS RETIRED.L1 MISS LOADS
```

Metric: VPU intensity
Threshold: < 0.5

UOPS RETIRED.PACKED SIMD/ (UOPS RETIRED.PACKED SIMD + UOPS RETIRED.SCALAR SIMD)

and can measured at a call-path level on each thread. To calculate all derived metrics, multiple native hardware counters have to be recorded. Since the KNL architecture provides only two general purpose counters per thread, multiple measurements have to be used to obtain the full set of counters required.