Skip to content

List of metrics

Here we list the current metrics and detail their meaning.

  • Usage
    Average core usage over the runtime of the job per node (y-axis) and per core (x-axis) of the node.

Warning

The abscissa in this graph are the core IDs instead of the timestamp, and it includes both "Physical" cores (first half) as well as the "Logical" ones (second-half) for multithreaded processors.

  • CPU Usage
    1-min average usage of the CPU across all cores in a node. For multithreaded processors, the value can go up to 200% using physical and logical cores.

  • Physical Cores Used
    Numbers of "Physical cores" with usage above 25% in the last minute. The "Physical cores" in the graphs are represented the first half of the node.

  • Logical Cores Used (For multithreaded processors)
    Numbers of "Logical cores" with usage above 25% in the last minute. The "Logical cores" in the graphs are represented by the second half of the node.

  • Load
    Average number of runnable processes (including those waiting for disk I/O) over the past 1 minute, indicating short-term system load and responsiveness (e.g., 1 means a load of 1 core on average - not a percentage).

Note

The Load is provided by Linux in three numbers: 1-, 5- and 15-min average loads. In the job reports, the Node: Load is obtained from Slurm, which at JSC contains the 1-min Load average.

  • Memory Usage
    Amount of allocated RAM memory (in GiB) in the node.

Note

In the job reports, the Node: Memory Usage graphs (both for CPU and GPU) is scaled by default from 0 up to the memory limit of the partition. A swich between Job and System limits can be found on the interactive reports.

Danger

Some system processes may use up to a few GiB of memory on the system, so it is better to plan for 10-15GiB less than the maximum amount.

  • Active SM
    Average fraction of time at least one warp was active on a multiprocessor, averaged over all multiprocessors.

  • Utilization
    Percent of time over the past sample period during which one or more kernels was executing on the GPU.

Warning

The Utilization graph reflect the usage of at least one kernel on the GPU - it does not contain information of how much occupied it is. For this reason, it is recommended to check the Active SM metric described below.

  • Memory Usage
    Amount of memory (in GiB) used on the device by the context.

  • Temperature
    Current Temperature (in Celsius) on a given GPU.

Warning

Note that high temperatures may trigger slow down of the GPU frequency (see examples of High Temperature / GPU Throttling).

  • Clk Throttle Reason
    Information about factors that are reducing the frequency of clocks. These are:

    1. GpuIdle - Nothing is running on the GPU and the clocks are dropping to Idle state.
    2. AppClkSet - GPU clocks are limited by applications clocks setting.
    3. SwPwrCap - SW Power Scaling algorithm is reducing the clocks below requested clocks because the GPU is consuming too much power.
    4. HWSlowDown - HW Slowdown (reducing the core clocks by a factor of 2 or more) is engaged. This is an indicator of:
                    * Temperature being too high
                    * External Power Brake Assertion is triggered (e.g. by the system power supply)
                    * Power draw is too high and Fast Trigger protection is reducing the clocks
    5.  SyncBoost - This GPU has been added to a Sync boost group with nvidia-smi or DCGM in order to maximize performance per watt. All GPUs in the sync boost group will boost to the minimum possible clocks across the entire group. Look at the throttle reasons for other GPUs in the system to see why those GPUs are holding this one at lower clocks.
    6.  SwThermSlDwn - SW Thermal Slowdown. This is an indicator of one or more of the following:
                       * Current GPU temperature above the GPU Max Operating Temperature
                       * Current memory temperature above the Memory Max Operating Temperature
    7.  HwThermSlDwn - HW Thermal Slowdown (reducing the core clocks by a factor of 2 or more) is engaged. This is an indicator of:
                       * Temperature being too high
    8.   PwrBrakeSlDwn - Power brake throttle to avoid that given racks draw more power than the facility can safely provide.
    

Note

The Clk Throttle Reason graphs are not shown when no throttling was ever active for the job.

  • StreamMP Clk
    Current frequency in MHz of SM (Streaming Multiprocessor) clock. The frequency may be slowed down for the reasons given above.

  • Memory Usage Rate
    Percent of time over the past sample period during which global (device) memory was being read or written.

  • Memory Clk
    Current frequency of the memory clock, in MHz.

  • Performance State
    The current performance state for the GPU. States range from P0 (maximum performance) to P12 (minimum performance).

Note

The Performance State graphs are only shown when it differs from the default value of 0.

  • PCIE TX
    The GPU-centric transmission throughput across the PCIe bus (in GiB/s) over the past 20ms.

  • PCIE RX
    The GPU-centric receive throughput across the PCIe bus (in GiB/s) over the past 20ms.

Warning

The PCIE TX and PCIE RX graphs only include throughput via PCIe bus, i.e., between GPU and CPU.

  • NVLink TX
    The rate of data transmitted over NVLink in in GiB/s.

  • NVLink RX
    The rate of data received over NVLink in GiB/s.

LLview can report power metrics (in Watts) at several levels:

  • Node Power
    The total power draw for the entire node at the moment of sampling.

Note

"Node Power" values come from Slurm's CurrentWatts field (scontrol show nodes) and are snapshots taken once per minute. LLview may integrate these samples over time to estimate total energy consumption.

  • CPU Power
    The instantaneous power consumed by the CPU package, including its memory controllers and system I/O.

  • CPU Power Cap
    The enforced power limit on the CPU package. Displaying this is useful when users can modify CPU power caps or when they deviate from the system default.

  • GPU Power
    The current power draw of each GPU device, including its onboard memory.

  • Superchip Power
    On Grace–Hopper systems, LLview also reports the power usage for each “superchip” (i.e. combined Grace and Hopper modules).

  • Read
    Average read data rate (in MiB/s) in the last minute.

  • Write
    Average write data rate (in MiB/s) in the last minute.

  • Open/Close Operations
    Average operation rate (in operations/s) in the last minute.

  • Data Input
    Average data input throughput (in MiB/s) in the last minute.

  • Data Output
    Average data output throughput (in MiB/s) in the last minute.

  • Packet Input
    Average package input throughput (in pkt/s) in the last minute.

  • Packet Output
    Average package output throughput (in pkt/s) in the last minute.

Attention

The Interconnect values refer to input and output transfers to/from a given node, so it does not include communications within the node itself. However, I/O data is also included in the transferred data in or out of a node.