Build-Up Operation

Warning

JUPITER is currently in a Pre-Access phase, rather than production. Many details change during this Pre-Access phase, and so this documentation may slip out of synchronisation with the current state of the system or be incomplete.

In the case where information seems to be incorrect or out of date, please contact sc@fz-juelich.de.

JUPITER is currently in build-up. During this pre-production operation, some elements of the environment are not yet production-ready and/or equalized to the other JSC system. Find here some notes.

Slurm & MPI

JUPITER currently deploys a different Slurm version compared to other JSC systems. Some plugins and customization are not available, while other features exist temporarily.

Prolog/Epilog Runtime Customization (PrEpS Framework)

Slurm traditionally allows runtime customization through SPANK plugins (Slurm Plugin Architecture for Node and Job Control). SPANK plugins are a very powerful way to extend and tweak the capabilities of SLURM according to site needs. The present however a higher barrier for development and deployment, hindering a little bit quick deployments for tweaks that would be applied at the beginning of a job for the whole node. For that reason we are exploring a mechanism to change job and node wide settings (ie: triggered in all the nodes of the job at the beginning and cleaned up at the end). The current implementation relies on the --comment option provided by SLURM to pass arbitrary strings as comments for the job. This mechanism leverages Slurm’s native Prolog/Epilog hooks, which allow actions to be executed automatically before a job starts and after it finishes on allocated nodes.

Warning

This mechanism is exploratory. The main benefit would be much faster development and deployment of features that require changes at the beginning of a job (and cleaned up at the end). If this approach proves to be reliable and benefitial we might move that mechanism out of the --comment option and have a custom SPANK plugin for these options, having the flexibility of this approach without abusing the --comment option.

This mechanism allows selected node-level settings to be:

  • Applied automatically before job execution

  • Reverted safely after job completion

  • Optionally overridden by users at submission time

User-Controlled Runtime Tuning

User overrides are currently passed via Slurm’s --comment option:

#SBATCH --comment="KEY=VALUE;KEY=VALUE"

These values are mapped to PrEpS configuration modules.

Note

The --comment option of sbatch passes user-specific overrides to the PrEpS framework. Multiple overrides can be specified as a semicolon-separated list, e.g.

#SBATCH --comment="CPU_POWER_CAP=200000000;NVIDIA_GPU_CLOCKS=990"

Warning

This interface is experimental and may evolve into a dedicated Slurm plugin.

Currently Supported Options

CPU related options:

  • CPU_POWER_CAP – Set the Grace CPU power cap in µW (default: 100000000). The Grace CPU may use up to this amount as its share of the 680 W superchip module power. Defaults to 100W, but tunable to upper and lower values up to 300W. The maximum value is 300000000 (300W). Altering the CPU power cap can result in lower power to the GPU, as CPU+GPU must share the 680W total powercap for the full superchip.

  • CPU_FREQUENCY_MIN=VALUE – Minimum CPU frequency (kHz) (default: 81000). This is the minimum frequency that the CPU is supposed to use during the lifetime of the job. Whether this is respected or not depends on the CPU governor selected.

  • CPU_FREQUENCY_MAX=VALUE – Maximum CPU frequency (kHz) (default: 3456000). This is the maximum frequency that the CPU is supposed to use during the lifetime of the job. Whether this is respected or not depends on the CPU governor selected.

  • CPU_GOVERNOR={ conservative | ondemand | userspace | powersave | performance | schedutil} – CPU frequency scaling governor (default: performance).

NVIDIA GPUs

  • NVIDIA_GPU_CLOCKS=VALUE[,VALUE] – Lock GPU clocks in MHz (no default locking). Takes a single value as argument, or a range of 2 comma-separated values. For supported values refer to nvidia-smi -q -d SUPPORTED_CLOCKS

  • NVIDIA_POWER_LIMIT=GPU_POWER_LIMIT,MODULE_POWER_LIMIT – GPU and module power limit in watts (default: 680,680). I takes 2 comma-separated values, the first one limiting the GPU power consumption, and the second one limiting the power consumption of the whole module (GPU, CPU and CPU memory). The module value has to be greater or equal to the GPU value.

  • CUDA_MPS={0 | 1 | yes | no | enable | disable} – Enables CUDA Multi-Process Service (default: disabled).

Memory & Kernel

  • THP={always | madvise | never} – Transparent Huge Pages mode (default: madvise).

  • THP_SHMEM={advise | always | deny | force | never} – THP behavior for shared memory segments

  • KERNEL_PARAMETERS=kernel.perf_event_paranoid={ -1 | 0 | 1 | 2 } – Additional kernel parameters applied at job runtime. Currently just kernel.perf_event_paranoid is supported.

Examples

Disable Transparent Huge Pages and limit Grace CPU power to 200 W:

#SBATCH --comment="THP=never;CPU_POWER_CAP=200000000"

Set GPU application clocks for performance-focused workloads:

#SBATCH --comment="NVIDIA_GPU_CLOCKS=990"

GH200 Superchip Power Cap

Each GH200 superchip, comprised of CPU and GPU, is power-capped to 680 W on compute nodes (900 W on login nodes). Within this TDP, the CPU is currently power-capped to 100 W per default, i.e. it may use up to 100 W of these 680 W. The CPU power cap may be set to as much as 300 W, or even below 100 W with the above-mentioned methods. In any case, the GPU on the superchip may use then use the remaining power budget for its operation. Refer to NVIDIA’s Grace documentation for more details.

Tuning for Large-Scale Execution

As you are using JUPITER, your workloads probably demand large-scale execution with many Slurm tasks / MPI processes. You might want to experiment with the following options.

Module: UCX-settings/RC-CUDA vs. UCX-settings/DC-CUDA

When loading an MPI framework on the system, like OpenMPI, the UCX-settings module will be loaded automatically with the RC-CUDA flavor. Especially for larger task counts it may advisable to change to UCX-settings/DC-CUDA, to make use of dynamic connections, rather than reliable connections. This is facilitated by setting the UCX_TLS environment variable, which you maybe interested in inspecting.

Rendezvous threshold: UCX_RNDV_THRESH

UCX uses heuristics to determine which protocol to use for data exchange. Setting the variable and giving an explicit size may optimize communication bandwidth. Exapmle: UCX_RNDV_THRESH=intra:131072,inter:131072

Further UCX variables

Other UCX variable settings which proved useful for some users are: UCX_RNDV_SCHEME=put_zcopy, UCX_MAX_RNDV_RAILS=1.

Additionally, the latest UCX releases changed the heuristics for selecting internal protocols and thresholds. Those heuristics have been problematic at large scale in some cases, resulting in subpar performance. If you run at a scale of more than 200 nodes please consider exploring UCX_PROTO_ENABLE=no or manually tune UCX_RNDV_THRES, UCX_RNDV_SCHEME. If you see big differences in performance please contact support at sc@fz-juelich.de.

Health Checker

Currently, a different health checker is deployed compare to our production machines. While thorough, some of the more nimble health checks might not be done.

Affinity

The GH200 superchip is – NUMA-wise – not as complex as other processors operated by JSC. The following options will result in a good affinity for standard cases

srun --tasks-per-node=4 --cpus-per-task=72 ./app

See also Processor Affinity.

Stability

A large amount of work is currently done non-stop on the system, with many operations inducing noise into the system, especially within the network. Sub-standard performance and even occasional job failures might occur; please report them only if you think an underlying, general issue can be see.

Storage Mounts

JUPITER has two filesystems, both entirely new: ExaSTORE and ExaFlash.

ExaSTORE is already (nearly) fully available and mounted at /e/. The JUST6 filesystem from the non-JUPITER datacenter is also available (/p/), but only on the login nodes. All project-related environment variables like $PROJECT_cjsc point to the new file system. A fully automated data movement solution to move data between JUST6 and ExaSTORE will be documented and made available soon.

ExaSTORE storage is based on spinning disks and includes in its final setup 22 building blocks. Available are:

  • $PROJECT: 1 ExaSTORE building block for mid-term storage and operation (In the final setup this will be extended by 1 building block)

  • $HOME : Small per user space for configuration files. Shares the building block with $PROJECT

  • $SCRATCH: Fast, temporary storage. Not accessed files are deleted after 90 days. Shares 20 ExaStore building blocks with $DATA

  • $DATA: Mid-term storage available on specific request (no workflow, yet); shares 20 ExaStore building block with $SCRATCH

Each building block has a theoretical bandwidth of about 60 GB/s. Depending on many factors, including number of used client nodes and I/O access pattern, a total theoretical bandwidth of 1.2 TB/s for reading and 1 TB/s for writing can be expected on $SCRATCH.

ExaFLASH is currently in acceptance and will be made available soon at $FSCRATCH. It is entirely flash-storage-based and has higher access bandwidths.

Issue with OpenMP/ACC device offloading using GCC on JUPITER

There is a known issue where attempting to build a code with GCC 13.x/14.x using OpenMP/ACC device offloading on AArch64 architecture (like JUPITER’s ARM CPUs) fails at the LTO stage:

This will not be fixed until next software Stage.

The Workaround/Suggested Action for this is to use Clang+LLVM or Nvidia’s HPC compilers instead of compiling with GCC for building codes which rely on device offloading.