JURECA GPU Nodes
JURECA features a number of accelerated (GPU-equipped) compute nodes for applications that can take advantage of such devices via CUDA, OpenCL or OpenACC.
The GPU nodes in the JURECA DC module feature four NVIDIA A100 GPUs. The nodes are accessible in the
In order to access this partition the argument
-p dc-gpu (or
--partition dc-gpu) must be provided to
In addition, as explained in Requesting Generic Resources and Features, the number of requested GPUs can be specified using the
--gres=gpu:X argument with
X in the range one to four.
In case no
gpu GRES is given to jobs targeting the GPU partitions, the submission filter will add
Through the job scheduler (Slurm), GPUs are associated to tasks. Per default, one task is associated to one GPU. This is achieved by setting the environment variable
CUDA_VISIBLE_DEVICES which steers available GPUs to an application.
Consider the following example:
$ srun --tasks 4 bash -c 'echo "Rank: $PMI_RANK CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"' | sort Rank: 0 CUDA_VISIBLE_DEVICES: 0 Rank: 1 CUDA_VISIBLE_DEVICES: 1 Rank: 2 CUDA_VISIBLE_DEVICES: 2 Rank: 3 CUDA_VISIBLE_DEVICES: 3
In this case, task with ID 0 has access to GPU with ID 0, task with ID 1 has access to GPU with ID 1, and so on. An application launched like this with
srun will see only one GPU with the app-internal ID always being 0. In other words: In the default, each GPU-using application will see one GPU per task (/rank). It is ensured that the process which accesses this GPU will be launched on a CPU core which has affinity to the GPU.
The default behavior can be changed by overriding
CUDA_VISIBLE_DEVICES before the
srun invocation. In that case, the value of the environment variable is not changed.
Also for jobs with only one task, the behavior is the same currently. In that case, only one GPU is visible to task. Please manually
export CUDA_VISIBLE_DEVICES=0,1,2,3. This behavior is currently being worked on and will soon be modified.
Nvidia Profiling Tools and Clock Speed
One point to take into consideration when using Nvidia profiling tools is the clock speed. For example, when using
NVIDIA Nsight Compute, the tool’s clock control is expected to lock the clocks to their base value by default, which is relatively low for some GPU models.
The reason for this behavior is that for many metrics, their value is directly influenced by the current GPU SM and memory clock frequencies. GPU might be in a higher clocked state at some points executing an application, and might be in a lower clocked state in other execution points. This would have a direct impact on the metrics being profiled.
To mitigate this non-determinism, NVIDIA Nsight Compute attempts to limit GPU clock frequencies to their base value. As a result, metric values are less impacted by the location of the kernel in the application, or by the number of the specific replay pass.
More information can be found at: NVIDIA Nsight Compute Documentation
Job Script Examples
Example: MPI application starting 16 tasks on 4 nodes using 128 CPUs per node and 4 GPUs per node. The program must be able to coordinate the access to the four GPU devices on each of the four nodes:
#!/bin/bash -x #SBATCH --account=<budget> #SBATCH --nodes=4 #SBATCH --ntasks=16 #SBATCH --ntasks-per-node=4 #SBATCH --cpus-per-task=32 #SBATCH --output=gpu-out.%j #SBATCH --error=gpu-err.%j #SBATCH --time=00:15:00 #SBATCH --partition=dc-gpu #SBATCH --gres=gpu:4 srun ./gpu-prog