GPU Computing

JUWELS GPU Nodes

JUWELS features a number of accelerated (GPU-equipped) compute nodes for applications that can take advantage of such devices via CUDA, OpenCL or OpenACC.

JUWELS includes two types of GPU-equipped compute nodes: The Booster module features a large number of compute nodes equipped with four NVIDIA A100 GPUs. Additionally the Cluster module includes a smaller number of GPU-equipped compute nodes with NVIDIA V100 SXM2 GPUs.

Booster module

The GPU nodes in the JUWELS Booster feature four NVIDIA A100 GPUs. The nodes are accessible in the booster partition. In order to use this partition the argument -p booster (or --partition booster) must be provided to sbatch or salloc. In addition, as explained in Requesting Generic Resources, the number of requested GPUs must be specified using the --gres=gpu:X argument with X in the range one to four.

To compile for Booster GPU nodes, dedicated login nodes are available. See Access.

An overview of the JUWELS Booster node and network configuration as well as some discussion about task placement can be found in a dedicated Booster overview.

Cluster module

The GPU nodes in the JUWELS Cluster feature four NVIDIA V100 GPUs. The nodes are accessible in the gpus partition. In order to access this partition the argument -p gpus (or --partition gpus) must be provided to sbatch or salloc. In addition, as explained in Requesting Generic Resources, the number of requested GPUs must be specified using the --gres=gpu:X argument with X in the range one to four.

GPU Visibility/Affinity

Through the job scheduler (Slurm), GPUs are associated to tasks. Per default, one task is associated to one GPU. This is achieved by setting the environment variable CUDA_VISIBLE_DEVICES which steers available GPUs to an application.

Consider the following example:

$ srun --tasks 4 bash -c 'echo "Rank: $PMI_RANK   CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"' | sort
Rank: 0   CUDA_VISIBLE_DEVICES: 0
Rank: 1   CUDA_VISIBLE_DEVICES: 1
Rank: 2   CUDA_VISIBLE_DEVICES: 2
Rank: 3   CUDA_VISIBLE_DEVICES: 3

In this case, task with ID 0 has access to GPU with ID 0, task with ID 1 has access to GPU with ID 1, and so on. An application launched like this with srun will see only one GPU with the app-internal ID always being 0. In other words: In the default, each GPU-using application will see one GPU per task (/rank). It is ensured that the process which accesses this GPU will be launched on a CPU core which has affinity to the GPU.

The default behavior can be changed by overriding CUDA_VISIBLE_DEVICES before the srun invocation. In that case, the value of the environment variable is not changed.

More discussions and examples relating GPU visibility and affinity can be found in the JUWELS Booster overview document. While the examples are specific to the node architecture of JUWELS Booster, the general behavior is same on JUWELS Cluster – with different NUMA domains per socket (2).

Note

Also for jobs with only one task, the behavior is the same currently. In that case, only one GPU is visible to task. Please manually export CUDA_VISIBLE_DEVICES=0,1,2,3. This behavior is currently being worked on and will soon be modified.

Job Script Examples

Example 1: MPI application starting 16 tasks on 4 nodes using 4 GPUs per node. The program must be able to coordinate the access to the four GPU devices on each of the four nodes:

#!/bin/bash -x
#SBATCH --account=<budget>
#SBATCH --nodes=4
#SBATCH --ntasks=16
#SBATCH --ntasks-per-node=4
#SBATCH --output=gpu-out.%j
#SBATCH --error=gpu-err.%j
#SBATCH --time=00:15:00
#SBATCH --partition=<partition>
#SBATCH --gres=gpu:4

srun ./gpu-prog

where partition is either booster or gpus. Please note that the booster or gpus compute nodes feature a different number of CPUs.

Example 2: Four independent instances (job steps) of a GPU program running on a JUWELS Cluster node using one CPU thread and one GPU device each. The program is pinned to CPU core 0, 10, 20 and 30, respectively:

#!/bin/bash -x
#SBATCH --account=<budget>
#SBATCH --nodes=1
#SBATCH --output=gpu-out.%j
#SBATCH --error=gpu-err.%j
#SBATCH --time=00:20:00
#SBATCH --partition=gpus
#SBATCH --gres=gpu:4

srun --exclusive -n 1 --gres=gpu:1 --cpu-bind=map_cpu:0  ./gpu-prog &
srun --exclusive -n 1 --gres=gpu:1 --cpu-bind=map_cpu:10 ./gpu-prog &
srun --exclusive -n 1 --gres=gpu:1 --cpu-bind=map_cpu:20 ./gpu-prog &
srun --exclusive -n 1 --gres=gpu:1 --cpu-bind=map_cpu:30 ./gpu-prog &

wait

where partition is either booster, gpus (JUWELS) or dc-cpu (JURECA).

Example 3: Four independent instances (job steps) of a GPU program running on a JUWELS Booster node using one CPU thread and one GPU device each. The program is pinned to CPU core 18, 6, 42 and 30, respectively:

#!/bin/bash -x
#SBATCH --account=<budget>
#SBATCH --nodes=1
#SBATCH --output=gpu-out.%j
#SBATCH --error=gpu-err.%j
#SBATCH --time=00:20:00
#SBATCH --partition=booster
#SBATCH --gres=gpu:4

srun --exclusive -n 1 --gres=gpu:1 --cpu-bind=map_cpu:18 ./gpu-prog &
srun --exclusive -n 1 --gres=gpu:1 --cpu-bind=map_cpu:6  ./gpu-prog &
srun --exclusive -n 1 --gres=gpu:1 --cpu-bind=map_cpu:42 ./gpu-prog &
srun --exclusive -n 1 --gres=gpu:1 --cpu-bind=map_cpu:30 ./gpu-prog &

wait