JUWELS Booster Overview

JUWELS Booster consists of 936 compute nodes, each equipped with 4 NVIDIA A100 GPUs. The GPUs are hosted by AMD EPYC Rome CPUs. The compute nodes are connected with HDR-200 InfiniBand in a DragonFly+ topology. To login, please see Access.

Node Configuration

The configuration of JUWELS Booster compute nodes is the following

  • CPU: AMD EPYC 7402 processor; 2 sockets, 24 cores per socket, SMT-2 (total: 2×24×2 = 96 threads) in NPS-4 1 configuration (details on WikiChip)

  • Memory: 512 GB DDR4-3200 RAM (of which at least 20 GB is taken by the system software stack, including the file system); 256 GB per socket; 8 memory channels per socket (2 channels per NUMA domain)

  • GPU: 4 × NVIDIA A100 Tensor Core GPU with 40 GB; connected via NVLink3 to each other

  • Network: 4 × Mellanox HDR200 InfiniBand ConnectX 6 (200 Gbit/s each), HCA

  • Periphery: CPU, GPU, and network adapter are connected via 2 PCIe Gen 4 switches with 16 PCIe lanes going to each device (CPU socket: 2×16 lanes). PCIe switches are configured in synthetic mode.

_images/juwelsbooster-node.svg
1

NPS-4: “NUMA Domains per socket: 4”; the socket is divided into four domains. The configuration originates from production of the CPU chip, which is not produced as one monolithic die, but rather consists of four individual dies.

System Network Topology

The InfiniBand network of JUWELS Booster is implemented as a DragonFly+ network.

48 nodes are combined in a switch group (cell), interconnected in a full fat-tree topology, with 10 leaf switches and 10 spine switches in a two-level configuration. 40 Tbit/s of bi-section bandwidth is available.

Network topology within a JUWELS Booster cell

Sketch of the network topology within a JUWELS Booster cell with 48 nodes (N1 to N 48), 10 level 1 switches (L1 1 to L1 10) and 10 level 2 switches (L2 1 to L2 10). Only a small subset of the total amount of links are shown for readability. The purple, 20th link leaving each level 2 switch should indicate the connection to JUWELS Cluster, while the other 19 outgoing level 2 links connect to other cells.

20 cells are connected with 10 links between each cell, delivering 4 Tbit/s bi-section bandwidth between cells. A total bi-section bandwidth of 400 Tbit/s is available. 10 links of each cell connect to JUWELS Cluster.

Network topology within a JUWELS Booster node

Sketch of the network topology of cells of JUWELS Booster. Only links for cells 1 and 2 are shown as an example.

Affinity

The AMD host CPU is configured with 4 NUMA domains per socket (NPS-4) resulting in 8 NUMA domains for the 2 socket system (0-7). Not all sockets have a direct connection (affinity) to each GPU or HCA.

The batch submission system, Slurm, will automatically select the affine devices per default. The affinity is as follows, sorted by GPU ID.

JUWELS Booster Affinity Overview

NUMA Domain ID

GPU ID

HCA ID

Core IDs

3

0

0

18-23,66-71

1

1

1

6-11,54-59

7

2

2

42-47,90-95

5

3

3

30-35,78-83

Slurm

Good affinity defaults are selected by Slurm extensions in PSSlurm. They can be overridden per user’s choice.

GPU Devices

Slurm sets the CUDA_VISIBLE_DEVICES variable automatically, giving one rank access to the one GPU it is close to.

$ srun --ntasks 4 bash -c 'echo "Rank: $PMI_RANK   CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"' | sort
Rank: 0   CUDA_VISIBLE_DEVICES: 0
Rank: 1   CUDA_VISIBLE_DEVICES: 1
Rank: 2   CUDA_VISIBLE_DEVICES: 2
Rank: 3   CUDA_VISIBLE_DEVICES: 3

The variable is picked up by the CUDA-using application and can directly be used. The application-internal ID of the GPU is 0, as application-internal numbering always starts at 0.

If CUDA_VISIBLE_DEVICES is set externally, the variable is respected, passed through Slurm, and not changed. Make sure to select it consciously!

$ export CUDA_VISIBLE_DEVICES=0,1,2,3
$ srun --ntasks 4 bash -c 'echo "Rank: $PMI_RANK   CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"' | sort
Rank: 0   CUDA_VISIBLE_DEVICES: 0,1,2,3
Rank: 1   CUDA_VISIBLE_DEVICES: 0,1,2,3
Rank: 2   CUDA_VISIBLE_DEVICES: 0,1,2,3
Rank: 3   CUDA_VISIBLE_DEVICES: 0,1,2,3

NUMA Domains

Per default, a task is bound to one core of the NUMA domain close to a GPU.

$ srun --cpu-bind=verbose --ntasks 4 bash -c 'echo "Rank: $PMI_RANK   CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"' |& sort
cpu-bind=THREADS - jwb0021, task  0  0 [17540]: mask 0x40000 set
cpu-bind=THREADS - jwb0021, task  1  1 [17542]: mask 0x40 set
cpu-bind=THREADS - jwb0021, task  2  2 [17544]: mask 0x40000000000 set
cpu-bind=THREADS - jwb0021, task  3  3 [17547]: mask 0x40000000 set
Rank: 0   CUDA_VISIBLE_DEVICES: 0
Rank: 1   CUDA_VISIBLE_DEVICES: 1
Rank: 2   CUDA_VISIBLE_DEVICES: 2
Rank: 3   CUDA_VISIBLE_DEVICES: 3

Which translates to the following binary mask, combining the six (physical) cores of the 8 NUMA domains into groups 2:

Rank 0: 000000 000000 000000 100000 000000 000000 000000 000000
Rank 1: 000000 100000 000000 000000 000000 000000 000000 000000
Rank 2: 000000 000000 000000 000000 000000 000000 000000 100000
Rank 3: 000000 000000 000000 000000 000000 100000 000000 000000

Hence, the srun default binds ranks to the first CPU core of the NUMA domains close to a GPU as per table JUWELS Booster Affinity Overview. To change the mask to bind not only to the first core, but, for example, to the first two cores, use --cpus-per-task=2.

To change the mask to allow for all cores of the NUMA domain close to a GPU, select --cpu-bind=socket.

$ srun --cpu-bind=socket,verbose --ntasks 4 bash -c 'echo "Rank: $PMI_RANK   CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"' |& sort
cpu-bind=SOCKETS - jwb0021, task  0  0 [17564]: mask 0xfc0000000000fc0000 set
cpu-bind=SOCKETS - jwb0021, task  1  1 [17566]: mask 0xfc0000000000fc0 set
cpu-bind=SOCKETS - jwb0021, task  2  2 [17568]: mask 0xfc0000000000fc0000000000 set
cpu-bind=SOCKETS - jwb0021, task  3  3 [17571]: mask 0xfc0000000000fc0000000 set
Rank: 0   CUDA_VISIBLE_DEVICES: 0
Rank: 1   CUDA_VISIBLE_DEVICES: 1
Rank: 2   CUDA_VISIBLE_DEVICES: 2
Rank: 3   CUDA_VISIBLE_DEVICES: 3

Which translates to the following binary mask:

Rank 0: 000000 000000 000000 111111 000000 000000 000000 000000
Rank 1: 000000 111111 000000 000000 000000 000000 000000 000000
Rank 2: 000000 000000 000000 000000 000000 000000 000000 111111
Rank 3: 000000 000000 000000 000000 000000 111111 000000 000000

To override any pre-configured Slurm binding, use --cpu-bind=none, or any other valid CPU binding mask, including mask_ldom.

InfiniBand Adapters (HCAs)

Affinity to InfiniBand Adapters is steered through environment variables to UCX, the network communication framework. For a given process, the affinity to a certain HCA can be set with

export UCX_NET_DEVICES=mlx5_0:1

In this case, the HCA with ID 0 is used. HCA with ID 1 is called mlx5_1:1, and so on.

Overriding Defaults

While the configured Slurm default should deliver good performance for most cases, depending on the behavior of the actual application other bindings might be of interest.

We recommend using high-level Slurm options to steer the behavior of the generated masks.

In case you rather want to take control yourself, you may consider using a wrapper script as a prefix to the application. A simple example can be found here. Make sure to disable the Slurm-provided pinning with --cpu-bind=none!

2

This can be determined, e.g., by running " ".join([f'{int(mask.split(",")[0], base=16):096b}'[::-1][:48][6*i:6*(i+1)] for i in range(8)]) (using 48 physical cores with SMT-2 (96 logical cores) of JUWELS Booster)