.. include:: system.rst

.. _juwels_booster_overview:

JUWELS Booster Overview
=======================

JUWELS Booster consists of 936 compute nodes, each equipped with 4 NVIDIA A100 GPUs. The GPUs are hosted by AMD EPYC Rome CPUs. The compute nodes are connected with HDR-200 InfiniBand in a DragonFly+ topology. To login, please see :ref:`access`.

Node Configuration
------------------

The configuration of JUWELS Booster compute nodes is the following

* **CPU**: AMD EPYC 7402 processor; 2 sockets, 24 cores per socket, SMT-2 (total: 2×24×2 = 96 threads) in NPS-4 [#nps]_ configuration (details `on WikiChip <https://en.wikichip.org/wiki/amd/epyc/7402>`_)
* **Memory**: 512 GB DDR4-3200 RAM (of which at least 20 GB is taken by the system software stack, including the file system); 256 GB per socket; 8 memory channels per socket (2 channels per NUMA domain)
* **GPU**: 4 × NVIDIA A100 Tensor Core GPU with 40 GB; connected via NVLink3 to each other
* **Network**: 4 × Mellanox HDR200 InfiniBand ConnectX 6 (200 Gbit/s each), *HCA*
* **Periphery**: CPU, GPU, and network adapter are connected via 2 PCIe Gen 4 switches with 16 PCIe lanes going to each device (CPU socket: 2×16 lanes). PCIe switches are configured in *synthetic mode*.

.. figure:: ./images/juwelsbooster-node.svg
   :name: jwbooster-node
   :align: center

.. [#nps] NPS-4: "NUMA Domains per socket: 4"; the socket is divided into four domains. The configuration originates from production of the CPU chip, which is not produced as one monolithic die, but rather consists of four individual dies.

System Network Topology
-----------------------

The InfiniBand network of JUWELS Booster is implemented as a DragonFly+ network.

48 nodes are combined in a switch group (*cell*), interconnected in a full fat-tree topology, with 10 leaf switches and 10 spine switches in a two-level configuration. 40 Tbit/s of bi-section bandwidth is available.

.. figure:: ./images/jwbooster-topology-intra.svg
   :name: jwbooster-topo-intra
   :align: center
   :width: 800px
   :alt: Network topology within a JUWELS Booster cell

   Sketch of the network topology within a JUWELS Booster cell with 48 nodes (``N1`` to ``N 48``), 10 level 1 switches (``L1 1`` to ``L1 10``) and 10 level 2 switches (``L2 1`` to ``L2 10``). Only a small subset of the total amount of links are shown for readability. The purple, 20th link leaving each level 2 switch should indicate the connection to JUWELS Cluster, while the other 19 outgoing level 2 links connect to other cells.

20 cells are connected with 10 links between each cell, delivering 4 Tbit/s bi-section bandwidth between cells. A total bi-section bandwidth of 400 Tbit/s is available. 10 links of each cell connect to JUWELS Cluster.

.. figure:: ./images/jwbooster-topology-inter.svg
   :name: jwbooster-topo-inter
   :align: center
   :alt: Network topology within a JUWELS Booster node

   Sketch of the network topology of cells of JUWELS Booster. Only links for cells 1 and 2 are shown as an example.

Affinity
--------

The AMD host CPU is configured with 4 NUMA domains per socket (NPS-4) resulting in 8 NUMA domains for the 2 socket system (0-7). Not all sockets have a direct connection (*affinity*) to each GPU or HCA.

The batch submission system, Slurm, will automatically select the affine devices per default. The affinity is as follows, sorted by GPU ID.

.. _affinity-table-label:

.. table:: JUWELS Booster Affinity Overview

	+----------------+--------+--------+-------------+
	| NUMA Domain ID | GPU ID | HCA ID | Core IDs    |
	+================+========+========+=============+
	| 3              | 0      | 0      | 18-23,66-71 |
	+----------------+--------+--------+-------------+
	| 1              | 1      | 1      | 6-11,54-59  |
	+----------------+--------+--------+-------------+
	| 7              | 2      | 2      | 42-47,90-95 |
	+----------------+--------+--------+-------------+
	| 5              | 3      | 3      | 30-35,78-83 |
	+----------------+--------+--------+-------------+

Slurm
^^^^^

Good affinity defaults are selected by Slurm extensions in PSSlurm. They can be overridden per user's choice.

GPU Devices
"""""""""""

Slurm sets the ``CUDA_VISIBLE_DEVICES`` variable automatically, giving one rank access to the one GPU it is close to.

.. code-block:: bash

	$ srun --ntasks 4 bash -c 'echo "Rank: $PMI_RANK   CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"' | sort
	Rank: 0   CUDA_VISIBLE_DEVICES: 0
	Rank: 1   CUDA_VISIBLE_DEVICES: 1
	Rank: 2   CUDA_VISIBLE_DEVICES: 2
	Rank: 3   CUDA_VISIBLE_DEVICES: 3

The variable is picked up by the CUDA-using application and can directly be used. The application-internal ID of the GPU is 0, as application-internal numbering always starts at 0.

If ``CUDA_VISIBLE_DEVICES`` is set externally, the variable is respected, passed through Slurm, and not changed. Make sure to select it consciously!

.. code-block:: bash
	
	$ export CUDA_VISIBLE_DEVICES=0,1,2,3
	$ srun --ntasks 4 bash -c 'echo "Rank: $PMI_RANK   CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"' | sort
	Rank: 0   CUDA_VISIBLE_DEVICES: 0,1,2,3
	Rank: 1   CUDA_VISIBLE_DEVICES: 0,1,2,3
	Rank: 2   CUDA_VISIBLE_DEVICES: 0,1,2,3
	Rank: 3   CUDA_VISIBLE_DEVICES: 0,1,2,3

NUMA Domains
""""""""""""

Per default, a task is bound to one core of the NUMA domain close to a GPU.

.. code-block:: bash
	
	$ srun --cpu-bind=verbose --ntasks 4 bash -c 'echo "Rank: $PMI_RANK   CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"' |& sort
	cpu-bind=THREADS - jwb0021, task  0  0 [17540]: mask 0x40000 set
	cpu-bind=THREADS - jwb0021, task  1  1 [17542]: mask 0x40 set
	cpu-bind=THREADS - jwb0021, task  2  2 [17544]: mask 0x40000000000 set
	cpu-bind=THREADS - jwb0021, task  3  3 [17547]: mask 0x40000000 set
	Rank: 0   CUDA_VISIBLE_DEVICES: 0
	Rank: 1   CUDA_VISIBLE_DEVICES: 1
	Rank: 2   CUDA_VISIBLE_DEVICES: 2
	Rank: 3   CUDA_VISIBLE_DEVICES: 3

Which translates to the following binary mask, combining the six (physical) cores of the 8 NUMA domains into groups [#pythonscript]_:

.. code-block::

	Rank 0: 000000 000000 000000 100000 000000 000000 000000 000000
	Rank 1: 000000 100000 000000 000000 000000 000000 000000 000000
	Rank 2: 000000 000000 000000 000000 000000 000000 000000 100000
	Rank 3: 000000 000000 000000 000000 000000 100000 000000 000000

Hence, the ``srun`` default binds ranks to the first CPU core of the NUMA domains close to a GPU as per table :ref:`affinity-table-label`. To change the mask to bind not only to the first core, but, for example, to the first two cores, use ``--cpus-per-task=2``.

To change the mask to allow for *all* cores of the NUMA domain close to a GPU, select ``--cpu-bind=socket``.

.. code-block:: bash
	
	$ srun --cpu-bind=socket,verbose --ntasks 4 bash -c 'echo "Rank: $PMI_RANK   CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"' |& sort
	cpu-bind=SOCKETS - jwb0021, task  0  0 [17564]: mask 0xfc0000000000fc0000 set
	cpu-bind=SOCKETS - jwb0021, task  1  1 [17566]: mask 0xfc0000000000fc0 set
	cpu-bind=SOCKETS - jwb0021, task  2  2 [17568]: mask 0xfc0000000000fc0000000000 set
	cpu-bind=SOCKETS - jwb0021, task  3  3 [17571]: mask 0xfc0000000000fc0000000 set
	Rank: 0   CUDA_VISIBLE_DEVICES: 0
	Rank: 1   CUDA_VISIBLE_DEVICES: 1
	Rank: 2   CUDA_VISIBLE_DEVICES: 2
	Rank: 3   CUDA_VISIBLE_DEVICES: 3

Which translates to the following binary mask:

.. code-block::

	Rank 0: 000000 000000 000000 111111 000000 000000 000000 000000
	Rank 1: 000000 111111 000000 000000 000000 000000 000000 000000
	Rank 2: 000000 000000 000000 000000 000000 000000 000000 111111
	Rank 3: 000000 000000 000000 000000 000000 111111 000000 000000


To override any pre-configured Slurm binding, use ``--cpu-bind=none``, or any other valid CPU binding mask, including ``mask_ldom``.

InfiniBand Adapters (HCAs)
""""""""""""""""""""""""""

Affinity to InfiniBand Adapters is steered through environment variables to UCX, the network communication framework. For a given process, the affinity to a certain HCA can be set with

.. code-block:: bash

	export UCX_NET_DEVICES=mlx5_0:1

In this case, the HCA with ID 0 is used. HCA with ID 1 is called ``mlx5_1:1``, and so on.


Overriding Defaults
^^^^^^^^^^^^^^^^^^^

While the configured Slurm default should deliver good performance for most cases, depending on the behavior of the actual application other bindings might be of interest.

We recommend using high-level Slurm options to steer the behavior of the generated masks.

In case you rather want to take control yourself, you may consider using a wrapper script as a prefix to the application. A simple example can be found :download:`here <./affinity-wrapper.sh>`. Make sure to disable the Slurm-provided pinning with ``--cpu-bind=none``!


.. [#pythonscript] This can be determined, e.g., by running ``" ".join([f'{int(mask.split(",")[0], base=16):096b}'[::-1][:48][6*i:6*(i+1)] for i in range(8)])`` (using 48 physical cores with SMT-2 (96 logical cores) of JUWELS Booster)