:orphan:

.. include:: system.rst

.. _gpu_computing:

GPU Computing
=============

|SYSTEM_NAME| GPU Nodes
-----------------------

|SYSTEM_NAME| features a number of accelerated (GPU-equipped) compute nodes for applications that can take advantage of such devices, for example via CUDA, OpenCL or OpenACC.

.. ifconfig:: system_name == 'juwels'

   |SYSTEM_NAME| includes two types of GPU-equipped compute nodes: The Booster module features a large number of compute nodes equipped with four NVIDIA A100 SXM4 GPUs. Additionally the Cluster module includes a smaller number of GPU-equipped compute nodes with NVIDIA V100 SXM2 GPUs.

   Booster module
   ~~~~~~~~~~~~~~

   The GPU nodes in the |SYSTEM_NAME| Booster feature four NVIDIA A100 GPUs (40 GB memory). The nodes are accessible in the ``booster`` partition.
   In order to use this partition the argument ``-p booster`` (or ``--partition booster``) must be provided to ``sbatch`` or ``salloc``.
   In addition, as explained in :ref:`batch_generic_resources`, the number of requested GPUs must be specified using the ``--gres=gpu:X`` argument with ``X`` in the range one to four.

   To compile for Booster GPU nodes, dedicated login nodes are available. See :ref:`access`.

   An overview of the JUWELS Booster node and network configuration as well as some discussion about task placement can be found in a dedicated :ref:`Booster overview <juwels_booster_overview>`.

   Cluster module
   ~~~~~~~~~~~~~~

   The GPU nodes in the |SYSTEM_NAME| Cluster feature four NVIDIA V100 GPUs (16 GB memory). The nodes are accessible in the ``gpus`` partition.
   In order to access this partition the argument ``-p gpus`` (or ``--partition gpus``) must be provided to ``sbatch`` or ``salloc``.
   In addition, as explained in :ref:`batch_generic_resources`, the number of requested GPUs can be specified using the ``--gres=gpu:X`` argument with ``X`` in the range one to four.
   In case no ``gpu`` GRES is given to jobs targeting the GPU partitions, the submission filter will add ``--gres=gpu:4`` automatically.

.. ifconfig:: system_name == 'jureca'

   The GPU nodes in the |SYSTEM_NAME| DC module feature four NVIDIA A100 GPUs (40 GB memory). The nodes are accessible in the ``dc-cpu`` partition.
   In order to access this partition the argument ``-p dc-gpu`` (or ``--partition dc-gpu``) must be provided to ``sbatch`` or ``salloc``.
   In addition, as explained in :ref:`batch_generic_resources`, the number of requested GPUs can be specified using the ``--gres=gpu:X`` argument with ``X`` in the range one to four.
   In case no ``gpu`` GRES is given to jobs targeting the GPU partitions, the submission filter will add ``--gres=gpu:4`` automatically.
   Finally, you must also load the module ``MPI-settings/CUDA`` to ensure MPI is configured to properly communicate between GPUs.


.. ifconfig:: system_name == 'jusuf'

   The GPU compute nodes feature one NVIDIA V100 PCIe GPU (16 GB memory). The nodes are accessible in the ``gpus`` partition.
   In order to access this partition the argument ``-p gpus`` (or ``--partition gpus``) must be provided to ``sbatch`` or ``salloc``.
   As there is only one GPU per node no additional option for the number of GPUs has to be given (``--gres=gpu:1`` is the default). 

GPU Visibility/Affinity
-----------------------

Through the job scheduler (Slurm), GPUs are associated to tasks. Per default, one task is associated to one GPU. This is achieved by Slurm setting the environment variable ``CUDA_VISIBLE_DEVICES`` which steers available GPUs to an application.

Consider the following example:

.. code-block:: bash

  $ srun --ntasks 4 bash -c 'echo "Rank: $PMI_RANK   CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"' | sort
  Rank: 0   CUDA_VISIBLE_DEVICES: 0
  Rank: 1   CUDA_VISIBLE_DEVICES: 1
  Rank: 2   CUDA_VISIBLE_DEVICES: 2
  Rank: 3   CUDA_VISIBLE_DEVICES: 3

In this case, task with ID 0 has access to GPU with ID 0, task with ID 1 has access to GPU with ID 1, and so on. An application launched like this with ``srun`` will *see* only one GPU with the app-internal ID always being 0. In other words: In the default, each GPU-using application will see one GPU per task (/rank). It is ensured that the process which accesses this GPU will be launched on a CPU core which has affinity to the GPU.

The default behavior can be changed by overriding ``CUDA_VISIBLE_DEVICES`` before the ``srun`` invocation. In that case, the value of the environment variable is not changed.

.. ifconfig:: system_name == 'juwels'
  
  More discussions and examples relating GPU visibility and affinity can be found in the :ref:`JUWELS Booster overview document <juwels_booster_overview>`. While the examples are specific to the node architecture of JUWELS Booster, the general behavior is same on JUWELS Cluster -- with different NUMA domains per socket (2).


.. note::

  Also for jobs with only one task, the behavior is the same currently. In that case, only one GPU is visible to task. Please manually ``export CUDA_VISIBLE_DEVICES=0,1,2,3``. This behavior is currently being worked on and will soon be modified.

Setting GPU Clock Rates
-----------------------

NVIDIA GPUs support a variety of GPU clock rates. Per default, the GPUs are operated at the highest clock frequency. Users may opt to operate GPUs at a lower frequency, usually entailing lower application performance.

To set GPU clock rates, the ``--gpufreq`` switch to ``srun`` can be used. Possible operation frequencies can be retrieved through ``nvidia-smi`` (``nvidia-smi -q -d SUPPORTED_CLOCKS``). 

Usage example:

.. code-block:: bash

  $ srun --gpufreq=1410 nvidia-smi

.. note::

  Setting GPU clock rates is facilitated through a custom Slurm SPANK plugin, mimicking the upstream support of Slurm of ``--gpu-freq`` (note the additional hyphen). This canonical (hyphenated) switch is not supported on JSC's platforms, currently.


NVIDIA Profiling Tools and Clock Speed
--------------------------------------

.. ifconfig:: system_name != 'jedi'

  With current combination of hard- and software, the NVIDIA GPU profiling metrics cannot be read by multiple processes at the same time.
  For user side profiling to work as expected, the system-side monitoring needs to be deactivated.
  Users can deactivate it by adding the option ``--disable-dcgm`` to their job submissions (with ``sbatch`` or ``salloc``).

A point to take into consideration when using NVIDIA profiling tools is the clock frequency. For example, when using ``NVIDIA Nsight Compute``, the tool's clock control is expected to lock the clocks to their base value by default, which is relatively low for some GPU models.

The reason for this behavior is that for many metrics, their value is directly influenced by the current GPU SM and memory clock frequencies. The GPU might be in a higher clocked state at some points executing an application, and might be in a lower clocked state in other execution points. This would have a direct impact on the metrics being profiled.

To mitigate this non-determinism, NVIDIA Nsight Compute attempts to limit GPU clock frequencies to their base value. As a result, metric values are less impacted by the location of the kernel in the application, or by the number of the specific replay pass. 

More information can be found at: `NVIDIA Nsight Compute Documentation <https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#clock-control>`_

CUDA MPS: Multiple MPI Ranks per GPU
------------------------------------

Per default, we expect that exactly one process uses a GPU (either with one rank per GPU, or one rank for all GPUs). Some applications can benefit from over-subscription, where multiple ranks share a GPU. The `CUDA Multi-Process Service (MPS) <https://docs.nvidia.com/deploy/mps/index.html>`_ enables hardware-accelerated sharing of a GPU by processes from the same user. 

To configure GPU-equipped nodes for CUDA MPS and load the necessary software infrastructure, we provide a Slurm flag to be attached to job submissions (with ``sbatch`` or ``salloc``): ``--cuda-mps``. No further changes are needed, MPS can be used directly and transparently from applications by launching more than one task/rank per GPU.

We do not provide support for `CUDA Multi-Instance GPU (MIG) <https://docs.nvidia.com/datacenter/tesla/mig-user-guide/>`_, which conceptionally extends the MPS feature with multi-tenant support, since we provide exclusive node access.

Job Script Examples
-------------------

.. ifconfig:: system_name == 'juwels'

   **Example 1:** MPI application starting 16 tasks on 4 nodes using 4 GPUs per node. The program must be able to coordinate the access to the four GPU devices on each of the four nodes::
   
           #!/bin/bash -x
           #SBATCH --account=<budget>
           #SBATCH --nodes=4
           #SBATCH --ntasks=16
           #SBATCH --ntasks-per-node=4
           #SBATCH --output=gpu-out.%j
           #SBATCH --error=gpu-err.%j
           #SBATCH --time=00:15:00
           #SBATCH --partition=<partition>
           #SBATCH --gres=gpu:4
   
           srun ./gpu-prog

   where ``partition``  is either ``booster`` or ``gpus``. Please note that the ``booster`` or ``gpus`` compute nodes feature a different number of CPUs.

   **Example 2:** Four independent instances (job steps) of a GPU program running on a |SYSTEM_NAME| Cluster node using one CPU thread and one GPU device each. The program is pinned to CPU core 0, 10, 20 and 30, respectively::
   
           #!/bin/bash -x
           #SBATCH --account=<budget>
           #SBATCH --nodes=1
           #SBATCH --output=gpu-out.%j
           #SBATCH --error=gpu-err.%j
           #SBATCH --time=00:20:00
           #SBATCH --partition=gpus
           #SBATCH --gres=gpu:4
   
           srun --exclusive -n 1 --gres=gpu:1 --cpu-bind=map_cpu:0  ./gpu-prog &
           srun --exclusive -n 1 --gres=gpu:1 --cpu-bind=map_cpu:10 ./gpu-prog &
           srun --exclusive -n 1 --gres=gpu:1 --cpu-bind=map_cpu:20 ./gpu-prog &
           srun --exclusive -n 1 --gres=gpu:1 --cpu-bind=map_cpu:30 ./gpu-prog &
   
           wait

   where ``partition``  is either ``booster``, ``gpus`` (JUWELS) or ``dc-cpu`` (JURECA).

   **Example 3:** Four independent instances (job steps) of a GPU program running on a |SYSTEM_NAME| Booster node using one CPU thread and one GPU device each. The program is pinned to CPU core 18, 6, 42 and 30, respectively::
   
           #!/bin/bash -x
           #SBATCH --account=<budget>
           #SBATCH --nodes=1
           #SBATCH --output=gpu-out.%j
           #SBATCH --error=gpu-err.%j
           #SBATCH --time=00:20:00
           #SBATCH --partition=booster
           #SBATCH --gres=gpu:4
   
           srun --exclusive -n 1 --gres=gpu:1 --cpu-bind=map_cpu:18 ./gpu-prog &
           srun --exclusive -n 1 --gres=gpu:1 --cpu-bind=map_cpu:6  ./gpu-prog &
           srun --exclusive -n 1 --gres=gpu:1 --cpu-bind=map_cpu:42 ./gpu-prog &
           srun --exclusive -n 1 --gres=gpu:1 --cpu-bind=map_cpu:30 ./gpu-prog &
   
           wait

.. ifconfig:: system_name == 'jureca'

   **Example:** MPI application starting 16 tasks on 4 nodes using 128 CPUs per node and 4 GPUs per node. The program must be able to coordinate the access to the four GPU devices on each of the four nodes::
   
           #!/bin/bash -x
           #SBATCH --account=<budget>
           #SBATCH --nodes=4
           #SBATCH --ntasks=16
           #SBATCH --ntasks-per-node=4
           #SBATCH --cpus-per-task=32
           #SBATCH --output=gpu-out.%j
           #SBATCH --error=gpu-err.%j
           #SBATCH --time=00:15:00
           #SBATCH --partition=dc-gpu
           #SBATCH --gres=gpu:4
   
           srun ./gpu-prog

.. ifconfig:: system_name == 'jusuf'

   **Example:** MPI application starting 96 tasks on 4 nodes using 24 CPUs per node and 1 GPU per node::
   
           #!/bin/bash -x
           #SBATCH --account=<budget>
           #SBATCH --nodes=4
           #SBATCH --ntasks=96
           #SBATCH --ntasks-per-node=24
           #SBATCH --output=gpu-out.%j
           #SBATCH --error=gpu-err.%j
           #SBATCH --time=00:15:00
           #SBATCH --partition=gpus
   
           [#SBATCH --gres=gpu:1]
   
           srun ./gpu-prog