:orphan: .. include:: system.rst .. _processor_affinity: Processor Affinity ================== .. ifconfig:: system_name == 'jedi' Each |SYSTEM_NAME| compute node consists of four sockets, each equipped with an NVIDIA GH200 Grace-Hopper superchip (consisting of one Grace CPU (with 120 GB host-side memory) and one H100 GPU (with 96 GB device-side memory)). Each CPU has 72 cores, so a |SYSTEM_NAME| node consists of 288 CPU cores distributed across four NUMA domains. In addition, each GPU-side memory is exposed as another NUMA domain (as it can be used also from the CPU). Hence, the node features eight relevant NUMA domains. In total, `numactl` exposes 36 NUMA domains, but the remaining 28 domains are not used (they are associated to the MIG feature of the GPUs, which we are not using). Here is an example output of `numactl`: .. code-block:: bash $ numactl -H available: 36 nodes (0-35) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 node 0 size: 121694 MB node 0 free: 117290 MB node 1 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 node 1 size: 122663 MB node 1 free: 119955 MB node 2 cpus: 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 node 2 size: 122663 MB node 2 free: 111686 MB node 3 cpus: 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 node 3 size: 122551 MB node 3 free: 120063 MB node 4 cpus: node 4 size: 97280 MB node 4 free: 97275 MB node 5 cpus: ... node 12 cpus: node 12 size: 97280 MB node 12 free: 97275 MB ... On the GPU side, an example output of `nvidia-smi` is: .. code-block:: bash $ nvidia-smi topo -m GPU0 GPU1 GPU2 GPU3 NIC0 NIC1 NIC2 NIC3 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV6 NV6 NV6 NODE SYS SYS SYS 0-71 0 4 GPU1 NV6 X NV6 NV6 SYS NODE SYS SYS 72-143 1 12 GPU2 NV6 NV6 X NV6 SYS SYS NODE SYS 144-215 2 20 GPU3 NV6 NV6 NV6 X SYS SYS SYS NODE 216-287 3 28 NIC0 NODE SYS SYS SYS X SYS SYS SYS NIC1 SYS NODE SYS SYS SYS X SYS SYS NIC2 SYS SYS NODE SYS SYS SYS X SYS NIC3 SYS SYS SYS NODE SYS SYS SYS X .. ifconfig:: system_name == 'juwels' Each |SYSTEM_NAME| Cluster compute node consists of two sockets, each with one CPU. Each CPU has 24 physical and 48 logical cores, so that one |SYSTEM_NAME| Cluster compute node consists 48 physical and 96 logical cores distributed on two NUMA domains. The |SYSTEM_NAME| Booster compute nodes also feature two CPUs with 48 physical cores per node but with a different design, giving eight separate NUMA domains. .. ifconfig:: system_name in ('jureca', 'jusuf') Each |SYSTEM_NAME| compute node consists of two sockets, each with one CPU. Each CPU has 64 physical and 128 logical cores, so that one |SYSTEM_NAME| compute node consists of 128 physical and 256 logical cores. Each CPU is split into four separate NUMA domains for a total of eight NUMA domains per node. Binding a process or thread to a specific core - known as pinning - can improve the performance of your code by limiting the likelihood of remote memory accesses. Once your code runs and produces correct results on a system, the next step is performance improvement. The placement of processes and/or threads can play a significant role in performance for a process that uses multiple cores or accelerator hardware. In general, the Linux scheduler will periodically (re-)distribute all running processes across all available threads to ensure similar usage of the threads. This leads to processes being moved from one thread, core, or socket to another within the compute node. Note that the allocated memory of a process does not necessarily move at the same time (or at all), which can make access to memory much slower. To avoid a potential performance loss due to process migration, processes are usually pinned (or bound) to a logical core through the resource management system. In the case of |SYSTEM_NAME|, this is Slurm. A pinned process (consisting of one or more threads) is bound to a specific set of cores and will only run on the cores in this set. The set can be a single core or multiple cores that implicitly includes 1st and 2nd level caches associated with those cores and is defined with an affinity mask. Since the majority of applications benefit from strict pinning that prevents migration -- unless explicitly prevented -- all tasks in a job step are pinned to a set of cores by default. Further information about the default behaviour can be found below. .. note:: SchedMD has adapted the behavior of the pinning with Slurm version 22.05 (Currently installed version: 23.02). Whilst our customised default setting improves the performance of average applications over no process binding at all, **specialised settings for your application can yield even better performance**. Pay attention to maximizing data locality while minimizing latency and resource contention, and have a clear understanding of the characteristics of your own code and the machine that the code is running on. .. ifconfig:: system_name in ('jureca') **JURECA-DC NUMA domain default:** The |SYSTEM_NAME|-DC CPU and GPU Slurm partition has configured 16 CPU cores in each of the 4 NUMA domains per socket (NPS-4), resulting in 8 NUMA domains for the 2 socket systems (0-7). Not all sockets have a direct connection (affinity) to each GPU or HCA. To keep the NUMA to core assignment equally configured on the entire |SYSTEM_NAME|-DC system we have the GPU partition configuration also configured at the CPU partition: .. table:: Assignment of NUMA Domain ID to Core IDs (and GPU ID for GPU partition only) +----------------+--------+------------------+ | NUMA Domain ID | GPU ID | Core IDs | +================+========+==================+ | 3 | 0 | 48-63, 176-191 | +----------------+--------+------------------+ | 1 | 1 | 16-31, 144-159 | +----------------+--------+------------------+ | 7 | 2 | 112-127, 240-255 | +----------------+--------+------------------+ | 5 | 3 | 80-95, 208-223 | +----------------+--------+------------------+ | 2 | | 32-47, 160-175 | +----------------+--------+------------------+ | 0 | | 0-15, 128-143 | +----------------+--------+------------------+ | 6 | | 96-111, 224-239 | +----------------+--------+------------------+ | 4 | | 64-79, 192-207 | +----------------+--------+------------------+ More details can be found :ref:`here `. .. ifconfig:: system_name == 'jusuf' **JUSUF NUMA domain default:** The |SYSTEM_NAME| CPU and GPU Slurm partitions have the following configuration: 16 CPU cores in each of the 4 NUMA domains per socket (NPS-4), resulting in 8 NUMA domains for the 2 socket systems (0-7). Not all sockets have a direct connection (affinity) to each GPU or HCA. To keep the NUMA-to-core configuration the same across the entire |SYSTEM_NAME| system we have the GPU partition configuration also applied on the CPU partition: .. table:: Assignment of NUMA Domain ID to Core IDs (and GPU ID for GPU partition only) +----------------+--------+------------------+ | NUMA Domain ID | GPU ID | Core IDs | +================+========+==================+ | 3 | 0 | 48-63, 176-191 | +----------------+--------+------------------+ | 1 | | 16-31, 144-159 | +----------------+--------+------------------+ | 7 | | 112-127, 240-255 | +----------------+--------+------------------+ | 5 | | 80-95, 208-223 | +----------------+--------+------------------+ | 2 | | 32-47, 160-175 | +----------------+--------+------------------+ | 0 | | 0-15, 128-143 | +----------------+--------+------------------+ | 6 | | 96-111, 224-239 | +----------------+--------+------------------+ | 4 | | 64-79, 192-207 | +----------------+--------+------------------+ More details can be found :ref:`here `. Slurm options ------------- .. ifconfig:: system_name == 'jedi' .. warning:: The remainder of this document does not apply to JEDI at this point in time, since it does not use ``psslurm``. Slurm allows users to modify the process affinity by means of the ``--cpu-bind``, ``--distribution`` and ``--hint`` options to ``srun``. While the available options to ``srun`` are standard across all Slurm installations, the implementation of process affinity is done in plugins and thus may differ between installations. On |SYSTEM_NAME| a custom pinning implementation is provided by Partec (*psslurm*). In contrast to other options, the processor affinity options need to be directly passed to ``srun`` and must not be given to ``sbatch`` or ``salloc``. In particular, the option cannot be specified in the header of a batch script. .. warning:: ``srun`` will no longer read in ``SLURM_CPUS_PER_TASK`` and will not inherit option ``--cpus-per-task`` from ``sbatch``! This means you will explicitly have to specify ``--cpus-per-task`` to your ``srun`` calls, or set the new ``SRUN_CPUS_PER_TASK`` env var. If you want to keep using ``--cpus-per-task`` with ``sbatch`` then you will have to add: ``export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}``. .. warning:: Setting the option ``--cpus-per-task`` implies the option ``--exact``, which means that each step with ``--cpus-per-task`` will now only receive the minimum number of cores requested for that job step. The pinning will change (which has an implication on the performance) and can mean threads of different tasks can share the same core (using SMT). **Attention:** As a result, explicitly setting ``--cpus-per-task=1`` may result in a different affinity mask than using the implicit default, which is also 1. .. note:: As we expect that most of our users will neither want to use nor benefit from SMT, we have disabled SMT by default by setting ``--threads-per-core=1``. To use SMT, the ``--threads-per-core=2`` option must be set for ``sbatch`` or ``salloc``. Just setting it as a ``srun`` option is not enough. **Attention:** In our tests we have seen that enabling SMT can lead to suboptimal, non-intuitive affinity masks. .. warning:: We recommend **not** to use ``--cpu-bind=sockets`` if you use more tasks than sockets, otherwise tasks will share the same hardware threads. If ``--cpus-per-task`` is to be used together with ``--cpu-bind=sockets``, then you usually want to override the implicit ``--exact`` by specifying ``--overcommit`` so that a task is allocated the full socket. .. warning:: Setting ``--hint`` can lead to unexpected pinning as it is mutually exclusive with with the following options: ``--ntasks-per-core``, ``--threads-per-core``, ``-B`` and ``--cpu-bind`` (other then ``--cpu-bind=verbose``). We recommend **not** using the ``--hint`` option. .. note:: For hybrid and pure OpenMP applications, it is important to specify the correct number of ``--cpus-per-task`` to ensure a proper affinity mask and set the ``OMP_NUM_THREADS`` environment variable accordingly. However, the individual threads of each MPI rank can still be moved between the logical threads matching the affinity mask of this rank. ``OMP_PROC_BIND=true`` can be used to prevent thread movement. For more advanced, OpenMP-internal affinity specifications, consult the documentation for ``OMP_PLACES`` or vendor-specific alternatives (``KMP_AFFINITY``/``GOMP_CPU_AFFINITY``). Terminology ^^^^^^^^^^^ **thread** One CPU thread. **task** Part of a job consisting of a number of requested CPU threads (specified by ``-c, --cpus-per-task``). **core** One physical CPU core can run multiple CPU threads. The CPU threads sitting on the same physical core share caches. **socket** Consists of a number of CPU threads, corresponding to the NUMA domains detailed above. ``--cpu-bind`` ^^^^^^^^^^^^^^ .. code-block:: none --cpu-bind=[{quiet,verbose},none|rank|map_cpu:|mask_cpu:|rank_ldom|map_ldom:|mask_ldom:|sockets|cores|threads|ldoms|boards] Implicit types ~~~~~~~~~~~~~~ .. table:: :align: left +----------------+----------------------------------------------------------------------------------------+ | ``none`` | Do not bind tasks to CPUs | +----------------+----------------------------------------------------------------------------------------+ | ``threads`` | | Each task is pinned to as many threads as it requests. Which threads each process | | | | gets is controlled by the ``--distribution`` option. (**Default**) | +----------------+----------------------------------------------------------------------------------------+ | ``rank`` | | Each task is pinned to as many threads as it requests, just filling cores | | | | consecutively. Spread the threads and tasks to as many cores as possible. | | | | This type is not influenced by the second and third part of the ``--distribution`` | | | | option. (old default until 12th May 2020) | +----------------+----------------------------------------------------------------------------------------+ | ``rank_ldom`` | | Each task is pinned to as many threads as it requests, just filling the nodes | | | | rank by rank cycling sockets and cores. This type is not influenced by the second | | | | and third level of the --distribution option. The threads of a task are always | | | | packed to as few cores as possible. This is the same as | | | | ``--cpu-bind=threads --distribution=*:cyclic:block`` | +----------------+----------------------------------------------------------------------------------------+ | ``sockets`` | | In a first step the requested CPU threads of a task are assigned in exactly the | | | | same way as with ``--cpu-bind=threads``. But the final affinity mask for the task | | | | is the whole socket where any thread is located that it is assigned to. This means | | | | if a task is assigned to any thread that is part of a socket, it will be bound to | | | | the whole socket. (The 'whole' here means to each thread of the socket that is | | | | allocated to the job) | +----------------+----------------------------------------------------------------------------------------+ | ``cores`` | | In a first step the requested CPU threads of a task are assigned in exactly the | | | | same way as with ``--cpu-bind=threads``. But the final affinity mask for the task | | | | is the whole core where any thread is located that it is assigned to. This means | | | | if a task is assigned to any thread that is part of a core, it will be bound to | | | | the whole core. (The 'whole' here means to each thread of the core that is | | | | allocated to the job) | +----------------+----------------------------------------------------------------------------------------+ | ``ldoms`` | This is the same as ``--cpu-bind=sockets`` | +----------------+----------------------------------------------------------------------------------------+ | ``boards`` | | Currently not supported on systems with more than one board per node. | | | | |SYSTEM_NAME| has only one board: same behavivor as ``none`` | +----------------+----------------------------------------------------------------------------------------+ Explicit types ~~~~~~~~~~~~~~ .. table:: :align: left +----------------------+-----------------------------------------------------------------------------------------+ | ``map_cpu:`` | Explicit passing of maps or masks to pin the tasks to threads in a round-robin fashion. | +----------------------+ + | ``mask_cpu:`` | | +----------------------+-----------------------------------------------------------------------------------------+ | ``map_ldom:`` | Explicit passing of maps or masks to pin the tasks to sockets in a round-robin fashion. | +----------------------+ + | ``mask_ldom:`` | | +----------------------+-----------------------------------------------------------------------------------------+ .. note:: Explicitly specified masks or bindings are only honored when the job step has allocated every available CPU on the node. If you want to use a ``map_`` or ``mask_`` bind, then you should have the steps request a whole allocation (do not use ``--exact`` or ``--cpus-per-task`` or ``--exclusive``). You also may want to use ``--overlap`` so that other steps can also allocate all of the cpus and you have control over the task to cpu binding via one of the map or mask options for ``--cpu-bind``. ``--distribution`` ^^^^^^^^^^^^^^^^^^ The string passed to ``--distribution/-m`` can have up to four parts separated by colon and comma: * The first part controls the distribution of the task over the nodes. * The second part controls the distribution of tasks over sockets inside one node. * The third part controls the distribution of tasks over cores inside one node. * The fourth part is an additional information concerning the distribution of tasks over nodes. .. code-block:: none --distribution/-m=[:[:[,Pack|NoPack]]] .. --distribution/-m=*|block|cyclic|arbitrary|plane=[:*|block|cyclic|fcyclic[:*|block|cyclic|fcyclic]][,Pack|NoPack] First part (``node_level``) ~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. table:: :align: left +---------------------+------------------------------------------------------------------------------------+ | ``*`` | The default is ``block`` | +---------------------+------------------------------------------------------------------------------------+ | ``block`` | Distribute tasks to a node such that consecutive tasks share a node | +---------------------+------------------------------------------------------------------------------------+ | ``cyclic`` | | Distribute tasks to a node such that consecutive tasks are distributed over | | | | consecutive nodes (in a round-robin fashion) | +---------------------+------------------------------------------------------------------------------------+ | ``arbitrary`` | see https://slurm.schedmd.com/srun.html | +---------------------+------------------------------------------------------------------------------------+ | ``plane=`` | see https://slurm.schedmd.com/dist_plane.html | +---------------------+------------------------------------------------------------------------------------+ .. _second_part: Second part (``socket_level``) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. table:: :align: left +------------------+------------------------------------------------------------------------------------+ | ``*`` | The default is ``cyclic`` | +------------------+------------------------------------------------------------------------------------+ | ``block`` | Each socket is first filled with tasks before the next socket will be used. | +------------------+------------------------------------------------------------------------------------+ | ``cyclic`` | Each task will be assigned to the next socket(s) in a round-robin fashion. | +------------------+------------------------------------------------------------------------------------+ | ``fcyclic`` | | Each thread inside a task will be assigned to the next socket in a round-robin | | | | fashion, spreading the task itself as much as possible over all sockets. | | | | ``fcyclic`` implies ``cyclic``. | +------------------+------------------------------------------------------------------------------------+ Third part (``core_level``) ~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. table:: :align: left +------------------+------------------------------------------------------------------------------------+ | ``*`` | The default is inherited from the :ref:`second part ` | +------------------+------------------------------------------------------------------------------------+ | ``block`` | Each core is first filled with tasks before the next core will be used. | +------------------+------------------------------------------------------------------------------------+ | ``cyclic`` | | Each task will be assigned to the next core(s) in a round-robin fashion. | | | | The threads of a task will fill the cores. | +------------------+------------------------------------------------------------------------------------+ | ``fcyclic`` | | Each thread inside a task will be assigned to the next core in a round-robin | | | | fashion, spreading the task itself as much as possible over all cores. | +------------------+------------------------------------------------------------------------------------+ Fourth part ~~~~~~~~~~~ Optional control for task distribution over nodes. .. table:: :align: left +------------+------------------------------------------------------------------------------------+ | ``Pack`` | Default is NoPack. See: https://slurm.schedmd.com/srun.html | +------------+ + | ``NoPack`` | | +------------+------------------------------------------------------------------------------------+ ``--hint`` ^^^^^^^^^^ **We do not recommend using this option, as our tests have shown that it can lead to unexpected pinning.** Possible values are ``nomultithread``, ``compute_bound``, and ``memory_bound`` (They imply other options). .. code-block:: none --hint=nomultithread Affinity visualization tool --------------------------- We have tried to understand and implement the Slurm affinity rules. The result is our `PinningWebtool `_, which allows you to test and visualise different Slurm affinity setups yourself. A description of the displayed scheme can be found in the section :ref:`below `. .. _affinity_examples: Affinity examples ----------------- .. ifconfig:: system_name in ('jureca', 'jusuf') Visualization of the processor affinity in the following examples is done by the tool ``jscgetaffinity`` (a wrapper for ``psslurmgetbind``) which is also available on the login nodes of |SYSTEM_NAME|. The scheme shown represents a node of |SYSTEM_NAME| which has two sockets divided by the space in the middle, and each socket is in turn divided into 4 NUMA domains. Each column corresponds to one core; the first row shows the first (physical) thread of the corresponding core and the second row the SMT (logical thread) of the core. The number (``X``) followed by a ``:`` in the line above the described scheme represents the MPI task number for which the affinity mask is shown in the scheme. The number ``1`` in the scheme itself indicates that the task with its threads is scheduled on the corresponding hardware thread of the node. .. ifconfig:: system_name in ('juwels') Visualization of the processor affinity in the following examples is done by the tool ``jscgetaffinity`` (a wrapper for ``psslurmgetbind``) which is also available on the login nodes of |SYSTEM_NAME|. The scheme shown represents a node of |SYSTEM_NAME| which has two sockets divided by the space in the middle. Each column corresponds to one core; the first row shows the first (physical) thread of the corresponding core and the second row the SMT (logical thread) of the core. The number (``X``) followed by a ``:`` in the line above the described scheme represents the MPI task number for which the affinity mask is shown in the scheme. The number ``1`` in the scheme itself indicates that the task with its threads is scheduled on the corresponding hardware thread of the node. **Example:** One MPI task with two threads: .. ifconfig:: system_name == 'jedi' This JEDI information is still under construction .. .. figure:: ../shared/images/jscgetaffinity_jedi.jpg :name: jscgetaffinity_jedi :align: center .. ifconfig:: system_name == 'juwels' .. figure:: ../shared/images/2024-07-Processor-Affinity_JUWELS-Cluster.png :name: jscgetaffinity_juwels :align: center .. ifconfig:: system_name == 'jureca' For the purpose of presentation, ``0..0`` indicates that not all hardware threads of a NUMA domain on a |SYSTEM_NAME| node are shown (on a |SYSTEM_NAME| there are 16 physical and 16 SMT threads per NUMA domain). .. figure:: ../shared/images/2024-07-Processor-Affinity_JURECA.png :name: jscgetaffinity_jureca :align: center .. ifconfig:: system_name == 'jusuf' For the purpose of presentation, ``0..0`` indicates that not all hardware threads of a NUMA domain on a |SYSTEM_NAME| node are shown (on a |SYSTEM_NAME| there are 16 physical and 16 SMT threads per NUMA domain). .. figure:: ../shared/images/2024-07-Processor-Affinity_JUSUF.png :name: jscgetaffinity_jusuf :align: center .. only:: builder_html Further examples with a colored representation can be found here :download:`Talk: Pinning with psslurm <../shared/docs/Pinning_with_psslurm.pdf>`. .. _default_affinity: Default processor affinity ^^^^^^^^^^^^^^^^^^^^^^^^^^ .. ifconfig:: system_name == 'juwels' The default processor affinity has changed at 8th August 2024 to the following setting: .. ifconfig:: system_name == 'jureca' The default processor affinity has changed at 6th August 2024 to the following setting: .. ifconfig:: system_name == 'jusuf' The default processor affinity has changed at 15th July 2024 to the following setting: .. code-block:: none --cpu-bind=threads --distribution=block:cyclic:cyclic --threads-per-core=1 The behavior of this combination is shown in the following examples for |SYSTEM_NAME|. .. ifconfig:: system_name == 'juwels' **Example 1:** Pure MPI application filling only the first thread of a core on a CPU node in alternating socket placement: .. code-block:: none srun --nodes=1 --ntasks=48 --cpus-per-task=1 .. code-block:: none $ jscgetaffinity -p juwels -H : -n 48 -c 1 0: 100000000000000000000000 000000000000000000000000 000000000000000000000000 000000000000000000000000 1: 000000000000000000000000 100000000000000000000000 000000000000000000000000 000000000000000000000000 2: 010000000000000000000000 000000000000000000000000 000000000000000000000000 000000000000000000000000 3: 000000000000000000000000 010000000000000000000000 000000000000000000000000 000000000000000000000000 ... 46: 000000000000000000000001 000000000000000000000000 000000000000000000000000 000000000000000000000000 47: 000000000000000000000000 000000000000000000000001 000000000000000000000000 000000000000000000000000 **Example 2:** Hybrid application (MPI + OpenMP) with 4 tasks per node in alternating socket placement and 12 threads per task on a CPU node: *Hint:* As stated in the note above, it is your responsibility to take care of the thread binding within the mask provided by Slurm to prevent the threads from moving. As a good starting point, you could add the following extra line to your job script: ``export OMP_PLACES=threads OMP_PROC_BIND=close OMP_NUM_THREADS=12`` .. code-block:: none srun --nodes=1 --ntasks=4 --cpus-per-task=12 .. code-block:: none $ jscgetaffinity -p juwels -H : -n 4 -c 12 0: 111111111111000000000000 000000000000000000000000 000000000000000000000000 000000000000000000000000 1: 000000000000000000000000 111111111111000000000000 000000000000000000000000 000000000000000000000000 2: 000000000000111111111111 000000000000000000000000 000000000000000000000000 000000000000000000000000 3: 000000000000000000000000 000000000000111111111111 000000000000000000000000 000000000000000000000000 **Example 3:** Pure OpenMP application with 48 threads on a CPU node: *Hint:* As stated in the note above, it is your responsibility to take care of the thread binding within the mask provided by Slurm to prevent the threads from moving. As a good starting point, you could add the following extra line to your job script: ``export OMP_PLACES=threads OMP_PROC_BIND=close OMP_NUM_THREADS=48`` .. code-block:: none srun --nodes=1 --ntasks=1 --cpus-per-task=48 .. code-block:: none $ jscgetaffinity -p juwels -H : -n 1 -c 48 0: 111111111111111111111111 111111111111111111111111 000000000000000000000000 000000000000000000000000 .. ifconfig:: system_name == 'jusuf' Since JURECA-DC and JUSUF have the same CPU architecture, the examples from JURECA have been used here to simplify the documentation. When calling ``jscgetaffinity``, the ``jurecadc`` profile can also be replaced by ``jusuf``. .. ifconfig:: system_name in ('jureca', 'jusuf') For the purpose of presentation, ``0..0`` indicates that not all cores of a NUMA domain on a |SYSTEM_NAME| node are shown. **Example 1:** Pure MPI application filling only the first thread of a core on a node in alternating socket placement: .. code-block:: none srun --nodes=1 --ntasks=128 --cpus-per-task=1 .. code-block:: none $ jscgetaffinity -p jurecadc -H : -n 128 -c 1 0: 0..0 0..0 0..0 1000000000000000 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0000000000000000 0..0 0..0 0..0 0..0 1: 0..0 1000000000000000 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0000000000000000 0..0 0..0 0..0 0..0 0..0 0..0 2: 0..0 0..0 0..0 0..0 0..0 0..0 0..0 1000000000000000 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0000000000000000 3: 0..0 0..0 0..0 0..0 0..0 1000000000000000 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0000000000000000 0..0 0..0 4: 0..0 0..0 1000000000000000 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0000000000000000 0..0 0..0 0..0 0..0 0..0 5: 1000000000000000 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0000000000000000 0..0 0..0 0..0 0..0 0..0 0..0 0..0 6: 0..0 0..0 0..0 0..0 0..0 0..0 1000000000000000 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0000000000000000 0..0 7: 0..0 0..0 0..0 0..0 1000000000000000 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0000000000000000 0..0 0..0 0..0 ... 126: 0..0 0..0 0..0 0..0 0..0 0..0 0000000000000001 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0000000000000000 0..0 127: 0..0 0..0 0..0 0..0 0000000000000001 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0000000000000000 0..0 0..0 0..0 **Example 2:** Hybrid application (MPI + OpenMP) with 16 tasks per node in round-robin socket placement and 8 threads per task on a node: *Hint:* As stated in the note above, it is your responsibility to take care of the thread binding within the mask provided by Slurm to prevent the threads from moving. As a good starting point, you could add the following extra line to your job script: ``export OMP_PLACES=threads OMP_PROC_BIND=close OMP_NUM_THREADS=8`` .. code-block:: none srun --nodes=1 --ntasks=16 --cpus-per-task=8 .. code-block:: none $ jscgetaffinity -p jurecadc -H : -n 16 -c 8 0: 0..0 0..0 0..0 1111111100000000 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0000000000000000 0..0 0..0 0..0 0..0 1: 0..0 1111111100000000 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0000000000000000 0..0 0..0 0..0 0..0 0..0 0..0 ... 7: 0..0 0..0 0..0 0..0 1111111100000000 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0000000000000000 0..0 0..0 0..0 8: 0..0 0..0 0..0 0000000011111111 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0000000000000000 0..0 0..0 0..0 0..0 ... 14: 0..0 0..0 0..0 0..0 0..0 0..0 0000000011111111 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0000000000000000 0..0 15: 0..0 0..0 0..0 0..0 0000000011111111 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0000000000000000 0..0 0..0 0..0 **Example 3:** Pure OpenMP application with 128 threads on a node: *Hint:* As stated in the note above, it is your responsibility to take care of the thread binding within the mask provided by Slurm to prevent the threads from moving. As a good starting point, you could add the following extra line to your job script: ``export OMP_PLACES=threads OMP_PROC_BIND=close OMP_NUM_THREADS=128`` .. code-block:: none srun --nodes=1 --ntasks=1 --cpus-per-task=128 .. code-block:: none $ jscgetaffinity -p jurecadc -H : -n 1 -c 128 0: 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 1111111111111111 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 Further examples ^^^^^^^^^^^^^^^^ .. ifconfig:: system_name == 'juwels' **Example 1:** Pure MPI application filling all all threads on a CPU node (including SMT): *Hint:* Don't forget to add ``--threads-per-core=2`` in your ``sbatch`` or ``salloc``: In your job script: ``#SBATCH --threads-per-core=2`` .. code-block:: none #SBATCH --threads-per-core=2 srun --nodes=1 --ntasks=96 --cpus-per-task=1 .. code-block:: none $ jscgetaffinity -p juwels -H : -n 96 -c 1 --threads-per-core=2 0: 100000000000000000000000 000000000000000000000000 000000000000000000000000 000000000000000000000000 1: 000000000000000000000000 100000000000000000000000 000000000000000000000000 000000000000000000000000 ... 46: 000000000000000000000001 000000000000000000000000 000000000000000000000000 000000000000000000000000 47: 000000000000000000000000 000000000000000000000001 000000000000000000000000 000000000000000000000000 48: 000000000000000000000000 000000000000000000000000 100000000000000000000000 000000000000000000000000 49: 000000000000000000000000 000000000000000000000000 000000000000000000000000 100000000000000000000000 ... 94: 000000000000000000000000 000000000000000000000000 000000000000000000000001 000000000000000000000000 95: 000000000000000000000000 000000000000000000000000 000000000000000000000000 000000000000000000000001 **Example 2:** Hybrid application (MPI + OpenMP) with 4 tasks per node in consecutive order and 12 threads per task on a CPU node: *Hint:* As stated in the note above, it is your responsibility to take care of the thread binding within the mask provided by Slurm to prevent the threads from moving. As a good starting point, you could add the following extra line to your job script: ``export OMP_PLACES=threads OMP_PROC_BIND=close OMP_NUM_THREADS=12`` .. code-block:: none srun --nodes=1 --ntasks=4 --cpus-per-task=12 --cpu-bind=rank .. code-block:: none $ jscgetaffinity -p juwels -H : -n 4 -c 12 --cpu-bind=rank 0: 111111111111000000000000 000000000000000000000000 000000000000000000000000 000000000000000000000000 1: 000000000000111111111111 000000000000000000000000 000000000000000000000000 000000000000000000000000 2: 000000000000000000000000 111111111111000000000000 000000000000000000000000 000000000000000000000000 3: 000000000000000000000000 000000000000111111111111 000000000000000000000000 000000000000000000000000 **Example 3:** Hybrid application (MPI + OpenMP) with 4 tasks per node and 16 threads per tasks. If you want to us more than tasks-per-node * cpus-per-tasks > 48 per CPU node on |SYSTEM_NAME| you have to add ``--threads-per-core=2``: *Hint:* Don't forget to add ``--threads-per-core=2`` also for ``sbatch`` or ``salloc``: In your job script: ``#SBATCH --threads-per-core=2`` *Hint:* As stated in the note above, it is your responsibility to take care of the thread binding within the mask provided by Slurm to prevent the threads from moving. As a good starting point, you could add the following extra line to your job script: ``export OMP_PLACES=threads OMP_PROC_BIND=close OMP_NUM_THREADS=16`` .. code-block:: none #SBATCH --threads-per-core=2 srun --nodes=1 --ntasks=4 --cpus-per-task=16 .. code-block:: none $ jscgetaffinity -p juwels -H : -n 4 -c 16 --threads-per-core=2 0: 111111110000000000000000 000000000000000000000000 111111110000000000000000 000000000000000000000000 1: 000000000000000000000000 111111110000000000000000 000000000000000000000000 111111110000000000000000 2: 000000001111111100000000 000000000000000000000000 000000001111111100000000 000000000000000000000000 3: 000000000000000000000000 000000001111111100000000 000000000000000000000000 000000001111111100000000 **Example 4:** Pure OpenMP application with 48 threads on a single socket of a CPU node: *Hint:* Don't forget to add ``--threads-per-core=2`` also for ``sbatch`` or ``salloc``: In your job script: ``#SBATCH --threads-per-core=2`` *Hint:* As stated in the note above, it is your responsibility to take care of the thread binding within the mask provided by Slurm to prevent the threads from moving. As a good starting point, you could add the following extra line to your job script: ``export OMP_PLACES=threads OMP_PROC_BIND=close OMP_NUM_THREADS=48`` .. code-block:: none #SBATCH --threads-per-core=2 srun --nodes=1 --ntasks=1 --cpus-per-task=48 .. code-block:: none $ jscgetaffinity -p jurecadc -H : -n 1 -c 48 --threads-per-core=2 0: 111111111111111111111111 000000000000000000000000 111111111111111111111111 000000000000000000000000 .. ifconfig:: system_name in ('jureca', 'jusuf') **Example 1:** Pure MPI application filling all all threads on a node (including SMT): *Hint:* Don't forget to add ``--threads-per-core=2`` also for ``sbatch`` or ``salloc``: In your job script: ``#SBATCH --threads-per-core=2`` .. code-block:: none #SBATCH --threads-per-core=2 srun --nodes=1 --ntasks=256 --cpus-per-task=1 .. code-block:: none $ jscgetaffinity -p jurecadc -H : -n 256 -c 1 0: 0..0 0..0 0..0 1000000000000000 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0000000000000000 0..0 0..0 0..0 0..0 1: 0..0 1000000000000000 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0000000000000000 0..0 0..0 0..0 0..0 0..0 0..0 ... 126: 0..0 0..0 0..0 0..0 0..0 0..0 0000000000000001 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0000000000000000 0..0 127: 0..0 0..0 0..0 0..0 0000000000000001 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0000000000000000 0..0 0..0 0..0 128: 0..0 0..0 0..0 0000000000000000 0..0 0..0 0..0 0..0 0..0 0..0 0..0 1000000000000000 0..0 0..0 0..0 0..0 129: 0..0 0000000000000000 0..0 0..0 0..0 0..0 0..0 0..0 0..0 1000000000000000 0..0 0..0 0..0 0..0 0..0 0..0 ... 254: 0..0 0..0 0..0 0..0 0..0 0..0 0000000000000000 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0000000000000001 0..0 255: 0..0 0..0 0..0 0..0 0000000000000000 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0000000000000001 0..0 0..0 0..0 **Example 2:** Hybrid application (MPI + OpenMP) with 16 tasks per node in a consecutive order according to the reordering of the NUMA domains and with 8 threads per task on a single node: *Hint:* As stated in the note above, it is your responsibility to take care of the thread binding within the mask provided by Slurm to prevent the threads from moving. As a good starting point, you could add the following extra line to your job script: ``export OMP_PLACES=threads OMP_PROC_BIND=close OMP_NUM_THREADS=8`` .. code-block:: none srun --nodes=1 --ntasks=16 --cpus-per-task=8 --cpu-bind=rank .. code-block:: none $ jscgetaffinity -p jurecadc -H : -n 16 -c 8 --cpu-bind=rank 0: 0..0 0..0 0..0 1111111100000000 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0000000000000000 0..0 0..0 0..0 0..0 1: 0..0 0..0 0..0 0000000011111111 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0000000000000000 0..0 0..0 0..0 0..0 2: 0..0 1111111100000000 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0000000000000000 0..0 0..0 0..0 0..0 0..0 0..0 ... 13: 0..0 0..0 0..0 0..0 0..0 0..0 0000000011111111 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0000000000000000 0..0 14: 0..0 0..0 0..0 0..0 1111111100000000 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0000000000000000 0..0 0..0 0..0 15: 0..0 0..0 0..0 0..0 0000000011111111 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0000000000000000 0..0 0..0 0..0 **Example 3:** Hybrid application (MPI + OpenMP) with 16 tasks per node and 12 threads per tasks. If you want to us more than tasks-per-node * cpus-per-tasks > 128 per node on |SYSTEM_NAME| you have to add ``--threads-per-core=2``: *Hint:* Don't forget to add ``--threads-per-core=2`` also for ``sbatch`` or ``salloc``: In your job script: ``#SBATCH --threads-per-core=2`` *Hint:* As stated in the note above, it is your responsibility to take care of the thread binding within the mask provided by Slurm to prevent the threads from moving. As a good starting point, you could add the following extra line to your job script: ``export OMP_PLACES=threads OMP_PROC_BIND=close OMP_NUM_THREADS=12`` .. code-block:: none #SBATCH --threads-per-core=2 srun --nodes=1 --ntasks=16 --cpus-per-task=12 .. code-block:: none $ jscgetaffinity -p jurecadc -H : -n 16 -c 12 0: 0..0 0..0 0..0 1111110000000000 0..0 0..0 0..0 0..0 0..0 0..0 0..0 1111110000000000 0..0 0..0 0..0 0..0 1: 0..0 1111110000000000 0..0 0..0 0..0 0..0 0..0 0..0 0..0 1111110000000000 0..0 0..0 0..0 0..0 0..0 0..0 ... 7: 0..0 0..0 0..0 0..0 1111110000000000 0..0 0..0 0..0 0..0 0..0 0..0 0..0 1111110000000000 0..0 0..0 0..0 8: 0..0 0..0 0..0 0000001111110000 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0000001111110000 0..0 0..0 0..0 0..0 ... 14: 0..0 0..0 0..0 0..0 0..0 0..0 0000001111110000 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0000001111110000 0..0 15: 0..0 0..0 0..0 0..0 0000001111110000 0..0 0..0 0..0 0..0 0..0 0..0 0..0 0000001111110000 0..0 0..0 0..0 **Example 4:** Pure OpenMP application with 128 threads and SMT enabled: *Hint:* Don't forget to add ``--threads-per-core=2`` also for ``sbatch`` or ``salloc``: In your job script: ``#SBATCH --threads-per-core=2`` *Hint:* As stated in the note above, it is your responsibility to take care of the thread binding within the mask provided by Slurm to prevent the threads from moving. As a good starting point, you could add the following extra line to your job script: ``export OMP_PLACES=threads OMP_PROC_BIND=close OMP_NUM_THREADS=128`` .. code-block:: none #SBATCH --threads-per-core=2 srun --nodes=1 --ntasks=1 --cpus-per-task=128 .. code-block:: none $ jscgetaffinity -p jurecadc -H : -n 1 -c 128 --threads-per-core=2 0: 0..0 1111111111111111 0..0 1111111111111111 0..0 1111111111111111 0..0 1111111111111111 0..0 1111111111111111 0..0 1111111111111111 0..0 1111111111111111 0..0 1111111111111111 **Examples for manual pinning** For advanced use cases it can be desirable to manually specify the binding masks or core sets for each task. This is possible using the options ``--cpu-bind=map_cpu`` and ``--cpu-bind=mask_cpu``. For example, .. code-block:: none srun -n 2 --cpu-bind=map_cpu:1,5 spawns two tasks pinned to core 1 and 5, respectively. The command .. code-block:: none srun -n 2 --cpu-bind=mask_cpu:0x3,0xC spawns two tasks pinned to cores 0 and 1 (``0x3 = 3 = 2^0 + 2^1``) and cores 2 and 3 (``0xC = 12 = 2^2 + 2^3``), respectively.