.. include:: system.rst .. _processor_affinity: Processor Affinity ================== .. ifconfig:: system_name == 'juwels' Each |SYSTEM_NAME| Cluster compute node consists of two sockets, each with one CPU. Each CPU has 24 physical and 48 logical cores, so that one |SYSTEM_NAME| Cluster compute node consists 48 physical and 96 logical cores distributed on two NUMA domains. The |SYSTEM_NAME| Booster compute nodes also feature two CPUs, 48 physical cores per node but eight separate NUMA domains. .. ifconfig:: system_name in ('jureca', 'jusuf') Each |SYSTEM_NAME| compute node consists of two sockets, each with one CPU. Each CPU has 64 physical and 128 logical cores, so that one |SYSTEM_NAME| compute node consists of 128 physical and 256 logical cores. Each CPU is split into four separate NUMA domains for a total of eight NUMA domains pre node. .. warning:: This documentation page is currently out of date due to a Slurm upgrade to version 22.05 which can have a major impact on the performance of your application. We are working on an update. - Unless running on SMT is explicitly requested, we strongly recommend using the ``--hint=nomultithread`` option (which implies ``--threads-per-core=1``). If you want to use SMT we would recommend to set ``--distribution=block:cyclic:cyclic``. - srun will no longer read in ``SLURM_CPUS_PER_TASK`` and will not inherit option ``--cpus-per-task`` from sbatch! This means you will explicitly have to specify --cpus-per-task to your srun calls, or set the new SRUN_CPUS_PER_TASK env var. If you want to keep using --cpus-per-task with sbatch then you will have to add: ``export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}``. - Using the option ``--cpus-per-task`` in 22.05 does imply ``--exact``, which means that each step with ``--cpus-per-task`` will now only get the minimum number of cores. The pinning will change (implication on the performance) and the tasks will fill the HW threads of same cores. If you don’t use SMT and want to keep old behavior as before where your threads run only on real cores then add this to srun: ``--threads-per-core=1``. - If you have any questions, please contact SC support [sc@fz-juelich.de]. Pinning, the binding of a process or thread to a specific core, can improve the performance of your code by increasing the percentage of local memory accesses. Once your code runs and produces correct results on a system, the next step is performance improvement. For a code that uses multiple cores, the placement of processes and/or threads can play a significant role in performance. In general, the Linux scheduler will periodically (re-)distribute all running processes across all available threads to ensure similar usage of the threads. This causes processes being moved from one thread, core, or socket to another within the compute node. Note that the allocated memory of a process does not necessarily move at the same time (or at all), possibly making access to memory much slower. To avoid such a potential performance loss by process migration, processes are usually pinned (or bound) to a logical core through the resource management system, SLURM in case of |SYSTEM_NAME|. A pinned process (consisting of one or more threads) is bound to a specific set of cores and will only run on the cores in this set. The set can be a single or multiple logical cores that implicitly includes 1st and 2nd level caches and is defined with an affinity mask. Since the majority of applications benefit from strict pinning that prevents migration -- unless explicitly prevented -- all tasks in a job step are pinned to a set of cores by default. Further information about the default behaviour can be found below. .. note:: Even though the default setting improves the performance of average applications over no process binding at all, specialised settings for your application can yield even better performance. Pay attention to maximizing data locality while minimizing latency and resource contention, and have a clear understanding of the characteristics of your own code and the machine that the code is running on. .. ifconfig:: system_name in ('jureca') **current JURECA-DC default:** The |SYSTEM_NAME|-DC CPU and GPU slurm partition has configured 16 CPU cores in each of the 4 NUMA domains per socket (NPS-4), resulting in 8 NUMA domains for the 2 socket systems (0-7). Not all sockets have a direct connection (affinity) to each GPU or HCA. To keep the NUMA to core assignment equally configured on the entire |SYSTEM_NAME|-DC system we have the GPU partition configuration also configured at the CPU partition: .. table:: GPU partition +----------------+--------+------------------+ | NUMA Domain ID | GPU ID | Core IDs | +================+========+==================+ | 3 | 0 | 48-63, 176-191 | +----------------+--------+------------------+ | 1 | 1 | 16-31, 144-159 | +----------------+--------+------------------+ | 7 | 2 | 112-127, 240-255 | +----------------+--------+------------------+ | 5 | 3 | 80-95, 208-223 | +----------------+--------+------------------+ Slurm example: .. code-block:: none srun -p dc-gpu -N 1 -n 4 --gpus=4 --cpu-bind=socket This will set the affinity of each process to make it use the cores closest to the GPU with ID matching the process's rank. .. table:: CPU partition +----------------+--------+------------------+ | NUMA Domain ID | GPU ID | Core IDs | +================+========+==================+ | 3 | 0 | 48-63, 176-191 | +----------------+--------+------------------+ | 1 | 1 | 16-31, 144-159 | +----------------+--------+------------------+ | 7 | 2 | 112-127, 240-255 | +----------------+--------+------------------+ | 5 | 3 | 80-95, 208-223 | +----------------+--------+------------------+ More details can be found :ref:`here `. Slurm options ------------- Slurm allows users to modify the process affinity by means of the ``--cpu-bind``, ``--distribution`` and ``--hint`` options to ``srun``. While the available options to ``srun`` are standard across all Slurm installations, the implementation of process affinity is done in plugins and thus may differ between installations. On |SYSTEM_NAME| a custom pinning implementation by ParTec is used (*psslurm*). In contrast to other options, the processor affinity options need to be directly passed to ``srun`` and must not be given to ``sbatch`` or ``salloc``. In particular, the option cannot be specified in the header of a batch script. .. note:: It is important to specify the correct ``--cpus-per-task`` count to ensure a proper affinity mask for hybrid applications and set the environment variable ``OMP_NUM_THREADS`` correspondingly. However, the individual threads of each MPI rank can still be moved between the logical threads matching the affinity mask of this rank. To also avoid this behaviour, there are diffenrent possibilities to pin the threads to specific logical cores within the mask, e.g., by the OpenMP runtime library: Intel: ``KMP_AFFINITY``, GNU: ``GOMP_AFFINITY`` or with the API of ``sched_setaffinity`` or ``kmp_set_affinity`` among others. Terminology ^^^^^^^^^^^ **thread** One CPU thread. **task** Part of a job consisting of a number of requested CPU threads (specified by ``-c, --cpus-per-task``). Usually this is a process. **core** One physical CPU core can run multiple CPU threads. The CPU threads sitting on the same physical core are sharing caches (traditional name of the second memory locality domain) **socket** Consists of a number of CPU threads with the same memory locality (traditional name of the top most memory locality domain) ``--cpu-bind`` ^^^^^^^^^^^^^^ .. code-block:: none --cpu-bind=[{quiet,verbose},none|rank|map_cpu:|mask_cpu:|rank_ldom|map_ldom:|mask_ldom:|sockets|cores|threads|ldoms|boards] Implicit types ~~~~~~~~~~~~~~ .. table:: :align: left +----------------+------------------------------------------------------------------------------------+ | ``none`` | Do not bind tasks to CPUs | +----------------+------------------------------------------------------------------------------------+ | ``rank`` | Each task is pinned to as many threads as it requests, just filling cores | | | consecutively. Spread the threads and tasks to as many cores as possible. | | | This type is not influenced by the second and third part of the ``--distribution`` | | | option. (old default until 12th May 2020) | +----------------+------------------------------------------------------------------------------------+ | ``threads`` | Each task is pinned to as many threads as it requests. Which threads each process | | | gets is controlled by the ``--distribution`` option. (**Default**) | +----------------+------------------------------------------------------------------------------------+ | ``rank_ldom`` | Each task is pinned to as many threads as it requests, just filling the nodes | | | rank by rank cycling sockets and cores. This type is not influenced by the second | | | and third level of the --distribution option. The threads of a task are always | | | packed to as few cores as possible. This is the same as | | | ``--cpu-bind=threads --distribution=*:cyclic:block`` | +----------------+------------------------------------------------------------------------------------+ | ``sockets`` | In a first step the requested CPU threads of a task are assigned in exactly the | | | same way as with ``--cpu-bind=threads``. But the final affinity mask for the task | | | is the whole socket where any thread is located that it is assigned to. This means | | | if a task is assigned to any thread that is part of a socket, it will be bound to | | | the whole socket. (The 'whole' here means to each thread of the socket that is | | | allocated to the job) | +----------------+------------------------------------------------------------------------------------+ | ``cores`` | In a first step the requested CPU threads of a task are assigned in exactly the | | | same way as with ``--cpu-bind=threads``. But the final affinity mask for the task | | | is the whole core where any thread is located that it is assigned to. This means | | | if a task is assigned to any thread that is part of a core, it will be bound to | | | the whole core. (The 'whole' here means to each thread of the core that is | | | allocated to the job) | +----------------+------------------------------------------------------------------------------------+ | ``ldoms`` | This is the same as ``--cpu-bind=sockets`` | +----------------+------------------------------------------------------------------------------------+ | ``boards`` | Currently not supported on systems with more than one board per node. | | | |SYSTEM_NAME| has only one board: same behavivor as ``none`` | +----------------+------------------------------------------------------------------------------------+ Explicit types ~~~~~~~~~~~~~~ .. table:: :align: left +----------------------+-----------------------------------------------------------------------------------------+ | ``map_cpu:`` | Explicit passing of maps or masks to pin the tasks to threads in a round-robin fashion. | +----------------------+ + | ``mask_cpu:`` | | +----------------------+-----------------------------------------------------------------------------------------+ | ``map_ldom:`` | Explicit passing of maps or masks to pin the tasks to sockets in a round-robin fashion. | +----------------------+ + | ``mask_ldom:`` | | +----------------------+-----------------------------------------------------------------------------------------+ ``--distribution`` ^^^^^^^^^^^^^^^^^^ The string passed to ``--distribution/-m`` can have up to four parts separated by colon and comma: * The first part controls the distribution of the task over the nodes. * The second part controls the distribution of tasks over sockets inside one node. * The third part controls the distribution of tasks over cores inside one node. * The fourth part is an additional information concerning the distribution of tasks over nodes. .. code-block:: none --distribution/-m=[:[:[,Pack|NoPack]]] .. --distribution/-m=*|block|cyclic|arbitrary|plane=[:*|block|cyclic|fcyclic[:*|block|cyclic|fcyclic]][,Pack|NoPack] First part (``node_level``) ~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. table:: :align: left +---------------------+------------------------------------------------------------------------------------+ | ``*`` | The default is ``block`` | +---------------------+------------------------------------------------------------------------------------+ | ``block`` | Distribute tasks to a node such that consecutive tasks share a node | +---------------------+------------------------------------------------------------------------------------+ | ``cyclic`` | Distribute tasks to a node such that consecutive tasks are distributed over | | | consecutive nodes (in a round-robin fashion) | +---------------------+------------------------------------------------------------------------------------+ | ``arbitrary`` | see https://slurm.schedmd.com/srun.html | +---------------------+------------------------------------------------------------------------------------+ | ``plane=`` | see https://slurm.schedmd.com/dist_plane.html | +---------------------+------------------------------------------------------------------------------------+ Second part (``socket_level``) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. table:: :align: left +------------------+------------------------------------------------------------------------------------+ | ``*`` | The default is ``cyclic`` | +------------------+------------------------------------------------------------------------------------+ | ``block`` | Each socket is first filled with tasks before the next socket will be used. | +------------------+------------------------------------------------------------------------------------+ | ``cyclic`` | Each task will be assigned to the next socket(s) in a round-robin fashion. | +------------------+------------------------------------------------------------------------------------+ | ``fcyclic`` | Each thread inside a task will be assigned to the next socket in a round-robin | | | fashion, spreading the task itself as much as possible over all sockets. | | | ``fcyclic`` implies ``cyclic``. | +------------------+------------------------------------------------------------------------------------+ Third part (``core_level``) ~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. table:: :align: left +------------------+------------------------------------------------------------------------------------+ | ``*`` | The default is ``fcyclic`` | +------------------+------------------------------------------------------------------------------------+ | ``block`` | Each core is first filled with tasks before the next core will be used. | +------------------+------------------------------------------------------------------------------------+ | ``cyclic`` | Each task will be assigned to the next core(s) in a round-robin fashion. | | | The threads of a task will fill the cores. | +------------------+------------------------------------------------------------------------------------+ | ``fcyclic`` | Each thread inside a task will be assigned to the next core in a round-robin | | | fashion, spreading the task itself as much as possible over all cores. | | | ``fcyclic`` implies ``cyclic``. | +------------------+------------------------------------------------------------------------------------+ Fourth part ~~~~~~~~~~~ Optional control for task distribution over nodes. .. table:: :align: left +------------+------------------------------------------------------------------------------------+ | ``Pack`` | Default is NoPack. See: https://slurm.schedmd.com/srun.html | +------------+ + | ``NoPack`` | | +------------+------------------------------------------------------------------------------------+ ``--hint`` ^^^^^^^^^^ If the hint ``nomultithread`` is given, the affinity will be set as if there were only one thread per core on the nodes and an error message will be thrown if the total amount of the threads you are trying to use per node is higher than available amount of physical hardware threads. .. code-block:: none --hint=nomultithread .. note:: The hints ``compute_bound`` and ``memory_bound`` are currently not supported. Affinity examples ----------------- Visualization of the processor affinity in the following examples is done by the tool ``psslurmgetbind`` which is also available on the login nodes of |SYSTEM_NAME|. The displayed scheme represents one node of |SYSTEM_NAME| which has two sockets divided by the blank space in the middle. Each column corresponds to one core; the first row shows the first (physical) thread of the corresponding core and the second row the SMT (logical thread) of the core. The number (``X``) followed by a ``:`` in the line above the described scheme represents the MPI task number for which the affinity mask is shown in the scheme. The number ``1`` in the scheme itself indicates that the task with its threads is scheduled on the corresponding hardware thread of the node. **Example:** One MPI task with two threads: .. ifconfig:: system_name == 'juwels' .. figure:: ../shared/images/psslurmgetbind_juwels.jpg :name: psslurmgetbind_juwels :align: center .. ifconfig:: system_name == 'jureca' .. figure:: ../shared/images/psslurmgetbind_jureca.jpg :name: psslurmgetbind_jureca :align: center .. ifconfig:: system_name == 'jusuf' For the purpose of presentation, ``...`` indicates that not all cores on a |SYSTEM_NAME| node are shown. .. figure:: ../shared/images/psslurmgetbind_jusuf.jpg :name: psslurmgetbind_jusuf :align: center .. only:: builder_html Further examples with a colored representation can be found here :download:`Talk: Pinning with psslurm <../shared/docs/Pinning_with_psslurm.pdf>`. .. _default_affinity: Default processor affinity ^^^^^^^^^^^^^^^^^^^^^^^^^^ The default processor affinity has changed at 12th May 2020 to the following setting: .. code-block:: none --cpu-bind=threads --distribution=block:cyclic:fcyclic The behavior of this combination is shown in the following examples for |SYSTEM_NAME|. .. ifconfig:: system_name == 'juwels' **Example 1:** Pure MPI application filling only the first thread of a core on a node: .. code-block:: none srun --nodes=1 --tasks-per-node=48 --cpus-per-task=1 .. code-block:: none $ psslurmgetbind 2 24 2 -h : -n 48 -c 1 0: 100000000000000000000000 000000000000000000000000 000000000000000000000000 000000000000000000000000 1: 000000000000000000000000 100000000000000000000000 000000000000000000000000 000000000000000000000000 ... 2: 010000000000000000000000 000000000000000000000000 000000000000000000000000 000000000000000000000000 3: 000000000000000000000000 010000000000000000000000 000000000000000000000000 000000000000000000000000 ... 46: 000000000000000000000001 000000000000000000000000 000000000000000000000000 000000000000000000000000 47: 000000000000000000000000 000000000000000000000001 000000000000000000000000 000000000000000000000000 **Example 2:** Pure MPI application filling all all threads on a node (including SMT): .. code-block:: none srun --nodes=1 --tasks-per-node=96 --cpus-per-task=1 .. code-block:: none $ psslurmgetbind 2 24 2 -h : -n 96 -c 1 0: 100000000000000000000000 000000000000000000000000 000000000000000000000000 000000000000000000000000 1: 000000000000000000000000 100000000000000000000000 000000000000000000000000 000000000000000000000000 ... 46: 000000000000000000000001 000000000000000000000000 000000000000000000000000 000000000000000000000000 47: 000000000000000000000000 000000000000000000000001 000000000000000000000000 000000000000000000000000 48: 000000000000000000000000 000000000000000000000000 100000000000000000000000 000000000000000000000000 49: 000000000000000000000000 000000000000000000000000 000000000000000000000000 100000000000000000000000 ... 94: 000000000000000000000000 000000000000000000000000 000000000000000000000001 000000000000000000000000 95: 000000000000000000000000 000000000000000000000000 000000000000000000000000 000000000000000000000001 **Example 3:** Hybrid application (MPI + OpenMP) with 4 tasks per node and 16 Threads per task: .. code-block:: none srun --nodes=1 --tasks-per-node=4 --cpus-per-task=16 .. code-block:: none $ psslurmgetbind 2 24 2 -h : -n 4 -c 16 0: 111111111111111100000000 000000000000000000000000 000000000000000000000000 000000000000000000000000 1: 000000000000000000000000 111111111111111100000000 000000000000000000000000 000000000000000000000000 2: 000000000000000011111111 000000000000000000000000 111111110000000000000000 000000000000000000000000 3: 000000000000000000000000 000000000000000011111111 000000000000000000000000 111111110000000000000000 **Example 4:** Pure OpenMP application with 48 Threads: .. code-block:: none srun --nodes=1 --cpus-per-task=48 .. code-block:: none $ psslurmgetbind 2 24 2 -h : -n 1 -c 48 0: 111111111111111111111111 000000000000000000000000 111111111111111111111111 000000000000000000000000 .. ifconfig:: system_name in ('jureca', 'jusuf') **Example 1:** Pure MPI application filling only the first thread of a core on a node: .. code-block:: none srun --nodes=1 --tasks-per-node=128 --cpus-per-task=1 .. code-block:: none $ psslurmgetbind 2 64 2 -h : -n 128 -c 1 0: 100000000000...000000000000 000000000000...000000000000 000000000000...000000000000 000000000000...000000000000 1: 000000000000...000000000000 100000000000...000000000000 000000000000...000000000000 000000000000...000000000000 ... 126: 000000000000...000000000001 000000000000...000000000000 000000000000...000000000000 000000000000...000000000000 127: 000000000000...000000000000 000000000000...000000000001 000000000000...000000000000 000000000000...000000000000 **Example 2:** Pure MPI application filling all all threads on a node (including SMT): .. code-block:: none srun --nodes=1 --tasks-per-node=256 --cpus-per-task=1 .. code-block:: none $ psslurmgetbind 2 64 2 -h : -n 256 -c 1 0: 100000000000...000000000000 000000000000...000000000000 000000000000...000000000000 000000000000...000000000000 1: 000000000000...000000000000 100000000000...000000000000 000000000000...000000000000 000000000000...000000000000 ... 126: 000000000000...000000000001 000000000000...000000000000 000000000000...000000000000 000000000000...000000000000 127: 000000000000...000000000000 000000000000...000000000001 000000000000...000000000000 000000000000...000000000000 128: 000000000000...000000000000 000000000000...000000000000 100000000000...000000000000 000000000000...000000000000 129: 000000000000...000000000000 000000000000...000000000000 000000000000...000000000000 100000000000...000000000000 ... 254: 000000000000...000000000000 000000000000...000000000000 000000000000...000000000001 000000000000...000000000000 255: 000000000000...000000000000 000000000000...000000000000 000000000000...000000000000 000000000000...000000000001 **Example 3:** Hybrid application (MPI + OpenMP) with 4 tasks per node and 36 Threads per task: .. code-block:: none srun --nodes=1 --tasks-per-node=4 --cpus-per-task=36 .. code-block:: none $ psslurmgetbind 2 64 2 -h : -n 4 -c 36 0: 1111111111111111111111111111111111110000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 1: 0000000000000000000000000000000000000000000000000000000000000000 1111111111111111111111111111111111110000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 2: 0000000000000000000000000000000000001111111111111111111111111111 0000000000000000000000000000000000000000000000000000000000000000 1111111100000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 3: 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000001111111111111111111111111111 0000000000000000000000000000000000000000000000000000000000000000 1111111100000000000000000000000000000000000000000000000000000000 **Example 4:** Pure OpenMP application with 128 Threads: .. code-block:: none srun --nodes=1 --cpus-per-task=128 .. code-block:: none $ psslurmgetbind 2 64 2 -h : -n 1 -c 128 0: 111111111111...111111111111 000000000000...000000000000 111111111111...111111111111 000000000000...000000000000 Further examples ^^^^^^^^^^^^^^^^ .. ifconfig:: system_name == 'juwels' **Example 1:** Hybrid application (MPI + OpenMP) with 4 tasks per node and 16 Threads per tasks using ``--distribution=*:*:cyclic``. Most of the hybrid applications using more than tasks-per-node * cpus-per-tasks > 48 per node on |SYSTEM_NAME| should benefit from this setting: .. code-block:: none srun --nodes=1 --tasks-per-node=4 --cpus-per-task=16 --distribution=*:*:cyclic .. code-block:: none $ psslurmgetbind 2 24 2 -h : -n 4 -c 16 --distribution=*:*:cyclic 0: 111111110000000000000000 000000000000000000000000 111111110000000000000000 000000000000000000000000 1: 000000000000000000000000 111111110000000000000000 000000000000000000000000 111111110000000000000000 2: 000000001111111100000000 000000000000000000000000 000000001111111100000000 000000000000000000000000 3: 000000000000000000000000 000000001111111100000000 000000000000000000000000 000000001111111100000000 **Example 2:** Pure MPI application using only the first thread of a core on a node with ``--cpu-bind=rank`` .. code-block:: none srun --nodes=1 --tasks-per-node=48 --cpus-per-task=1 --cpu-bind=rank .. code-block:: none $ psslurmgetbind 2 24 2 -h : -n 48 -c 1 --cpu-bind=rank 0: 100000000000000000000000 000000000000000000000000 000000000000000000000000 000000000000000000000000 1: 010000000000000000000000 000000000000000000000000 000000000000000000000000 000000000000000000000000 ... 46: 000000000000000000000000 000000000000000000000010 000000000000000000000000 000000000000000000000000 47: 000000000000000000000000 000000000000000000000001 000000000000000000000000 000000000000000000000000 **Example 3:** Pure MPI application filling all all threads on a node (including SMT) with ``--cpu-bind=rank`` .. code-block:: none srun --nodes=1 --tasks-per-node=96 --cpus-per-task=1 --cpu-bind=rank .. code-block:: none $ psslurmgetbind 2 24 2 -h : -n 96 -c 1 --cpu-bind=rank 0: 100000000000000000000000 000000000000000000000000 000000000000000000000000 000000000000000000000000 1: 010000000000000000000000 000000000000000000000000 000000000000000000000000 000000000000000000000000 ... 46: 000000000000000000000000 000000000000000000000010 000000000000000000000000 000000000000000000000000 47: 000000000000000000000000 000000000000000000000001 000000000000000000000000 000000000000000000000000 48: 000000000000000000000000 000000000000000000000000 100000000000000000000000 000000000000000000000000 49: 000000000000000000000000 000000000000000000000000 010000000000000000000000 000000000000000000000000 ... 94: 000000000000000000000000 000000000000000000000000 000000000000000000000000 000000000000000000000010 95: 000000000000000000000000 000000000000000000000000 000000000000000000000000 000000000000000000000001 **Example 4:** Hybrid application (MPI + OpenMP) with 4 tasks per node and 16 Threads per task ``--cpu-bind=rank`` .. code-block:: none srun --nodes=1 --tasks-per-node=4 --cpus-per-task=16 --cpu-bind=rank .. code-block:: none $ psslurmgetbind 2 24 2 -h : -n 4 -c 16 --cpu-bind=rank 0: 111111111111111100000000 000000000000000000000000 000000000000000000000000 000000000000000000000000 1: 000000000000000011111111 111111110000000000000000 000000000000000000000000 000000000000000000000000 2: 000000000000000000000000 000000001111111111111111 000000000000000000000000 000000000000000000000000 3: 000000000000000000000000 000000000000000000000000 111111111111111100000000 000000000000000000000000 .. ifconfig:: system_name in ('jureca', 'jusuf') **Example 1:** Hybrid application (MPI + OpenMP) with 4 tasks per node and 36 Threads per tasks using ``--distribution=*:*:cyclic``. Most of the hybrid applications using more than tasks-per-node * cpus-per-tasks > 128 per node on |SYSTEM_NAME| should benefit from this setting: .. code-block:: none srun --nodes=1 --tasks-per-node=4 --cpus-per-task=36 --distribution=*:*:cyclic .. code-block:: none $ psslurmgetbind 2 64 2 -h : -n 4 -c 36 --distribution=*:*:cyclic 0: 1111111111111111110000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 1111111111111111110000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 1: 0000000000000000000000000000000000000000000000000000000000000000 1111111111111111110000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 1111111111111111110000000000000000000000000000000000000000000000 2: 0000000000000000001111111111111111110000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000001111111111111111110000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 3: 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000001111111111111111110000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000001111111111111111110000000000000000000000000000 **Example 2:** Pure MPI application using only the first thread of a core on a node with ``--cpu-bind=rank`` .. code-block:: none srun --nodes=1 --tasks-per-node=128 --cpus-per-task=1 --cpu-bind=rank .. code-block:: none $ psslurmgetbind 2 64 2 -h : -n 128 -c 1 --cpu-bind=rank 0: 100000000000...000000000000 000000000000...000000000000 000000000000...000000000000 000000000000...000000000000 1: 010000000000...000000000000 000000000000...000000000000 000000000000...000000000000 000000000000...000000000000 ... 126: 000000000000...000000000000 000000000000...000000000010 000000000000...000000000000 000000000000...000000000000 127: 000000000000...000000000000 000000000000...000000000001 000000000000...000000000000 000000000000...000000000000 **Example 3:** Pure MPI application filling all all threads on a node (including SMT) with ``--cpu-bind=rank`` .. code-block:: none srun --nodes=1 --tasks-per-node=256 --cpus-per-task=1 --cpu-bind=rank .. code-block:: none $ psslurmgetbind 2 64 2 -h : -n 256 -c 1 --cpu-bind=rank 0: 100000000000...000000000000 000000000000...000000000000 000000000000...000000000000 000000000000...000000000000 1: 010000000000...000000000000 000000000000...000000000000 000000000000...000000000000 000000000000...000000000000 ... 126: 000000000000...000000000000 000000000000...000000000010 000000000000...000000000000 000000000000...000000000000 127: 000000000000...000000000000 000000000000...000000000001 000000000000...000000000000 000000000000...000000000000 128: 000000000000...000000000000 000000000000...000000000000 100000000000...000000000000 000000000000...000000000000 129: 000000000000...000000000000 000000000000...000000000000 010000000000...000000000000 000000000000...000000000000 ... 254: 000000000000...000000000000 000000000000...000000000000 000000000000...000000000000 000000000000...000000000010 255: 000000000000...000000000000 000000000000...000000000000 000000000000...000000000000 000000000000...000000000001 **Example 4:** Hybrid application (MPI + OpenMP) with 4 tasks per node and 36 Threads per task ``--cpu-bind=rank`` .. code-block:: none srun --nodes=1 --tasks-per-node=4 --cpus-per-task=36 --cpu-bind=rank .. code-block:: none $ psslurmgetbind 2 64 2 -h : -n 4 -c 36 --cpu-bind=rank 0: 1111111111111111111111111111111111110000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 1: 0000000000000000000000000000000000001111111111111111111111111111 1111111100000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 2: 0000000000000000000000000000000000000000000000000000000000000000 0000000011111111111111111111111111111111111100000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 3: 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000011111111111111111111 1111111111111111000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 **Examples for manual pinning** For advanced use cases it can be desirable to manually specify the binding masks or core sets for each task. This is possible using the options ``--cpu-bind=map_cpu`` and ``--cpu-bind=mask_cpu``. For example, .. code-block:: none srun -n 2 --cpu-bind=map_cpu:1,5 spawns two tasks pinned to core 1 and 5, respectively. The command .. code-block:: none srun -n 2 --cpu-bind=mask_cpu:0x3,0xC spawns two tasks pinned to cores 0 and 1 (``0x3 = 3 = 2^0 + 2^1``) and cores 2 and 3 (``0xC = 12 = 2^2 + 2^3``), respectively. Affinity visualisation ---------------------- You can use a web interface, available at https://apps.fz-juelich.de/jsc/llview/pinning, to test and visualise different SLURM affinity setups by yourself. Differences to vanilla Slurm (19.05) ------------------------------------ * Auto binding is not supported in *psslurm* * ``--cpu-bind=boards`` is not supported * The option ``--cpu-bind=rank`` is implemented differently in *psslurm*, since it is redundant and makes no sense without auto-pin. Slurm completely ignores the ``--cpus-per-task`` option here, *psslurm* does not. * *psslurm* does NOT YET differentiate ``ldoms`` from ``sockets``. This keywords are currently used equivalent. * psslurm does not consider the values given to the options ``--ntasks-per-core`` and ``--ntasks-per-socket``. (As far as we could observe, Slurm does neither, even though it is described otherwise in the srun manpage.) * The hints ``compute_bound`` and ``memory_bound`` are currently not supported. * *psslurm* follows what is described at the srun manpage. (In many cases, Slurm does not.)