Processor Affinity

Each JURECA compute node features 24 physical and 48 logical cores (see SMT). The Linux operating system on each node has been designed to balance the computational load dynamically by migrating processes between cores where necessary. For many high performance computing applications, however, dynamic load balancing is not beneficial since the load can be predicated a priori and process migration may lead to performance loss on the JURECA compute nodes which fall in the category of Non-uniform Memory Access (NUMA) architectures. To avoid process migration, processes can be pinned (or bound) to a logical core through the resource management system. A pinned process (or thread) is bound to a specific set of cores (which may be a single or multiple logical cores) and will only run on the cores in this set.

Slurm allows users to modify the process binding by means of the --cpu_bind option to srun. While the available options to srun are standard across all Slurm installation, the implementation of process affinity is done in plugins and thus may differ between installations. On JURECA a custom pinning implementation is used. In contrast to other options, the processor affinity options need to be directly passed to srun and must not be given to sbatch or salloc. In particular, the option cannot be specified in the header of a batch script.

Note

The option --cpu_bind=cores is not supported on JURECA and will be rejected by the batch system.

Default processor affinity

Since the majority of applications benefit from strict pinning that prevents migration -- unless explicitly prevented -- all tasks in a job step are pinned to a set of cores which heuristically determines the optimal core set based on the job step specification. In job steps with --cpus-per-task=1 (the default) each task is pinned to a single logical core as shown in Fig. 2. In job steps with a --cpus-per-task count larger than one (e.g., threaded applications), each task/process will be assigned to a set of cores with cardinality matching the value of --cpus-per-task, see Fig. 3.

_images/Default-1-48.png

Fig. 2 Visualization of the processor affinity of a 48 task job step on a single JURECA node. Each column corresponds to a logical core and each row to a task/process. A red dot indicates that the task can be scheduled on the corresponding core. For the purpose of presentation, stars are used to highlight cores/tasks 0, 6, 12 to 42.

_images/Default-8-6.png

Fig. 3 Visualization of the processor affinity of a 8 task job step with --cpus-per-task=6 (e.g., a hybrid MPI/OpenMP job with 8 MPI processes and OMP_NUM_THREADS=6). Pinning of the individual threads spawned by each task is not in the hand of the resource management system but managed by the runtime (e.g., the OpenMP runtime library).

Note

It is important to specify the correct --cpus-per-task count to ensure an optimal pinning for hybrid applications.

The processor affinity masks generated with the options --cpu_bind=rank and --cpu_bind=threads coincide with the default binding scheme.

Note

The distribution of processes across sockets can be affect with the option -m to srun. See srun(1) for more information.

Binding to sockets

With the option --cpu_bind=sockets processes can be bound to sockets, see Fig. 4.

_images/Sockets-1-2.png

Fig. 4 Visualization of the processor affinity for a two task job step with --cpu_bind=sockets. The option --cpu_bind=sockets can be further combined with --hint=nomultithread (see SMT) to restrict task zero to cores 0 to 11 and task two to cores 12 to 23.

On JURECA, locality domains coincide with sockets so that --cpu_bind=ldoms and --cpu_bind=sockets give the same results.

Manual pinning

For advanced use cases it can be desirable to manually specify the binding masks or core sets for each task. This is possible using the options --cpu_bind=map_cpu and --cpu_bind=mask_cpu. For example,

srun -n 2 --cpu_bind=map_cpu:1,5

spawns two tasks pinned to core 1 and 5, respectively. The command

srun -n 2 --cpu_bind=mask_cpu:0x3,0xC

spawns two tasks pinned to cores 0 and 1 (\(0x3 = 3 = 2^0 + 2^1\)) and cores 2 and 3 (\(0xC = 11 = 2^2 + 2^3\)), respectively.

Disabling pinning

Processor binding can be disabled using the argument --cpu_bind=none to srun. In this case, each thread may execute on any of the 48 logical cores and the scheduling of the processes is up to the operating system. On JURECA the options --cpu_bind=none and --cpu_bind=boards achieve the same result.