.. include:: system.rst .. _mpmd: MPMD: Multiple Program Multiple Data Execution Model ==================================================== .. ifconfig:: system_name == 'jupiter' .. include:: ../shared/jupiter_warning.rst Basic Functionality ------------------- Slurm supports starting multiple executables within one ``MPI_COMM_WORLD`` communicator (MPMD model) using the ``--multi-prog`` option. When the ``--multi-prog`` option is specified, ``srun`` expects a text file describing the mapping from tasks to the program instead of an executable. As an example we consider the simple MPI program:: #include #include #include int main(int argc, char **argv) { int size; int rank; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); printf(NAME "(%s): %d (%s) of %d\n", \ argv[0], rank, argv[1], size); return MPI_Finalize(); } We generate three separate executables via:: mpicc -DNAME='"prog1.x"' -o prog1.x example.c mpicc -DNAME='"prog2.x"' -o prog2.x example.c mpicc -DNAME='"prog3.x"' -o prog3.x example.c .. note:: The compilation process may depend on the employed modules and environment. This example is only intended to demonstrate the MPMD support in Slurm. We additionally create a ``multi.conf`` file specifying the mapping from task numbers to programs:: 0-1,7 ./prog1.x %o 2-3 ./prog2.x %t 4-5,6 ./prog3.x %t In this example task zero, one and seven will execute ``prog1.x``. Tasks two and three will execute ``prog2.x`` and tasks four, five and six will execute ``prog3.x``. Slurm replaces the placeholder ``%t`` by the task number and the placeholder ``%o`` by the offset of the application within the range. When submitting the batchscript:: #!/bin/bash #SBATCH --account= #SBATCH --nodes=4 #SBATCH --ntasks=8 #SBATCH --ntasks-per-node=2 srun --multi-prog multi.conf with ``sbatch`` the output of the job equals:: prog1.x(./prog1.x): 0 (0) of 8 prog1.x(./prog1.x): 1 (1) of 8 prog2.x(./prog2.x): 3 (3) of 8 prog2.x(./prog2.x): 2 (2) of 8 prog3.x(./prog3.x): 4 (4) of 8 prog3.x(./prog3.x): 5 (5) of 8 prog3.x(./prog3.x): 6 (6) of 8 prog1.x(./prog1.x): 7 (2) of 8 Advanced Concurrent GPU and CPU Usage ------------------------------------- |SYSTEM_NAME| features GPU-equipped nodes which also feature CPU hosts with significant compute. It can be beneficial to use the CPUs concurrently with the GPUs. This can be implemented at different levels; for example in the code itself, or at the level of separate binaries. MPMD is an option to facilitate this heterogeneous execution. As with all heterogeneous execution, special care should be taken with the affinity of processes launched on the node. In the following example, we provide a CPU-GPU-MPMD example outlining the strategy for separated CPU/GPU affinity. In this example, the CPU intra-node parallelism is assumed to be implemented via OpenMP. One binary has been compiled for CPU usage, the other for GPU usage. These binaries are launched in a single job and share a common MPI communicator. Step 1 - Remove Global Core Affinity ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Many job scripts set the ``cpus-per-task`` option in the header section (including when generated by JUBE on JSC's HPC-systems). This parameter not only sets the CPU mask for the allocation in general, but is by default also inherited by all job steps and tasks/MPI ranks. The option should be set sufficiently coarse for every job steps the job includes, or removed entirely from the header to enable finer-grained affinity options for individual job steps. A typical MPMD job script header then looks like the following: .. code-block:: … #SBATCH --nodes=1 #SBATCH --ntasks=5 #SBATCH --time=… … .. note:: The parameters determining the number of nodes and tasks MUST be set and MUST be consistent with the multi-prog config file. Here, the values are chosen to be consistent with the example below. .. note:: If this is done via JUBE, the simplest way to achieve this would be to take the files ``platform.xml`` and ``submit.job.in`` from the JUBE include path to a user-defined JUBE include path, and to edit the file ``submit.job.in`` (template for job script) accordingly. This directory can then be used as a custom path in the JUBE config. To allow fine-grained, manual affinity configurations of the MPMD programs, the outer job step has to release control over CPU affinity and launch the (child) MPMD programs with an open CPU mask, using the flag ``--cpu-bind=none``. .. code-block:: srun --cpu-bind=none,verbose --multi-prog multi.conf The number of tasks, ``--ntasks=5`` here is inherited from the outer allocation. Step 2 - Individual Affinities ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Having manually cleared the CPU mask for all ranks, one can now use ``numactl`` to manually set affinities through individual CPU masks for the MPMD programs. ``numactl`` is used as a wrapper around the program, and can be augment by further wrapping tools, like ``env`` to set the environment of the subshell. In the following example, a ``multi.conf`` configuration for a 5-task MPMD job is shown, where task 0 executes a CPU program, and tasks 1-4 execute (the same) GPU programs. The ``numactl`` option ``--physcpubind`` is used to to bind programs to certain physical CPU IDs. For task 0, typical environment variables for OpenMP have been set; for task 1-4, the ``CUDA_VISIBLE_DEVICES`` environment variable is used to select distinct GPUs per task/binary program. .. ifconfig:: system_name == 'jupiter' .. code-block:: 0 env OMP_NUM_THREADS=124 OMP_PROC_BIND=spread OMP_PLACES=threads numactl --physcpubind=1-71,73-143,145-215,217-277 binary.cpu 1 env CUDA_VISIBLE_DEVICES=0 numactl --physcpubind=0 binary.gpu 2 env CUDA_VISIBLE_DEVICES=1 numactl --physcpubind=72 binary.gpu 3 env CUDA_VISIBLE_DEVICES=2 numactl --physcpubind=144 binary.gpu 4 env CUDA_VISIBLE_DEVICES=3 numactl --physcpubind=216 binary.gpu .. ifconfig:: system_name == 'juwels' .. code-block:: 0 env OMP_NUM_THREADS=124 OMP_PROC_BIND=spread OMP_PLACES=threads numactl --physcpubind=1-5,7-17,19-29,31-41,43-47 binary.cpu 1 env CUDA_VISIBLE_DEVICES=0 numactl --physcpubind=18 binary.gpu 2 env CUDA_VISIBLE_DEVICES=1 numactl --physcpubind=6 binary.gpu 3 env CUDA_VISIBLE_DEVICES=2 numactl --physcpubind=42 binary.gpu 4 env CUDA_VISIBLE_DEVICES=3 numactl --physcpubind=30 binary.gpu .. ifconfig:: system_name in ('jureca, jusuf') .. code-block:: 0 env OMP_NUM_THREADS=124 OMP_PROC_BIND=spread OMP_PLACES=threads numactl --physcpubind=0-15,17-47,49-79,81-111,113-127 binary.cpu 1 env CUDA_VISIBLE_DEVICES=0 numactl --physcpubind=48 binary.gpu 2 env CUDA_VISIBLE_DEVICES=1 numactl --physcpubind=16 binary.gpu 3 env CUDA_VISIBLE_DEVICES=2 numactl --physcpubind=112 binary.gpu 4 env CUDA_VISIBLE_DEVICES=3 numactl --physcpubind=80 binary.gpu .. note:: The ``physcpubind`` settings in this example should take into account NUMA domains, so each GPU is connected to a single rank into the physically connected CPU NUMA domain. Further notes: - Please verify that pinning occurs as you expect it to after defining these affinities. ``htop`` and ``nvidia-smi`` can be good options to do this. Additionally We offer the ``sgoto`` command. This can be used to go to a particular node in a particular job, as discussed in :ref:`sgoto`. - This process can be implemented in our benchmarking framework, JUBE, as detailed in comments above. .. warning:: When combining CPU and GPU computation it is easy to end up with suboptimal performance, e.g. in the case where GPU needs to wait for results from CPU or vice-versa. It is important that you understand your code and ensure that appropriate load-balancing is performed. If you need to understand your code's behaviour, a first, straightforward option can be our monitoring tool LLView, to check general distribution of tasks and load on CPU and GPU is approximately correct. Instrumentation can be performed via tools such as e.g. ScoreP, Cube and Vampir. Load-balancing will be dependent on the tasks performed and it can be that it is ultimately not efficient to use CPU resources at all, for example.