MPMD: Multiple Program Multiple Data Execution Model

Basic Functionality

Slurm supports starting multiple executables within one MPI_COMM_WORLD communicator (MPMD model) using the --multi-prog option. When the --multi-prog option is specified, srun expects a text file describing the mapping from tasks to the program instead of an executable.

As an example we consider the simple MPI program:

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv)
{

    int size;
    int rank;

    MPI_Init(&argc, &argv);

    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    printf(NAME "(%s): %d (%s) of %d\n", \
            argv[0], rank, argv[1], size);

    return MPI_Finalize();
}

We generate three separate executables via:

mpicc -DNAME='"prog1.x"' -o prog1.x example.c
mpicc -DNAME='"prog2.x"' -o prog2.x example.c
mpicc -DNAME='"prog3.x"' -o prog3.x example.c

Note

The compilation process may depend on the employed modules and environment. This example is only intended to demonstrate the MPMD support in Slurm.

We additionally create a multi.conf file specifying the mapping from task numbers to programs:

0-1,7 ./prog1.x %o
2-3   ./prog2.x %t
4-5,6 ./prog3.x %t

In this example task zero, one and seven will execute prog1.x. Tasks two and three will execute prog2.x and tasks four, five and six will execute prog3.x. Slurm replaces the placeholder %t by the task number and the placeholder %o by the offset of the application within the range.

When submitting the batchscript:

#!/bin/bash
#SBATCH --account=<budget>
#SBATCH --nodes=4
#SBATCH --ntasks=8
#SBATCH --ntasks-per-node=2
srun --multi-prog multi.conf

with sbatch the output of the job equals:

prog1.x(./prog1.x): 0 (0) of 8
prog1.x(./prog1.x): 1 (1) of 8
prog2.x(./prog2.x): 3 (3) of 8
prog2.x(./prog2.x): 2 (2) of 8
prog3.x(./prog3.x): 4 (4) of 8
prog3.x(./prog3.x): 5 (5) of 8
prog3.x(./prog3.x): 6 (6) of 8
prog1.x(./prog1.x): 7 (2) of 8

Advanced Concurrent GPU and CPU Usage

JUWELS features GPU-equipped nodes which also feature CPU hosts with significant compute. It can be beneficial to use the CPUs concurrently with the GPUs. This can be implemented at different levels; for example in the code itself, or at the level of separate binaries. MPMD is an option to facilitate this heterogeneous execution.

As with all heterogeneous execution, special care should be taken with the affinity of processes launched on the node. In the following example, we provide a CPU-GPU-MPMD example outlining the strategy for separated CPU/GPU affinity.

In this example, the CPU intra-node parallelism is assumed to be implemented via OpenMP. One binary has been compiled for CPU usage, the other for GPU usage. These binaries are launched in a single job and share a common MPI communicator.

Step 1 - Remove Global Core Affinity

Many job scripts set the cpus-per-task option in the header section (including when generated by JUBE on JSC’s HPC-systems). This parameter not only sets the CPU mask for the allocation in general, but is by default also inherited by all job steps and tasks/MPI ranks.

The option should be set sufficiently coarse for every job steps the job includes, or removed entirely from the header to enable finer-grained affinity options for individual job steps.

A typical MPMD job script header then looks like the following:

…
#SBATCH --nodes=1
#SBATCH --ntasks=5
#SBATCH --time=…
…

Note

The parameters determining the number of nodes and tasks MUST be set and MUST be consistent with the multi-prog config file. Here, the values are chosen to be consistent with the example below.

Note

If this is done via JUBE, the simplest way to achieve this would be to take the files platform.xml and submit.job.in from the JUBE include path to a user-defined JUBE include path, and to edit the file submit.job.in (template for job script) accordingly. This directory can then be used as a custom path in the JUBE config.

To allow fine-grained, manual affinity configurations of the MPMD programs, the outer job step has to release control over CPU affinity and launch the (child) MPMD programs with an open CPU mask, using the flag --cpu-bind=none.

srun --cpu-bind=none,verbose --multi-prog multi.conf

The number of tasks, --ntasks=5 here is inherited from the outer allocation.

Step 2 - Individual Affinities

Having manually cleared the CPU mask for all ranks, one can now use numactl to manually set affinities through individual CPU masks for the MPMD programs. numactl is used as a wrapper around the program, and can be augment by further wrapping tools, like env to set the environment of the subshell.

In the following example, a multi.conf configuration for a 5-task MPMD job is shown, where task 0 executes a CPU program, and tasks 1-4 execute (the same) GPU programs. The numactl option --physcpubind is used to to bind programs to certain physical CPU IDs. For task 0, typical environment variables for OpenMP have been set; for task 1-4, the CUDA_VISIBLE_DEVICES environment variable is used to select distinct GPUs per task/binary program.

0 env OMP_NUM_THREADS=124 OMP_PROC_BIND=spread OMP_PLACES=threads numactl --physcpubind=1-5,7-17,19-29,31-41,43-47 binary.cpu
1 env CUDA_VISIBLE_DEVICES=0 numactl --physcpubind=18 binary.gpu
2 env CUDA_VISIBLE_DEVICES=1 numactl --physcpubind=6 binary.gpu
3 env CUDA_VISIBLE_DEVICES=2 numactl --physcpubind=42 binary.gpu
4 env CUDA_VISIBLE_DEVICES=3 numactl --physcpubind=30 binary.gpu

Note

The physcpubind settings in this example should take into account NUMA domains, so each GPU is connected to a single rank into the physically connected CPU NUMA domain.

Further notes:

  • Please verify that pinning occurs as you expect it to after defining these affinities. htop and nvidia-smi can be good options to do this. Additionally We offer the sgoto command. This can be used to go to a particular node in a particular job, as discussed in sgoto.

  • This process can be implemented in our benchmarking framework, JUBE, as detailed in comments above.

Warning

When combining CPU and GPU computation it is easy to end up with suboptimal performance, e.g. in the case where GPU needs to wait for results from CPU or vice-versa. It is important that you understand your code and ensure that appropriate load-balancing is performed. If you need to understand your code’s behaviour, a first, straightforward option can be our monitoring tool LLView, to check general distribution of tasks and load on CPU and GPU is approximately correct. Instrumentation can be performed via tools such as e.g. ScoreP, Cube and Vampir. Load-balancing will be dependent on the tasks performed and it can be that it is ultimately not efficient to use CPU resources at all, for example.