Heterogeneous and Cross-Module Jobs

Heterogeneous Jobs

With Slurm 17.11 support for heterogeneous jobs was introduced. A heterogeneous job consists of several job components, all of which can have individual job options. In particular, the different components can request nodes from different partitions. That way, a heterogeneous job can for example be spawned across multiple modules of our supercomputers.

Specifying Individual Job Options

The syntax of the interactive and non-interactive submission mechanisms – salloc and srun – has been extended to the user to specify individual options for the different job components. For srun, the sequence of command line is partitioned into several blocks with the colon : acting as the seperator. The resulting heterogeneous job will have as many job components as there were blocks of command line arguments. The first block of arguments contains the job options of the first job component as well as common job options that will apply to all other components. The second block contains options for the second job component and so on. The abstract syntax is as follows:

$ salloc <options 0 + common> : <options 1> [ : <options 2>... ]

The following invocation of salloc submits an interactive heterogeneous job that consists of two components, the first requesting one node from the partition_a partition, the second requesting 16 nodes from the partition_b partition.

$ salloc -A budget -p partition_a -N 1 : -p partition_b -N 16

Submitting non-interactive heterogeneous jobs through sbatch works similarly, but the syntax for seperating blocks of options in a batch script is slightly different. Instead of the colon :, batch scripts use the usual directive #SBATCH followed by the word hetjob as a separator:

#!/bin/bash
#SBATCH <options 0 + common>
#SBATCH hetjob
#SBATCH <options 1>
[
#SBATCH hetjob
#SBATCH <options 2>...
]

To submit a non-interactive heterogeneous job with the same setup as the interactive job above, the jobscript would read

#!/bin/bash
#SBATCH -A budget -p partition_a -N 1
#SBATCH hetjob
#SBATCH -p partition_b -N 16
...

As always, one can also specify job options on the sbatch command line and even mix options specified on the command line and in the batch script. Again, the colon : acts as the seperator of blocks of command line arguments. For example to specify that particular job components should always run on certain partitions they could be specified in the job script, while the number of nodes is left to be specified on the command line. The following batch script, submitted via sbatch -N 1 : -N 16 <batch script> results in the same heterogeneous job as the previous two examples.

#!/bin/bash
#SBATCH -A budget -p partition_a
#SBATCH hetjob
#SBATCH -p partition_b
...

An overview of the available partitions can be found at the Available Partitions.

Running Job Components Side by Side

As with homogeneous jobs, applications are launched inside a heterogeneous job using srun. Like salloc and sbatch, srun can be used to specify different options and also commands to run for different components through blocks of command line arguments separated by the colon :.

$ srun <options and command 0> : <options and command 1> [ : <options and command 2> ]

For example, in a heterogeneous job with two components, srun accepts up to two blocks of arguments and commands:

$ srun --ntasks-per-node 24 ./prog1 : --ntasks-per-node 1 ./prog2

The first block applies to the first component, the second block to the second component and so on. If there are less blocks than job components, the resources of the latter job components go unused as no application is launched there.

The option --het-group=<expr> can be used to explicitly assign a block of command line arguments to a job component. It takes as its argument <expr> either a single job component index in the range 0 ... n - 1 where n is the number of job components, or a range of indices like 1-3 or a comma seperated list of both indices and ranges like 1,3-5. The following invocation of srun runs the same application ./prog in components 0 and 2 of a three component heterogeneous job, leaving component 1 idle:

$ srun --het-group=0,2 ./prog

The same application ./prog can be run in all three job components using:

$ srun --het-group=0-2 ./prog

For detailed information about Slurm, please take a look at Batch system page as well as the official Slurm documentation on heterogeneous jobs for additional information on this feature.

Loading Software in a Heterogeneous Environment

Executing applications in a modular environment, especially when different modules have different architectures or the dependencies of programs are not uniform, can be a challenging tasks.

Uniform Architecture and Dependencies

As long as the architecture of the given modules are uniform and there are not mutually exclusive dependencies for the binaries that are going to be executed, one can rely on the module command. Take a look at Software Modules if module is new for you.

#!/bin/bash -x
#SBATCH ...
module load [...]
srun ./prog1 : ./prog2

Non Uniform Architectures and Mutual Exclusive Dependencies

When submitting jobs to modules that have different architectures (and therefore different software stacks), or to modules different than the one where the submitting node belongs to (for instance a login for a Cluster module submitting a Booster job), one needs to address the fact that the environment inherited by the job’s processes will not be correct. Having this particular scenario in mind, we have developed a tool called xenv (short for eXtended env).

This tool modifies the environment by loading the desired software modules. The concept is the same as the module command. However, the module command relies on environment variables that might not be correctly set when submitting modular jobs to supercomputing modules that have a different software stack than the one used to submit the job. Because xenv is a node-local tool it knows which software stack is the correct one for each node, and where to locate the appropriate modules for it.

The basic usage is the following:

srun --account=<budget account> \
  --partition=<batch, ...> xenv -L GCC -L ParaStationMPI IMB-1 : \
  --partition=<booster, ...> xenv -L GCC -L ParaStationMPI IMB-1

The above example will run IMB-1 in two separate supercomputing modules, with a single communicator, but it will correctly load the software modules for each architecture. One could load different software modules if the job requires it. However, the sets of modules used by different job components should be mutually compatible, e.g., mixing different compilers should work, while different MPI libraries are unlikely to communicate successfully with one another. We should note that the order in which software modules are specified in xenv is important.

Because the list of modules can get long in some cases, xenv also supports loading software module collections. As a user you can create a collection for, let’s say, the Cluster, and another one for the Booster, and give them separate names. Then you can load your long list of modules relying on that as in the example below:

srun --account=<budget account> \
  --partition=<batch, ...> xenv -R cluster-collection IMB-1 : \
  --partition=<booster, ...> xenv -R booster-collection IMB-1

For more information on software module collections please take a look at https://lmod.readthedocs.io/en/latest/010_user.html#user-collections-label

MPI Traffic Across Modules

Since in JUWELS both supercomputing modules (Cluster and Booster) are part of the same InfiniBand fabric, there are no limitations when it comes to use the MPI of your choice. Both ParaStationMPI and OpenMPI work in heterogeneous jobs as long as the same MPI is used for all job components. Note, however, that IntelMPI is not available on the Booster, since it lacks CUDA-awareness and it makes therefore no sense to deploy it in that module.