Heterogeneous and Cross-Module Jobs
With Slurm 17.11 support for heterogeneous jobs was introduced. A heterogeneous job consists of several job components, all of which can have individual job options. In particular, the different components can request nodes from different partitions. That way, a heterogeneous job can for example be spawned across multiple modules of our supercomputers.
Specifying Individual Job Options
The syntax of the interactive and non-interactive submission mechanisms –
srun – has been extended to the user to specify individual options for the different job components.
srun, the sequence of command line is partitioned into several blocks with the colon
: acting as the seperator.
The resulting heterogeneous job will have as many job components as there were blocks of command line arguments.
The first block of arguments contains the job options of the first job component as well as common job options that will apply to all other components.
The second block contains options for the second job component and so on.
The abstract syntax is as follows:
$ salloc <options 0 + common> : <options 1> [ : <options 2>... ]
The following invocation of
salloc submits an interactive heterogeneous job that consists of two components, the first requesting one node from the
partition_a partition, the second requesting 16 nodes from the
$ salloc -A budget -p partition_a -N 1 : -p partition_b -N 16
Submitting non-interactive heterogeneous jobs through
sbatch works similarly, but the syntax for seperating blocks of options in a batch script is slightly different.
Instead of the colon
:, batch scripts use the usual directive
#SBATCH followed by the word
hetjob as a separator:
#!/bin/bash #SBATCH <options 0 + common> #SBATCH hetjob #SBATCH <options 1> [ #SBATCH hetjob #SBATCH <options 2>... ]
To submit a non-interactive heterogeneous job with the same setup as the interactive job above, the jobscript would read
#!/bin/bash #SBATCH -A budget -p partition_a -N 1 #SBATCH hetjob #SBATCH -p partition_b -N 16 ...
As always, one can also specify job options on the
sbatch command line and even mix options specified on the command line and in the batch script.
Again, the colon
: acts as the seperator of blocks of command line arguments.
For example to specify that particular job components should always run on certain partitions they could be specified in the job script, while the number of nodes is left to be specified on the command line.
The following batch script, submitted via
sbatch -N 1 : -N 16 <batch script> results in the same heterogeneous job as the previous two examples.
#!/bin/bash #SBATCH -A budget -p partition_a #SBATCH hetjob #SBATCH -p partition_b ...
An overview of the available partitions can be found at the Available Partitions.
Running Job Components Side by Side
As with homogeneous jobs, applications are launched inside a heterogeneous job using
srun can be used to specify different options and also commands to run for different components through blocks of command line arguments separated by the colon
$ srun <options and command 0> : <options and command 1> [ : <options and command 2> ]
For example, in a heterogeneous job with two components,
srun accepts up to two blocks of arguments and commands:
$ srun --ntasks-per-node 24 ./prog1 : --ntasks-per-node 1 ./prog2
The first block applies to the first component, the second block to the second component and so on. If there are less blocks than job components, the resources of the latter job components go unused as no application is launched there.
--het-group=<expr> can be used to explicitly assign a block of command line arguments to a job component.
It takes as its argument
<expr> either a single job component index in the range
0 ... n - 1 where
n is the number of job components, or a range of indices like
1-3 or a comma seperated list of both indices and ranges like
The following invocation of
srun runs the same application
./prog in components
2 of a three component heterogeneous job, leaving component
$ srun --het-group=0,2 ./prog
The same application
./prog can be run in all three job components using:
$ srun --het-group=0-2 ./prog
Loading Software in a Heterogeneous Environment
Executing applications in a modular environment, especially when different modules have different architectures or the dependencies of programs are not uniform, can be a challenging tasks.
Uniform Architecture and Dependencies
As long as the architecture of the given modules are uniform and there are not mutually exclusive dependencies for the binaries that are going to be executed, one can rely on the
Take a look at Software Modules if
module is new for you.
#!/bin/bash -x #SBATCH ... module load [...] srun ./prog1 : ./prog2
Non Uniform Architectures and Mutual Exclusive Dependencies
When submitting jobs to modules that have different architectures (and therefore different software stacks), or to modules different than the one where the submitting node belongs to (for instance a login for a Cluster module submitting a Booster job), one needs to address the fact that the environment inherited by the job’s processes will not be correct. Having this particular scenario in mind, we have developed a tool called
xenv (short for eXtended
This tool modifies the environment by loading the desired software modules. The concept is the same as the
module command. However, the
module command relies on environment variables that might not be correctly set when submitting modular jobs to supercomputing modules that have a different software stack than the one used to submit the job. Because
xenv is a node-local tool it knows which software stack is the correct one for each node, and where to locate the appropriate modules for it.
The basic usage is the following:
srun --account=<budget account> \ --partition=<batch, ...> xenv -L GCC -L ParaStationMPI IMB-1 : \ --partition=<booster, ...> xenv -L GCC -L ParaStationMPI IMB-1
The above example will run
IMB-1 in two separate supercomputing modules, with a single communicator, but it will correctly load the software modules for each architecture. One could load different software modules if the job requires it. However, the sets of modules used by different job components should be mutually compatible, e.g., mixing different compilers should work, while different MPI libraries are unlikely to communicate successfully with one another. We should note that the order in which software modules are specified in
xenv is important.
Because the list of modules can get long in some cases,
xenv also supports loading software module collections. As a user you can create a collection for, let’s say, the Cluster, and another one for the Booster, and give them separate names. Then you can load your long list of modules relying on that as in the example below:
srun --account=<budget account> \ --partition=<batch, ...> xenv -R cluster-collection IMB-1 : \ --partition=<booster, ...> xenv -R booster-collection IMB-1
For more information on software module collections please take a look at https://lmod.readthedocs.io/en/latest/010_user.html#user-collections-label
MPI Traffic Across Modules
Since in JUWELS both supercomputing modules (Cluster and Booster) are part of the same InfiniBand fabric, there are no limitations when it comes to use the MPI of your choice. Both ParaStationMPI and OpenMPI work in heterogeneous jobs as long as the same MPI is used for all job components. Note, however, that IntelMPI is not available on the Booster, since it lacks CUDA-awareness and it makes therefore no sense to deploy it in that module.