Heterogeneous and Cross-Module Jobs

Heterogeneous Jobs

With Slurm 17.11 support for heterogeneous jobs was introduced. A heterogeneous job consists of several job components, all of which can have individual job options. In particular, the different components can request nodes from different partitions. That way, a heterogeneous job can for example be spawned across multiple modules of our supercomputers.

Specifying Individual Job Options

The syntax of the interactive and non-interactive submission mechanisms -- salloc and srun -- has been extended to the user to specify individual options for the different job components. For srun, the sequence of command line is partitioned into several blocks with the colon : acting as the seperator. The resulting heterogeneous job will have as many job components as there were blocks of command line arguments. The first block of arguments contains the job options of the first job component as well as common job options that will apply to all other components. The second block contains options for the second job component and so on. The abstract syntax is as follows:

$ salloc <options 0 + common> : <options 1> [ : <options 2>... ]

The following invocation of salloc submits an interactive heterogeneous job that consists of two components, the first requesting one node from the partition_a partition, the second requesting 16 nodes from the partition_b partition.

$ salloc -A budget -p partition_a -N 1 : -p partition_b -N 16

Submitting non-interactive heterogeneous jobs through sbatch works similarly, but the syntax for seperating blocks of options in a batch script is slightly different. Instead of the colon :, batch scripts use the usual directive #SBATCH followed by the word packjob as a separator:

#!/bin/bash
#SBATCH <options 0 + common>
#SBATCH packjob
#SBATCH <options 1>
[
#SBATCH packjob
#SBATCH <options 2>...
]

To submit a non-interactive heterogeneous job with the same setup as the interactive job above, the jobscript would read

#!/bin/bash
#SBATCH -A budget -p partition_a -N 1
#SBATCH packjob
#SBATCH -p partition_b -N 16
...

As always, one can also specify job options on the sbatch command line and even mix options specified on the command line and in the batch script. Again, the colon : acts as the seperator of blocks of command line arguments. For example to specify that particular job components should always run on certain partitions they could be specified in the job script, while the number of nodes is left to be specified on the command line. The following batch script, submitted via sbatch -N 1 : -N 16 <batch script> results in the same heterogeneous job as the previous two examples.

#!/bin/bash
#SBATCH -A budget -p partition_a
#SBATCH packjob
#SBATCH -p partition_b
...

A overview of the available partitions can be found at the Quick Introduction page.

Running Job Components Side by Side

As with homogeneous jobs, applications are launched inside a heterogeneous job using srun. Like salloc and sbatch, srun can be used to specify different options and also commands to run for different components through blocks of command line arguments separated by the colon :.

$ srun <options and command 0> : <options and command 1> [ : <options and command 2> ]

For example, in a heterogeneous job with two components, srun accepts up to two blocks of arguments and commands:

$ srun --ntasks-per-node 24 ./prog1 : --ntasks-per-node 1 ./prog2

The first block applies to the first component, the second block to the second component and so on. If there are less blocks than job components, the resources of the latter job components go unused as no application is launched there.

The option --pack-group=<expr> can be used to explicitly assign a block of command line arguments to a job component. It takes as its argument <expr> either a single job component index in the range 0 ... n - 1 where n is the number of job components, or a range of indices like 1-3 or a comma seperated list of both indices and ranges like 1,3-5. The following invocation of srun runs the same application ./prog in components 0 and 2 of a three component heterogeneous job, leaving component 1 idle:

$ srun --pack-group=0,2 ./prog

The same application ./prog can be run in all three job components using:

$ srun --pack-group=0-2 ./prog

For detailed information about Slurm, please take a look on the Quick Introduction and Batch system page as well as the official Slurm documentation on heterogeneous jobs for additional information on this feature.

Loading Software in a Heterogeneous Environment

Executing applications in a modular environment, especially when different modules have different architectures or the dependencies of programs are not uniform, can be a challenging tasks.

Uniform Architecture and Dependencies

As long as the architecture of the given modules are uniform and there are not mutually exclusive dependencies for the binaries that are going to be executed, one can rely on the module command. Take a look on the Quick Introduction if module is new for you.

#!/bin/bash -x
#SBATCH ...
module load [...]
srun ./prog1 : ./prog2

Non Uniform Architectures and Mutual Exclusive Dependencies

A tool called xenv was implement to ease the task of loading modules for heterogeneous jobs. For details on supported command line arguments, execute xenv -h on the given system.

srun --account=<budget account> \
  --partition=<batch, ...> xenv -L intel-para IMB-1 : \
  --partition=<knl, ...> xenv -L Architecture/KNL -L intel-para IMB-1

MPI Traffic Across Modules

When the nodes of a job belong to different interconnects and MPI communication is used, bridging has to take place. To support this workflow, e.g. run a job on a cluster with Infiniband and a booster with OmniPath, a gateway daemon (psgwd, ParaStation Gateway Daemon) was implemented that takes care of moving packages across fabrics.

For this functionality, ParaStationMPI is required and must be loaded via either module load or xenv!

Requesting Gateways

To request gateway nodes for a job, the mandatory option gw_num has to be specified at submit/allocation time.

  • There are in total 198 gateways available.
  • The gateways are exclusive resources, they are not shared across user jobs. This may change in the future.
  • There is currently no enforced maximum on the number of gateways per job, beside of the total number of gateways. This may change in the future.

Submitting Jobs

To start an interactive pack job using two gateway nodes the following command must be used:

srun --gw_num=2 -A <budget account> \
  -p <batch, ...>  xenv [-L ...] -L ParaStationMPI ./prog1 : \
  -p <booster, ...> xenv [-L ...] -L ParaStationMPI ./prog2

When submitting a job that will run later, you can either specify the number of gateways at submit time:

sbatch --gw_num=2 ./submit-script.sbatch

or via sbatch script directive:

#!/bin/bash
#SBATCH --gw_num=1
#SBATCH -A <budget account>
#SBATCH -p <batch, ...>
#SBATCH packjob
#SBATCH -p <booster, ...>

srun xenv [-L ...] -L ParaStationMPI ./prog1 : \
  xenv [-L ...] -L ParaStationMPI ./prog2

When you use the later approach, you must make sure that gw_num is specified before the first packjob occurence.

PSGWD

PSGWD Slurm Extension

The psgw plugin for the ParaStation management daemon extends the Slurm commands salloc, srun and sbatch with the following options:

--gw_num=#
Number of gateway nodes that have to be allocated.
--gw_file=path
Path of the routing file.
--gw_plugin=string
Name of the route plugin.

As long as no other path is specified, the routing file will be generated in the directory where the job was submitted/started. With the option gw_file a user-defined absolute path for the generation of the routing file can be specified.

PSGWD Routing

The routing of MPI traffic across the gateway nodes is performed by the ParaStation Gateway Daemon on a per-node-pair basis. When a certain number of gateway nodes is requested, an instance of psgwd is launched on each gateway. By default, given the list of cluster and booster nodes obtained at allocation time, the system assigns each one of the cluster node - booster node pair to one of the instances of psgwd previously launched. This mapping between cluster and booster nodes is saved into a routing file and used for the routing of the MPI traffic across the gateway nodes.

The routing can be influenced via the gw_plugin option:

srun --gw_plugin=$HOME/custom-route-plugin --gw_num=2 -N 1 hostname : -N 2 hostname

The gw_plugin option accepts either a label for a plugin already installed on the system, or the path to a user-defined plugin.

Currently two plugins are available on the JURECA system:

  • plugin01 is the default plugin (used when the gw_file is not used).
  • plugin02 is better suited for applications that use point-to-point communication between the same pairs of processes between cluster and booster, especially when the number of gateway nodes used is low.

The plugin file must include the functions associating a gateway node to a cluster node - booster node pair. As an example, the code for plugin01 is reported here:

# Route function: Given the numerical Ids of nodes in partition A and B, the function
# returns a tuple (error, numeral of gateway)
def routeConnectionS(sizePartA, sizePartB, numGwd, numeralNodeA, numeralNodeB):
  numeralGw = (numeralNodeA + numeralNodeB) % numGwd

  return None, numeralGw

# Route function (extended interface): Make decision based on names of nodes to
# take topology into account
# def routeConnectionX(nodeListPartA, nodeListPartB, gwList, nodeA, nodeB):
#       return Exception("Not implemented"), gwList[0]
routeConnectionX = None

In the case of 2 cluster nodes, 2 booster nodes and 2 gateway nodes, this function results in the following mapping:

Cluster node Booster node Gateway node
0 0 0
1 0 1
0 1 1
1 1 0

PSGWD Gateway Assignment

If more gateways were requested than available, the slurmctld prologue will fail for interactive jobs.

srun --gw_num=3 -N 1 hostname : -N 2 hostname
srun: psgw: requesting 3 gateway nodes
srun: job 158553 queued and waiting for resources
srun: job 158553 has been allocated resources
srun: PrologSlurmctld failed, job killed
srun: Force Terminated job 158553
srun: error: Job allocation 158553 has been revoked

If batch jobs run out of gateway resources they will be re-queued and have to wait for 10 minutes before becoming eligible to be scheduled again.

Debugging

For debugging purposes, and to make sure the gateways are used, you might use

export PSP_DEBUG=3

You should see output like

<PSP:r0000003:CONNECT (192.168.12.34,26708,0x2,r0000003) to (192.168.12.41,29538,0x2,r0000004) via gw>
<PSP:r0000004:ACCEPT  (192.168.12.34,26708,0x2,r0000003) to (192.168.12.41,29538,0x2,r0000004) via gw>

JUROPA3

Because JUROPA3 has only one high-speed interconnect, using the psgwd is only possible using PSP_GATEWAY=2. Via exporting this environment variable the gateway protocols priority is boosted over the default interconnect.

export PSP_GATEWAY=2
srun -A <budget account> \
  -p <cluster, ...> --gw_num=2 xenv -L ParaStationMPI ./prog1 : \
  -p <booster, ...> xenv -L ParaStationMPI ./prog2

JUROPA3 has 4 gateways available.