Batch system

Overview

On JUWELS the Slurm (Simple Linux Utility for Resource Management) Workload Manager, a free open-source resource manager and batch system, is employed. Slurm is a modern, extensible batch system that is widely deployed around the world on clusters of various sizes.

A Slurm installation consists of several programs and daemons. The slurmctld daemon is the central brain of the batch system responsible for monitoring the available resources and scheduling batch jobs. The slurmctld runs on an administrative node with a special setup to ensure availability in the case of hardware failures. Most user programs such as srun, sbatch, salloc and scontrol interact with the slurmctld. For the purpose of job accounting slurmctld communicates with the slurmdbd database daemon. Information from the accounting database can be queries using the sacct command. Slurm combines the functionality of the batch system and resource management. For this purpose Slurm provides the slurmd daemon which runs on the compute nodes and interacts with slurmctld. For the executing of user processes slurmstepd instances are spawned by slurmd to shepherd the user processes.

On JUWELS no slurmd is running on the compute nodes. Instead the process management is performed by psid, the management daemon from the Parastation Cluster Suite. A plugin psslurm to psid replaces slurmd on the compute nodes of JUWELS. Therefore only one daemon is required on the compute nodes for the resource management which minimizes jitter that could affect large-scale applications.

Available Partitions

In Slurm multiple nodes can be grouped into partitions which are sets of nodes with associated limits (for wall-clock time, job size, etc.). Partitions can overlap. Please refer to the Quick Introduction page for a list of the available partitions on JUWELS.

Allocations, Jobs and Job Steps

In Slurm a job is an allocation of selected resources for a specific amount of time. A job allocation can be requested using sbatch and salloc. Within a job multiple job steps can be executed using srun that use all or a subset of the allocated compute nodes. Job steps may execute at the same time if the resource allocation permits it.

Slurm commands

Below a list of the most important Slurm user commands available on JUWELS is given.

sbatch

is used to submit a batch script (which can be a bash, Perl or Python script)

The script will be executed on the first node in the allocation chosen by the scheduler. The working directory coincides with the working directory of the sbatch program. Within the script one or multiple srun commands can be used to create job steps and execute (MPI) parallel applications.

Note: mpiexec is not supported on JUWELS. srun is the only supported method to spawn MPI applications.

salloc

is used to request an allocation

When the job is started, a shell (or other program specified on the command line) is started on the submission host (login node). From the shell srun can be used to interactively spawn parallel applications. The allocation is released when the user exits the shell.

srun

is mainly used to create a job step within an job

srun can be executed without arguments except the program to use the full allocation or with additional arguments to restrict the job step resources to a subset of the allocated processors.

squeue

allows to query the list of pending and running jobs

By default it reports the list of pending jobs sorted by priority and the list of running jobs sorted separately according to the job priority.

scancel
is used to cancel pending or running jobs or to send signals to processes in running jobs or job steps
scontrol

can be used to query information about compute nodes and running or recently completed jobs

Note: For old jobs scontrol show job <jobid> will not work and sacct -j <jobid> should be used instead.

sacct

is used to retrieve accounting information for jobs and job steps

For older jobs sacct queries the accounting database.

sinfo
is used to retrieve information about the partitions and node states
sprio
can be used to query job priorities
smap

graphically shows the state of the partitions and nodes using a curses interface

We recommend Llview as an alternative which is supported on all JSC machines.
sattach
allows to attach to the standard input, output or error of a running job
sstat
allows to query information about a running job

Job Steps

The example below shows a job script where two different job steps are initiated within one job. In total 24 cores are allocated on two nodes. Each job step uses 12 cores on one of the compute nodes. In this example the option --exclusive is passed to srun to ensure that distinct cores are allocated to each job step.:

#!/bin/bash -x
#SBATCH --account=<budget>
#SBATCH --nodes=2
#SBATCH --ntasks=24
#SBATCH --ntasks-per-node=12
#SBATCH --output=mpi-out.%j
#SBATCH --error=mpi-err.%j
#SBATCH --time=00:20:00
srun --exclusive -n 12 ./mpi-prog1 &
srun --exclusive -n 12 ./mpi-prog2 &
wait

Dependency Chains

Slurm supports dependency chains, i.e., collections of batch jobs with defined dependencies. Dependencies can be defined using the --dependency argument to sbatch:

sbatch --dependency=afterany:<jobid> <jobscript>

Slurm will guarantee that the new batch job (whose job ID is returned by sbatch) does not start before <jobid> terminates (successfully or not). It is possible to specify other types of dependencies, such as afterok which ensures that the new job will only start if <jobid> finished successfully.

Below an example script for the handling of job chains is provided. The script submits a chain of ${NO_OF_JOBS} jobs. A job will only start after successful completion of its predecessor. Please note that a job which exceeds its time-limit is not marked successful.:

#!/bin/bash -x
# submit a chain of jobs with dependency
# number of jobs to submit
NO_OF_JOBS=<no of jobs>
# define jobscript
JOB_SCRIPT=<jobscript>
echo "sbatch ${JOB_SCRIPT}"
JOBID=$(sbatch ${JOB_SCRIPT} 2>&1 | awk ’{print $(NF)}’)
I=0
while [ ${I} -le ${NO_OF_JOBS} ]; do
echo "sbatch --dependency=afterok:${JOBID} ${JOB_SCRIPT}"
JOBID=$(sbatch --dependency=afterok:${JOBID} ${JOB_SCRIPT} 2>&1 | awk ’{print $(NF)}’)
let I=${I}+1
done

Interactive Sessions

As explained in the Quick Introduction interactive sessions can be allocated using the salloc command. Below a transcript of an exemplary interactive session is shown. Note that salloc executes a shell on the login node. srun can be run within the allocation without delay (note that the first srun execution may take slightly longer due to the necessary node health checking performed upon the invocation of the very first srun command within the session). One can execute a shell on the compute node using srun. Here one has to pass --pty to srun to ensure that the bash is started in terminal mode.:

[user1@juwels07 ~]$ hostname
juwels07.fz-juelich.de
[user1@juwels07 ~]$ salloc --nodes=2 --account=<budget>
salloc: Pending job allocation 218906
salloc: job 218906 queued and waiting for resources
salloc: job 218906 has been allocated resources
salloc: Granted job allocation 218906
user1@juwels07:~ $ hostname
juwels07.fz-juelich.de
user1@juwels07:~ $ srun --ntasks 2 --ntasks-per-node=2 hostname
jwc02n000.adm02.juwels.fzj.de
jwc03n024.adm03.juwels.fzj.de
user1@juwels07:~ $ srun --cpu_bind=none --nodes=1 --pty /bin/bash -i
[user1@jwc02n000 ~]$ hostname
jwc02n000.adm02.juwels.fzj.de
[user1@jwc02n000 ~]$ logout
user1@juwels07:~ $ hostname
juwels07.fz-juelich.de
user1@juwels07:~ $ exit
exit
salloc: Relinquishing job allocation 218906
[user1@juwels07 ~]$ hostname
juwels07.fz-juelich.de

To support X11 forwarding the --forward-x option to srun is available.

Note: Your account will be charged per allocation whether the compute nodes are used or not. Batch submission is the preferred way to execute jobs.

Hold and Release Batch Jobs

Jobs that are in pending state (i.e., not yet running) can be put in hold using:

scontrol hold <jobid>

Jobs that are in hold are still reported as pending (PD) by squeue but the Reason shown by squeue or scontrol show job is changed to JobHeldUser:

[user1@juwels07 ~]$ scontrol show job 218927
JobId=218927 JobName=jobscript.sh
   UserId=XXX(nnnn) GroupId=XXX(nnnn) MCS_label=N/A
   Priority=0 Nice=0 Account=XXX QOS=normal
   JobState=PENDING Reason=JobHeldUser Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:20:00 TimeMin=N/A
   SubmitTime=2018-12-10T10:52:42 EligibleTime=Unknown
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=batch AllocNode:Sid=juwels07:14699
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=2-2 NumCPUs=24 NumTasks=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=24,node=2
   Socks/Node=* NtasksPerN:B:S:C=12:0:*:* CoreSpec=*
   MinCPUsNode=12 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=mem96 Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/XXX/jobscript.sh
   WorkDir=/XXX
   StdErr=/XXX/mpi-err.218927
   StdIn=/dev/null
   StdOut=/XXX/mpi-out.218927
   Power=

The job can be released using:

$ scontrol release <jobid>