Batch system
JUWELS is accessed through a dedicated set of login nodes used to write and compile applications as well as to perform pre- and post-processing of simulation data. Access to the compute nodes in the system is controlled by the workload manager.
On JUWELS the Slurm (Simple Linux Utility for Resource Management) Workload Manager, a free open-source resource manager and batch system, is employed. Slurm is a modern, extensible batch system that is widely deployed around the world on clusters of various sizes.
A Slurm installation consists of several programs and daemons. The slurmctld
daemon is the central brain of the batch system responsible
for monitoring the available resources and scheduling batch jobs. The slurmctld
runs on an administrative node with a special setup to
ensure availability in the case of hardware failures. Most user programs such as srun
, sbatch
, salloc
and scontrol
interact with
the slurmctld
. For the purpose of job accounting slurmctld
communicates with the slurmdbd
database daemon.
Information from the accounting database can be queries using the sacct
command.
Slurm combines the functionality of the batch system and resource management. For this purpose Slurm provides the slurmd
daemon which
runs on the compute nodes and interacts with slurmctld
. For the executing of user processes slurmstepd
instances are spawned by
slurmd
to shepherd the user processes.
On JUWELS no slurmd
is running on the compute nodes. Instead the process management is performed by psid
, the management daemon from the
Parastation Cluster Suite. A plugin psslurm
to psid
replaces slurmd
on the compute nodes of JUWELS. Therefore only one daemon is
required on the compute nodes for the resource management which minimizes jitter that could affect large-scale applications.
Slurm Partitions
In Slurm multiple nodes can be grouped into partitions which are sets of nodes with associated limits (for wall-clock time, job size, etc.). In practice these partitions can be used for example to signal need for resources that have certain hardware characteristics (normal, large memory, accelerated, etc.) or that are dedicated to specific workloads (large production jobs, small debugging jobs, visualization, etc.).
Hardware Overview
JUWELS is a modular supercomputer consisting of a Cluster and a Booster module.
Note
Each module is equipped with dedicated login nodes. Job submission other modules from these logins is possible but currently requires workarounds. For the time being we advise users to submit jobs for each module from their respective login partition.
JUWELS Cluster module
Type |
Quantity |
Description |
---|---|---|
Standard / Slim nodes |
2271 |
48 cores, 96 GiB |
Large memory nodes |
240 |
48 cores, 192 GiB |
Accelerated nodes |
56 |
40 cores, 192 GiB, 4× V100 SXM2 GPUs |
Login nodes |
12 |
40 cores, 768 GiB |
JUWELS Booster module
Type |
Quantity |
Description |
---|---|---|
Booster nodes |
936 |
48 cores, 512 GiB, 4× A100 GPUs |
Login nodes |
4 |
48 cores, 512 GiB |
Visualization login partition
Type |
Quantity |
Description |
---|---|---|
Visualization login node |
4 |
40 cores, 768 GiB, P100 GPU |
Available Partitions
Compute nodes are used exclusively by jobs of a single user; no node sharing between jobs is done. The smallest allocation unit is one node (48 cores). Users will be charged for the number of compute nodes multiplied with the wall-clock time used. On each node, a share of the available memory is reserved and not available for application usage.
The batch
, gpus
, mem192
and booster
partitions are intended for production jobs. To support development and code optimization, additional devel
partitions are available.
The batch
partition is the default partition used when no other partition is specified. It encompasses compute nodes in the JUWELS Cluster module with 96 GiB and 192 GiB main memory.
The gpus
partition provides access to JUWELS Cluster compute nodes with V100 GPUs.
The mem192
partition contains nodes in the JUWELS Cluster module with larger main memory.
The booster
partition encompasses compute nodes in the JUWELS Booster module.
A limit regarding the maximum number of running jobs per user is enforced. The precise values are adjusted to optimize system utilization. In general, the limit for the number of running jobs is lower for nocont projects.
In addition to the above mentioned partitions the large
and largebooster
partitions are available for large and full-system jobs.
The partitions are open for submission but jobs will only run in selected timeslots. The use of these partitions needs to be coordinated with the user support.
In order to request nodes with particular resources (gpu
) generic resources need to be requested at job submission.
JUWELS Cluster partitions
Partition |
Resource |
Value |
---|---|---|
|
max. wallclock time (normal / nocont) |
24 h / 6 h |
default wallclock time |
1 h |
|
min. / max. number of nodes |
1 / 1024 |
|
node types |
mem96 (96 GiB) andmem192 (192 GiB) |
|
|
max. wallclock time (normal / nocont) |
24 h / 6 h |
default wallclock time |
1 h |
|
min. / max. number of nodes |
1 / 64 |
|
node types |
|
|
|
max. wallclock time |
2 h |
default wallclock time |
30 min |
|
min. / max. number of nodes |
1 / 8 |
|
node types |
|
|
|
max. wallclock time (normal / nocont) |
24 h / 6 h |
default wallclock time |
1 h |
|
min. / max. number of nodes |
1 / 46 |
|
node types |
mem192 , gpu:[1-4] (192 GiB, 4× V100 per node)
|
|
|
max. wallclock time |
2 h |
default wallclock time |
1 h |
|
min. / max. number of nodes |
1 / 2 |
|
node types |
mem192 , gpu:[1-4] (192 GiB, 4× V100 per node)
|
JUWELS Booster partitions
Partition |
Resource |
Value |
---|---|---|
|
max. wallclock time (normal / nocont) |
24 h / 6 h |
default wallclock time |
1 h |
|
min. / max. number of nodes |
1 / 384 |
|
node types |
mem512 , gpu:[1-4] (512 GiB, 4× A100 per node)
|
|
|
max. wallclock time |
2 h |
default wallclock time |
1 h |
|
min. / max. number of nodes |
1 / 4 |
|
node types |
mem512 , gpu:[1-4] (512 GiB, 4× A100 per node)
|
Allocations, Jobs and Job Steps
In Slurm a job is an allocation of selected resources for a specific amount of time. A job allocation can be requested using sbatch
and salloc
.
Within a job multiple job steps can be executed using srun
that use all or a subset of the allocated compute nodes. Job steps may execute at
the same time if the resource allocation permits it.
Writing a Batch Script
Users submit batch applications (usually bash scripts) using the sbatch
command. The script is executed on the first compute node in the allocation. To execute parallel MPI tasks users call srun
within their script.
Note
mpiexec
is not supported on JUWELS and has to be replaced by srun
.
The minimal template to be filled is
Multiple srun
calls can be placed in a single batch script.
Options such as --account
, --nodes
, --ntasks
and --ntasks-per-node
are by default taken from the sbatch
arguments but can be overwritten for each srun
invocation.
#!/bin/bash -x
#SBATCH --account=<budget account>
# budget account where contingent is taken from
#SBATCH --nodes=<no of nodes>
#SBATCH --ntasks=<no of tasks (MPI processes)>
# can be omitted if --nodes and --ntasks-per-node
# are given
#SBATCH --ntasks-per-node=<no of tasks per node>
# if keyword omitted: Max. 96 tasks per node
# (SMT enabled, see comment below)
#SBATCH --cpus-per-task=<no of threads per task>
# for OpenMP/hybrid jobs only
#SBATCH --output=<path of output file>
# if keyword omitted: Default is slurm-%j.out in
# the submission directory (%j is replaced by
# the job ID).
#SBATCH --error=<path of error file>
# if keyword omitted: Default is slurm-%j.out in
# the submission directory.
#SBATCH --time=<walltime>
#SBATCH --partition=<batch, booster, mem192, ...>
#SBATCH --gres=gpu:<n>
# For gpus and and booster partition
# *** start of job script ***
# Note: The current working directory at this point is
# the directory where sbatch was executed.
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
srun <executable>
Multiple srun
calls can be placed in a single batch script.
Options such as --account
, --nodes
, --ntasks
and --ntasks-per-node
are by default taken from the sbatch
arguments but can be overwritten for each srun
invocation.
The default partition on JUWELS, which is used if --partition
is omitted, is the batch
partition.
Note
If --ntasks-per-node
is omitted or set to a value higher than 48 SMT (simultaneous multithreading) will not be enabled automatically.
The Cluster and Booster compute nodes have 48 physical cores and the nodes in the gpus partition feature 40 physical cores. The number of logical cores is twice this number. To use the SMT capability, it must be activated manually by using the flag --threads-per-core=2
.
Job Script Examples
Note
For more information about the use of --cpus-per-task
, SRUN_CPUS_PER_TASK
and SBATCH_CPUS_PER_TASK
after the update to Slurm version 23.02, please refer to the
affinity documention found here: https://apps.fz-juelich.de/jsc/hps/jureca/affinity.html
Example 1: MPI application starting 3072 tasks on 64 nodes using 48 CPUs per node (no SMT) running for max. 15 minutes:
#!/bin/bash -x
#SBATCH --account=<budget>
#SBATCH --nodes=64
#SBATCH --ntasks=3072
#SBATCH --output=mpi-out.%j
#SBATCH --error=mpi-err.%j
#SBATCH --time=00:15:00
#SBATCH --partition=batch
srun ./mpi-prog
Example 2: MPI application starting 3072 tasks on 32 nodes using 96 logical CPUs (hardware threads) per node (SMT enabled) running for max. 20 minutes:
#!/bin/bash -x
#SBATCH --account=<budget>
#SBATCH --nodes=32
#SBATCH --ntasks-per-node=96
#SBATCH --output=mpi-out.%j
#SBATCH --error=mpi-err.%j
#SBATCH --time=00:20:00
#SBATCH --partition=batch
srun ./mpi-prog
Example 3: Hybrid application starting 3 tasks per node on 64 allocated nodes and starting 14 threads per task (no SMT):
#!/bin/bash -x
#SBATCH --account=<budget>
#SBATCH --nodes=64
#SBATCH --ntasks-per-node=3
#SBATCH --cpus-per-task=14
#SBATCH --output=mpi-out.%j
#SBATCH --error=mpi-err.%j
#SBATCH --time=00:20:00
#SBATCH --partition=batch
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}
srun ./hybrid-prog
Example 4: Hybrid application starting 4 tasks per node on 64 allocated nodes and starting 24 threads per task (SMT enabled):
#!/bin/bash -x
#SBATCH --account=<budget>
#SBATCH --nodes=64
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=24
#SBATCH --threads-per-core=2
#SBATCH --output=mpi-out.%j
#SBATCH --error=mpi-err.%j
#SBATCH --time=00:20:00
#SBATCH --partition=batch
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}
srun ./hybrid-prog
Example 5: MPI application starting 3072 tasks on 64 nodes using 48 CPUs per node (no SMT) running for max. 15 minutes on nodes with 192 GiB main memory. This example is identical to Example 1 except for the requested node type:
#!/bin/bash -x
#SBATCH --account=<budget>
#SBATCH --nodes=64
#SBATCH --ntasks=3072
#SBATCH --output=mpi-out.%j
#SBATCH --error=mpi-err.%j
#SBATCH --time=00:15:00
#SBATCH --partition=mem192
srun ./mpi-prog
The job script is submitted using:
$ sbatch <jobscript>
On success, sbatch
writes the job ID to standard out.
Note
One can also define sbatch
options on the command line, e.g.:
$ sbatch --nodes=4 --acount=<budget> --time=01:00:00 <jobscript>
Generic Resources, Features and Topology-aware Allocations
In order to request resources with special features (additional main memory, GPU devices) the --gres
option to sbatch
can be used.
For mem192
nodes, which are accessible via specific partitions, the --gres
option can be omitted.
Since the GPU and visualization nodes feature multiple user-visible GPU devices an additional quantity can be specified as shown in the following examples.
With the Slurm submission option --constraint
users can request resources/nodes according to Slurm Features.
Option |
Requested hardware features |
---|---|
|
192 GiB main memory |
|
Booster node, 4 GPUs per node |
|
Cluster node, 2 GPUs per node |
|
Cluster node, 4 GPUs per node |
|
XCST storage - largedata, largedata2 |
Complete list of Slurm GRES on JUWELS:
GRES |
Node Count |
---|---|
|
936 |
|
2271 |
|
240 |
|
56 |
Complete list of Slurm Features on JUWELS:
Features |
Count |
---|---|
|
48 |
|
48 |
|
48 |
|
48 |
|
48 |
|
48 |
|
48 |
|
48 |
|
48 |
|
48 |
|
48 |
|
48 |
|
48 |
|
48 |
|
48 |
|
48 |
|
48 |
|
48 |
|
38 |
|
10 |
|
24 |
|
2501 |
|
10 |
|
56 |
If no specific memory size is requested the default --gres=mem96
is automatically added to the submission to the JUWELS Cluster module.
Please note that jobs requesting 96 GiB may also run on nodes with 192 GiB if no other free resources are available.
If no gpu
GRES is given then --gres=gpu:4
is automatically added by Slurm’s submission filter for all partitions with GPU nodes.
Please note that GPU applications can request GPU devices per node via --gres=gpu:n
where n
can be 1
, 2
, 3
or 4
on GPU compute nodes.
Please refer to the JUWELS GPU computing page for examples.
Note
The charged computing time is independent of the number of specified GPUs. Production workloads must use all available GPU resources per node.
The XCST storage resource is available on all Login systems as well as on 10 Cluster Compute nodes and 10 Booster Compute nodes inside the ususal default batch partitions batch
and booster
.
For an example on how to use it, please refer to How to access largedata on a limited number of computes within your jobs?
On JUWELS a tree topology is used in Slurm configuration. The following table shows how many computes are connected to each (Infiniband) leaf switch on each system module. Note that Booster nodes have 4 HCAs, each connected to a different switch, so for scheduling purposes real switches are aggregated in a single “virtual” switch per rack containing all nodes in that rack.
System module |
SLURM view of nodes per leaf switch |
---|---|
JUWELS Cluster |
21 or 24 |
JUWELS Booster |
24 |
With the Slurm submission option --switches=<count>[@max-time]
users can request the maximum count of leaf switches that will be used for their jobs. This is
especially useful for network-bound applications, where network locality and maximum network performance is required. Optionally users can define also the maximum
time to wait for the given number of switches to be available.
Please see GPU Computing for more details.
Job Steps
The example below shows a job script where two different job steps are initiated within one job. In total 96 cores are allocated on
two nodes where -n 48
causes that each job step uses 48 cores on one of the compute nodes. Additionally in this example the option
--exclusive
is passed to srun
to ensure that distinct cores are allocated to each job step.:
#!/bin/bash -x
#SBATCH --account=<budget>
#SBATCH --nodes=2
#SBATCH --ntasks=96
#SBATCH --ntasks-per-node=48
#SBATCH --output=mpi-out.%j
#SBATCH --error=mpi-err.%j
#SBATCH --time=00:20:00
srun --exclusive -n 48 ./mpi-prog1 &
srun --exclusive -n 48 ./mpi-prog2 &
wait
Dependency Chains
Slurm supports dependency chains, i.e., collections of batch jobs with defined dependencies. Dependencies can be defined using the --dependency
argument to sbatch
:
sbatch --dependency=afterany:<jobid> <jobscript>
Slurm will guarantee that the new batch job (whose job ID is returned by sbatch) does not start before <jobid>
terminates (successfully or not).
It is possible to specify other types of dependencies, such as afterok which ensures that the new job will only start if <jobid>
finished
successfully.
Below an example script for the handling of job chains is provided. The script submits a chain of ${NO_OF_JOBS}
jobs. A job will only start after
successful completion of its predecessor. Please note that a job which exceeds its time-limit is not marked successful.:
#!/bin/bash -x
# submit a chain of jobs with dependency
# number of jobs to submit
NO_OF_JOBS=<no of jobs>
# define jobscript
JOB_SCRIPT=<jobscript>
echo "sbatch ${JOB_SCRIPT}"
JOBID=$(sbatch ${JOB_SCRIPT} 2>&1 | awk '{print $(NF)}')
I=0
while [ ${I} -le ${NO_OF_JOBS} ]; do
echo "sbatch --dependency=afterok:${JOBID} ${JOB_SCRIPT}"
JOBID=$(sbatch --dependency=afterok:${JOBID} ${JOB_SCRIPT} 2>&1 | awk '{print $(NF)}')
let I=${I}+1
done
Interactive Sessions
Interactive sessions can be allocated using the salloc
command:
$ salloc --partition=<devel|dc-cpu-devel|...> --nodes=2 --account=<budget> --time=00:30:00
Once an allocation has been made salloc
will start a shell on the login node (submission host). One can then execute srun
from within the shell, e.g.:
$ srun --ntasks=4 --ntasks-per-node=2 --cpus-per-task=7 ./hybrid-prog
The interactive session is terminated by exiting the shell. In order to obtain a shell on the first allocated compute nodes one can start a remote shell from within the salloc
session and connect it to a pseudo terminal using:
$ srun --cpu_bind=none --nodes=2 --pty /bin/bash -i
The option --cpu_bind=none
is used to disable CPU binding for the spawned shell. In order to execute MPI application one uses srun
again from the remote shell. To support X11 forwarding the --forward-x
option to srun
is available. X11 forwarding is required for users who want to use applications or tools with provide a GUI.
Below a transcript of an exemplary interactive session is shown.
srun
can be run within the allocation without delay (note that the first srun
execution may take slightly longer due to the necessary node health checking performed upon the invocation of the very first srun
command within the session).
[user1@jwlogin08 ~]$ hostname
jwlogin08.juwels
[user1@jwlogin08 ~]$ salloc -n 2 --nodes=2 --account=<budget>
salloc: Granted job allocation 3116222
salloc: Waiting for resource configuration
salloc: Nodes jwc00n[017-018] are ready for job
[user1@jwlogin08 ~]$ hostname
jwlogin08.juwels
[user1@jwlogin08 ~]$ srun --ntasks 2 --ntasks-per-node=2 hostname
jwc00n017.juwels
jwc00n018.juwels
[user1@jwlogin08 ~]$ srun --cpu-bind=none --nodes=1 --pty /bin/bash -i
[user1@jwc00n017 ~]$ hostname
jwc00n017.juwels
[user1@jwc00n017 ~]$ logout
[user1@jwlogin08 ~]$ hostname
jwlogin08.juwels
[user1@jwlogin08 ~]$ exit
exit
salloc: Relinquishing job allocation 3116222
[user1@jwlogin08 ~]$ hostname
jwlogin08.juwels
To support X11 forwarding the --forward-x
option to srun
is available.
Note
Your account will be charged per allocation whether the compute nodes are used or not. Batch submission is the preferred way to execute jobs.
Hold and Release Batch Jobs
Jobs that are in pending state (i.e., not yet running) can be put in hold using:
scontrol hold <jobid>
Jobs that are in hold are still reported as pending (PD) by squeue
but the Reason
shown by squeue
or scontrol show job
is changed to JobHeldUser
:
[user1@jrlogin07 ~]$ scontrol show job <jobid>
JobId=<jobid> JobName=jobscript.sh
UserId=XXX(nnnn) GroupId=XXX(nnnn) MCS_label=N/A
Priority=0 Nice=0 Account=XXX QOS=normal
JobState=PENDING Reason=JobHeldUser Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=00:20:00 TimeMin=N/A
SubmitTime=2018-12-10T10:52:42 EligibleTime=Unknown
StartTime=Unknown EndTime=Unknown Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=batch AllocNode:Sid=jrlogin07:14699
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=2-2 NumCPUs=24 NumTasks=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=24,node=2
Socks/Node=* NtasksPerN:B:S:C=12:0:*:* CoreSpec=*
MinCPUsNode=12 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
Gres=mem96 Reservation=(null)
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/XXX/jobscript.sh
WorkDir=/XXX
StdErr=/XXX/mpi-err.<jobid>
StdIn=/dev/null
StdOut=/XXX/mpi-out.<jobid>
Power=
The job can be released using:
$ scontrol release <jobid>
Slurm commands
Below a list of the most important Slurm user commands available on JUWELS is given.
- sbatch
is used to submit a batch script (which can be a bash, Perl or Python script)
The script will be executed on the first node in the allocation chosen by the scheduler. The working directory coincides with the working directory of the sbatch program. Within the script one or multiple srun commands can be used to create job steps and execute (MPI) parallel applications.
Note
mpiexec
is not supported on JUWELS.srun
is the only supported method to spawn MPI applications.- salloc
is used to request an allocation
When the job is started, a shell (or other program specified on the command line) is started on the submission host (login node). From the shell
srun
can be used to interactively spawn parallel applications. The allocation is released when the user exits the shell.- srun
is mainly used to create a job step within an job
srun
can be executed without arguments except the program to use the full allocation or with additional arguments to restrict the job step resources to a subset of the allocated processors.- squeue
allows to query the list of pending and running jobs
By default it reports the list of pending jobs sorted by priority and the list of running jobs sorted separately according to the job priority.
- scancel
is used to cancel pending or running jobs or to send signals to processes in running jobs or job steps
Example:
scancel <jobid>
- scontrol
can be used to query information about compute nodes and running or recently completed jobs
Examples:
scontrol show job <jobid>
to show detailed information about pending, running or recently completed jobsscontrol update job <jobid> set ...
to update a pending job
Note
For old jobs
scontrol show job <jobid>
will not work andsacct -j <jobid>
should be used instead.- sacct
is used to retrieve accounting information for jobs and job steps
For older jobs
sacct
queries the accounting database.Example:
sacct -j <jobid>
- sinfo
is used to retrieve information about the partitions and node states
- sprio
can be used to query job priorities
- smap
graphically shows the state of the partitions and nodes using a curses interface
We recommend Llview as an alternative which is supported on all JSC machines.
- sattach
allows to attach to the standard input, output or error of a running job
- sstat
allows to query information about a running job
Summary of sbatch and srun Options
The following table summarizes important sbatch
and srun
command options:
|
Budget account where contingent is taken from. |
|
Number of compute nodes used by the job. Can be omitted if |
|
Number of tasks (MPI processes). Can be omitted if |
|
Number of tasks per compute nodes. |
|
Number of logical CPUs (hardware threads) per task. This option is only relevant for hybrid/OpenMP jobs. |
|
A name for the job |
|
Path to the job’s standard output. Slurm supports format strings containing replacement symbols such as |
|
Path to the job’s standard error. Slurm supports format strings containing replacement symbols such as |
|
Maximal wall-clock time of the job. |
|
Partition to be used, e.g. |
|
Define the mail address to receive mail notification. |
|
Define when to send a mail notifications. [3] |
|
Execute the first task in pseudo terminal mode. |
|
Enable X11 forwarding on the first allocated node. |
|
Disable turbo mode of all CPUs of the allocated nodes. |
More information is available on the man pages of sbatch
, srun
and salloc
which can be retrieved on the login nodes with the commands man sbatch
, man srun
and man salloc
, respectively, or in the Slurm documentation.
Frequency Scaling Performance Reliability
- CPU frequency sets the pace at which instructions are executed by the CPU. A higher frequency results in:
Higher power usage
Possible higher performance
Each CPU has a base frequency, which is the frequency that the CPU is guaranteed to work at.
Turbo mode means that the CPU increases the frequency above the base frequency, if temperature allows. Higher frequency results in more heat dissipation and a higher temperature. If the temperature passes the designed threshold, the CPU will tend to control the temperature by lowering the frequency, and this might affect the performance.
Therefore, the base frequency is more reliable since application performance does not depend on the current temperature of the allocated CPUs.
As a result, for repeatable performance measurements, it is recommended to use --disable-turbomode
to use the base frequency and disable turbo mode, a reference can be found in Summary of sbatch and srun Options