Quick Introduction

Welcome to the JUWELS system.

This is a very basic information for using JUWELS. More information can be found in the User Info on the JSC Website (general documentation):

The documentation you are just reading can be read either as man page on the JUWELS system using man juwels or as HTML pages.

JUWELS Usage Model

JUWELS is accessed through a dedicated set of login nodes used to write and compile applications as well as to perform pre- and post-processing of simulation data. Access to the compute nodes in the system is controlled by the workload manager.

Data Management

The available parallel filesystems on JUWELS are mounted from JUST. The same filesystems are available on JUDAC data gateway which is the recommended system for initiating transfers of large files.

JUWELS Hardware Overview

Type Quantity Description
Standard / Slim 2271 48 cores, 96 GiB
Large memory 240 48 cores, 192 GiB
Accelerated 48 40 cores, 192 GiB, 4x V100 SXM2 GPUs
Login 12 40 cores, 768 GiB
Visualization login node (availability will be announced) 4 40 cores, 768 GiB, P100 GPU

Software on JUWELS

For the usage of the module command, compilers and pre-installed applications, please see the JURECA Software page. Please note that the organization of Software on JURECA and JUWELS is identical but the installed software packages may differ.

Compile with the Intel Compilers

Please see the following page for examples how to compile Fortran, C or C++ programs with the Intel compilers: JUWELS compilation information.

Batch System on JUWELS

The batch system on JUWELS is the Slurm (Simple Linux Utility for Resource Management) Workload Manager, a free open-source batch system. The resource management on JUWELS is not performed by Slurm but by the proven Parastation process management daemon.

Available Partitions

Compute nodes are used exclusively by jobs of a single user; no node sharing between jobs is done. The smallest allocation unit is one node (48 processors). Users will be charged for the number of compute nodes multiplied with the wall-clock time used. On each node, a share of the available memory is reserved and not available for application usage.

The default batch partition is intended for production jobs and encompasses nodes with 96 GiB and 192 GiB main memory. The mem192 partition contains nodes with 192 GiB main memory each. The gpus partition provides access to GPU-equipped compute nodes.

Please note: The job size and runtime limits are still subject to changes during the early operation phase of the system. Plese see the "message of the day" shown at login for more information.

Partition Resource Value
batch max. wallclock time (normal / nocont) TBD / TBD
  default wallclock time 1 h
  min. / max. number of nodes TBD / TBD
  node types
mem96 (96 GiB) and
mem192 (192 GiB)
mem192 max. wallclock time (normal / nocont) TBD / TBD
  default wallclock time 1 h
  min. / max. number of nodes TBD / TBD
  node types mem192 (192 GiB)
gpus max. wallclock time (normal / nocont) TBD / TBD
  default wallclock time 1 h
  min. / max. number of nodes TBD / TBD
  node types
mem192,*gpu*:[1-4]
(192 GiB, 4x V100 per node)

A limit regarding the maximum number of running jobs per user is enforced. The precise values are adjusted to optimize system utilization. In general, the limit for the number of running jobs is lower for nocont projects.

In addition to the above mentioned partitions the large partition is available for large and full-system jobs. The partition is open for submission but jobs will only run at in selected timeslots. The use of the large partition needs to be coordinated with the user support.

Writing a Batch Script

Users submit batch applications (usually bash scripts) using the sbatch command. The script is executed on the first compute node in the allocation. To execute parallel MPI tasks users call srun within their script.

Please note that mpiexec is not supported on JUWELS and has to be replaced by srun.

The minimal template to be filled is

#!/bin/bash -x
#SBATCH --nodes=<no of nodes>
#SBATCH --ntasks=<no of tasks (MPI processes)>
# can be omitted if --nodes and --ntasks-per-node
# are given
#SBATCH --ntasks-per-node=<no of tasks per node>
# if keyword omitted: Max. 96 tasks per node
# (SMT enabled, see comment below)
#SBATCH --cpus-per-task=<no of threads per task>
# for OpenMP/hybrid jobs only
#SBATCH --output=<path of output file>
# if keyword omitted: Default is slurm-%j.out in
# the submission directory (%j is replaced by
# the job ID).
#SBATCH --error=<path of error file>
# if keyword omitted: Default is slurm-%j.out in
# the submission directory.
#SBATCH --time=<walltime>
#SBATCH --partition=<batch, mem512, ...>

# *** start of job script ***
# Note: The current working directory at this point is
# the directory where sbatch was executed.

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
srun <executable>

Multiple srun calls can be placed in a single batch script. Options such as --nodes, --ntasks and --ntasks-per-node are by default taken from the sbatch arguments but can be overwritten for each srun invocation. The default partition on JUWELS, which is used if --partition is omitted, is the batch partition.

Note: If --nasks-per-node is omitted or set to a value higher than 48 SMT (simultaneous multithreading) will be enabled. Each standard compute node has 48 physical cores and 96 logical cores (40 and 80 on GPU nodes, respectively).

Job Script Examples

Example 1: MPI application starting 3072 tasks on 64 nodes using 48 CPUs per node (no SMT) running for max. 15 minutes:

#!/bin/bash -x
#SBATCH --nodes=64
#SBATCH --ntasks=3072
#SBATCH --ntasks-per-node=48
#SBATCH --output=mpi-out.%j
#SBATCH --error=mpi-err.%j
#SBATCH --time=00:15:00
#SBATCH --partition=batch

srun ./mpi-prog

Example 2: MPI application starting 3072 tasks on 32 nodes using 96 logical CPUs (hardware threads) per node (SMT enabled) running for max. 20 minutes:

#!/bin/bash -x
#SBATCH --nodes=32
#SBATCH --ntasks=3072
#SBATCH --ntasks-per-node=96
#SBATCH --output=mpi-out.%j
#SBATCH --error=mpi-err.%j
#SBATCH --time=00:20:00
#SBATCH --partition=batch

srun ./mpi-prog

Example 3: Hybrid application starting 3 tasks per node on 4 allocated nodes and starting 14 threads per node (no SMT):

#!/bin/bash -x
#SBATCH --nodes=64
#SBATCH --ntasks=192
#SBATCH --ntasks-per-node=3
#SBATCH --cpus-per-task=14
#SBATCH --output=mpi-out.%j
#SBATCH --error=mpi-err.%j
#SBATCH --time=00:20:00
#SBATCH --partition=batch

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
srun ./hybrid-prog

Example 4: Hybrid application starting 4 tasks per node on 64 allocated nodes and starting 24 threads per task (SMT enabled):

#!/bin/bash -x
#SBATCH --nodes=64
#SBATCH --ntasks=12
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=24
#SBATCH --output=mpi-out.%j
#SBATCH --error=mpi-err.%j
#SBATCH --time=00:20:00
#SBATCH --partition=batch

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
srun ./hybrid-prog

Example 5: MPI application starting 3072 tasks on 64 nodes using 48 CPUs per node (no SMT) running for max. 15 minutes on nodes with 192 GiB main memory. This example is identical to Example 1 except for the requested node type:

#!/bin/bash -x
#SBATCH --nodes=64
#SBATCH --ntasks=3072
#SBATCH --ntasks-per-node=48
#SBATCH --output=mpi-out.%j
#SBATCH --error=mpi-err.%j
#SBATCH --time=00:15:00
#SBATCH --partition=mem192

srun ./mpi-prog

The job script is submitted using:

$ sbatch <jobscript>

On success, sbatch writes the job ID to standard out.

Note: One can also define sbatch options on the command line, e.g.:

$ sbatch --nodes=4 --time=01:00:00 <jobscript>

Requesting Generic Resources

In order to request resources with special features (additional main memory, GPU devices) the --gres option to sbatch can be used. For mem192 nodes, which are accessible via specific partitions, the --gres option can be omitted. Since the GPU and visualization nodes feature multiple user-visible GPU devices (4 on GPU compute nodes and 2 on visualization nodes) an additional quantity needs to be specified as shown in the following examples.

Option Requested hardware features
--partition=mem192 256 GiB main memory
--gres=gpu:2 --partition=gpus 2 GPUs per node
--gres=gpu:4 --partition=gpus 4 GPUs per node

If no specific memory size is requested the default --gres=mem96 is automatically added to the submission. Please note that jobs requesting 96 GiB may also run on nodes with 192 GiB if no other free resources are available.

The gpus partition will reject submissions if the corresponding resources are not requested. Please note that GPU applications will only be able to use as many GPUs per node as requested via --gres=gpu:n where n can be 1, 2, 3 or 4 on GPU compute nodes and 1 or 2 on visualization nodes. Please refer to the JUWELS GPU computing page for examples (please note that JURECA and JUWELS feature the same number of GPUs per node but different number of cores).

Interactive Sessions

Interactive sessions can be allocated using the salloc command:

$ salloc --partition=devel --nodes=2 --time=00:30:00

Once an allocation has been made salloc will start a bash on the login node (submission host). One can then execute srun from within the bash, e.g.:

$ srun --ntasks=4 --ntasks-per-node=2 --cpus-per-task=7 ./hybrid-prog

The interactive session is terminated by exiting the shell. In order to obtain a shell on the first allocated compute nodes one can start a remote shell from within the salloc session and connect it to a pseudo terminal using:

$ srun --cpu_bind=none --nodes=2 --pty /bin/bash -i

The option --cpu_bind=none is used to disable CPU binding for the spawned shell. In order to execute MPI application one uses srun again from the remote shell. To support X11 forwarding the --forward-x option to srun is available. X11 forwarding is required for users who want to use applications or tools with provide a GUI.

Note: Your account will be charged per allocation whether the compute nodes are used or not. Batch submission is the preferred way to execute jobs.

Other Useful sbatch and srun Options

  • To receive e-mail notification users have to specify --mail-user=<e-mail address> and set --mail-type=<type> with valid types: BEGIN, END, FAIL, REQUEUE, ALL, TIME_LIMIT, TIME_LIMIT_90, TIME_LIMIT_80, TIME_LIMIT_50, ARRAY_TASKS to receive emails when events occur. Multiple type values may be specified in a comma separated list.
  • Stdout and stderr can be combined by specifying the same file for the --output and --error option.
  • A job name can be specified using the --job-name option.
  • If --ntasks is omitted the number of nodes can be specified as a range --nodes=<min no. of nodes>-<max no. of nodes> allowing the scheduler to start the job with fewer nodes nodes than the maximum requested if this reduces wait time.

Summary of sbatch and srun Options

The following table summarizes important sbatch and srun command options:

--nodes Number of compute nodes used by the job. Can be omitted if --ntasks and --ntasks-per-node is given.
--ntasks Number of tasks (MPI processes). Can be omitted if --nnodes and --ntasks-per-node is given.
--ntasks-per-node Number of tasks per compute nodes.
--cpus-per-task Number of logical CPUs (hardware threads) per task. This option is only relevant for hybrid/OpenMP jobs.
--output Path to the job's standard output. Slurm supports format strings containing replacement symbols such as %j (job ID).
--error Path to the job's standard error. Slurm supports format strings containing replacement symbols such as %j (job ID).
--time Maximal wall-clock time of the job.
--partition Partition to be used. The argument can be either batch or large on JUWELS. If omitted, batch is the default.
--mail-user Define the mail address to receive mail notification.
--mail-type Define when to send a mail notifications.
--pty (srun only) Execute the first task in pseudo terminal mode.
--forward-x (srun) Enable X11 forwarding on the first allocated node.

More information is available on the man pages of sbatch, srun and salloc which can be retrieved on the login nodes with the commands man sbatch, man srun and man salloc, respectively.

Other Slurm Commands

squeue Show status of all jobs.
scancel <jobid> Cancel a job.
scontrol show job <jobid> Show detailed information about a pending, running or recently completed job.
scontrol update job <jobid> set ... Update a pending job.
scontrol -h Show detailed information about scontrol.
sacct -j <jobid> Query information about old jobs.
sprio Show job priorities.
smap Show distribution of jobs. For a graphical interface users are referred to llview.
sinfo View information about nodes and partitions.

For further information please see also the Slurm documentation.