Quick Introduction

Welcome to the JURECA system.

This is a very basic information for using JURECA. More information can be found in the User Info on the JSC Website (general documentation):

The documentation you are just reading can be read either as man page on the JURECA system using man jureca or as HTML pages.

JURECA Usage Model

JURECA is accessed through a dedicated set of login nodes used to write and compile applications as well as to perform pre- and post-processing of simulation data. Access to the compute nodes in the system is controlled by the workload manager.

Data Management

The available parallel filesystems on JURECA are mounted from JUST. The same filesystems are available on JUDAC data gateway which is the recommended system for initiating transfers of large files.

JURECA Hardware Overview

Type Quantity Description
Standard / Slim 1605 24 cores, 128 GiB
Fat (type 1) 128 24 cores, 256 GiB
Fat (type 2) 64 24 cores, 512 GiB
Accelerated 75 24 cores, 128 GiB, 2x K80
Login 12 24 cores, 256 GiB
Visualization (type 1) 10 24 cores, 512 GiB, 2x K40
Visualization (type 2) 2 24 cores, 1 TiB, 2x K40
Booster (KNL) 1640 68 cores, 96 GiB

Software on JURECA

For the usage of the module command, compilers and pre-installed applications, please see the JURECA Software page. Please note that the organization of Software on JURECA and JUWELS is identical but the installed software packages may differ.

Compile with the Intel Compilers

Please see the following page for examples how to compile Fortran, C or C++ programs with the Intel compilers: compilation information.

Batch System on JURECA

The batch system on JURECA is the Slurm (Simple Linux Utility for Resource Management) Workload Manager, a free open-source batch system. The resource management on JURECA is not performed by Slurm but by the proven Parastation process management daemon.

Available Partitions

Compute nodes are used exclusively by jobs of a single user; no node sharing between jobs is done. The smallest allocation unit is one node (24 processors). Users will be charged for the number of compute nodes multiplied with the wall-clock time used.

On each node, a share of the available memory is reserved and not available for application usage. Batch jobs are guaranteed to have at minimum of 4 GiB, 8 GiB and 16 GiB per core available (96 GiB, 192 GiB and 384 GiB in total) on nodes with 128 GiB, 256 GiB and 512 GiB nodes, respectively.

In contrast to JUROPA on JURECA batch and interactive jobs can be used interchangeably and all limits apply to both use cases. The default batch partition is intended for production jobs and encompasses nodes with 128 GiB and 256 GiB main memory. The mem512 partition contains 64 nodes with 512 GiB main memory each. The gpus and vis partition provide access to GPU-equipped compute and visualization nodes. The GPU-equipped large memory nodes (1 TiB main memory) are accessible through the mem1024 as well as the vis partition. To support development and optimization of single node performance an additional devel partition is available. For software development and optimization efforts targeted at GPU-equipped compute nodes the develgpus partition can be used. The booster partition contains 1612 KNL nodes which are confiugured with Quadrant NUMA Mode with Cache Memory mode. The 24 KNL nodes in develbooster partition are configured in the same way but they are meant to be used for software development, small and short tests and also compilation of applications meant for the KNL architecture. The purpose of modetestbooster partition is for testing different KNL configurations and includes 4 KNL nodes which are divided into 2 different groups with 2 nodes each.

Partition Resource Value
devel max. wallclock time 2 h
  default wallclock time 30 min
  min. / max. number of nodes 1 / 8
  max. no. of running jobs 4
  node types mem128 (128 GiB)
develgpus max. wallclock time 2 h
  default wallclock time 30 min
  min. / max. number of nodes 1 / 2
  max. no. of running jobs 2
  node types
mem128,gpu:[1-4]
(128 GiB, 2x K80 per node)
batch max. wallclock time (normal / nocont) 24 h / 6 h
  default wallclock time 1 h
  min. / max. number of nodes 1 / 256
  node types
mem128 (128 GiB) and
mem256 (256 GiB)
mem256 max. wallclock time (normal / nocont) 24 h / 6 h
  default wallclock time 1 h
  min. / max. number of nodes 1 / 128
  node types mem256 (256 GiB)
mem512 max. wallclock time (normal / nocont) 24 h / 6 h
  default wallclock time 1 h
  min. / max. number of nodes 1 / 32
  node types mem512 (512 GiB)
mem1024 max. wallclock time (normal / nocont) 24 h / 6 h
  default wallclock time 1 h
  node types mem1024 (1 TiB)
gpus max. wallclock time (normal / nocont) 24 h / 6 h
  default wallclock time 1 h
  min. / max. number of nodes 1 / 32
  node types
mem128,gpu:[1-4]
(128 GiB, 2x K80 per node)
vis max. wallclock time (normal / nocont) 24 h / 6 h
  default wallclock time 1 h
  min. / max. number of nodes 1 / 4
  node types
mem512,gpu:[1-2]
(512 GiB, 2x K40 per node)
mem1024,gpu:[1-2]
(1 TiB, 2x K40 per node)
booster max. wallclock time (normal / nocont) 24 h / 6 h
  default wallclock time 1 h
  min. / max. number of nodes 1 / 512
  node types
mem96 with feature
quadcache
develbooster max. wallclock time 6 h
  default wallclock time 30 min
  min. / max. number of nodes 1 / 8
  node types
mem96 with feature
quadcache
modetestbooster max. wallclock time (normal / nocont) 24 h / 6 h
  default wallclock time 1 h
  min. / max. number of nodes 1 / 4
  node types
mem96 with features
snc4flat or
snc4cache

A limit regarding the maximum number of running jobs per user is enforced. The precise values are adjusted to optimize system utilization. In general, the limit for the number of running jobs is lower for nocont projects.

In addition to the above mentioned partitions the large and largebooster partition is available for large and full-module jobs. The partitions are open for submission but jobs will only run in selected timeslots. The max. wallclock time is not limited but jobs with wallclock time above 30 minutes need to coordinated with the user support.

In order to request nodes with particular resources (mem256, mem512, mem1024, gpu) generic resources need to be requested at job submission. Please note that the mem256, mem512, mem1024, gpus, and develgpus partitions may only be used for applications requiring large memory or using GPU accelerators, respectively.

The modetestbooster partition is divided into 2 different groups of 2 nodes each. Currently the KNL configurations are: a) SNC4 + Flat, b) SNC4 + Cache. In Slurm those groups have been configured to have different "Features", and the current list of Features is: a) snc4flat and b) snc4cache. In order to use certain configuration users must apply also the constraint submission option: -C, --constraint=<feature>. For example: sbatch ... -p modetestbooster -C snc4flat .... Submissions in modetestbooster partition are denied when users do not specify any Feature.

Writing a Batch Script

Users submit batch applications (usually bash scripts) using the sbatch command. The script is executed on the first compute node in the allocation. To execute parallel MPI tasks users call srun within their script.

Note

mpiexec is not supported on JURECA and has to be replaced by srun.

The minimal template to be filled is

#!/bin/bash -x
#SBATCH --account=<budget account>
# budget account where contingent is taken from
#SBATCH --nodes=<no of nodes>
#SBATCH --ntasks=<no of tasks (MPI processes)>
# can be omitted if --nodes and --ntasks-per-node
# are given
#SBATCH --ntasks-per-node=<no of tasks per node>
# if keyword omitted: Max. 48 tasks per node
# (SMT enabled, see comment below)
#SBATCH --cpus-per-task=<no of threads per task>
# for OpenMP/hybrid jobs only
#SBATCH --output=<path of output file>
# if keyword omitted: Default is slurm-%j.out in
# the submission directory (%j is replaced by
# the job ID).
#SBATCH --error=<path of error file>
# if keyword omitted: Default is slurm-%j.out in
# the submission directory.
#SBATCH --time=<walltime>
#SBATCH --partition=<batch, mem512, ...>

# *** start of job script ***
# Note: The current working directory at this point is
# the directory where sbatch was executed.

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
srun <executable>

Multiple srun calls can be placed in a single batch script. Options such as --account, --nodes, --ntasks and --ntasks-per-node are by default taken from the sbatch arguments but can be overwritten for each srun invocation. The default partition on JURECA, which is used if --partition is omitted, is the batch partition.

Note

If --nasks-per-node is omitted or set to a value higher than 24 SMT (simultaneous multithreading) will be enabled. Each standard compute node has 24 physical cores and 48 logical cores.

Job Script Examples

Example 1: MPI application starting 96 tasks on 4 nodes using 24 CPUs per node (no SMT) running for max. 15 minutes:

#!/bin/bash -x
#SBATCH --account=<budget>
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=24
#SBATCH --output=mpi-out.%j
#SBATCH --error=mpi-err.%j
#SBATCH --time=00:15:00
#SBATCH --partition=batch

srun ./mpi-prog

Example 2: MPI application starting 1536 tasks on 32 nodes using 48 logical CPUs (hardware threads) per node (SMT enabled) running for max. 20 minutes:

#!/bin/bash -x
#SBATCH --account=<budget>
#SBATCH --nodes=32
#SBATCH --ntasks-per-node=48
#SBATCH --output=mpi-out.%j
#SBATCH --error=mpi-err.%j
#SBATCH --time=00:20:00
#SBATCH --partition=batch

srun ./mpi-prog

Example 3: Hybrid application starting 3 tasks per node on 4 allocated nodes and starting 7 threads per task (no SMT):

#!/bin/bash -x
#SBATCH --account=<budget>
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=3
#SBATCH --cpus-per-task=7
#SBATCH --output=mpi-out.%j
#SBATCH --error=mpi-err.%j
#SBATCH --time=00:20:00
#SBATCH --partition=batch

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
srun ./hybrid-prog

Example 4: Hybrid application starting 4 tasks per node on 3 allocated nodes and starting 12 threads per task (SMT enabled):

#!/bin/bash -x
#SBATCH --account=<budget>
#SBATCH --nodes=3
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=12
#SBATCH --output=mpi-out.%j
#SBATCH --error=mpi-err.%j
#SBATCH --time=00:20:00
#SBATCH --partition=batch

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
srun ./hybrid-prog

Example 5: MPI application starting 96 tasks on 4 nodes using 24 CPUs per node (no SMT) running for max. 15 minutes on nodes with 256 GiB main memory. This example is identical to Example 1 except for the requested node type:

#!/bin/bash -x
#SBATCH --account=<budget>
#SBATCH --nodes=4
#SBATCH --ntasks=96
#SBATCH --output=mpi-out.%j
#SBATCH --error=mpi-err.%j
#SBATCH --time=00:15:00
#SBATCH --partition=mem256

srun ./mpi-prog

The job script is submitted using:

$ sbatch <jobscript>

On success, sbatch writes the job ID to standard out.

Note

One can also define sbatch options on the command line, e.g.:

$ sbatch --nodes=4 --acount=<budget> --time=01:00:00 <jobscript>

Requesting Generic Resources

In order to request resources with special features (additional main memory, GPU devices) the --gres option to sbatch can be used. For mem256 and mem512 nodes, which are accessible via specific partitions, the --gres option can be omitted. Since the GPU and visualization nodes feature multiple user-visible GPU devices (4 on GPU compute nodes and 2 on visualization nodes) an additional quantity needs to be specified as shown in the following examples.

Option Requested hardware features
--partition=mem256 256 GiB main memory
--gres=mem512 --partition=mem512 512 GiB main memory
--gres=gpu:2 --partition=gpus 2 GPUs per node
--gres=gpu:4 --partition=gpus 4 GPUs per node
--gres=mem1024 --partition=mem1024 1 TiB main memory
--gres=mem1024,gpu:2 --partition=vis 1 TiB main memory and 2 GPUs per node

If no specific memory size is requested the default --gres=mem128 is automatically added to the submission. Please note that jobs requesting 128 GiB may also run on nodes with 256 GiB if no other free resources are available. The nodes equipped with 1 TiB main memory are accessible through the mem1024 and vis partition depending on the intended use case.

The vis, gpus and develgpus partitions will reject submissions if the corresponding resources are not requested. Please note that GPU applications will only be able to use as many GPUs per node as requested via --gres=gpu:n where n can be 1, 2, 3 or 4 on GPU compute nodes and 1 or 2 on visualization nodes. Please refer to the GPU computing page for examples.

Interactive Sessions

Interactive sessions can be allocated using the salloc command:

$ salloc --partition=devel --nodes=2 --account=<budget> --time=00:30:00

Once an allocation has been made salloc will start a bash on the login node (submission host). One can then execute srun from within the bash, e.g.:

$ srun --ntasks=4 --ntasks-per-node=2 --cpus-per-task=7 ./hybrid-prog

The interactive session is terminated by exiting the shell. In order to obtain a shell on the first allocated compute nodes one can start a remote shell from within the salloc session and connect it to a pseudo terminal using:

$ srun --cpu_bind=none --nodes=2 --pty /bin/bash -i

The option --cpu_bind=none is used to disable CPU binding for the spawned shell. In order to execute MPI application one uses srun again from the remote shell. To support X11 forwarding the --forward-x option to srun is available. X11 forwarding is required for users who want to use applications or tools with provide a GUI.

Note

Your account will be charged per allocation whether the compute nodes are used or not. Batch submission is the preferred way to execute jobs.

Other Useful sbatch and srun Options

  • To receive e-mail notification users have to specify --mail-user=<e-mail address> and set --mail-type=<type> with valid types: BEGIN, END, FAIL, REQUEUE, ALL, TIME_LIMIT, TIME_LIMIT_90, TIME_LIMIT_80, TIME_LIMIT_50, ARRAY_TASKS to receive emails when events occur. Multiple type values may be specified in a comma separated list.
  • Stdout and stderr can be combined by specifying the same file for the --output and --error option.
  • A job name can be specified using the --job-name option.
  • If --ntasks is omitted the number of nodes can be specified as a range --nodes=<min no. of nodes>-<max no. of nodes> allowing the scheduler to start the job with fewer nodes nodes than the maximum requested if this reduces wait time.
  • There is a custom plugin implemented for the sbatch command which disables the turbo mode of all CPUs of the allocated nodes. To disable CPU turbo mode use --disable-turbomode.

Summary of sbatch and srun Options

The following table summarizes important sbatch and srun command options:

--account Budget account where contingent is taken from.
--nodes Number of compute nodes used by the job. Can be omitted if --ntasks and --ntasks-per-node is given.
--ntasks Number of tasks (MPI processes). Can be omitted if --nnodes and --ntasks-per-node is given.
--ntasks-per-node Number of tasks per compute nodes.
--cpus-per-task Number of logical CPUs (hardware threads) per task. This option is only relevant for hybrid/OpenMP jobs.
--output Path to the job's standard output. Slurm supports format strings containing replacement symbols such as %j (job ID).
--error Path to the job's standard error. Slurm supports format strings containing replacement symbols such as %j (job ID).
--time Maximal wall-clock time of the job.
--partition Partition to be used, e.g. batch or large. If omitted, batch is the default.
--mail-user Define the mail address to receive mail notification.
--mail-type Define when to send a mail notifications.
--pty (srun only) Execute the first task in pseudo terminal mode.
--forward-x (srun) Enable X11 forwarding on the first allocated node.
--disable-turbomode (sbatch) Disable turbo mode of all CPUs of the allocated nodes.

More information is available on the man pages of sbatch, srun and salloc which can be retrieved on the login nodes with the commands man sbatch, man srun and man salloc, respectively.

Other Slurm Commands

squeue Show status of all jobs.
scancel <jobid> Cancel a job.
scontrol show job <jobid> Show detailed information about a pending, running or recently completed job.
scontrol update job <jobid> set ... Update a pending job.
scontrol -h Show detailed information about scontrol.
sacct -j <jobid> Query information about old jobs.
sprio Show job priorities.
smap Show distribution of jobs. For a graphical interface users are referred to llview.
sinfo View information about nodes and partitions.

For further information please see also the Slurm documentation.