.. include:: system.rst

.. _batchsystem:

Batch system
============

|SYSTEM_NAME| is accessed through a dedicated set of login nodes used to write and compile applications as well as to perform pre- and post-processing of simulation data.
Access to the compute nodes in the system is controlled by the workload manager.

On |SYSTEM_NAME| the Slurm (Simple Linux Utility for Resource Management) Workload Manager, a free open-source resource manager and batch system,
is employed. Slurm is a modern, extensible batch system that is widely deployed around the world on clusters of various sizes.

A Slurm installation consists of several programs and daemons. The ``slurmctld`` daemon is the central brain of the batch system responsible
for monitoring the available resources and scheduling batch jobs. The ``slurmctld`` runs on an administrative node with a special setup to
ensure availability in the case of hardware failures. Most user programs such as ``srun``, ``sbatch``, ``salloc`` and ``scontrol`` interact with
the ``slurmctld``. For the purpose of job accounting ``slurmctld`` communicates with the ``slurmdbd`` database daemon.
Information from the accounting database can be queries using the ``sacct`` command.
Slurm combines the functionality of the batch system and resource management. For this purpose Slurm provides the ``slurmd`` daemon which
runs on the compute nodes and interacts with ``slurmctld``. For the executing of user processes ``slurmstepd`` instances are spawned by
``slurmd`` to shepherd the user processes.

.. ifconfig:: system_name != 'jedi'

   On |SYSTEM_NAME| no ``slurmd`` is running on the compute nodes. Instead the process management is performed by ``psid``, the management daemon from the
   Parastation Cluster Suite. A plugin ``psslurm`` to ``psid`` replaces ``slurmd`` on the compute nodes of |SYSTEM_NAME|. Therefore only one daemon is
   required on the compute nodes for the resource management which minimizes jitter that could affect large-scale applications.

.. _batch_avail_partitions:

Slurm Partitions
----------------
In Slurm multiple nodes can be grouped into partitions which are sets of nodes with associated limits (for wall-clock time, job size, etc.).
In practice these partitions can be used for example to signal need for resources that have certain hardware characteristics (normal, large memory, accelerated, etc.) or that are dedicated to specific workloads (large production jobs, small debugging jobs, visualization, etc.).

Hardware Overview
^^^^^^^^^^^^^^^^^

.. ifconfig:: system_name == 'jedi'

   +----------------------------------+----------+--------------------------------------+
   | Type                             | Quantity | Description                          |
   +==================================+==========+======================================+
   | Standard                         |       48 | 288 cores, 480 GiB                   |
   +----------------------------------+----------+--------------------------------------+
   | Login                            |        1 | 72 cores, 574 GiB                    |
   +----------------------------------+----------+--------------------------------------+

.. ifconfig:: system_name == 'juwels'

   |SYSTEM_NAME| is a modular supercomputer consisting of a Cluster and a Booster module.

   .. note:: Each module is equipped with dedicated login nodes. Job submission to other modules from these logins is possible but currently requires workarounds. For the time being we advise users to submit jobs for each module from their respective login partition.

   JUWELS Cluster module
   """""""""""""""""""""

   +----------------------------------+----------+--------------------------------------+
   | Type                             | Quantity | Description                          |
   +==================================+==========+======================================+
   | Standard / Slim nodes            |     2271 | 48 cores, 96 GiB                     |
   +----------------------------------+----------+--------------------------------------+
   | Large memory nodes               |      240 | 48 cores, 192 GiB                    |
   +----------------------------------+----------+--------------------------------------+
   | Accelerated nodes                |       56 | 40 cores, 192 GiB, 4× V100 SXM2 GPUs |
   +----------------------------------+----------+--------------------------------------+
   | Login nodes                      |       12 | 40 cores, 768 GiB                    |
   +----------------------------------+----------+--------------------------------------+

   JUWELS Booster module
   """""""""""""""""""""

   +----------------------------------+----------+--------------------------------------+
   | Type                             | Quantity | Description                          |
   +==================================+==========+======================================+
   | Booster nodes                    |      936 | 48 cores, 512 GiB, 4× A100 GPUs      |
   +----------------------------------+----------+--------------------------------------+
   | Login nodes                      |        4 | 48 cores, 512 GiB                    |
   +----------------------------------+----------+--------------------------------------+

   Visualization login partition
   """""""""""""""""""""""""""""

   +----------------------------------+----------+--------------------------------------+
   | Type                             | Quantity | Description                          |
   +==================================+==========+======================================+
   | Visualization login node         |        4 | 40 cores, 768 GiB, P100 GPU          |
   +----------------------------------+----------+--------------------------------------+

.. ifconfig:: system_name == 'jureca'

   Login partition
   """""""""""""""

   +----------------------------------+----------+--------------------------------------+
   | Type                             | Quantity | Description                          |
   +==================================+==========+======================================+
   | Login nodes                      |       12 | 128 cores, 1 TiB, 2× Quadro RTX8000  |
   +----------------------------------+----------+--------------------------------------+

   JURECA DC module
   """"""""""""""""

   +----------------------------------+----------+--------------------------------------+
   | Type                             | Quantity | Description                          |
   +==================================+==========+======================================+
   | Standard / Slim nodes            |      480 | 128 cores, 512 GiB                   |
   +----------------------------------+----------+--------------------------------------+
   | Large memory nodes               |       96 | 128 cores, 1 TiB                     |
   +----------------------------------+----------+--------------------------------------+
   | Accelerated nodes                |      192 | 128 cores, 512 GiB, 4× A100 GPUs     |
   +----------------------------------+----------+--------------------------------------+

.. ifconfig:: system_name == 'jusuf'

   +----------------------------------+----------+--------------------------------------+
   | Type                             | Quantity | Description                          |
   +==================================+==========+======================================+
   | Standard / Slim                  |      136 | 128 cores, 256 GiB                   |
   +----------------------------------+----------+--------------------------------------+
   | Accelerated                      |       45 | 128 cores, 256 GiB, 1x V100 GPU      |
   +----------------------------------+----------+--------------------------------------+
   | Login                            |        4 | 128 cores, 256 GiB                   |
   +----------------------------------+----------+--------------------------------------+


.. _batch_available_partitions:

Available Partitions
^^^^^^^^^^^^^^^^^^^^

.. ifconfig:: system_name == 'jedi'

   Compute nodes are used exclusively by jobs of a single user; no node sharing between jobs is done.
   The smallest allocation unit is one node (288 cores).
   Users will be charged for the number of compute nodes multiplied with the wall-clock time used.
   On each node, a share of the available memory is reserved and not available for application usage.

   The system has only one partition called ``all``.

.. ifconfig:: system_name == 'juwels'

   Compute nodes are used exclusively by jobs of a single user; no node sharing between jobs is done.
   The smallest allocation unit is one node (48 cores).
   Users will be charged for the number of compute nodes multiplied with the wall-clock time used.
   On each node, a share of the available memory is reserved and not available for application usage.

   The ``batch``, ``gpus``, ``mem192`` and ``booster`` partitions are intended for production jobs. To support development and code optimization, additional ``devel`` partitions are available.

   The ``batch`` partition is the default partition used when no other partition is specified. It encompasses compute nodes in the |SYSTEM_NAME| Cluster module with 96 GiB and 192 GiB main memory.
   The ``gpus`` partition provides access to |SYSTEM_NAME| Cluster compute nodes with V100 GPUs.
   The ``mem192`` partition contains nodes in the |SYSTEM_NAME| Cluster module with larger main memory.
   The ``booster`` partition encompasses compute nodes in the |SYSTEM_NAME| Booster module.

.. ifconfig:: system_name == 'jureca'

   Compute nodes are used exclusively by jobs of a single user; no node sharing between jobs is done.
   The smallest allocation unit is one node (128 processors).
   Users will be charged for the number of compute nodes multiplied with the wall-clock time used.
   On each node, a share of the available memory is reserved and not available for application usage.

   The ``dc-cpu``, ``dc-gpu``, ``dc-cpu-bigmem`` partitions are intended for production jobs. To support development and code optimization, additional ``devel`` partitions are available.

   The ``dc-cpu`` partition is the default partition used when no other partition is specified. It encompasses CPU-only compute nodes in the |SYSTEM_NAME| DC module with 512 GiB and 1024 GiB main memory.
   The ``dc-gpu`` partition provides access to |SYSTEM_NAME| compute nodes with A100 GPUs.
   The ``dc-cpu-bigmem`` partition contains nodes with 1 TiB main memory each.
   The ``dc-hwai`` partition is exclusively available to WestAI and HelmholtzAI users and features H100 GPUs.

.. ifconfig:: system_name == 'jusuf'

   Compute nodes are used exclusively by jobs of a single user; no node sharing between jobs is done.
   The smallest allocation unit is one node (128 cores).
   Users will be charged for the number of compute nodes multiplied with the wall-clock time used.
   On each node, a share of the available memory is reserved and not available for application usage.

   The ``batch`` and ``gpu`` partitions are intended for production jobs. To support development and code optimization, additional ``devel`` partitions are available. To support workloads that need to access the internet, e.g. to retrieve training data-sets for AI workloads, a ``scraper`` partition is available. Nodes belonging to that partion have direct access to the Internet. Outgoing SSH connections are blocked, incoming connections are blocked via firewall. For outgoing connections which are in general allowed for remote web servers, JuNet firewall rules apply and might limit access to forbidden IP addresses or ports.

   The ``batch`` partition is the default partition used when no other partition is specified. It encompasses CPU-only compute nodes 256 GiB main memory.
   The ``gpus`` partition provides access to GPU-equipped compute nodes.

A limit regarding the maximum number of running jobs per user is enforced.
The precise values are adjusted to optimize system utilization.
In general, the limit for the number of running jobs is lower for nocont projects.

.. ifconfig:: system_name == 'jedi'

   JEDI partitions

   +-------------------------------------+---------------------------------------+-----------------------------+
   | Partition                           | Resource                              | Value                       |
   +=====================================+=======================================+=============================+
   | ``all``                             | max. wallclock time (normal / nocont) |  6 h / 6 h                  |
   +-------------------------------------+---------------------------------------+-----------------------------+
   |                                     | default wallclock time                | 1 h                         |
   +-------------------------------------+---------------------------------------+-----------------------------+
   |                                     | min. / max. number of nodes           | 1 / 48                      |
   +-------------------------------------+---------------------------------------+-----------------------------+
   |                                     | node types                            | ``mem480`` (480 GiB)        |
   +-------------------------------------+---------------------------------------+-----------------------------+


.. ifconfig:: system_name == 'juwels'

   In addition to the above mentioned partitions the ``large`` and ``largebooster`` partitions are available for large and full-system jobs.
   The partitions are open for submission but jobs will only run in selected timeslots. The use of these partitions needs to be coordinated with the user support.

   In order to request nodes with particular resources (``gpu``) generic resources need to be requested at job submission.

   JUWELS Cluster partitions
   ~~~~~~~~~~~~~~~~~~~~~~~~~

   +-----------------+---------------------------------------+-------------------------------+
   | Partition       | Resource                              | Value                         |
   +=================+=======================================+===============================+
   | ``batch``       | max. wallclock time (normal / nocont) | 24 h / 6 h                    |
   +-----------------+---------------------------------------+-------------------------------+
   |                 | default wallclock time                | 1 h                           |
   +-----------------+---------------------------------------+-------------------------------+
   |                 | min. / max. number of nodes           | 1 / 1024                      |
   +-----------------+---------------------------------------+-------------------------------+
   |                 | node types                            | | ``mem96`` (96 GiB) and      |
   |                 |                                       | | ``mem192`` (192 GiB)        |
   +-----------------+---------------------------------------+-------------------------------+
   |                 |:ref:`internet access<internet_access>`| No                            |
   +-----------------+---------------------------------------+-------------------------------+
   | ``mem192``      | max. wallclock time (normal / nocont) | 24 h / 6 h                    |
   +-----------------+---------------------------------------+-------------------------------+
   |                 | default wallclock time                | 1 h                           |
   +-----------------+---------------------------------------+-------------------------------+
   |                 | min. / max. number of nodes           | 1 / 64                        |
   +-----------------+---------------------------------------+-------------------------------+
   |                 | node types                            | ``mem192`` (192 GiB)          |
   +-----------------+---------------------------------------+-------------------------------+
   |                 |:ref:`internet access<internet_access>`| No                            |
   +-----------------+---------------------------------------+-------------------------------+
   | ``devel``       | max. wallclock time                   | 2 h                           |
   +-----------------+---------------------------------------+-------------------------------+
   |                 | default wallclock time                | 30 min                        |
   +-----------------+---------------------------------------+-------------------------------+
   |                 | min. / max. number of nodes           | 1 / 8                         |
   +-----------------+---------------------------------------+-------------------------------+
   |                 | node types                            | ``mem96`` (96 GiB)            |
   +-----------------+---------------------------------------+-------------------------------+
   |                 |:ref:`internet access<internet_access>`| Yes                           |
   +-----------------+---------------------------------------+-------------------------------+
   | ``gpus``        | max. wallclock time (normal / nocont) | 24 h / 6 h                    |
   +-----------------+---------------------------------------+-------------------------------+
   |                 | default wallclock time                | 1 h                           |
   +-----------------+---------------------------------------+-------------------------------+
   |                 | min. / max. number of nodes           | 1 / 46                        |
   +-----------------+---------------------------------------+-------------------------------+
   |                 | node types                            | | ``mem192``, ``gpu:[1-4]``   |
   |                 |                                       | | (192 GiB, 4× V100 per node) |
   +-----------------+---------------------------------------+-------------------------------+
   |                 |:ref:`internet access<internet_access>`| No                            |
   +-----------------+---------------------------------------+-------------------------------+
   | ``develgpus``   | max. wallclock time                   | 2 h                           |
   +-----------------+---------------------------------------+-------------------------------+
   |                 | default wallclock time                | 1 h                           |
   +-----------------+---------------------------------------+-------------------------------+
   |                 | min. / max. number of nodes           | 1 / 2                         |
   +-----------------+---------------------------------------+-------------------------------+
   |                 | node types                            | | ``mem192``, ``gpu:[1-4]``   |
   |                 |                                       | | (192 GiB, 4× V100 per node) |
   +-----------------+---------------------------------------+-------------------------------+
   |                 |:ref:`internet access<internet_access>`| Yes                           |
   +-----------------+---------------------------------------+-------------------------------+

   JUWELS Booster partitions
   ~~~~~~~~~~~~~~~~~~~~~~~~~

   +------------------+---------------------------------------+-------------------------------+
   | Partition        | Resource                              | Value                         |
   +==================+=======================================+===============================+
   | ``booster``      | max. wallclock time (normal / nocont) | 24 h / 6 h                    |
   +------------------+---------------------------------------+-------------------------------+
   |                  | default wallclock time                | 1 h                           |
   +------------------+---------------------------------------+-------------------------------+
   |                  | min. / max. number of nodes           | 1 / 384                       |
   +------------------+---------------------------------------+-------------------------------+
   |                  | node types                            | | ``mem512``, ``gpu:[1-4]``   |
   |                  |                                       | | (512 GiB, 4× A100 per node) |
   +------------------+---------------------------------------+-------------------------------+
   |                  |:ref:`internet access<internet_access>`| No                            |
   +------------------+---------------------------------------+-------------------------------+
   | ``develbooster`` | max. wallclock time                   | 2 h                           |
   +------------------+---------------------------------------+-------------------------------+
   |                  | default wallclock time                | 1 h                           |
   +------------------+---------------------------------------+-------------------------------+
   |                  | min. / max. number of nodes           | 1 / 4                         |
   +------------------+---------------------------------------+-------------------------------+
   |                  | node types                            | | ``mem512``, ``gpu:[1-4]``   |
   |                  |                                       | | (512 GiB, 4× A100 per node) |
   +------------------+---------------------------------------+-------------------------------+
   |                  |:ref:`internet access<internet_access>`| Yes                           |
   +------------------+---------------------------------------+-------------------------------+

.. ifconfig:: system_name == 'jureca'

   In addition to the above mentioned partitions the ``dc-cpu-large`` and ``dc-gpu-large`` partitions are available for large and full-module jobs.
   The partitions are open for submission but jobs will only run in selected timeslots. The use of these partitions needs to be coordinated with the user support.

   In order to request nodes with particular resources (``mem1024``, ``gpu``) generic resources need to be requested at job submission.

   JURECA DC module partitions
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~

   +---------------------+---------------------------------------+-----------------------------------------+
   | Partition           | Resource                              | Value                                   |
   +=====================+=======================================+=========================================+
   | ``dc-cpu``          | max. wallclock time                   | 24 h / 6 h                              |
   +---------------------+---------------------------------------+-----------------------------------------+
   |                     | default wallclock time                | 1 h                                     |
   +---------------------+---------------------------------------+-----------------------------------------+
   |                     | min. / max. number of nodes           | 1 / 128                                 |
   +---------------------+---------------------------------------+-----------------------------------------+
   |                     | node types / gres                     | | ``mem512`` (512 GiB) and              |
   |                     |                                       | | ``mem1024`` (1024 GiB)                |
   +---------------------+---------------------------------------+-----------------------------------------+
   |                     | features                              | nodesubset@jrc0[710-719]: ``largedata`` |
   +---------------------+---------------------------------------+-----------------------------------------+
   |                     |:ref:`internet access<internet_access>`| No                                      |
   +---------------------+---------------------------------------+-----------------------------------------+
   | ``dc-gpu``          | max. wallclock time                   | 24 h / 6 h                              |
   +---------------------+---------------------------------------+-----------------------------------------+
   |                     | default wallclock time                | 1 h                                     |
   +---------------------+---------------------------------------+-----------------------------------------+
   |                     | min. / max. number of nodes           | 1 / 24                                  |
   +---------------------+---------------------------------------+-----------------------------------------+
   |                     | node types / gres                     | | ``mem512``, ``gpu:[1-4]``             |
   |                     |                                       | | (512 GiB, 4× A100 per node)           |
   +---------------------+---------------------------------------+-----------------------------------------+
   |                     |:ref:`internet access<internet_access>`| No                                      |
   +---------------------+---------------------------------------+-----------------------------------------+
   | ``dc-cpu-bigmem``   | max. wallclock time                   | 24 h / 6 h                              |
   +---------------------+---------------------------------------+-----------------------------------------+
   |                     | default wallclock time                | 1 h                                     |
   +---------------------+---------------------------------------+-----------------------------------------+
   |                     | min. / max. number of nodes           | 1 / 48                                  |
   +---------------------+---------------------------------------+-----------------------------------------+
   |                     | node types / gres                     |``mem1024`` (1024 GiB)                   |
   +---------------------+---------------------------------------+-----------------------------------------+
   |                     | features                              |``bigmem`` (1024 GiB)                    |
   +---------------------+---------------------------------------+-----------------------------------------+
   |                     |:ref:`internet access<internet_access>`| No                                      |
   +---------------------+---------------------------------------+-----------------------------------------+
   | ``dc-cpu-devel``    | max. wallclock time                   | 2 h                                     |
   +---------------------+---------------------------------------+-----------------------------------------+
   |                     | default wallclock time                | 30 min                                  |
   +---------------------+---------------------------------------+-----------------------------------------+
   |                     | min. / max. number of nodes           | 1 / 4                                   |
   +---------------------+---------------------------------------+-----------------------------------------+
   |                     | node types / gres                     | ``mem512`` (512 GiB)                    |
   +---------------------+---------------------------------------+-----------------------------------------+
   |                     |:ref:`internet access<internet_access>`| Yes                                     |
   +---------------------+---------------------------------------+-----------------------------------------+
   | ``dc-gpu-devel``    | max. wallclock time                   | 2 h                                     |
   +---------------------+---------------------------------------+-----------------------------------------+
   |                     | default wallclock time                | 30 min                                  |
   +---------------------+---------------------------------------+-----------------------------------------+
   |                     | min. / max. number of nodes           | 1 / 4                                   |
   +---------------------+---------------------------------------+-----------------------------------------+
   |                     | node types / gres                     | | ``mem512``, ``gpu:[1-4]``             |
   |                     |                                       | | (512 GiB, 4× A100 per node)           |
   +---------------------+---------------------------------------+-----------------------------------------+
   |                     |:ref:`internet access<internet_access>`| Yes                                     |
   +---------------------+---------------------------------------+-----------------------------------------+
   | ``dc-hwai``         | max. wallclock time                   | 24 h                                    |
   +---------------------+---------------------------------------+-----------------------------------------+
   |                     | default wallclock time                | 1 h                                     |
   +---------------------+---------------------------------------+-----------------------------------------+
   |                     | min. / max. number of nodes           | 1 / 4                                   |
   +---------------------+---------------------------------------+-----------------------------------------+
   |                     | node types / gres                     | | ``mem512``, ``gpu:[1-4]``             |
   |                     |                                       | | (512 GiB, 4× H100 per node)           |
   +---------------------+---------------------------------------+-----------------------------------------+
   |                     |:ref:`internet access<internet_access>`| No                                      |
   +---------------------+---------------------------------------+-----------------------------------------+

.. ifconfig:: system_name == 'jusuf'

   +-----------------+---------------------------------------+-------------------------------+
   | Partition       | Resource                              | Value                         |
   +=================+=======================================+===============================+
   | ``batch``       | max. wallclock time (normal / nocont) | 24 h / 6 h                    |
   +-----------------+---------------------------------------+-------------------------------+
   |                 | default wallclock time                | 1 h                           |
   +-----------------+---------------------------------------+-------------------------------+
   |                 | min. / max. number of nodes           | 1 / 110                       |
   +-----------------+---------------------------------------+-------------------------------+
   |                 | node types                            | | ``mem256`` (256 GiB)        |
   +-----------------+---------------------------------------+-------------------------------+
   |                 |:ref:`internet access<internet_access>`| No                            |
   +-----------------+---------------------------------------+-------------------------------+
   | ``scraper``     | max. wallclock time (normal / nocont) | 24 h / 6 h                    |
   +-----------------+---------------------------------------+-------------------------------+
   |                 | default wallclock time                | 1 h                           |
   +-----------------+---------------------------------------+-------------------------------+
   |                 | min. / max. number of nodes           | 1 / 10                        |
   +-----------------+---------------------------------------+-------------------------------+
   |                 | node types                            | | ``mem256`` (256 GiB)        |
   +-----------------+---------------------------------------+-------------------------------+
   |                 |:ref:`internet access<internet_access>`| Yes                           |
   +-----------------+---------------------------------------+-------------------------------+
   | ``gpus``        | max. wallclock time (normal / nocont) | 24 h / 6 h                    |
   +-----------------+---------------------------------------+-------------------------------+
   |                 | default wallclock time                | 1 h                           |
   +-----------------+---------------------------------------+-------------------------------+
   |                 | min. / max. number of nodes           | 1 / 39                        |
   +-----------------+---------------------------------------+-------------------------------+
   |                 | node types                            | | ``mem256``, ``gpu1``        |
   |                 |                                       | | (256 GiB, 1x V100 per node) |
   +-----------------+---------------------------------------+-------------------------------+
   |                 |:ref:`internet access<internet_access>`| No                            |
   +-----------------+---------------------------------------+-------------------------------+
   | ``develgpus``   | max. wallclock time (normal / nocont) | 24 h / 6 h                    |
   +-----------------+---------------------------------------+-------------------------------+
   |                 | default wallclock time                | 1 h                           |
   +-----------------+---------------------------------------+-------------------------------+
   |                 | min. / max. number of nodes           | 1 / 6                         |
   +-----------------+---------------------------------------+-------------------------------+
   |                 | node types                            | | ``mem256``, ``gpu1``        |
   |                 |                                       | | (256 GiB, 1x V100 per node) |
   +-----------------+---------------------------------------+-------------------------------+
   |                 |:ref:`internet access<internet_access>`| Yes                           |
   +-----------------+---------------------------------------+-------------------------------+

.. _internet_access:

Internet Access
---------------

Due to security measures, we do not allow internet access on the compute nodes - please take this into consideration when running your jobs. 
Internet access is, however, allowed on the login nodes and on the ``devel`` partitions to facilitate development activities.
If your production jobs require files available on the internet, consider downloading them first on the login nodes and using them from the jobs (some frameworks allow runs on “offline mode”).
If this is not sufficient, and you need to run a "scraping" style workflow, please contact your Project Mentor or contact SC-Support at sc@fz-juelich.de for assistance.

.. ifconfig:: system_name == 'jusuf'

   The compute nodes on the ``scraper`` partition also have access to the internet. See :ref:`batch_available_partitions` for more information.

.. note::

   Although the internet access is allowed, only a few ports (as the usual HTTP/S 80/443) are open, while many other ports are blocked.


.. _batch_allocations:

Allocations, Jobs and Job Steps
-------------------------------

In Slurm a job is an allocation of selected resources for a specific amount of time. A job allocation can be requested using ``sbatch`` and ``salloc``.
Within a job multiple job steps can be executed using ``srun`` that use all or a subset of the allocated compute nodes. Job steps may execute at
the same time if the resource allocation permits it.

Writing a Batch Script
----------------------

Users submit batch applications (usually bash scripts) using the ``sbatch`` command. The script is executed on the first compute node in the allocation. To execute parallel MPI tasks users call ``srun`` within their script.

.. note::

   ``mpiexec`` is not supported on |SYSTEM_NAME| and has to be replaced by ``srun``.

The minimal template to be filled is

.. ifconfig:: system_name == 'jedi'

   .. code-block:: none

      #!/bin/bash -x
      #SBATCH --account=<budget account>
      # budget account where contingent is taken from
      #SBATCH --nodes=<no of nodes>
      #SBATCH --ntasks=<no of tasks (MPI processes)>
      # can be omitted if --nodes and --ntasks-per-node
      # are given
      #SBATCH --ntasks-per-node=<no of tasks per node>
      # if keyword omitted: Max. 288 tasks per node
      # (SMT enabled, see comment below)
      #SBATCH --cpus-per-task=<no of threads per task>
      # for OpenMP/hybrid jobs only
      #SBATCH --output=<path of output file>
      # if keyword omitted: Default is slurm-%j.out in
      # the submission directory (%j is replaced by
      # the job ID).
      #SBATCH --error=<path of error file>
      # if keyword omitted: Default is slurm-%j.out in
      # the submission directory.
      #SBATCH --time=<walltime>
      #SBATCH --partition=all

      # *** start of job script ***
      # Note: The current working directory at this point is
      # the directory where sbatch was executed.

      export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
      srun <executable>


Multiple ``srun`` calls can be placed in a single batch script.
Options such as ``--account``, ``--nodes``, ``--ntasks`` and ``--ntasks-per-node`` are by default taken from the ``sbatch`` arguments but can be overwritten for each ``srun`` invocation.


.. ifconfig:: system_name == 'juwels'

   .. code-block:: none

      #!/bin/bash -x
      #SBATCH --account=<budget account>
      # budget account where contingent is taken from
      #SBATCH --nodes=<no of nodes>
      #SBATCH --ntasks=<no of tasks (MPI processes)>
      # can be omitted if --nodes and --ntasks-per-node
      # are given
      #SBATCH --ntasks-per-node=<no of tasks per node>
      # if keyword omitted: Max. 96 tasks per node
      # (SMT enabled, see comment below)
      #SBATCH --cpus-per-task=<no of threads per task>
      # for OpenMP/hybrid jobs only
      #SBATCH --output=<path of output file>
      # if keyword omitted: Default is slurm-%j.out in
      # the submission directory (%j is replaced by
      # the job ID).
      #SBATCH --error=<path of error file>
      # if keyword omitted: Default is slurm-%j.out in
      # the submission directory.
      #SBATCH --time=<walltime>
      #SBATCH --partition=<batch, booster, mem192, ...>
      #SBATCH --gres=gpu:<n>
      # For gpus and and booster partition

      # *** start of job script ***
      # Note: The current working directory at this point is
      # the directory where sbatch was executed.

      export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
      srun <executable>

.. ifconfig:: system_name == 'jureca'

   .. code-block:: none

      #!/bin/bash -x
      #SBATCH --account=<budget account>
      # budget account where contingent is taken from
      #SBATCH --nodes=<no of nodes>
      #SBATCH --ntasks=<no of tasks (MPI processes)>
      # can be omitted if --nodes and --ntasks-per-node
      # are given
      #SBATCH --ntasks-per-node=<no of tasks per node>
      # if keyword omitted: Max. 256 tasks per node
      # (SMT enabled, see comment below)
      #SBATCH --cpus-per-task=<no of threads per task>
      # for OpenMP/hybrid jobs only
      #SBATCH --output=<path of output file>
      # if keyword omitted: Default is slurm-%j.out in
      # the submission directory (%j is replaced by
      # the job ID).
      #SBATCH --error=<path of error file>
      # if keyword omitted: Default is slurm-%j.out in
      # the submission directory.
      #SBATCH --time=<walltime>
      #SBATCH --partition=<dc-cpu, ...>

      # *** start of job script ***
      # Note: The current working directory at this point is
      # the directory where sbatch was executed.

      export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
      srun <executable>

.. ifconfig:: system_name == 'jusuf'

   .. code-block:: none

      #!/bin/bash -x
      #SBATCH --account=<budget account>
      # budget account where contingent is taken from
      #SBATCH --nodes=<no of nodes>
      #SBATCH --ntasks=<no of tasks (MPI processes)>
      # can be omitted if --nodes and --ntasks-per-node
      # are given
      #SBATCH --ntasks-per-node=<no of tasks per node>
      # if keyword omitted: Max. 256 tasks per node
      # (SMT enabled, see comment below)
      #SBATCH --cpus-per-task=<no of threads per task>
      # for OpenMP/hybrid jobs only
      #SBATCH --output=<path of output file>
      # if keyword omitted: Default is slurm-%j.out in
      # the submission directory (%j is replaced by
      # the job ID).
      #SBATCH --error=<path of error file>
      # if keyword omitted: Default is slurm-%j.out in
      # the submission directory.
      #SBATCH --time=<walltime>
      #SBATCH --partition=<equipment_node_compute_x3515_hmq, all>

      # *** start of job script ***
      # Note: The current working directory at this point is
      # the directory where sbatch was executed.

      export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
      srun <executable>


Multiple ``srun`` calls can be placed in a single batch script.
Options such as ``--account``, ``--nodes``, ``--ntasks`` and ``--ntasks-per-node`` are by default taken from the ``sbatch`` arguments but can be overwritten for each ``srun`` invocation.

.. ifconfig:: system_name not in ('jureca',  'jedi')

   The default partition on |SYSTEM_NAME|, which is used if ``--partition`` is omitted, is the ``batch`` partition.

.. ifconfig:: system_name == 'jedi'

   The default partition on |SYSTEM_NAME|, which is used if ``--partition`` is omitted, is the ``all`` partition.

.. ifconfig:: system_name == 'jureca'

   The default partition on |SYSTEM_NAME|, which is used if ``--partition`` is omitted, is the ``dc-gpu`` partition.

.. ifconfig:: system_name == 'juwels'

   .. note::

      If ``--ntasks-per-node`` is omitted or set to a value higher than 48 SMT (simultaneous multithreading) will not be enabled automatically.
      The Cluster and Booster compute nodes have 48 physical cores and the nodes in the `gpus` partition feature 40 physical cores. The number of logical cores is twice this number. To use the SMT capability, it must be activated manually by using the flag ``--threads-per-core=2``.

.. ifconfig:: system_name == 'jureca'

   .. note::

      If ``--ntasks-per-node`` is omitted or set to a value higher than 128 SMT (simultaneous multithreading) will not be enabled automatically. 
      While each compute node in the DC module features 128 physical cores and 256 logical cores, use of the SMT capability must be activated manually by using the flag ``--threads-per-core=2``.

.. ifconfig:: system_name == 'jusuf'

   .. note::

      If ``--ntasks-per-node`` is omitted or set to a value higher than 128 SMT (simultaneous multithreading) will not be enabled automatically.
      While each compute node has 128 physical cores and 256 logical cores, use of the SMT capability must be activated manually by using the flag ``--threads-per-core=2``.

.. _batch_job_example:

Job Script Examples
^^^^^^^^^^^^^^^^^^^

.. note::

   For more information about the use of ``--cpus-per-task``, ``SRUN_CPUS_PER_TASK`` and ``SBATCH_CPUS_PER_TASK`` after the update to Slurm version 23.02, please refer to the
   affinity documention found here: https://apps.fz-juelich.de/jsc/hps/jureca/affinity.html

.. ifconfig:: system_name == 'jedi'

   **Example 1:** MPI application starting 1552 tasks on 4 nodes using 288 CPUs per node (no SMT) running for max. 15 minutes:

   .. code-block:: none

         #!/bin/bash -x
         #SBATCH --account=<budget>
         #SBATCH --nodes=4
         #SBATCH --ntasks=1552
         #SBATCH --output=mpi-out.%j
         #SBATCH --error=mpi-err.%j
         #SBATCH --time=00:15:00
         #SBATCH --partition=all

         srun ./mpi-prog

.. ifconfig:: system_name == 'juwels'

   **Example 1:** MPI application starting 3072 tasks on 64 nodes using 48 CPUs per node (no SMT) running for max. 15 minutes:

   .. code-block:: none

         #!/bin/bash -x
         #SBATCH --account=<budget>
         #SBATCH --nodes=64
         #SBATCH --ntasks=3072
         #SBATCH --output=mpi-out.%j
         #SBATCH --error=mpi-err.%j
         #SBATCH --time=00:15:00
         #SBATCH --partition=batch

         srun ./mpi-prog

   **Example 2:** MPI application starting 3072 tasks on 32 nodes using 96 logical CPUs (hardware threads) per node (SMT enabled) running for max. 20 minutes:

   .. code-block:: none

         #!/bin/bash -x
         #SBATCH --account=<budget>
         #SBATCH --nodes=32
         #SBATCH --ntasks-per-node=96
         #SBATCH --output=mpi-out.%j
         #SBATCH --error=mpi-err.%j
         #SBATCH --time=00:20:00
         #SBATCH --partition=batch

         srun ./mpi-prog

   **Example 3:** Hybrid application starting 3 tasks per node on 64 allocated nodes and starting 14 threads per task (no SMT):

   .. code-block:: none

         #!/bin/bash -x
         #SBATCH --account=<budget>
         #SBATCH --nodes=64
         #SBATCH --ntasks-per-node=3
         #SBATCH --cpus-per-task=14
         #SBATCH --output=mpi-out.%j
         #SBATCH --error=mpi-err.%j
         #SBATCH --time=00:20:00
         #SBATCH --partition=batch

         export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
         export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}
         srun ./hybrid-prog

   **Example 4:** Hybrid application starting 4 tasks per node on 64 allocated nodes and starting 24 threads per task (SMT enabled):

   .. code-block:: none

         #!/bin/bash -x
         #SBATCH --account=<budget>
         #SBATCH --nodes=64
         #SBATCH --ntasks-per-node=4
         #SBATCH --cpus-per-task=24
         #SBATCH --threads-per-core=2
         #SBATCH --output=mpi-out.%j
         #SBATCH --error=mpi-err.%j
         #SBATCH --time=00:20:00
         #SBATCH --partition=batch

         export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
         export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}
         srun ./hybrid-prog

   **Example 5:** MPI application starting 3072 tasks on 64 nodes using 48 CPUs per node (no SMT) running for max. 15 minutes on nodes with 192 GiB main memory. This example is identical to Example 1 except for the requested node type:

   .. code-block:: none

         #!/bin/bash -x
         #SBATCH --account=<budget>
         #SBATCH --nodes=64
         #SBATCH --ntasks=3072
         #SBATCH --output=mpi-out.%j
         #SBATCH --error=mpi-err.%j
         #SBATCH --time=00:15:00
         #SBATCH --partition=mem192

         srun ./mpi-prog

.. ifconfig:: system_name == 'jureca'

   **Example 1:** MPI application starting 512 tasks on 4 nodes using 128 CPUs per node (no SMT) running for max. 15 minutes:

   .. code-block:: none

         #!/bin/bash -x
         #SBATCH --account=<budget>
         #SBATCH --nodes=4
         #SBATCH --ntasks-per-node=128
         #SBATCH --output=mpi-out.%j
         #SBATCH --error=mpi-err.%j
         #SBATCH --time=00:15:00
         #SBATCH --partition=dc-cpu

         srun ./mpi-prog

   **Example 2:** MPI application starting 4096 tasks on 16 nodes using 256 logical CPUs (hardware threads) per node (SMT enabled) running for max. 20 minutes:

   .. code-block:: none

         #!/bin/bash -x
         #SBATCH --account=<budget>
         #SBATCH --nodes=16
         #SBATCH --ntasks-per-node=256
         #SBATCH --output=mpi-out.%j
         #SBATCH --error=mpi-err.%j
         #SBATCH --time=00:20:00
         #SBATCH --partition=dc-cpu

         srun ./mpi-prog

   **Example 3:** Hybrid application starting 8 tasks per node on 4 allocated nodes and starting 16 threads per task (no SMT):

   .. code-block:: none

         #!/bin/bash -x
         #SBATCH --account=<budget>
         #SBATCH --nodes=4
         #SBATCH --ntasks-per-node=8
         #SBATCH --cpus-per-task=16
         #SBATCH --output=mpi-out.%j
         #SBATCH --error=mpi-err.%j
         #SBATCH --time=00:20:00
         #SBATCH --partition=dc-cpu

         export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
         export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}
         srun ./hybrid-prog

   **Example 4:** Hybrid application starting 8 tasks per node on 3 allocated nodes and starting 32 threads per task (SMT enabled):

   .. code-block:: none

         #!/bin/bash -x
         #SBATCH --account=<budget>
         #SBATCH --nodes=3
         #SBATCH --ntasks-per-node=8
         #SBATCH --cpus-per-task=32
         #SBATCH --threads-per-core=2 
         #SBATCH --output=mpi-out.%j
         #SBATCH --error=mpi-err.%j
         #SBATCH --time=00:20:00
         #SBATCH --partition=dc-cpu

         export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
         export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}
         srun ./hybrid-prog

.. ifconfig:: system_name == 'jusuf'

   **Example 1:** MPI application starting 4096 tasks on 32 nodes using 128 CPUs per node (no SMT) running for max. 15 minutes:

   .. code-block:: none

         #!/bin/bash -x
         #SBATCH --account=<budget>
         #SBATCH --nodes=32
         #SBATCH --ntasks=4096
         #SBATCH --output=mpi-out.%j
         #SBATCH --error=mpi-err.%j
         #SBATCH --time=00:15:00
         #SBATCH --partition=batch

         srun ./mpi-prog

   **Example 2:** MPI application starting 8192 tasks on 32 nodes using 256 logical CPUs (hardware threads) per node (SMT enabled) running for max. 20 minutes:

   .. code-block:: none

         #!/bin/bash -x
         #SBATCH --account=<budget>
         #SBATCH --nodes=32
         #SBATCH --ntasks-per-node=256
         #SBATCH --output=mpi-out.%j
         #SBATCH --error=mpi-err.%j
         #SBATCH --time=00:20:00
         #SBATCH --partition=batch

         srun ./mpi-prog

   **Example 3:** Hybrid application starting 3 tasks per node on 64 allocated nodes and starting 34 threads per task (no SMT):

   .. code-block:: none

         #!/bin/bash -x
         #SBATCH --account=<budget>
         #SBATCH --nodes=64
         #SBATCH --ntasks-per-node=3
         #SBATCH --cpus-per-task=34
         #SBATCH --output=mpi-out.%j
         #SBATCH --error=mpi-err.%j
         #SBATCH --time=00:20:00
         #SBATCH --partition=batch

         export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
         export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}
         srun ./hybrid-prog

   **Example 4:** Hybrid application starting 4 tasks per node on 64 allocated nodes and starting 68 threads per task (SMT enabled):

   .. code-block:: none

         #!/bin/bash -x
         #SBATCH --account=<budget>
         #SBATCH --nodes=64
         #SBATCH --ntasks-per-node=4
         #SBATCH --threads-per-core=2 
         #SBATCH --cpus-per-task=68
         #SBATCH --output=mpi-out.%j
         #SBATCH --error=mpi-err.%j
         #SBATCH --time=00:20:00
         #SBATCH --partition=batch

         export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
         export SRUN_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK}
         srun ./hybrid-prog

   **Example 5:** MPI application starting 4096 tasks on 32 nodes using 128 CPUs per node (no SMT) running for max. 15 minutes on nodes. This example is similar to Example 1 except for the requested node type:

   .. code-block:: none

         #!/bin/bash -x
         #SBATCH --account=<budget>
         #SBATCH --nodes=32
         #SBATCH --ntasks=4096
         #SBATCH --output=mpi-out.%j
         #SBATCH --error=mpi-err.%j
         #SBATCH --time=00:15:00
         #SBATCH --partition=gpus

         srun ./mpi-prog

The job script is submitted using:

.. code-block:: none

      $ sbatch <jobscript>

On success, ``sbatch`` writes the job ID to standard out.

.. note::

   One can also define ``sbatch`` options on the command line, e.g.:

   .. code-block:: none

      $ sbatch --nodes=4 --acount=<budget> --time=01:00:00 <jobscript>

.. _batch_generic_resources:

Generic Resources, Features and Topology-aware Allocations
----------------------------------------------------------

.. ifconfig:: system_name == 'jedi'

   All nodes on JEDI are the same so there is no need to differentiate via Slurms generic resources or features mechanisms.

   .. note:: The charged computing time is independent of the number of specified GPUs. Production workloads must use all available GPU resources per node.

.. ifconfig:: system_name == 'juwels'

   In order to request resources with special features (additional main memory, GPU devices) the ``--gres`` option to ``sbatch`` can be used.
   For ``mem192`` nodes, which are accessible via specific partitions, the ``--gres`` option can be omitted.
   Since the GPU and visualization nodes feature multiple user-visible GPU devices an additional quantity can be specified as shown in the following examples.
   With the Slurm submission option ``--constraint`` users can request resources/nodes according to Slurm Features.

   +--------------------------------------+----------------------------------------------------+
   | Option                               | Requested hardware features                        |
   +======================================+====================================================+
   | ``--partition=mem192``               | 192 GiB main memory                                |
   +--------------------------------------+----------------------------------------------------+
   | ``--gres=gpu:4 --partition=booster`` | Booster node, 4 GPUs per node                      |
   +--------------------------------------+----------------------------------------------------+
   | ``--gres=gpu:2 --partition=gpus``    | Cluster node, 2 GPUs per node                      |
   +--------------------------------------+----------------------------------------------------+
   | ``--gres=gpu:4 --partition=gpus``    | Cluster node, 4 GPUs per node                      |
   +--------------------------------------+----------------------------------------------------+
   | ``--constraint=largedata``           | XCST storage - largedata, largedata2               |
   +--------------------------------------+----------------------------------------------------+

   Complete list of Slurm GRES on |SYSTEM_NAME|:

   +--------------+------------+
   | GRES         | Node Count |
   +==============+============+
   |  ``mem512``, |       936  |
   |  ``gpu:4``   |            |
   |              |            |
   |              |            |
   |              |            |
   |              |            |
   +--------------+------------+
   |   ``mem96``  |      2271  |
   +--------------+------------+
   |   ``mem96``, |       240  |
   |   ``mem192`` |            |
   +--------------+------------+
   |  ``mem96``,  |         56 |
   |  ``mem192``, |            |
   |  ``gpu:4``   |            |
   +--------------+------------+

   Complete list of Slurm Features on |SYSTEM_NAME|:

   +-------------------------+-------+
   | Features                | Count |
   +=========================+=======+
   | ``gpu``, ``bcell01``    | 48    |
   +-------------------------+-------+
   | ``gpu``, ``bcell02``    | 48    |
   +-------------------------+-------+
   | ``gpu``, ``bcell03``    | 48    |
   +-------------------------+-------+
   | ``gpu``, ``bcell04``    | 48    |
   +-------------------------+-------+
   | ``gpu``, ``bcell05``    | 48    |
   +-------------------------+-------+
   | ``gpu``, ``bcell06``    | 48    |
   +-------------------------+-------+
   | ``gpu``, ``bcell07``    | 48    |
   +-------------------------+-------+
   | ``gpu``, ``bcell08``    | 48    |
   +-------------------------+-------+
   | ``gpu``, ``bcell09``    | 48    |
   +-------------------------+-------+
   | ``gpu``, ``bcell10``    | 48    |
   +-------------------------+-------+
   | ``gpu``, ``bcell11``    | 48    |
   +-------------------------+-------+
   | ``gpu``, ``bcell12``    | 48    |
   +-------------------------+-------+
   | ``gpu``, ``bcell13``    | 48    |
   +-------------------------+-------+
   | ``gpu``, ``bcell14``    | 48    |
   +-------------------------+-------+
   | ``gpu``, ``bcell15``    | 48    |
   +-------------------------+-------+
   | ``gpu``, ``bcell16``    | 48    |
   +-------------------------+-------+
   | ``gpu``, ``bcell17``    | 48    |
   +-------------------------+-------+
   | ``gpu``, ``bcell18``    | 48    |
   +-------------------------+-------+
   | ``gpu``, ``bcell19``    | 38    |
   +-------------------------+-------+
   | ``gpu``, ``bcell19``,   | 10    |
   | ``largedata``           |       |
   +-------------------------+-------+
   | ``gpu``, ``bcell20``    | 24    |
   +-------------------------+-------+
   | ``skylake``             | 2501  |
   |                         |       |
   +-------------------------+-------+
   | ``skylake``,            | 10    |
   | ``largedata``           |       |
   +-------------------------+-------+
   | ``skylake``, ``gpu``    | 56    |
   +-------------------------+-------+

   If no specific memory size is requested the default ``--gres=mem96`` is automatically added to the submission to the |SYSTEM_NAME| Cluster module.
   Please note that jobs requesting 96 GiB may also run on nodes with 192 GiB if no other free resources are available.

   If no ``gpu`` GRES is given then ``--gres=gpu:4`` is automatically added by Slurm's submission filter for all partitions with GPU nodes.
   Please note that GPU applications can request GPU devices per node via ``--gres=gpu:n`` where ``n`` can be ``1``, ``2``, ``3`` or ``4`` on GPU compute nodes.
   Please refer to the :ref:`JUWELS GPU computing page for examples <gpu_computing>`.

   .. note:: The charged computing time is independent of the number of specified GPUs. Production workloads must use all available GPU resources per node.

   The XCST storage resource is available on all Login systems as well as on 10 Cluster Compute nodes and 10 Booster Compute nodes inside the ususal default batch partitions ``batch`` and ``booster``.
   For an example on how to use it, please refer to :ref:`How to access largedata on a limited number of computes within your jobs?<largedata-use>`

   On |SYSTEM_NAME| a tree topology is used in Slurm configuration. The following table shows how many computes are connected to each (Infiniband) leaf switch
   on each system module. Note that Booster nodes have 4 HCAs, each connected to a different switch, so for scheduling purposes real switches are aggregated in
   a single "virtual" switch per rack containing all nodes in that rack.

   +----------------+-------------------------------------+
   | System module  | SLURM view of nodes per leaf switch |
   +================+=====================================+
   | JUWELS Cluster | 21 or 24                            |
   +----------------+-------------------------------------+
   | JUWELS Booster | 24                                  |
   +----------------+-------------------------------------+

   With the Slurm submission option ``--switches=<count>[@max-time]`` users can request the maximum count of leaf switches that will be used for their jobs. This is
   especially useful for network-bound applications, where network locality and maximum network performance is required. Optionally users can define also the maximum
   time to wait for the given number of switches to be available.

.. ifconfig:: system_name == 'jureca'

   In order to request resources with special features (additional main memory, GPU devices, largedata) the ``--gres`` option to ``sbatch`` can be used.
   For ``mem1024`` nodes, which are accessible via specific partitions, the ``--gres`` option can be omitted.
   Since the GPU and visualization nodes feature multiple user-visible GPU devices an additional quantity can be specified as shown in the following examples.
   With the Slurm submission option ``--constraint`` users can request resources/nodes according to Slurm Features.

   +--------------------------------------+----------------------------------------------------+
   | Option                               | Requested hardware features                        |
   +======================================+====================================================+
   | ``--partition=dc-cpu-bigmem``        | 1 TiB main memory                                  |
   +--------------------------------------+----------------------------------------------------+
   | ``--gres=gpu:2 --partition=dc-gpu``  | GPU node, 2 GPUs per node                          |
   +--------------------------------------+----------------------------------------------------+
   | ``--gres=gpu:4 --partition=dc-gpu``  | GPU node, 4 GPUs per node                          |
   +--------------------------------------+----------------------------------------------------+
   | ``--constraint=largedata``           | XCST storage - largedata, largedata2               |
   +--------------------------------------+----------------------------------------------------+
   | ``--constraint=bigmem``              | 1 TiB main memory                                  |
   +--------------------------------------+----------------------------------------------------+

   Complete list of Slurm GRES on |SYSTEM_NAME|:

   +--------------+------------+
   | GRES         | Node Count |
   +==============+============+
   |  ``mem512``  |       480  |
   +--------------+------------+
   |  ``mem512``, |       192  |
   |  ``gpu:4``   |            |
   +--------------+------------+
   |  ``mem512``, |        96  |
   |  ``mem1024`` |            |
   +--------------+------------+

   Complete list of Slurm Features on |SYSTEM_NAME|:

   +------------------------------------------------+-------+
   | Features                                       | Count |
   +================================================+=======+
   | ``cpu``, ``rack01``, ``cell01``                | 94    |
   +------------------------------------------------+-------+
   | ``cpu``, ``rack01``, ``cell01``, ``largedata`` |  2    |
   +------------------------------------------------+-------+
   | ``cpu``, ``rack02``, ``cell01``                | 94    |
   +------------------------------------------------+-------+
   | ``cpu``, ``rack02``, ``cell01``, ``largedata`` |  2    |
   +------------------------------------------------+-------+
   | ``cpu``, ``rack11``, ``cell06``                | 94    |
   +------------------------------------------------+-------+
   | ``cpu``, ``rack11``, ``cell06``, ``largedata`` |  2    |
   +------------------------------------------------+-------+
   | ``cpu``, ``rack12``, ``cell06``                | 94    |
   +------------------------------------------------+-------+
   | ``cpu``, ``rack12``, ``cell06``, ``largedata`` |  2    |
   +------------------------------------------------+-------+
   | ``cpu``, ``rack13``, ``cell07``                | 94    |
   +------------------------------------------------+-------+
   | ``cpu``, ``rack13``, ``cell07``, ``largedata`` |  2    |
   +------------------------------------------------+-------+
   | ``cpu``, ``rack14``, ``cell07``, ``bigmem``    | 96    |
   +------------------------------------------------+-------+
   | ``gpu``, ``rack03``, ``cell02``                | 24    |
   +------------------------------------------------+-------+
   | ``gpu``, ``rack04``, ``cell02``                | 24    |
   +------------------------------------------------+-------+
   | ``gpu``, ``rack05``, ``cell03``                | 24    |
   +------------------------------------------------+-------+
   | ``gpu``, ``rack06``, ``cell03``                | 24    |
   +------------------------------------------------+-------+
   | ``gpu``, ``rack07``, ``cell04``                | 24    |
   +------------------------------------------------+-------+
   | ``gpu``, ``rack08``, ``cell04``                | 24    |
   +------------------------------------------------+-------+
   | ``gpu``, ``rack09``, ``cell05``                | 24    |
   +------------------------------------------------+-------+
   | ``gpu``, ``rack10``, ``cell05``                | 24    |
   +------------------------------------------------+-------+

   If no specific memory size is requested the default ``--gres=mem512`` is automatically added to the submission.
   Please note that jobs requesting 512 GiB may also run on nodes with 1024 GiB if no other free resources are available.

   If no ``gpu`` GRES is given then ``--gres=gpu:4`` is automatically added by Slurm's submission filter for all partitions with GPU nodes.
   Please note that GPU applications can request GPU devices per node via ``--gres=gpu:n`` where ``n`` can be ``1``, ``2``, ``3`` or ``4`` on GPU compute nodes.
   Additionally, while not directly part of the batch system, it is important the module ``MPI-settings/CUDA`` is loaded to ensure MPI is configured to properly 
   communicate between GPUs.
   Please refer to the :ref:`JURECA GPU computing page for examples <gpu_computing>`.

   .. note:: The charged computing time is independent of the number of specified GPUs. Production workloads must use all available GPU resources per node.

   The XCST storage resource is available on all Login systems as well as on 10 JURECA-DC Compute nodes inside the ususal default batch partitions ``dc-cpu``.
   For an example on how to use it, please refer to :ref:`How to access largedata on a limited number of computes within your jobs?<largedata-use>`

   On |SYSTEM_NAME| a tree topology is used in Slurm configuration.
   The following table shows how many computes are connected to each (InfiniBand) leaf switch on each partition.
   Note that GPU nodes have 2 HCAs, each connected to a different switch, so for scheduling purposes real switches are aggregated in a single "virtual" switch per rack containing all nodes in that rack.

   +----------------+-------------------------------------+
   | Partition      | Slurm view of nodes per leaf switch |
   +================+=====================================+
   | JURECA-DC CPU  | 96                                  |
   +----------------+-------------------------------------+
   | JURECA-DC GPU  | 24                                  |
   +----------------+-------------------------------------+

   With the Slurm submission option ``--switches=<count>[@max-time]`` users can request the maximum count of leaf switches that will be used for their jobs.
   This is especially useful for network-bound applications, where network locality and maximum network performance is required.
   Optionally users can define also the maximum time to wait for the given number of switches to be available.

.. ifconfig:: system_name == 'jusuf'

   As currently all available resources are accessible via specific partitions (``batch``, ``gpus``, ``develgpus``) there are no additional options needed to request resources with special features.
   Since the GPU nodes feature just one GPU device, ``--gres=gpu:1`` option can be omitted.

   +-------------------------------------+--------------------------------------+
   | Option                              | Requested hardware features          |
   +=====================================+======================================+
   | ``--partition=batch``               | Regular CPU-only nodes               |
   +-------------------------------------+--------------------------------------+
   | | ``--partition=gpus``              | 1 GPU per node                       |
   | | ``--gres=gpu:1 --partition=gpus`` |                                      |
   +-------------------------------------+--------------------------------------+
   | ``--constraint=largedata``          | XCST storage - largedata, largedata2 |
   +-------------------------------------+--------------------------------------+

   Complete list of Slurm GRES on JUSUF:

   +--------+------------+
   | GRES   | Node Count |
   +========+============+
   | mem256 | 181        |
   +--------+------------+
   | gpu    | 45         |
   +--------+------------+

   Complete list of Slurm Features on JUSUF:

   +-----------+-------+
   | Features  | Count |
   +===========+=======+
   | normal    | 181   |
   +-----------+-------+
   | ams       | 16    |
   +-----------+-------+
   | scraper   | 10    |
   +-----------+-------+
   | largedata | 10    |
   +-----------+-------+

   The XCST storage resource is available on all Login systems as well as on 10 JUSUF Cluster Compute nodes inside the ususal default batch partitions ``batch``.
   For an example on how to use it, please refer to :ref:`How to access largedata on a limited number of computes within your jobs?<largedata-use>`

.. ifconfig:: system_name == 'juwels' or system_name == 'jureca' or system_name == 'jusuf'

  Please see :ref:`GPU Computing <gpu_computing>` for more details.

.. _batch_jobsteps:

Job Steps
---------

.. ifconfig:: system_name == 'jedi'

   The example below shows a job script where two different job steps are initiated within one job. In total 96 cores are allocated on
   two nodes where ``-n 288`` causes that each job step uses 48 cores on one of the compute nodes. Additionally in this example the option
   ``--exclusive`` is passed to ``srun`` to ensure that distinct cores are allocated to each job step.:

   .. code-block:: bash

      #!/bin/bash -x
      #SBATCH --account=<budget>
      #SBATCH --nodes=2
      #SBATCH --ntasks=576
      #SBATCH --ntasks-per-node=288
      #SBATCH --output=mpi-out.%j
      #SBATCH --error=mpi-err.%j
      #SBATCH --time=00:20:00

      srun --exclusive -n 288 ./mpi-prog1 &
      srun --exclusive -n 288 ./mpi-prog2 &

      wait

.. ifconfig:: system_name == 'juwels'

   The example below shows a job script where two different job steps are initiated within one job. In total 96 cores are allocated on
   two nodes where ``-n 48`` causes that each job step uses 48 cores on one of the compute nodes. Additionally in this example the option
   ``--exclusive`` is passed to ``srun`` to ensure that distinct cores are allocated to each job step.:

   .. code-block:: bash

      #!/bin/bash -x
      #SBATCH --account=<budget>
      #SBATCH --nodes=2
      #SBATCH --ntasks=96
      #SBATCH --ntasks-per-node=48
      #SBATCH --output=mpi-out.%j
      #SBATCH --error=mpi-err.%j
      #SBATCH --time=00:20:00

      srun --exclusive -n 48 ./mpi-prog1 &
      srun --exclusive -n 48 ./mpi-prog2 &

      wait

.. ifconfig:: system_name in ('jureca', 'jusuf')

   The example below shows a job script where two different job steps are initiated within one job. In total 256 cores are allocated on
   two nodes where ``-n 128`` causes that each job step uses 128 cores on one of the compute nodes. Additionally in this example the option
   ``--exclusive`` is passed to ``srun`` to ensure that distinct cores are allocated to each job step.:

   .. code-block:: bash

      #!/bin/bash -x
      #SBATCH --account=<budget>
      #SBATCH --nodes=2
      #SBATCH --ntasks=256
      #SBATCH --ntasks-per-node=128
      #SBATCH --output=mpi-out.%j
      #SBATCH --error=mpi-err.%j
      #SBATCH --time=00:20:00

      srun --exclusive -n 128 ./mpi-prog1 &
      srun --exclusive -n 128 ./mpi-prog2 &

      wait

.. _batch_dependency_chains:

Dependency Chains
-----------------
Slurm supports dependency chains, i.e., collections of batch jobs with defined dependencies. Dependencies can be defined using the ``--dependency`` argument to ``sbatch``:

.. code-block:: none

   sbatch --dependency=afterany:<jobid> <jobscript>

Slurm will guarantee that the new batch job (whose job ID is returned by sbatch) does not start before ``<jobid>`` terminates (successfully or not).
It is possible to specify other types of dependencies, such as afterok which ensures that the new job will only start if ``<jobid>`` finished
successfully.

Below an example script for the handling of job chains is provided. The script submits a chain of ``${NO_OF_JOBS}`` jobs. A job will only start after
successful completion of its predecessor. Please note that a job which exceeds its time-limit is not marked successful.:

.. code-block:: bash

   #!/bin/bash -x
   # submit a chain of jobs with dependency
   # number of jobs to submit
   NO_OF_JOBS=<no of jobs>
   # define jobscript
   JOB_SCRIPT=<jobscript>
   echo "sbatch ${JOB_SCRIPT}"
   JOBID=$(sbatch ${JOB_SCRIPT} 2>&1 | awk '{print $(NF)}')
   I=0
   while [ ${I} -le ${NO_OF_JOBS} ]; do
   echo "sbatch --dependency=afterok:${JOBID} ${JOB_SCRIPT}"
   JOBID=$(sbatch --dependency=afterok:${JOBID} ${JOB_SCRIPT} 2>&1 | awk '{print $(NF)}')
   let I=${I}+1
   done

.. _batch_interactive_sessions:

Interactive Sessions
--------------------

Interactive sessions can be allocated using the ``salloc`` command:

.. code-block:: none

      $ salloc --partition=<devel|dc-cpu-devel|...> --nodes=2 --account=<budget> --time=00:30:00

Once an allocation has been made ``salloc`` will start a shell on the login node (submission host). One can then execute ``srun`` from within the shell, e.g.:

.. code-block:: none

      $ srun --ntasks=4 --ntasks-per-node=2 --cpus-per-task=7 ./hybrid-prog

The interactive session is terminated by exiting the shell. In order to obtain a shell on the first allocated compute nodes one can start a remote shell from within the ``salloc`` session and connect it to a pseudo terminal using:

.. code-block:: none

      $ srun --cpu_bind=none --nodes=2 --pty /bin/bash -i

The option ``--cpu_bind=none`` is used to disable CPU binding for the spawned shell. In order to execute MPI application one uses ``srun`` again from the remote shell. To support X11 forwarding the ``--forward-x`` option to ``srun`` is available. X11 forwarding is required for users who want to use applications or tools with provide a GUI.

Below a transcript of an exemplary interactive session is shown.
``srun`` can be run within the allocation without delay (note that the first ``srun`` execution may take slightly longer due to the necessary node health checking performed upon the invocation of the very first ``srun`` command within the session).

.. ifconfig:: system_name == 'jedi'

   .. code-block:: none

      [user1@jpblt-s01-01 ~]$ hostname
      jpblt-s01-01.jupiter.internal
      [user1@jpblt-s01-01 ~]$ salloc --nodes=2 --account=<budget>
      salloc: Granted job allocation 72
      salloc: Waiting for resource configuration
      salloc: Nodes jpbot-001-[17-18] are ready for job
      [user1@jpblt-s01-01 ~]$ hostname
      jpblt-s01-01.jupiter.internal
      [user1@jpblt-s01-01 ~]$ srun --ntasks 2 hostname
      jpbot-001-18.jupiter.internal
      jpbot-001-17.jupiter.internal
      [user1@jpblt-s01-01 ~]$ srun --cpu-bind=none --nodes=1 --pty /bin/bash -i
      [user1@jpbot-001-17 ~]$ hostname
      jpbot-001-17.jupiter.internal
      [user1@jpbot-001-17 ~]$ exit
      [user1@jpblt-s01-01 ~]$ hostname
      jpblt-s01-01.jupiter.internal
      [user1@jpblt-s01-01 ~]$ exit
      exit
      salloc: Relinquishing job allocation 72
      salloc: Job allocation 72 has been revoked.
      [user1@jpblt-s01-01 ~]$ hostname
      jpblt-s01-01.jupiter.internal

.. ifconfig:: system_name == 'juwels'

   .. code-block:: none

      [user1@jwlogin08 ~]$ hostname
      jwlogin08.juwels
      [user1@jwlogin08 ~]$ salloc -n 2 --nodes=2 --account=<budget>
      salloc: Granted job allocation 3116222
      salloc: Waiting for resource configuration
      salloc: Nodes jwc00n[017-018] are ready for job
      [user1@jwlogin08 ~]$ hostname
      jwlogin08.juwels
      [user1@jwlogin08 ~]$ srun --ntasks 2 --ntasks-per-node=2 hostname
      jwc00n017.juwels
      jwc00n018.juwels
      [user1@jwlogin08 ~]$ srun --cpu-bind=none --nodes=1 --pty /bin/bash -i
      [user1@jwc00n017 ~]$ hostname
      jwc00n017.juwels
      [user1@jwc00n017 ~]$ logout
      [user1@jwlogin08 ~]$ hostname
      jwlogin08.juwels
      [user1@jwlogin08 ~]$ exit
      exit
      salloc: Relinquishing job allocation 3116222
      [user1@jwlogin08 ~]$ hostname
      jwlogin08.juwels

.. ifconfig:: system_name == 'jureca'

   .. code-block:: none

      [user1@jrlogin04 ~]$ hostname
      jrlogin04.jureca
      [user1@jrlogin04 ~]$ salloc -n 2 --nodes=2 --account=<budget>
      salloc: Pending job allocation 2622
      salloc: job 2622 queued and waiting for resources
      salloc: job 2622 has been allocated resources
      salloc: Granted job allocation 2622
      salloc: Waiting for resource configuration
      salloc: Nodes jrc0690 are ready for job
      [user1@jrlogin04 ~]$ srun --ntasks 2 --ntasks-per-node=2 hostname
      jrc0690.jureca
      jrc0690.jureca
      [user1@jrlogin04 ~]$ srun --cpu-bind=none --nodes=1 --pty /bin/bash -i
      [user1@jrc0690 ~]$ hostname
      jrc0690.jureca
      [user1@jrc0690 ~]$ logout
      [user1@jrlogin04 ~]$ hostname
      jrlogin04.jureca
      [user1@jrlogin04 ~]$ exit
      exit
      salloc: Relinquishing job allocation 2622
      [user1@jrlogin04 ~]$ hostname
      jrlogin04.jureca

.. ifconfig:: system_name =='jusuf'

   .. code-block:: none

      [user1@jsfl03 ~]$ hostname
      jsfl03.jusuf
      [user1@jsfl03 ~]$ salloc --nodes=2 --account=<budget>
      salloc: Pending job allocation 1289
      salloc: job 1289 queued and waiting for resources
      salloc: job 1289 has been allocated resources
      salloc: Granted job allocation 1289
      salloc: Waiting for resource configuration
      salloc: Nodes jsfc[197-198] are ready for job
      [user1@jsfl03 ~]$ hostname
      jsfl03.jusuf
      [user1@jsfl03 ~]$ srun --ntasks 2 --ntasks-per-node=2 hostname
      jsfc198
      jsfc197
      [user1@jsfl03 ~]$ srun --cpu-bind=none --nodes=1 --pty /bin/bash -i
      [user1@jsfc197 ~]$ hostname
      jsfc197
      [user1@jsfc197 ~]$ logout
      [user1@jsfl03 ~]$ hostname
      jsfl03.jusuf
      [user1@jsfl03 ~]$ exit
      exit
      salloc: Relinquishing job allocation 1289
      [user1@jsfl03 ~]$ hostname
      jsfl03.jusuf

To support X11 forwarding the ``--forward-x`` option to ``srun`` is available.

.. note::

   Your account will be charged per allocation whether the compute nodes are used or not.
   Batch submission is the preferred way to execute jobs.

.. _batch_hold_release_jobs:

Hold and Release Batch Jobs
---------------------------
Jobs that are in pending state (i.e., not yet running) can be put in hold using:

.. code-block:: none

   scontrol hold <jobid>

Jobs that are in hold are still reported as pending (*PD*) by ``squeue`` but the ``Reason`` shown by ``squeue`` or ``scontrol show job`` is changed to ``JobHeldUser``:

.. ifconfig:: system_name == 'juwels'

   .. code-block:: none

      [user1@jrlogin07 ~]$ scontrol show job <jobid>
      JobId=<jobid> JobName=jobscript.sh
         UserId=XXX(nnnn) GroupId=XXX(nnnn) MCS_label=N/A
         Priority=0 Nice=0 Account=XXX QOS=normal
         JobState=PENDING Reason=JobHeldUser Dependency=(null)
         Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
         RunTime=00:00:00 TimeLimit=00:20:00 TimeMin=N/A
         SubmitTime=2018-12-10T10:52:42 EligibleTime=Unknown
         StartTime=Unknown EndTime=Unknown Deadline=N/A
         PreemptTime=None SuspendTime=None SecsPreSuspend=0
         Partition=batch AllocNode:Sid=jrlogin07:14699
         ReqNodeList=(null) ExcNodeList=(null)
         NodeList=(null)
         NumNodes=2-2 NumCPUs=24 NumTasks=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
         TRES=cpu=24,node=2
         Socks/Node=* NtasksPerN:B:S:C=12:0:*:* CoreSpec=*
         MinCPUsNode=12 MinMemoryNode=0 MinTmpDiskNode=0
         Features=(null) DelayBoot=00:00:00
         Gres=mem96 Reservation=(null)
         OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
         Command=/XXX/jobscript.sh
         WorkDir=/XXX
         StdErr=/XXX/mpi-err.<jobid>
         StdIn=/dev/null
         StdOut=/XXX/mpi-out.<jobid>
         Power=

.. ifconfig:: system_name == 'jureca'

   .. code-block:: none

      [user1@jrlogin08 ~]$ scontrol show job <jobid>
      JobId=<jobid> JobName=jobscript.sh
      UserId=XXX(nnnn) GroupId=XXX(nnnn) MCS_label=N/A
      Priority=0 Nice=0 Account=XXX QOS=normal
      JobState=PENDING Reason=JobHeldUser Dependency=(null)
      Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
      RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A
      SubmitTime=2020-11-28T11:44:26 EligibleTime=2020-11-28T11:44:26
      AccrueTime=Unknown
      StartTime=Unknown EndTime=Unknown Deadline=N/A
      SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-11-28T11:44:26
      Partition=dc-cpu AllocNode:Sid=jrlogin04:19969
      ReqNodeList=(null) ExcNodeList=(null)
      NodeList=(null)
      NumNodes=2-2 NumCPUs=2 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
      TRES=cpu=2,node=2,billing=2
      Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
      MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
      Features=(null) DelayBoot=00:00:00
      Reservation=(null)
      OverSubscribe=NO Contiguous=0 Licenses=home@just,project@just,scratch@just Network=(null)
      Command=/XXX/jobscript.sh
      WorkDir=/XXX
      StdErr=/XXX/mpi-err.<jobid>
      StdIn=/dev/null
      StdOut=/XXX/mpi-out.<jobid>
      Power=
      TresPerNode=mem512

.. ifconfig:: system_name == 'jusuf'

   .. code-block:: none

      [user1@jrl02 ~]$ scontrol show job <jobid>
      JobId=<jobid> JobName=jobscript.sh
      UserId=XXX(nnnn) GroupId=XXX(nnnn) MCS_label=N/A
      Priority=0 Nice=0 Account=XXX QOS=normal
      JobState=PENDING Reason=JobHeldUser Dependency=(null)
      Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
      RunTime=00:00:00 TimeLimit=00:20:00 TimeMin=N/A
      SubmitTime=2020-02-20T12:40:45 EligibleTime=Unknown
      StartTime=Unknown EndTime=Unknown Deadline=N/A
      PreemptTime=None SuspendTime=None SecsPreSuspend=0
      LastSchedEval=2020-02-20T12:40:45
      Partition=batch AllocNode:Sid=jrl02:20538
      ReqNodeList=(null) ExcNodeList=(null)
      NodeList=(null)
      NumNodes=2-2 NumCPUs=128 NumTasks=128 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
      TRES=cpu=128,node=2
      Socks/Node=* NtasksPerN:B:S:C=64:0:*:* CoreSpec=*
      MinCPUsNode=64 MinMemoryNode=0 MinTmpDiskNode=0
      Features=(null) DelayBoot=00:00:00
      Gres=mem256 Reservation=(null)
      OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
      Command=/XXX/jobscript.sh
      WorkDir=/XXX
      StdErr=/XXX/mpi-err.<jobid>
      StdIn=/dev/null
      StdOut=/XXX/mpi-out.<jobid>

The job can be released using:

.. code-block:: none

   $ scontrol release <jobid>

.. _batch_slurm_commands:

Slurm commands
--------------
Below a list of the most important Slurm user commands available on |SYSTEM_NAME| is given.

**sbatch**
  is used to submit a batch script (which can be a bash, Perl or Python script)

  The script will be executed on the first node in the
  allocation chosen by the scheduler. The working directory coincides with the working directory of the sbatch program. Within the script one or
  multiple srun commands can be used to create job steps and execute (MPI) parallel applications.

  .. note::

     ``mpiexec`` is not supported on |SYSTEM_NAME|. ``srun`` is the only supported method to spawn MPI applications.

**salloc**
  is used to request an allocation

  When the job is started, a shell (or other program specified on the command line) is started on the
  submission host (login node). From the shell ``srun`` can be used to interactively spawn parallel applications. The allocation is released
  when the user exits the shell.

**srun**
  is mainly used to create a job step within an job

  ``srun`` can be executed without arguments except the program to use the full allocation
  or with additional arguments to restrict the job step resources to a subset of the allocated processors.

**squeue**
  allows to query the list of pending and running jobs

  By default it reports the list of pending jobs sorted by priority and the list of
  running jobs sorted separately according to the job priority.

**scancel**
  is used to cancel pending or running jobs or to send signals to processes in running jobs or job steps

  Example: ``scancel <jobid>``

**scontrol**
   can be used to query information about compute nodes and running or recently completed jobs

   Examples:

   - ``scontrol show job <jobid>`` to show detailed information about pending, running or recently completed jobs
   - ``scontrol update job <jobid> set ...`` to update a pending job

   .. note::

      For old jobs ``scontrol show job <jobid>`` will not work and ``sacct -j <jobid>`` should be used instead.

**sacct**
  is used to retrieve accounting information for jobs and job steps

  For older jobs ``sacct`` queries the accounting database.

  Example: ``sacct -j <jobid>``

**sinfo**
  is used to retrieve information about the partitions and node states

**sprio**
  can be used to query job priorities

**smap**
  graphically shows the state of the partitions and nodes using a curses interface

   We recommend Llview as an alternative which is supported on all JSC machines.

**sattach**
  allows to attach to the standard input, output or error of a running job

**sstat**
  allows to query information about a running job

.. _slurm_options:

Summary of sbatch and srun Options
----------------------------------

The following table summarizes important ``sbatch`` and ``srun`` command options:

+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+
| ``--account``                    | Budget account where contingent is taken from.                                                                                           |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+
| ``--nodes``                      | Number of compute nodes used by the job. Can be omitted if ``--ntasks`` and ``--ntasks-per-node`` is given.                              |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+
| ``--ntasks``                     | Number of tasks (MPI processes). Can be omitted if ``--nnodes`` and ``--ntasks-per-node`` is given. [#ntasks]_                           |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+
| ``--ntasks-per-node``            | Number of tasks per compute nodes.                                                                                                       |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+
| ``--cpus-per-task``              | Number of logical CPUs (hardware threads) per task. This option is only relevant for hybrid/OpenMP jobs.                                 |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+
| ``--job-name``                   | A name for the job                                                                                                                       |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+
| ``--output``                     | Path to the job's standard output. Slurm supports format strings containing replacement symbols such as ``%j`` (job ID). [#stdcombine]_  |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+
| ``--error``                      | Path to the job's standard error. Slurm supports format strings containing replacement symbols such as ``%j`` (job ID).                  |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+
| ``--time``                       | Maximal wall-clock time of the job.                                                                                                      |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+
| ``--partition``                  | Partition to be used, e.g. ``batch`` or ``large``. If omitted, ``batch`` is the default.                                                 |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+
| ``--mail-user``                  | Define the mail address to receive mail notification.                                                                                    |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+
| ``--mail-type``                  | Define when to send a mail notifications. [#mail-type]_                                                                                  |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+
| ``--pty (srun only)``            | Execute the first task in pseudo terminal mode.                                                                                          |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+
| ``--forward-x (srun)``           | Enable X11 forwarding on the first allocated node.                                                                                       |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+
| ``--disable-turbomode (sbatch)`` | Disable turbo mode of all CPUs of the allocated nodes.                                                                                   |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+

.. [#ntasks] If ``--ntasks`` is omitted the number of nodes can be specified as a range ``--nodes=<min no. of nodes>-<max no. of nodes>`` allowing the scheduler to start the job with fewer nodes than the maximum requested if this reduces wait time.
.. [#stdcombine] ``stdout`` and ``stderr`` can be combined by specifying the same file for the ``--output`` and ``--error`` option.
.. [#mail-type] Valid types: ``BEGIN``, ``END``, ``FAIL``, ``REQUEUE``, ``ALL``, ``TIME_LIMIT``, ``TIME_LIMIT_90``, ``TIME_LIMIT_80``, ``TIME_LIMIT_50``, ``ARRAY_TASKS`` to receive emails when events occur. Multiple type values may be specified in a comma separated list.

More information is available on the man pages of ``sbatch``, ``srun`` and ``salloc`` which can be retrieved on the login nodes with the commands ``man sbatch``, ``man srun`` and ``man salloc``, respectively, or in the `Slurm documentation`_.

.. _`Slurm documentation`: https://slurm.schedmd.com/documentation.html

CPU Limiting Options
--------------------

.. ifconfig:: system_name != 'jedi'

   CPU frequency sets the pace at which instructions are executed by the CPU. A higher frequency results in:

   - Higher power usage
   - Possible higher performance

   Each CPU has a base frequency, which is the frequency that the CPU is
   operating at by default.

   Turbo mode means that the CPU increases the frequency above the base frequency, if conditions (such as temperature) allow. Higher frequency results in more heat dissipation and a higher temperature. If the temperature passes the designed threshold, the CPU will tend to control the temperature by lowering the frequency, and this might affect the performance.

   Therefore, the base frequency is more reproducible since application performance does not depend on the current temperature of the allocated CPUs.

   As a result, for repeatable performance measurements, it is recommended to use ``--disable-turbomode`` to use the base frequency and disable turbo mode, a reference can be found in :ref:`slurm_options`

.. ifconfig:: system_name == 'jedi'

   Each |SYSTEM_NAME| node includes 4 Grace Hopper Superchips, as covered in :ref:`the configuration details for JEDI <configuration>`.
   Each Superchip is comprised of a CPU and GPU, and each Superchip receives a fixed total power budget of 680 W.

   By default, the CPU for each Superchip is limited to a power budget of 100 W, to maximise performance
   from the GPU, which for many applications will be delivering the bulk of compute performance.
   However, in some cases, where work is split between CPU and GPU, it may be advantageous to
   to rebalance the power budget between CPU and GPU.

   JSC deploys a custom Slurm plugin, which provides the option ``--grace-power-cap=<cap-in-watts>`` (e.g.
   to set to 200 Watts use ``--grace-power-cap=200``).
   This option is available for ``sbatch``, ``srun`` and ``salloc`` commands, and can be set to values between
   100 and 300, changing the CPU power limit to the corresponding number of Watts. The option operates on a *per-node* level,
   i.e. it limits all Grace CPUs on that node to the given value at once (including CPUs with job steps that may already be running).

   .. warning::
      Currently, setting this value to a number higher than 300 will silently fail and
      the CPU power limit will be set to the default of 100 W.

   .. note::
      If this option is used in ``srun`` commands, the value will not be reset at the end of that
      ``srun`` commands, i.e. if used in a batch script or ``salloc`` session with multiple job steps, subsequent jobs will use the same
      custom power limit.

   It is important to understand that raising power limits on the CPU can diminish the amount of power
   available to the GPU, and there is not a 1-to-1 relationship between the power available, clock speeds
   and performance. If using this option, it is important to benchmark your specific use-case to understand
   what benefits (if any) you can extract from it, and in which situations this unacceptably degrades GPU performance.