Known Issues on JUWELS

This page collects known issues affecting JUWELS’s system and application software.

Note

The following list of known issue is intended to provide a quick reference for users experiencing problems on JUWELS. We strongly encourage all users to report the occurrence of problems, whether listed below or not, to the user support.

Open Issues

Problems with commercial software like ANSYS using IntelMPI under Slurm 23.11

Added: 2024-12-11

Affects: All systems at JSC

Description: Job execution fails for commercial software like ANSYS which comes with its own intelMPI (No easy build usage since a separate intelMPI version is bundled with the ANSYS SW package.).

Status: Open.

Workaround/Suggested Action: The following settings within a job script to allow multi-node jobs spawned by IntelMPI to run:

export I_MPI_HYDRA_BOOTSTRAP=ssh
unset I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS

Apptainer sandbox containers disabled on login nodes

Added: 2024-11-02

Affects: All HPC systems

Description: We have recently discovered a flaw that allows users to crash the Linux kernel when using Apptainer sandbox containers with IBM Storage Scale (formerly GPFS) as the backing file system. Login nodes in both JURECA and JUSUF have fallen victim to this issue, resulting in an unexpected reboot. To prevent users from losing work we have decided to temporarily disable sandbox containers on the login nodes while we wait for a fix for Storage Scale.

Status: Open (waiting for fix).

Workaround/Suggested Action: If sandbox containers are essential to your workflow, we suggest you use a compute node where the feature is still enabled. However, make sure to run the container from a local tmpfs such as /tmp or /dev/shm.

Conda Disabled

Added: 2024-11-02

Affects: All HPC systems

Description: Usage of the Conda default channel might not be allowed. Access to the channel is blocked on the systems.

Status: Closed.

Workaround/Suggested Action: Use an alternative channel (conda-forge) or even an alternative, faster client (mamba). See the dedicated description.

IP connectivity on compute nodes

Added: 2024-06-24

Affects: JURECA-DC, JUWELS Cluster, JUWELS Booster, and JUSUF

Description: IP connectivity on compute nodes for compute tasks should be done over the InfiniBand interface. The usage of that interface is not automatic. Failure to do so will lead to poor performance or direct failure in establishing communication between compute nodes.

This problem is most often observed with deep learning frameworks such as PyTorch, but can be worked around as described below.

Status: Open.

Workaround/Suggested Action: The problem can be avoided by appending an “i” to the hostname, e.g., convert from jrc0001 to jrc0001i, or from jwb0001.juwels to jwb0001i.juwels. These modified hostnames resolve to the IP address associated to the InfiniBand adapter, available in all connection cases. The code snippet below is an automatic solution for PyTorch that first sets a hostname and then appends the “i” if required. Note that launcher scripts that try to automatically figure out the hostname, such as torchrun, may require additional handling. For the torchrun launcher, these additional handling steps and other potential issues are documented in more detail in the comprehensive PyTorch at JSC recipe.

export MASTER_ADDR="$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)"
if [ "$SYSTEMNAME" = juwelsbooster ] \
       || [ "$SYSTEMNAME" = juwels ] \
       || [ "$SYSTEMNAME" = jurecadc ] \
       || [ "$SYSTEMNAME" = jusuf ]; then
    # Allow communication over InfiniBand cells on JSC machines.
    MASTER_ADDR="$MASTER_ADDR"i
fi

Fortran 2008 MPI bindings rewrite array bounds

Added: 2023-08-17

Affects: All systems at JSC

Description: Due to a bug in versions of the gfortran compiler installed in software stages earlier than 2024, the Fortran 2008 bindings (use mpi_f08) of MPICH-based MPI libraries (e.g. ParaStationMPI) erroneously modify the bounds of arrays passed into MPI routines as buffers.

Status: Open.

Workaround/Suggested Action: The issue can be avoided by using:

  • gfortran version 12 or later (available in software stage 2024) or

  • a Fortran compiler other than gfortran (e.g. the Intel Fortran compiler) or

  • an MPI library that is not based on MPICH (e.g. OpenMPI).

Process affinity

Added: 2023-08-03

Affects: All systems at JSC

Description: After an update of Slurm to version 22.05 the process affinity has changed, which results in unexpected pinning in certain cases. This could have a major impact on code’s performance.

Status: Open.

Workaround/Suggested Action: Further information can be found in the warning section of Processor Affinity.

Slurm: wrong default task pinning with odd number of tasks/node

Added: 2022-06-20

Affects: All systems at JSC

Description: With default CPU bindings (’–cpu-bind=threads’) the task pinning is not the expected one when we have odd number of tasks per node and those tasks are using number of cores less or equal to half of the total cores on each node.

When we have even number of tasks/node then only real cores are being used by the tasks. When we have odd number of tasks/node then SMT is enabled and different tasks share the hardware threads of same cores (this shouldn’t happen). Following you can see a few examples on JUWELS-CLUSTER.

With 1 task/node and 48 cpus/task it uses SMT:

$ srun -N1 -n1 -c48 --cpu-bind=verbose exec
cpu_bind=THREADS - jwc00n001, task  0  0 [7321]: mask 0xffffff000000ffffff set

With 2 tasks/node and 24 cpus/task it uses only physical cores:

$ srun -N1 -n2 -c24 --cpu-bind=verbose exec
cpu_bind=THREADS - jwc00n001, task  0  0 [7340]: mask 0xffffff set
cpu_bind=THREADS - jwc00n001, task  1  1 [7341]: mask 0xffffff000000 set

With 3 tasks/node and 16 threads/task it uses SMT (task 0 and 1 are on physical cores but task 2 uses SMT):

$ srun -N1 -n3 -c16 --cpu-bind=verbose exec
cpu_bind=THREADS - jwc00n001, task  0  0 [7362]: mask 0xffff set
cpu_bind=THREADS - jwc00n001, task  1  1 [7363]: mask 0xffff000000 set
cpu_bind=THREADS - jwc00n001, task  2  2 [7364]: mask 0xff000000ff0000 set

With 4 tasks/node and 12 cpus/task uses only physical cores:

$ srun -N1 -n4 -c12 --cpu-bind=verbose exec
cpu_bind=THREADS - jwc00n001, task  0  0 [7387]: mask 0xfff set
cpu_bind=THREADS - jwc00n001, task  2  2 [7389]: mask 0xfff000 set
cpu_bind=THREADS - jwc00n001, task  1  1 [7388]: mask 0xfff000000 set
cpu_bind=THREADS - jwc00n001, task  3  3 [7390]: mask 0xfff000000000 set

Status: Open.

Workaround/Suggested Action: To workaround this behavior you have to disable SMT with srun option “–hint=nomultithread”. You can compare the cpu masks in the following examples:

$ srun -N1 -n3 -c16 --cpu-bind=verbose exec
cpu_bind=THREADS - jwc00n004, task  0  0 [17629]: mask 0x0000000000ffff set
cpu_bind=THREADS - jwc00n004, task  1  1 [17630]: mask 0x0000ffff000000 set
cpu_bind=THREADS - jwc00n004, task  2  2 [17631]: mask 0xff000000ff0000 set


$ srun -N1 -n3 -c16 --cpu-bind=verbose --hint=nomultithread exec
cpu_bind=THREADS - jwc00n004, task  0  0 [17652]: mask 0x00000000ffff set
cpu_bind=THREADS - jwc00n004, task  1  1 [17653]: mask 0x00ffff000000 set
cpu_bind=THREADS - jwc00n004, task  2  2 [17654]: mask 0xff0000ff0000 set

Slurm: srun options –exact and –exclusive change default pinning

Added: 2022-06-09

Affects: All systems at JSC

Description: In Slurm 21.08 the srun options “–exact” and “–exclusive” change the default pinning. For example on JURECA:

$ srun -N1 --ntasks-per-node=1 -c32 --cpu-bind=verbose exec
cpu_bind=THREADS - jrc0731, task  0  0 [3027]: mask 0xffff0000000000000000000000000000ffff000000000000 set
...
$ srun -N1 --ntasks-per-node=1 -c32 --cpu-bind=verbose --exact exec
cpu_bind=THREADS - jrc0731, task  0  0 [3068]: mask 0x3000300030003000300030003000300030003000300030003000300030003 set
...
$ srun -N1 --ntasks-per-node=1 -c32 --cpu-bind=verbose --exclusive exec
cpu_bind=THREADS - jrc0731, task  0  0 [3068]: mask 0x3000300030003000300030003000300030003000300030003000300030003 set
...

As you can see with the default pinning only physical cores are used but with “–exact” or “–exclusive” Slurm pins the tasks to SMT cores (Hardware Threads). Actually this means that the task distribution changes to “cyclic”.

Status: Open.

Workaround/Suggested Action: To workaround this behavior you have to request block distribution of the tasks using option “-m” like this:

$ srun -N1 --ntasks-per-node=1 -c32 --cpu-bind=verbose --exact -m *:block exec
cpu_bind=THREADS - jrc0731, task  0  0 [3027]: mask 0xffff0000000000000000000000000000ffff000000000000 set
...
$ srun -N1 --ntasks-per-node=1 -c32 --cpu-bind=verbose --exclusive -m *:block exec
cpu_bind=THREADS - jrc0731, task  0  0 [3027]: mask 0xffff0000000000000000000000000000ffff000000000000 set
...

ParaStationMPI: Cannot allocate memory

Added: 2021-10-06

Affects: All systems at JSC

Description: Using ParaStationMPI, the following error might occur:

ERROR mlx5dv_devx_obj_create(QP) failed, syndrome 0: Cannot allocate memory

Status: Open.

Workaround/Suggested Action: Use mpi-settings/[CUDA-low-latency-UD,CUDA-UD,UCX-UD] (Stage < 2022) or UCX-settings/[UD,UD-CUDA] (Stage >= 2022) to reduce the memory footprint. The particular module depends on the user requirements.

Jobs cannot load/access software modules

Added: 2021-05-03

Affects: JUWELS Cluster and Booster

Description: The JUWELS system currently has two sets of login nodes, one associated with the Cluster part (juwels-cluster.fz-juelich.de), the other with the Booster part (juwels-booster.fz-juelich.de). Submitting jobs from the Cluster login nodes to Booster partitions and vice versa currently fails with error messages such as /p/software/juwels/lmod/8.4.1/libexec/lmod: No such file or director or error while loading shared libraries: libgsl.so.25: cannot open shared object file: No such file or directory.

Status: Open.

Workaround/Suggested Action: Please use either juwels-cluster.fz-juelich.de to submit jobs to Cluster partitions or juwels-booster.fz-juelich.de to submit jobs to Booster partitions.

Cannot connect using old OpenSSH clients

Added: 2020-06-15

Affects: All systems at JSC

Description: In response to the recent security incident, the SSH server on JUWELS has been configured to only use modern cryptography algorithms. As a side effect, it is no longer possible to connect to JUWELS using older SSH clients. For OpenSSH, at least version 6.7 released in 2014 is required. Some operating systems with very long term support ship with older versions, e.g. RHEL 6 ships with OpenSSH 5.3.

Status: Open.

Workaround/Suggested Action: Use a more recent SSH client with support for the newer cryptography algorithms. If you cannot update the OpenSSH client (e.g. because you are not the administrator of the system you are trying to connect from) you can install your own version of OpenSSH from https://www.openssh.com. Logging in from a different system with a newer SSH client is another option. If you have to transfer data from a system with an old SSH client to JUWELS (e.g. using scp) you may have to transfer the data to a third system with a newer SSH client first (scp’s command line option -3 can be used to automate this).

IntelMPI crashes on MPI_Finalize if windows have not been freed

Added: 2020-02-13

Affects: JUWELS Cluster

Description: When using windows for one-sided communication with IntelMPI/2019.6.154, an arbitrary subset of processes may crash when calling MPI_Finalize.

Status: Open.

Workaround/Suggested Action: Ensure that all windows are freed with MPI_Win_free before MPI_Finalize is called.

Variations in runtime/performance

Added: 2018-08-09

Affects: JUWELS Cluster

Description: In some cases variations in runtime/performance of certain codes have been reported.

If you encounter such a case please let us know via sc@fz-juelich.de. Please include data which illustrates your case.

Status: Open.

Recently Resolved and Closed Issues

SLURM_NTASKS and SLURM_NPROCS not exported in jobscript

Added: 2024-08-08

Affects: All systems with Slurm 23.02

Description: Environment variables “SLURM_NTASKS” and “SLURM_NPROCS” are not exported in the jobscript when only “–ntasks-per-node” is given to sbatch wihout “-n”.

Status: Resolved. Fixed in cli_filter.

Workaround/Suggested Action: To workaround it you have to give to sbatch the option “-n” or “–ntasks” with the total number of tasks, you can keep “–ntasks-per-node”.

ParaStationMPI: GPFS backend for ROMIO (MPI I/O)

Added: 2023-04-03

Update: 2023-06-12

Affects: All systems at JSC

Description: GPFS backend for ROMIO (MPI I/O) in ParaStationMPI has been enabled in the 2023 stage after a bug has been fixed. However, occasional segmentation faults have been observed when ParaStationMPI is used with GPFS backend enabled, resulting in job failures. Disabling the GPFS backend, the issue not reproducible anymore, and the jobs complete successfully.

Status: Resolved.

Workaround/Suggested Action: Versions 5.7.1-1 and 5.8.1-1 include a patch to address this issue and have been installed. If you are affected by this issue please explicitly load these versions.

JUST: GPFS hanging waiters lead to stuck I/O

Added: 2023-04-12

Update: As of 2023-05-26 all systems have been updated to a GPFS version that fixed the issue

Affects: All systems at JSC

Description: We are aware, since the 15th of March, that some users have seen their jobs cause waiters on JUST, which leads to these jobs hanging seemingly indefinitely on I/O. This issue has been observed for a specific set of jobs and more frequently occurred on JURECA than other systems. IBM has identified a possible cause and are now in the process of developing a fix.

Status: Resolved.

Workaround/Suggested Action: There are no known workarounds. Once IBM releases the fix, we will shortly schedule a maintenance window and install the patch.

Job requeueing failures due to slurmctld prologue bug

Added: 2021-05-18

Affects: All systems at JSC

Description: There is a bug in slurmctld and currently the prologue mechanism and the job requeueing are broken. Normally before a job allocates any nodes the prologue runs and if it finds unhealthy nodes it drains them and requeues the job. Because of the bug now slurcmtld will cancel the jobs that were requeued at least once but finally landed on healthy nodes. We have reported this bug to SchedMD and they are working on it.

Status: Resolved.

$DATA not available on login nodes

Added: 2020-12-04

Affects: JURECA-DC, JUWELS Booster

Description: The $DATA file system is not mounted on the login nodes. We are working on making it available soon.

Status: Open.

Workaround/Suggested Action: Please access $DATA on JUDAC or a JUWELS Cluster login node.

GPU Device Handling

Added: 2020-12-01

Affects: JUWELS Cluster GPU partition, JUWELS Booster

Description: We are in the process of updating how GPU devices are distributed to Slurm tasks. The current implementation contains bugs that are currently being adressed. A temporary workaround has been added to the CUDA module on JUWELS Cluster. Some more details follow, including a suggestion for JUWELS Booster.

In the past, Slurm automatically exported CUDA_VISIBLE_DEVICES=0,1,2,3 at the start of jobs allowing an application to see all four installed GPUs and utilize them. This always bore the latent possibility of using GPUs which did not have affinity to the socket the MPI process was running on. On JUWELS Booster, this behavior is more pronounced and slow in the default. The indent change is to let Slurm assign GPU to tasks taking the CPU-GPU affinity into account. As an example, rank 0 would only have access to GPU 0, by automatically setting CUDA_VISIBLE_DEVICES=0. Full user-override is enabled when CUDA_VISIBLE_DEVICES is set manually outside of Slurm or if --cpu-bind=none is selected.

Unfortunately, while working for most cases, the current implementation does not work for all cases. On the JUWELS Booster the GPU assignement is incorrect for tasks assigned to cores in certain NUMA domains:, in particular,: 4 to 7, 12 to 15, etc. In these cases, the CUDA_VISIBLE_DEVICES environment variable is not set.

Fix description Slurm assigns now the closest GPU to every process. Even NUMA domains that do not have direct affinity to GPUs get the closest one assigned. Users should be aware of the case where the number of processes requested is less than the number of GPUs. Each process will get a single GPU assigned, even in this case. Eg: Managing all 4 GPUs from a single process requires to set CUDA_VISIBLE_DEVICES to =0,1,2,3 manually

Status: Closed.

Workaround/Suggested Action: On the JUWELS Cluster GPU nodes we recommend loading the CUDA module before job execution. The module exports CUDA_VISIBLE_DEVICES to =0,1,2,3. On the JUWELS Booster we recommend to limit the CPU affinity masks to the NUMA domains 1,3,5 and 7, e.g., via [srun] --cpu-bind=map_ldoms:5,7,1,3. More complicated use cases may require you to export CUDA_VISIBLE_DEVICES manually after srun for each task in a wrapper script using the PMI_RANK and MPI_LOCALRANK_ID environment variables.

MPI_Allreduce bug in CUDA-Aware MVAPICH2-GDR

Added: 2020-01-17

Affects: JUWELS Cluster GPU nodes

Description: MPI_Allreduce produces wrong results and crashes for small buffers of double precision on the GPU.

For a complete description read the information on the following link: https://gist.github.com/AndiH/b929b50b4c8d25137e0bfee25db63791

Status: Closed.

Workaround/Suggested Action: No known workaround for 1 rank. MVAPICH2-GDR version 2.3.3 has been installed. That version works as intended when using more than 1 rank. With Stage 2020, the MVAPICH2 (GDR version) is not part of the default system software stack anymore.

Segmentation Faults with MVAPICH2

Added: 2019-03-11

Affects: JUWELS Cluster GPU nodes, JURECA Cluster (decomissioned in December 2020)

Description: It has been observed that MVAPICH2 (GDR version) is not reliably detecting GPU device memory pointers and therefore executes invalid memory operations on such buffers. This results in an application segmentation fault.

Status: Closed.

Workaround/Suggested Action: The behavior of the MPI implementation is dependent on the buffer sizes. For some applications, adjusting the eager size limits via the environment variables MV2_IBA_EAGER_THRESHOLD and MV2_RDMA_FAST_PATH_BUF_SIZE can improve the situation. However, this has been observed to create problems with the collectives implementation in MVAPICH2. Please contact the support in case you intend to adjust these values. With Stage 2020, the MVAPICH2 (GDR version) is not part of the default system software stack anymore.