Known Issues on JURECA

This page collects known issues affecting JURECA’s system and application software.

Note

The following list of known issue is intended to provide a quick reference for users experiencing problems on JURECA. We strongly encourage all users to report the occurrence of problems, whether listed below or not, to the user support.

Open Issues

ParaStationMPI: Cannot allocate memory

Added: 2021-10-06

Affects: All systems at JSC

Description: Using ParaStationMPI, the following error might occur:

ERROR mlx5dv_devx_obj_create(QP) failed, syndrome 0: Cannot allocate memory

Status: Open.

Workaround/Suggested Action: Use mpi-settings module with an -UD suffix to reduce the memory footprint. The particular module depends on the user requirements. Options are:

  • mpi-settings/CUDA-low-latency-UD

  • mpi-settings/CUDA-UD

  • mpi-settings/UCX-UD

Job requeueing failures due to slurmctld prologue bug

Added: 2021-05-18

Affects: All systems at JSC

Description: There is a bug in slurmctld and currently the prologue mechanism and the job requeueing are broken. Normally before a job allocates any nodes the prologue runs and if it finds unhealthy nodes it drains them and requeues the job. Because of the bug now slurcmtld will cancel the jobs that were requeued at least once but finally landed on healthy nodes. We have reported this bug to SchedMD and they are working on it.

Status: Open.

Heterogeneous jobs across Cluster and Booster support only one job step

Added: 2020-07-20

Affects: JURECA

Description: Running multiple heterogeneous jobs steps using Cluster and Booster resources in the same allocation results in an error message such as

<PSP:r0000007:pscom4gateway: Error: Connecting gateway failed>

The problem does not occur for all job configuration.

Status: Open.

Workaround/Suggested Action: Please use separate allocations for job steps when using Cluster and Booster resources.

Cannot connect using old OpenSSH clients

Added: 2020-06-15

Affects: All systems at JSC

Description: In response to the recent security incident, the SSH server on JURECA has been configured to only use modern cryptography algorithms. As a side effect, it is no longer possible to connect to JURECA using older SSH clients. For OpenSSH, at least version 6.7 released in 2014 is required. Some operating systems with very long term support ship with older versions, e.g. RHEL 6 ships with OpenSSH 5.3.

Status: Open.

Workaround/Suggested Action: Use a more recent SSH client with support for the newer cryptography algorithms. If you cannot update the OpenSSH client (e.g. because you are not the administrator of the system you are trying to connect from) you can install your own version of OpenSSH from https://www.openssh.com. Logging in from a different system with a newer SSH client is another option. If you have to transfer data from a system with an old SSH client to JURECA (e.g. using scp) you may have to transfer the data to a third system with a newer SSH client first (scp’s command line option -3 can be used to automate this).

Intel compiler error with std::valarray and optimized headers

Added: 2016-03-16

Affects: JURECA

Description: An error was found in the implementation of several C++ std::valarray operations in the Intel compiler suite that occurs if the option -use-intel-optimized-headers of icpc is used.

Status: Open.

Workaround/Suggested Action: Users are strongly advised not to use the -use-intel-optimized-headers option on JURECA.

Recently Resolved and Closed Issues

$DATA not available on login nodes

Added: 2020-12-04

Affects: JURECA-DC, JUWELS Booster

Description: The $DATA file system is not mounted on the login nodes. We are working on making it available soon.

Status: Open.

Workaround/Suggested Action: Please access $DATA on JUDAC or a JUWELS Cluster login node.

libicm warning by UCX

Added: 2020-12-04

Affects: JURECA-DC

Description: The warning messages

libibcm: couldn't read ABI version

is printed by every MPI rank in the job step.

Status: Resolved.

Application crashes when using CUDA-MPS

Added: 2020-07-03

Affects: JURECA Cluster (decomissioned in December 2020)

Description When using CUDA MPS during job allocation (salloc --cuda-mps […]) and selecting ParaStationMPI as the MPI runtime, some programs may fail due to an out of memory error (ERROR_OUT_OF_MEMORY).

Status: Open

Workaround: The issue is documented in the MPS documentation. Try to compile your program with -fPIC -fPIE / -pie. Alternatively, we found that making a call to cuInit(0); at the very beginning of the program flow solves the problem (i.e. very early in your main()).

Finally, if you cannot modify your application, the call to cuInit(0) can also be achieved by writing a small external library, which is prepended to your program by using the system linker. See the following sketch. Note that this is highly discouraged as it might interfere with other utilities making use of the same functionality (debugger, profilers, …).

#include "cuda.h"
struct Initializer { Initializer() { cuInit(0); } };
Initializer I;
gcc -fPIC preload.cpp -shared -o preload.so -lcuda
LD_PRELOAD=./preload.so srun -n2 ./simpleMPI

Segmentation Faults with MVAPICH2

Added: 2019-03-11

Affects: JUWELS Cluster GPU nodes, JURECA Cluster (decomissioned in December 2020)

Description: It has been observed that MVAPICH2 (GDR version) is not reliably detecting GPU device memory pointers and therefore executes invalid memory operations on such buffers. This results in an application segmentation fault.

Status: Closed.

Workaround/Suggested Action: The behavior of the MPI implementation is dependent on the buffer sizes. For some applications, adjusting the eager size limits via the environment variables MV2_IBA_EAGER_THRESHOLD and MV2_RDMA_FAST_PATH_BUF_SIZE can improve the situation. However, this has been observed to create problems with the collectives implementation in MVAPICH2. Please contact the support in case you intend to adjust these values. With Stage 2020, the MVAPICH2 (GDR version) is not part of the default system software stack anymore.

Collectives in Intel MPI 2019 can lead to hanging processes or segmentation faults

Added: 2018-11-27

Affects: JURECA Cluster (decomissioned in December 2020)

Description: Problems with collective operations and Intel MPI 2019 have been observed. Segmentation faults in MPI_Allreduce, MPI_Alltoall, MPI_Alltoallv have been reproduced. Hangs in MPI_Allgather, MPI_Allgatherv have been observed. As the occurrence is dependent on the underlying dynamically chosen algorithm in the MPI implementation, the issue may or may not be visible depending on job and buffer sizes. Hangs in MPI_Cart_create call have been reported, likely due to problems with the underlying collective operations.

Status: Open.

Workaround/Suggested Action: The default Intel MPI in the Stage 2018b has been changed to Intel MPI 2018.04. Alternatively a fall-back to Stage 2018a may be an option.

Errors with IntelMPI and Slurm’s cyclic job/task distribution

Added: 2018-05-07

Affects: JURECA Cluster

Description: If using IntelMPI together with srun’s option

--distribution=cyclic or if variable SLURM_DISTRIBUTION=cyclic is exported there is a limitation of the maximum number of MPI tasks that can be spawned and jobs fail completely for more than 6 total MPI tasks in a job step.

You have to be aware that the cyclic distribution is the default behavior of Slurm when using compute nodes interactively, i.e. the number of tasks is no larger than the number of allocated nodes! The problem has already been reported to Intel in 2017 and a future release may solve this issue.

Status: Open.

Workaround/Suggested Action: The recommended workarounds are:

  1. Avoid srun’s option --distribution=cyclic

  2. Unset SLURM_DISTRIBUTION inside the jobscript or export SLURM_DISTRIBUTION=block before starting the srun

  3. Export I_MPI_SLURM_EXT=0 to disable the optimized startup algorithm for IntelMPI