Known Issues on JURECA
This page collects known issues affecting JURECA’s system and application software.
Note
The following list of known issue is intended to provide a quick reference for users experiencing problems on JURECA. We strongly encourage all users to report the occurrence of problems, whether listed below or not, to the user support.
Open Issues
ParaStationMPI: GPFS backend for ROMIO (MPI I/O)
Added: 2023-04-03
Affects: All systems at JSC
Description: GPFS
backend for ROMIO (MPI I/O)
in ParaStationMPI
has been enabled in the 2023 stage after a bug has been fixed.
However, occasional segmentation faults have been observed when ParaStationMPI
is used with GPFS
backend enabled, resulting in job failures.
Disabling the GPFS
backend, the issue not reproducible anymore, and the jobs complete successfully.
Status: Open.
Workaround/Suggested Action: In case of jobs using ParaStationMPI
failing with segmentation faults having traces pointing to GPFS
,
UFS
backend can be used instead of the GPFS
backend. This can be done using the following:
export ROMIO_FSTYPE_FORCE="ufs:"
Please note that switching to the UFS backend would result in disabling the IME
backend.
This does not mean that IME
cannot be used, it would simply work without some extra tuning that has been done for IME
.
Flipping links
Added: 2022-09-28
Updated: 2023-05-25
Affects: JUWELS Booster and JURECA-DC
Description: A few months ago, we have identified an issue with the InfiniBand cabling of our Sequana XH2000 machines. Under certain circumstances not easily reproduceable at a small scale, an InfiniBand adapter will lose its link for a few seconds. Usually, this happens rarely and if the communication library just tries again, it will also not lead to a proper failure, but just a temporary delay. However this issue is more pronounced with the NCCL library, showing up more frequently at large scale, particularly with PyTorch.
Current situation: Since the problem has been identified, we have retrofitted all InfiniBand cables between compute nodes and switches with ferrite beads. This effectively reduced the frequency of these events, but did not fully solve the problem as it was initially expected. We are working together with Atos to find a final solution. In the meantime, if your job ends unexpectedly with this or a similar error, despite using the suggested workaround below, please contact sc@fz-juelich.de:
RuntimeError: NCCL communicator was aborted on rank X. Original reason for failure was: NCCL error: unhandled system error, NCCL version 21.2.7 ncclSystemError: System call (socket, malloc, munmap, etc) failed.
Update 2023-05-25: All affected systems have received a second ferrite bead in the cables of the affected links. This has improved the situation significantly. However some jobs still trigger the problem ocassionally. The current strategy is to identify nodes where this happens and treat them as “weak” nodes which require a third ferrite bead or an InfiniBand card replacement. This is not considered a fix, but a workaround from the hardware side.
Status: Open.
Workaround/Suggested Action: While we wait for a definite fix, and besides the effort on the hardware side, we currently recommend the following environment variables to try to mitigate the link flip issue:
export NCCL_IB_TIMEOUT=50 export UCX_RC_TIMEOUT=4s export NCCL_IB_RETRY_CNT=10
Slurm: wrong default task pinning with odd number of tasks/node
Added: 2022-06-20
Affects: All systems at JSC
Description: With default CPU bindings (’–cpu-bind=threads’) the task pinning is not the expected one when we have odd number of tasks per node and those tasks are using number of cores less or equal to half of the total cores on each node.
When we have even number of tasks/node then only real cores are being used by the tasks. When we have odd number of tasks/node then SMT is enabled and different tasks share the hardware threads of same cores (this shouldn’t happen). Following you can see a few examples on JUWELS-CLUSTER.
With 1 task/node and 48 cpus/task it uses SMT:
$ srun -N1 -n1 -c48 --cpu-bind=verbose exec cpu_bind=THREADS - jwc00n001, task 0 0 [7321]: mask 0xffffff000000ffffff set
With 2 tasks/node and 24 cpus/task it uses only physical cores:
$ srun -N1 -n2 -c24 --cpu-bind=verbose exec cpu_bind=THREADS - jwc00n001, task 0 0 [7340]: mask 0xffffff set cpu_bind=THREADS - jwc00n001, task 1 1 [7341]: mask 0xffffff000000 set
With 3 tasks/node and 16 threads/task it uses SMT (task 0 and 1 are on physical cores but task 2 uses SMT):
$ srun -N1 -n3 -c16 --cpu-bind=verbose exec cpu_bind=THREADS - jwc00n001, task 0 0 [7362]: mask 0xffff set cpu_bind=THREADS - jwc00n001, task 1 1 [7363]: mask 0xffff000000 set cpu_bind=THREADS - jwc00n001, task 2 2 [7364]: mask 0xff000000ff0000 set
With 4 tasks/node and 12 cpus/task uses only physical cores:
$ srun -N1 -n4 -c12 --cpu-bind=verbose exec cpu_bind=THREADS - jwc00n001, task 0 0 [7387]: mask 0xfff set cpu_bind=THREADS - jwc00n001, task 2 2 [7389]: mask 0xfff000 set cpu_bind=THREADS - jwc00n001, task 1 1 [7388]: mask 0xfff000000 set cpu_bind=THREADS - jwc00n001, task 3 3 [7390]: mask 0xfff000000000 set
Status: Open.
Workaround/Suggested Action: To workaround this behavior you have to disable SMT with srun option “–hint=nomultithread”. You can compare the cpu masks in the following examples:
$ srun -N1 -n3 -c16 --cpu-bind=verbose exec cpu_bind=THREADS - jwc00n004, task 0 0 [17629]: mask 0x0000000000ffff set cpu_bind=THREADS - jwc00n004, task 1 1 [17630]: mask 0x0000ffff000000 set cpu_bind=THREADS - jwc00n004, task 2 2 [17631]: mask 0xff000000ff0000 set $ srun -N1 -n3 -c16 --cpu-bind=verbose --hint=nomultithread exec cpu_bind=THREADS - jwc00n004, task 0 0 [17652]: mask 0x00000000ffff set cpu_bind=THREADS - jwc00n004, task 1 1 [17653]: mask 0x00ffff000000 set cpu_bind=THREADS - jwc00n004, task 2 2 [17654]: mask 0xff0000ff0000 set
Slurm: srun options –exact and –exclusive change default pinning
Added: 2022-06-09
Affects: All systems at JSC
Description: In Slurm 21.08 the srun options “–exact” and “–exclusive” change the default pinning. For example on JURECA:
$ srun -N1 --ntasks-per-node=1 -c32 --cpu-bind=verbose exec cpu_bind=THREADS - jrc0731, task 0 0 [3027]: mask 0xffff0000000000000000000000000000ffff000000000000 set ... $ srun -N1 --ntasks-per-node=1 -c32 --cpu-bind=verbose --exact exec cpu_bind=THREADS - jrc0731, task 0 0 [3068]: mask 0x3000300030003000300030003000300030003000300030003000300030003 set ... $ srun -N1 --ntasks-per-node=1 -c32 --cpu-bind=verbose --exclusive exec cpu_bind=THREADS - jrc0731, task 0 0 [3068]: mask 0x3000300030003000300030003000300030003000300030003000300030003 set ...
As you can see with the default pinning only physical cores are used but with “–exact” or “–exclusive” Slurm pins the tasks to SMT cores (Hardware Threads). Actually this means that the task distribution changes to “cyclic”.
Status: Open.
Workaround/Suggested Action: To workaround this behavior you have to request block distribution of the tasks using option “-m” like this:
$ srun -N1 --ntasks-per-node=1 -c32 --cpu-bind=verbose --exact -m *:block exec cpu_bind=THREADS - jrc0731, task 0 0 [3027]: mask 0xffff0000000000000000000000000000ffff000000000000 set ... $ srun -N1 --ntasks-per-node=1 -c32 --cpu-bind=verbose --exclusive -m *:block exec cpu_bind=THREADS - jrc0731, task 0 0 [3027]: mask 0xffff0000000000000000000000000000ffff000000000000 set ...
ParaStationMPI: Cannot allocate memory
Added: 2021-10-06
Affects: All systems at JSC
Description: Using ParaStationMPI, the following error might occur:
ERROR mlx5dv_devx_obj_create(QP) failed, syndrome 0: Cannot allocate memory
Status: Open.
Workaround/Suggested Action: Use mpi-settings/[CUDA-low-latency-UD,CUDA-UD,UCX-UD]
(Stage < 2022) or UCX-settings/[UD,UD-CUDA]
(Stage >= 2022) to reduce the memory footprint.
The particular module depends on the user requirements.
Cannot connect using old OpenSSH clients
Added: 2020-06-15
Affects: All systems at JSC
Description: In response to the recent security incident, the SSH server on JURECA has been configured to only use modern cryptography algorithms. As a side effect, it is no longer possible to connect to JURECA using older SSH clients. For OpenSSH, at least version 6.7 released in 2014 is required. Some operating systems with very long term support ship with older versions, e.g. RHEL 6 ships with OpenSSH 5.3.
Status: Open.
Workaround/Suggested Action:
Use a more recent SSH client with support for the newer cryptography algorithms.
If you cannot update the OpenSSH client (e.g. because you are not the administrator of the system you are trying to connect from) you can
install your own version of OpenSSH from https://www.openssh.com.
Logging in from a different system with a newer SSH client is another option.
If you have to transfer data from a system with an old SSH client to JURECA (e.g. using scp
) you may have to transfer the data
to a third system with a newer SSH client first (scp
’s command line option -3
can be used to automate this).
Intel compiler error with std::valarray
and optimized headers
Added: 2016-03-16
Affects: JURECA
Description: An error was found in the implementation of several C++ std::valarray
operations in the Intel compiler suite that
occurs if the option -use-intel-optimized-headers
of icpc
is used.
Status: Open.
Workaround/Suggested Action: Users are strongly advised not to use the -use-intel-optimized-headers
option on JURECA.
Recently Resolved and Closed Issues
JUST: GPFS hanging waiters lead to stuck I/O
Added: 2023-04-12
Update: As of 2023-05-26 all systems have been updated to a GPFS version that fixed the issue
Affects: All systems at JSC
Description: We are aware, since the 15th of March, that some users have seen their jobs cause waiters on JUST, which leads to these jobs hanging seemingly indefinitely on I/O. This issue has been observed for a specific set of jobs and more frequently occurred on JURECA than other systems. IBM has identified a possible cause and are now in the process of developing a fix.
Status: Resolved.
Workaround/Suggested Action: There are no known workarounds. Once IBM releases the fix, we will shortly schedule a maintenance window and install the patch.
Job requeueing failures due to slurmctld prologue bug
Added: 2021-05-18
Affects: All systems at JSC
Description: There is a bug in slurmctld and currently the prologue mechanism and the job requeueing are broken. Normally before a job allocates any nodes the prologue runs and if it finds unhealthy nodes it drains them and requeues the job. Because of the bug now slurcmtld will cancel the jobs that were requeued at least once but finally landed on healthy nodes. We have reported this bug to SchedMD and they are working on it.
Status: Resolved.
$DATA
not available on login nodes
Added: 2020-12-04
Affects: JURECA-DC, JUWELS Booster
Description: The $DATA
file system is not mounted on the login nodes. We are working on making it available soon.
Status: Open.
Workaround/Suggested Action: Please access $DATA
on JUDAC or a JUWELS Cluster login node.
libicm warning by UCX
Added: 2020-12-04
Affects: JURECA-DC
Description: The warning messages
libibcm: couldn't read ABI version
is printed by every MPI rank in the job step.
Status: Resolved.
Heterogeneous jobs across Cluster and Booster support only one job step
Added: 2020-07-20
Affects: JURECA (Booster module decomissioned end of September 2022)
Description: Running multiple heterogeneous jobs steps using Cluster and Booster resources in the same allocation results in an error message such as
<PSP:r0000007:pscom4gateway: Error: Connecting gateway failed>
The problem does not occur for all job configuration.
Status: Open.
Workaround/Suggested Action: Please use separate allocations for job steps when using Cluster and Booster resources.
Application crashes when using CUDA-MPS
Added: 2020-07-03
Affects: JURECA Cluster (decomissioned in December 2020)
Description
When using CUDA MPS during job allocation (salloc --cuda-mps […]
) and selecting ParaStationMPI as the MPI runtime, some programs may fail due to an out of memory error (ERROR_OUT_OF_MEMORY
).
Status: Open
Workaround:
The issue is documented in the MPS documentation. Try to compile your program with -fPIC -fPIE
/ -pie
. Alternatively, we found that making a call to cuInit(0);
at the very beginning of the program flow solves the problem (i.e. very early in your main()
).
Finally, if you cannot modify your application, the call to cuInit(0)
can also be achieved by writing a small external library, which is prepended to your program by using the system linker. See the following sketch. Note that this is highly discouraged as it might interfere with other utilities making use of the same functionality (debugger, profilers, …).
#include "cuda.h"
struct Initializer { Initializer() { cuInit(0); } };
Initializer I;
gcc -fPIC preload.cpp -shared -o preload.so -lcuda
LD_PRELOAD=./preload.so srun -n2 ./simpleMPI
Segmentation Faults with MVAPICH2
Added: 2019-03-11
Affects: JUWELS Cluster GPU nodes, JURECA Cluster (decomissioned in December 2020)
Description: It has been observed that MVAPICH2 (GDR version) is not reliably detecting GPU device memory pointers and therefore executes invalid memory operations on such buffers. This results in an application segmentation fault.
Status: Closed.
Workaround/Suggested Action: The behavior of the MPI implementation is dependent on the buffer sizes.
For some applications, adjusting the eager size limits via the environment variables MV2_IBA_EAGER_THRESHOLD
and
MV2_RDMA_FAST_PATH_BUF_SIZE
can improve the situation.
However, this has been observed to create problems with the collectives implementation in MVAPICH2.
Please contact the support in case you intend to adjust these values.
With Stage 2020, the MVAPICH2 (GDR version) is not part of the default system software stack anymore.
Collectives in Intel MPI 2019 can lead to hanging processes or segmentation faults
Added: 2018-11-27
Affects: JURECA Cluster (decomissioned in December 2020)
Description: Problems with collective operations and Intel MPI 2019 have been observed.
Segmentation faults in MPI_Allreduce
, MPI_Alltoall
, MPI_Alltoallv
have been reproduced.
Hangs in MPI_Allgather
, MPI_Allgatherv
have been observed.
As the occurrence is dependent on the underlying dynamically chosen algorithm in the MPI implementation, the issue may or may not be
visible depending on job and buffer sizes.
Hangs in MPI_Cart_create
call have been reported, likely due to problems with the underlying collective operations.
Status: Open.
Workaround/Suggested Action: The default Intel MPI in the Stage 2018b has been changed to Intel MPI 2018.04. Alternatively a fall-back to Stage 2018a may be an option.
Errors with IntelMPI and Slurm’s cyclic job/task distribution
Added: 2018-05-07
Affects: JURECA Cluster
- Description: If using IntelMPI together with srun’s option
--distribution=cyclic
or if variableSLURM_DISTRIBUTION=cyclic
is exported there is a limitation of the maximum number of MPI tasks that can be spawned and jobs fail completely for more than 6 total MPI tasks in a job step.You have to be aware that the cyclic distribution is the default behavior of Slurm when using compute nodes interactively, i.e. the number of tasks is no larger than the number of allocated nodes! The problem has already been reported to Intel in 2017 and a future release may solve this issue.
Status: Open.
Workaround/Suggested Action: The recommended workarounds are:
Avoid srun’s option
--distribution=cyclic
Unset
SLURM_DISTRIBUTION
inside the jobscript or exportSLURM_DISTRIBUTION=block
before starting thesrun
Export
I_MPI_SLURM_EXT=0
to disable the optimized startup algorithm for IntelMPI