Known Issues on JURECA

This page collects known issues affecting JURECA's system and application software.

Note

The following list of known issue is intended to provide a quick reference for users experiencing problems on JURECA. We strongly encourage all users to report the occurrence of problems, whether listed below or not, to the user support.

Incorrect RMA rendezvous cache handling in ParaStation MPI and Intel MPI

Added: 2017-03-20

Description: In rare cases, Fortran applications utilizing buffers that have been deallocated and then reallocated for MPI communication observed data corruption in transit. The root cause could be identified as an error in the handling of the rendezvous cache used by the MPI implementations. The error only only occures only with newer Intel compilers.

Status: Resolved in stage 2016b and newer since 2017-04-11.

Workaround/Suggested Action: A fix for ParaStation MPI and workaround for Intel MPI have been installed in the 2016b stage on 2017-04-11. Users of this stage or newer stages are not affected by the problem.

Intel compiler error with std::valarray and optimized headers

Added: 2016-03-16

Description: An error was found in the implementation of several C++ std::valarray operations in the Intel compiler suite that occurs if the option -use-intel-optimized-headers of icpc is used.

Status: Open.

Workaround/Suggested Action: Users are strongly advised not to use the -use-intel-optimized-headers option on JURECA.

Errors with IntelMPI and Slurm's cyclic job/task distribution

Added: 2018-05-07

Description: If using IntelMPI together with srun's option

--distribution=cyclic or if variable SLURM_DISTRIBUTION=cyclic is exported there is a limitation of the maximum number of MPI tasks that can be spawned and jobs fail completely for more than 6 total MPI tasks in a job step.

You have to be aware that the cyclic distribution is the default behavior of Slurm when using compute nodes interactively, i.e. the number of tasks is no larger than the number of allocated nodes! The problem has already been reported to Intel in 2017 and a future release may solve this issue.

Status: Open.

Workaround/Suggested Action: The recommended workarounds are:

  1. Avoid srun's option --distribution=cyclic
  2. Unset SLURM_DISTRIBUTION inside the jobscript or export SLURM_DISTRIBUTION=block before starting the srun
  3. Export I_MPI_SLURM_EXT=0 to disable the optimized startup algorithm for IntelMPI

MPI_Gather and MPI_Gatherv hang with Intel MPI 2018.02

Added: 2018-08-25

Description: With Intel MPI-version 2018.02, MPI_Gather hangs for large message sizes. MPI_Gatherv does not terminate as well.

Status: Workaround implemented.

Workaround/Suggested Action: Mitigating environment variables have been added to the module file.

Collectives in Intel MPI 2019 can lead to hanging processes or segmentation faults

Added: 2018-11-27

Description: Problems with collective operations and Intel MPI 2019 have been observed. Segmentation faults in MPI_Allreduce, MPI_Alltoall, MPI_Alltoallv have been reproduced. Hangs in MPI_Allgather, MPI_Allgatherv have been observed. As the occurrence is dependent on the underlying dynamically chosen algorithm in the MPI implementation, the issue may or may not be visible depending on job and buffer sizes. Hangs in MPI_Cart_create call have been reported, likely due to problems with the underlying collective operations.

Status: Open.

Workaround/Suggested Action: The default Intel MPI in the Stage 2018b has been changed to Intel MPI 2018.04. Alternatively a fall-back to Stage 2018a may be an option.

Segmentation Faults with MVAPICH2

Added: 2019-03-11

Description: It has been observed that MVAPICH2 (GDR version) is not reliably detecting GPU device memory pointers and therefore executes invalid memory operations on such buffers. This results in an application segmentation fault.

Status: Open.

Workaround/Suggested Action: The behavior of the MPI implementation is dependent on the buffer sizes. For some applications, adjusting the eager size limits via the environment variables MV2_IBA_EAGER_THRESHOLD and MV2_RDMA_FAST_PATH_BUF_SIZE can improve the situation. However, this has been observed to create problems with the collectives implementation in MVAPICH2. Please contact the support in case you intend to adjust these values.