Known Issues on JUWELS

This page collects known issues affecting JUWELS's system and application software.

Please note: The following list of known issue is intended to provide a quick reference for users experiencing problems on JUWELS. We strongly encourage all users to report the occurrence of problems, whether listed below or not, to the user support.

Application launch failures: Timeout waiting for task launch

Added: 2018-07-28

Description: Launch of applications may fail on JUWELS with the error message:

kvsprovider[3227]: Timeout: Not all clients called
pmi_init(): init=XXX left=YYY round=9
srun: error: timeout waiting for task launch, started XXX of YYY tasks

The problem is caused by a slow startup resulting in a termination after several minutes as a precaution. The issue is observed particularly with larger allocations and higher numbers of MPI tasks per node. Please note that the system will attempt 10 rounds before terminating the application but will report a warning in each round. If the number of rounds reported (round=X) is less then 9 and the following srun: error message is not shown, the message will not affect the following execution of your application.

Status: Resolved. Please note that a small number of kvsprovider warning messages may still occur from time to time, e.g., when the file systems are heavily loaded. This does not affect the execution of the application.

MPI failure: pscom_con_setup_ok() : connection in wrong state

Added: 2018-07-28

Description: The application terminates with the message:

<PSP:r0000XXX:pscom_con_setup_ok() : connection in wrong state : closed (openib)>

The error message may be followed by additional fatal errors reported by MPI.

Status: Resolved by a software update.

MPI failure: Other MPI error or read from socket failed

Added: 2018-07-28

Description: Execution of an executable with ParaStation MPI results in a failure such as:

Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(907).......:
MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x7fff32a31410,
count=1, MPI_INT, MPI_SUM, MPI_COMM_WORLD) failed
mpid_irecv_done(107).....: read from socket failed -
request state:recv(pde)doneFatal error in MPI_Allreduce:
Other MPI error, error stack:
MPI_Allreduce(907).......:
MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x7fffbae01710,
count=1, MPI_INT, MPI_SUM, MPI_COMM_WORLD) failed

The problem does not always occur, i.e., for two identical job executions the problem may occur only in one. The error read from socket failed indicates a failure report by the low-level library of ParaStation MPI. Unfortunately, the same MPI error is reported for different underlying reasons with no further differentiating information. The initial problem from July 2018 has been resolved. However, some simulations are still affected by this problem, likely for different underlying reasons.

Status: Open.

Workaround/Suggested Action: No known workaround.

Hanging applications in MPI I/O

Added: 2018-07-28

Description: Applications hang in MPI I/O without progress. The problem occurs with ParaStation MPI and Intel MPI.

Status: Resolved. The root cause has not been identified.

Variations in runtime/performance

Added: 2018-08-09

Description: In some cases variations in runtime/performance of certain codes have been reported.

If you encounter such a case please let us know via sc@fz-juelich.de. Please include data which illustrates your case.

Status: Open.

MPI_Gather and MPI_Gatherv hang with Intel MPI 2018.02

Added: 2018-08-25

Description: With Intel MPI-version 2018.02, MPI_Gather hangs for large message sizes. MPI_Gatherv does not terminate as well.

Status: Workaround implemented.

Workaround/Suggested Action: Mitigating environment variables have been added to the module file.

Collectives in Intel MPI 2019 can lead to hanging processes or segmentation faults

Added: 2018-11-27

Description: Problems with collective operations and Intel MPI 2019 have been observed. Segmentation faults in MPI_Allreduce, MPI_Alltoall, MPI_Alltoallv have been reproduced. Hangs in MPI_Allgather, MPI_Allgatherv have been observed. As the occurrence is dependent on the underlying dynamically chosen algorithm in the MPI implementation, the issue may or may not be visible depending on job and buffer sizes. Hangs in MPI_Cart_create call have been reported, likely due to problems with the underlying collective operations.

Status: Open.

Workaround/Suggested Action: The default Intel MPI in the Stage 2018b has been changed to Intel MPI 2018.04. Alternatively a fall-back to Stage 2018a may be an option.

Segmentation Faults with MVAPICH2

Added: 2019-03-11

Description: It has been observed that MVAPICH2 (GDR version) is not reliably detecting GPU device memory pointers and therefore executes invalid memory operations on such buffers. This results in an application segmentation fault.

Status: Open.

Workaround/Suggested Action: The behavior of the MPI implementation is dependent on the buffer sizes. For some applications, adjusting the eager size limits via the environment variables MV2_IBA_EAGER_THRESHOLD and MV2_RDMA_FAST_PATH_BUF_SIZE can improve the situation. However, this has been observed to create problems with the collectives implementation in MVAPICH2. Please contact the support in case you intend to adjust these values.