Known Issues on JUWELS

This page collects known issues affecting JUWELS's system and application software.

Please note: The following list of known issue is intended to provide a quick reference for users experiencing problems on JUWELS. We strongly encourage all users to report the occurence of problems, whether listed below or not, to the user support.

Application launch failures: Timeout waiting for task launch

Added: 2018-07-28

Description: Launch of applications may fail on JUWELS with the error message:

kvsprovider[3227]: Timeout: Not all clients called
pmi_init(): init=XXX left=YYY round=9
srun: error: timeout waiting for task launch, started XXX of YYY tasks

The problem is caused by a slow startup resulting in a termination after several minutes as a precaution. The issue is observed particularly with larger allocations and higher numbers of MPI tasks per node. Please note that the system will attempt 10 rounds before terminating the application but will report a warning in each round. If the number of rounds reported (round=X) is less then 9 and the following srun: error message is not shown, the message will not affect the following execution of your application.

Status: Resolved. Please note that a small number of kvsprovider warning messages may still occur from time to time, e.g., when the file systems are heavily loaded. This does not affect the execution of the application.

MPI failure: pscom_con_setup_ok() : connection in wrong state

Added: 2018-07-28

Description: The application terminates with the message:

<PSP:r0000XXX:pscom_con_setup_ok() : connection in wrong state : closed (openib)>

The error message may be followed by additional Fatal errors reported by MPI.

Status: Resolved.

MPI failure: Other MPI error or read from socket failed

Added: 2018-07-28

Description: The application terminates with (for example):

Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(907).......:
MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x7fff32a31410,
count=1, MPI_INT, MPI_SUM, MPI_COMM_WORLD) failed
mpid_irecv_done(107).....: read from socket failed -
request state:recv(pde)doneFatal error in MPI_Allreduce:
Other MPI error, error stack:
MPI_Allreduce(907).......:
MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x7fffbae01710,
count=1, MPI_INT, MPI_SUM, MPI_COMM_WORLD) failed

The problem does not always occur, i.e., for two identical job executions the problem may occur only in one

Status: Open. Due to recent system optimizations this

Workaround/Suggested Action: No known workaround.

Hanging applications in MPI I/O

Added: 2018-07-28

Description: Applications hang in MPI I/O without progress. The problem occurs with ParaStation MPI and Intel MPI.

Status: Open.

Workaround/Suggested Action: No known workaround.

Variations in runtime/performance

Added: 2018-08-09

Description: In some cases variations in runtime/performance of certain codes have been reported.

If you encounter such a case please let us know via sc@fz-juelich.de. Please include data which illustrates your case.

Status: Open.

MPI_Gather and MPI_Gatherv hang with Intel MPI 18.02

Added: 2018-08-25

Description: With the latest Intel MPI 2018 version, MPI_Gather hangs for large message sizes. MPI_Gatherv does not terminate as well.

Status: Workaround implemented.

Workaround/Suggested Action: Mitigating environment variables have been added to the module file.

Collectives in Intel MPI 2019 can lead to hanging processes or segmentation faults

Added: 2018-11-27

Description: Problems with collective operations and Intel MPI 2019 have been observed. Segmentation faults in MPI_Allreduce, MPI_Alltoall, MPI_Alltoallv have been reproduced. Hangs in MPI_Allgather, MPI_Allgatherv have been observed. As the occurrence is dependent on the underlying dynamically chosen algorithm in the MPI implementation, the issue may or may not be visible depending on job and buffer sizes. Hangs in MPI_Cart_create call have been reported, likely due to problems with the underlying collective operations.

Status: Open.

Workaround/Suggested Action: Please use Stage 2018a with Intel MPI 2018.