Known Issues on JUWELS

This page collects known issues affecting JUWELS's system and application software.

Please note: The following list of known issue is intended to provide a quick reference for users experiencing problems on JUWELS. We strongly encourage all users to report the occurence of problems, whether listed below or not, to the user support.

Application launch failures: Timeout waiting for task launch

Added: 2018-07-28

Description: Launch of applications may fail on JUWELS with the error message:

kvsprovider[3227]: Timeout: Not all clients called
pmi_init(): init=XXX left=YYY round=9
srun: error: timeout waiting for task launch, started XXX of YYY tasks

The problem is caused by a slow startup resulting in a termination after several minutes as a precaution. The issue is observed particularly with larger allocations and higher numbers of MPI tasks per node. Please note that the system will attempt 10 rounds before terminating the application but will report a warning in each round. If the number of rounds reported (round=X) is less then 9 and the following srun: error message is not shown, the message will not affect the following execution of your application.

Status: Resolved. Please note that a small number of kvsprovider warning messages may still occur from time to time, e.g., when the file systems are heavily loaded. This does not affect the execution of the application.

MPI failure: pscom_con_setup_ok() : connection in wrong state

Added: 2018-07-28

Description: The application terminates with the message:

<PSP:r0000XXX:pscom_con_setup_ok() : connection in wrong state : closed (openib)>

The error message may be followed by additional Fatal errors reported by MPI.

Status: Resolved.

MPI failure: Other MPI error or read from socket failed

Added: 2018-07-28

Description: The application terminates with (for example):

Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(907).......:
MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x7fff32a31410,
count=1, MPI_INT, MPI_SUM, MPI_COMM_WORLD) failed
mpid_irecv_done(107).....: read from socket failed -
request state:recv(pde)doneFatal error in MPI_Allreduce:
Other MPI error, error stack:
MPI_Allreduce(907).......:
MPI_Allreduce(sbuf=MPI_IN_PLACE, rbuf=0x7fffbae01710,
count=1, MPI_INT, MPI_SUM, MPI_COMM_WORLD) failed

The problem does not always occur, i.e., for two identical job executions the problem may occur only in one

Status: Open. Due to recent system optimizations this

Workaround/Suggested Action: No known workaround.

Hanging applications in MPI I/O

Added: 2018-07-28

Description: Applications hang in MPI I/O without progress. The problem occurs with ParaStation MPI and Intel MPI.

Status: Open.

Workaround/Suggested Action: No known workaround.

Variations in runtime/performance

Added: 2018-08-09

Description: In some cases variations in runtime/performance of certain codes have been reported.

If you encounter such a case please let us know via sc@fz-juelich.de. Please include data which illustrates your case.

Status: Open.

MPI_Gather and MPI_Gatherv hang with Intel MPI 18.02

Added: 2018-08-25

Description: With the latest Intel MPI version, MPI_Gather hangs for large message sizes. MPI_Gatherv does not terminate as well.

Status: Open.

Workaround/Suggested Action: Please export I_MPI_ADJUST_GATHER=1 and I_MPI_ADJUST_GATHERV=3 in your job script.