SCALASCA v2.6.1 OPEN ISSUES =========================== (Status: December 2022) This file lists known limitations and unimplemented features of various Scalasca components. ----------------------------------------------------------------------------- * Platform support - Scalasca has been tested on the following platforms: + IBM BladeCenter & iDataPlex clusters + Cray XC series (x86_64, AArch64) + various Linux-based clusters (x86_64, Power8/9, armhf, AArch64) + Intel Xeon Phi (KNL) In addition, the provided configure options (see INSTALL) may provide a good basis for building and testing the toolset on other systems. Please report success/failure on other platforms to the Scalasca development team. - The following platforms have not been tested recently, however, the supplied build system might still work on those systems: + IBM Blue Gene/Q + IBM Blue Gene/P + AIX-based clusters + Cray XT, XE, XK series + Fujitsu FX10, FX100, and K computer + Intel Xeon Phi (KNC) + Oracle/Sun Solaris/SPARC-based clusters - On the Intel Xeon Phi platform, only Intel compilers and Intel MPI are currently supported. ----------------------------------------------------------------------------- * SCOUT parallel trace analysis - The OpenMP and hybrid MPI/OpenMP versions of the SCOUT parallel trace analyzer (and its associated libraries) have been found not to build or execute incorrectly with old PGI pgCC versions (verified with, e.g., v10.5, v13.9, v14.1). Consequently, Scalasca should be configured using '--disable-openmp' to skip building the corresponding scout.omp and scout.hyb executables. Pathscale and other compilers may have similar problems and also need the same treatment. - If it is not possible to build the required versions of SCOUT, or it fails to run reliably, it may be possible to substitute a version built with a different compiler (such as GCC) when doing measurement collection & analysis (e.g., in a batch job). - The MPI and hybrid MPI/OpenMP versions of the SCOUT parallel analyzer must be run as an MPI program with exactly the same number of processes as contained in the experiment to analyze: typically it will be convenient to launch SCOUT immediately following the measurement in a single batch script so that the MPI launch command can be configured similarly for both steps. The SCAN nexus executes SCOUT with the appropriate launch configuration when automatic trace analysis is specified. - If the appropriate variant of SCOUT (e.g., scout.hyb for hybrid OpenMP/MPI) is not located by SCAN, it attempts to substitute an alternative variant which will generally result in only partial trace analysis (e.g., scout.mpi will ignore OpenMP events in hybrid traces). - SCOUT is unable to analyze traces of applications using MPI inter- communicators. - SCOUT is unable to analyze hybrid OpenMP/MPI traces of applications using MPI_THREAD_MULTIPLE and generally unable to handle MPI_THREAD_SERIALIZED, therefore it is necessary to enforce use of MPI_THREAD_FUNNELED. - SCOUT is unable to analyze traces of OpenMP or hybrid OpenMP/MPI applications using nested parallelism, untied tasks, or using varying numbers of threads for parallel regions (e.g., due to "num_threads(#)" or "if" clauses). - Although SCOUT is able to handle traces including OpenMP tasking events, no tasking-specific analysis is performed yet. Also, calculated wait states in OpenMP barriers as well as results of the root-cause and critical-path analysis are currently incorrect in the presence of OpenMP tasks. - For traces including OpenMP tasking events, the CubeGUI and CubeLib command-line tools up to and including version 4.5 show incorrect metric values for the whole-program call path (existent for traces generated by Score-P v5.0 and above). This is due to accounting tasks twice in the inclusify/exclusify calculations, and also applies to derived metrics, for example, calculated during the report post-processing. This issue will be addressed in future versions of CubeLib/CubeGUI. - Analysis of traces using POSIX threads is performed using the OpenMP and OpenMP/MPI versions of SCOUT, which will try to launch one analysis thread for every POSIX thread recorded in the trace. That is, if the application repeatedly creates and joins POSIX threads, the analysis is likely to oversubscribe the system (thus, potentially increasing analysis times significantly), and the OpenMP runtime may not even be able to create the required number of analysis threads. - SCOUT is unable to analyze traces of applications using CUDA or OpenCL that contain events for compute kernels and/or memory transfers recorded on separate locations. Yet, analysis of traces including only host-side events from CUDA, OpenCL, or OpenACC is possible, also when used in combination with MPI and/or OpenMP. However, care has to be taken when interpreting analysis results of such experiments, as the critical-path and root-cause analysis only consider host-side events. For example, SCOUT may identify a GPU synchronization operation as the delay causing some wait states, while the real culprit are the compute kernels causing the synchronization operation to block. - SCOUT is unable to analyze traces of applications using SHMEM, even when used in combination with MPI and/or OpenMP. - SCOUT is unable to analyze traces that include per-process metric events on separate locations. In particular, this applies to memory tracking events measured using, for example, "SCOREP_MEMORY_RECORDING=true" or "SCOREP_MPI_MEMORY_RECORDING=true". - SCOUT is unable to analyze incomplete traces or traces that it is unable to load entirely into memory. Experiment archives are portable to other systems where sufficient processors with additional memory are available and a compatible version of SCOUT is installed, however, the size of such experiment archives typically prohibits this. - In rare cases, the SCOUT analysis of traces generated with Score-P v6.0 or earlier built against OTF2 v2.2 or higher may produce negative time values. This is due to a code inconsistency in the given versions, which is fixed in Score-P v7.0 (using OTF2 2.2 or higher). - SCOUT requires user-specified instrumentation blocks to correctly nest and match for it to be able to analyze resulting measurement traces. Similarly, collective operations must be recorded by all participating processes and messages recorded as sent (or received) by one process must be recorded as received (or sent) by another process, otherwise SCOUT can be expected to deadlock during trace analysis. - SCOUT ignores hardware counter measurements recorded in traces. If measurement included simultaneous runtime summarization and tracing, the two reports are automatically combined during experiment post-processing. - SCOUT is unable to handle old EPILOG trace files stored in SIONlib containers. Also, the lock contention analysis will not be performed for EPILOG traces. - SCOUT may deadlock and be unable to analyze measurement experiments: should you suspect this to be the case, please save the experiment archive and contact the Scalasca development team for it to be investigated (see instructions in the User Guide with details on which information to include in bug reports). - Traces that SCOUT is unable to analyze may still be visualized and interactively analyzed by 3rd-party tools such as VAMPIR. - Issues related to Score-P's sampling/unwinding feature: - The current implementation of handling traces generated with sampling/unwinding is known to be memory inefficient. That is, it may only be possible to analyze small trace files. - Obviously, limitations of the Score-P measurement system also propagate to Scalasca via the generated trace data. In particular, analysis results of programs using OpenMP are likely to include unexpected callpaths for worker threads. - The aforementioned limitation also leads to the calculation of bogus OpenMP thread management times. ----------------------------------------------------------------------------- * SCAN collection & analysis launcher This utility attempts to parse MPI launcher commands to be able to launch measurement collection along with subsequent trace analysis when appropriate. It also attempts to determine whether measurement and analysis are likely to be blocked by various configuration issues, before performing the actual launch(es). Such basic checks might be invalid in certain circumstances, and inhibit legitimate measurement and analysis launches. While it has been tested with a selection of MPI launchers (on different systems, interactively and via batch systems), it is not possible to test all versions, combinations and configuration/launch arguments, and if the current support is inadequate for a particular setup, details should be sent to the developers for investigation. In general, launcher flags that require one or more arguments can be ignored by SCAN if they are quoted, e.g., $MPIEXEC -np 32 "-ignore arg1 arg2" target arglist would ignore the "-ignore arg1 arg2" flag and arguments. Although SCAN parses launcher arguments from the given command-line (and in certain cases also launcher environment variables), it does not parse launcher configurations from command-files (regardless of whether they are specified on the command-line or otherwise). Since the part of the launcher configuration specified in this way is ignored by SCAN, but will be used for the measurement and analysis steps launched, this may lead to undesirable discrepancies. If command-files are used for launcher configuration, it may therefore be necessary or desirable to repeat some of their specifications on the command-line to make it visible to SCAN. SCAN only parses the command-line as far as the target executable, assuming that subsequent flags/parameters are intended solely for the target itself. Unfortunately, some launchers (notably POE) allow MPI configuration options after the target executable, where SCAN won't find them and therefore won't use them when launching the parallel trace analyzer. A workaround is to specify POE configuration options via environment variables instead, e.g., specify MP_PROCS instead of -procs. SCAN uses getopt_long_only() (either from the system's C library or GNU libiberty) to parse launcher options. Older versions seem to have a bug that fails to stop parsing when the first non-option (typically the target executable) is encountered: a workaround in such cases is to insert "--" in the commandline before the target executable, e.g., scan -t mpirun -np 4 -- target.exe arglist. If an MPI launcher is used that is not recognized by SCAN, such as one that has been locally customized, it can be specified via an environment variable, e.g., SCAN_MPI_LAUNCHER=mympirun, to have SCAN accept it. Warning: In such a case, SCAN's parsing of the launcher's arguments may fail. Some MPI launchers result in some or all program output being buffered until execution terminates. In such cases, SCAN_MPI_REDIRECT can be set to redirect program standard and error output to separate files in the experiment archive. If necessary, or preferred, measurement and analysis launches can be performed without using SCAN, resulting in "default" measurement collection or explicit trace analysis (based on the effective Score-P configuration). SCAN automatic trace analysis of hybrid MPI/OpenMP applications primarily is done with an MPI/OpenMP version of the SCOUT trace analyzer (scout.hyb). When this is not available, or when the MPI-only version of the trace analyzer (scout.mpi) is specified, analysis results are provided for the master threads only. ----------------------------------------------------------------------------- * Other known issues - In contrast to Score-P v6.0 and above, the Cube metric hierarchy remapping specification file shipped with Scalasca currently classifies all MPI request finalization calls (i.e., MPI_Test[all|any|some] and MPI_Wait[all|any|some]) as point-to-point, regardless of whether they actually finalize a point-to-point request, a file I/O request, or a non-blocking collective operation. This causes incorrect results when merging already post-processed runtime summary and trace analysis reports. However, this is not an issue when post-processing reports after merging. This inconsistency will be addressed in a future release.