While the previous sections introduced the general usage workflow of Scalasca based on an abstract example, we will now guide through an example analysis of a moderately complex benchmark code using MPI: BT from the NAS Parallel Benchmarks (NPB-MPI 3.3) [10]. The BT benchmark implements a simulated CFD application using a block-tridiagonal solver for a synthetic system of nonlinear partial differential equations and consists of about 20 Fortran 77 source code files. Although BT does not exhibit significant performance bottlenecks – after all, it is a highly optimized benchmark – it serves as a good example to demonstrate the overall workflow, including typical configuration steps and how to avoid common pitfalls.

The example measurements (available for download on the Scalasca documentation web page [12]) were carried out using Scalasca in combination with Score-P 1.4 and Cube 4.3 on the JUROPA cluster at Jülich Supercomputing Centre. JUROPA's compute nodes are equipped with two Intel Xeon X5570 (Nehalem-EP) quad-core CPUs running at 2.93 GHz, and connected via a QDR Infiniband fat-tree network. The code was compiled using Intel compilers and linked against ParTec ParaStation MPI (which is based on MPICH2). The example commands shown below – which are assumed to be available in PATH, e.g., after loading site-specific environment modules – should therefore be representative for using Scalasca in a typical HPC cluster environment.

Note: Remember that the Scalasca commands use other commands provided by Score-P and Cube. It is assumed that the executable directories of appropriate installations of all three components are available in the shell search path.

Subsections: