Simultaneous Multithreading

The Haswell processors in JURECA offer the possibility of Simultaneous Multithreading (SMT) in the form of the Intel Hyper-Threading (HT) Technology. With HT enabled each (physical) processor core can execute two threads or tasks (called processes in the following for simplicity) simultaneously. The operating system thus lists a total of 48 logical cores or Hardware Threads (HWT) (see cat /proc/cpuinfo). Therefore a maximum of 48 processes can be executed on each compute node without overbooking.

Each JURECA compute node consists of two CPUs, located on socket zero and one, with 24 physical cores. These cores are numbered 0 to 23 and the hardware threads are named 0 to 47 in a round-robin fashion. Fig. 1 depicts a node schematically and illustrates the naming convention.

_images/NodeScheme.png

Fig. 1 Illustration of a JURECA compute node including hardware threads.

Using HT on JURECA

The Slurm batch system on JURECA does not differentiate between physical cores and hardware threads. In the Slurm terminology each hardware thread is a CPU. For this reason each compute node reports a total of 48 CPUs in the scontrol show node output. Therefore whether or not threads share a physical core depends on the total number of tasks per node (--ntasks-per-node and --cpus-per-task) and the process pinning. Listing 1 shows a pure MPI job that requests 48 tasks per node, thus using all hardware threads. The script will start mpi-prog on 4 nodes using 48 MPI tasks per node, where two MPI tasks will be executed on each physical core.

Listing 1 Pure MPI code
#!/bin/bash -x
#SBATCH --nodes=4
#SBATCH --ntasks=192
#SBATCH --ntasks-per-node=48
#SBATCH --output=mpi-out.%j
#SBATCH --error=mpi-err.%j
#SBATCH --time=00:15:00
#SBATCH --partition=batch

srun ./mpi-prog

Listing 2 shows a hybrid MPI/OpenMP jobs that request 48 tasks per node, thus using all hardware threads. The script will start hybrid-prog on 3 nodes using 2 MPI tasks per node and 24 OpenMP threads per task.

Listing 2 Hybrid MPI/OpenMP code
#!/bin/bash -x
#SBATCH --nodes=3
#SBATCH --ntasks=6
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=24
#SBATCH --output=mpi-out.%j
#SBATCH --error=mpi-err.%j
#SBATCH --time=00:20:00
#SBATCH --partition=batch

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
srun ./hybrid-prog

The use of the last 24 hardware threads can be disabled with the option --hint=nomultithread to srun. This option leads to overbooking of the same logical cores as soon as more than 24 threads are executed. For most applications, this option is not beneficial and the default (--hint=multithread) should be used.

How to profit from HT

Processes which are running on the same physical core will share several of the resources available to that particular core. Therefore, applications will profit most from HT if processes running on the same core which are complementary in their usage of resources (e.g., complementary computation and memory-access phases). On other hand, processes with similar resource usage may compete for bandwidth or functional units and hamper each other. We recommend to test whether your code profits from HT or not.

In order to test whether your application benefits from HT one should compare the timings of two runs on the same number of physical cores (i.e., number of nodes specified with --nodes should be the same for both jobs): One job without HT (\(t_1\)) and one job with HT (\(t_2\)). If \(t_2\) is lower than \(t_1\) your application benefits from HT. In practice, \(t_1 / t_2\) will be less than 1.5 (i.e., a runtime improvement of maximal 50% will be achieved through HT). However, applications may show a smaller benefit or even slow down when using HT.

Please note that the process binding may have a significant impact on the measured run times \(t_1\) and \(t_2\).