JURECA Evaluation Platform Overview

JURECA is equipped with additional nodes for evaluation and testing. To login, please see Access.

MI200 nodes

The configuration of JURECA DC MI200 compute nodes (2 nodes) is the following

  • CPU: AMD EPYC 7443 processor (Milan); 2 sockets, 24 cores per socket, SMT-2 (total: 2×24×2 = 96 threads) in NPS-4 1 configuration (details for AMD EPYC 7443 on WikiChip)

  • Memory: 512 GiB DDR4-3200 RAM (of which at least 20 GB is taken by the system software stack, including the file system); 256 GB per socket; 8 memory channels per socket (2 channels per NUMA domain)

  • GPU: 4 × AMD MI250 GPUs, each with 128 GB memory; the GPUs are built as Multi Chip Modules (MCM) and because of that they are shown as 8 GPUs with 64 GB memory each.

  • Network: 1 × Mellanox HDR InfiniBand ConnectX 6 (100 Gbit/s), HCA (not yet final)

Details about the hardware can be found on Gigabyte’s webpage.

Details about the node topology can be found in AMD’s CDNA2 whitepaper as Figure 2b.

MI200 nodes Slurm considerations

The MI200 nodes are accessible in the dc-mi200 partition, which is hidden by default.

Known Issues

Building for AMD GPUs

Added: 2022-08-01

Affects: MI200 nodes

Description: Currently, the ROCm stack is not available on the JRDC login nodes.

Status: Open.

Workaround/Suggested Action: For the time being, please build your application directly on the AMD-GPU-equipped compute nodes. Please don’t block compute nodes unnecessarily and release them quickly after building.

InfiniBand Cards

Added: 2022-08-01

Affects: MI200 nodes

Description: The InfiniBand HCAs are currently installed into non-optimal PCIe slots due to power constraints.

Status: Open.

Workaround/Suggested Action: No action necessary. During high-bandwidth transfers, latency might be a little higher than expected. A fix might be on the way.

Graphcore IPU-POD4

The IPU-POD4 consists of two parts:

  • an AMD EPYC based access server on which user applications are launched with
    • CPU: AMD EPYC 7413 (Milan); 2 sockets, 24 cores per socket, SMT-2 (total: 2×24×2 = 96 threads) in NPS-4 1 configuration (details for AMD EPYC 7413 on WikiChip)

    • Memory: 512 GiB DDR4-3200 RAM (of which at least 20 GB is taken by the system software stack, including the file system); 256 GB per socket; 8 memory channels per socket (2 channels per NUMA domain)

    • Network: 1 × Mellanox EDR InfiniBand ConnectX 5 (100 Gbit/s) to connect to other compute nodes and 1 × Mellanox 100 GigE ConnectX 5 to connect to the IPU-M2000

  • a Graphcore IPU-M2000 which is connected directly to the access server with

Graphcore IPU-POD4 Slurm considerations

The access server is integrated into the Slurm batch system on JURECA-DC as its own partition with the name dc-ipu, which is hidden by default. To get access, use the regular Slurm mechanisms, e.g. for an “interactive” job from which you can run several job steps interactively:

$ salloc --account <budget> --partition pdc-ipu -N1
salloc: Granted job allocation 10362000
salloc: Waiting for resource configuration
salloc: Nodes jrc0860 are ready for job

Then use srun to run commands on the access server:

$ srun hostname
jrc0860.jureca

From the access server, you can use all four IPUs in the M2000. Slurm restricts access to one user at a time.

Graphcore Software

Applications can make use of the IPUs through the Graphcore SDK or one of several ML frameworks that have been extended with Graphcore plugins, such as TensorFlow or PyTorch. Please see the Graphcore documentation library for information on how to use these. Graphcore provide binary distributions of their SDK which do not currently support the OS used on JURECA-DC (Rocky Linux 8). However, there is a set of container images with the software published by Graphcore on Docker Hub. These can be used on JURECA-DC with the Apptainer container runtime.

Apptainer creates containers from images stored as a single file in the file system. To pull an image with the Poplar SDK onto the file system do:

$ apptainer pull poplar.sif docker://docker.io/graphcore/poplar:2.4.0

Afterwards you can run commands from inside a container based on the image like this:

$ srun apptainer run poplar.sif -- gc-info -a
Graphcore device listing:

-+- Id: [0], target:    [Fabric], PCI Domain: [3]
-+- Id: [1], target:    [Fabric], PCI Domain: [2]
-+- Id: [2], target:    [Fabric], PCI Domain: [1]
-+- Id: [3], target:    [Fabric], PCI Domain: [0]
-+- Id: [4], target: [Multi IPU]
 |--- Id: [0], DNC Id: [0], PCI Domain: [3]
 |--- Id: [1], DNC Id: [1], PCI Domain: [2]
-+- Id: [5], target: [Multi IPU]
 |--- Id: [2], DNC Id: [0], PCI Domain: [1]
 |--- Id: [3], DNC Id: [1], PCI Domain: [0]
-+- Id: [6], target: [Multi IPU]
 |--- Id: [0], DNC Id: [0], PCI Domain: [3]
 |--- Id: [1], DNC Id: [1], PCI Domain: [2]
 |--- Id: [2], DNC Id: [2], PCI Domain: [1]
 |--- Id: [3], DNC Id: [3], PCI Domain: [0]

Note how Slurm’s srun and the apptainer command are composed. This assumes that you have an active job allocation from a previous salloc. To run one of the Graphcore tutorial applications:

$ git clone -b sdk-release-2.4 https://github.com/graphcore/tutorials.git
$ apptainer build tensorflow.sif docker://docker.io/graphcore/tensorflow:1
$ srun apptainer run tensorflow.sif -- python3 tutorials/simple_applications/tensorflow/mnist/mnist.py
2022-04-08 13:03:38.627641: I tensorflow/compiler/plugin/poplar/driver/poplar_platform.cc:47] Poplar version: 2.4.0 (10a96ee536) Poplar package: 969064e2df
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/compat/v2_compat.py:68: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/losses/losses_impl.py:121: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
2022-04-08 13:03:41.144173: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2649785000 Hz
2022-04-08 13:03:41.458234: I tensorflow/compiler/plugin/poplar/driver/poplar_executor.cc:1610] Device /device:IPU:0 attached to IPU: 0
2022-04-08 13:03:44.865634: I tensorflow/compiler/jit/xla_compilation_cache.cc:251] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
Compiling module cluster_18183311839169509267__.365:
[##################################################] 100% Compilation Finished [Elapsed: 00:00:08.5]

Loss: 1.5828259517669678
Time: 11.811853408813477

Loss: 1.5447135463078816
Time: 2.1141273975372314

Loss: 1.5387713934580485
Time: 2.111985206604004

Loss: 1.5387063130696614
Time: 2.105924129486084

Loss: 1.5317738628387452
Time: 2.110886812210083
Program ran successfully

In case software is missing from the images, it can be installed by building a new image based on it. E.g., the Poplar SDK image comes without a compiler, meaning it can be used to run software compiled against the Poplar SDK, but not to compile the software. An image with the Poplar SDK and a compatible compiler can be created from the following Dockerfile:

FROM docker.io/graphcore/poplar:2.4.0

RUN apt-get update && \
    apt-get install -y build-essential && \
    apt-get -y clean && \
    rm -rf /var/lib/apt/lists/*

Either process the Dockerfile with Docker and then upload to a registry from which you can then apptainer pull to JURECA-DC, or use our container build system.

NVIDIA Arm HPC Dev Kit

We deploy 2 NVIDIA Arm HPC Dev Kits, each consisting of

  • an Ampere Altra Q80-30 CPU with 80 cores and 512 GB memory,

  • 2 NVIDIA A100-PCIe-40-GB GPUs,

  • 2 NVIDIA Mellanox BlueField2 DPUs (200 GbE)

Please find details on NVIDIA’s documentation about the Dev Kit.

Building for Arm

Currently, cross-compilation from the JURECA DC login nodes is not officially support. For easiest utilization, please use the Arm compute nodes themselves for building for the platform.

H100 node

The configuration of JURECA DC H100 compute node is the following

  • CPU: Intel Xeon Platinum Sapphire Rapid 8452Y processor; 2 sockets, 36 cores per socket, SMT-2 (total: 2×36×2 = 144 threads) (details for Intel Xeon Platinum 8452Y on Intel ARK)

  • Memory: 512 GiB DDR5-4800 RAM (of which at least 20 GB is taken by the system software stack, including the file system); 256 GB per socket;

  • GPU: 4 × NVIDIA H100 PCIe GPUs, each with 80 GB memory;

  • Network: 1 × 1x BlueField-2 ConnectX-6 DPU @ EDR (100 Gbit/s)

A subset of the GPUs are inter-connected via NVLink (12 links each): GPU0 and GPU1; GPU2 and GPU3. See the following GPU topology display of the node for details:

$ nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    NIC0    CPU Affinity    NUMA Affinity
GPU0     X      NV12    SYS     SYS     NODE    0-35,72-107     0
GPU1    NV12     X      SYS     SYS     NODE    0-35,72-107     0
GPU2    SYS     SYS      X      NV12    SYS     36-71,108-143   1
GPU3    SYS     SYS     NV12     X      SYS     36-71,108-143   1
NIC0    NODE    NODE    SYS     SYS      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0

H100 nodes Slurm considerations

The H100 node is accessible in the dc-h100 partition, which is hidden by default.

H100 software modules

The H100 node uses its own software stage. The main difference from the default Stages/2023 is that CUDA/12.0 is used, as this is the first CUDA version to supports the new NVIDIA H100.

To load the software stage for the H100 on the login nodes you can load the corresponding Architecture module:

module load Architecture/jureca_spr

When executing a job on the H100 node, the modules can be loaded directly. As usual, please don’t block the H100 node for extended interactive sessions.

Grace Hopper nodes

The Grace Hopper nodes are two QCT QuantaGrid S74G-2U with the following configuration:

  • CPU: 1 × NVIDIA Grace, 72 cores

  • GPU: 1 × NVIDIA H100

  • Memory: 480 GiB LPDDR5X and 96 GiB HBM3

  • Network: 1 × NVIDIA ConnectX-7 @ 2 × EDR (200 Gbit/s)

  • Storage: 1 × SAMSUNG MZTL21T9 2 TB mounted at $LOCALSCRATCH (an alias for /local/scratch)

Please find further information about NVIDIA’s Grace Hopper Superchip at the following locations:

Grace Hopper nodes Slurm considerations

The Grace Hopper nodes are accessible in the dc-gh partition, which is hidden by default.

Grace Hopper software modules

The Grace Hopper nodes uses their own software filesystem. The main difference from JURECA-DC software filesystem is that the software is compiled for ARM, thus cross-compilation from the JURECA DC login nodes, which use x86-64 hardware, is not supported. The available modules can be loaded when executing jobs on the Grace Hopper nodes. As usual, please do not block the Grace Hopper nodes for extended interactive sessions.

Known Issues

Thermal throttling of H100 GPUs

Added: 2024-01-11

Affects: Grace Hopper nodes

Description: The VBIOS in use on the H100 GPUs are known to incorrectly report that thermal throttling has been applied. This will be fixed in a future VBIOS update.

Status: Open.

Workaround/Suggested Action: None.

Software Stage under construction

Added: 2024-01-11

Affects: Grace Hopper nodes

Description: The software stage on the Grace Hopper nodes is still under construction. As the time of writing the GCC compiler, OpenMPI as well as BLAS libraries BLIS, OpenBLAS and NVPL are available. More modules might be available in the future.

Status: Open.

Workaround/Suggested Action: None.

1(1,2)

NPS-4: “NUMA Domains per socket: 4”; the socket is divided into four domains. The configuration originates from production of the CPU chip, which is not produced as one monolithic die, but rather consists of four individual dies.