JURECA Evaluation Platform Overview
JURECA is equipped with additional nodes for evaluation and testing. To login, please see Access.
MI200 nodes
The configuration of JURECA DC MI200 compute nodes (2 nodes) is the following
CPU: AMD EPYC 7443 processor (Milan); 2 sockets, 24 cores per socket, SMT-2 (total: 2×24×2 = 96 threads) in NPS-4 [1] configuration (details for AMD EPYC 7443 on WikiChip)
Memory: 512 GiB DDR4-3200 RAM (of which at least 20 GB is taken by the system software stack, including the file system); 256 GB per socket; 8 memory channels per socket (2 channels per NUMA domain)
GPU: 4 × AMD MI250 GPUs, each with 128 GB memory; the GPUs are built as Multi Chip Modules (MCM) and because of that they are shown as 8 GPUs with 64 GB memory each.
Network: 1 × Mellanox HDR InfiniBand ConnectX 6 (100 Gbit/s), HCA (not yet final)
Details about the hardware can be found on Gigabyte’s webpage.
Details about the node topology can be found in AMD’s CDNA2 whitepaper as Figure 2b.
MI200 nodes Slurm considerations
The MI200 nodes are accessible in the dc-mi200
partition, which is hidden by default.
Known Issues
Building for AMD GPUs
Added: 2022-08-01
Affects: MI200 nodes
Description: Currently, the ROCm stack is not available on the JRDC login nodes.
Status: Open.
Workaround/Suggested Action: For the time being, please build your application directly on the AMD-GPU-equipped compute nodes. Please don’t block compute nodes unnecessarily and release them quickly after building.
InfiniBand Cards
Added: 2022-08-01
Affects: MI200 nodes
Description: The InfiniBand HCAs are currently installed into non-optimal PCIe slots due to power constraints.
Status: Open.
Workaround/Suggested Action: No action necessary. During high-bandwidth transfers, latency might be a little higher than expected. A fix might be on the way.
Graphcore IPU-POD4
The IPU-POD4 consists of two parts:
- an AMD EPYC based access server on which user applications are launched with
CPU: AMD EPYC 7413 (Milan); 2 sockets, 24 cores per socket, SMT-2 (total: 2×24×2 = 96 threads) in NPS-4 [1] configuration (details for AMD EPYC 7413 on WikiChip)
Memory: 512 GiB DDR4-3200 RAM (of which at least 20 GB is taken by the system software stack, including the file system); 256 GB per socket; 8 memory channels per socket (2 channels per NUMA domain)
Network: 1 × Mellanox EDR InfiniBand ConnectX 5 (100 Gbit/s) to connect to other compute nodes and 1 × Mellanox 100 GigE ConnectX 5 to connect to the IPU-M2000
- a Graphcore IPU-M2000 which is connected directly to the access server with
IPUs: 4 × GC200 IPUs
Graphcore IPU-POD4 Slurm considerations
The access server is integrated into the Slurm batch system on JURECA-DC as its own partition with the name dc-ipu
, which is hidden by default.
To get access, use the regular Slurm mechanisms, e.g. for an “interactive” job from which you can run several job steps interactively:
$ salloc --account <budget> --partition pdc-ipu -N1
salloc: Granted job allocation 10362000
salloc: Waiting for resource configuration
salloc: Nodes jrc0860 are ready for job
Then use srun
to run commands on the access server:
$ srun hostname
jrc0860.jureca
From the access server, you can use all four IPUs in the M2000. Slurm restricts access to one user at a time.
Graphcore Software
Applications can make use of the IPUs through the Graphcore SDK or one of several ML frameworks that have been extended with Graphcore plugins, such as TensorFlow or PyTorch. Please see the Graphcore documentation library for information on how to use these. Graphcore provide binary distributions of their SDK which do not currently support the OS used on JURECA-DC (Rocky Linux 8). However, there is a set of container images with the software published by Graphcore on Docker Hub. These can be used on JURECA-DC with the Apptainer container runtime.
Apptainer creates containers from images stored as a single file in the file system. To pull an image with the Poplar SDK onto the file system do:
$ apptainer pull poplar.sif docker://docker.io/graphcore/poplar:2.4.0
Afterwards you can run commands from inside a container based on the image like this:
$ srun apptainer run poplar.sif -- gc-info -a
Graphcore device listing:
-+- Id: [0], target: [Fabric], PCI Domain: [3]
-+- Id: [1], target: [Fabric], PCI Domain: [2]
-+- Id: [2], target: [Fabric], PCI Domain: [1]
-+- Id: [3], target: [Fabric], PCI Domain: [0]
-+- Id: [4], target: [Multi IPU]
|--- Id: [0], DNC Id: [0], PCI Domain: [3]
|--- Id: [1], DNC Id: [1], PCI Domain: [2]
-+- Id: [5], target: [Multi IPU]
|--- Id: [2], DNC Id: [0], PCI Domain: [1]
|--- Id: [3], DNC Id: [1], PCI Domain: [0]
-+- Id: [6], target: [Multi IPU]
|--- Id: [0], DNC Id: [0], PCI Domain: [3]
|--- Id: [1], DNC Id: [1], PCI Domain: [2]
|--- Id: [2], DNC Id: [2], PCI Domain: [1]
|--- Id: [3], DNC Id: [3], PCI Domain: [0]
Note how Slurm’s srun
and the apptainer
command are composed.
This assumes that you have an active job allocation from a previous salloc.
To run one of the Graphcore tutorial applications:
$ git clone -b sdk-release-2.4 https://github.com/graphcore/tutorials.git
$ apptainer build tensorflow.sif docker://docker.io/graphcore/tensorflow:1
$ srun apptainer run tensorflow.sif -- python3 tutorials/simple_applications/tensorflow/mnist/mnist.py
2022-04-08 13:03:38.627641: I tensorflow/compiler/plugin/poplar/driver/poplar_platform.cc:47] Poplar version: 2.4.0 (10a96ee536) Poplar package: 969064e2df
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/compat/v2_compat.py:68: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/losses/losses_impl.py:121: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
2022-04-08 13:03:41.144173: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2649785000 Hz
2022-04-08 13:03:41.458234: I tensorflow/compiler/plugin/poplar/driver/poplar_executor.cc:1610] Device /device:IPU:0 attached to IPU: 0
2022-04-08 13:03:44.865634: I tensorflow/compiler/jit/xla_compilation_cache.cc:251] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
Compiling module cluster_18183311839169509267__.365:
[##################################################] 100% Compilation Finished [Elapsed: 00:00:08.5]
Loss: 1.5828259517669678
Time: 11.811853408813477
Loss: 1.5447135463078816
Time: 2.1141273975372314
Loss: 1.5387713934580485
Time: 2.111985206604004
Loss: 1.5387063130696614
Time: 2.105924129486084
Loss: 1.5317738628387452
Time: 2.110886812210083
Program ran successfully
In case software is missing from the images, it can be installed by building a new image based on it.
E.g., the Poplar SDK image comes without a compiler, meaning it can be used to run software compiled against the Poplar SDK, but not to compile the software.
An image with the Poplar SDK and a compatible compiler can be created from the following Dockerfile
:
FROM docker.io/graphcore/poplar:2.4.0
RUN apt-get update && \
apt-get install -y build-essential && \
apt-get -y clean && \
rm -rf /var/lib/apt/lists/*
Either process the Dockerfile
with Docker and then upload to a registry from which you can then apptainer pull
to JURECA-DC, or use our container build system.
NVIDIA Arm HPC Dev Kit
We deploy 2 NVIDIA Arm HPC Dev Kits, each consisting of
an Ampere Altra Q80-30 CPU with 80 cores and 512 GB memory,
2 NVIDIA A100-PCIe-40-GB GPUs,
2 NVIDIA Mellanox BlueField2 DPUs (200 GbE)
Please find details on NVIDIA’s documentation about the Dev Kit.
Building for Arm
Currently, cross-compilation from the JURECA DC login nodes is not officially support. For easiest utilization, please use the Arm compute nodes themselves for building for the platform.