Container Runtime on JUDAC

What Containers Provide

Containers provide the ability to build, ship, and run applications. They typically use Linux features (e.g. namespaces) to encapsulate containers from other containers and the underlying operating system, and are more lightweight than virtual machines.

There are several technologies available to run containers, some examples are Docker, Shifter, Singularity and Apptainer. On top, container orchestration middlewares like Kubernetes or OpenShift evolved. For shipping applications, these technologies typically use so-called images. Images contain a file system including a minimal operating-system, the application, and some metadata. A well-known standard for containers (and especially images) is OCI. Building the images is done via recipes. While all container technologies point out their differences to Docker, its Dockerfile recipe format is well known to all technologies and most often supported. This means, providing a Dockerfile is sufficient to build a proper image within the local container technology. A fallback is the interactive creation of container images.

While container technologies evolved in the cloud computing field, to support developers and operators to easily test/run (web-)services and databases, they more and more make their way to HPC. Encapsulating an application into a ready-to-use container image can be easier than providing all dependencies for the application via e.g. EasyBuild or operating system packages.

Getting Access

To be granted access to the container runtime, you have to go to our user portal JuDoor.

On the webpage please proceed via

Software
Request access to restricted software
Access to other restricted software
Container Runtime Engine
Get Access
Accept the Service Level Description.

This will add your user account to the container group. Due to caching effects this might take some hours. Without that group set, you can not start containers!

Apptainer on JUDAC

Formerly, we provided Singularity on the Systems. We have replaced Singularity by Apptainer, a fork maintained by the Linux Foundation.

We provide an up-to-date version of Apptainer, it is available as soon as access to the container group is granted (it is in the default PATH and does not require a module).

Backwards compatibility to Singularity

Apptainer has put efforts into being backwards-compatible to Singularity:

singularity will symlink to the apptainer binary.
The old SINGULARITY_ environment variables are respected, unless there is a conflicting variable with APPTAINER_ prefix. In the latter case, the APPTAINER_ environment variable is used.
Apptainer will honor Singularity configuration details. See here for more details.

Apptainer Images

Building images on JUDAC is not possible, because root privileges are required.

If you want to download images from the Docker Hub or some other registry, it might be helpful to overwrite some Apptainer environment variables, because otherwise you might run into your HOME quota or fill-up /tmp.

$ export APPTAINER_CACHEDIR=$(mktemp -d -p <WRITABLE_DIRECTORY>)
$ export APPTAINER_TMPDIR=$(mktemp -d -p <WRITABLE_DIRECTORY>)
$ apptainer pull centos.sif docker://centos:7

Pre-built container images can be obtained via the Singularity Registry, which we provide as a module in our systems. Run module load shpc and follow the User Guide to access them.

Launching Containers via Slurm

To Slurm, Apptainer is just another executable and can be called as such.

The following snippet would launch an interactive shell into an Apptainer container running on a GPU compute node.

$ srun -N1 -p <partition> --gres gpu:1 --pty apptainer shell --nv /p/fastdata/singularity/centos.sif

where partition is one of the GPU partitions available on JUDAC.

Interfacing with the System Environment

HPC workloads have different requirements in terms of interaction with their system environment than the microservice-type workloads, which have been particularly successful when deployed as containers. Where container-friendly workloads are content with having access to some CPU for computation, a few sockets for IP communication, and maybe a file system or an object store to keep state, HPC workloads often have to

be optimized for a specific kind of compute hardware (instruction set),
interface with specialized compute and communications hardware (e.g. GPGPUs and InfiniBand), and
interface with the software used for job orchestration.

These needs can be at odds with the desire to build container images which are portable across a large number of systems and a wide range of time. JSC, like other computing centres, expends a considerable effort to provide a central software stack that is both up-to-date and tuned to make efficient use of our hardware resources. This section will try to provide some guidance on building container images that integrate well with typical HPC hardware and software environments.

The two Models

The Apptainer documentation has chapters on the interplay between containers and MPI and containers and GPGPUs. The former makes a useful distinction between two models:

Containerized Applications (the Apptainer documentation calls this the Bind Model), in which the container image contains the application itself, but otherwise re-uses large parts of the software stack provided by the HPC site.
User-Defined Software Stack (the Apptainer documentation calls this the Hybrid Model), in which the container image contains both the application and an underlying software stack.

We refer to the Apptainer documentation for a more detailed description of the two scenarios and a discussion of their respective pros and cons. The following sections provide hints for building container images that follow one of the models for use on the JSC systems:

Containerized Applications

On HPC systems, the HPC software stack provided by JSC is installed on a shared file system, which is bind mounted into Apptainer containers by the default configuration as /p/software/....

From the collection of all available packages, users select software via a module system based on Lmod. While Lmod itself is installed into the shared file system, it uses some dependencies (lua and a few lua libraries) that are installed as base operating system packages, so those would have to be available in the container image as well.

Lmod implements loading of software modules via environment variables. Apptainer under some circumstances modifies the environment when entering a container. Use apptainer inspect -e to make sure a particular container image does not have cleanup hooks that prevent Lmod from working. Alternatively, you can start from a fresh environment apptainer run -e and then initialize Lmod from inside the container.

The HPC software stack installed by JSC is not self-sufficient, but instead relies on several libraries installed as operating system packages. To see which libraries a particular configuration of software modules from the stack depends on, load the modules (outside a container) with module load and run:

$ echo $LD_LIBRARY_PATH | xargs -d: -n1 -I{} find {} -name "*.so*" | xargs ldd | awk '$3 ~ /usr/ { print $1 " " $2 " " $3 }' | sort -u

These libraries will either have to be installed into the container image or bind mounted from the host file system. For NVIDIA CUDA driver libraries in particular, Apptainer has an option --nv which finds these libraries on the host and automatically bind mounts them into the container.

User-Defined Software Stack

While a user-defined software stack is typically more self-sufficient than a containerized application, it still has to interact with the system environment in several places which will have an influence on how the software stack deployed as a container image has to be constructed.

Please see the documentation of our HPC systems for a reference of their hardware characteristics.

Another useful reference for how to build your own software stack in a way that makes efficient use of the hardware hosted by JSC is the EasyConfigs we use to build our own software stack. These can be found at https://github.com/easybuilders/JSC.

Apart from using appropriate compiler flags to optimize for a given CPU architecture, your software stack should be optimized for communication on the HPC fabric (InfiniBand on the current JSC systems). Our own software stack uses UCX as the communication framework, with a suite of lower level libraries to talk to the IB hardware.

To correctly set up and coordinate parallel computations, the parallel programming framework (typically MPI) has to exchange some information with the HPC resource manager (Slurm on JSC systems). This exchange can use one of several common interfaces. Nowadays, the PMIx interface typically fills this role, so your software stack should include it and both Slurm and the parallel programming framework should be instructed to use it. Our Slurm supports a few different process management interfaces, see srun --mpi=list.

Besides compile-time settings, some aspects of the parallel computing and communication frameworks have to be tuned to the system via run-time configuration. In our own software stack this is achieved using Lmod settings modules called UCX-settings and MPI-settings. To make efficient use of the hardware you will want to employ a similar configuration in your own stack.

For the NVIDIA CUDA driver libraries, there is a strong dependency between their version and the operating system kernel driver module. Thus, the driver library is commonly installed on the host system, and from there it is bind mounted into the container, even in a user-defined software stack module. Apptainer has the --nv command line switch to automate this.

Container Build System

Warning

The Container Build System provided by JSC is deprecated. While it will stay online as-is for the forseeable future, there will be no development efforts to fix existing issues or add new features. Please use the native Container building functionality provided by Apptainer instead.

JSC provides a build system that can build images on behalf of the user, based on a Docker- or Singularity/Apptainer-file. Having a build system available is necessary because

Building images requires administrator privileges which regular users do not have on JSC’s clusters
Users might also not have the ability to build images on their local workstation

Building of images with JSC’s Build System takes place on a dedicated system that is external to the clusters. The dedicated system has different characteristics compared to the HPC machines (different CPU type, no GPUs); created images might not be optimized to the fullest extent to the targeted system.

Building Container Images via CLI

We provide a Python-based command line interface for the Container Build System. It is available via an EasyBuild module.

$ module load GCC Singularity-Tools

Afterwards you have the tool sib available. An additional configuration step is necessary to specify an API endpoint of JSC’s build system.

$ mkdir -p ~/.config/sib
$ cat > ~/.config/sib/settings.ini <<'EOF'
[config]
url_prefix=https://sbuild-hps.fz-juelich.de/
EOF

Warning

The CLI stores a file containing the list of built images, which is not thread-safe, it is available at ~/.config/sib/data.json. This may lead to container builds getting lost when multiple instances of sib are started in parallel.

An example for a full workflow:

$ cat Dockerfile-lcgm
FROM centos:7
RUN yum -y install epel-release
RUN yum -y install lcgdm lfc gfal2 gfal2-plugin-lfc
$ sib upload ./Dockerfile-lcgm lcgm
Recipe got successfully imported into Database
$ sib build --recipe-name lcgm --blocking
Build of recipe will be executed
Building ...
Build succeeded
$ sib download --recipe-name lcgm
Download succeeded

The --recipe-name is optional. If it is not provided, the client will assume that the last modified recipe is the target. With that, the workflow above can be simplified by omitting these values:

$ cat Dockerfile-lcgm
FROM centos:7
RUN yum -y install epel-release
RUN yum -y install lcgdm lfc gfal2 gfal2-plugin-lfc
$ sib upload ./Dockerfile-lcgm lcgm
Recipe got successfully imported into Database
$ sib build  --blocking
Build of recipe will be executed
Building ...
Build succeeded
$ sib download
Download succeeded

To build multiple containers in parallel without fearing a race condition in the client, you can omit the --blocking flag on the build. You can see an example of 2 parallel builds in the following:

$ cat Dockerfile-lcgm
FROM centos:7
RUN yum -y install epel-release
RUN yum -y install lcgdm lfc gfal2 gfal2-plugin-lfc
$ cat Dockerfile-httpd-rocky8
FROM rockylinux:8
RUN yum -y install httpd
$ sib upload ./Dockerfile-lcgm lcgm
Recipe got successfully imported into Database
$ sib upload ./Dockerfile-httpd-rocky8 httpd-rocky8
Recipe got successfully imported into Database
$ sib build --recipe-name lcgm
Build of recipe will be executed
$ sib build --recipe-name httpd-rocky8
Build of recipe will be executed
$ sib list
Container Name    Last Modified               Buildstatus
----------------  --------------------------  -------------
lcgm              2021-01-01T16:00:31.415926  BUILDING
httpd-rocky8      2021-01-01T16:00:31.415926  BUILDING
# Wait a bit of time
$ sib list
Container Name    Last Modified               Buildstatus
----------------  --------------------------  -------------
lcgm              2021-01-01T16:02:31.415926  SUCCESS
httpd-rocky8      2021-01-01T16:02:31.415926  SUCCESS
$ sib download --recipe-name lcgm
Download succeeded
$ sib download --recipe-name httpd-rocky8
Download succeeded

Adding additional files to the build process is supported as well. Not by uploading single files, but by specifying a directory that is then compressed. The directory must contain a Dockerfile. It is also possible to give that specific directory directly as .tar.gz. Here is an example in which a given TensorFlow image is updated with a specific file needed to inject:

$ cat tensorflow_20.08-tf1-py3/add_mofed_version.sh
#!/bin/bash
# example usage: add_mofed_version.sh 4.5-1.0.1.0
export MOFED_VERSION=$1

DIR=$(dirname $(readlink -f ${BASH_SOURCE[0]}))

mkdir -p $DIR/${MOFED_VERSION%.*}
pushd $DIR/${MOFED_VERSION%.*} >/dev/null
curl -Ls http://www.mellanox.com/downloads/ofed/MLNX_OFED-${MOFED_VERSION}/MLNX_OFED_LINUX-${MOFED_VERSION}-ubuntu18.04-$(uname -m).tgz | \
    tar zx --strip-components=3 --wildcards \
        '*/DEBS/libibverbs1_51*' \
        '*/DEBS/libibverbs-dev*' \
        '*/DEBS/ibverbs-utils*' \
        '*/DEBS/ibverbs-providers*'
popd >/dev/null

$ cat tensorflow_20.08-tf1-py3/Dockerfile
FROM nvcr.io/nvidia/tensorflow:20.08-tf1-py3
COPY add_mofed_version.sh /opt/mellanox/DEBS/add_mofed_version.sh
RUN /opt/mellanox/DEBS/add_mofed_version.sh 5.1-0.6.6.0

$ sib upload tensorflow_20.08-tf2-py3 tensorflow_20.08-tf2-py3
Recipe got successfully imported into Database
$ sib build --blocking --recipe-name tensorflow_20.08-tf2-py3
Build of recipe will be executed
Building...
Build succeeded
$ sib download --recipe-name tensorflow_20.08-tf2-py3

To debug failures that happened during building, it is possible to obtain the Apptainer recipe that has been used as well as the build logs.

The Apptainer recipe can be obtained with sib content [--recipe-name your_recipe].

The build logs can be obtained with sib logs [--recipe-name your_recipe].

Container Build System REST API

You can download an specification of the full API of the Container Build System as a OpenAPI description here. It is intended to be used with the CLI client provided, but it can also be used directly with the REST API. Note that you need to save the UUID of recipes and containers as soon as you obtain them, as there is no way to get these afterwards. No user-based authentication implemented, the actual authentication is done on a per-object base with the UUID being the secret here.

Apptainer image building

Converting Dockerfiles to Apptainer recipes

Dockerfiles can be converted to Apptainer recipes with the spython module, which is included in the Apptainer-Tools module. While Dockerfiles and Apptainer recipes are not 100% compatible, the conversion should provide a good starting point, which may require some manual adjustments.

To generate an Apptainer recipe from a Dockerfile, use the following command:

spython recipe <Dockerfile> > <recipe.def>

In case you want the converted output of the Dockerfile in the console, you can omit the path to the recipe and only use the following command:

spython recipe <Dockerfile>

Rootless builds

Apptainer images can be built with the apptainer CLI tool. With recent versions, apptainer started supporting rootless builds through fakeroot, which means that images can be built directly on the HPC systems, however some limitations apply compared to building with root privileges. Please refer to the Apptainer documentation on fakeroot for more information.

To build a container image via apptainer, use the following command:

apptainer build <container.sif> <recipe.def>

Rootfull builds

For rootfull builds, the caveats of fakeroot do not apply. These builds are not possible on the HPC machines, but can be done on a local machine where one has privileged access. The resulting image file can then be uploaded to the HPC machines and used there. Rootfull builds use the same CLI interface as rootless builds, but are required to be run under the root user.