AI Workloads on JUWELS

As a GPU focussed system, JUWELS Booster is ideal for running current AI workflows. We have specialised documentation for getting started with these workflows from our Simulation and Data Lab for Applied Machine Learning.

Note

While some of these instructions can be generalised to other systems, the majority of support is provided foremost for JUWELS booster, as of our current full-scale systems it is the most suited to AI workloads. This will be expanded and updated to JUPITER as JUPITER becomes available.

For getting started with AI workloads on our systems, we provide a series of pages:

Working with JSC Filesystems and AI

Basic guidance for working with AI workloads on our filesystems can be found here .

Installing Python software for AI

You can find guidance here on using our Python virtual environment template to easily create reproducible python virtual environments that utilise as much of the software modules provided on the systems as possible for optimum performance, while also allowing users to install specific python modules or module versions that they need. This is set up for AI workloads by default but is also useful for any user of our systems.

Integrating with VSCode

Integration with VSCode here for remote debugging is detailed here .

Git on HPC

The basics of working with Git projects, specialised particularly for those that might have less experience with academic High-performance Computing (HPC) clusters, can be found here .

PyTorch Usage and Common Problems Guide

We provide the PyTorch at JSC guide, which explains how to properly use PyTorch on our systems, as well as solutions for several common problems.

Quickstart guide for PyTorch Lightning and Hydra

To ease organising AI projects and configuring their environments, we provide a guide for using PyTorch Lightning and Hydra, alongside the previously mentioned HPC python environment template to make getting started with a deep learning workflow as painless as possible. This guide can be found here .

Handling Datasets with Many Files

HPC Filesystems are traditionally configured to give optimal large read/writes to a small number of files. AI workflows often feature the opposite of this, with extremely large numbers of small files. If performed naively, this can lead to significant performance degradation. Two ways of adapting datasets to perform better within this filesystem are detailed here .

VLLM installation

VLLM is being used by more and more practitioners, but does not (at the time of writing) have an entirely straightforward installation on some systems. On systems with CPUs using x86 architectures (All except JUPITER), VLLM can simply be installed with uv pip install VLLM, as pre-built binaries exist. However, on systems with CPUs based on ARM architectures (Such as JUPITER), these are not available. Following this guide will help get you started with VLLM on ARM.

Scraping Workflows

If you need to run a “scraping” workflow, please contact your Project Mentor or contact SC-Support at sc@fz-juelich.de for assistance.

Blablador

Applying advanced Large Language Models (LLMs) requires significant compute resources and expertise that can be out of reach for many academic researchers. To enable researchers to access scientific LLMs as well, as well as make them available , Helmholtz Blablador has been developed. This means that pretrained models can be made accessible via a simple API, relieving academics from the difficulty of managing their own servers, which can be prohibitive due to .

Blablador has several functions. It allows users to access a range of scientific LLMs made available by the Helmholtz AI community. Additionally, researchers can add their pretrained models to the central hub. Other scientists can then easily query the catalog via web or using the popular OpenAI api to add these LLMs as functionality in other tools, such as programming IDEs.

Follow the instructions here to gain access to the Blablador API, allowing you to upload your own models.

To add Blablador functionality Visual Studio Code or Sublime, follow the instructions here .

Blablador can also be made available through JAN.AI, a private and secure AI assistant. Instructions can be found here .

It’s also possible to use Blablador to make queries about your own documents. This can be done through both Langchain (with instructions here and GPT4all (instructions here , which can also be used to query PDF files securely and privately, since the documents do not leave your computer.

Application-Specific AI documentation

We are also involved in efforts to apply AI in more domain- and application-specific areas, some of which are listed here.

AI4HPC is an open-source library to train AI models with CFD datasets on HPC systems.

AI4HPC consists of data manipulation routines tuned to handle CFD datasets, ML models useful for CFD analyses, and optimizations for HPC systems. AI4HPC also includes a benchmarking suite to test the limits of any system with CPUs and GPUs towards Exascale and a HyperParameter Optimization (HPO) suite for scalable HPO tasks.

itwinai is a platform intended to support general-purpose Machine Learning workflows for Digital Twin use cases, developed in the interTwin project . The platform provides ML researchers with an easy-to-use endpoint to manage general-purpose ML workflows with limited engineering overhead, while providing state-of-the-art MLOps best practices. The library streamlines distributed training, HPO, model logging, profiling and modular workflows for scientific applications, and supports independent use-case development with plugins.