Reference

xOPS Repository Structure

xOPS follows Ansible best practices with a clear separation of concerns:

Directory

Purpose

roles/

150+ reusable Ansible roles (the building blocks of configuration: one role per service or subsystem).

group_vars/

Variables scoped per system or host group (e.g., group_vars/juwels/), allowing per-cluster customization.

host_vars/

Variables specific to individual nodes, for host-level tuning.

files/

Static configuration files deployed as-is: SSH keys, certificates, pre-built configs.

templates/

Jinja2 templates rendered at deploy time (e.g., SLURM configs, kickstart files).

vault/

Encrypted secrets (passwords, API tokens) managed with Ansible Vault and GPG.

juwels/, jureca/, ...

Per-system inventory directories defining which hosts belong to which groups.

*.yml (root)

Top-level playbooks — one per system — that orchestrate which roles run on which hosts.

How Changes Are Applied

5-step deployment workflow: Edit, Merge Request, CI, Merge, Deploy

Technology Stack

  • Configuration Management: Ansible (agentless, SSH-based automation)

  • Operating Systems: Red Hat Enterprise Linux, Rocky Linux

  • Job Scheduling: SLURM, UNICORE

  • Parallel Filesystems: IBM Spectrum Scale (GPFS), Ceph, NFS

  • Interconnect: InfiniBand (Mellanox/NVIDIA OFED), Ethernet

  • Accelerators: NVIDIA & AMD GPUs (driver & runtime management)

  • Monitoring: Prometheus, Grafana, Loki, Alertmanager

  • Containers: Apptainer (Singularity), Podman, Kubernetes

  • High Availability: Pacemaker/Corosync, HAProxy, Keepalived

  • Secrets Management: Ansible Vault with GPG encryption

  • Source Control & CI/CD: GitLab, GitLab-CI, Ansible AWX

Glossary

Term

Definition

Ansible

An open-source IT automation engine that manages configuration, deployment, and orchestration over SSH — no agent software is required on managed nodes.

Playbook

A YAML file that declares what should be configured on which hosts. Think of it as a recipe that Ansible follows step by step.

Role

A self-contained, reusable unit of automation (e.g., “install and configure SLURM”). Roles are the building blocks assembled by playbooks.

Inventory

A listing of all servers (nodes) organised into groups. Each HPC system has its own inventory.

Configuration as Code

The practice of managing IT infrastructure configuration through machine-readable definition files rather than manual configuration, enabling version control, review, and repeatable deployments.

HPC

High-Performance Computing — using supercomputers and parallel processing to solve large-scale computational problems in science, engineering, and AI.

SLURM

Simple Linux Utility for Resource Management — the industry-standard job scheduler that allocates compute resources to user workloads on HPC clusters.

GPFS

General Parallel File System (IBM Spectrum Scale) — a high-performance clustered filesystem designed for large-scale data-intensive workloads.