The xOPS Project

xOPS is an Ansible based Configuration-as-Code (CaC) repository of code that defines and enforces the configuration of the supercomputers, cloud platforms, and storage systems operated by the High-Performance Computing, Cloud and Data Systems and Services (HPCCDSS) division of the Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich GmbH.

xOPS codifies every aspect of system setup (e.g. user accounts, network interfaces, job schedulers, storage mounts, monitoring stacks, security policies) into repeatable, auditable automation scripts called playbooks and roles. Changes are tracked in Git, reviewed via merge requests, and applied consistently across thousands of nodes.

This project is used to keep large-scale infrastructure reliable and predictable while reducing operational risk. It provides a single source of truth for platform configuration, speeds up rollout of new systems and updates, and helps teams recover quickly by recreating previously known working states when incidents occur.

JSC supercomputer hall

Jülich Supercomputing Centre

Why Configuration as Code (CaC)?

CaC is a term generally referring to the separation of configuration settings from the actual code. The ideal being you can store that configuration data in source control, and easily run and tweak it to match different environments. Please see Ansible documentation for details.

Auditability

Every change is committed to Git with an author, timestamp, and review trail. Auditors and compliance officers can trace who changed what, when, and why.

Consistency

Identical configurations are applied to hundreds of nodes simultaneously, eliminating “snowflake” servers and reducing human error.

Reproducibility

A new system or disaster recovery scenario can be bootstrapped from scratch by running the relevant playbook.

Security

Secrets are encrypted with Ansible Vault. SSH keys, certificates, and access policies are managed centrally and rotated systematically.

At a Glance

The following indicators provide a quick snapshot of the current xOPS scope. They show the scale of systems managed.

Metric

Value

HPC Systems

6+

Managed Nodes

6000+

Ansible Roles

150+

Playbooks

36

Inventory Groups

40+

High-Level Architecture

The diagram below shows how the Ansible control node applies configuration to the various HPC clusters and their subsystems.

High-level architecture diagram (mobile)

Managed Supercomputer Systems

Playbooks and inventory definitions for each of the following JSC production systems are managed by xOPS:

System

Full Name

Description

Playbook

JUPITER

Joint Undertaking Pioneer for Innovative and Transformative Exascale Research

Europe’s first exascale-class supercomputer.

jupiter.yml

JUWELS

Jülich Wizard for European Leadership Science

Flagship hybrid CPU/GPU cluster with booster module.

juwels.yml

JURECA-DC

Jülich Research on Exascale Cluster Architectures

Data-centric system with 768+ compute nodes including AI accelerator prototypes.

jurecadc.yml

JSC Cloud

JSC Cloud Infrastructure

Cloud platform providing virtual machines and services alongside the HPC systems.

servers.yml

JUST

Jülich Storage Cluster

Central GPFS-based storage cluster for all HPC systems.

servers.yml

JUSUF

Jülich Support for Fenix

GPU-accelerated system for interactive and batch workloads.

jusuf.yml

JUZEA

Jülich Zone of Energy Abstraction

Smaller test and development system with container-based workloads.

juzea.yml

JUDAC

Jülich Data Access Server

Data access and transfer node for moving data between systems and external partners.

servers.yml

Supporting

HPSMC, DEEP, Gateways, …

Management clusters, SSH gateways, LDAP servers, CI runners, and monitoring infrastructure.

servers.yml