The xOPS Project
xOPS is an Ansible based Configuration-as-Code (CaC) repository of code that defines and enforces the configuration of the supercomputers, cloud platforms, and storage systems operated by the High-Performance Computing, Cloud and Data Systems and Services (HPCCDSS) division of the Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich GmbH.
xOPS codifies every aspect of system setup (e.g. user accounts, network interfaces, job schedulers, storage mounts, monitoring stacks, security policies) into repeatable, auditable automation scripts called playbooks and roles. Changes are tracked in Git, reviewed via merge requests, and applied consistently across thousands of nodes.
This project is used to keep large-scale infrastructure reliable and predictable while reducing operational risk. It provides a single source of truth for platform configuration, speeds up rollout of new systems and updates, and helps teams recover quickly by recreating previously known working states when incidents occur.
Jülich Supercomputing Centre
Why Configuration as Code (CaC)?
CaC is a term generally referring to the separation of configuration settings from the actual code. The ideal being you can store that configuration data in source control, and easily run and tweak it to match different environments. Please see Ansible documentation for details.
Auditability
Every change is committed to Git with an author, timestamp, and review trail. Auditors and compliance officers can trace who changed what, when, and why.
Consistency
Identical configurations are applied to hundreds of nodes simultaneously, eliminating “snowflake” servers and reducing human error.
Reproducibility
A new system or disaster recovery scenario can be bootstrapped from scratch by running the relevant playbook.
Security
Secrets are encrypted with Ansible Vault. SSH keys, certificates, and access policies are managed centrally and rotated systematically.
At a Glance
The following indicators provide a quick snapshot of the current xOPS scope. They show the scale of systems managed.
Metric |
Value |
|---|---|
HPC Systems |
6+ |
Managed Nodes |
6000+ |
Ansible Roles |
150+ |
Playbooks |
36 |
Inventory Groups |
40+ |
High-Level Architecture
The diagram below shows how the Ansible control node applies configuration to the various HPC clusters and their subsystems.
Managed Supercomputer Systems
Playbooks and inventory definitions for each of the following JSC production systems are managed by xOPS:
System |
Full Name |
Description |
Playbook |
|---|---|---|---|
JUPITER |
Joint Undertaking Pioneer for Innovative and Transformative Exascale Research |
Europe’s first exascale-class supercomputer. |
|
JUWELS |
Jülich Wizard for European Leadership Science |
Flagship hybrid CPU/GPU cluster with booster module. |
|
JURECA-DC |
Jülich Research on Exascale Cluster Architectures |
Data-centric system with 768+ compute nodes including AI accelerator prototypes. |
|
JSC Cloud |
JSC Cloud Infrastructure |
Cloud platform providing virtual machines and services alongside the HPC systems. |
|
JUST |
Jülich Storage Cluster |
Central GPFS-based storage cluster for all HPC systems. |
|
JUSUF |
Jülich Support for Fenix |
GPU-accelerated system for interactive and batch workloads. |
|
JUZEA |
Jülich Zone of Energy Abstraction |
Smaller test and development system with container-based workloads. |
|
JUDAC |
Jülich Data Access Server |
Data access and transfer node for moving data between systems and external partners. |
|
Supporting |
HPSMC, DEEP, Gateways, … |
Management clusters, SSH gateways, LDAP servers, CI runners, and monitoring infrastructure. |
|