Reference
xOPS Repository Structure
xOPS follows Ansible best practices with a clear separation of concerns:
Directory |
Purpose |
|---|---|
|
150+ reusable Ansible roles (the building blocks of configuration: one role per service or subsystem). |
|
Variables scoped per system or host group (e.g., |
|
Variables specific to individual nodes, for host-level tuning. |
|
Static configuration files deployed as-is: SSH keys, certificates, pre-built configs. |
|
Jinja2 templates rendered at deploy time (e.g., SLURM configs, kickstart files). |
|
Encrypted secrets (passwords, API tokens) managed with Ansible Vault and GPG. |
|
Per-system inventory directories defining which hosts belong to which groups. |
|
Top-level playbooks — one per system — that orchestrate which roles run on which hosts. |
How Changes Are Applied
Technology Stack
Configuration Management: Ansible (agentless, SSH-based automation)
Operating Systems: Red Hat Enterprise Linux, Rocky Linux
Job Scheduling: SLURM, UNICORE
Parallel Filesystems: IBM Spectrum Scale (GPFS), Ceph, NFS
Interconnect: InfiniBand (Mellanox/NVIDIA OFED), Ethernet
Accelerators: NVIDIA & AMD GPUs (driver & runtime management)
Monitoring: Prometheus, Grafana, Loki, Alertmanager
Containers: Apptainer (Singularity), Podman, Kubernetes
High Availability: Pacemaker/Corosync, HAProxy, Keepalived
Secrets Management: Ansible Vault with GPG encryption
Source Control & CI/CD: GitLab, GitLab-CI, Ansible AWX
Glossary
Term |
Definition |
|---|---|
Ansible |
An open-source IT automation engine that manages configuration, deployment, and orchestration over SSH — no agent software is required on managed nodes. |
Playbook |
A YAML file that declares what should be configured on which hosts. Think of it as a recipe that Ansible follows step by step. |
Role |
A self-contained, reusable unit of automation (e.g., “install and configure SLURM”). Roles are the building blocks assembled by playbooks. |
Inventory |
A listing of all servers (nodes) organised into groups. Each HPC system has its own inventory. |
Configuration as Code |
The practice of managing IT infrastructure configuration through machine-readable definition files rather than manual configuration, enabling version control, review, and repeatable deployments. |
HPC |
High-Performance Computing — using supercomputers and parallel processing to solve large-scale computational problems in science, engineering, and AI. |
SLURM |
Simple Linux Utility for Resource Management — the industry-standard job scheduler that allocates compute resources to user workloads on HPC clusters. |
GPFS |
General Parallel File System (IBM Spectrum Scale) — a high-performance clustered filesystem designed for large-scale data-intensive workloads. |