Core Concepts & Capabilities
Declarative Control
Define the desired state which Ansible converges each host to match.
System Inventory
Every cluster and node is catalogued in structured, versioned inventory files.
Operational Security
Secrets stay in Vault so that deployments go through gated checks.
Key Capabilities
Job Scheduling & Resource Management
Configures SLURM controllers, daemons, partitions, accounting, and job submit filters across all clusters. Integrates with UNICORE for federated job submission.
Parallel Filesystems
Manages IBM Spectrum Scale (GPFS) clusters providing home, project, scratch, and data filesystems. Also handles Ceph distributed storage and NFS exports.
Network Fabric
Provisions InfiniBand (Mellanox OFED, OpenSM subnet managers) and Ethernet interfaces. Configures DNS, DHCP, firewalls, and SSH gateways.
Monitoring & Observability
Deploys a full Prometheus and Grafana stack with alerting (Alertmanager) and centralized log aggregation (Loki and Promtail).
Containers & Virtualization
Supports HPC container runtimes (Apptainer), system containers (Podman), and Kubernetes clusters for service orchestration.
Maintenance Operations
Dedicated maint.yml playbook orchestrates planned downtime:
SLURM reservations, SSH banners, GPFS graceful shutdown, and
status-page integration.