Core Concepts & Capabilities

Declarative Control

Declarative control diagram

Define the desired state which Ansible converges each host to match.

System Inventory

System inventory diagram

Every cluster and node is catalogued in structured, versioned inventory files.

Operational Security

Operational security diagram

Secrets stay in Vault so that deployments go through gated checks.

Key Capabilities

Job Scheduling & Resource Management

Configures SLURM controllers, daemons, partitions, accounting, and job submit filters across all clusters. Integrates with UNICORE for federated job submission.

Parallel Filesystems

Manages IBM Spectrum Scale (GPFS) clusters providing home, project, scratch, and data filesystems. Also handles Ceph distributed storage and NFS exports.

Network Fabric

Provisions InfiniBand (Mellanox OFED, OpenSM subnet managers) and Ethernet interfaces. Configures DNS, DHCP, firewalls, and SSH gateways.

Monitoring & Observability

Deploys a full Prometheus and Grafana stack with alerting (Alertmanager) and centralized log aggregation (Loki and Promtail).

Containers & Virtualization

Supports HPC container runtimes (Apptainer), system containers (Podman), and Kubernetes clusters for service orchestration.

Maintenance Operations

Dedicated maint.yml playbook orchestrates planned downtime: SLURM reservations, SSH banners, GPFS graceful shutdown, and status-page integration.