Changelog

Current state

Installed software

Software

Version

Description

Rocky Linux

9.6

Kernel Version

5.14.0-570.42.2.el9_6

NVIDIA GPU Driver

580.65.06

OFED

25.04-OFED.25.04.0.6.0.1

Slurm

24.11.6-1.20250807git03d01a9

ParaStation Management

6.4.1

GPFS

5.2.3-2

Apptainer

1.4.1-1

PMIx

5.0.8

Default Software Stage

2025

Changelog entries

2025-09-23 Software update

Update type: OS Packages

OS Packages
  • Kernel Version has been updated to 5.14.0-570.42.2.el9_6 (from 5.14.0-570.32.1.el9_6)

2025-09-22 Update UCX

Update type: SW Modules

  • UCX has been changed to 1.18.1 from 1.17.0

2025-09-09 Software update

Update type: SW Modules

UCX-settings
  • UCX_CUDA_COPY_DMABUF=no has been removed for the UCX-settings/[RC,UD,DC]-CUDA modules, since it is no longer necessary to prevent crashes, and it actually causes a performance regression with the latest OFED and NVIDIA driver

2025-09-02 Software update

Update type: OS Packages

OS Packages
  • Rocky Linux has been updated to 9.6 (from 9.5)

  • Kernel Version has been updated to 5.14.0-570.32.1.el9_6 (from 5.14.0-503.40.1.el9_5)

  • NVIDIA GPU Driver has been updated to 580.65.06 (from 570.133.20)

  • Slurm has been updated to 24.11.6-1.20250807git03d01a9 (from 24.11.5-1.20250602git2ed9014)

  • ParaStation Management has been updated to 6.4.1 (from 6.3.0)

  • GPFS has been updated to 5.2.3-2 (from 5.2.2-1.12)

  • Apptainer has been updated to 1.4.1-1 (from 1.3.6-1)

2025-07-22 Software update

Update type: OS Packages

OS Packages
  • ParaStation Management has been updated to 6.3.0 (from 6.2.4)

2025-07-01 Software update

Update type: OS Packages

OS Packages
  • ParaStation Management has been updated to 6.2.4 (from 6.2.3)

2025-06-17 Software update

Update type: OS Packages and Firmware

Firmware
  • ConnectX-7 HCAs have been updated to firmware version 28.45.1200

  • ConnectX-6 HCAs have been updated to firmware version 20.43.2566

OS Packages
  • Kernel Version has been updated to 5.14.0-503.40.1.el9_5 (from 5.14.0-503.38.1.el9_5)

  • OFED has been updated to 25.04-OFED.25.04.0.6.0.1 (from 25.01-OFED.25.01.0.6.0.1)

  • Slurm has been updated to 24.11.5-1.20250602git2ed9014 (from 23.11.10-1.20240920git20c5755)

  • ParaStation Management has been updated to 6.2.3 (from 6.1.1)

  • GPFS has been updated to 5.2.2-1.12 (from 5.2.2-1)

  • PMIx has been updated to 5.0.8 (from 5.0.6)

2025-04-29 Software update

Update type: OS Packages

OS Packages
  • Kernel Version has been updated to 5.14.0-503.38.1.el9_5 (from 5.14.0-503.26.1.el9_5)

  • NVIDIA GPU Driver has been updated to 570.133.20 (from 570.86.15)

2025-03-13 Maintenance

Extension of HWAI partition

Sixteen additional compute nodes have been added to the HWAI partition.

Slurm configuration changes

cgroup constraints have been be enabled for (GPU) devices. Jobsteps will be able to access only the GPUs that were requested with --gres=gpu:X or any other GPU related options.

Update type: OS Packages

OS Packages
  • Rocky Linux has been updated to 9.5 (from 9.4)

  • Kernel Version has been updated to 5.14.0-503.26.1.el9_5 (from 5.14.0-427.33.1.el9_4)

  • NVIDIA GPU Driver has been updated to 570.86.15 (from 560.35.03)

  • OFED has been updated to 25.01-OFED.25.01.0.6.0.1 (from 24.07-OFED.24.07.0.6.1.1)

  • ParaStation Management has been updated to 6.1.1 (from 5.1.63)

  • GPFS has been updated to 5.2.2-1 (from 5.1.9-4)

  • PMIx has been updated to 5.0.6 (from 4.2.9)

2025-02-27 MemoryMax

Update type: Login nodes

  • MemoryMax has been set to 25% on individual user slices on login nodes

2025-02-05 Change MPI-settings for OpenMPI

Update type: SW Modules

  • As of 2025 romio321 is not working, so we have disabled the selection of romio321 in the MPI-settings, giving OpenMPI the freedom to choose and prioritize, currently ompio is selected.

2025-01-29 Default UCX-settings module

Update type: SW Modules

  • RC (RC-CUDA on dc-hwai and dc-h100) has been made the default module for UCX-settings in the 2025 stage. Until now it was UD by mistake.

2024-12-11 Software update

Update type: OS Packages

OS Packages
  • Apptainer has been updated to 1.3.6-1 (from 1.3.2-1)

2024-11-26 Software update

Update type: OS Packages

OS Packages
  • fuse utility programs have been added to compute nodes

  • ParaStation Management has been updated to 5.1.63 (from 5.1.62)

2024-10-18 Maintenance

Configuration changes

  • File system oldscratch is no longer mounted.

  • Partition dc-wai renamed to dc-hwai.

  • Made /local/scratch on login nodes world-writeable.

Update type: OS Packages

OS Packages
  • Rocky Linux has been updated to 9.4 (from 8.10)

  • Kernel Version has been updated to 5.14.0-427.33.1.el9_4 (from 4.18.0-553.el8_10)

  • NVIDIA GPU Driver has been updated to 560.35.03 (from 550.54.15)

  • OFED has been updated to 24.07-OFED.24.07.0.6.1.1 (from 24.04-OFED.24.04.0.6.6.1)

  • Slurm has been updated to 23.11.10-1.20240920git20c5755 (from 23.02.7-1.20240328git405c820)

  • ParaStation Management has been updated to 5.1.62 (from 5.1.61)

  • GPFS has been updated to 5.1.9-4 (from 5.1.9-3)

2024-08-06 Software update

Update type: OS Packages

OS Packages
  • Slurm has been updated to 23.02.7-1.20240328git405c820 (from 22.05.11-1.20231215gitc756517)

  • ParaStation Management has been updated to 5.1.61 (from 5.1.60)

2024-06-17 Software update

Update type: OS Packages

OS Packages
  • Rocky Linux has been updated to 8.10 (from 8.9)

  • Kernel Version has been updated to 4.18.0-553.el8_10 (from 4.18.0-513.18.1.el8_9)

  • NVIDIA GPU Driver has been updated to 550.54.15 (from 535.154.05)

  • OFED has been updated to 24.04-OFED.24.04.0.6.6.1 (from 23.10-OFED.23.10.1.1.9.1)

  • ParaStation Management has been updated to 5.1.60 (from 5.1.56)

  • GPFS has been updated to 5.1.9-3 (from 5.1.9-1)

  • Apptainer has been updated to 1.3.2-1 (from 1.2.4-1)

  • PMIx has been updated to 4.2.9 (from 4.2.6)

2024-03-07 Software update (Benedikt Steinbusch)

Update type: OS Packages

OS Packages:
  • Kernel 4.18.0-513.18.1.el8_9 (from 4.18.0-513.11.1.el8_9)

2024-02-29 Software update (Benedikt Steinbusch)

Only affects the Grace Hopper evaluation nodes.

Update type: OS Packages

OS Packages:
  • NVIDIA GPU drivers 550.54.14 (from 535.154.05)

2024-02-19 Software update (Benedikt Steinbusch)

Update type: OS Packages

OS Packages:
  • NVIDIA “open-source” GPU drivers 535.154.05 (from 535.129.03)

2024-01-22 Software update (Benedikt Steinbusch)

Only affects the Grace Hopper evaluation nodes.

Update type: OS Packages

OS Packages:
  • NVIDIA GPU drivers 535.154.05 (from 535.129.03)

2024-01-16 Software update (Benedikt Steinbusch)

Update type: OS Packages, Batch system

OS Packages:
  • General update to Rocky 8.9

  • SLURM has been updated to 22.05.11-1.20231215gitc756517 (from 22.05.10-2.20231203gitae058ea)

  • psmgmt has been updated to 5.1.59-1 (from 5.1.58-1).

  • Kernel 4.18.0-513.11.1.el8_9 (from 4.18.0-477.27.1.el8_8)

  • NVIDIA OFED 23.10-1.1.9.1 (from 23.07-0.5.1.2)

  • NVIDIA GPU drivers 535.129.03 (from 535.104.12)

  • GPFS 5.1.9-1 (from 5.1.8-2)

  • DDN IME 1.5.2-152129 (from 1.5.2-152128) with custom version of fuse

Batch System:
  • Slurm is now configured to use Linux cgroupsv2 for process management. As a consequence, CPU pinning will be more strictly enforced.

  • The Rocky Linux update results in slightly less memory being available on the compute nodes. The Slurm configuration has been updated to reflect that.

2023-12-14 Software update (Damian Alvarez)

Update type: OS Packages, Batch system, SW Modules

OS Packages:
  • SLURM has been updated to 22.05.10-2.20231203gitae058ea to address newly-discovered security issues

  • psmgmt has been updated to 5.1.58-1

Software stack
  • netCDF in the 2024 stage has been rebuilt to add support for extra compression libraries

  • GCC in the 2024 stage has been recompiled to patch some bugs that appeared in combination with PyTorch

2023-11-02 PMIx update (Sebastian Achilles)

Update type: OS Packages

Packages:
  • PMIx 4.2.6

Configuration:
  • All OpenMPI installations have been rebuilt to include a patch necessary for the new PMIx

2023-10-19 Software update (Benedikt Steinbusch)

Update type: OS Packages, Firmware, Batch system, Configuration

Packages:
  • Kernel 4.18.0-477.27.1.el8_8.x86_64

  • NVIDIA OFED 23.07-0.5.1.2

  • NVIDIA GPU drivers 535.104.12

  • AMD GPU drivers 5.7

  • GPFS 5.1.8-2

  • Apptainer 1.2.4-1

  • DDN IME 1.5.2-152128

  • psmgmt-5.1.56-2

  • IB Switch firmware 27.2012.1010

  • IB HCA firmware 20.38.1900

Configuration:
  • SSH now rejects RSA keys

  • The Slurm devel partitions are now spread across multiple racks so that rack-wise maintenance procedures will no longer affect an entire partition at once

  • All OpenMPI installations rely now on a user-space provided PMIx

2023-08-30 UCX-settings update (Damian Alvarez, JSC)

Update type: SW Modules

The UCX-settings/*CUDA modules also set UCX_RNDV_FRAG_MEM_TYPE=cuda. This enables the GPU to initiate transfers of CUDA managed buffers. This can have a large speed-up in case Unified Memory (cudaMallocManaged()) is used, as staging of data is avoided.

2023-07-27 Software update (Benedikt Steinbusch)

Update type: OS Packages, Batch system

  • Rocky Linux 8.8

  • NVIDIA OFED 23.04-1.1.3.0

  • NVIDIA GPU drivers 535.54.03

  • AMD GPU drivers 5.6

  • GPFS 5.1.8-1

  • psmgmt-5.1.56-1

2023-07-03–2023-07-10 TS Update (Benedikt Steinbusch)

Update type: Other

The JURECA compute node racks were updated to technical state 068.03.

2023-06-27 Rolling update (Benedikt Steinbusch)

Update type: Rolling update, Slurm

Software Updates:
  • Slurm 22.05.9-1

2023-06-05 Rolling update (Benedikt Steinbusch)

Update type: Rolling update, PSMgmt

Software Updates:
  • PSMgmt 5.1.55-2

2023-05-16 Rolling update (Benedikt Steinbusch)

Update type: Rolling update, OS Packages, Storage

Software Updates:
  • Kernel 4.18.0-425.19.2.el8_7

  • NVIDIA OFED 5.8-2.0.3.0

  • NVIDIA GPU Driver 525.105.17-1

  • GPFS 5.1.7-1.5

  • Apptainer 1.1.8

2023-03-09 Emergency maintenance/update (Benedikt Steinbusch)

Update type: Maintenance, OS Packages, Storage, Skyways

Skyways:

Four additional Skyway gateways that provide connectivity to the JUST storage system have been put into production and configured as highly availably redundant pairs with the existing four Skyways.

Software Updates:
  • GPFS 5.1.7-0 (from 5.1.6-1)

2023-02-28 10:00 to 2023-02-28 13:30 offline Maintenance (Benedikt Steinbusch)

Update type: Maintenance, SW Modules, Batch system, OS Packages, Firmware

Stage Update:

The default software stack has been changed to 2023. The remaining software stages are nevertheless reachable.

Slurm Update:

Slurm has been updated to version 22.05.

Software Updates:
  • OFED 5.8-1.1.2.1

  • GPFS 5.1.6-1 (from 5.1.4-1)

  • IME 1.5.2-152111 (from 1.5.2-152065)

  • NVIDIA driver 525.85.12 (from 515.65.07-1)

  • Apptainer 1.1.6-1 (from 1.1.3-1)

  • psmgmt 5.1.54-2 (from 5.1.52-5)

Firmware Updates:
  • Infiniband switches firmware 27.2010.5042

  • Infiniband HCA firmware 20.36.1010

2023-02-15 11:00 to 2023-02-16 13:15 online Maintenance (Benedikt Steinbusch)

Update type: Maintenance, Firmware

Firmware Updates:

Racks 12-14 (containing compute nodes jrc[0545-0832]) have been updated to Atos Technical state 67.02.

2022-11-29 07:00 to 17:23 offline Maintenance (Philipp Thörnig)

Update type: Maintenance, SW/FW/HW

Power off all direct water-cooled HW since the infra loop got reconnected from cold-water-supply to the warm-water-supply cooling loop (same as JW-Booster is connected to)

SW Updates:
  • Rocky 8.7 (from 8.6)

  • MOFED 5.8-1.0.1.1 (from 5.7-1.0.2)

  • GPFS 5.1.5-1.10 (from 5.1.4-1)

  • NVIDIA driver 515.65.07-1 (from 515.65.01-1)

  • Apptainer 1.1.3-1 (from 1.0.3-1)

  • psmgmt 5.1.52-5 (from 5.1.50-4)

  • slurm_plugins_version 2.0-21088205.20221027git0d9ac96

  • Slurm Atos plugin updates sbb/sbf/eojr/beo

  • apptainer 1.1.3-1

FW Updates:
  • BMC/HCA/BIOS FW updates at Service Island including the logins

Service Storage change:
  • ceph flag activation ceph osd set-require-min-compat-client luminous

  • ceph_client SW update to version pacific

GPFS GW OS Update Skyways:
  • OS update to V8.1.3000 at all four active/production Skyways

    • With this version, the long-missing HA functionality/support is possible/available now. So next step is to activate HA and the four remaining (currently inactive) skyways during the next offline maintenance (includes some config/routing adaptions at JR).

2022-11-02 09:00 to 18:26 offline Maintenance (Philipp Thörnig)

Update type: Maintenance, SW/FW/HW

  • psmgmt update: psmgmt-5.1.52-1

  • power safe functionality enabled in slurm to power off idle systems and handle the automated power on / online-ing for computes. Higher job initialization phases are expected due to this: those are not accounted for.

    • online tuning will take place next few days.

JURECA-DC:
  • IB-SWITCH FW updates 27.2010.3118 to stabilize the switch-to-switch connections.

  • Seq2000 TS-Global update TS066.02 to bring all compute HW/Rack components to the latest FW versions.

JURECA Booster:
  • Module EOL -> power down and remove from slurm/cluster config (Disassembling will take place over the next few weeks.)

2022-08-23 08:30 to 17:28 offline Maintenance (Philipp Thörnig)

Update type: Maintenance, SW/FW

  • Slurm Plugin jsc-slurm-plugins-nopshc installed

  • Logins[02-12] jrlogin[01-12] are connected to GPFS through ethernet now (no need to close the logins when IB maintenances will take place in the future)

JURECA-DC:
  • IB-opensm new portgroup added: compute

  • IB-HCA-FW updates: 20.33.1048 (main target to reduce the link-down-events)

    • HCA-configuration adaptions: LOG_MAX_QUEUE 18

  • Rocky8.6 update including all related SW updates

    • kernel: 4.18.0-372.19.1.el8_6.x86_64

  • GPFS update: 5.1.4-1

  • psmgmt update: psmgmt-5.1.50-2

  • nvidia update: 515.65.01

  • OFED update: 5.7-1.0.2.0

JURECA Booster:
  • GPFS update: 5.1.4-1

  • psmgnt update: psmgmt-5.1.50-2

2022-06-28 09:00 to 12:56 online Maintenance (Philipp Thörnig)

Update type: Maintenance, SW

JURECA:
  • SW updates to fix important bug in slurm/psslurm:

    • psmgmnt update 5.1.49-5

    • Slurm update 21.08.8-2

  • python2/38 cleanup

  • jrceph host rebooted: to fix slow ops warnings in log

  • keeplived rollout at logins: HA-IP prio added

  • Graphcore updates:

    • RNIC:[ UP ] Version:[ 2.5.0 ] [ bmc: gc-1.22.0 ] [ gatewayFpga_ipum-p2: 1.5.0 ] [ gwsw: 2.5.2 ] [ ipuofServer: v1.10.0 ] [ mcu: 2.5.6 ] [ systemFpga: 0x16 ] [ vipuStandalone: 1.17.0 ] [ virmAgent: 1.17.0 ]

  • JURECA-DC GPU overheating-check: 1 compute drained

  • HPL performance benchmark: 3 DC computes drained due to slow performance

2022-06-09 10:00 to 15:58 online Maintenance (Philipp Thörnig)

Update type: Maintenance, SW/FW

JURECA-DC:
  • HCA-FW updates JURECA-DC computes: new HCA-FW-Version 20.32.101

    • online_maintenance_20220609_gp StartTime=2022-06-09T10:00:00 EndTime=12:29:57

    • online_maintenance_20220609_cp StartTime=2022-06-09T14:00:00 EndTime=15:58:51

2022-05-31 08:00 planned Offline Maintenance Booster Module (Philipp Thörnig)

Update type: Maintenance, HW

JURECA Booster: (last service and support day -> all open HW-Tickets got addressed)
  • 12 optical OPA-Cable replaced

  • 3 computes repaired

2022-05-12 Offline maintenance (C. Paschoulas, JSC)

Update type: Maintenance, HW

JURECA:
  • HW:

    • Infiniband switches re-configuration

2022-05-03 Global maintenance with general updates (C. Paschoulas, JSC)

Update type: Maintenance, HW + SW

JURECA:
  • SW:

    • GPFS updated to 5.1.3-1

    • OFED updated to 5.5-1.0.3.1

    • NVIDIA driver updated to 510.47.03

    • Kernel updated to 4.18.0-348.23.1

    • Slurm updated to 21.08

    • IME clients updated to 1.5.1.1-151131

    • Migrated from singularity to apptainer 1.0.1-1

  • HW:

    • 1 x IB-Switch was replaced

2022-04-08 10:00 to 11:09 online Maintenance (Philipp Thörnig)

Update type: Maintenance, SW

JURECA:
  • Slurm:

    • slurm.conf update to add new prototype systems

    • fixed TRESBillingWeights to only count the Real Cores (equally to our dispatch accounting)

  • OS-SW: linux-firmware-20210702-103.gitd79c2677.el8.noarch removed from all DC computes to shrink the diskless image as much as possible.

2022-03-15&17 online Maintenance (Philipp Thörnig)

Update type: Maintenance, SW

JURECA:

gpfs update -> gpfs_version: '5.1.2-3',gpfs_gsk_version: '8.0.55-19.1'

ReservationName=gpfs_20220315 StartTime=2022-03-15T09:00:00 EndTime=2022-03-16T18:30:00 Duration=1-09:30:00
   Nodes=jrc[0001-0204,0213-0236,0245-0268,0277-0300,0309-0314,0437-0442,0449-0539,0727-0731,0737-0784,0850,0870,5401-6008,6617-6628] NodeCnt=1054
ReservationName=gpfs_20220317 StartTime=2022-03-17T09:00:00 EndTime=2022-03-18T18:30:00 Duration=1-09:30:00
   Nodes=jrc[0315-0332,0341-0364,0373-0396,0405-0428,0443-0448,0540-0726,0732-0832,0851,0871,6009-6616,6629-6640] NodeCnt=1006

GPFS will be updated in login and compute nodes: This will be done in a rolling fashion. The implications are that batches of login and compute nodes need to taken out of production temporarily. It will be mostly transparent, but the following login nodes won’t be reachable at the specified times:

  • jureca[08-14].fz-juelich.de will be updated on Tuesday at 09:00 AM

  • jureca[01-07].fz-juelich.de will be updated on Thursday at 09:00 AM

New logins via the default DNS name jureca.fz-juelich.de will be possible at all times.

2022-03-15 Update: reservation gpfs_20220315 released at 11:57 and jureca[08-14] back online since 09:47

2022-03-17 Update: reservation gpfs_20220315 released at 15:15 and jureca[01-07] back online since 11:00

2022-03-08 08:30 planned Offline Maintenance (Philipp Thörnig)

Update type: Maintenance, SW, HW

JURECA:
  • Golden Client (GC) SW cleanup to reduce the diskless image size for all computes in the cluster.

  • easybuild: modules that expand the module path (GCCcore, compilers, mpi) rebuilded with the following settings: --rebuild --module-only

  • disabling the IPv6 usage in GRUB, ssh-config, and parastation config

    • to apply the changes a reboot of all hosts in Service Island and all ~2000 computes was necessary.

  • Slurm config adaptions:

  • HW: ~5,5 hours Ethernet Cable replace @ all Service Racks (all cables at ipmi, admin and ceph networks)

  • last webpage highmessage update after maintenance: High Message service deprecation

  • psmgmt update to psmgmt-5.1.46-0

JURECA-DC including logins and service island hosts+container:
  • pcs, containers and sriov ansible-tag rollout at hole service island and restart of all containers

  • compute: mlnx-nvme kernel module adaption to support sbb/sbf

  • compute and login: new ime/hpst/cscratch client configuration to support further testing while HPST is in maintenance

  • MLNX-Skyway reboot to fix some minor JUST connection issues (IPoIB)

2022-03-08 Change in user installations (Damian Alvarez, JSC)

Update type: Announcement, SW Modules

Change in user installations
  • The module structure has been changed so $MODULEPATH is not expanded depending on the existence of the $PROJECT variable. Now the variable used is $USERINSTALLATIONS, so the project software is not automatically activated when using jutil.

2021-12-13 08:30 to 2021-12-15 10:00 Maintenance (Philipp Thörnig)

Update type: Maintenance, SW, HW

JURECA:
  • XCST-largedata and XCST-largedata2 mounted at 10 relocated DC-Computes and 10 relocated Booster-Computes: (to have them used through default queue only if the system is completely filled with user jobs)

    • jrc[0710-0719],jrc[6600-6609]

  • GPFS update to 5.1.2-1

  • sssd adapted to react faster to new user changes on all computes and logins:

    • compute: entry_cache_[user,group]_timeout: 900 (=15 minutes)

    • login: entry_cache_[user,group]_timeout: 300 (=5 minutes)

  • Singularity update: 3.8.5-1

  • UNICORE update: 8.3.0-1

  • HPST/IME SW update: 1.5.1.1-151123

  • parastation updates due to RockyOS migration:

    • pscluster-console-5.2.1-1

    • psconfig-5.2.1-1

    • pshealthcheck-5.2.3-1

JURECA-DC including logins and service island hosts+container:
  • Fluid exchange on all XH2000 Racks (internal water loop / direct water cooling rack internal): main reason for the 2 days offline maintenance

  • TS-Global update (FW update of all components inside the XH2000 Racks)

  • OS update from CentOS8.4 to Rocky Linux release 8.5 4.18.0-348.2.1.el8_5

  • OFED update: 5.4-3.1.0.0

  • NVIDIA update:

    nvidia_version: '470.82.01'
    nvidia_version_gdrcopy: '2.3-3'
    nvidia_gpumond_version: '2.0-27.20201021git8c3d9b5a'
    nvidia_version_gpu_tools: '1.0-17.20160816git89d2162'
    nvidia_version_peer_memory: '1.1-750'
    
JURECA-Booster
  • OS update from CentOS8.3 to Rocky Linux release 8.4 4.18.0-305.19.1.el8_4

  • Omnipath SW opa Version update: 10.11.1.0.10

  • OPA-Switch FW/OS update (unmananged/managed): F/W ver:10.8.4.0.5

  • OPA-switch HW replace: edge_4_06_1

  • OPA-cable HW replace: 5 optical cable -> switch to switch (edge to root)

  • removed the powered down 400 KNLs from slurm config (see also maint 2021-11-04 info): jrc5[001-400]

  • Admin-System BIOS update: BIOS version 2.12.1

2021-12-06 emergency Maintenance (Philipp Thörnig)

2021-11-03 14:00 to 18:17 - JURECA emergency Offline Maintenance due to global GPFS outage

2021-11-04 Maintenance (Philipp Thörnig)

Update type: Maintenance, SW, Acceptance tests

JURECA:
  • XCST-largedata and XCST-largedata2 mounted at 10 DC-Computes and 10 Booster-Computes:

    • jrc001[0-9],jrc541[0-9]

    • slurm feature/resource largedata available now

  • new slurm partitions for swmanage users. The following partitions overlap the devel partitions, but without the 2 hour time limit:

    dc-cpu-devel-sw
    dc-gpu-devel-sw
    booster-devel-sw
    
  • GPFS update to 5.1.2-0

  • Update psmgmt to psmgmt-5.1.44-2

    Version 5.1.44-2:
    =================
    Bugfixes:
     - Let visspank start without additional parameters
    
    Version 5.1.44-1:
    =================
    Bugfixes:
     - Various fixes on input forwarding in psidforwarder
     - Fix various warning emitted by rpmbuild
    Enhancements:
     - Use mallinfo2() if available (#19)
    
    Version 5.1.44:
    ===============
    Bugfixes:
     - Use correct pack size for interactive steps
     - step follower need to send step complete messages for pack jobs
     - Prevent possible segfault when a pack job is aborted at startup
     - Ensure nodes with different Slurm protocols can use tree forwarding
     - Prevent segfault when psslurm gets unloaded and protocol < 20.11
    Enhancements:
     - pspam: add option auth_groups to pam module
     - Optimize partition creation in psslurm
    Additional changes:
     - psslurm: Rename some variables to better reflect their meaning
    
JURECA-DC including logins:
  • OS update from CentOS8.3 to CentOS8.4 4.18.0-305.19.1.el8_4.x86_64

  • OFED update: 5.4-1.0.3.0

  • ibms update: 5.6.2

  • NVIDIA update:

    nvidia_version: '470.57.02'
    nvidia_version_gdrcopy: '2.3-2'
    nvidia_gpumond_version: '2.0-26.20201021git8c3d9b5a'
    nvidia_version_gpu_tools: '1.0-16.20160816git89d2162'
    nvidia_version_peer_memory: '1.1-746'
    
  • power capping (power envelope by now at 800KW) disabled:

    • Bull BEO CPU power capping on 180W disabled

    • Bull RLPL PSU-Rack-based capping disabled

JURECA-Booster
  • 400 KNLs powered down to enable the possibility to deactivate power capping at DC (see last point above): jrc5[001-400]

2021-11-03 emergency Maintenance (Philipp Thörnig)

2021-11-03 19:30 - JURECA emergency Offline Maintenance due to global GPFS outage

2021-09-14 Maintenance (Philipp Thörnig)

Update type: Maintenance, SW

Infrastructure installed an additional PQ-Box at Trafo 3 to get more details on Trafo level power consumptions and peaks so a general power reduction was needed by JW-Cluster JSF and JURECA

JURECA:
  • linktest (bandwidth/latency)

  • OS: kernel.shmmax set to default OS value at all computes

  • EasyBuild update:

    • Update of default modules

      The default compilers have been changed during the maintenance.
      The new default compilers are:
       - GCC 10.3
       - NVHPC 21.5
       - Intel 2021.2.0
      
      New MPIs and CUDA are also part of this update.
      If you wish to keep using the old defaults please make sure you are loading the modules for those particular versions.
      
JURECA DC
  • To expand the monitoring capabilities the jr-ibms rbd device was increased to 2TB

  • EJR-Mojo installation at computes

  • BDPO installation at computes

  • HPST: new IME-Config

  • TS-Global 56.01 update (new BMC and BIOS FW at computes)

  • RLPL and BEO adaptions since the new BMC FW (TS-Global update) is supporting Power Capping adaptions now.

2021-08-18 emergency Maintenance (Philipp Thörnig)

2021-08-18 08:30 - JURECA emergency Offline Maintenance due to global GPFS outage on (DC and Booster) 2021-08-18 16:10 - JURECA emergency Offline Maintenance off (DC and Booster)

  • no system changes

2021-07-22 Maintenance (Philipp Thörnig)

Update type: Maintenance, SW, Acceptance tests

JURECA:
Version 5.1.43:
===============
Bugfixes:
 - Let psslurm report the real memory of the local node in Megabytes
 - Fix partition creation for job packs in psslurm
 - Allow PAM SSH connections when cpuacc module is loaded
 - Make RPC REQUEST_JOB_NOTIFY compatible with Slurm 20.11
 - Make psgw option --gw_debug work with --gw_psgwd_per_node
 - Don't send signal twice on scancel
 - Do not prevent signal delivery in hetjobs
Enhancements:
 - Support interactive steps (#16)
 - Add support for --gpu-bind=map_gpu in psslurm
 - Add support for RPC REQUEST_RECONFIGURE_WITH_CONFIG and REQUEST_RECONFIGURE
 - Support hetjobs in pspmix
   * For this, distribute reservations to all nodes in partition
 - Rework map string parsing and support multiplying '*' in psslurm
 - Rename CPU env variables and leave in user env
   * Rename __PSSLURM_STEP_CORE_BITMAP to PSSLURM_STEP_CPUS
   * Rename __PSSLURM_JOB_CORE_BITMAP to PSSLURM_JOB_CPUS
 - Pass psid's log destination to plugins
Additional changes:
 - Add STEP_CPUS to main jail script
 - Prevent jail plugin from spamming the log
 - psslurm now manages job infos in list but array
 - Utilize different PMIX-macros within the code
 - Add NVIDIA Tesla V100 SXM2 32GB to nodeinfo config
JURECA-DC:
  • IOR and nsdperf acceptance tests

2021-07-15 Maintenance (Philipp Thörnig)

Update type: Maintenance, HW, SW

JURECA:
  • Slurm config adaption to fix a bug at modular job level: InactiveLimit=0

JURECA-DC:
  • GPGFS-GW: 4 Skyway HW replaced to match GA Version (jurecag01, jurecag03, jurecag05, and jurecag07)

    • OS installation and configuration after HW was installed

2021-07-07 to 2021-07-08 and 2021-07-14 JURECA-DC Module reservation (Philipp Thörnig)

acceptance Benchmarks while final power capping is in place 2021-07-07 08:00 - JURECA Offline Maintenance on (DC only) - logins stay open 2021-07-08 19:26 - JURECA Offline Maintenance off (DC only) 2021-07-14 08:00 - JURECA Offline Maintenance on (DC only) - logins stay open

2021-06-29 Maintenance (Philipp Thörnig)

Update type: Maintenance, HW, SW

JURECA:
Version 5.1.42-1:
=================
Bugfix:
 - Fix possible segmentation fault in x11spank
 - Change psgw configuration option GATEWAY_ENV to change compute
   process' environment instead of psgwd

Version 5.1.42:
===============
Bugfixes:
 - Fix bug in jail script to set the oom score
 - Fix various memory leaks
Enhancements:
 - Add new psgw configuration option GATEWAY_ENV to set environment for psgwd
 - psslurm checks if PrologSlurmctld is set in slurm.conf
 - Improved syslog messages
 - Replace getdtablesize() by sysconf(_SC_OPEN_MAX) in psmom, too
Additional changes:
 - Merge fwCMD_printMessage() and fwCMD_printJobMsg() into fwCMD_printMsg()
 - Move doRead() et al. from psserial to PSCio_recvBuf()
   * Call PSCio_recvBuf() directly instead via PSID_readall()
 - Introduce PSCio_recvMsg() family of functions
 - Use PSCio_setFDblock() instead of fcntl()
JURECA-DC:
  • SW: Acceptance tests: Rack-based power Capping Phase 2 measurements

    • GPU 300W capping applied at all DC-GPU-Computes

    • Rack-based power capping set up and activated

    • 50ms power measurements with extra HW equipment being installed while HPL triggers GPU-Power-Peaks

  • HW: Rack06 WELB exchange

2021-06-17 Maintenance (Philipp Thörnig)

Update type: Maintenance, HW, SW

JURECA:
  • Minor slurm update inside 20.02.7-1

  • Update psmgmt to psmgmt-5.1.41-3

    • change log:

Version 5.1.41-3:
=================
Bugfixes:
 - Ignore if a spank plugin registers spank options only to srun
 - x11spank: fix handling of connect() return code
 - x11spank: use correct display string for xauth

-> The complete change log list can be found at this link: ​https://github.com/ParaStation/psmgmt/blob/master/NEWS

JURECA DC
  • 4.Phase of TS-Global update: ts5503

    • PSU update finalization: Rack3+6+8+9+10+12

    • After maintenance still a lot of racks with offline PSUs and problematic power shelfs visible, as per Atos this has no impact with the computes which are in production now.

      • Rack 8 pws01 - power shelf failed

      • Rack 3 pws04 psu 1- no update possible

  • rack 3: PMC replaced

  • #9431 jrpmc06 / jrpmc09 / jrpmc10 flipping reachability

    • all pmcs reseated

    • jrpmc06 replaced

  • 62 IB-Cable reseated

JURECA Booster
  • Parastation Admin node jra58 update to CentOS8.3 by Partec (now the 205 computes behind are drained due to delays with the update)

  • local resource $LOCALSCRATCH available again at computes

  • OPA-Cable replace

    • 4 optical cable

    • 4 copper cable at nodes

2021-06-07 08:30 to 2021-06-09 01:18 Maintenance (Philipp Thörnig)

Update type: Maintenance, HW, SW

JURECA:
  • Minor slurm release update to 20.02.7-1

  • Update psmgmt to psmgmt-5.1.41-2

    • change log:

Version 5.1.41-2:
=================
Bugfixes:
 - Ensure to call the correct callback for spank options

Version 5.1.41-1:
=================
Bugfixes:
 - Ensure the environment is setup properly for Spank
 - Forward runtime variables to spank_exit hook
Enhancements:
 - Use PSIDHOOK_EXEC_CLIENT_PREP in psslurm to call Spank hook SPANK_TASK_INIT
 - Make psslurm plugin init never fail without message
Additional changes:
 - Add hook PSIDHOOK_EXEC_CLIENT_PREP and bump plugin API version to 132

-> The complete change log list can be found at this link: ​https://github.com/ParaStation/psmgmt/blob/master/NEWS

JURECA-DC:
  • 2 GPU Rack SOH HW replace tasks 6h each:

    • jrc0288@Rack05

    • jrc0350@Rack07

  • 2.Phase of TS-Global update: ts5503

    • Atos decided just to pick PSU update due to new functionality supporting Rack-based power capping due to missing time for the remaining updates:

      • we faced major issues about those PSU updates this was also the root cause why the maintenance was prolonged to Tuesday

      • Atos is still analyzing the root cause and working to fix the current situation where a lot of PSUs are offline or still at the old FW level.

      • After those updates, Rack11 was kept offline while the rest of the DC-Module went back into production due to failing linktests bandwidth tests related to this rack.

2021-05-31 to 2021-06-01 JURECA-DC Module maintenance reservation (Philipp Thörnig)

~16:50 2021-05-31 we faced roughly 3-4 log lines per second at opensm.log . Those IB problems triggered severe GPFS connection problems and a full DC Module maintenance reservation was needed:

ReservationName=root_969 StartTime=2021-05-31T18:25:54 EndTime=2021-06-02T18:00:00 Duration=1-23:34:06
   Nodes=jrc[0001-0204,0213-0236,0245-0268,0277-0300,0309-0332,0341-0364,0373-0396,0405-0428,0437-0832] NodeCnt=768 CoreCnt=98304 Features=(null) PartitionName=dc-maint Flags=MAINT,IGNORE_JOBS,SPEC_NODES,PART_NODES
   TRES=cpu=196608
   Users=(null) Accounts=root Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)
  • jurecag03 was the root cause, so all computes behind this GPFS-GW were affected jrc[0001-0030,0034-0164,0166-0204,0213-0236,0245-0256].

  • reservation reduced to affected nodes at 13:38:

    • At the same time, we brought jrlogin0[5-8] Logins05 to Login08 into maintenance mode

ReservationName=root_969 StartTime=2021-05-31T18:25:54 ...
   Nodes=jrc[0001-0204,0213-0236,0245-0256] NodeCnt=240 ...
  • reservation was completely released after the problem was solved: 17:24 Logins went back in production at the same time.

2021-05-27 to 2021-05-28 Maintenance (Philipp Thörnig)

Update type: Maintenance, SW, Benchmarks

JURECA:
  • Update psmgmt to psmgmt-5.1.41-0

    • change log:

      • The complete change log list can be found at this link

Version 5.1.41:
===============
Bugfixes:
 - pspelogue removes "SPANK_" prefix from already prefixed variables (jwt:#9228)
 - Ensure PSP_SMP_NODE_ID is kept (#2911, meluxina:#92)
Enhancements:
 - Add support for spank options (spank_option_register(),
   spank_option_getopt(), and spank_options symbol)
 - Add support for slurm_spank_log()
JURECA-DC:
  • benchmarks to submit to the Top500/Green500/Graph500/HPCG lists

2021-05-20 Maintenance (Philipp Thörnig)

Update type: Maintenance, SW, HW

JURECA:
  • size of /dev/shm increased at DC and Booster:

    • tmpfs /dev/shm tmpfs rw,nosuid,nodev,mode=1777,size=85% 0 0

JURECA-Booster:
  • local gpfs manager OS update to CentOS 7.9 and GPFS 5.1.0-3

  • ps-admin node update (CentOS 8 update): further partec tasks

  • /etc/bashrc fixed to support/fix compiling at booster-devel partition

JURECA-DC:
  • sbb rpm update:

    • file descriptor limit increased by Atos R&D to avoid user-visible 0: iolib: warning: access to a file descriptor higher than 1023 problems

  • HPL execution (Top500/Green500 submission) without capping

  • LWP: rpm installation at computes and logins

    • gnuplot

    • libpipeline

    • libomp-atos-9.0.0-1.20201118141416

  • defective Infiniband Switch jrc-05-L1-04 replaced

  • large queue jobs started after maint (one job in dc-cpu-large)

2021-05-12 Maintenance without closing the logins (Philipp Thörnig)

Update type: Maintenance, SW

JURECA:
  • psmgnt has been updated to 5.1.40-8. psmgmt changelog: [https://github.com/ParaStation/psmgmt/blob/master/NEWS]

JURECA-DC: ISSUES WITH CUDA_VISIBLE_DEVICES on PHASE 2 GPU NODES:

Due to issues with the CUDA_VISIBLE_DEVICES environment variable, ensuring access to GPU devices, the GPU nodes of phase 2 were taken offline. Update: A full system reservation is needed to apply a psslurm fix as soon as possible at 2021-05-12 10:00. This Reservation should be released latest at 13:00 on the same day. Afterward, the CUDA_VISIBLE_DEVICES environment variable is fixed everywhere.

2021-05-06 Maintenance (Philipp Thörnig)

Update type: Maintenance, SW

JURECA:
  • psmgnt has been updated to 5.1.40-7. psmgmt changelog: [https://github.com/ParaStation/psmgmt/blob/master/NEWS]

  • slurm patched to 20.02.6-1.20210429gitec7ac2caf7 to close security breach

  • /etc/hosts cleanup at Logins

  • Linktest latency and bandwidth at both modules

JURECA-Booster:
  • ps-admin node update (CentOS 8 update)

JURECA-DC:
  • Sequana Valve-HYC 2+1 mode configuration check

  • firestarter stress tests

  • Seq-compute BMC ntp configuration

  • jrlgoin03 ethernet card enp225s0f[0,1]reseat CEPH-bond interface

  • IB-Cable replace

JURECA-TEST:
  • IB-Cable replace and jrtlogin01 installation

2021-04-29 to 2021-04-30 Maintenance (Philipp Thörnig)

Update type: Maintenance, Acceptance

JURECA-DC:
  • phase 2 partial acceptance finished:

    • HPL/Firestarter

    • statical bios power capping in place

    • full system job launch tests

    • HA checks

    • HW Stress tests

    • final facility water loop configuration in place

    • final amount of computes in production since 2021-05-01 00:00

      • Standard / Slim nodes Quantity up to 480

      • Accelerated nodes Quantity up to 192

2021-04-22 Maintenance (Philipp Thörnig)

Update type: Maintenance, OS, SW

JURECA-DC:
  • TS upgrade on several repaired computes

  • partial acceptance tests preparation took place (incl. Graph500 and HPL)

JURECA DC/BOOSTER:
  • 8 JR-Booster MPI-GWs - External IB-Connection fixed with IB-Switch reconfiguration and switch reboots

  • SW update at all nodes (top island and compute nodes):

    • CentOS update to 8.3.2011

      • Kernel update to 4.18.0-240.22.1.el8_3.x86_64

    • OFED: Updated to version 5.1-2580

    • Booster Omnipaht version update to 10.11.0.0.577

    • DC-GPU driver update to: 460.32.03

      • remark: Since the DC module was online 2021-04-22 ~19:30 after CentOS8.2 was upgraded to version 8.3 the GPU-equipped computes faced problems with the GPU driver in production. The GPU driver was fixed 2021-04-23 at ~08:40.

2021-04-13 to 2021-04-15 Maintenance (Philipp Thörnig)

Update type: Maintenance

JURECA-DC:
  • Infiniband fabric cleanup (ports reseat, replace/switch config updates/reboots…)

  • IB-Switch replace at GPU Rack 6 switch L1-04

  • TS upgrade on several repaired computes

  • various partial acceptance tests took place

JURECA-TEST:
  • MB replace Service node 3 - jrtsrv01

  • IB Cable installation

  • Ethernet at Seq-Rack installation

JURECA DC/BOOSTER:
  • JR-Booster MPI-GWs jrq[001-198] - External IB-Connection fixed with IB-Switch reconfiguration and switch reboots

  • GPFS update to 5.1.0-3

JURECA-DC: HPST
  • Increase count for ime-scratch license

Slurm:
  • Deploy slurm role on jurecadc

2021-04-08 Maintenance (Philipp Thörnig)

Update type: Maintenance

JURECA-DC:
  • Infiniband fabric cleanup (ports reseat, replace, swap / switch config updates/reboots…)

  • one seq-2000 SOH Cable replacements (~3h hw task)

JURECA-DC: HPST
  • Increase count for ime-scratch license

Slurm:
  • Deploy slurm role on jurecadc

2021-03-25 Maintenance (Philipp Thörnig)

Update type: Maintenance

JURECA-DC:
  • Infiniband fabric cleanup (ports reseat, replace / switch config updates/reboots…)

  • two seq-2000 SOH Cable replacements (~3h each hw task)

  • CPU rack1 and GPU rack 7 (phase2) - test installation of upgrade PSU FW

  • GPU handling changed:

    • The default way of distributing GPU IDs and tasks has changed. Now, per default, one Slurm task will only see one GPU ID. See JURECA documentation for details:

      • https://apps.fz-juelich.de/jsc/hps/jureca/gpu-computing.html#gpu-visibility-affinity

JURECA-DC: HPST
  • ime/HPST configuration adaption

Slurm:
  • reduce backfiller interval

  • new sbb/sbf plugin

2021-03-18 Maintenance (Philipp Thörnig)

Update type: Maintenance

JURECA-DC: Infiniband fabric cleanup (ports reseat, replace / switch fw updates/reboots…)
JURECA-DC: HPST
  • provisorical IB-Cable replaced

  • IOR tests

MPI-GW DC/Booster
  • initial tests took place

Atos SBF software updates
  • libmooshika-1.0-202103101503.el8.x86_64.rpm

GPFS at DC
  • IOR tests at $SCRATCH

2021-03-11 Maintenance (Philipp Thörnig)

Update type: Maintenance

psmgmt-5.1.39-2 update on JURECA

psmgnt has been updated to 5.1.39-2. psmgmt changelog: [https://github.com/ParaStation/psmgmt/blob/master/NEWS]

JURECA-DC: Infiniband conclusion of Phase1 and Phase2 installation

after installation/cleanup of Phase2 DFP Fabric is finished both IB-Fabrics (Phase 1 & 2) will be concluded.

JURECA-DC: HPST connected to IB-Fabric again after replacing all cables

ime mounts accessible after maintenance again

JURECA-Booster: OPA
  • switch configuration adaption

  • 12 OPA-Cable replaced

  • Performance problems visible in Linktest solved

2021-03-02 Maintenance (Philipp Thörnig)

Update type: Maintenance, SW Modules

psmgmt-5.1.38-2 update on JURECA

psmgnt has been updated to 5.1.38-2. This fixes a protocol incompatibility problem with slurmctld and a segmentation fault when using heterogeneous jobs. psmgmt changelog: [https://github.com/ParaStation/psmgmt/blob/master/NEWS]

UCX has been changed to 1.9.0 from 1.8.1 for both ParaStationMPI and OpenMPI (easybuild)

The default UCX version was updated to 1.9.0 on the JURECA DC Module. If you would like to have this version in your jobs please execute “ml UCX/1.9.0” after loading your MPI module. This version provides better performance and auto-selection of the closest HCA for communication

2021-02-25 Maintenance (Philipp Thörnig)

Update type: Maintenance, SW Modules

pscom with gateway support added to JURECA-DC Easybuild stack

details about psgwd you will find at slide7 in ParTec Presentation

recompilation GCC at JURECA-DC (easybuild)

GCC now supports GPU offload

JURECA-DC: HPST maintenance/umounted until 2021-03-11 due to IB-Cable renew

Installation JURECA-DC module (including new Service Island for JURECA)

Phase1 end of 2020

JR-DC and JR-Booster update to CentOS 8.2

Phase2 beginning of 2021

2020-07-16 Maintenance (Benedikt von St. Vieth)

Update type: Maintenance

Row0[2,3] - shutdown all located components

In preparation of ppi4hpc we need to shutdown half of JURECA

Slurm - delete all components at Row0[2,3] from config

We should delete the dissembled computes in slurm config

Move Singularity installation from Easybuild to RPM

Singularity was loaded as a Easbuild module before. Because we have a nosuid bit on the shared filesystems, this does not work anymore. We moved to a RPM based installation. singularity is by default in the path now, but, users have to be part of the container group. We will have a automated judoor workflow to be part of that group, this is not implemented now. Because of this a wrapper is referring to sc@fz-juelich.de Please assign me to the tickets that popup there.

2020-06-23 Maintenance (Benedikt von St. Vieth)

Update type: Maintenance

Reenable mmpmon/gpumon and add to HC

Because we squash root, the services are no longer able to write to GPFS. We now use another mechanism.

Bring IME into production
  • prepare GCs based on Ansible Role

  • add HC entries for filesystem and systemd service

On booster jobs fail with OPA errors (libpsm2 bug)

Update libpsm to libpsm2-11.2.166-1.x86_64 to circumvent a bug that was introduced there before.

2020-02-04 New PGI compiler and Intel MPI version (Damian Alvarez)

Update type: SW Modules

  • PGI 19.10 installed (but not default)

  • IntelMPI 2019.6 installed (but not default)

2020-01-21 Maintenance-HPST-IB-extension (Benedikt von St. Vieth)

Update type: Maintenance, Network

MVAPICH issues due to wrong infiniband-diags version

With the update to OFED 4.7 in the last maintenance, a upstream infiniband-diags packages was installed.

MVAPICH needs the following file:

[root@jrc0001 ~]# yum provides /usr/lib64/libibmad.so.12
libibmad-5.4.0.MLNX20190423.1d917ae-0.1.47100.x86_64 : OpenFabrics Alliance InfiniBand MAD library

but libibmad can not be installed, because yum things libibmad is obsoleted by the upstream infiniband-diags, which is only providing

/usr/lib64/libibmad.so.5
HPST IB extension - 3 Line-Cars and 72 Infiniband cables

1a. online maintenance 2019-01-06

  • IB-Root-Switch extent 3 Line-Cards HPST integration: jrs02,jrs03,jrs04 extended by one line card each. 1b. DDN optical cable installation

  • Two DDN Technicians will install the 72 cables during 2019-01-06 to 2019-01-08 while newly installed cards and effected ports in jrs01 are disabled. 2a. offline maintenance 2019-01-21 Linecard activation / port enabling

  • while GPFS is unmounted we will online the ports of newly installed cables and analyze health state of fabric during offline maintenance.

This was only partially done due to a lack of cables.

Update IME Client Software

In preparation of tomorrows maintenance the Ansible role was adjusted to update IME to

Jan 20 15:22:35 Updated: ime-common-1.3.1.1-131143.el7.x86_64
Jan 20 15:22:35 Updated: ime-ulockmgr-1.3.1.1-131143.el7.x86_64
Jan 20 15:22:36 Updated: ime-client-1.3.1.1-131143.el7.x86_64
Jan 20 15:22:37 Updated: ime-net-cci-1.3.1.1-131143.el7.x86_64
Jan 20 15:23:02 Updated: libcci-0.1.b8.ddn1.56-el7.x86_64
Jan 20 15:23:03 Updated: libisal-2.16.0.ddn2-el7.x86_64

and place a config there. The config is still temporarily, but it has at least some content.

2019-12-10 Maintenance (Benedikt von St. Vieth)

Update type: Maintenance, OS Packages, SW Modules

Maintenance 2019-12-10

Update Slurm to 19.05.4-1.20191203git1b8453f491
jutil: update to version 19.12.0
  • bash completion support

  • more privileges to members of parateam group

  • all users can query dataquota of all member in a project/group

  • better perf., bug fixes and improved output formats with more options

Flexible module naming scheme
  • The user modules in production have been adapted to work with a flexible module naming scheme. Minor updates of compilers and MPIs are possible without full toolchain duplication now.

Update to CentOS 7.7

Together with

  • Kernel 3.10.0-1062.7.1

  • GPFS 5.0.4-1

  • OFED 4.7

  • OPA 10.10.0.0-445

12:dhclient-4.2.5-77.el7.centos.x86_64
12:dhcp-common-4.2.5-77.el7.centos.x86_64
12:dhcp-libs-4.2.5-77.el7.centos.x86_64
14:tcpdump-4.9.2-4.el7_7.1.x86_64
1:cups-libs-1.6.3-40.el7.x86_64
1:dmidecode-3.2-3.el7.x86_64
1:grub2-2.02-0.80.el7.centos.x86_64
1:grub2-common-2.02-0.80.el7.centos.noarch
1:grub2-efi-x64-2.02-0.80.el7.centos.x86_64
1:grub2-efi-x64-modules-2.02-0.80.el7.centos.noarch
1:grub2-pc-2.02-0.80.el7.centos.x86_64
1:grub2-pc-modules-2.02-0.80.el7.centos.noarch
1:grub2-tools-2.02-0.80.el7.centos.x86_64
1:grub2-tools-extra-2.02-0.80.el7.centos.x86_64
1:grub2-tools-minimal-2.02-0.80.el7.centos.x86_64
1:make-3.82-24.el7.x86_64
1:mariadb-libs-5.5.64-1.el7.x86_64
1:net-snmp-libs-5.7.2-43.el7.x86_64
1:nfs-utils-1.3.0-0.65.el7.x86_64
1:opa-address-resolution-10.10.0.0-445.el7.x86_64
1:opa-basic-tools-10.10.0.0-445.el7.x86_64
1:openssl-1.0.2k-19.el7.x86_64
1:openssl-devel-1.0.2k-19.el7.x86_64
1:openssl-libs-1.0.2k-19.el7.x86_64
1:quota-4.01-19.el7.x86_64
1:quota-nls-4.01-19.el7.noarch
1:smartmontools-7.0-1.el7.x86_64
2:ethtool-4.8-10.el7.x86_64
2:microcode_ctl-2.1-53.3.el7_7.x86_64
2:nmap-ncat-6.40-19.el7.x86_64
2:shadow-utils-4.6-5.el7.x86_64
2:vim-common-7.4.629-6.el7.x86_64
2:vim-enhanced-7.4.629-6.el7.x86_64
2:vim-filesystem-7.4.629-6.el7.x86_64
2:vim-minimal-7.4.629-6.el7.x86_64
32:bind-export-libs-9.11.4-9.P2.el7.x86_64
32:bind-libs-9.11.4-9.P2.el7.x86_64
32:bind-libs-lite-9.11.4-9.P2.el7.x86_64
32:bind-license-9.11.4-9.P2.el7.noarch
32:bind-utils-9.11.4-9.P2.el7.x86_64
3:irqbalance-1.0.7-12.el7.x86_64
3:mcelog-144-10.94d853b2ea81.el7.x86_64
3:nvidia-driver-latest-dkms-418.87.00-2.el7.x86_64
3:nvidia-driver-latest-dkms-cuda-418.87.00-2.el7.x86_64
3:nvidia-driver-latest-dkms-cuda-libs-418.87.00-2.el7.x86_64
3:nvidia-driver-latest-dkms-devel-418.87.00-2.el7.x86_64
3:nvidia-driver-latest-dkms-libs-418.87.00-2.el7.x86_64
3:nvidia-driver-latest-dkms-NvFBCOpenGL-418.87.00-2.el7.x86_64
3:nvidia-driver-latest-dkms-NVML-418.87.00-2.el7.x86_64
3:nvidia-modprobe-latest-dkms-418.87.00-2.el7.x86_64
3:nvidia-persistenced-latest-dkms-418.87.00-2.el7.x86_64
3:nvidia-xconfig-latest-dkms-418.87.00-2.el7.x86_64
7:device-mapper-1.02.158-2.el7_7.2.x86_64
7:device-mapper-event-1.02.158-2.el7_7.2.x86_64
7:device-mapper-event-libs-1.02.158-2.el7_7.2.x86_64
7:device-mapper-libs-1.02.158-2.el7_7.2.x86_64
7:lvm2-2.02.185-2.el7_7.2.x86_64
7:lvm2-libs-2.02.185-2.el7_7.2.x86_64
alsa-lib-1.1.8-1.el7.x86_64
audit-2.8.5-4.el7.x86_64
audit-libs-2.8.5-4.el7.x86_64
bash-4.2.46-33.el7.x86_64
binutils-2.27-41.base.el7_7.1.x86_64
biosdevname-0.7.3-2.el7.x86_64
ca-certificates-2019.2.32-76.el7_7.noarch
cairo-1.15.12-4.el7.x86_64
cairo-gobject-1.15.12-4.el7.x86_64
centos-release-7-7.1908.0.el7.centos.x86_64
coreutils-8.22-24.el7.x86_64
cpp-4.8.5-39.el7.x86_64
cronie-1.4.11-23.el7.x86_64
cronie-anacron-1.4.11-23.el7.x86_64
cryptsetup-libs-2.0.3-5.el7.x86_64
curl-7.29.0-54.el7_7.1.x86_64
dapl-2.1.10mlnx-OFED.3.4.2.1.0.47100.x86_64
device-mapper-persistent-data-0.8.5-1.el7.x86_64
diffutils-3.3-5.el7.x86_64
dnsmasq-2.76-10.el7_7.1.x86_64
dracut-033-564.el7.x86_64
dracut-config-rescue-033-564.el7.x86_64
dracut-network-033-564.el7.x86_64
dyninst-9.3.1-3.el7.x86_64
e2fsprogs-1.42.9-16.el7.x86_64
e2fsprogs-libs-1.42.9-16.el7.x86_64
efivar-libs-36-12.el7.x86_64
elfutils-0.176-2.el7.x86_64
elfutils-default-yama-scope-0.176-2.el7.noarch
elfutils-libelf-0.176-2.el7.x86_64
elfutils-libs-0.176-2.el7.x86_64
firewalld-0.6.3-2.el7_7.2.noarch
firewalld-filesystem-0.6.3-2.el7_7.2.noarch
freeipmi-1.5.7-3.el7.x86_64
freetype-2.8-14.el7.x86_64
fuse-2.9.2-11.el7.x86_64
gcc-4.8.5-39.el7.x86_64
gcc-c++-4.8.5-39.el7.x86_64
gcc-gfortran-4.8.5-39.el7.x86_64
gdb-7.6.1-115.el7.x86_64
gdrcopy-kmod-3.10.0-1062.7.1.el7.x86_64-2.0-3.el7.x86_64
GeoIP-1.5.0-14.el7.x86_64
geoipupdate-2.5.0-1.el7.x86_64
glib2-2.56.1-5.el7.x86_64
glibc-2.17-292.el7.i686
glibc-2.17-292.el7.x86_64
glibc-common-2.17-292.el7.x86_64
glibc-devel-2.17-292.el7.x86_64
glibc-headers-2.17-292.el7.x86_64
gpfs.base-5.0.4-0.x86_64
gpfs.base-5.0.4-1.x86_64
gpfs.docs-5.0.4-0.noarch
gpfs.docs-5.0.4-1.noarch
gpfs.gplbin-3.10.0-1062.7.1.el7.x86_64-5.0.4-0.el7.x86_64
gpfs.gplbin-3.10.0-1062.7.1.el7.x86_64-5.0.4-1.el7.x86_64
gpfs.gplbin-3.10.0-957.27.2.el7.x86_64-5.0.4-0.el7.x86_64
gpfs.gplbin-3.10.0-957.27.2.el7.x86_64-5.0.4-1.el7.x86_64
gpfs.msg.en_US-5.0.4-0.noarch
gpfs.msg.en_US-5.0.4-1.noarch
gpm-libs-1.20.7-6.el7.x86_64
grubby-8.28-26.el7.x86_64
gssproxy-0.7.0-26.el7.x86_64
hcoll-4.4.2938-1.47100.x86_64
hostname-3.13-3.el7_7.1.x86_64
http-parser-2.7.1-8.el7.x86_64
hwdata-0.252-9.3.el7.x86_64
ibacm-22.1-3.el7.x86_64
ibacm-47mlnx1-1.47100.x86_64
ibdump-5.0.0-3.47100.x86_64
ibutils-1.5.7.1-0.12.gdcaeae2.47100.x86_64
ime-client-1.3.0-1639.el7.x86_64
ime-common-1.3.0-1639.el7.x86_64
ime-net-cci-1.3.0-1639.el7.x86_64
ime-ulockmgr-1.3.0-1639.el7.x86_64
infiniband-diags-2.1.0-1.el7.x86_64
infiniband-diags-47mlnx1-1.47100.x86_64
infiniband-diags-compat-47mlnx1-1.47100.x86_64
initscripts-9.49.47-1.el7.x86_64
iproute-4.11.0-25.el7_7.2.x86_64
ipset-7.1-1.el7.x86_64
ipset-libs-7.1-1.el7.x86_64
iptables-1.4.21-33.el7.x86_64
ipxe-bootimgs-20180825-2.git133f4c.el7.noarch
jsc-slurm-plugins-1.2-19054100.20191023git2fc3e8f.el7.x86_64
jsc-slurm-plugins-cuda-1.2-19054100.20191023git2fc3e8f.el7.x86_64
jsc-slurm-plugins-globres-1.2-19054100.20191023git2fc3e8f.el7.x86_64
jsc-slurm-plugins-noturbo-1.2-19054100.20191023git2fc3e8f.el7.x86_64
jsc-slurm-plugins-perfparanoid-1.2-19054100.20191023git2fc3e8f.el7.x86_64
jsc-slurm-plugins-perftool-1.2-19054100.20191023git2fc3e8f.el7.x86_64
jsc-slurm-plugins-showglobres-1.2-19054100.20191023git2fc3e8f.el7.x86_64
jsc-slurm-plugins-vis-1.2-19054100.20191023git2fc3e8f.el7.x86_64
jsc-slurm-plugins-x11-1.2-19054100.20191023git2fc3e8f.el7.x86_64
kernel-3.10.0-1062.7.1.el7.x86_64
kernel-headers-3.10.0-1062.7.1.el7.x86_64
kernel-tools-3.10.0-1062.7.1.el7.x86_64
kernel-tools-libs-3.10.0-1062.7.1.el7.x86_64
kexec-tools-2.0.15-33.el7.x86_64
kmod-20-25.el7.x86_64
kmod-ifs-kernel-updates-3.10.0-1062.7.1.el7.x86_64-10.10.0.0.445-1880.x86_64
kmod-ifs-kernel-updates-3.10.0-957.27.2.el7.x86_64-10.10.0.0.445-1880.x86_64
kmod-ifs-kernel-updates-3.10.0-957.5.1.el7.x86_64-10.10.0.0.445-1880.x86_64
kmod-kernel-mft-mlnx-3.10.0-1062.7.1.el7.x86_64-4.13.0-1.x86_64
kmod-kernel-mft-mlnx-3.10.0-957.27.2.el7.x86_64-4.13.0-1.x86_64
kmod-libs-20-25.el7.x86_64
kmod-mlnx-ofa_kernel-3.10.0-1062.7.1.el7.x86_64-4.7-OFED.4.7.1.0.0.1.g1c4bf42.x86_64
kmod-mlnx-ofa_kernel-3.10.0-957.27.2.el7.x86_64-4.7-OFED.4.7.1.0.0.1.g1c4bf42.x86_64
kmod-rapl-3.10.0-1062.7.1.el7.x86_64-1.0-10.20160415git8b73fdd.el7.x86_64
kpartx-0.4.9-127.el7.x86_64
krb5-devel-1.15.1-37.el7_7.2.x86_64
krb5-libs-1.15.1-37.el7_7.2.x86_64
libatomic-4.8.5-39.el7.x86_64
libblkid-2.23.2-61.el7_7.1.x86_64
libcap-2.22-10.el7.x86_64
libcci-0.1.b8.ddn1.55-el7.x86_64
libcom_err-1.42.9-16.el7.x86_64
libcom_err-devel-1.42.9-16.el7.x86_64
libcurl-7.29.0-54.el7_7.1.x86_64
libdb-5.3.21-25.el7.x86_64
libdb-utils-5.3.21-25.el7.x86_64
libdrm-2.4.97-2.el7.x86_64
libfabric-1.7.1-0.x86_64
libgcc-4.8.5-39.el7.x86_64
libgfortran-4.8.5-39.el7.x86_64
libgomp-4.8.5-39.el7.x86_64
libgudev1-219-67.el7_7.2.x86_64
libibcm-41mlnx1-OFED.4.1.0.1.0.47100.x86_64
libibcm-devel-41mlnx1-OFED.4.1.0.1.0.47100.x86_64
libibumad-22.1-3.el7.x86_64
libibumad-47mlnx1-1.47100.x86_64
libibverbs-22.1-3.el7.x86_64
libibverbs-47mlnx1-1.47100.x86_64
libibverbs-utils-47mlnx1-1.47100.x86_64
libicu-50.2-3.el7.x86_64
libipa_hbac-1.16.4-21.el7_7.1.x86_64
libisal-2.16.0-el7.x86_64
libjpeg-turbo-1.2.90-8.el7.x86_64
libkadm5-1.15.1-37.el7_7.2.x86_64
libldb-1.4.2-1.el7.x86_64
libmount-2.23.2-61.el7_7.1.x86_64
libndp-1.2-9.el7.x86_64
libpsm2-11.2.86-1.x86_64
libpsm2-devel-11.2.86-1.x86_64
libquadmath-4.8.5-39.el7.x86_64
libquadmath-devel-4.8.5-39.el7.x86_64
librdmacm-22.1-3.el7.x86_64
librdmacm-47mlnx1-1.47100.x86_64
librdmacm-utils-47mlnx1-1.47100.x86_64
libsmartcols-2.23.2-61.el7_7.1.x86_64
libsmbclient-4.9.1-10.el7_7.x86_64
libss-1.42.9-16.el7.x86_64
libssh2-1.8.0-3.el7.x86_64
libsss_autofs-1.16.4-21.el7_7.1.x86_64
libsss_certmap-1.16.4-21.el7_7.1.x86_64
libsss_idmap-1.16.4-21.el7_7.1.x86_64
libsss_nss_idmap-1.16.4-21.el7_7.1.x86_64
libsss_sudo-1.16.4-21.el7_7.1.x86_64
libstdc++-4.8.5-39.el7.x86_64
libstdc++-devel-4.8.5-39.el7.x86_64
libtalloc-2.1.14-1.el7.x86_64
libtdb-1.3.16-1.el7.x86_64
libteam-1.27-9.el7.x86_64
libtevent-0.9.37-1.el7.x86_64
libtiff-4.0.3-32.el7.x86_64
libtirpc-0.2.4-0.16.el7.x86_64
libuuid-2.23.2-61.el7_7.1.x86_64
libwbclient-4.9.1-10.el7_7.x86_64
libX11-1.6.7-2.el7.x86_64
libX11-common-1.6.7-2.el7.noarch
libX11-devel-1.6.7-2.el7.x86_64
libxkbcommon-0.7.1-3.el7.x86_64
libXxf86misc-1.0.3-7.1.el7.x86_64
linux-firmware-20190429-72.gitddde598.el7.noarch
lm_sensors-libs-3.4.0-8.20160601gitf9185e5.el7.x86_64
lz4-1.7.5-3.el7.x86_64
mesa-filesystem-18.3.4-5.el7.x86_64
mesa-libEGL-18.3.4-5.el7.x86_64
mesa-libgbm-18.3.4-5.el7.x86_64
mesa-libGL-18.3.4-5.el7.x86_64
mesa-libglapi-18.3.4-5.el7.x86_64
mft-4.13.0-102.x86_64
mlnx-ofa_kernel-4.7-OFED.4.7.1.0.0.1.g1c4bf42.x86_64
mlnxofed-docs-4.7-1.0.0.1.noarch
mstflint-4.13.0-1.41.g4e8819c.47100.x86_64
mxm-3.7.3112-1.47100.x86_64
ncdu-1.14.1-1.el7.x86_64
net-tools-2.0-0.25.20131004git.el7.x86_64
nscd-2.17-292.el7.x86_64
nspr-4.21.0-1.el7.x86_64
nss-3.44.0-4.el7.x86_64
nss-pem-1.0.3-7.el7.x86_64
nss-softokn-3.44.0-5.el7.x86_64
nss-softokn-freebl-3.44.0-5.el7.i686
nss-softokn-freebl-3.44.0-5.el7.x86_64
nss-sysinit-3.44.0-4.el7.x86_64
nss-tools-3.44.0-4.el7.x86_64
nss-util-3.44.0-3.el7.x86_64
ntp-4.2.6p5-29.el7.centos.x86_64
ntpdate-4.2.6p5-29.el7.centos.x86_64
numactl-2.0.12-3.el7_7.1.x86_64
numactl-libs-2.0.12-3.el7_7.1.x86_64
nvidia-kmod-3.10.0-1062.7.1.el7.x86_64-418.87.00-2.el7.x86_64
nvidia-kmod-3.10.0-957.27.2.el7.x86_64-418.87.00-2.el7.x86_64
nvidia-peer-memory-1.0-734.el7.x86_64
nvidia-peer-memory-kmod-3.10.0-1062.7.1.el7.x86_64-1.0-734.el7.x86_64
nvidia-peer-memory-kmod-3.10.0-957.27.2.el7.x86_64-1.0-734.el7.x86_64
nvidia-uvm-kmod-3.10.0-1062.7.1.el7.x86_64-418.87.00-2.el7.x86_64
ofed-scripts-4.7-OFED.4.7.1.0.0.x86_64
OpenIPMI-2.0.27-1.el7.x86_64
OpenIPMI-libs-2.0.27-1.el7.x86_64
OpenIPMI-modalias-2.0.27-1.el7.x86_64
opensm-libs-3.3.21-2.el7.x86_64
opensm-libs-5.5.0.MLNX20190923.1c78385-0.1.47100.x86_64
openssh-7.4p1-21.el7.x86_64
openssh-clients-7.4p1-21.el7.x86_64
openssh-server-7.4p1-21.el7.x86_64
pango-1.42.4-4.el7_7.x86_64
parted-3.1-31.el7.x86_64
passwd-0.79-5.el7.x86_64
patch-2.7.1-12.el7_7.x86_64
perf-3.10.0-1062.7.1.el7.x86_64
perftest-4.4-0.8.g7af08be.47100.x86_64
plymouth-0.8.9-0.32.20140113.el7.centos.x86_64
plymouth-core-libs-0.8.9-0.32.20140113.el7.centos.x86_64
plymouth-scripts-0.8.9-0.32.20140113.el7.centos.x86_64
policycoreutils-2.5-33.el7.x86_64
polkit-0.112-22.el7_7.1.x86_64
procps-ng-3.3.10-26.el7_7.1.x86_64
psmisc-22.20-16.el7.x86_64
pytalloc-2.1.14-1.el7.x86_64
python-2.7.5-86.el7.x86_64
python2-clustershell-1.8.2-1.el7.noarch
python2-rpm-macros-3-32.el7.noarch
python-babel-0.9.6-8.el7.noarch
python-chardet-2.2.1-3.el7.noarch
python-devel-2.7.5-86.el7.x86_64
python-firewall-0.6.3-2.el7_7.2.noarch
python-jinja2-2.7.2-4.el7.noarch
python-libs-2.7.5-86.el7.x86_64
python-markupsafe-0.11-10.el7.x86_64
python-perf-3.10.0-1062.7.1.el7.x86_64
python-rpm-macros-3-32.el7.noarch
python-srpm-macros-3-32.el7.noarch
python-sssdconfig-1.16.4-21.el7_7.1.noarch
qperf-0.4.9-9.47100.x86_64
rdma-core-22.1-3.el7.x86_64
rdma-core-47mlnx1-1.47100.x86_64
rdma-core-devel-22.1-3.el7.x86_64
rdma-core-devel-47mlnx1-1.47100.x86_64
readline-6.2-11.el7.x86_64
redhat-rpm-config-9.1.0-88.el7.centos.noarch
rpcbind-0.2.0-48.el7.x86_64
rpm-4.11.3-40.el7.x86_64
rpm-build-4.11.3-40.el7.x86_64
rpm-build-libs-4.11.3-40.el7.x86_64
rpm-libs-4.11.3-40.el7.x86_64
rpm-python-4.11.3-40.el7.x86_64
rsyslog-8.24.0-41.el7_7.2.x86_64
samba-client-libs-4.9.1-10.el7_7.x86_64
samba-common-4.9.1-10.el7_7.noarch
samba-common-libs-4.9.1-10.el7_7.x86_64
samba-common-tools-4.9.1-10.el7_7.x86_64
samba-libs-4.9.1-10.el7_7.x86_64
selinux-policy-3.13.1-252.el7_7.6.noarch
selinux-policy-targeted-3.13.1-252.el7_7.6.noarch
sepdk-kmod-3.10.0-1062.7.1.el7.x86_64-4.1-4.20180625snap.el7.x86_64
sharp-2.0.0.MLNX20190922.a9ebf22-1.47100.x86_64
slurm-19.05.4-1.20191203git1b8453f491.el7.x86_64
sos-3.7-10.el7.centos.noarch
sssd-1.16.4-21.el7_7.1.x86_64
sssd-ad-1.16.4-21.el7_7.1.x86_64
sssd-client-1.16.4-21.el7_7.1.x86_64
sssd-common-1.16.4-21.el7_7.1.x86_64
sssd-common-pac-1.16.4-21.el7_7.1.x86_64
sssd-ipa-1.16.4-21.el7_7.1.x86_64
sssd-krb5-1.16.4-21.el7_7.1.x86_64
sssd-krb5-common-1.16.4-21.el7_7.1.x86_64
sssd-ldap-1.16.4-21.el7_7.1.x86_64
sssd-proxy-1.16.4-21.el7_7.1.x86_64
sudo-1.8.23-4.el7_7.1.x86_64
sysstat-10.1.5-18.el7.x86_64
systemd-219-67.el7_7.2.x86_64
systemd-libs-219-67.el7_7.2.x86_64
systemd-sysv-219-67.el7_7.2.x86_64
systemtap-client-4.0-10.el7_7.x86_64
systemtap-runtime-4.0-10.el7_7.x86_64
teamd-1.27-9.el7.x86_64
tzdata-2019c-1.el7.noarch
ucx-1.7.0-1.47100.x86_64
unzip-6.0-20.el7.x86_64
urw-base35-bookman-fonts-20170801-10.el7.noarch
urw-base35-c059-fonts-20170801-10.el7.noarch
urw-base35-d050000l-fonts-20170801-10.el7.noarch
urw-base35-fonts-20170801-10.el7.noarch
urw-base35-fonts-common-20170801-10.el7.noarch
urw-base35-gothic-fonts-20170801-10.el7.noarch
urw-base35-nimbus-mono-ps-fonts-20170801-10.el7.noarch
urw-base35-nimbus-roman-fonts-20170801-10.el7.noarch
urw-base35-nimbus-sans-fonts-20170801-10.el7.noarch
urw-base35-p052-fonts-20170801-10.el7.noarch
urw-base35-standard-symbols-ps-fonts-20170801-10.el7.noarch
urw-base35-z003-fonts-20170801-10.el7.noarch
util-linux-2.23.2-61.el7_7.1.x86_64
vulkan-filesystem-1.1.97.0-1.el7.noarch
xfsprogs-4.5.0-20.el7.x86_64
xorg-x11-server-common-1.20.4-7.el7.x86_64
xorg-x11-server-utils-7.7-20.el7.x86_64
xorg-x11-server-Xorg-1.20.4-7.el7.x86_64
yum-3.4.3-163.el7.centos.noarch
yum-plugin-fastestmirror-1.1.31-52.el7.noarch
yum-utils-1.1.31-52.el7.noarch

2019-11-05 Maintenance 2019-11-05 (Benedikt von St. Vieth)

Update type: Maintenance, Batch system

UFM REST API not showing all nodes connected to the fabric

At the moment we see all nodes within UFM, but when we query its REST API only parts of the JRQ Systems are shown. This was solved during todays maintnenance.

JR-Booster WCDs/PDUs FW update - RPC2 communications module 14.0.0.3

As per !Dell/Vertiv there is a new PDU/WCD RPC2 communications module FW availabel: 14.0.0.3

Mellanox FW update

The following updates are available for Infiniband equipment:

  • CS7500 -> image-X86_64-3.8.2004.img

  • SX6036G -> image-PPC_M460EX-3.6.8012.img

  • SB7790 -> fw-SwitchIB-rel-11_2000_2046-MSB7790-E_Ax.bin

  • MCX455A -> fw-ConnectX4-rel-12_25_1020-MCX455A-ECA_Ax-UEFI-14.18.19-FlexBoot-3.5.701.bin.zip

gdrcopy for JURECA

GDRCopy ist inzwischen von NVIDIA nicht mehr nur als Test/PoC klassifiziert und es gibt auch ein offizielles Release: https://github.com/NVIDIA/gdrcopy/releases

Update psmgmt-5.1.26-0

psmgmt version psmgmt-5.1.26-0 is available.

psmgmt will be updated on all JURECA nodes.

change log:

Version 5.1.26:
===============
Bugfixes:
 - Prevent psgw plugin from crashing the daemon (j3t:#329)
 - Fix segfault when late srun replies arrive after step is gone (jrt:10050)
 - Ensure PSI_recvMsg() ignores interupted read() (jwt:#2494)
 - Let psslurm delay tasks via PSIDHOOK_RECV_SPAWNREQ (jwt:#2515)
 - Prevent segmentation fault if username resolution failes (jwt:#4234)
 - Ensure step in callback is still valid (jrt:#10122)
 - Consider byte-order when dropping SPAWNREQUEST (jwt:#4282)
 - Make PMI parameters fit into the line (psc:#332)
 - Add missing offset to SLURM_GTIDS for pack jobs (pct:#334)
 - Fix segfault in gres environment parsing
 - Ensure dupSlurmMsg() copies the complete structure
 - Prevent possible segfault in psgw plugin
 - Fix potential memory leaks unveiled by scan-build
 - Ensure to actually exit on PSIlog_exit().
Enhancements:
 - Add option --gw_psgwd_per_node start multiple psgwd on a gateway node
 - Show gres IDs at psslurm startup
 - Only report step timeout message on mother superior
 - Unify gres error reporting
 - Add option --gw_verbose to report psgw startup errors to file
 - Improve handling of psgw error files
 - Don't send message to parent known to be dead
Additional changes:
 - Adopt psroute.py to start multiple psgwd on a gateway node
The complete change log list can be found at:
https://github.com/ParaStation/psmgmt/blob/master/NEWS
Booster Firmware Update

ALL: iDRAC 2.70.70.70: https://www.dell.com/support/home/us/en/04/drivers/driversdetails?driverid=dnh17

C6320 BIOS 2.1.2 -> 2.2.0: https://www.dell.com/support/home/us/en/19/Drivers/DriversDetails?driverId=P60HW
NIC 18.8.9 -> 19.0.12: https://www.dell.com/support/home/us/en/19/Drivers/DriversDetails?driverId=GK57C

R630 BIOS 2.9.1 -> 2.10.5: https://www.dell.com/support/home/us/en/19/Drivers/DriversDetails?driverId=1RKPD NIC 18.8.9 -> 19.0.12: https://www.dell.com/support/home/us/en/19/Drivers/DriversDetails?driverId=T6HGD Backplane 2.23 -> 2.25: https://www.dell.com/support/home/us/en/19/Drivers/DriversDetails?driverId=HRP1V

R430 BIOS 2.9.1 -> 2.10.5: https://www.dell.com/support/home/us/en/19/Drivers/DriversDetails?driverId=VH9R0 NIC 20.8.4 -> 21.40.21: https://www.dell.com/support/home/us/en/19/Drivers/DriversDetails?driverId=K99RK

Sobald die iDRAC Firmware 2.70.x mit dem wichtigen Fix zur Verfügung steht, melde ich mich.

Update Slurm to 19.05

On juropa3exp we have tested Slurm 19.05 with psmgmt 5.1.26 and we have the green light to upgrade also on Jureca.

This action happened during todays offline maintenance.

2019-10-30 Beginning of the changelog (Benedikt von St. Vieth)

Update type: Announcement

Beginning of the changelog for JURECA