Changelog
Current state
Installed software
Software |
Version |
Description |
---|---|---|
Rocky Linux |
|
|
Kernel Version |
|
|
NVIDIA GPU Driver |
|
|
OFED |
|
|
Slurm |
|
|
ParaStation Management |
|
|
GPFS |
|
|
Apptainer |
|
|
PMIx |
|
|
Default Software Stage |
|
Changelog entries
2025-09-22 Update UCX
Update type: SW Modules
UCX has been changed to 1.18.1 from 1.17.0
2025-09-09 Software update
Update type: OS Packages and SW Modules
OS Packages
Rocky Linux
has been updated to9.6
(from9.5
)Kernel Version
has been updated to5.14.0-570.32.1.el9_6
(from5.14.0-503.40.1.el9_5
)NVIDIA GPU Driver
has been updated to580.65.06
(from570.133.20
)Slurm
has been updated to24.11.6-1.20250807git03d01a9
(from24.11.5-1.20250602git2ed9014
)ParaStation Management
has been updated to6.4.1
(from6.3.0
)GPFS
has been updated to5.2.3-2
(from5.2.2-1.12
)Apptainer
has been updated to1.4.1-1
(from1.3.6-1
)
UCX-settings
UCX_CUDA_COPY_DMABUF=no
has been removed for theUCX-settings/[RC,UD,DC]-CUDA
modules, since it is no longer necessary to prevent crashes, and it actually causes a performance regression with the latest OFED and NVIDIA driver
2025-07-24 Software update
Update type: OS Packages
OS Packages
ParaStation Management
has been updated to6.3.0
(from6.2.3
)
2025-06-24 Software update
Update type: OS Packages and Firmware
Firmware
ConnectX-6 HCAs have been updated to firmware version
20.43.2566
OS Packages
Kernel Version
has been updated to5.14.0-503.40.1.el9_5
(from5.14.0-503.38.1.el9_5
)OFED
has been updated to25.04-OFED.25.04.0.6.0.1
(from25.01-OFED.25.01.0.6.0.1
)Slurm
has been updated to24.11.5-1.20250602git2ed9014
(from23.11.10-1.20240920git20c5755
)GPFS
has been updated to5.2.2-1.12
(from5.2.2-1
)ParaStation Management
has been updated to6.2.3
(from6.1.1
)PMIx
has been updated to5.0.8
(from5.0.6
)
2025-04-29 Software update
Update type: OS Packages
OS Packages
Kernel Version
has been updated to5.14.0-503.38.1.el9_5
(from5.14.0-503.23.1.el9_5
)NVIDIA GPU Driver
has been updated to570.133.20
(from570.86.15
)
2025-03-20 Software update
Update type: OS Packages, SLURM configuration
OS Packages
Rocky Linux
has been updated to9.5
(from9.4
)Kernel Version
has been updated to5.14.0-503.23.1.el9_5
(from5.14.0-427.33.1.el9_4
)NVIDIA GPU Driver
has been updated to570.86.15
(from560.35.03
)OFED
has been updated to25.01-OFED.25.01.0.6.0.1
(from24.07-OFED.24.07.0.6.1.1
)GPFS
has been updated to5.2.2-1
(from5.1.9-4
)PMIx
has been updated to5.0.6
(from4.2.9
)
SLURM Configuration
Cgroup constraints have been enabled for (GPU) devices, jobsteps can access just the requested GPUs
Update type: SW Modules
OpenMPI has been recompiled to incorporate this patch
2025-02-27 MemoryMax
Update type: Login nodes
MemoryMax
has been set to 25% on individual user slices on login nodes
2025-02-05 Change MPI-settings for OpenMPI
Update type: SW Modules
As of 2025
romio321
is not working, so we have disabled the selection ofromio321
in theMPI-settings
, giving OpenMPI the freedom to choose and prioritize, currentlyompio
is selected.
2025-01-15 Default UCX-settings module
Update type: SW Modules
RC-CUDA
has been made the default module forUCX-settings
in the 2025 stage. Until now it wasUD
by mistake.
2024-12-18 Software update
Update type: OS Packages
OS Packages
ParaStation Management
has been updated to5.1.63
(from5.1.62
)
2024-12-11 Software update
Update type: OS Packages
OS Packages
Apptainer
has been updated to1.3.6-1
(from1.3.2-1
)
2024-10-30 Software update
Update type: OS Packages
OS Packages
Rocky Linux
has been updated to9.4
(from8.10
)Kernel Version
has been updated to5.14.0-427.33.1.el9_4
(from4.18.0-553.el8_10
)NVIDIA GPU Driver
has been updated to560.35.03
(from550.54.15
)OFED
has been updated to24.07-OFED.24.07.0.6.1.1
(from24.04-OFED.24.04.0.6.6.1
)Slurm
has been updated to23.11.10-1.20240920git20c5755
(from23.02.7-1.20240328git405c820
)ParaStation Management
has been updated to5.1.62
(from5.1.61
)GPFS
has been updated to5.1.9-4
(from5.1.9-3
)
2024-08-08 Software update
Update type: OS Packages, Network
OS Packages
Slurm
has been updated to23.02.7-1.20240328git405c820
(from22.05.11-1.20231215gitc756517
)ParaStation Management
has been updated to5.1.61
(from5.1.59
)PMIx
has been updated to4.2.9
(from4.2.7
)
Network
The Skyway firmware have been updated to
8.2.2302
2024-06-14 Subnet Manager Update (Damian Alvarez)
Update type: Network
Subnet Manager updated to
mlnx_ib_mgmt-5.19.1
2024-06-14 Software update
Update type: OS Packages
OS Packages
Rocky Linux
has been updated to8.10
(from8.9
)Kernel Version
has been updated to4.18.0-553.el8_10
(from4.18.0-513.18.1.el8_9
)NVIDIA GPU Driver
has been updated to550.54.15
(from535.154.05
)OFED
has been updated to24.04-OFED.24.04.0.6.6.1
(from23.10-OFED.23.10.1.1.9.1
)GPFS
has been updated to5.1.9-3
(from5.1.9-1
)Apptainer
has been updated to1.3.2-1
(from1.2.3-1
)
2024-01-16 Software update (Damian Alvarez)
Update type: OS Packages, Batch system, SW Modules
OS Packages:
General update to Rocky 8.9
SLURM has been updated to
22.05.11-1.20231215gitc756517
(from22.05.10-2.20231203gitae058ea
)psmgmt
has been updated to5.1.59-1
(from5.1.58-1
).Kernel
4.18.0-513.11.1.el8_9
(from4.18.0-477.27.1.el8_8.x86_64
)NVIDIA OFED
23.10-1.1.9.1
(from23.07-0.5.1.2
)NVIDIA GPU drivers
535.129.03
(from535.104.12
)GPFS
5.1.9-1
(from5.1.8-2
)DDN
IME 1.5.2-152129
(from1.5.2-152128
)
HCA FW
CX6 cards have been updated to
20.39.2048
Software stack
UCX-settings
loads nowRC
by default in JUWELS Cluster. Before it was mistakenly loadingUD
2023-12-14 Software update (Damian Alvarez)
Update type: OS Packages, Batch system, SW Modules
OS Packages:
SLURM has been updated to
22.05.10-2.20231203gitae058ea
to address newly-discovered security issuespsmgmt
has been updated to5.1.58-1
Software stack
netCDF
in the2024
stage has been rebuilt to add support for extra compression librariesGCC
in the2024
stage has been recompiled to patch some bugs that appeared in combination withPyTorch
2023-10-30 PMIx update (Sebastian Achilles)
Update type: OS Packages
Packages:
PMIx
4.2.6
Configuration:
All OpenMPI installations have been rebuilt to include a patch necessary for the new PMIx
2023-10-19 Software update (Damian Alvarez)
Update type: OS Packages, Batch system
Packages:
Kernel
4.18.0-477.27.1.el8_8.x86_64
NVIDIA OFED
23.07-0.5.1.2
NVIDIA GPU drivers
535.104.12
GPFS
5.1.8-2
Apptainer
1.2.3-1
DDN IME
1.5.2-152128
psmgmt-5.1.56-2
IB Switch firmware
27.2012.1010
Configuration:
ssh
rejects now RSA keysAll
OpenMPI
installations rely now on a user-space providedPMIx
2023-08-30 UCX-settings update (Damian Alvarez, JSC)
Update type: SW Modules
The UCX-settings/*CUDA
modules also set UCX_RNDV_FRAG_MEM_TYPE=cuda
. This enables the GPU to initiate transfers of CUDA managed buffers. This can have a large speed-up in case Unified Memory (cudaMallocManaged()
) is used, as staging of data is avoided.
2023-08-10 General maintenance/update (Damian Alvarez, JSC)
Update type: OS Packages, General configuration, Storage, Network, Other
Compute nodes update
The compute nodes are updated to:
Rocky
8.8
(from8.7
)MOFED
23.04-OFED.23.04.1.1.3.1
(from5.8-OFED.5.8.2.0.3.1
)GPFS
5.1.8-1
(from5.1.7-1.5
)NVIDIA driver
535.54.03
(from525.105.17
)psmgmt
5.1.56-2
(from5.1.56-1
)
2023-08-03 General maintenance/update (Damian Alvarez, JSC)
Update type: OS Packages, General configuration, Storage, Network, Other
Login nodes update
The login nodes are updated to:
Rocky
8.8
(from8.7
)MOFED
23.04-OFED.23.04.1.1.3.1
(from5.8-OFED.5.8.2.0.3.1
)GPFS
5.1.8-1
(from5.1.7-1.5
)NVIDIA driver
535.54.03
(from525.105.17
)
2023-08-01 TS update, psmgmt update (Damian Alvarez, JSC)
Update type: OS Packages, Batch system, Other
The nodes in JUWELS Booster have been updated to
psmgmt-5.1.56-1
.Racks
[21,29,31-39]
in JUWELS Booster have been updated to technical state068.03
2023-07-31 TS Update (Damian Alvarez, JSC)
Update type: Other
Racks [22-28,30]
in JUWELS Booster have been updated to technical state 068.03
2023-07-27 TS update, psmgmt update (Damian Alvarez, JSC)
Update type: OS Packages, Batch system, Other
The nodes in JUWELS Cluster have been updated to
psmgmt-5.1.56-1
. The nodes in JUWELS Booster will be updated on 2023-08-01Racks
[11-20]
in JUWELS Booster have been updated to technical state068.03
2023-07-26 TS Update (Damian Alvarez, JSC)
Update type: Other
Racks [04-10]
in JUWELS Booster have been updated to technical state 068.03
2023-07-25 TS Update (Damian Alvarez, JSC)
Update type: Other
Racks [01-03]
in JUWELS Booster have been updated to technical state 068.03
2023-05-23 – 2023-06-19 Rolling update (Ahmed Fahmy, JSC)
Top island nodes have been updated to the following versions:
Kernel
4.18.0-425.19.2.el8_7
(from4.18.0-425.13.1.el8_7
)OFED
5.8-2.0.3.1
(from5.8-1.1.2.1
)GPFS
5.1.7-1.5
(from5.1.7-0
)NVIDIA driver
525.105.17
(from525.85.12
)Apptainer
1.1.8-1
(from1.1.5-1
)LXC
5.0.0-1
(from3.0.4-2
)Slurm
22.05.9-1
(from22.05.8-1
)Slurm Plugins
2.1
(from2.0
)
2023-05-25 – 2023-05-26 Rolling update (Damian Alvarez, JSC)
Update type: OS Packages, Storage
Compute nodes have been updated to the following versions:
Kernel
4.18.0-425.19.2.el8_7
(from4.18.0-425.13.1.el8_7
)OFED
5.8-2.0.3.1
(from5.8-1.1.2.1
)GPFS
5.1.7-1.5
(from5.1.7-0
)NVIDIA driver
525.105.17
(from525.85.12
)Apptainer
1.1.8-1
(from1.1.5-1
)
2023-05-23 Emergency maintenance/update (Damian Alvarez, JSC)
Update type: Maintenance, OS Packages, Storage
During an storage outage the jwlogin[01-06,10-11,22],jwvis[00-03]
nodes have been updated to the following versions:
Kernel
4.18.0-425.19.2.el8_7
(from4.18.0-425.13.1.el8_7
)OFED
5.8-2.0.3.1
(from5.8-1.1.2.1
)GPFS
5.1.7-1.5
(from5.1.7-0
)NVIDIA driver
525.105.17
(from525.85.12
)Apptainer
1.1.8-1
(from1.1.5-1
)
2023-03-09 Emergency maintenance/update (Damian Alvarez, JSC)
Update type: Maintenance, OS Packages, Storage
GPFS software upgrade
GPFS has been updated everywhere to:
GPFS
5.1.7-0
(from5.1.6-1
)
2023-02-28 General maintenance/update (Damian Alvarez, JSC)
Update type: Maintenance, SW Modules, Batch system, OS Packages, Firmware
Stage Update:
The default software stack has been changed to 2023. The remaining software stages are nevertheless reachable.
Slurm Update:
Slurm has been updated to version 22.05.
Software Updates:
OFED
5.8-1.1.2.1
GPFS
5.1.6-1
(from5.1.4-1
)IME
1.5.2-152111
(from1.5.2-152065
)NVIDIA driver
525.85.12
(from515.65.07-1
)Apptainer
1.1.6-1
(from1.1.3-1
)psmgmt
5.1.53-1
(from5.1.52-5
)
Firmware Updates:
HDR Infiniband switches firmware
27.2010.5042
EDR Infiniband switches firmware
15.2010.5042
HDR Infiniband HCA firmware
20.36.1010
2022-12-09 Emergency maintenance/update (Damian Alvarez, JSC)
Update type: Maintenance, OS Packages, Storage
Compute nodes software downgrade
The compute nodes have been downgraded to:
GPFS
5.1.4-1
(from5.1.5-1.10
)
2022-12-05 Emergency maintenance/update (Damian Alvarez, JSC)
Update type: Maintenance, OS Packages, Network, Other
Compute nodes software update
The compute nodes are updated to:
MOFED
5.8-1.1.2.1
(from5.8-1.0.1.1
)
InfiniBand Firmware updates
The following components in the InfiniBand network are updated:
Unmanaged Quantum based switches are updated to
27.2010.4102
(from27.2010.3118
)Managed Quantum based switches are updated to
27.2010.4034
(from27.2010.3118
)Switch-IB 2 based switches are updated to
15.2010.4102
(from15.2010.3118
)
2022-11-29 General maintenance/update (Damian Alvarez, JSC)
Update type: Maintenance, Announcement, OS Packages, General configuration, Batch system, Storage, Network, Other
Compute nodes software update
The compute nodes are updated to:
Rocky
8.7
(from8.6
)MOFED
5.8-1.0.1.1
(from5.7-1.0.2
)GPFS
5.1.5-1.10
(from5.1.4-1
)NVIDIA driver
515.65.07-1
(from515.65.01-1
)Apptainer
1.1.3-1
(from1.0.3-1
)psmgmt
5.1.52-5
(from5.1.50-4
)
Skyway configuration
The skyway gateways have been configured in HA pairs, with 4 extra skyways being taken into production. As a side effect, extra bandwidth between JUWELS Booster and the JUST storage is now available.
2022-10-18 Cooling maintenance (Damian Alvarez, JSC)
Update type: Maintenance, Batch system, Storage, Network
New SLURM plugins available
cpufreq
andgpufreq
plugins are now available in JUWELS
New firmware version for Skyways
Updated to
8.1.3000
2022-10-12 psslurm change during unplanned downtime (Damian Alvarez, JSC)
Update type: Batch system, Other
The ENABLE_FPE_EXCEPTION
option in psslurm.conf
has been disabled as a response to applications crashing with underflow floating point exceptions being sent/forwarded by psid
.
2022-09-07 Small update during unplanned downtime (Damian Alvarez, JSC)
Update type: Maintenance, Batch system, Network, Other
Compute nodes software update
The compute nodes are updated to:
psmgmt
5.1.50-5
(from5.1.50-4
). This corrects a bug in the PMIx server that had effects onMPI_Comm_split_type
on OpenMPI and therefore Horovod too
InfiniBand Firmware updates
The following components in the compute infrastructure are updated:
ConnectX-4 HCAs are downgraded to
12.28.2006
(from12.32.1010
), following a recommendation by NVIDIASkyway InfiniBand-Ethernet gateways are updated to
8.1.2000
(from8.0.2300
)
2022-08-30 General maintenance/update (Damian Alvarez, JSC)
Update type: Maintenance, Announcement, OS Packages, General configuration, Batch system, Storage, Network, Other
Compute nodes software update
The compute nodes are updated to:
Rocky
8.6
(from8.5
)MOFED
5.7-1.0.2
(from5.5-1.0.3
)GPFS
5.1.4-1
(from5.1.3-1
)NVIDIA driver
515.65.01-1
(from510.47.03-1
)Apptainer
1.0.3-1
(from1.0.1-1
)psmgmt
5.1.50-4
(from5.1.49-4
)
InfiniBand Firmware updates
The following components in the compute infrastructure are updated:
ConnectX-6 HCAs are updated to
20.34.1002
(from20.31.2006
)ConnectX-5 HCAs are updated to
16.34.1002
ConnectX-4 HCAs are updated to
12.32.1010
(from12.30.1004
)Quantum based switches are updated to
27.2010.3118
(from27.2010.2110
)Switch-IB 2 based switches are updated to
15.2010.3118
(from15.2008.3328
)
Slurm configuration update
SLURM has now the topology plugin active. That enables SLURM to make more adequate decisions with respect to node allocation. It also enables users to use --switches=count[@time]
in sbatch
and salloc
commands, where count
is the maximum number of leaf switches used for a job, and time
is the maximum time to make the job wait for an opportunity to run.
In the JUWELS cluster the count
option matches 1 to 1 to switches.
In the JUWELS booster the count
option implies racks rather than leaf switches, given the stripping of the 4 links over different switches in the rack.
GPFS setup on login nodes
The GPFS setup on login nodes has been changed. Access to storage is now done over 100GbE instead of InfiniBand. That allows the login nodes to stay available when major work is being done in the InfiniBand fabric or in case of instabilities. Some nodes were kept in the old setup for evaluation purposes.
2022-05-18 Python clean up (Damian Alvarez, JSC)
Update type: OS Packages
Python 2 and 3.8 have been removed from the system.
2022-05-03 Global maintenance with general updates (Damian Alvarez, JSC)
Update type: Maintenance, Announcement, OS Packages, General configuration, Storage, Network, Other
General update
List of changes:
OFED updated to
5.5-1.0.3
NVIDIA driver updated to
510.47.03
Kernel updated to
4.18.0-348.23.1
Slurm updated to
21.08
Migrated to
apptainer
1.0.1-1
GPFS parameter change
GPFS parameters have been changed to optimize metadata performance
2022-04-29 XH2000 IB Switch Update (Damian Alvarez, JSC)
Update type: Network
The firmware in the InfiniBand switches in XH2000 has been updated to
27.2010.2110
from27.2008.3336
2022-04-12 IME Update (Damian Alvarez, JSC)
Update type: OS Packages, Storage
IME libraries have been updated from
1.5.1.1-151123
to1.5.1.1-151130
. That fixes a use case when using IME directly from python scripts.
2022-03-08 Change in user installations (Damian Alvarez, JSC)
Update type: Announcement, SW Modules
Change in user installations
The module structure has been changed so
$MODULEPATH
is not expanded depending on the existence of the$PROJECT
variable. Now the variable used is$USERINSTALLATIONS
, so the project software is not automatically activated when usingjutil
.
2022-02-15 Stage update (Damian Alvarez, JSC)
Update type: Maintenance, Announcement, SW Modules, Storage, Network, Other
Stage update
The default software stack has been changed to 2022. The remaining software stages are nevertheless reachable.
Fabric components replaced
The
juwelsg02:SX6036G
gateway has been re-added to the fabric after a replacementThe
jwb-25-L2-02
switch has been re-added to the fabric after a replacement
New HPST (IME) mount point
The mount point of HPST changed to
/p/cscratch/fs
IB configuration
The order of the routing algorithms in the routing chain has been changed.
updn
is now the first one.ARP settings have been tweaked to favour responding to ARP requests via the correct IPoIB interface
SR-IOV has been adapted in the corresponding service nodes to have different node and port GUID
2021-12-17 Rocky update (Damian Alvarez, JSC)
Update type: Maintenance, Announcement, OS Packages, General configuration, Batch system, Storage, Network, Other
Software updates
The system has been udpated to Rocky Linux 8.5
OFED has been updated to
5.4-3.1.0
The nvidia driver has been updated to
470.82.01
GPFS has been updated to
5.1.2-1
Singularity has been updated to
3.8.5-1
Firmware/BIOS updates
The HCA FW has been updated
12.30.1004
(EDR nodes)20.31.2006
(HDR nodes)
The BIOS in the Cluster login nodes has been updated
The Technical State on the Cluster has been updated to
45.02
Storage updates
HPST (DDN-IME) is now accessible from Cluster and Booster nodes
General configuration updates
The sssd cache time has been reduced, so LDAP updates are refreshed faster
The priority of the different queues has been updated, to prioritize jobs that need nodes with large memory
Switch exchanges
jwc04isw218
has been replacedjwb-27-L2-04
has been replacedjwb-30-L1-02
has been replaced
Other changes
The cooling liquid in the Booster racks has been exchanged
2021-10-12 Maintenance (Damian Alvarez, JSC)
Update type: Maintenance, General configuration, Batch system, Storage, Network, Other
OpenSM configuration
OpenSM is configured now with dumping the SA file in a shared filesystem, to improve failover times
Switch replacement
jwb-26-L2-01
, jwb-27-L1-03
and jwb-39-L1-05
have been replaced
Update HCA FW in a variety of admin nodes
All the admin nodes have had their HCA FW version synced
largedata available in a subset of compute nodes
This filesystem is now available in a 10 nodes on the cluster, and 10 on the booster. To request it you can use –constraint=largedata in your sbatch/salloc command
Update psconfig and pshealthcheck
These packages have been updated to psconfig-5.2.1-1
and pshealthcheck-5.2.3-1
Overlapping partitions for swmanage users
The following partitions overlap the devel partitions, but without the 2 hour time limit:
devel-sw
develgpus-sw
develbooster-sw
2021-09-14 Module update (Damian Alvarez, JSC)
Update type: Maintenance, SW Modules, Network
New compilers and MPIs
The default compilers have been updated:
GCC/9.3.0
->GCC/10.3.0
Intel/2020.2.254-GCC-9.3.0
->Intel/2021.2.0-GCC-10.3.0
NVHPC/20.7-GCC-9.3.0
->NVHPC/21.5-GCC-10.3.0
With these versions of the compilers the latest available MPIs have been also installed
jwb-16-L2-01 has been replaced
IME-FUSE client config update
The IME client has been updated and the configuration updated
2021-08-10 CentOS 8.4 update (Damian Alvarez, JSC)
Update type: Maintenance, Announcement, OS Packages, General configuration, Batch system, Network, Other
Software update
The system has been updated to
CentOS 8.4
OFED 5.4
gdrcopy 2.3
NVIDIA driver 470.57.02
psmgmt 5.1.43-0
Switch replacement
jwb-11-L1-05
and jwb-31-L2-01
have been replaced
MOTD announcement
It has been announced that during the next maintenance the default compilers will change to:
GCC 10.3 (from GCC 9.3)
NVHPC 21.5 (from NVHPC 20.11)
Intel 2021.2 (from Intel 2020.2)
2021-07-19 Update and clean up IB fabric (Damian Alvarez, JSC)
Update type: Maintenance, Batch system, Network, Other
Switch replacement
jwc03isw208
and jwc05isw118
have been replaced
Skyway cable mismatches
Fixed the cable mistmatching in the skyways
InactiveLimit=0 in slurm.conf
Set to default
Fix PSID in jwb-02-L2-01
This switch had the wrong PSID and therefore the wrong FW.
FW update in all switches
All the switches have been updated to the latest version
2021-06-29 Skyway replacement, SLURM updates (Damian Alvarez, JSC)
Update type: Maintenance, OS Packages, Batch system, Storage, Network
Update psmgmt
Updated to
psmgmt-5.1.42-1
SLURM update
Minor updated within
20.02.7-1
Fixes Spank plugin environment variables
Adds an additional check in the submission filter, to submit to the
booster
queue by default when submitting from Booster nodes
Skyway replacements
All the skyway units have been replaced to the GA HW version
This includes updating the software to the latest version
2021-06-08 GPFS and SLURM updates (Damian Alvarez, JSC)
Update type: Maintenance, OS Packages, Batch system, Storage, Network
GPFS update
Updated to
5.1.1-1
Update psmgmt
Updated to
psmgmt-5.1.41-0
SLURM update
Updated to
20.02.7-1
This gets rid of the GTK2 dependencies
Switch replacements
jwb-17-L2-03
has been replacedjwb-13-L2-03
has been replaced
2021-05-11 SLURM update (Damian Alvarez, JSC)
Update type: Maintenance, Announcement, SW Modules, Batch system
SLURM update
Slurm has been patched to mitigate CVE-2021-31215
UCX as default for ParaStationMPI in the cluster
ParaStationMPI uses now UCX as default also in the cluster module
2021-04-16 Technical State update (Damian Alvarez, JSC)
Update type: Maintenance, OS Packages
SLURM change
The old FQDN in JUWELS cluster have been removed from the SLURM configuration
TS Upgrade - TS 44.01
The following components have been updated:
PMC
EMC
WMC
TMC
HMC
BMC
Kernel update
The kernel has been updated to 4.18.0-240.22.1.el8_3.x86_64
in all nodes (top island and compute nodes)
2021-03-25 Acceptance tests (Damian Alvarez, JSC)
Update type: Maintenance, OS Packages, Network, Other
OpenSM testing
A new OpenSM release 5.8.2
has been tested for failover. This version delays the re-registration of clients after the failover takes place, accelerating the process that way. The tests indicate good performance in the fabric and between the cluster and JUST, but no improvement between the booster and JUST over the Skyways.
The version has been reverted to 5.7.3
at the end of the maintenance due to extra problems connecting to the XCST
Updated psmgmt to 5.1.38-3
This new version fixes the X forwarding bug that was preventing slurm from correctly performing it
2021-03-16 Cluster-Booster links enabling (Damian Alvarez Mallon, JSC)
Update type: Maintenance, General configuration, Batch system, Network
Cluster-Booster cabling
The 200 links between cluster and booster have been reenabled for full bandwidth. 5 links are still not coming up:
top19:38 <-> jwb-23-L2-05:22
top19:39 <-> jwb-32-L2-05:26
top25:40 <-> jwb-36-L2-02:28
top28:38 <-> jwb-18-L2-03:19
top33:40 <-> jwb-36-L2-04:28
SLURM
The submission filter now correctly supports the --gpus
option
OpenSM fixed DragonFly switch grouping
Now the configuration correctly considers the switches in the A4 rack
Switch replacement
jwc00isw106
has been replaced to fix a broken port
Various links rechecked/fixed between cluster switches and cluster gateways
juwelsg01:13
juwelsg02:6
juwelsg03:10
juwelsg03:12
juwelsg03:14
juwelsg03:17
juwelsg04:6
juwelsg04:9
juwelsg04:10
juwelsg04:11
juwelsg04:14
juwelsg04:16
juwelsg04:17
juwelsg04:18
Cabling jwslurm[00-01]
The SLURM nodes are now cabled with the admin rack
2021-03-09 Migration of cluster ISMAs (Damian Alvarez Mallon, JSC)
Update type: General configuration
Update of cluster ISMAs
They have been migrated to CentOS 8.3 (both baremetals). The PCS cluster per pair has also been deployed
Update of master nodes
They have been updated to CentOS 8.3 (both baremetals).
jwsm[00-01] recabling
They have been recabled on the booster admin switches
Remove CUDA_VISIBLE_DEVICES from environment on the juwels gpu nodes
With the update of psmgnt
the workaround in the CUDA
module is no longer necessary. It has been removed
2021-02-23 UCX update, ISMA migration (Damian Alvarez, JSC)
Update type: Maintenance, SW Modules, General configuration, Network
psmgnt update
psmgnt
has been updated to 5.1.38-2
. This fixes a protocol incompatibility problem with slurmctld
and a segmentation fault when using heterogeneous jobs.
The cluster ISMAs have been migrated to CentOS 8
DNS have been updated to point to new containers
Imaging and configuration has been moved to new containers
PMSM has been moved out to a separate container due to its CentOS 7 requirement
OpenSM configuration
OpenSM uses now 2 ports per node.
User modules updates
GCC now supports GPU offload
UCX has been changed to 1.9.0 from 1.8.1 for both ParaStationMPI and OpenMPI
pscom
has been updated to5.4.7-1
(from5.4.6-1
). This fixes an issue where an error onpscom
was not properly propagated topsmpi
, leaving the job running without progressing
Increased size of /dev/shm
/dev/shm
has been increased to 85% of the memory size
2021-02-09 CentOS 8.3 update (Damian Alvarez Mallon, JSC)
Update type: Maintenance, OS Packages, Storage, Network, Other
InfiniBand switches
jwc00isw216
has been replacedjwc04isw114
has been replaced
OpenSM configuration
Increased the max number of SMPs on the wire to 32 (from 8).
Set the number of threads for routing calculating to 0, except for updn with lid tracking, where it is kept as 1
Extra HCAs added to the subnet managers to enable multiport MAD pushing
Software updates
Updated user exposed nodes to CentOS 8.3
Updated to OFED 5.1-2580
Updated to NVIDIA 460.32.03
Updated to Singularity 3.7.1
2021-01-28 FW updates (Damian Alvarez, JSC)
Update type: Maintenance, OS Packages, Storage, Network, Other
TS update on Booster nodes
Update of BIOS, CPLD and BMC on the nodes. Example node:
NAME BOARD COMPONENT VERSION pm3-bmc84 CER-G BIOS BIOS_RME090.18.25.001 CPLD 1.2 CPLD_CER 1.7 CPLD_CWG 0.1.2 SW0_CWG 1.2 SW1_CWG 1.2 FPGA_RDSTN 2.7 MC 60.39.00.0000
New PCIe switch FW
Version 1.2
fixes the bidirectional BW issue.
Enable assert on NSD checksum error
Enabled on the booster GPFS configuration, to protect JUST from the checksum errors caused by the skyways, that result in disks taken down and general JUST availability
Setup new pscluster containers as part of cluster CentOS8 migration
The isma setup in cell 01 has been migrated to CentOS8
TOP-Lvl Switches: temporary cables
The cabling between cluster and booster has been limited to 5 cables to switches top[43-47]. This is temporary until the 200 links are correctly connected.
IB Switch FW update
Going from 27.2008.1904
to 27.2008.2102
on the HDR switches, and to 15.2008.2102
in EDR switches
Set MTU of the IPoIB Interfaces on the booster to 4000
As part of the skyway stabilization efforts, the MTU has been set to 4000, since the normal 4092 MTU was creating corrupted packages
Modify GPFS cluster on JUWELS (Cluster)
The GC and image have been moved to the new local GPFS cluster, getting rid of the legacy setup.
Update IME config
IME has been updated to 1.4.1.slice-141029 and reconfigured following DDN recommentations
2021-01-12 Various updates (Damian Alvarez Mallon, JSC)
Update type: Maintenance, Batch system, Storage, Network, Other
New Skyway configuration
The Skyway gateways connecting the booster to storage have been reconfigure to support 8x4 HCAs instead of 1x4
Update of SLURM
SLURM has been updated to 20.02.6
Update of GPFS in service nodes and GC
To 5.1.0-1
in both parts of the system
Update of psmgmt
To 5.1.35
. Changes the pinning strategy on GPU nodes, to assign GPUs and HCAs properly when using more processes than GPUs
OpenSM configuration changes
Optimizations suggested by Mellanox to OpenSM configuration
InfiniBand work
Recabling of various broken links on the cluster part of the system
Cell00 HYC replacement
pm-hmc1
was faulty and has been replaced
jwslurm[00-01] renaming to jwslurm[01-02]
The newly installed baremetals are now hosting the jw-slurm
container
Switch entries in DNS
Switch names and aliases have been added to the DNS. Necessary for IBMS
Modify GPFS cluster on JUWELS (Cluster)
Temporary move the GPFS cluster managers to inactive logins.
2020-12-08 Maintenance for booster acceptance tests (Damian Alvarez, JSC)
Update type: Maintenance, Storage, Network
jwc07isw118 replaced
The switch is back in production, and with it the nodes connected to it
New route to a JUST subnet in the cluster images (CPU and GPU)
The following subnet route has been added in the cluster part:
134.94.76.0/23
. This should be routed via the gateway corresponding for the pkey interface of the node, and of course over the pkey interface. In other words, these routes, depending on the node:
134.94.76.0/23 via 10.11.168.1 dev ib0.8007
134.94.76.0/23 via 10.11.160.1 dev ib0.8006
134.94.76.0/23 via 10.11.176.1 dev ib0.8008
New routes to a JUST subnet in the booster images
All the routes have been added in all the nodes. Depending on the group of nodes:
134.94.100.0/23 via 10.13.22.11 dev ib0
134.94.102.0/23 via 10.13.22.11 dev ib0
134.94.140.0/23 via 10.13.22.11 dev ib0
134.94.15.0/24 via 10.13.22.11 dev ib0
134.94.74.0/23 via 10.13.22.11 dev ib0
134.94.76.0/23 via 10.13.22.11 dev ib0
134.94.100.0/23 via 10.13.22.12 dev ib0
134.94.102.0/23 via 10.13.22.12 dev ib0
134.94.140.0/23 via 10.13.22.12 dev ib0
134.94.15.0/24 via 10.13.22.12 dev ib0
134.94.74.0/23 via 10.13.22.12 dev ib0
134.94.76.0/23 via 10.13.22.12 dev ib0
134.94.100.0/23 via 10.13.22.13 dev ib0
134.94.102.0/23 via 10.13.22.13 dev ib0
134.94.140.0/23 via 10.13.22.13 dev ib0
134.94.15.0/24 via 10.13.22.13 dev ib0
134.94.74.0/23 via 10.13.22.13 dev ib0
134.94.76.0/23 via 10.13.22.13 dev ib0
134.94.100.0/23 via 10.13.22.14 dev ib0
134.94.102.0/23 via 10.13.22.14 dev ib0
134.94.140.0/23 via 10.13.22.14 dev ib0
134.94.15.0/24 via 10.13.22.14 dev ib0
134.94.74.0/23 via 10.13.22.14 dev ib0
134.94.76.0/23 via 10.13.22.14 dev ib0
As always on the booster, no pkeys, just the main ib
interface
Update cluster and booster to psmgmt 5.1.34
Update cluster and booster to 5.1.34
IME software + config update
The IME config on the cluster needs some performance tuning for Skylakes. In addition a new IME version is available (1.4.1.slice-141026)
2020-11-10 SLURM cluster-booster unification (Damian Alvarez Mallon, JSC)
Update type: Maintenance, Announcement, SW Modules, General configuration, Batch system, Network
SLURM merge
Both cluster and booster slurm instances have been merged. It is possible now to submit jobs to both sides of the system
Cell 5
The cell is back in production
InfiniBand network
The FW has been updated in all compute HCAs and switches.
Cluster and Booster have been re-merged
The fabric has been also cleaned up
IME update
New version has been installed
OpenMPI failure in CentOS 8
In some circumstances, when doing MPI-IO one could see this failure
[jwc09n006.adm09.juwels.fzj.de:08897] mca_base_component_repository_open: unable to open mca_fs_gpfs: libevent_core-2.0.so.5: cannot open shared object file: No such file or directory (ignored)
[jwc09n006.adm09.juwels.fzj.de:08882] mca_base_component_repository_open: unable to open mca_fs_gpfs: libevent_core-2.0.so.5: cannot open shared object file: No such file or directory (ignored)
[jwc09n006.adm09.juwels.fzj.de:08907] mca_base_component_repository_open: unable to open mca_fs_gpfs: libevent_core-2.0.so.5: cannot open shared object file: No such file or directory (ignored)
That is a byproduct of compiling in CentOS 7 at the beginning of the stage deployment. It has been fixed during the maintenance by recompiling OpenMPI.
Remove ParaStationMPI GPFS support on ROMIO
Some users reported problems on when using HDF5 on the new stage (on the booster). The issue is reliably resolved when setting ROMIO_FSTYPE_FORCE=ufs:
. As relying on that variable alsodisables also IME, ParaStationMPI has been recompiled without GPFS support on ROMIO.
Update cluster nodes to psmgmt 5.1.32-0
To homogenise with the booster, psmgmt 5.1.32-0 has been installed.
2020-11-02 Cluster-Booster InfiniBand merge, CentOS 8 migration and software stack update (Damian Alvarez Mallon, JSC)
Update type: Maintenance, SW Modules, OS Packages, General configuration, Network, Other
Change to 55V on the PSUs
Before it was set to 54V
Meant to address recent throttling events
InfiniBand FW updates
On HCAs and switches
InfiniBand merge
Both fabrics have been merged
Chain routing setup to have updn for IO, ftree for cluster part, and dfp for the Booster
Update IPoIB addresses
For the cluster/booster merge, new IP addresses for the IPoIB devices on the cluster nodes have been assigned
PSP_NETWORK has been updated for it
New justime IPoIB IPs
Migrate compute nodes to CentOS 8 images
Compute nodes have been upgraded to CentOS 8
Move to 2020 stage
The 2020 stage has been made default
10GbE card in juwels11
For the ceph network
Move software mountpoint
From
/gpfs/software
to/p/software/juwels
Change DNS RR on login nodes to migrate to CentOS 8
New login nodes based on CentOS (
jwlogin[04-10]
andjwvis[02-03]
)
2020-08-25 Regular maintenance (Damian Alvarez Mallon, JSC)
Update type: Maintenance, OS Packages, GPUs
Cell HW
HYCs in cell 6 and 7 have been replaced TMC in cell 0 has been replaced
MAD control options
The maximum number of in-flight MAD datagrams is now limited to 1, set as a drop-in file in /system/openibd.service.d/ticket4005.conf
New nvidia driver
The driver version 450.56.01 has been installed, for CUDA 11 compatibility.
Reenable Singularity
Singularity has been deployed locally on the logins and compute nodes via (RPM-based) installation.
10 GbE cards for Ceph access
The following nodes have now extra 10 GbE cards:
jwm[00,01]
jwsm[00,01]
jwlogin[00-03]
jwvis[00,01]
2020-07-13 Network migration (Damian Alvarez)
Update type: Maintenance, Network, Other
Network migration
The admin network has been migrated to enable the integration with the Booster in the near future
Cell 09 switch backplane
The switch backplane (BOD/S) in cell 09 has been replaced
Ceph network
The ISMA, monitoring and SLURM nodes have been equipped with 10 GbE cards to access in the future the Ceph network
2020-06-23 HW maintenance (Damian Alvarez)
Update type: Maintenance, Network, Other
Replacement of HYC in cell 4
Replacement of switches
jwc00isw222
jwc02isw208
jwc03isw210
jwc05isw214
jwc07isw[216,218]
jwc04isw204
Update of pscluster containers
To CentOS 7.8
2020-06-04 Changes after security incident (Damian Alvarez)
Update type: Maintenance, Announcement, SW Modules, OS Packages, General configuration, Batch system, Storage, Network, Other
Security changes
User visible and incomplete list of changes:
Revoked ssh keys
Revoked ssh host keys
Strong recommendation of
from
clauses inauthorized_keys
CentOS update
Update from CentOS 7.7 to CentOS 7.8
Phase rebalancing
The electric phases have been permutated to balance all 3 phases evenly
jwlogXX
Enabled SR-IOV setup
IB Firmware Upgrade
The firmware of the following components have been updated:
L1/L2/L3 switches
HCA Update for all nodes
New psmgnt 5.1.30
This includes the pinning changes
Rollout slurm role from hps-config
That will imply also a few changes on the compute nodes.
New default modules
Updated defaults:
Default PGI module: 19.3 -> 19.10
Default Intel module: 2019.3 -> 2019.5
Default ParaStationMPI: 5.2.2-1 -> 5.4
Default IntelMPI: 2018.5 -> 2019.6
XDG_RUNTIME_DIR not existing in compute nodes
$XDG_RUNTIME_DIR
is set on login to /run/usr/$ID
. This directory is used by a few programs and libraries (like Qt), and created by pam
on a normal system. However, on compute nodes, the directory did not exist until this update.
2020-04-28 Phase verification maintenance (Damian Alvarez)
Update type: Maintenance, General configuration
Phase load tests were performed
Roles from hps-config deployed on top island:
postfix
mellanox
2020-03-31 TS update (Damian Alvarez)
Update type: Maintenance
Finished update of TS, left incomplete in the previous maintenance
Deployed new OpenSM role (from
hps-config
)
2020-03-17 Technical State Update (Damian Alvarez)
Update type: Maintenance, OS Packages
New TS
Updated FW accross the system
Add IME servers to DNS
IME servers have been added to the DNS
New packages
Minor OS update, including new minor kernel update
2020-02-11 MPI settings modules (Damian Alvarez)
Update type: SW Modules
MPI modules now enable the loading of
mpi-settings
for easier tuning of MPI parameters
2020-02-04 New PGI compiler and Intel MPI version (Damian Alvarez)
Update type: SW Modules
PGI 19.10 installed (but not default)
IntelMPI 2019.6 installed (but not default)
2020-01-28 Update to CentOS, OFED and bypass installation on cooling loop (Damian Alvarez)
Update type: Maintenance, SW Modules, OS Packages, Batch system, Other
CUDA MPS support on SLURM
JUWELS has now support for CUDA MPS through SLURM. Example:
salloc -p develgpus --gres=gpu:4 -t 40 --cuda-mps zsh
Update to CentOS 7.7 on CPU and GPU nodes
The OS on the compute nodes has been updated to CentOS 7.7
OFED has been updated to
4.7-3
GPFS has been updated to
5.0.4-1
Update to CentOS on the top island
The OS on the admin nodes and containers has been updated to CentOS 7.7
OFED has been updated to
4.7-3
GPFS has been updated to
5.0.4-1
Update psmgmt to 5.1.28
User relevant changes:
Add
OMPI_*
variables to client environment in pspmixImprove Slurm message reply code of psslurm
Revert “Enhancement: set
SLURM_NTASKS_PER_NODE
if it was not set by sbatch”
Default UCX
The default UCX for ParaStationMPI has been changed to 1.6.1
Cooling infrastructure
The primary cooling loop has now a bypass to allow the Sequana cells to control the valves of the primary loop that are on their racks.
2020-01-17 New MVAPICH2-GDR version (Damian Alvarez)
Update type: SW Modules
MVAPICH2-GDR 2.3.3
has been installed and made default in production
2020-01-14 Connection to HPST (Damian Alvarez)
Update type: Maintenance, OS Packages, Storage, Network
Connection with HPST
Install HPST client RPMs in the compute images
The following packages have been installed
ime-client
ime-net-cci
fuse
ime-common
ime-ulockmgr
libcci
libisal
python-babel
python-jinja2
python-markupsafe
2019-12-10 SLURM and IB fabric updates (Damian Alvarez)
Update type: Maintenance, Batch system, Network
JUWELS IB fabric
The FW of all the switches has been updated
SLURM update
SLURM is now installed as RPM, and has been updated to version 19.05.4
The maximum number of nodes in batch has been increased to 1024
jwlogin06 - network link down
The
juwels06
container is again reachable from outside
Install fix to resolve increased IB errors when sideband is activated
A fix has been installed on the Sequana switches to enable sideband functionality without increasing the IB symbol error counters
2019-11-18 HDR switches, VR 2.2 update and others (Damian Alvarez)
Update type: Maintenance, SW Modules, Network, Other
JUWELS IB fabric
The top level EDR switches have been replaced by HDR switches
4 defective L1 and L2 switches have been replaced
IBACM has been disabled
Update VR version to 2.2
VR 2.2 has been installed on the compute nodes, to fix PVCCIN issues
Supermicro firmware
The service nodes BIOS and BMC have been updated to address PCI reordering problems
GPFS client configuration
/var/mmfs/mmsysmon/mmsysmonitor.conf
has been updated withclitimeout = 32
,maxretries = 5
,csmspeed = 10
andmaxcsmretries = 1
to avoid flooding on the quorum nodes:
2019-11-07_20:50:18.443+0100: [N] The server side TLS handshake with node 10.11.160.76 was cancelled: connection reset by peer (return code 420).
2019-11-07_20:50:18.443+0100: [E] sdrServ: Communication error on socket 3066 (10.11.160.76) @handleRpcReq/AuthenticateIncoming, [err 146] Internal server error
2019-11-07_20:50:18.444+0100: [N] The server side TLS handshake with node 10.11.171.75 was cancelled: connection reset by peer (return code 420).
2019-11-07_20:50:18.444+0100: [E] sdrServ: Communication error on socket 3064 (10.11.171.75) @handleRpcReq/AuthenticateIncoming, [err 146] Internal server error
Update LXC
LXC on admin nodes has been updated to
3.0.4
from1.0.X
juwelsm01 and SELinux
SELinux has been enabled on juwelsm01. It was disabled by mistake.
Flexible module naming scheme
The user modules in production have been adapted to work with a flexible module naming scheme. Minor updates of compilers and MPIs are possible without full toolchain duplication now.
2019-11-07 IB network updates (Damian Alvarez)
Update type: Maintenance
JUWELS IB fabric update
The firmware of the switches and gateways has been updated to the latest version.
IBACM has been disabled
Supermicro firmware
The BMC and BIOS on admin nodes have been upgraded.
jwslurm[00-01],jwsm[00-01], jwvis[00-01]
are still pending
psmgmt
psmgmt has been updated to 5.1.26. This includes various fixes affecting user jobs, PMIx (for OpenMPI) and PMI (for Intel MPI), and psid crashes.
Update LXC
LXC has been updated on jwlogin10 (from
lxc-1.0.11-2.el7.x86_64
tolxc-3.0.4-2.el7.x86_64
) as a stability test
Updated nvidia driver on GPU partition
The nvidia driver has been updated from currently
418.40.04
to418.87.00
Updated OFED on the login nodes and admin nodes to 4.6
The whole cluster is running now on the same OFED version
OS update on computes and logins
Minor kernel update
Minor update of packages
gdrcopy on the gpu nodes
The
gdrdrv
kernel module has been installed, and thegdrcopy
service enabled
2019-10-24 Max jobs in queue (Damian Alvarez)
Update type: Batch system
Maximum number of jobs in the queues is now up to 20000 from 10000
2019-10-22 Change in IPoIB qlen (Damian Alvarez)
Update type: Network
The QLEN in ib0 has been increased to 4096 from 256. This was causing
RDP: MYsendto to XXX.XXX.XXX.XXX(0): No buffer space available
errors.
2019-10-17 Changes in nvidia and MVAPICH2 modules - OTRS #1031954 (Damian Alvarez)
Update type: SW Modules
The nvidia module has now a link
libnvidia-ml.so -> libnvidia-ml.so.1
to allow applications to link to it, instead of using stubsThe MVAPICH2 compiler wrappers now point to
$EBROOTNVIDIA/lib64
instead of$EBROOTCUDA/lib64/stubs
2019-10-15 Updates in login nodes and large partition (Damian Alvarez)
Update type: OS Packages, Batch system
The large partition excludes now the devel nodes and has the same nodes than the batch partition
OFED 4.6 libraries have been installed in the login containers
2019-10-10 IPoIB update (Damian Alvarez)
Update type: Maintenance, SW Modules, Network
OFED update A new patched driver by Mellanox has been installed, that fixed the IPoIB issues that the system had from the beginning. This update has been also applied on the GPU nodes. The OFED version is now 4.6.
Admin nodes The update to CentOS 7.6 in the admin nodes has been completed (including master nodes, slurm and subnet manager nodes). Also GPFS has been updated in the whole partition to 5.0.3.
IntelMPI The default Intel MPI module in the current stage (2019a) is now 2018.5.288, instead of 2019.3.199
VR firmware The VR firmware has been updated to 2.1 also in Cell 9, following the cells that were upgraded in previous maintenance.
2019-09-30 InfiniBand Update (Damian Alvarez)
Update type: Maintenance, General configuration, Network, Other
Updates in admin nodes infrastructure
Updated log baremetals
Update OFED on the compute nodes to 4.6
OFED has been updated to version 4.6, including a custom patch by Mellanox to address the IPoIB issues present in JUWELS. Scalability has been increased significantly. General roll out needs pending changes in the environment.
/etc/locale.conf in compute images
Locale updated to LANG="en_US.UTF-8"
2019-09-25 (Damian Alvarez)
Update type: OS Packages, General configuration
Installation of nvidia-libXNVCtrl in juwelvis[00-03]
Set of
LANG=en_US.UTF-8
via/etc/locale.conf
in login nodes
2019-09-24 VR update (Damian Alvarez)
Update type: Maintenance, General configuration
Update VR version to 2.1 to address throttling events
The voltage regulator firmware has been upgraded to 2.1 to address most of the CPU throttling events we have seen in JUWELS.
Add ib.juwels.fzj.de to /etc/resolv.conf in compute image
Now applications and tools can resolve the hostname from nodes in other cells. This is necessary for some functionality provided by TotalView
OS update in most of the admin nodes
This includes baremetal and containers, but do not include all of them
Increased size of /dev/shm
/dev/shm
on the computes nodes is now 85% of the total memory capacity
2019-09-11 Beginning of the changelog (Damian Alvarez)
Update type: Announcement
Initial state of the changelog