Information about $DATA incident from January 2021

This page collects information about the $DATA incident that occurred on 2021-01-26. Until 2021-02-02 an improbable sequence of unrelated hardware failures came together with a firmware bug and has unfortunately led to data loss. This page contains information about the next steps and relevant answers to questions related to the incident and its aftermath. A description of the event time line is provided at the bottom of the page.

FAQ

Which of my data is affected?

A list of all potentially affected files have been communicated to the persons of contact (PI or PA) of each data project on 2021-02-14. The provided list contained all files that had blocks on the affected disk group. Please note that this list was generated based on the location of the data storage for the files at the time of the incident and we expect the list of actually affected files to be much shorter.

At this point, we cannot reliably assess which of these files are actually affected. Statistics gathered by comparison of restored files and files in $DATA suggests that ca. 1-2% of the potentially affected files contain corrupted data blocks. The percentage will vary between projects in dependance on file size.

Is my data corrupted or missing entirely?

All i-nodes on the file system are intact and hence all files are visible in the filesystem with correct metadata and are fully accessible.

How can I detect data corruption in my files?

Warning

The completeness and accuracy of metadata (file size, time stamps, etc.) does not indicate the correctness of the data in the file!

Based on the analysis so far, we know that all faulty blocks are 512 KiB (524288 bytes) long due to the underlying RAID stripe layout (in rare cases multiple faulty blocks are adjacent). In most cases, the faulty blocks are zeroed out and detectable in this way. Unfortunately, not all affected blocks are zero.

In order to detect zero blocks with this size, we provide the tool find0s.el8.x located at /p/largedata/bin for use on JUDAC, JUWELS and JURECA. find0s.el8.x --quick <filename> searched <filename> for zero blocks which are a multiple in size of 512 KiB. find0s.el8.x (without --quick) will scan the file for all zero regions. Please note that, depending on the file format, zero blocks are common.

Warning

find0s.el8.x will detect corruption but files for which find0s.el8.x does not report a matching region are not necessarily corruption-free.

Why did the RAID metadata recovery not work fully?

The recovery procedure used the metadata stored in disks in dg_b01 that were excluded from the rebuild process (and hence not subject to the wiping of the metadata). Due to this, the metadata on these drives was not updated during the rebuild process and hence reflects the stripe allocation at the start of the rebuild. It does not capture any relocation of RAID stripes that happened during the rebuild due to the algorithm.

Is the firmware bug fixed?

We are currently waiting for a released firmware fix. The bug fix is currently undergoing testing. Stop-gap measures are in place. We have been assured that re-occurrence of the problem is not possible without manual actions which are prohibited at this time.

There used to be snapshots, why can’t we use them?

Snapshots protect data on the file systems from accidental file deletion, overwrites and similar events but do not allow to recover data in case of a system-internal corruption such as the one that unfortunately occurred now.

What will be the future measures to avoid such an incident?

Backup: Until now $DATA was available as a storage service generally without out-of-system backup. We are working to enable a regular full backup of the file system and, once operational, will announce the general availability of this feature for all projects on $DATA. For technical reasons, we can only start providing backup functionality for $DATA once the restore to /p/largedata_restore/ is finalized. Hence, please expect that backup will be available only several weeks in the future. Due to the large volume in the file system and its continuously changing data modification rates the number of backup runs/week will be variable and cannot be predicted at the moment. The backup will currently be performed on tape but the technology may change without further notice. We intend to provide a mechanism for you to assess which files have an off-system copy for incorporation with workflows. Please note that, given the file system size, a backup provides data safety (for data that has been injected before the last backup run) but does not guarantee data availability in case of an incident, since recalls of large volumes may take a very long time.
Temporary duplication: As a temporary measure until the full system backup is in place and has been validated, we intend to perform an in-system duplication of data to ensure that data blocks are available in more than one RAID system. This will cut in half the available storage space and hence is not a long term option. It also does not provide the same safety as an off-system backup. Please note that the necessary duplication will take several days to complete and may lower access times to the system while it is running.
File system reorganization: Over the course of the next months we will reorganize the $DATA structure and distribute projects over more file systems. This will improve the manageability of the file systems but also reduce the available bandwidth.

For how long will the largedata_restore filesystem be available?

We plan to leave largedata_restore available to users for six months. It will be removed on 2021-09-30.

What is the reason for my data volume being about twice as large as I expect it to be?

As a consequence of the duplication of data mentioned above, the volume of each file is counted twice. The duplication is a temporary measure until a proper out-of-system backup has been established. It is rather difficult to prevent this effect from showing to the user. Therefore, we decided to also increase the volume quota for each project to compensate for the effect. Quotas will be reset to normal once the duplication will be disabled and the full backup has been established.

The number of files/inodes is not affected by this measure, so you will not see twice the number of files.

Event time line

2021-01-26

Failure of one storage enclosure and loss of access to 14 drives. An I/O error on $DATA was visible when accessing files with blocks on these drives. Replacement of a seemingly unrelated drive.

2021-01-28

Attempts to revive enclosure were not successful. $DATA was taken fully into maintenance.

2021-02-01

Successful replacement of the enclosure. Disk group dg_b01 started rebuild but dg_a01 remained in quarantine due to lack of disks.

2021-02-02

Failure of controller b forced rebuild to migrate to controller a. A dequarantine instruction for dg_a01 triggered a previously unknown firmware bug in the form of a race condition between the reconfiguration and the rebuild for disk group dg_b01. The bug resulted in RAID stripe metadata being cleared on drives.

2021-02-04

Loss of RAID metadata confirmed. Start of analysis of affected files and recovery options. Start of migration of data from dg_a01 to other disks.

2021-02-08

Start of test for RAID metadata recovery from drives originally in dg_b01 that were excluded from the rebuild and have not been cleared by the bug.

Reduplication of meta data was triggered to ensure two copies are available.

2021-02-12

Custom firmware for recovery has gone through testing.

List of files with blocks on dg_b01 has been identified and cross-checked against the available copies on tape.

Start of restore of files onto new file system largedata_restore.

2021-02-15

List of affected files per project communicated to the contact persons for the data project.

2021-02-15

Start of recovery actions.

2021-02-18

RAID metadata recovery finalized. Work moves to file system level. After verification of several checksums, the migration of the data from the disks in dg_b01 to other disks in the system was triggered since dg_b01 was still in critical state and additional data loss in case of a drive failure was possible.

2021-02-19

NSDs on dg_a01 and dg_b01 were successfully removed from the system and two file system checks did not show problems (note: file system checks do not verify correctness of data blocks).

Start of assessment of the exposure of data on the file system to corruption by comparison with tape copies.

2021-02-23

$DATA made available in read-only mode on login and data access nodes allowing users to access data and start validation. The file system /p/largedata_restore is made available in read-only mode while the restore of data from tape continues. We are working on making markers in the project available to allow you to assess the completeness of the restore for your project.

2021-03-02

$DATA made available read-write.

2021-03-07

Start data duplication for $DATA (on file system layer).

2021-03-09

/p/largedata2 created.

2021-03-10

The restore of /p/largedata_restore has finished.