Information about $DATA incident from January 2021
This page collects information about the $DATA
incident that occurred on 2021-01-26. Until 2021-02-02 an improbable sequence of unrelated hardware failures came together with a firmware bug and has unfortunately led to data loss. This page contains information about the next steps and relevant answers to questions related to the incident and its aftermath. A description of the event time line is provided at the bottom of the page.
FAQ
Event time line
- 2021-01-26:
Failure of one storage enclosure and loss of access to 14 drives. An I/O error on
$DATA
was visible when accessing files with blocks on these drives. Replacement of a seemingly unrelated drive.- 2021-01-28:
Attempts to revive enclosure were not successful.
$DATA
was taken fully into maintenance.- 2021-02-01:
Successful replacement of the enclosure. Disk group
dg_b01
started rebuild butdg_a01
remained in quarantine due to lack of disks.- 2021-02-02:
Failure of controller
b
forced rebuild to migrate to controllera
. A dequarantine instruction fordg_a01
triggered a previously unknown firmware bug in the form of a race condition between the reconfiguration and the rebuild for disk groupdg_b01
. The bug resulted in RAID stripe metadata being cleared on drives.- 2021-02-04:
Loss of RAID metadata confirmed. Start of analysis of affected files and recovery options. Start of migration of data from
dg_a01
to other disks.- 2021-02-08:
Start of test for RAID metadata recovery from drives originally in
dg_b01
that were excluded from the rebuild and have not been cleared by the bug.Reduplication of meta data was triggered to ensure two copies are available.
- 2021-02-12:
Custom firmware for recovery has gone through testing.
List of files with blocks on
dg_b01
has been identified and cross-checked against the available copies on tape.Start of restore of files onto new file system
largedata_restore
.- 2021-02-15:
List of affected files per project communicated to the contact persons for the data project.
- 2021-02-15:
Start of recovery actions.
- 2021-02-18:
RAID metadata recovery finalized. Work moves to file system level. After verification of several checksums, the migration of the data from the disks in
dg_b01
to other disks in the system was triggered sincedg_b01
was still in critical state and additional data loss in case of a drive failure was possible.- 2021-02-19:
NSDs on
dg_a01
anddg_b01
were successfully removed from the system and two file system checks did not show problems (note: file system checks do not verify correctness of data blocks).Start of assessment of the exposure of data on the file system to corruption by comparison with tape copies.
- 2021-02-23:
$DATA
made available in read-only mode on login and data access nodes allowing users to access data and start validation. The file system/p/largedata_restore
is made available in read-only mode while the restore of data from tape continues. We are working on making markers in the project available to allow you to assess the completeness of the restore for your project.- 2021-03-02:
$DATA
made available read-write.- 2021-03-07:
Start data duplication for
$DATA
(on file system layer).- 2021-03-09:
/p/largedata2
created.- 2021-03-10:
The restore of
/p/largedata_restore
has finished.