Checksum Scrubbing
Introduction
The checksum scrubber is a background task that scans the Version Object Store (VOS) trees to verify data integrity with the checksums that are stored on a DAOS Server. Corrective actions can be taken when corruption is detected.
The data that is scrubbed is the user data (created with daos_obj_update), not the metadata, though the metadata (VOS Trees) is used to for scanning/iterating the user data.
Design Overview
Background task
Per Pool Target ULT that will iterate containers. If checksums and scrubber is enabled then iterate the object tree. If a record value (SV or array) is not marked corrupted then scan.
Fetch the data
Calculate checksum for data
Compare calculated checksum with stored checksum. If they don’t match, silent data corruption has been detected
Silent Data Corruption
When silent data corruption has been detected the following actions will be taken:
Mark the record as corrupt
Raise an event using the DAOS RAS Notification system
Increment checksum error counters
If a configurable threshold of corruption has been reached, initiate a rebuild/drain operation
User Interface
Interactive Mode (nice to have)
A dmg command to start scrubbing from a script (would only run once, then quit)
Checksum Scrubbing Properties
These properties will be settable at a DAOS system level (control plane still needs this ability), or individual pool level. If set at both, the pool configuration overrides the system configuration. When updated, they should be active right away and not wait for a full scrub to be processed. For example if the mode changes to be more aggressive, then it should become more aggressive (based on the mode configuration) right away, not in the next scan.
Pool Scrubber Mode (scrub) - How the scrubber will run for each pool target. The container configuration can disable scrubbing for the container, but it cannot alter the mode.
Lazy - Only run when system is idle (no IO activities and no space pressure)
Timed - Run at regular intervals, regardless of IO activities, however, not to exceed Max Impact
Pool Scrubber Frequency (scrub-freq) - How frequently checksums should be scrubber. Checksums should not be scrubbed more frequently than this property, however, depending on the Scrubber Mode, they may be scrubbed less frequently. It is stored internally as number of seconds, but dmg should be more user friendly by accepting minutes, hours, days, weeks.
Pool Scrubber Max CPU Impact (scrub-max) - (Not implemented) Percentage of CPU used by the server scheduler to run the scrubber if there are other IO activities.
Scrubbing Threshold (scrub-thresh) - Number of checksum errors when the pool target is evicted. A value of 0 disables auto eviction
The command to create a pool with scrubbing enabled might look like this:
dmg pool create --scm-size 1G --properties=scrub:lazy,scrub-freq:1w
# or
dmg pool create --scm-size 1G aPool
dmg pool set-prop aPool --properties=scrub:3,scrub-freq:1w
Container Properties (-> doc/user/container.md)
Container Disable Scrubbing - If scrubbing is enabled for a pool, a container can disable it for itself.
Pool Query (Not Implemented)
Include scrubber info on a pool query. (DAOS-7680)
Number of checksums scrubber / estimated remaining
Last full pool scan completed (max of pool target completion)
Telemetry
The following telemetry metrics are gathered and can be reported for better understanding of how the scrubber is running. They will be gathered at both the pool and container level, with the exception of the Scrubber ULT Start.
Schedule
Scrubber ULT Start - datetime the scrubber service started
Scrubber Current Start - datetime the current scrubbing job started
Scrubber Last Completion - datetime of when the last scrubbing job completed
Last Duration - how long the last scrubber took to run to completion.
Checksum Calculated Counts
Total Checksum Count - Total number of checksums calculated over the life of the scrubber.
Last Checksum Count - number of checksums calculated during last scrubber job
Current Checksum Count - number of checksums calculated so far for the current scrubber job
Silent Data Corruption Counts
Total Silent Data Corruption - Total number of silent data corruption found while scrubbing object values.
Current Silent Data Corruption - number of silent data corruption found so far for the current scrubber job
Design Details & Implementation
Pool ULT
The code for the pool ULT is found in srv_pool_scrub.c
. It can be a bit difficult to follow because there are several layers of callback functions due to the nature of how ULTs and the vos_iterator work, but the file is organized such that functions typically call the function above it (either directly or indirectly as a callback). For example (~> is an indirect call, -> is a direct call):
ds_start_scrubbing_ult ~> scrubbing_ult -> scrub_pool ~> cont_iter_scrub_cb ->
scrub_cont ~> obj_iter_scrub_cb ...