Introduction

The checksum scrubber is a background task that scans the Version Object Store (VOS) trees to verify data integrity with the checksums that are stored on a DAOS Server. Corrective actions can be taken when corruption is detected.

The data that is scrubbed is the user data (created with daos_obj_update), not the metadata, though the metadata (VOS Trees) is used to for scanning/iterating the user data.

Design Overview

Background task

Per Pool Target ULT that will iterate containers. If checksums and scrubber is enabled then iterate the object tree. If a record value (SV or array) is not marked corrupted then scan.

Silent Data Corruption

When silent data corruption has been detected the following actions will be taken:

User Interface

Interactive Mode (nice to have)

Checksum Scrubbing Properties

These properties will be settable at a DAOS system level (control plane still needs this ability), or individual pool level. If set at both, the pool configuration overrides the system configuration. When updated, they should be active right away and not wait for a full scrub to be processed. For example if the mode changes to be more aggressive, then it should become more aggressive (based on the mode configuration) right away, not in the next scan.

The command to create a pool with scrubbing enabled might look like this:

dmg pool create --scm-size 1G --properties=scrub:lazy,scrub-freq:1w
# or
dmg pool create --scm-size 1G aPool
dmg pool set-prop aPool --properties=scrub:3,scrub-freq:1w

Container Properties (-> doc/user/container.md)

Pool Query (Not Implemented)

Include scrubber info on a pool query. (DAOS-7680)

Telemetry

The following telemetry metrics are gathered and can be reported for better understanding of how the scrubber is running. They will be gathered at both the pool and container level, with the exception of the Scrubber ULT Start.

Schedule

Checksum Calculated Counts

Silent Data Corruption Counts

Design Details & Implementation

 

Pool ULT

The code for the pool ULT is found in srv_pool_scrub.c. It can be a bit difficult to follow because there are several layers of callback functions due to the nature of how ULTs and the vos_iterator work, but the file is organized such that functions typically call the function above it (either directly or indirectly as a callback). For example (~> is an indirect call, -> is a direct call):

ds_start_scrubbing_ult ~> scrubbing_ult -> scrub_pool ~> cont_iter_scrub_cb ->
    scrub_cont ~> obj_iter_scrub_cb ...