Checksum Scrubbing

Introduction

The checksum scrubber is a background task that scans the Version Object Store (VOS) trees to verify data integrity with the checksums that are stored on a DAOS Server. Corrective actions can be taken when corruption is detected.

The data that is scrubbed is the user data (created with daos_obj_update), not the metadata, though the metadata (VOS Trees) is used to for scanning/iterating the user data.

Design Overview

Background task

Per Pool Target ULT that will iterate containers. If checksums and scrubber is enabled then iterate the object tree. If a record value (SV or array) is not marked corrupted then scan.

  • Fetch the data

  • Calculate checksum for data

  • Compare calculated checksum with stored checksum. If they don’t match, silent data corruption has been detected

Silent Data Corruption

When silent data corruption has been detected the following actions will be taken:

  • Mark the record as corrupt

  • Raise an event using the DAOS RAS Notification system

  • Increment checksum error counters

  • If a configurable threshold of corruption has been reached, initiate a rebuild/drain operation

User Interface

Interactive Mode (nice to have)

  • A dmg command to start scrubbing from a script (would only run once, then quit)

Checksum Scrubbing Properties

These properties will be settable at a DAOS system level (control plane still needs this ability), or individual pool level. If set at both, the pool configuration overrides the system configuration. When updated, they should be active right away and not wait for a full scrub to be processed. For example if the mode changes to be more aggressive, then it should become more aggressive (based on the mode configuration) right away, not in the next scan.

  • Pool Scrubber Mode (scrub) - How the scrubber will run for each pool target. The container configuration can disable scrubbing for the container, but it cannot alter the mode.

    • Lazy - Only run when system is idle (no IO activities and no space pressure)

    • Timed - Run at regular intervals, regardless of IO activities, however, not to exceed Max Impact

  • Pool Scrubber Frequency (scrub-freq) - How frequently checksums should be scrubber. Checksums should not be scrubbed more frequently than this property, however, depending on the Scrubber Mode, they may be scrubbed less frequently. It is stored internally as number of seconds, but dmg should be more user friendly by accepting minutes, hours, days, weeks.

  • Pool Scrubber Max CPU Impact (scrub-max) - (Not implemented) Percentage of CPU used by the server scheduler to run the scrubber if there are other IO activities.

  • Scrubbing Threshold (scrub-thresh) - Number of checksum errors when the pool target is evicted. A value of 0 disables auto eviction

The command to create a pool with scrubbing enabled might look like this:

dmg pool create --scm-size 1G --properties=scrub:lazy,scrub-freq:1w # or dmg pool create --scm-size 1G aPool dmg pool set-prop aPool --properties=scrub:3,scrub-freq:1w

Container Properties (-> doc/user/container.md)

  • Container Disable Scrubbing - If scrubbing is enabled for a pool, a container can disable it for itself.

Pool Query (Not Implemented)

Include scrubber info on a pool query. (DAOS-7680)

  • Number of checksums scrubber / estimated remaining

  • Last full pool scan completed (max of pool target completion)

Telemetry

The following telemetry metrics are gathered and can be reported for better understanding of how the scrubber is running. They will be gathered at both the pool and container level, with the exception of the Scrubber ULT Start.

Schedule

  • Scrubber ULT Start - datetime the scrubber service started

  • Scrubber Current Start - datetime the current scrubbing job started

  • Scrubber Last Completion - datetime of when the last scrubbing job completed

  • Last Duration - how long the last scrubber took to run to completion.

Checksum Calculated Counts

  • Total Checksum Count - Total number of checksums calculated over the life of the scrubber.

  • Last Checksum Count - number of checksums calculated during last scrubber job

  • Current Checksum Count - number of checksums calculated so far for the current scrubber job

Silent Data Corruption Counts

  • Total Silent Data Corruption - Total number of silent data corruption found while scrubbing object values.

  • Current Silent Data Corruption - number of silent data corruption found so far for the current scrubber job

Design Details & Implementation

 

Pool ULT

The code for the pool ULT is found in srv_pool_scrub.c. It can be a bit difficult to follow because there are several layers of callback functions due to the nature of how ULTs and the vos_iterator work, but the file is organized such that functions typically call the function above it (either directly or indirectly as a callback). For example (~> is an indirect call, -> is a direct call):

ds_start_scrubbing_ult ~> scrubbing_ult -> scrub_pool ~> cont_iter_scrub_cb -> scrub_cont ~> obj_iter_scrub_cb ...