Introduction

It may take many years to implement all the functionalities that are mentioned in the catastrophic recovery high level design document. Be the start, the catastrophic recovery phase I, will focus on the local debug tools, the infrastructure to run the distributed checker, and distributed pool and container consistency (i.e., passes 0 to 5).

Requirements & Use Cases

SRS ticket is available here: https://daosio.atlassian.net/browse/SRS-396

DDB Functionality

For navigation purposes and for referencing different parts of a VOS tree, the notion of a VOS path is used and can indicate any level of the tree. For example, /[uuid] would reference the container at that uuid. For DDB, the pool uuid is not needed in the VOS path because it will already be connected to a pool shard. Another example, /[uuid]/[obj_id]/dkey/akey, would reference the value at the given akey path. To make it more convenient to reference parts of the path, indexing can be used. For example, /[0] would reference the ‘first’ container on the pool shard. The order of containers, keys, and values is arbitrary but consistent. The listing command (described below) indicates appropriate indexes to use for the different parts of the path.

The following functionalities will be supported:

Navigate the VOS tree, which consists of pools, containers, objects, dkeys, akeys, and values.
Dump the value of an akey at a given VOS path to a file.
Insert, modify, or delete value at a specific VOS path.
Dump and modify DTX, ILOG and VEA entries.
View SMD info.
Sync the SMD file from backups within a pool’s NVMe blob.

Requirements for Distributed Checker

In the catastrophic recovery phase I, the distributed checker should have the ability to detect and handle the following inconsistencies:

Orphan pool

The pool only is claimed by engines, but not registered on MS. It is an orphan pool.

Dangling pool entry

The pool only has dangling entry registered on MS but not claimed by any engine.

Broken pool service

Only a subset of PS replicas are present, do not have a quorum for the PS leader.

Inconsistency pool label

The pool label recorded by MS does not match the pool label property from PS.

Orphan pool shard in pool map

An engine has some allocated storage but does not appear in the pool map. The pool shard on this engine is orphan.

Dangling pool map entry

An engine is referenced in the pool map, but no storage is allocated on this engine. The pool map entry for this pool shard is dangling reference.

Orphan container

The container has storage allocated on engines but does not exist in the container service (CS that is currently combined with PS). It is an orphan container.

Inconsistent container label

The container label recorded by CS does not match the container label property.

In addition to above internal functionalities, there are also some control related requirements, including:

Switch system between check mode and regular mode.
(Re-)Start checker for all pools or specified pool(s) in the system.
Stop checker for all pools or specified pool(s).
Resume checker for all pools or specified pool(s) from former stopped/paused point (pass).
Query checker process for all pools or specified pool(s).
Config the repair policy for specified inconsistency.
Interact with the checker to handle specified inconsistency.

Design Overview

The DAOS Debug Tool (ddb) allows users to interact with a file in the VOS format. These files are generally located on a DAOS server under the pool folder of the engine mount point. For example: /mnt/daos/[pool_uuid]/vos-0 where “vos-0” is the vos file for a pool shard. Only a single vos file can be opened at a time and ddb must run locally on the system with the vos file. Also, the daos engine must not be connected to the vos file for ddb to connect to it. The ddb tool can be run from the command line or as an interactive CLI.

Once the local consistency is verified on each storage node, the DAOS server should be able to start, and the management service (MS) should be up and running. Then the admin can control the DAOS checker to verify DAOS distributed internal consistency via the control plane offered dmg check APIs. The following figure shows the schematic diagram for the DAOS checker workflow.

Check Passes

As described in the high level design document, the DAOS checker needs to verify multiple DAOS components in different levels via multiple passes scanning. Then we define check phases corresponding to these different verifications.

enum CheckScanPhase {
        CSP_PREPARE = 0; // Initial phase, prepare to start check on related engines.
        CSP_POOL_LIST = 1; // Pool list consolidation.
        CSP_POOL_MBS = 2; // Pool membership.
        CSP_POOL_CLEANUP = 3; // Pool cleanup.
        CSP_CONT_LIST = 4; // Container list consolidation.
        CSP_CONT_CLEANUP = 5; // Container cleanup.
 
        // The following phases will be implemented in the future.

        CSP_DTX_RESYNC = 6; // DTX resync and cleanup.
        CSP_OBJ_SCRUB = 7; // RP/EC shards consistency verification with checksum scrub if have.
        CSP_REBUILD = 8; // Object rebuild.
        CSP_AGGREGATION = 9; // EC aggregation & VOS aggregation.

        CSP_DONE = 10; // All done.
};

The DAOS checker drives the verification for each pool with each own dedicated ULT (on both check leader and check engines). One pool being blocked (such as for interaction) will not affect other pools. So the verification for multiple pools can be run in parallel under different passes.

Inconsistency Classification

The inconsistency classification defines the kinds of distributed inconsistent issues that DAOS checker will detect and handle in this milestone, include the following:

CIC_POOL_LESS_SVC_WITHOUT_QUORUM

Only a subset of PS replicas are present, do not have a quorum for the PS leader.

CIC_POOL_NONEXIST_ON_MS

The pool only is claimed by engines, but not registered on MS. It is an orphan pool.

CIC_POOL_NONEXIST_ON_ENGINE

The pool only has dangling entry registered on MS but not claimed by any engine.

CIC_POOL_BAD_LABEL

The pool label recorded by MS does not match the pool label property from PS.

CIC_ENGINE_NONEXIST_IN_MAP

An engine has some allocated storage but does not appear in the pool map. The pool shard on this engine is orphan.

CIC_ENGINE_HAS_NO_STORAGE

An engine is referenced in the pool map, but no storage is allocated on this engine. The pool map entry for this pool shard is dangling reference.

CIC_CONT_NONEXIST_ON_PS

The container has storage allocated on engines but does not exist in the container service (CS that is currently combined with PS). It is an orphan container.

CIC_CONT_BAD_LABEL

The container label recorded by CS does not match the container label property.

Inconsistency Handling Policy

It is the set of actions that can be used by the DAOS checker to handle these kinds of distributed inconsistencies. The admin can specify the handle policy for each kind of inconsistency when starting DAOS checker.

CIA_DISCARD

Discard the unrecognized element: the pool shard or the whole pool, or container, and so on; depends on the detailed inconsistency.

CIA_READD

Re-add the missing element back to the system: register orphan pool to MS, add orphan pool target to the pool map, and so on, depending on the detailed inconsistency.

CIA_TRUST_MS

Trust the information recorded in MS, then fix the element that is inconsistent with MS.

CIA_TRUST_PS

Trust the information recorded in PS, then fix the element that is inconsistent with PS.

CIA_TRUST_TARGET

Trust the information recorded by target to fix the element that is inconsistent with target.

CIA_INTERACT

Ask the admin to explicitly make the decision to handle the inconsistency.

CIA_IGNORE

Only record and report the inconsistency without further reparation.

CIA_DEFAULT

If the admin does not specify the policy for some kind of inconsistency, then DAOS checker will use the default action to handle such kind if inconsistency. The default action will be finally mapped to one of above policy, depending on the detailed inconsistency.

Check Mode

Currently, the DAOS checker makes offline scan and reparation. We introduce a new “check” mode for these offline operations. Compared with regular mode, there are some restrictions under check mode:

Restricted pool service (PS)
- Do not automatically start pool service until check leader explicitly triggers.
- Forbid new pool connection from client, also evict old connections.
- Disable pool service replicas auto reconfiguration.
Turn off some background ULTs
- VOS/DTX/EC aggregation.
- DTX batched commit.
- Garbage collection.
- Checksum scrub.
- VEA flush.

The admin can switch the system mode via “dmg check enable/disable”.

Control Plane Enhancements for DAOS Checker

The DAOS Control Plane is responsible for coordinating the processes that provide storage services (Engines). It comprises the control plane service (daos_server), the client-side agent (daos_agent), and the administrative tool used for managing DAOS (dmg).

Management Service Database Updates

The DAOS Management Service (MS) is a Raft-replicated in-memory database that runs on a subset of the Control Plane service instances. This database contains the top-level set of information about the entire DAOS system (ranks within the system, the rank/fabric address mappings, and Pool service entries).

The MS database was expanded to allow replicated storage of checker findings. As the checker engine discovers issues in the DAOS storage, structured reports are sent to the MS, which then persists them in the MS DB.

Checker Reports

A checker report contains the following high-level information:

Sequence Number: A unique identifier for a given checker finding.
Finding Class: Corresponds to the inconsistency classification as described above.
Action: Records the action taken automatically if applicable, or as specified by the administrator
Timestamp: Records the time that the report was generated.
Finding details: Depending on the inconsistency class, different details will be available to assist the administrator with understanding what the problem is (e.g., pool UUID/Label, target index, etc.).
Potential/Suggested repair actions: If the checker has not been configured to automatically repair a given inconsistency class, then the report will include potential repair actions for display to the administrator, including a suggested action when applicable.

Engine Logic for DAOS Checker

The DAOS engine is the main body for scanning the system to detect distributed inconsistent issues and then handle accordingly.

Check Interaction

Strictly speaking, DAOS checker is not an interactive debug tool that usually offers some kind of shell to handle kinds of user input interactively, instead, DAOS check interaction is mainly used for the following purposes without interactive shell:

For sanity checks, the admin wants to confirm before repairing some critical issues, then configure DAOS checker to raise check interaction when hit specified inconsistency.
The admin wants DAOS checker to handle different inconsistency issues (that have the same inconsistency class) with different actions case by case.
If DAOS check is uncertain about how to handle some inconsistency, it will ask help from the admin to make decision via check interaction.

DAOS check interaction is triggered by check leader (that will be described in the Check Instance section subsequently) via check report upcall to MS asynchronously for an unresolved inconsistency. When the admin queries the check process, related interaction requests will be shown. Then the admin can make a decision and indicate DAOS checker to handle the inconsistency via the “dmg check repair” API.

Check Instance

Each time when DAOS checker is triggered, it will generate a check instance. For each check instance, there is a unique leader instance (that is called as “check leader”) and several engine instances (that is called as “check engine”).

The check leader is elected by MS when starting DAOS checker. It cannot be switched to another rank unless the admin wants to restart DAOS checker from scratch. The main responsibilities of the check leader are as following:

The unified interface with MS for the whole check process.
- Execute check command (start/stop/query) from MS.
- Report inconsistency (handle action and result) to MS.
Control check engines.
- Start/stop/query check engines.
- Trace check engines process, exclude dead ones.
Interact with the admin and forward the feedback to related sponsor.
Drives pass 0 and pass 1 process with independent ULT for each pool.
Trace check process for the pools in current check instance’s scope.

On each server rank, there will run one check engine for current check instance, including the rank on which the check leader resides. Here are the check engine’s main responsibilities:

Execute check command (start/stop/query) from the check leader.
Drives pass 2 to pass N process with independent ULT for each pool on it.
Report inconsistency (handle action and result) to the check leader.
Trace check process for the pool shards in current check instance’s scope.

Logic of Detecting and Handling Distributed Inconsistency

In this section, we will describe how DAOS checker detects and handles kinds of inconsistencies.

Detect and Handle Orphan Pool

When starting the checker, every check engine will return its known pools list (shards, UUID, label and PS replicas) via CKH_START RPC reply to the check leader. The check leader will compare them with the pools list obtained from MS (via DRPC_METHOD_CHK_LIST_POOL upcall). If some pool only exists in the engines reported list, then it is an orphan pool.