Phase I

Introduction

It may take many years to implement all the functionalities that are mentioned in the catastrophic recovery high level design document. Be the start, the catastrophic recovery phase I, will focus on the local debug tools, the infrastructure to run the distributed checker, and distributed pool and container consistency (i.e., passes 0 to 5).

Requirements & Use Cases

SRS ticket is available here: https://daosio.atlassian.net/browse/SRS-396

DDB Functionality

For navigation purposes and for referencing different parts of a VOS tree, the notion of a VOS path is used and can indicate any level of the tree. For example, /[uuid] would reference the container at that uuid. For DDB, the pool uuid is not needed in the VOS path because it will already be connected to a pool shard. Another example, /[uuid]/[obj_id]/dkey/akey, would reference the value at the given akey path. To make it more convenient to reference parts of the path, indexing can be used. For example, /[0] would reference the ‘first’ container on the pool shard. The order of containers, keys, and values is arbitrary but consistent. The listing command (described below) indicates appropriate indexes to use for the different parts of the path.

The following functionalities will be supported:

  • Navigate the VOS tree, which consists of pools, containers, objects, dkeys, akeys, and values.

  • Dump the value of an akey at a given VOS path to a file.

  • Insert, modify, or delete value at a specific VOS path.

  • Dump and modify DTX, ILOG and VEA entries.

  • View SMD info.

  • Sync the SMD file from backups within a pool’s NVMe blob.

Requirements for Distributed Checker

In the catastrophic recovery phase I, the distributed checker should have the ability to detect and handle the following inconsistencies:

  • Orphan pool

The pool only is claimed by engines, but not registered on MS. It is an orphan pool.

  • Dangling pool entry

The pool only has dangling entry registered on MS but not claimed by any engine.

  • Broken pool service

Only a subset of PS replicas are present, do not have a quorum for the PS leader.

  • Inconsistency pool label

The pool label recorded by MS does not match the pool label property from PS.

  • Orphan pool shard in pool map

An engine has some allocated storage but does not appear in the pool map. The pool shard on this engine is orphan.

  • Dangling pool map entry

An engine is referenced in the pool map, but no storage is allocated on this engine. The pool map entry for this pool shard is dangling reference.

  • Orphan container

The container has storage allocated on engines but does not exist in the container service (CS that is currently combined with PS). It is an orphan container.

  • Inconsistent container label

The container label recorded by CS does not match the container label property.

In addition to above internal functionalities, there are also some control related requirements, including:

  • Switch system between check mode and regular mode.

  • (Re-)Start checker for all pools or specified pool(s) in the system.

  • Stop checker for all pools or specified pool(s).

  • Resume checker for all pools or specified pool(s) from former stopped/paused point (pass).

  • Query checker process for all pools or specified pool(s).

  • Config the repair policy for specified inconsistency.

  • Interact with the checker to handle specified inconsistency.

Design Overview

The DAOS Debug Tool (ddb) allows users to interact with a file in the VOS format. These files are generally located on a DAOS server under the pool folder of the engine mount point. For example: /mnt/daos/[pool_uuid]/vos-0 where “vos-0” is the vos file for a pool shard. Only a single vos file can be opened at a time and ddb must run locally on the system with the vos file. Also, the daos engine must not be connected to the vos file for ddb to connect to it. The ddb tool can be run from the command line or as an interactive CLI.

Once the local consistency is verified on each storage node, the DAOS server should be able to start, and the management service (MS) should be up and running. Then the admin can control the DAOS checker to verify DAOS distributed internal consistency via the control plane offered dmg check APIs. The following figure shows the schematic diagram for the DAOS checker workflow.

Check Passes

As described in the high level design document, the DAOS checker needs to verify multiple DAOS components in different levels via multiple passes scanning. Then we define check phases corresponding to these different verifications.

enum CheckScanPhase {         CSP_PREPARE = 0; // Initial phase, prepare to start check on related engines.         CSP_POOL_LIST = 1; // Pool list consolidation.         CSP_POOL_MBS = 2; // Pool membership.         CSP_POOL_CLEANUP = 3; // Pool cleanup.         CSP_CONT_LIST = 4; // Container list consolidation.         CSP_CONT_CLEANUP = 5; // Container cleanup.           // The following phases will be implemented in the future.         CSP_DTX_RESYNC = 6; // DTX resync and cleanup.         CSP_OBJ_SCRUB = 7; // RP/EC shards consistency verification with checksum scrub if have.         CSP_REBUILD = 8; // Object rebuild.         CSP_AGGREGATION = 9; // EC aggregation & VOS aggregation.         CSP_DONE = 10; // All done. };

The DAOS checker drives the verification for each pool with each own dedicated ULT (on both check leader and check engines). One pool being blocked (such as for interaction) will not affect other pools. So the verification for multiple pools can be run in parallel under different passes.

Inconsistency Classification

The inconsistency classification defines the kinds of distributed inconsistent issues that DAOS checker will detect and handle in this milestone, include the following:

  • CIC_POOL_LESS_SVC_WITHOUT_QUORUM

Only a subset of PS replicas are present, do not have a quorum for the PS leader.

  • CIC_POOL_NONEXIST_ON_MS

The pool only is claimed by engines, but not registered on MS. It is an orphan pool.

  • CIC_POOL_NONEXIST_ON_ENGINE

The pool only has dangling entry registered on MS but not claimed by any engine.

  • CIC_POOL_BAD_LABEL

The pool label recorded by MS does not match the pool label property from PS.

CIC_ENGINE_NONEXIST_IN_MAP

An engine has some allocated storage but does not appear in the pool map. The pool shard on this engine is orphan.

  • CIC_ENGINE_HAS_NO_STORAGE

An engine is referenced in the pool map, but no storage is allocated on this engine. The pool map entry for this pool shard is dangling reference.

  • CIC_CONT_NONEXIST_ON_PS

The container has storage allocated on engines but does not exist in the container service (CS that is currently combined with PS). It is an orphan container.

  • CIC_CONT_BAD_LABEL

The container label recorded by CS does not match the container label property.

Inconsistency Handling Policy

It is the set of actions that can be used by the DAOS checker to handle these kinds of distributed inconsistencies. The admin can specify the handle policy for each kind of inconsistency when starting DAOS checker.

  • CIA_DISCARD

Discard the unrecognized element: the pool shard or the whole pool, or container, and so on; depends on the detailed inconsistency.

  • CIA_READD

Re-add the missing element back to the system: register orphan pool to MS, add orphan pool target to the pool map, and so on, depending on the detailed inconsistency.

  • CIA_TRUST_MS

Trust the information recorded in MS, then fix the element that is inconsistent with MS.

  • CIA_TRUST_PS

Trust the information recorded in PS, then fix the element that is inconsistent with PS.

  • CIA_TRUST_TARGET

Trust the information recorded by target to fix the element that is inconsistent with target.

  • CIA_INTERACT

Ask the admin to explicitly make the decision to handle the inconsistency.

  • CIA_IGNORE

Only record and report the inconsistency without further reparation.

  • CIA_DEFAULT

If the admin does not specify the policy for some kind of inconsistency, then DAOS checker will use the default action to handle such kind if inconsistency. The default action will be finally mapped to one of above policy, depending on the detailed inconsistency.

Check Mode

Currently, the DAOS checker makes offline scan and reparation. We introduce a new “check” mode for these offline operations. Compared with regular mode, there are some restrictions under check mode:

  • Restricted pool service (PS)

    • Do not automatically start pool service until check leader explicitly triggers.

    • Forbid new pool connection from client, also evict old connections.

    • Disable pool service replicas auto reconfiguration.

  • Turn off some background ULTs

    • VOS/DTX/EC aggregation.

    • DTX batched commit.

    • Garbage collection.

    • Checksum scrub.

    • VEA flush.

The admin can switch the system mode via “dmg check enable/disable”.

Control Plane Enhancements for DAOS Checker

The DAOS Control Plane is responsible for coordinating the processes that provide storage services (Engines). It comprises the control plane service (daos_server), the client-side agent (daos_agent), and the administrative tool used for managing DAOS (dmg).

Management Service Database Updates

The DAOS Management Service (MS) is a Raft-replicated in-memory database that runs on a subset of the Control Plane service instances. This database contains the top-level set of information about the entire DAOS system (ranks within the system, the rank/fabric address mappings, and Pool service entries).

The MS database was expanded to allow replicated storage of checker findings. As the checker engine discovers issues in the DAOS storage, structured reports are sent to the MS, which then persists them in the MS DB.

Checker Reports

A checker report contains the following high-level information:

  • Sequence Number: A unique identifier for a given checker finding.

  • Finding Class: Corresponds to the inconsistency classification as described above.

  • Action: Records the action taken automatically if applicable, or as specified by the administrator

  • Timestamp: Records the time that the report was generated.

  • Finding details: Depending on the inconsistency class, different details will be available to assist the administrator with understanding what the problem is (e.g., pool UUID/Label, target index, etc.).

  • Potential/Suggested repair actions: If the checker has not been configured to automatically repair a given inconsistency class, then the report will include potential repair actions for display to the administrator, including a suggested action when applicable.

Engine Logic for DAOS Checker

The DAOS engine is the main body for scanning the system to detect distributed inconsistent issues and then handle accordingly.

Check Interaction

Strictly speaking, DAOS checker is not an interactive debug tool that usually offers some kind of shell to handle kinds of user input interactively, instead, DAOS check interaction is mainly used for the following purposes without interactive shell:

  • For sanity checks, the admin wants to confirm before repairing some critical issues, then configure DAOS checker to raise check interaction when hit specified inconsistency.

  • The admin wants DAOS checker to handle different inconsistency issues (that have the same inconsistency class) with different actions case by case.

  • If DAOS check is uncertain about how to handle some inconsistency, it will ask help from the admin to make decision via check interaction.

DAOS check interaction is triggered by check leader (that will be described in the Check Instance section subsequently) via check report upcall to MS asynchronously for an unresolved inconsistency. When the admin queries the check process, related interaction requests will be shown. Then the admin can make a decision and indicate DAOS checker to handle the inconsistency via the “dmg check repair” API.

Check Instance

Each time when DAOS checker is triggered, it will generate a check instance. For each check instance, there is a unique leader instance (that is called as “check leader”) and several engine instances (that is called as “check engine”).

The check leader is elected by MS when starting DAOS checker. It cannot be switched to another rank unless the admin wants to restart DAOS checker from scratch. The main responsibilities of the check leader are as following:

  • The unified interface with MS for the whole check process.

    • Execute check command (start/stop/query) from MS.

    • Report inconsistency (handle action and result) to MS.

  • Control check engines.

    • Start/stop/query check engines.

    • Trace check engines process, exclude dead ones.

  • Interact with the admin and forward the feedback to related sponsor.

  • Drives pass 0 and pass 1 process with independent ULT for each pool.

  • Trace check process for the pools in current check instance’s scope.

On each server rank, there will run one check engine for current check instance, including the rank on which the check leader resides. Here are the check engine’s main responsibilities:

  • Execute check command (start/stop/query) from the check leader.

  • Drives pass 2 to pass N process with independent ULT for each pool on it.

  • Report inconsistency (handle action and result) to the check leader.

  • Trace check process for the pool shards in current check instance’s scope.

Logic of Detecting and Handling Distributed Inconsistency

In this section, we will describe how DAOS checker detects and handles kinds of inconsistencies.

Detect and Handle Orphan Pool

When starting the checker, every check engine will return its known pools list (shards, UUID, label and PS replicas) via CKH_START RPC reply to the check leader. The check leader will compare them with the pools list obtained from MS (via DRPC_METHOD_CHK_LIST_POOL upcall). If some pool only exists in the engines reported list, then it is an orphan pool.

Solution Options (based on the handle policy, specified or not):

  • CIA_READD or CIA_TRUST_PS or CIA_DEFAULT: Re-add the orphan pool back to MS [suggested].

  • CIA_DISCARD or CIA_TRUST_MS: Destroy the orphan pool to release space.

  • CIA_IGNORE: Keep the orphan pool entry on engines, repair nothing.

  • CIA_INTERACT or others: Interact with the admin for the decision.

Detect and Handle Dangling Pool

In the above pools list comparison, if some pool only exists on MS, then it is a dangling pool.

Solution Options (based on the handle policy, specified or not):

  • CIA_DISCARD or CIA_TRUST_PS or CIA_DEFAULT: Discard the dangling pool entry from MS [suggested].

  • CIA_IGNORE: Keep the dangling pool entry on MS, repair nothing.

  • CIA_INTERACT or others: Interact with the admin for the decision.

Detect and Handle Broken PS

For the pool exists on engines (not destroyed if orphan), the check leader will check whether there are enough pool service replicas for the PS leader quorum. If not, then the pool service is broken.

Solution Options (based on the handle policy, specified or not):

  • CIA_TRUST_PS or CIA_DEFAULT: Start pool service under DICTATE mode[suggested].

  • CIA_DISCARD: Destroy the corrupted pool from related engines to release space.

  • CIA_IGNORE: Keep the corrupted pool on related engines, repair nothing.

  • CIA_INTERACT or others: Interact with the admin for the decision.

Detect and Handle Inconsistent Pool Label

And then, for the pool that exists on both MS and engines, the check leader will compare the pool label from engines with the one from MS. If they do not match, then the pool label is bad.

Solution Options (based on the handle policy, specified or not):

  • CIA_TRUST_MS or CIA_DEFAULT: Trust MS pool label [suggested].

NOTE: From the check leader’s perspective, it does not know which label is corrupted. But consider that the pool label on MS is mainly used for mapping the pool from label to UUID, that is more visible than the one stored in the PS property which can be regarded as backup. So unless the pool label on MS is empty, the check leader will prefer to trust the pool label on MS and fix the pool label in the pool property to make them to be consistent.

  • CIA_TRUST_PS: Trust PS pool label and fix the pool label on MS.

  • CIA_IGNORE: Keep the inconsistent pool label, repair nothing.

  • CIA_INTERACT or others: Interact with the admin for the decision.

Detect and Handle Orphan Pool Shard

After the pools list comparison, for the pool that can start pool service, the check leader will send (via CHK_POOL_MBS RPC) the known pool shards list (from CHK_START RPC reply) to the check engine on which the PS leader resides. And then such check engine will load the pool map and compare with the pool given shards. If some pool shard only exists on the engine, then it is an orphan pool shard.

Solution Options (based on the handle policy, specified or not):

  • CIA_TRUST_PS or CIA_DISCARD or CIA_DEFAULT: Discard the orphan pool shard to release space [suggested].

  • CIA_IGNORE: Keep the orphan pool shard on engine; repair nothing.

  • CIA_INTERACT or others: Interact with the admin for the decision.

Detect and Handle Dangling Pool Map Entry

In above pool shards comparison, if an engine is referenced in the pool map, but no storage is allocated on such engine, then related pool map entry is a dangling reference.

Solution Options (based on the handle policy, specified or not):

  • CIA_TRUST_TARGET or CIA_DEFAULT: Change pool map for the dangling map entry as DOWN [suggested].

  • CIA_IGNORE: Keep the dangling map entry in the pool map, repair nothing.

  • CIA_INTERACT or others: Interact with the admin for the decision.

Detect and Handle Orphan Container

After pool cleanup, the check engine on the PS leader will collect the containers list for the pool (via CHK_CONT_LIST PRC) from the pool shards. And then for each container in the list, such check engine will check whether it exists in the container service (CS) or not. If not, then it is an orphan container.

Solution Options (based on the handle policy, specified or not):

  • CIA_TRUST_PS or CIA_DISCARD or CIA_DEFAULT: Destroy the orphan container to release space [suggested].

NOTE: Currently, we do not support adding the orphan container back to the CS; that may be implemented in the future when we have enough information to recover the properties and attributes for the orphan container.

  • CIA_IGNORE: Keep the orphan container on engines, repair nothing.

  • CIA_INTERACT or others: Interact with the admin for the decision.

Detect and Handle Inconsistent Container Label

And then, for non-orphan container, the check engine will check whether its label in the CS matches the one in the container property or not. If not, then the container label is bad.

Solution Options (based on the handle policy, specified or not):

  • CIA_TRUST_PS or CIA_DEFAULT: Trust the container label in container service [suggested].

NOTE: Similar as the case of bad pool label, from the check engine’s perspective, it does not know which label is corrupted. But consider that the container label on PS (CS) is mainly used for mapping the container from label to UUID, that is more visible than the one stored in the container property which can be regarded as backup. So unless the container label on PS is empty, the check engine will prefer to trust the container label on PS and fix the container label in the container property to make them to be consistent.

  • CIA_TRUST_TARGET: Trust the container label in container property.

  • CIA_IGNORE: Keep the inconsistent container label, repair nothing.

  • CIA_INTERACT or others: Interact with the admin for the decision.

User Interface

Both ddb and distributed checker are user visible tools, needs some friendly user interfaces.

DDB API

General Usage

  • ddb [OPTIONS] [<vos_file_path>] [COMMAND [COMMAND OPTIONS] [COMMAND ARGS]]

If a command is not specified on the command line, then ddb runs in interactive mode. Below is a list of commands. For more details about the commands, see the help output of ddb (also included in the Appendix).

General Commands

  • clear: Clears the screen in interactive mode.

  • exit: Exits the interactive shell.

  • help: Displays help information for a given command.

  • quit or q: Exits the interactive shell.

  • version: Prints the DDB version.

VOS Commands

For these commands to work, a VOS file must be connected to using the “open” command.

  • open: Opens the VOS file at the specified path.

  • close: Closes the currently opened VOS pool shard.

  • superblock_dump: Dumps the pool superblock information.

Interacting with the VOS Tree

  • ls: Lists containers, objects, dkeys, akeys, and values.

  • rm: Removes a branch of the VOS tree.

  • value_dump: Dumps a value at a given akey to the screen or a file.

  • value_load: Loads a value to a VOS path. Can be used to update the value of an existing key or create a new key.

Transactions

DTX and ILOG entries and tables support DAOS consistency between engines with transactions and incarnation logs.

  • dtx_act_abort: Marks the active DTX entry as aborted.

  • dtx_act_commit: Marks the active DTX entry as committed.

  • dtx_cmt_clear: Clears the DTX committed table.

  • dtx_dump: Dumps the DTX tables.

  • ilog_clear: Removes all the ILOG entries.

  • ilog_commit: Processes the ILOG.

  • ilog_dump: Dumps the ILOG.

Versioned Extent Allocator

VEA table stores information about allocated and free regions on an NVMe SSD.

  • vea_dump: Dumps information from the VEA about free regions.

  • vea_update: Alters the VEA tree to mark a region as free.

SMD Commands

For these commands to work, a VOS file must not be connected to.

  • smd: Displays information about the SMD file.

  • smd_sync: Restores the SMD file with backup from NVMe blob.

dmg check API

  • dmg check enable: Set system property and start ranks in checker mode.

  • dmg check disable: Clear system property and stop ranks.

  • dmg check start: Start a checker session for all pools or a subset.

Usage: dmg [OPTIONS] check start [start-OPTIONS] [[pool name or UUID [pool name or UUID]] ...] [start command options] -n, --dry-run Scan only; do not initiate repairs. -r, --reset Reset the system check state. -f, --failout=[on|off] Stop on failure. -a, --auto=[on|off] Attempt to automatically repair problems. -O, --find-orphans Find orphaned pools. -p, --policies= Set repair policies.
  • dmg check stop: Stop the current checker session.

Usage: dmg [OPTIONS] check stop [[pool name or UUID [pool name or UUID]] ...]
  • dmg check query: Retrieve status of the checker session and any findings.

  • dmg check repair: Perform a repair operation for a given checker finding.

  • dmg check get-policy: Display the current checker action for each inconsistency class.

  • dmg check set-policy: Update the checker action for an inconsistency class.

MS Recovery Capabilities

In addition to the Pool and Container repair capabilities offered by the Engine-based Checker, the Management Service itself has some new features for assistance with recovering from certain failure scenarios. These features are described in the following sections documenting the new daos_server subcommands related to MS status and recovery.

  • daos_server ms status

This command is used to obtain status about the MS database stored on a given replica. With this information, the administrator may determine which of several replicas has the most recent committed raft logs (normally, all replicas should receive the same set of logs, but in a catastrophic failure, some replicas may not receive the most recent logs before being terminated).

  • daos_server ms recover

This command may be used to force a recovery of the MS database using this replica. In normal operations, once the MS has reached a quorum of peers, it will fall back to degraded (read-only) operations if it loses that quorum and will not allow write operations until it has regained quorum with previously contacted peers. If there is only one MS replica remaining, or the set of replicas needs to be updated to include a new set of peers, then this command may be used to force the MS back into bootstrap mode. In bootstrap mode, the first replica to start will automatically gain leadership and other replicas will be added as followers.

  • daos_server ms restore

This command is similar to the previously described recover command, but with the prerequisite step of reading the MS state from an on-disk snapshot before bootstrapping the replica. If the MS snapshots have been backed up to external storage, they can be used to restore the system metadata to whichever version of the database was current when the snapshot was taken.

Impacts

In the phase I, the catastrophic recovery tools will run under offline mode, will not affect any other online operation, nothing related with performance, RPC protocol, rebuild logic, aggregation and security. And neither configuration changes nor interoperability issues. Some new user interfaces will be introduced, please refer to the section Requirements & Use Cases for detail.

Quality

The test will be done as described in the catastrophic recovery test plan.