Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Introduction

It may take many years to implement all the functionalities that are mentioned in the catastrophic recovery high level design document. Be the start, the catastrophic recovery phase I, will focus on the local debug tools, the infrastructure to run the distributed checker, and distributed pool and container consistency (i.e., passes 0 to 5).

Requirements & Use Cases

SRS ticket is available here: 

Jira Legacy
serverSystem JIRA
serverIdf325724b-f7c9-34db-bd1c-69d12ec98a69
keySRS-396

DDB Functionality

For navigation purposes and for referencing different parts of a VOS tree, the notion of a VOS path is used and can indicate any level of the tree. For example, /[uuid] would reference the container at that uuid. For DDB, the pool uuid is not needed in the VOS path because it will already be connected to a pool shard. Another example, /[uuid]/[obj_id]/dkey/akey, would reference the value at the given akey path. To make it more convenient to reference parts of the path, indexing can be used. For example, /[0] would reference the ‘first’ container on the pool shard. The order of containers, keys, and values is arbitrary but consistent. The listing command (described below) indicates appropriate indexes to use for the different parts of the path.

...

  • Navigate the VOS tree, which consists of pools, containers, objects, dkeys, akeys, and values.

  • Dump the value of an akey at a given VOS path to a file.

  • Insert, modify, or delete value at a specific VOS path.

  • Dump and modify DTX, ILOG and VEA entries.

  • View SMD info.

  • Sync the SMD file from backups within a pool’s NVMe blob.

Requirements for Distributed Checker

In the catastrophic recovery phase I, the distributed checker should have the ability to detect and handle the following inconsistencies:

...

  • Switch system between check mode and regular mode.

  • (Re-)Start checker for all pools or specified pool(s) in the system.

  • Stop checker for all pools or specified pool(s).

  • Resume checker for all pools or specified pool(s) from former stopped/paused point (pass).

  • Query checker process for all pools or specified pool(s).

  • Config the repair policy for specified inconsistency.

  • Interact with the checker to handle specified inconsistency.

Design Overview

The DAOS Debug Tool (ddb) allows users to interact with a file in the VOS format. These files are generally located on a DAOS server under the pool folder of the engine mount point. For example: /mnt/daos/[pool_uuid]/vos-0 where “vos-0” is the vos file for a pool shard. Only a single vos file can be opened at a time and ddb must run locally on the system with the vos file. Also, the daos engine must not be connected to the vos file for ddb to connect to it. The ddb tool can be run from the command line or as an interactive CLI.

Once the local consistency is verified on each storage node, the DAOS server should be able to start, and the management service (MS) should be up and running. Then the admin can control the DAOS checker to verify DAOS distributed internal consistency via the control plane offered dmg check APIs. The following figure shows the schematic diagram for the DAOS checker workflow.

...

Check Passes

As described in the high level design document, the DAOS checker needs to verify multiple DAOS components in different levels via multiple passes scanning. Then we define check phases corresponding to these different verifications.

...

The DAOS checker drives the verification for each pool with each own dedicated ULT (on both check leader and check engines). One pool being blocked (such as for interaction) will not affect other pools. So the verification for multiple pools can be run in parallel under different passes.

Inconsistency Classification

The inconsistency classification defines the kinds of distributed inconsistent issues that DAOS checker will detect and handle in this milestone, include the following:

...

The container label recorded by CS does not match the container label property.

Inconsistency Handling Policy

It is the set of actions that can be used by the DAOS checker to handle these kinds of distributed inconsistencies. The admin can specify the handle policy for each kind of inconsistency when starting DAOS checker.

...

If the admin does not specify the policy for some kind of inconsistency, then DAOS checker will use the default action to handle such kind if inconsistency. The default action will be finally mapped to one of above policy, depending on the detailed inconsistency.

Check Mode

Currently, the DAOS checker makes offline scan and reparation. We introduce a new “check” mode for these offline operations. Compared with regular mode, there are some restrictions under check mode:

...

The admin can switch the system mode via “dmg check enable/disable”.

Control Plane Enhancements for DAOS Checker

The DAOS Control Plane is responsible for coordinating the processes that provide storage services (Engines). It comprises the control plane service (daos_server), the client-side agent (daos_agent), and the administrative tool used for managing DAOS (dmg).

Management Service Database Updates

The DAOS Management Service (MS) is a Raft-replicated in-memory database that runs on a subset of the Control Plane service instances. This database contains the top-level set of information about the entire DAOS system (ranks within the system, the rank/fabric address mappings, and Pool service entries).

The MS database was expanded to allow replicated storage of checker findings. As the checker engine discovers issues in the DAOS storage, structured reports are sent to the MS, which then persists them in the MS DB.

Checker Reports

A checker report contains the following high-level information:

  • Sequence Number: A unique identifier for a given checker finding.

  • Finding Class: Corresponds to the inconsistency classification as described above.

  • Action: Records the action taken automatically if applicable, or as specified by the administrator

  • Timestamp: Records the time that the report was generated.

  • Finding details: Depending on the inconsistency class, different details will be available to assist the administrator with understanding what the problem is (e.g., pool UUID/Label, target index, etc.).

  • Potential/Suggested repair actions: If the checker has not been configured to automatically repair a given inconsistency class, then the report will include potential repair actions for display to the administrator, including a suggested action when applicable.

Engine Logic for DAOS Checker

The DAOS engine is the main body for scanning the system to detect distributed inconsistent issues and then handle accordingly.

Check Interaction

Strictly speaking, DAOS checker is not an interactive debug tool that usually offers some kind of shell to handle kinds of user input interactively, instead, DAOS check interaction is mainly used for the following purposes without interactive shell:

...

DAOS check interaction is triggered by check leader (that will be described in the Check Instance section subsequently) via check report upcall to MS asynchronously for an unresolved inconsistency. When the admin queries the check process, related interaction requests will be shown. Then the admin can make a decision and indicate DAOS checker to handle the inconsistency via the “dmg check repair” API.

Check Instance

Each time when DAOS checker is triggered, it will generate a check instance. For each check instance, there is a unique leader instance (that is called as “check leader”) and several engine instances (that is called as “check engine”).

...

  • Execute check command (start/stop/query) from the check leader.

  • Drives pass 2 to pass N process with independent ULT for each pool on it.

  • Report inconsistency (handle action and result) to the check leader.

  • Trace check process for the pool shards in current check instance’s scope.

Logic of Detecting and Handling Distributed Inconsistency

In this section, we will describe how DAOS checker detects and handles kinds of inconsistencies.

Detect and Handle Orphan Pool

When starting the checker, every check engine will return its known pools list (shards, UUID, label and PS replicas) via CKH_START RPC reply to the check leader. The check leader will compare them with the pools list obtained from MS (via DRPC_METHOD_CHK_LIST_POOL upcall). If some pool only exists in the engines reported list, then it is an orphan pool.

...

  • CIA_READD or CIA_TRUST_PS or CIA_DEFAULT: Re-add the orphan pool back to MS [suggested].

  • CIA_DISCARD or CIA_TRUST_MS: Destroy the orphan pool to release space.

  • CIA_IGNORE: Keep the orphan pool entry on engines, repair nothing.

  • CIA_INTERACT or others: Interact with the admin for the decision.

Detect and Handle Dangling Pool

In the above pools list comparison, if some pool only exists on MS, then it is a dangling pool.

...

  • CIA_DISCARD or CIA_TRUST_PS or CIA_DEFAULT: Discard the dangling pool entry from MS [suggested].

  • CIA_IGNORE: Keep the dangling pool entry on MS, repair nothing.

  • CIA_INTERACT or others: Interact with the admin for the decision.

Detect and Handle Broken PS

For the pool exists on engines (not destroyed if orphan), the check leader will check whether there are enough pool service replicas for the PS leader quorum. If not, then the pool service is broken.

...

  • CIA_TRUST_PS or CIA_DEFAULT: Start pool service under DICTATE mode[suggested].

  • CIA_DISCARD: Destroy the corrupted pool from related engines to release space.

  • CIA_IGNORE: Keep the corrupted pool on related engines, repair nothing.

  • CIA_INTERACT or others: Interact with the admin for the decision.

Detect and Handle Inconsistent Pool Label

And then, for the pool that exists on both MS and engines, the check leader will compare the pool label from engines with the one from MS. If they do not match, then the pool label is bad.

...

  • CIA_TRUST_PS: Trust PS pool label and fix the pool label on MS.

  • CIA_IGNORE: Keep the inconsistent pool label, repair nothing.

  • CIA_INTERACT or others: Interact with the admin for the decision.

Detect and Handle Orphan Pool Shard

After the pools list comparison, for the pool that can start pool service, the check leader will send (via CHK_POOL_MBS RPC) the known pool shards list (from CHK_START RPC reply) to the check engine on which the PS leader resides. And then such check engine will load the pool map and compare with the pool given shards. If some pool shard only exists on the engine, then it is an orphan pool shard.

...

  • CIA_TRUST_PS or CIA_DISCARD or CIA_DEFAULT: Discard the orphan pool shard to release space [suggested].

  • CIA_IGNORE: Keep the orphan pool shard on engine; repair nothing.

  • CIA_INTERACT or others: Interact with the admin for the decision.

Detect and Handle Dangling Pool Map Entry

In above pool shards comparison, if an engine is referenced in the pool map, but no storage is allocated on such engine, then related pool map entry is a dangling reference.

...

  • CIA_TRUST_TARGET or CIA_DEFAULT: Change pool map for the dangling map entry as DOWN [suggested].

  • CIA_IGNORE: Keep the dangling map entry in the pool map, repair nothing.

  • CIA_INTERACT or others: Interact with the admin for the decision.

Detect and Handle Orphan Container

After pool cleanup, the check engine on the PS leader will collect the containers list for the pool (via CHK_CONT_LIST PRC) from the pool shards. And then for each container in the list, such check engine will check whether it exists in the container service (CS) or not. If not, then it is an orphan container.

...

  • CIA_IGNORE: Keep the orphan container on engines, repair nothing.

  • CIA_INTERACT or others: Interact with the admin for the decision.

Detect and Handle Inconsistent Container Label

And then, for non-orphan container, the check engine will check whether its label in the CS matches the one in the container property or not. If not, then the container label is bad.

...

  • CIA_TRUST_TARGET: Trust the container label in container property.

  • CIA_IGNORE: Keep the inconsistent container label, repair nothing.

  • CIA_INTERACT or others: Interact with the admin for the decision.

User Interface

Both ddb and distributed checker are user visible tools, needs some friendly user interfaces.

DDB API

General Usage

  • ddb [OPTIONS] [<vos_file_path>] [COMMAND [COMMAND OPTIONS] [COMMAND ARGS]]

If a command is not specified on the command line, then ddb runs in interactive mode. Below is a list of commands. For more details about the commands, see the help output of ddb (also included in the Appendix).

General Commands

  • clear: Clears the screen in interactive mode.

  • exit: Exits the interactive shell.

  • help: Displays help information for a given command.

  • quit or q: Exits the interactive shell.

  • version: Prints the DDB version.

VOS Commands

For these commands to work, a VOS file must be connected to using the “open” command.

  • open: Opens the VOS file at the specified path.

  • close: Closes the currently opened VOS pool shard.

  • superblock_dump: Dumps the pool superblock information.

Interacting with the VOS Tree

  • ls: Lists containers, objects, dkeys, akeys, and values.

  • rm: Removes a branch of the VOS tree.

  • value_dump: Dumps a value at a given akey to the screen or a file.

  • value_load: Loads a value to a VOS path. Can be used to update the value of an existing key or create a new key.

Transactions

DTX and ILOG entries and tables support DAOS consistency between engines with transactions and incarnation logs.

  • dtx_act_abort: Marks the active DTX entry as aborted.

  • dtx_act_commit: Marks the active DTX entry as committed.

  • dtx_cmt_clear: Clears the DTX committed table.

  • dtx_dump: Dumps the DTX tables.

  • ilog_clear: Removes all the ILOG entries.

  • ilog_commit: Processes the ILOG.

  • ilog_dump: Dumps the ILOG.

Versioned Extent Allocator

VEA table stores information about allocated and free regions on an NVMe SSD.

  • vea_dump: Dumps information from the VEA about free regions.

  • vea_update: Alters the VEA tree to mark a region as free.

SMD Commands

For these commands to work, a VOS file must not be connected to.

  • smd: Displays information about the SMD file.

  • smd_sync: Restores the SMD file with backup from NVMe blob.

dmg check API

  • dmg check enable: Set system property and start ranks in checker mode.

  • dmg check disable: Clear system property and stop ranks.

  • dmg check start: Start a checker session for all pools or a subset.

...

Code Block
Usage:
  dmg [OPTIONS] check set-policy [set-policy-OPTIONS] [Policies]

[set-policy command options]
      -d, --reset-defaults   Set all policies to their default action.
      -a, --all-interactive  Set all policies to interactive.

[set-policy command arguments]
  Policies:                  Repair policies (required unless --all-interactive is specified).

MS Recovery Capabilities

In addition to the Pool and Container repair capabilities offered by the Engine-based Checker, the Management Service itself has some new features for assistance with recovering from certain failure scenarios. These features are described in the following sections documenting the new daos_server subcommands related to MS status and recovery.

...

This command is similar to the previously described recover command, but with the prerequisite step of reading the MS state from an on-disk snapshot before bootstrapping the replica. If the MS snapshots have been backed up to external storage, they can be used to restore the system metadata to whichever version of the database was current when the snapshot was taken.

Impacts

In the phase I, the catastrophic recovery tools will run under offline mode, will not affect any other online operation, nothing related with performance, RPC protocol, rebuild logic, aggregation and security. And neither configuration changes nor interoperability issues. Some new user interfaces will be introduced, please refer to the section Requirements & Use Cases for detail.

Quality

The test will be done as described in the /wiki/spaces/DAOS/pages/11085840453.