Introduction

DAOS employed various fault tolerance mechanisms to cope with regular temporary or permanent hardware failures, such as Raft engine used for pool/container metadata, EC/Replica used for user data, etc. These mechanisms ensure the system survives from most regular failures, and automatic self- healing mechanisms are in place to bring back redundancy once tolerable failure happens. However, several factors can challenge this design:

Users can control (and lower) the data protection on a per-container basis.
The system is facing some unexpected events causing more failures than it is designed to tolerate.
The self-healing mechanism may fail in some specific conditions (e.g., ENOSPC on the surviving nodes).
Hardware bugs (e.g., broken flush, firmware issue, data corruption, ...) are causing massive corruption.
Software bugs (e.g., corner cases, overflow, ...).
Human errors.

The DAOS catastrophic recovery feature is introduced to address the failure cases above. While it is unreasonable to assume that all cases can be covered, this feature covers the most likely ones. The first goal is to detect corruptions and distributed consistency issues and then offer a remediation path whenever possible. Remediation options can range from a transparent, automatic fix to a manual repair or deletion of a pool or container. If catastrophic recovery fails, then the system will ultimately have to be reformatted. Another aspect to take into account is that the check and repair should complete in a reasonable amount of time and the framework should provide estimates on how long it is expected to take for each pool and allow the administrator to prioritize some pools over others.

The flow of catastrophic recovery is described in the diagram below:

Catastrophic recovery will be supported as an offline operation (with the opportunity to run it online in the future) and will involve three components:

Local tools (ddb) to navigate and alter DAOS files stored on persistent memory and SSD.
A new checker module is to be loaded into the engine to verify distributed consistency (dmg check).
A middleware tool (daos fs check) to be run on a client side to verify middleware consistency.

While this design document aims at defining the final complete solution, it would require multiple years of development. As a consequence, only a subset of what is described in this document will be implemented. This implementation includes the local debug tools, the infrastructure to run the distributed checker, and distributed pool and container consistency (i.e., passes 0 to 5).

High-level Changes

Layout Changes

DAOS currently primarily stores bootstrapping metadata on PMEM and does not maintain any local redundancy. This limits the ability to recover from a local configuration file's corruption or PMEM modules' failure. To facilitate this recovery, the content of the superblock file will be replicated on the SPDK blobs maintained on the SSD.

$ cat superblock
version: 1
uuid: 1fceeeb1-b432-4859-b61b-bce235da1c26
system: daos_server
rank: 0
uri: ofi+tcp;ofi_rxm://10.8.1.151:31416
validrank: true
hostfaultdomain: /wolf-151.wolf.hpdd.intel.com

Moreover, the administrator must preserve a copy of the yaml file to rebuild the nvme.conf file required to configure SPDK.

Debug tools

The list of tools (i.e., ipmctl, e2fsck, pmempool check) that might be required to repair the storage used by the control and data plane will be documented in the DAOS online manual.

Moreover, a procedure to restart the management service with a single replica will be provided. This will allow DAOS to address massive server failures preventing the management service from reaching quorums.

DAOS files stored on persistent memory use either ASCII or VOS format. The former can be easily edited, while the latter requires a specific tool to parse the content. A new tool (ddb) that can be run locally against the local DAOS VOS files will be developed to parse the VOS format.

Distributed Consistency Checker

A new DAOS server module called checker will be developed to verify distributed consistency. The control plane will also be modified to start the engine in “checker” mode. In this mode, the administrator will be able to interact with the checker module in charge of verifying and repairing (when possible) distributed consistency.

Local Check & Repair

The first phase of catastrophic recovery consists of recovering the local storage (i.e., PMEM and SSD) to start the DAOS control and data plane to verify distributed consistency (phase 2, see next section).

In this section, we present the existing tools that will be needed to fix the local storage and finally introduce a new DAOS-specific tool that will be developed in this milestone.

PMEM Device Recovery

A PMEM namespace can be provisioned to operate in one of the following four modes: fsdax, devdax, sector, and raw. All DAOS metadata are stored in a fsdax namespace exported as a block device (e.g., /dev/pmem0).

The ipmctl(1) tool allows users to debug and troubleshoot PMEM modules. It supports several diagnostic tests. References to the ipmctl documentation will be added to the DAOS documentation to point users to how to use ipmctl to fix PMEM modules.

[root@wolf-151 control_raft]# ipmctl start -diagnostic quick --Test = Quick
State = Ok
Message = The quick health check succeeded. --SubTest = Manageability
State = Ok
--SubTest = Boot status
      State = Ok
   --SubTest = Health
State = Ok

The ndctl(1) tool allows users to manipulate PMEM namespaces and interact with the libnvdimm subsystem of the Linux kernel. It supports different options to check (e.g., NVDIMM address range scrubbing) and repair the NVDIMMs and namespace. Like ipmctl, the DAOS manual will be updated to point at the relevant sections of the ndctl documentation.

Ext4 Filesystem Recovery

Once the PMEM devices have been restored to a healthy state, the ext4 filesystem created on each device may need to be verified and repaired. This is achieved via the e2fsck(8) tool. While DAOS enables specific ext4 features like flex_bg and packed_meta_blocks to guarantee that large contiguous space allocations are made, all those features are supported by recent kernels and Linux distributions.

The following is a sample output of e2fsck against a PMEM device.

[root@wolf-151 jlombard]# e2fsck -y -f /dev/pmem0 e2fsck 1.45.6 (20-Mar-2020)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
daos: 39/1022112 files (2.6% non-contiguous), 29721922/1046642176 blocks

PMDK Pool Recovery

DAOS uses the PMDK library to manage persistency inside ext4 files. PMDK provides a tool called pmemobj(1) that has a check and repair option.

$ pmempool check -v ./3c2220cb-4a7f-413d-acd2-fbae8ea3afca/vos-1 checking shutdown state
shutdown state correct
checking pool header
pool header correct
./3c2220cb-4a7f-413d-acd2-fbae8ea3afca/vos-1: consistent

More functionalities might be added to pmempool check in the future and the DAOS documentation will be updated accordingly as more check & repair cases are added to the tool in the future.

DAOS Debug Tool

A new tool (ddb - DAOS Debug) to navigate through a file in the VOS format will be developed. It will also load SPDK (by using the nvme.conf file stored in pmem) and access the SPDK blobstore information.

The new tool will be similar to debugfs(8) for ext2/3/4 and will offer both a command line and interactive shell mode. It will support:

Parsing a VOS file, including:
- Assessing the VOS superblock.
- Listing containers, objects, dkeys, akeys, and records.
- Dumping content of a value.
- Dumping content of ilogs.
- Iterating over the dtx committed and active tables.
- The VEA tree storing information about free space on NVMe SSDs.
Altering a VOS file, including:
- Deleting containers, objects, dkeys, akeys, or records o changing/inserting new value.
- Processing/removing ilogs.
- Clearing the dtx committed table.
Resyncing the content of the SMD file with the information stored in the extended attributes from blobs (SSD). For this process, the server yaml (backed up by the administrator) will be required to be passed as input to the tool to know which devices the engine uses. It will then be able to reconstruct SMD by scanning existing blobstores when the SMD is corrupted, or delete/re-create blobs when there are any orphan/missing blobs.

Moreover, to restart the management service (required for distributed consistency check), a procedure to start the control plane in single replica mode will be documented. This will involve manipulating the internal Raft implementation used by the management service (see Recovery Mode | Vault | HashiCorp Developer ).

Distributed Internal Consistency

Once the previous phase has been completed, the DAOS server should be able to start on each storage node, and the management service should be up and running. This section describes how distributed consistency will be verified in multiple passes.

Workflow

Once the control plane has been started, dmg will be used to start the DAOS engine in a specific checker mode. In this operating mode, the engines won’t serve any client requests, and only cross-server RPCs as defined by the checker module will be processed.

Once the engine is started in checker mode, the administrator will interact with the checker module via dmg. The administrator will typically be able to:

Verify that all required storage nodes have joined the system (it is critical to validate the membership before starting the distributed consistency checks).
Mark engines that won’t be able to rejoin the system as administratively excluded.
List all the pools currently registered in the management service.

The checker won’t be started until all storage nodes referenced in the system map are marked as either joined or administratively excluded.

The administrator will then be able to start the checker via “dmg system check”. This process runs in the background, and status will be maintained in the management service. The operator will be able to check the status of each pool via the dmg command. The status can be:

“Unchecked” when the checker hasn’t started against this pool yet.
“Checking” when the pool is still being checked. Details about the pass that is in progress will be
provided.
“Checked” when the check has successfully completed all the passes.
“Stopped” when the checker is explicitly stopped by the user (e.g., to accelerate checking of other pools that might be more important).
“Paused” when the in-running checker is broken because of system stop or crash.
“Pending” when a problem is discovered, and a decision is required by the operator. Details
about the fault and pending action will be provided. For each fault/action, a unique ID will be generated and recorded by the management service. The class this fault belongs to will also be reported.
“Failed” when the check couldn’t be completed due to an unrecoverable internal error like an unsupported case, failure of an engine or storage node or network outage while the check was in progress.

When a pending status is reported, some recovery options might be provided to the administrator. The administrator can record the decision via the dmg tool by specifying the failure identifier. The decision can be some specific repair action or skipping this particular failure (at your own risk). Once the decision is recorded via dmg, the checker can be resumed on the impacted pool that will execute the recovery option selected by the operator and then continue the checking.

For massively corrupted pools, the back-and-forth exchange between the checker and the operator can be overwhelming. Therefore, the administrator will have the ability to:

Trust the checker and choose the repair action by itself. It is worth noting that the check will always prefer consistency over partial recovery and might thus aggressively remove pools or containers.
Record a generic decision that applies to all faults of the same class.

Those settings can be toggled for all pools in a system or on a per-pool basis. At any time, the operator will also be able to interrupt the in-progress check of a pool and delete the pool. Moreover, an estimate of the time to completion for each pool will be provided and adjusted on the way.

Once all surviving pools have been checked, the operator can use dmg to restart the system in normal mode to resume production.

Checker Module

Like rebuild, the checker module won’t have a client component (i.e., part of libdaos) and will define its own RPC operation codes and reject known RPCs that might be sent by leftover applications still running on the cluster. As other modules, the checker will interact with the control plane via the dRPC channel. It will receive instructions from the control plane on when to start, stop, and resume processing and report progress back to the control plane regularly.

The checker module will leverage the engine infrastructure (e.g., CART, Argobots, incast variables ...) and run over the primary CART group. It will load the other modules (i.e., pool, container, object) and start the associated service on demand, depending on the consistency passes it is running.

A root engine will be selected as the check leader by control plane, that is in charge of the whole process for current check instance, including start/stop/query checker on other engines (called as check engine) and report/forward detected inconsistency and interaction, and so on.

In the initial version, the checker won’t be resilient to failure, which means that if an engine fails while the checker is running, it will abort processing on the impacted pools and mark them as “failed”. It won’t start swim in this initial implementation. In the case of a transient failure, the operator will be able to restart the workflow.

Consistency Passes

Like traditional fsck tools, the DAOS checker executes different passes. Pools are processed in parallel, but each pass is executed sequentially for a given pool. This is required since repair action taken in one pass can have important consequences in the subsequent ones.

Pass 0: Management Service & Membership

Before starting any processing, the checker will verify that the management service is up and running and is ready to store regular status updates. It will also verify that all storage nodes reported in the system map are in the joined or excluded state. The leader engine is also elected as part of this pass.

Pass 1: Pool List Consolidation

Each engine will scan local storage and discover whether they are a member of a pool service. This is achieved by analyzing the presence of RDB (i.e. Replicated Data Base module used to manage pool and container metadata over the Raft consensus algorithm) files. Each engine will then report what pool service they are part of and how many pool service replicas they are aware of.

Upon completion of the distributed discovery, the check leader will analyze the content from check engines. There are several possible cases:

All members of the pool service are present and consistent.
Only a subset of the pool service is present, but we will have a quorum.
Only a subset of the pool service is present, and we don’t have a quorum.
More members are reported than the pool service was created with.

In the last two cases, a repair action will be required. The admin will be able to select what pool service replica to restart from and resilver new pool service replicas.

The list of discovered pools will then be compared to the pools that the management service is aware of. There are three cases:

The pool exists in both lists and is good to go.
The pool only exists in the management service and will thus have to be deleted from there.
The pool only exists in the discovery list. It could then either be registered to the management service or just deleted.

Once the list of pools is consolidated, each target will then scan the local storage and delete any VOS files (PMEM) and blobs (SSD) associated with pools that are not in the consolidated pools list. This will allow the checker to reclaim the space for any orphaned pool components pro-actively. In the future, an attempt to recreate the pool service for those pool might be implemented.

Pass 2: Pool Membership

Each pool from the consolidated list created in the pass 1 will then individually transition to the pass 2. In this pass, the pool service will be brought up.

Each engine will then report whether they have any storage allocated (i.e., PMEM VOS files and SSD blobs) for a given pool. This list will then be compared to the pool map stored in the pool service. There are three cases:

An engine has some allocated storage but does not appear in the pool map. In this case, the associated files and blobs will be automatically deleted from the engine. In the future, an option to try to re-add this target to the pool might be provided.
An engine has some allocated storage and is marked as down in the pool map. In this case, the administrator can decide to either remove or leave (e.g., for future incremental reintegration) the files/blobs.
An engine is referenced in the pool map, but no storage is actually allocated on this engine. The node will be evicted from the pool map, and a rebuild will be triggered in a later pass.

Pass 3: Pool Cleanup

Pools that successfully went through the previous passes will then be cleaned up. This clean up involves:

Verifying that the svcl list stored in the management service is coordinated with the actual pool service membership.
Verifying that the pool label recorded by the management service is coordinated with the pool label property from the pool service.
Revoking all pool connections recorded in the pool service.

Pass 4: Container List Consolidation

Each engine will scan the local storage and report what containers exist locally for a given pool. This global list of containers will then be compared to containers registered to the pool service. Containers that have storage allocated on the engine and do not exist in the pool service will be automatically deleted. In the future, customizable policies might be provided to recover some data from those orphan containers.

Pass 5: Container Cleanup

The container module will be loaded. Each container that survived the previous pass will then be cleaned up. This involves:

Verifying that the container label recorded by the pool service is coordinated with the container label property from the container service.
Revoking all container open handles from applications.

Pass 6: DTX Resync and Cleanup (future)

Once containers have been consolidated and cleaned up, dtx resync will be triggered to synchronize the DTX entries status across VOS targets on the same or different engines, that will consolidate related data visibility which is the base for subsequent object data consistency verification.

Since all old connections have been cleanup in former pass, the committed DTX entries are useless after DTX resync, then we will remove them to release space.

Pass 7: Object Scrubbing (future)

After object data visibility has been consolidated, we can verify object data consistency, including:

Running the checksum scrubber locally to verify local data consistency.
For replicated objects, verifying that all replicas are consistent using checksums.
For erasure coded objects, verifying that the parity fragment matches what is stored on the data fragments. This can again be done with checksums.

Pass 8: Rebuild and Aggregation (future)

In this last pass, regular rebuild (if necessary), erasure code aggregation and regular aggregation will be triggered in turn. Once those three steps have been successfully completed, the pass will be marked as completed.

An extra step might be implemented in the future as part of this pass to scan the objects on each target and delete object shards that are not supposed to be stored there based on the current pool map and algorithmic placement.

Lost Management Service (future)

Pass 0 of the checker might be extended in the future to support the reconstruction of the membership from the superblock information stored on pmem or in the blobstores.

Next Steps

This milestone aims at building the checker infrastructure for catastrophic recovery. Many improvements can be made in the future, including but not limited to:

Implement object scrubbing.
Better resilience to failure when a fault happens while the checker is running.
More recovery options (e.g. trying to rebuild the pool service if it is completely gone).
Ability to run the checker online.

DAOS Community

Catastrophic Recovery