Middleware Consistency
Stakeholders
Johann, Mohamad, Xuezhao, Phil.
Introduction
The DAOS data model is very generic and users build different data models on top of it. Different data models and middleware libraries include:
DFS for POSIX containers (mapping DAOS object to directories, files, symlinks) and maintaining a namespace and a user API on top of that.
PyDAOS for a flat namespace of dictionary objects that are mapped to internal DAOS LV objects
HDF5 containers to support the HDF5 API and tools over a hierarchical namespace of groups, datasets, attributes, etc.
and many others that are in production or research.
The key point is that each of those middleware libraries maintain and understand their own data model over the DAOS data model. DAOS itself does not understand this data model and for example cannot make a distinction between a POSIX root directory or SB object or a regular file. From the DAOS API perspective and below, they are all regular objects.
After a catastrophic recovery event, the data model could become inconsistent after repair actions from different passes. For instance, an object (a directory in a POSIX container) might be removed by the DAOS distributed checker. If the POSIX container is remounted, this directory will not be seen in the namespace and everything under it becomes unreachable and leaked / lost in the container. This situation calls for for a generic infrastructure to support every middleware library to be able to check & repair it’s data model after it’s event.
This design document describes this generic infrastructure that DAOS will provide, and details two middleware libraries, DFS and PyDAOS, on how they will implement their own checkers using that infrastructure.
Requirements & Use Cases
As discussed in the previous section, when recovering from a catastrophic recovery event, repair actions from DAOS could render some data models broken. The requirements for this include developing:
The generic infrastructure / API extensions to support middleware consistency are sufficient for existing data models to use for repair actions.
The Middleware tools supported in this work (POSIX and PyDAOS) are able to properly utilize the new infrastructure to fix the namespace avoiding any leak objects / space in the container.
A testing infrastructure is provided to allow corrupting containers to emulate a catastrophic recovery event to make it easy for testing the consistency tools.
Use cases according to each middleware:
DFS/POSIX containers:
Under the DFS relaxed mode and in some cases with the Balanced mode, one could end up with orphaned files or directories. Those are due to the consistency model of the DFS API. The consistency tool for DFS should be able to detect all those broken links and orphaned objects and either punch them to reclaim the space, or link them in a lost+found on the root directory of the container so the user can either relink them in the namespace or punch them.
After the distributed checker run, the DFS namespace or data model can be broken in several ways:
Lost an entire directory object (all objects that had a link in that directory are orphaned)
Lost inode entry dkey(s) in one directory that could be a link to another file or directory (orphan subtree or orphan file).
Lost an entire file object (dangling entry in the namespace).
Lost chunk(s) of file object (corrupted file in the namespace).
Lost the container SB (container cannot be opened)
Corrupted container SB object
Lost the root object (everything is orphaned)
PyDAOS containers:
PyDAOS namespace is broken after distributed checker run. The PyDAOS data model is built on top of the DAOS KV API, so given that design the main issues there that could arise:
Lost a KV entry (orphaned object)
Lost an entire KV (everything is basically unlinked from the container)
Design Overview
The main infrastructure change to support the middleware consistency tools as we discussed before is the extension to the DAOS Object ID Table (OIT). Currently the OIT API allows one to create the OIT object and just iterate through all the object IDs at a particular snapshot in the container. For the MWC tools, we need some extensions to this OIT API to support the process of repairing the middleware model. This repair process is a two-pass process:
Descend the middleware “namespace” and querying the object ID of every object that is visited / connected in the namespace. Every object ID that is seen should be marked as such in the in the OIT.
Iterate through all the object IDs in the OIT and return to the middleware tool all the unmarked objects (orphaned objects). The middleware tool can decide what to do with those objects, to either punch then to reclaim space or reattach to a lost+found for the user to relink later.
The two main DAOS components that will require changes to support this are:
The OIT API today does not support any updates or “markings” of object IDs and thus will need to be extended to support that.
The DAOS POSIX and PyDAOS middleware which are within the DAOS source code will need to provide tools to enact this recovery process.
In-scope:
DAOS infrastructure to support MWC tools
Snapshot improvements
Improve OIT creation time
Allow storing of a “marker” for each entry in the OIT
Provide API to iterate over the OIT, mark/unmark entries, and retrieve unmarked entires.
DFS MWC checker
Support POSIX container (file & directories)
Integrated into the daos utility
No orphaned objects or leaked space to exist in the container after the checker is run
All the orphaned objects are returned to the tool that could either (through a user specified option) punch the object or add them to lost+found.
DFS Checker tool to reconstruct a SB if it was not found
DFS Checker tool to optionally recreate the root object in the container in case it was not found and put everything in lost+found in the container:
otherwise the user would have to just destroy the container as everything is leaked.
PyDAOS MWC checker
Support for Python dictionary over DAOS
Integrated into the daos utility
No orphaned objects or leaked space to exist in the container after the checker is run
All the orphaned objects are returned to the tool that could either (through a user specified option) punch the object or add them to lost+found KV object.
Middleware specific Fault Injection tool:
to properly test all required use cases, we need a fault injection tool that does the appropriate data corruptions and loss of objects / data in a container since such cases will not be deterministically generated with every catastrophic event. This tool should be based on the DAOS API but still aware of the data model to be able to generate the accurate use cases to recover from.
Out-of-scope:
MWC checkers for other middleware on top of DAOS
the HDF5 DAOS VOL MWC checker is not implemented (only DFS and pydaos will be implemented for this work)
Some corruption cases for MW checkers:
A file that is corrupted (lost chunks in the array object) cannot be detected and the issue cannot be reported to the user.
An inode entry was partially corrupted (the akey value) which means some metadata on the corresponding object was lost (this could be mode_t, mtime, ctime, uid, gid, symlink value if entry was a symlink, etc.).
This is undetectable by the DFS MWC tool and won’t be discoverable until the user attempts to access that entry which it might not be able to or it has to ask for specific repair on the entry which can be provided as a later extension to the repair tool.
generic corruptions on the pool and container metadata (properties, attributes).
Outside of the Catastrophic recovery, a user might cause problems in the DFS namespace like creating a loop: dir1->dir2->dir3->dir1
Fixing such issues is out of scope for the the MWC checker tool for DFS.
User Interface
To support the Middleware Consistency tools for different middleware libraries, we need some generic infrastructure that these tools can use to detect irregularities in their data model. This generic infrastructure is in the Object ID Table (OIT) APIs to be able to go through the entire list objects in a container that the tools can use to check if any of the objects are missing from their data model. This API needs to be extended to allow for marking objects as “checked” to be able to restart or do another pass through the list when needed. These new APIs will be implemented:
int daos_oit_mark(daos_handle_t oh, daos_obj_id_t oid, d_iov_t *marker, daos_event_t *ev);
typedef int (daos_oit_filter_cb)(daos_obj_id_t oid, d_iov_t *marker);
int daos_oit_list_filter(daos_handle_t oh, daos_obj_id_t *oids, uint32_t *oids_nr, daos_anchor_t *anchor, daos_oit_filter_cb filter, daos_event_t *ev);
The first API allows a user to mark an object in the OIT list with a marker (max size of 128-bits). This marker can just be a single bit flag in the use case of middleware consistency to just indicate the object as “checked”. The second API allows listing all the object IDs in the OIT and call a user defined callback on each oid.
Using the new OIT APIs, we will extend the daos tool to support middleware consistency. Since we are supporting POSIX and PYDAOS containers, the tool extension would be something like:
daos container checker pool container
which depending on the container type would do the appropriate scanning of the namespace (POSIX or PYDAOS) and the OIT table and fix the irregularities in the data model.
This tool will support only containers with type POSIX and PYTHON. Passing containers of other types will return ENOTSUPP.
Impacts
No perf impacts. New APIs and tool extensions required (see UI section).
Quality
Testing of the work in this milestone is done at two layers as is the development:
For testing the OIT API extensions, new component tests will be added to daos_test to verify that the API is working properly.
For testing the MWC checker tools we will add new functional tests to:
generate POSIX (using mdtest and IOR) and PyDAOS containers.
use the new fault injection tool to simulate the state after a catastrophic recovery.
all the supported states should be simulated.
use the middleware checker tools to fix the containers.
verify the state of the containers and that no orphan objects exist.
this can be done through the OIT list API and some size queries on the pool to verify there is no leaked space
Benchmark the performance of the checker tool on different container sizes.
Project Milestones
OIT API extensions - Q4'22
Demo script for NRE milestone - Q4'22
MWC checker tool - Q1’23
Fault injection tool - Q1’23
Demo test cases - Q1’23
Validation - Q2’23
References (optional)
External papers, web page (if any)
Future Work (optional)
Known issues and future works that was considered out-of-scope.