Tiered Containers (Phase 1)
Stakeholders
This feature will be driven as a community project among multiple companies/contributors.
Introduction
The purpose of this feature is to attach a container to an existing backend storage tier (object stores or filesystems) and provide transparent caching.
It only targets POSIX container, although the architecture should allow to support other container types in the future.
Both data and metadata (including namespace) are imported into the container asynchronously and on-demand/access.
It should support POSIX filesystems (e.g. NFS, Lustre, GPFS) as well as cloud object stores like S3, GCS or blob storage.
The implementation will be done in multiple phases, delivering more incremental functionality over the same baseline infrastructure.
The purpose of this page is to describe the baseline infrastructure for phase 1 to support AI training and will then be extended to eventually support caching the same backend tier to multiple DAOS system with cache pressure, writeback, prefetch and out-of-band update support capabilities.
In concept, it is similar to the Lustre HSM feature, except that:
it is fully integrated into DAOS and does not rely on third-party software
the namespace does not need to be backed up separately
it does not rely on a central coordinator since DAOS metadata are natively distributed on all engines.
Requirements & Use Cases
The primary use case for phase 1 is training data for AI. A POSIX POSIX container (and POSIX only) can be attached to a GCS/S3 bucket or POSIX namespace (called “backend tier” in this document). The URI (i.e. bucket address or filesystem path) to the backend tier should be passed as a property to the container.
Once the container is successfully created, it is can be immediately mounted via libdfs (and thus dfuse too). The namespace exposed through the container matches the one in the backend tier. Latency on first metadata or data access is higher than on subsequent ones since the metadata/data needs to be loaded from the backend tier and copied to DAOS.
Files and directories loaded via the backend tier are immutable in DAOS. It means files/dirs cannot be removed/renamed and new files cannot be created into existing directories. Similarly, files from the backend tier cannot be modified. EROFS will be returned in all those cases.
New files and directories can be created at the root of the container. Those files and directories will exist in the DAOS container and won’t be propagated back to the backend tier.
In-scope
Ability to configure the DAOS server via the yaml file to allow users to create tiered container.
Populating directory and file content from backend tier on demand.
Expose files backed by backend storage tier as read-only.
Allow new files (not existing in backend storage tier) to be created in the same container. Those files won’t be written back to the backend tier.
Proper error should be propagated back to the DAOS client if the backend tier is not accessible.
ENOSPC should be returned to the application if the content of the backend tier could not be written into DAOS.
The whole file is prefetched into DAOS upon access.
(TBD, might be premature: Optimisations to prefetch a bunch of small files at once from backend tier).
Out-of-scope
Only POSIX container are supported.
The backend storage tier should be pre-configured to be accessible from the all the DAOS engines. For POSIX filesystems, the filesystem must be mounted on all the engines and POSIX ACL/perm should be set to allow the user ID of the DAOS engines in systemd to access the filesystem in read-only mode (for phase 1, read-write will be required for phase 2). For S3/GCS, the buckets must be accessible from all the DAOS servers and ACL should be configured to allow the user ID of the DAOS engine to access and list all objects in the bucket.
The backend tier is considered as immutable while it is attached to a container until it is destroyed. Changes to the backend tier will result in undefined behaviour.
None of the new files created in the DAOS container are pushed to the backend tier. It means that those are lost once the container is destroy if no manual backup was done.
The pool should have enough space to fit the backend tier. If not, ENOSPC might be returned on any operation (including read or ls). No cache pressure are managed in DAOS. When it is full, it is full since there is no data eviction.
Any user hint on prefetch or optimizing data movement are left to subsequent phases.
Many many performance optimisations that will be developed in subsequent phases.
Partial fetch of file content or namespace.
Design Overview
Phase 1 aims at defining a baseline infrastructure that support the use case described above and provide an evolutive infrastructure to support more advanced use cases in subsequent phases.
An attempt to support tiering from the client side was attempted several years ago and exposed a lot of complexity difficult to overcome. For this new approach, it was thus decided to support tiering as a first class feature in DAOS and drive it from the engine which has finer-grained visibility and control of the data.
New daos_mover Service
The DAOS engine is a controlled event-driven environment where third-party libraries (e.g. libs3) not strictly required are not welcomed. A bug in those libraries would cause the engine to crash and prevent access to all pools/containers even if tiering is not used. The engine is also written in C, and higher level languages (e.g. Go, Rust, …) might be more suitable to access the backend storage tier.
A new service called daos_mover will then be started by the control plane to manage access to the backend tier. It will run along the daos_engine and have dedicated cores reserved in the DAOS yaml file and not overlap with the daos_enginer core. The daos_mover service sets up a drpc channel with the daos_engine to receive cache miss notifications from this one.
The daos_mover service should implement a backend plugin interface to support different backend tiers, including S3, GCS, Lustre and POSIX filesystems.
For phase 1, the daos_mover is mostly responsible for handling cache miss from the engine, converting those requests into libdfs requests and copying data from the backend tier to the container. It will act over libdfs as a regular client and need to be able to open the container and access it. It is also acknowledged that the daos_mover process/threads on one node can issue write or read requests to non-local engines via libdfs/libdaos. This might be optimised in the future.
A raft service might be required among all the daos_mover process to handle failure (see failure section) and hopefully reuse the code of the control plane. The raft service might be able to help with concurrency control to avoid for instance duplicating transfer of the same files from two different mover instances and will give more flexibility in the future.
Core DAOS Changes
The intent is to minimise changes required in the DAOS stack and more particularly in the client libraries that should be able to run over a tiered container as a normal container in the same way, modulus the higher latency upon first access.
Cache Miss
When receiving an I/O request from a missing object or dkey in a tiered container, the engine will have to send cache miss notification to the daos_mover through the local drpc channel. This missing object or dkey will have to be turned into a DFS object and eventually into an object/file/directory into the backend tier. This mapping translation is detailed in the next section.
At the DFS level, cache miss on files should be triggered while opening the file (and not on entry lookup) to allow ls -l to be run by the user w/o triggering an import of all the files into DAOS. It is currently difficult for the engine to detect the difference between lookup and open, but a new intent flag can be added to the DAOS API to differentiate those two cases and have the mover triggering the file import only on open.
Data Readiness Notifications
Upon a cache miss, the data mover will be in charge of pulling the missing data from the backend tier and this might take some time. As a consequence, we cannot afford to just hold the reply until the data is copied since the original RPC might time out on the client. Currently, DAOS has no mechanism to issue notifications/RPCs from a server to a client. This could be implemented by using RDMA and having the server issuing a PUT when the requested data is available and libdaos can retry the request. This mechanism can be used in other features, like flock implementation or new snapshot notifications.
Specific Handle
The data_mover service needs to provide specific permission over libdfs/daos to be able to access the container, but also to write to objects supposed to be immutable (what a regular client is actually not allowed to do). Rebuild already uses a specific handle that can hopefully be shared (or at least the mechanism) with the data_mover service without messing with ACLs.
Mapping between DAOS objects and backend objects/files
Identifiers on backend tier
Rebuilding the path to the root of the POSIX container is an expensive operation. That’s why each file/directory imported in the DAOS container should be associated with a file/object in the backend tier. For Lustre, the FID can be used to uniquely identify and open any files or directories in the filesystem. Targeted cloud object store relies on a 1024-char path to uniquely identify the object.
Storing Identifiers in DAOS
The identifier can be stored in the directory entry in a special xattr. This works as long as we import the whole file and don’t support partial file import (not in scope for phase 1).
The special xattr will be populated when importing the directory with all its entries and fetched by the mover to retrieve the file content on open.
Fault Management
DAOS Node Failure
Failure of DAOS nodes or VMs will result in not only the daos_engine to be lost, but also the daos_mover process(es). This causes two issues:
The engine is responsible for tracking what RDMA keys to notify upon completion of outgoing transfers on the mover service. Upon rebuild on the surviving engines, this list of RDMA keys to notify will be lost. So the client will wait indefinitely and never be notify of the transfer completion.
If the daos_mover was in the progress of copying data from the backend tier to the container, this transfer will be interrupted and won’t be restarted anywhere.
Several options can be considered to address this issue. The daos_mover can run a raft instance (similar to control plane) and record status of outdoing transfer, potentially with RDMA key (if can be serialised). Upon eviction of a daos_engine, the control plane could notify the daos_mover service that would restart any outgoing transfer on a different mover node and use the local engine to notify the client.
Another potentially simpler solution is to have the client resending the request upon pool map changes with the RDMA key. The engine receiving the resent request would track the key and trigger the data transfer again. This might be acceptable as an initial implementation given that server failure are still not that frequent.
Backend Storage Tier Unavailability
The backend tier might become unresponsive while trying to transfer data to DAOS. The daos_mover should implement some retry capabilities, but eventually propagate the error back to the client application via the DAOS stack. ENOTCONN (i.e. transport endpoint not connected) sounds like the closest one in errno.h.
User Interface
Two new immutable container properties passed at creation time will be added:
Tiering: used to select the tiering mechanism, if any. The value is off by default if the container is not attached to any backend tier. It can then be set to different values to be determined.
Tier address (tier_add): the value is a string composed of a prefix identifying the protocol (i.e. S3, GCS, posix, …) followed by a path pointing at the backend storage tier. Examples are s3://[bucket_name]/, gcs://[bucket_name]/, lustre:/path/to/lustre/fs and posix:/path/to/root/dir.
Container creation will fail if the DAOS server cannot access/connect to the backend tier.
The DAOS server yaml file will also be extended to define whether the daos_mover process is enabled and what cores it should be bound to.
Impacts
Any performance impact?
Any API changes? If so, internal or external API? Any changes required to middleware? Any interop requirements?
Any VOS/config layout changes? How will migration will be supported?
Any extra parameters required in the config file?
Any wire protocol change? How will interop be supported?
Any impact on the rebuild protocol?
Any impact on aggregation?
Any impact on security?
If the security credentials to access the backend are contained within the container metadata, this will allow any user with access to the DAOS container via DAOS ACLs to access the backend.
Does the daos_mover change identity to the user when accessing the backend?
Quality
How the feature will be tested? Unit tests, functional tests and system tests need to be covered.
Describe the extra soak/performance tests that should be added.
Project Milestones
The work for phase 1 can be divided into multiple components that can be somewhat developed independently to start with:
All changes to the DAOS control plane to support addition to the yaml file, spawn the daos_mover process, wire all the mover processes together and set up the drpc channel with the engine.
daos_mover binary to link with different backend plugin for S3, GCS and POSIX and setting up the raft instance (if needed)
Implement new intent flag to differentiate dfs open from lookup
daos_engine to report cache miss via drpc channel
Generic notification mechanism to notify clients nodes of long running event.
Implement specific handle for daos_mover to use
Code to map cache miss to actual data transfer over backend tier and store backend identifier in DAOS entry.
Integrating all component togethers
All tasks associated with phase 1 should be tracked under this jira ticket: https://daosio.atlassian.net/browse/DAOS-11430
References (optional)
External papers, web page (if any)
Future Work (optional)
Known issues and future works that was considered out-of-scope.