WORM Feature
Stakeholders
Identify developer(s), component validation engineer(s) & reviewer(s).
Introduction
WORM stands for Write Once Read Many. WORM is a per-container feature that aims at increasing performance for containers that are accessed in read-only mode after initial data ingest.
The typical use case is an AI training dataset. There are several places in DAOS that can be accelerated if we know that the data is immutable:
Aggressive caching can be done in the client side;
File size can be safely stored in the inode (dkey) and trusted;
VOS layout for the object can be optimized by simplifying all the versioning and concurrency control mechanism.
Requirements & Use Cases
Describe the key functionality, use cases and assumptions. This should include what is in-scope and out-of-scope.
Several use cases for data access today consist of read-only workloads on immutable data. For example:
Deep Learning or AI applications (MLPerf/RESNET, CosmoFlow, etc.) perform training and validation phases on existing datasets that consist of a large number of files (images).
Visualization applications on datasets produced by simulations; etc.
The key assumption, as the name of the feature suggests, is that the data in the container cannot be modified once the container is marked as a WORM container. Any further modification to the containerโs objects or metadata would result in ENOPERM errors. We can however allow to add/remove/modify container attributes in this case.
The initial implementation and testing can be done for POSIX type containers, as the use cases are well know and testing can be straightforward. In a later phase, we can explore extending support for WORM to other types of containers. Generally the server side optimizations will apply to any type of container; however the client side and container layout optimizations that are middleware specific, apply to each container type and each middleware would need to be updated to see how it can leverage the WORM information to optimize its data access.
Design Overview
Describe the key architecture decisions, benefits & drawbacks.
Describe what software component(s) be modified and how. Diagrams are welcomed.
To mark a container as WORM, we can add a new container property that will enable all the different optimizations and layout changes that can be done knowing the container is a WORM container. There are two ways we need to consider how / when this property is set:
At container creation time, where data is being first added / generated in the container. This would mean for example that after a key is inserted, it cannot be modified; when an array object is opened and written to once, it cannot be written to again, etc. Enforcing that at the DAOS level might be challenging, so it might need to be the user responsibility to ensure that.
On an existing container that has already been ingested or generated, and will be read-only.
As for the different optimizations that can be done, those can be middleware dependent or internal to DAOS:
DFS Layout optimization:
The DFS inode entry for every file and directory is stored in the parent directory that this object belongs to. The POSIX mapping in DFS to DAOS object is described here:
daos/src/client/dfs/README.md at master ยท daos-stack/daos
Today the entry for a file includes the file permission bits, ctime/mtime/atime, chunk size and object class. What it does NOT include is the file size, and this was a design decision on the DFS layout to make write faster avoiding maintaining the file size and providing better POSIX consistency. As part of a stat operation on a file, the DFS library has to do 2 operations:
fetch the inode entry for that file from the parent directory
perform a daos key query operation to calculate the file size from all the targets the file is sharded on.
The latter operation (get the file size) can be very expensive especially if the file is widely shared / striped.
However, in case the data is read only, we know that the file size will not change anymore, so we can store the file size in the inode entry and avoid needing to do the file size query on every stat operation. We can accomplish that in either of 2 ways:
Before we set the container property for WORM, we can scan the POSIX namespace in the container, and stat all the files, and store the file size in the inode entry. We can leverage a parallel tool like mpifileutils to do the scanning and updating in parallel to speedup this operation.
on data ingest, we can write the file size to the inode as we are populating the container. In that case, we will need to modify the ingest tool to tell the DAOS library that this data is going to be read-only once itโs written.
Caching Optimizations:
When a POSIX container is mounted with dfuse, by default caching time is pretty low. For read-only use cases though, those values can be higher to avoid the need to fetch data blocks, or re-open or stat files unnecessarily. When dfuse tries to mount a container, we can check if the worm property is set, and if it is, we can bump all the cache timing to something like 60 seconds by default instead of 1 second. Those fuse caching parameters include:
dfuse-attr-time
dfuse-dentry-time
dfuse-dentry-dir-time
dfuse-ndentry-time
At some point, we can support caching at the DAOS client library at the DFS level. Using the same mechanism as dfuse, we can enable aggressive caching at that layer as well when it is supported.
VOS layout
DAOS is a version object store and supports consistency using multi-version concurrency control. Even though the MVCC protocol is optimistic, it does impose some overhead to detect conflicts. If we know a container is read-only, we can optimize the VOS layout to assume there are no snapshots that need to be taken to avoid the need to maintain versioning. We can also disable all concurrency control mechanisms when accessing the container.
User Interface
How is the user/admin expected to interact with the new feature? Describe the API/tool.
What are all the tunables provided to the user/admin?
Any extra statistics that should reported to the user/admin?
Explain how errors will be handled.
To set the WORM property, users can use the daos tool or API on a container (at create time or on an existing container to mark read-only)
When doing the layout optimization as the data is being generated / ingested, we need to consider how to enable the optimizations on the fly. For POSIX, we need to maintain the file size as we write to the file for example. We can possible also maintain the number of entries per directory or a list of entries in the directory for a faster readdir.
If we are doing the layout changes on an existing container, we need to consider how we implement the layout optimizations. Should the tool itself scan the container POSIX namespace or should we implement a different mechanism that does that. Usually the set-prop command and API only set a property value on the container and does not do any further work. So we probably need a new mechanism to implement the layout changes (query the file sizes and store them in the inode, any VOS layout optimizations, etc). This can be accomplished by adding a new option to the daos tool:
daos cont make-worm mypool mycont
We can also extent mpifileutils to support the scanning and layout changes in parallel (we would need to investigate what tool in MFU to extent to support this); or we will need to implement our own parallel tool.
Impacts
Any performance impact?
Any API changes? If so, internal or external API? Any changes required to middleware? Any interop requirements?Any VOS/config layout changes? How will migration will be supported?Any extra parameters required in the config file?Any wire protocol change? How will interop be supported?Any impact on the rebuild protocol?Any impact on aggregation?Any impact on security?
New APIs to be added for setting the WORM property.
New tools to be added for the layout changes.
A container marked as WORM with optimized layout will not be accessible using older versions of DAOS that might not understand the layout changes that were made.
Quality
How the feature will be tested? Unit tests, functional tests and system tests need to be covered.
Describe the extra soak/performance tests that should be added.
Project Milestones
Description of the different milestones delivering incremental functionality.
Describe what will work/not work and what will be validated.
Targeted date for each milestone.
References (optional)
External papers, web page (if any)
Future Work (optional)
Known issues and future works that was considered out-of-scope.