DAOS POSIX Best Practices
The purpose of this document is to explain how POSIX is supported over DAOS and to point out some best practices to achieve good performance and system stability when using the POSIX interface. Most of these best practices are highlighted in green throughout this document.
DAOS POSIX Containers
A Container in DAOS is a dataset containing objects. A DAOS container is synonymous with a POSIX Namespace, a K-V store, database, etc. One should look at a container as a single entity containing a lot of data. Container are described in more detail in the DAOS user guide:
https://docs.daos.io/v2.6/overview/storage/#daos-container
DAOS POSIX containers should be thought of as a namespace that can contain millions or billions of files and directories. The DAOS pool is not a root POSIX file system, and containers should not be thought of as directory objects contained within a DAOS pool. Instead, the POSIX file system is defined within a POSIX container. |
DAOS has limitations on the number of containers as described here:
https://docs.daos.io/v2.6/overview/storage/#overview
DAOS containers can be accessed by different users by providing ACL access to the user needing access.
POSIX Container Interfaces Available
DAOS provides access to DAOS containers through several interfaces, including:
DFS: a POSIX like interface in user space.
dfuse: a fuse plugin that involves the kernel for using the POSIX API + an Interceptor to bypass the plugin if necessary.
mpi-io: Standard API for parallel IO access through a ROMIO user space driver (also works through the POSIX driver over dfuse).
HDF5: a middleware IO library and data format that uses the POSIX or MPIIO API underneath. The best practices apply the same for HDF5 as they do for the interface being used.
HDF5 also has a DAOS connector for the Virtual Object Layer but is out of scope of this document.
This section provides some guidance on when to use one of these interfaces. The figure below provides a reference for the following discussion:
DAOS File System (DFS)
The DFS interface is the most efficient because it calls directly into the DAOS library, libdaos, but requires application and code changes to use the DFS API for IO calls. The DFS library does not utilize any user-space caching, which means that when using the DFS API, a write-back, read-ahead, or metadata cache is not available. More information about DFS is available here :
https://docs.daos.io/v2.6/user/filesystem/#libdfs
dfuse
dfuse provides access to DAOS through the Linux user-space file system, FUSE. A dfuse mountpoint is the easiest way to use DAOS as a regular parallel filesystem without requiring any changes to applications or user codes. However, dfuse is not always efficient because it involves the kernel which increases the latency for application I/O calls. Nonetheless, the kernel interface does offer some advantages in some situations because it supports write-back, read-ahead, and metadata caching.
One of the major limitations of dfuse today is that on client nodes that have multiple network interfaces that can process IO requests to the DAOS server system, dfuse utilizes only a single interface. This can be improved in future versions of DAOS, but this limitation today will limit performance from a single client node to 1 network interface.
More information about dfuse is available here:
https://docs.daos.io/v2.6/user/filesystem/#dfuse-daos-fuse
An interception library, libioil, is available to bypass the kernel for bulk IO (read/write) operations. The reduced latency results in performance similar to DFS. Metadata operations, however, are currently not intercepted and must pass through the kernel via FUSE.
More information about libiol is available here:
https://docs.daos.io/v2.6/user/filesystem/#interception-library-libioil
A newer interception library, libpil4dfs, was implemented in DAOS version 2.4+ to intercept both metadata and data operations and provides performance almost identical to the DFS library directly. More info here:
https://docs.daos.io/v2.6/user/filesystem/#interception-library-libpil4dfs
Both interception libraries, when used, do not suffer from the single network link limitation that dfuse has on client platforms with multiple NICs for IO operations. The reason is that client processes in user space will each call daos_init() in the interception library, from the core they are running on. This means that DAOS will create and use an independent network context using the closest network interface of the numa socket where the process is running. For systems where there are multiple NICs per CPU socket, in such case a round robin policy is used by the agent for selecting the NIC for the network context of that process.
Given all that, it is better for applications running with multiple processes / MPI ranks per node to distribute the CPU bindings evenly between the 2 CPU sockets over the client node to utilize all the network interfaces and achieve the best performance.
Some best practices when using dfuse are:
|
MPIIO
MPIIO is another POSIX interface that DAOS supports with either a dfuse mount, or OS bypass with the ROMIO ADIO driver (available in MPICH and IntelMPI). This driver can be used by either adding a prefix (daos:filename) when accessing the files with the MPI interface or setting an environment variable, depending on the MPI implementation.
Note: When using DAOS 2.6.3 code base and later, the file system detection is automatically done by ROMIO and setting the file prefix and environment variables are no longer needed. |
More information about DAOS and MPIIO is available here:
https://docs.daos.io/v2.6/user/mpi-io/
When using the DAOS ROMIO driver directly, one should consider a few things:
|
Redundancy and Object Class Settings
Configuring a DAOS POSIX container with the different redundancy and object class settings to achieve the best performance can be challenging for non-storage experts. The default settings are ideal in some situations but can be inefficient in other scenarios. DAOS does provide some simplified properties and hints that the user can use to change those default settings. This document will help explain the different knobs that can be tuned and how.
Redundancy Factor
The first property to consider is the redundancy factor. This property (rd_fac) describes the number of concurrent server failures that objects in the container are protected against. The rd_fac value is an integer between 0 (no data protection) and 5 (support up to 5 simultaneous failures). The redundancy factor can be set at container creation time and cannot be modified after creation. If a user sets a redundancy factor of 1, then tries to create a file/directory with no redundancy, that will fail because it does not meet the redundancy requirement of the container.
More information about DAOS redundancy factor is available here:
https://docs.daos.io/v2.6/user/container/#redundancy-factor
It is recommended on most production systems to set a redundancy factor of 2 or 3 when creating the container. |
Object Class
The redundancy factor on the container determines the default object class of objects (files and directories) created in that container. The object class (oclass) controls the redundancy and sharding of the object and allows the client to do algorithmic placement when accessing the object.
The naming of object class describes how the data is placed and protected. the redundancy type and stripe size of the objects.
DAOS offers several data protection schemes: sharding/striping (S*) (no redundancy, RF0), N-way replication (RP_*), or Erasure Coding (EC_*). RP and EC are available to provide single (RF1) or double redundancy (RF2). Replication does support more levels of redundancy if desired (RF3/4/5 etc.). EC schemes can create 2, 4, 8, or 16 data stripes per full stripe write with 1, 2, or 3 parity blocks.
DAOS will place objects in a single stripe on one target (S1, G1) or striped across all available targets (SX, GX).
If not set by the user, DAOS automatically selects an object class (oclass) for files and directories using replication for directories and EC for files:
Files: Widely striped, EC oclass (SX, EC_xP1GX, EC_xP2GX, etc.)
Directories: Single stripe, Replicated oclass (S1, RP_2G1, RP_3G1, etc.)
The following table shows an example of default object class settings depending on the redundancy factor (RF) of the container in a large DAOS system (20+ servers):
| RF0 | RF1 | RF2 | RF3 |
File | SX | EC_16P1GX | EC_16P2GX | EC_16P3GX |
Dir | S1 | RP_2G1 | RP_3G1 | RP_4G1 |
On smaller systems, the EC data sharding is less depending on how many fault domains are available (for example, EC_8Pn, EC_4Pn, EC_2Pn).
The default object class for files and directories can be modified on container creation with the --file-oclass and --dir-oclass settings, respectively. But once set on a container, these default object classes cannot be changed at the container level. If desired, objects inside the container can be created with an object class other than the container default.
The default setting for object class on the container works well when files are large (>= GBs) and directories are relatively lean (< 10k entries). |
For example, if one uses a single stripe (S1) object class for a large file, the bandwidth and space of accessing that file is limited to a single target on a single DAOS server. On the other hand, if one uses a widely striped (SX) object class, the bandwidth and space of that file scales across the entire set of targets in the pool. One of the disadvantages of using widely striped files, however, is that some metadata operations on that file, like querying the file size or removing the file, will be more expensive than single striped files since those operations involve a fan out of RPCs to every target in the system.
When applications are creating a large number of small files, such as found in the AI space (eg resnet50, mlperf) or HPC file per process, the default setting for the file object class will actually be inefficient if that application does a stat() of every file. Changing the object class to something that is small striped like S32/S16/S1 (or the same G variants) is the most efficient in that case.
The default directory object class settings for most applications we have seen is usually fine and does not need to be changed. However, for some benchmarks like mdtest, where operations like file creates, removes, and stats are measured, changing the object class of a directory to be widely striped is needed to achieve the maximum IOPS.
To summarize the sharding settings from above:
Widely striped objects (SX, RP_GX, or EC_GX):
Small stripe objects (S1/2/4/16/32, RP_ G1/2/4/16/32, or EC_G1/2/4/16/32):
|
As explained earlier, defaults can be set on a container, but the object class can be changed for each object in the container. See the oclass Recommendation section for more details.
Redundancy Types
DAOS achieves data redundancy with replication or erasure codes. The client will write multiple copies to create replicas striped across different targets in a pool or it will generate erasure code blocks that will get written to server targets in addition to the data blocks.
The performance and recommendation for the redundancy type are as follows: Erasure Code:
Replication:
|
See the EC Settings section on how to properly set the ec cell size when using POSIX containers to achieve the best performance with full stripe IO.
oclass Recommendations
This table shows a recommendation for file object class settings:
|
Note that container hints are provided to help more with these settings as described here:
https://github.com/daos-stack/daos/tree/master/src/client/dfs#container-hints
As discussed before, the object class default settings can be set when creating a POSIX container. Each file or directory, when being created, can change that object class for the underlying DAOS object being created. Note that once a file or directory is created, one cannot change the object class of that object since in DAOS the object class of an existing object is immutable. The object class for objects that are being created can be changed from the container default in one of the following ways:
Using the DFS API: the dfs_open(O_CREAT) and similar calls for creating files or directories has a parameter for the object class for the object being created which can be set by the user. This of course works when the application is using the DFS API directly, but not possible with POSIX API/dfuse and MPIIO.
With MPIIO one can use an MPI Info hint to the MPI_File_open call to set the oclass: romio_daos_obj_class
When using dfuse on the command line or the POSIX API, there is no programmatic way to pass a hint since the POSIX API does not have a way to pass that information to the dfuse driver. One though can change the default object class for newly created objects at a directory (all sub-files or subdirs for that dir, not the dir itself), using the daos tool: daos fs set-attr --path=/mnt/dfuse/d1 --oclass=SX. In this case, d1 must exist in the daos container mount at the dfuse mountpoint, and anything created under d1 will now have object class of SX. At this time, there is no option to set file and dir object class separately.
Similarly, one can create a new file with an oclass override with:
daos fs set-attr --path=/mnt/dfuse/f1 --oclass=S1
In this case, f1 shall not yet exist in the container and will be created as a new file with object class S1.
Erasure Coding Settings
DAOS Erasure Code (EC) settings can be complicated to set properly to achieve the best performance. The following describes some of the parameters that can be tuned to maximize performance when using EC oclasses.
EC Cell Size
DAOS provides different sizes of EC data (1, 2, 4, 8, 16) and parity (1, 2, 3) settings and allows modifying the EC cell size at the container level. The EC cell size controls the fragment of data that DAOS splits the application buffer into. If the application buffer is a multiple of the EC cell size property, then the IO is considered a full stripe write and provides optimal performance. In contrast, partial stripe IO adds more overhead as it requires more processing at the DAOS level.
Chunk Size
DFS is built on top the DAOS object layer that implements this EC mechanism. DFS adds an additional concept of chunking, which is unrelated to EC or redundancy, by describing how the DFS file is striped over multiple DAOS keys. DFS chunking (1 MiB by default) can be changed when creating the container (--chunk-size) and/or for each file when it’s created. Though chunk/stripe size from DFS adds more complication when setting EC cell size from EC, it is a knob that can be turned to tune performance for the application.
Guidance for the DFS chunk size and EC cell size settings is summarized here: https://docs.daos.io/v2.6/user/container/#erasure-code.
Here are some examples of the result from setting those values to what DAOS ends up doing with full stripe vs partial stripe updates (this table assumes that the application is using IO size as a multiple of the DFS chunk size):
EC Object Class | EC Cell Size | DFS Chunk size | Full or Partial Write? |
16P2 | 128k | 1m | Partial: 1m gets divided into 8 128k data parts which < 16 data shards of the object class used |
8P2 | 128k | 1m | Full |
16P2 | 128k | 2m, 4m, 8m, etc | Full |
16P2 | 256k | 2m | Partial |
16P2 | 256k | 4m, 8m, etc. | Full |
On most systems, it is recommended to use 128k EC cell size. As you can tell from the table above, using a minimum of 2m (4m is better to increase the inflight IO depth) for the DFS chunk size (and application IO size) is needed to do full stripe updates and achieve the best performance. Otherwise, the overhead will be the same as using replication.
Checksum Settings
End-to-end data integrity is a feature of DAOS that utilizes checksums to guarantee data corruption does not occur. Data corruption over the network can happen and in some cases depending on the MR cache monitor being used. The performance overhead of enabling that feature is not significant, and usually very minimal.
It is recommended to turn on checksums when creating containers by adding the property as follows to container creates: --properties=cksum:crc32,srv_cksum:on |
This will prevent the possibility of silent data corruption without much performance overhead.
Example Scenarios
Depending on application use cases and patterns here are some recommended examples on what options and properties to give to containers on creation or to output directories.
Applications accessing large files:
daos cont create pool cont --type=POSIX --dir-oclass=RP_3G1 --file-oclass=EC_16P2GX --properties=rd_fac:2,cksum:crc32,srv_cksum:on,ec_cell_sz:128kib --chunk-size=4mib
Applications accessing tiny files:
daos cont create pool cont --type=POSIX --dir-oclass=RP_3G1 --file-oclass=RP_3G1 --properties=rd_fac:2,cksum:crc32,srv_cksum:on,ec_cell_sz:128kib --chunk-size=4mib
Application with mixed patterns:
In that case, users should try to split the large and small files in separate directories and enact different settings on those directories:
daos cont create pool cont --type=POSIX --dir-oclass=RP_3G1 --file-oclass=EC_16P2GX --properties=rd_fac:2,cksum:crc32,srv_cksum:on,ec_cell_sz:128kib --chunk-size=4mib
mkdir /mnt/dfuse/dir_with_large_files
mkdir /mnt/dfuse/dir_with_tiny_files
daos fs set-attr –-path=/mnt/dfuse/dir_with_tiny_files –-oclass=RP_3G1
If it is not possible for the application to write files in separate directories, the advice is to give priority to the large file settings which will be inefficient to small files. So there will be some performance hit in that case.