DAOS S3 Mapping

Use libdfs to maintain compatibility with POSIX namespace (lingua franca).

S3 Bucket = DAOS Container

Bucket metadata can be stored in container attributes (https://github.com/daos-stack/daos/blob/master/src/include/daos_cont.h#L436)

S3 Buckets are mapped to DAOS containers. The DAOS containers need to be marked as POSIX
compliant to maintain interoperability with DFS. This is done by setting the layout type property in the
DAOS container to (DAOS_PROP_CO_LAYOUT_POSIX). The libdfs provides a method
(dfs_cont_create_with_label) that creates a DAOS container and initializes it to support DFS, and
this DGW implementation uses that method.
Bucket information (metadata, number of objects, size, etc.) will be added as DAOS container attributes
(using daos_cont_set_attr). The data will be encoded by DGW and stored in one value (WIP name in
the current implementation is "rgw_info").
Listing buckets within a pool uses daos_pool_list_cont to list DAOS containers, returning only the
containers that have labels.

S3 Object = DAOS FS File (DFS)

Object metadata can be stored in extended attributes (https://github.com/daos-stack/daos/blob/master/src/include/daos_fs.h)

S3 objects map directly to DFS files, so we use dfs_open or dfs_lookup API calls. The RGW object
metadata is stored inside the extended attributes of the files, using dfs_setxattr and
dfs_getxattr. Slashes / in keys are be translated into directories (using dfs_mkdir) to make itcompatible with POSIX API. If a different delimiter is specified using the S3 list-objects operation, then
we list all the objects then divide the results by delimiter. Since this implementation uses the object’s
extended attributes to store metadata, maintaining a separate bucket index should not be needed. This
decision might be revisited in the future.

Basic Operations

When accessing objects, REST APIs arrive to DGW with three parameters: account information (access
key in S3), bucket or container name, and object name (or key). After authentication, DGW should look
up the user id from the given access keys and check that the user can access the bucket. Then, it looks
up the bucket within the pool using its label and stores the open DAOS container handle (from
daos_cont_open) in the bucket entity DaosBucket. Next, DFS is mounted using dfs_mount, and the
resulting DFS handle is also stored in the DaosBucket entity.
After the operation is done, all the handles created in this process should be closed in reverse order: the
DFS object handle, the DFS mount handle, then the container handle

GET Operation

In the case of GET operation, the object key is looked up within the directory structure using
dfs_lookup, which internally traverses each directory in the key (split by /) until it finds the file, and an
open DFS object handle is returned and stored in the DaosObject entity. The handle is then used to
read the object data using dfs_read. When the operation is done, the DFS object handle is released
using dfs_release.

PUT Operation

In the case of a PUT operation, the object must be created along with all the directories in the object
key. To achieve this, the key is split by /, then for each part of the key, a directory is looked up using
dfs_lookup_rel or created using dfs_mkdir if it did not exist, then move on to the next part. For the
last part of the key, a DFS file is created using dfs_open and the open DFS file handle is stored in the
DaosObject entity. The handle is then used to write data to the object using dfs_write. When the
operation is done, the DFS object handle is released using dfs_release.

DELETE Operation

In the case of DELETE operation, the object’s parent directory is looked up within the directory
structure using dfs_lookup, and an open DFS directory handle is returned. The handle is then used to
remove the object using dfs_remove. When the operation is done, the DFS directory handle is released
using dfs_release.


In order to integrate with S3, DGW needs to store some user information and metadata. This includes
user id, name, email, access keys, and the access policies for the user. This information is shared across
the whole system, so it must be stored in a place accessible at the pool level. DAOS supports simple ACL
policies on containers and pools, but S3 requires more user metadata and the complicated S3 policies do
not map nicely to DAOS ACL policies. Users are thus stored in the metadata container as shown in the
next section

The Metadata Container

A new hidden DAOS container called _METADATA is used to store all pool metadata. The user list and
other metadata could then be stored as DFS objects within it. The same method could be used to store
other miscellaneous metadata we might need in the future in the same bucket.
S3 bucket names cannot contain underscores or capital letters
[https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucketnamingrules.html], so the
_METADATA bucket will not be accessible to the S3 interface. Additionally, a rule has been set in place to
prevent the bucket from being loaded.
_METADATA directory structure

Multipart upload

Reference for multipart upload:
Multipart upload allows uploading a single object as a set of parts. Each part is a contiguous portion of
the object's data. The object parts could be uploaded independently and in any order. If transmission of
any part fails, it can be retransmitted without affecting other parts. After all parts of the object are
uploaded, the S3 layer assembles these parts and creates the object. The multipart upload is done on
three steps: initiating the upload, uploading the object parts, then completing the upload. Upon
completion, the object is constructed from the parts and the object is accessible like any other object in
the bucket.

Design summary:


Versioning is supported mostly out of the box by setting the version id in the object name. We maintain
a symbolic link to the latest version of the object with the suffix [latest]. This is to allow easier access
to the latest version of the object when reading the file or updating the latest version.

TODO: Delete an object when its versioning is turned on. Do we delete all versions? Just latest?



DAOS object API do not support listing dkeys in order, so they will be returned in random ordering. This
is a challenge when designing the list operation because we require being able to list the keys in order.

Multipart Objects

The current design for multipart object uploads the parts to the metadata bucket then copies them to
the object. This might cause some performance degradation on the complete multipart operation. A
structure where an object can have multiple parts is preferred


The current design opens a lot of object handles to reach an object - this is resolved by caching. Listing
files requires opening each file - could be resolved by accessing lower layers.