S3

The goal of this project is to create an S3 layer to access DAOS backend.
libds3 is an library shipped with DAOS that implement basic method to support the S3 API over a DAOS pool.

libds3 is purposely implemented over libdfs to maintain compatibility with POSIX namespace (lingua franca).
The intent is to be able to access a POSIX container concurrently as a S3 bucket and a filesystem.

libds3 is considered as a building block to build more complex http-based S3 service.
It is currently integrated with Rados GateWay (RGW) see https://github.com/ceph/ceph/tree/main/src/rgw/store/daos for more details.

DAOS S3 Mapping

S3 Bucket = DAOS Container

Bucket metadata can be stored in container attributes (https://github.com/daos-stack/daos/blob/master/src/include/daos_cont.h#L436)

S3 Buckets are mapped to DAOS containers. The DAOS containers need to be marked as POSIX
compliant to maintain interoperability with DFS. This is done by setting the layout type property in the
DAOS container to (DAOS_PROP_CO_LAYOUT_POSIX). The libdfs provides a method
(dfs_cont_create_with_label) that creates a DAOS container and initializes it to support DFS, and
this DGW implementation uses that method.
Bucket information (metadata, number of objects, size, etc.) will be added as DAOS container attributes
(using daos_cont_set_attr). The data will be encoded by DGW and stored in one value (WIP name in
the current implementation is "rgw_info").
Listing buckets within a pool uses daos_pool_list_cont to list DAOS containers, returning only the
containers that have labels.

S3 Object = DAOS FS File (DFS)

Object metadata can be stored in extended attributes (https://github.com/daos-stack/daos/blob/master/src/include/daos_fs.h)

S3 objects map directly to DFS files, so we use dfs_open or dfs_lookup API calls. The RGW object
metadata is stored inside the extended attributes of the files, using dfs_setxattr and
dfs_getxattr. Slashes / in keys are be translated into directories (using dfs_mkdir) to make it compatible
with POSIX API. If a different delimiter is specified using the S3 list-objects operation, then
we list all the objects then divide the results by delimiter. Since this implementation uses the object’s
extended attributes to store metadata, maintaining a separate bucket index should not be needed. This
decision might be revisited in the future.

Basic Operations

When accessing objects, REST APIs arrive to DGW with three parameters: account information (access
key in S3), bucket or container name, and object name (or key). After authentication, DGW should look
up the user id from the given access keys and check that the user can access the bucket. Then, it looks
up the bucket within the pool using its label and stores the open DAOS container handle (from
daos_cont_open) in the bucket entity DaosBucket. Next, DFS is mounted using dfs_mount, and the
resulting DFS handle is also stored in the DaosBucket entity.
After the operation is done, all the handles created in this process should be closed in reverse order: the
DFS object handle, the DFS mount handle, then the container handle

GET Operation

In the case of GET operation, the object key is looked up within the directory structure using
dfs_lookup, which internally traverses each directory in the key (split by /) until it finds the file, and an
open DFS object handle is returned and stored in the DaosObject entity. The handle is then used to
read the object data using dfs_read. When the operation is done, the DFS object handle is released
using dfs_release.

PUT Operation

In the case of a PUT operation, the object must be created along with all the directories in the object
key. To achieve this, the key is split by /, then for each part of the key, a directory is looked up using
dfs_lookup_rel or created using dfs_mkdir if it did not exist, then move on to the next part. For the
last part of the key, a DFS file is created using dfs_open and the open DFS file handle is stored in the
DaosObject entity. The handle is then used to write data to the object using dfs_write. When the
operation is done, the DFS object handle is released using dfs_release.

DELETE Operation

In the case of DELETE operation, the object’s parent directory is looked up within the directory
structure using dfs_lookup, and an open DFS directory handle is returned. The handle is then used to
remove the object using dfs_remove. When the operation is done, the DFS directory handle is released
using dfs_release.

Users

In order to integrate with S3, DGW needs to store some user information and metadata. This includes
user id, name, email, access keys, and the access policies for the user. This information is shared across
the whole system, so it must be stored in a place accessible at the pool level. DAOS supports simple ACL
policies on containers and pools, but S3 requires more user metadata and the complicated S3 policies do
not map nicely to DAOS ACL policies. Users are thus stored in the metadata container as shown in the
next section

The Metadata Container

A new hidden DAOS container called _METADATA is used to store all pool metadata. The user list and
other metadata could then be stored as DFS objects within it. The same method could be used to store
other miscellaneous metadata we might need in the future in the same bucket.
S3 bucket names cannot contain underscores or capital letters
[https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucketnamingrules.html], so the
_METADATA bucket will not be accessible to the S3 interface. Additionally, a rule has been set in place to
prevent the bucket from being loaded.
_METADATA directory structure

  • /users/ contains all the user data as files, with the user id as the file name and the user data as
    the file contents.• /emails/ contains soft link files. The name of each file is the email, while the file pointed to is
    the user data file in /users/.

  • /access_keys/ contains soft link files. The name of each file is the access key, while the file
    pointed to is the user data file in /users/.

  • /multipart/ contains multipart data for each bucket. See below for internal structure.

Multipart upload

Reference for multipart upload:

Multipart upload allows uploading a single object as a set of parts. Each part is a contiguous portion of
the object's data. The object parts could be uploaded independently and in any order. If transmission of
any part fails, it can be retransmitted without affecting other parts. After all parts of the object are
uploaded, the S3 layer assembles these parts and creates the object. The multipart upload is done on
three steps: initiating the upload, uploading the object parts, then completing the upload. Upon
completion, the object is constructed from the parts and the object is accessible like any other object in
the bucket.

Design summary:

  • Multipart upload initiation

    • Generate a unique multipart_upload_id that identifies the upload

    • Create the directory upload_dir =
      multipart/<bucket_name>/<multipart_upload_id> in the _METADATA bucket

    • Add any metadata and attributes related to the upload in the rgw_entry xattr of the
      upload_dir directory

  • Parts upload

    • Write the data of each part to an object under <upload_dir>/<part_id>

    • Metadata about each part is stored in the xattr rgw_part of the part object

  • Multipart upload completion

    • Create the object under the correct key and bucket, creating any missing directories in
      the process

    • Copy data from each part to the created object ordered by part_id

    • Move the metadata and attributes from the upload directory to the object

    • Delete the upload_dir directory created during initiation, including the parts.

Versioning

Versioning is supported mostly out of the box by setting the version id in the object name. We maintain
a symbolic link to the latest version of the object with the suffix [latest]. This is to allow easier access
to the latest version of the object when reading the file or updating the latest version.

TODO: Delete an object when its versioning is turned on. Do we delete all versions? Just latest?

Challenges

Ordering

DAOS object API do not support listing dkeys in order, so they will be returned in random ordering. This
is a challenge when designing the list operation because we require being able to list the keys in order.

Multipart Objects

The current design for multipart object uploads the parts to the metadata bucket then copies them to
the object. This might cause some performance degradation on the complete multipart operation. A
structure where an object can have multiple parts is preferred

Performance

The current design opens a lot of object handles to reach an object - this is resolved by caching. Listing
files requires opening each file - could be resolved by accessing lower layers.

Testing

Basic unit test for the libds3 library must be developed following the same model as libdfs (see )

We should go through the API and add test cases for verify the behavior and consistency of all libds3 API functions.

Component

Test Name

Description

Component

Test Name

Description

1

Bucket

DS3_BUCKET_TEST1

Create a pool, ds3_connect, list bucket, create 10 buckets, list bucket, destroy all buckets, list buckets, ds3_disconnect, destroy pool

2

 

DS3_BUCKET_LIST_OBJ

Create pool, connect, create bucket, create 50 objects with random names and prefixes, ds3_bucket_list_obj lists objects in lexical order

3

 

 

ds3_disconnect, destroy pool

4

User

DS3_USER_ERROR

Create pool, ds3_connect, verify creating user name and/or email address longer than DS3_MAX_USER_NAME fails

5

 

DS3_USER_GET

Create multiple users, get each user by name, email, and key

6

 

DS3_USER_UPDATE

ds3_user_set to change email addresses and access IDs. Verify get of previous email addresses and keys fail and changed email addresses and keys succeed

7

 

DS3_USER_REMOVE

ds3_user_remove of all users succeeds, verify get each by name, email, and key fails, verify (a repeated) ds3_user_remove of all users fails

8

 

DS3_USER_RECREATE

Create user, get user by name, email, and key, attempt to re-create same user(s) and check expected error

9

 

 

ds3_disconnect, destroy pool

10

Object

DS3_OBJ_CREATE

Create pool, connect, create bucket, create multiple objects with 0-3 delimiters in the prefix and random encoded data

11

 

DS3_OBJ_RW

Write random amounts of random data to each object and read back to verify

12

 

DS3_OBJ_INFO

ds3_obj_get_info to verify encoded data, ds3_obj_set_info to change encoded data

13

 

DS3_OBJ_DESTROY

ds3_obj_destroy all objects checking for success, destroy all objects again checking for error, ds3_bucket_list_obj returns an empty list

14

 

DS3_OBJ_RECREATE

Create a previously destroyed object checking for success and an empty object, re-create object checking for error

15

 

 

ds3_disconnect, destroy pool

16

Multipart

DS3_MULTI_REMOVE_ERR

Create pool, ds3_connect, verify ds3_upload_remove a non-existent upload fails

17

 

DS3_MULTI_INFO_ERR

Verify ds3_upload_get_info of non-existent bucket or uploads fails

18

 

DS3_MULTI_LIST_PARTS

Create a multipart object with a random number of parts (2-32), part sizes, and part information. Verify ds3_upload_list_parts list all parts and the part information matches the values set via ds3_part_set_info. Verify the data in each part matches the expected.

19

 

DS3_MULTI_LIST_MARKER

Issue ds3_upload_list_parts with marker of zero and a number of parts less than created. Verify only parts less than value and greater than marker are returned, the marker points to the next expected part (number + 1) and is_truncated is true. Repeat process to to return all parts (use value of marker returned by previous calls). Last repetition should result in is_truncated being false.

20

 

DS3_MULTI_CREATE

Create more multipart objects with random number of parts, part sizes, and part information. Verify ds3_bucket_list_multipart list all multipart objects. For each object, verify ds3_upload_list_parts list all parts and the part information matches the values set via ds3_part_set_info. Verify the data in each part matches the expected.

21

 

DS3_MULTI_REMOVE

Verify removing each previously created multipart object succeeds. Verify removing each previously created multipart object a second time fails.

22

Versioning

DS3_VERSION_MARK

Create pool, connect, create bucket, and create multiple versions of an object, each with different data. Mark one as the latest and verify *.[latest] contains the expected data

23

 

DS3_VERSION_LIST

Verify ds3_bucket_list_obj does not return object versions when the list_versions parameter is false but does when the parameter is true

24

 

 

ds3_disconnect, destroy pool