The goal of this project is to create an S3 layer to access DAOS backend. libds3 is an library shipped with DAOS that implement basic method to support the S3 API over a DAOS pool.
libds3 is purposely implemented over libdfs to maintain compatibility with POSIX namespace (lingua franca). The intent is to be able to access a POSIX container concurrently as a S3 bucket and a filesystem.
S3 Buckets are mapped to DAOS containers. The DAOS containers need to be marked as POSIX compliant to maintain interoperability with DFS. This is done by setting the layout type property in the DAOS container to (DAOS_PROP_CO_LAYOUT_POSIX). The libdfs provides a method (dfs_cont_create_with_label) that creates a DAOS container and initializes it to support DFS, and this DGW implementation uses that method. Bucket information (metadata, number of objects, size, etc.) will be added as DAOS container attributes (using daos_cont_set_attr). The data will be encoded by DGW and stored in one value (WIP name in the current implementation is "rgw_info"). Listing buckets within a pool uses daos_pool_list_cont to list DAOS containers, returning only the containers that have labels.
S3 objects map directly to DFS files, so we use dfs_open or dfs_lookup API calls. The RGW object metadata is stored inside the extended attributes of the files, using dfs_setxattr and dfs_getxattr. Slashes / in keys are be translated into directories (using dfs_mkdir) to make it compatible with POSIX API. If a different delimiter is specified using the S3 list-objects operation, then we list all the objects then divide the results by delimiter. Since this implementation uses the object’s extended attributes to store metadata, maintaining a separate bucket index should not be needed. This decision might be revisited in the future.
When accessing objects, REST APIs arrive to DGW with three parameters: account information (access key in S3), bucket or container name, and object name (or key). After authentication, DGW should look up the user id from the given access keys and check that the user can access the bucket. Then, it looks up the bucket within the pool using its label and stores the open DAOS container handle (from daos_cont_open) in the bucket entity DaosBucket. Next, DFS is mounted using dfs_mount, and the resulting DFS handle is also stored in the DaosBucket entity. After the operation is done, all the handles created in this process should be closed in reverse order: the DFS object handle, the DFS mount handle, then the container handle
In the case of GET operation, the object key is looked up within the directory structure using dfs_lookup, which internally traverses each directory in the key (split by /) until it finds the file, and an open DFS object handle is returned and stored in the DaosObject entity. The handle is then used to read the object data using dfs_read. When the operation is done, the DFS object handle is released using dfs_release.
In the case of a PUT operation, the object must be created along with all the directories in the object key. To achieve this, the key is split by /, then for each part of the key, a directory is looked up using dfs_lookup_rel or created using dfs_mkdir if it did not exist, then move on to the next part. For the last part of the key, a DFS file is created using dfs_open and the open DFS file handle is stored in the DaosObject entity. The handle is then used to write data to the object using dfs_write. When the operation is done, the DFS object handle is released using dfs_release.
In the case of DELETE operation, the object’s parent directory is looked up within the directory structure using dfs_lookup, and an open DFS directory handle is returned. The handle is then used to remove the object using dfs_remove. When the operation is done, the DFS directory handle is released using dfs_release.
In order to integrate with S3, DGW needs to store some user information and metadata. This includes user id, name, email, access keys, and the access policies for the user. This information is shared across the whole system, so it must be stored in a place accessible at the pool level. DAOS supports simple ACL policies on containers and pools, but S3 requires more user metadata and the complicated S3 policies do not map nicely to DAOS ACL policies. Users are thus stored in the metadata container as shown in the next section
The Metadata Container
A new hidden DAOS container called _METADATA is used to store all pool metadata. The user list and other metadata could then be stored as DFS objects within it. The same method could be used to store other miscellaneous metadata we might need in the future in the same bucket. S3 bucket names cannot contain underscores or capital letters [https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucketnamingrules.html], so the _METADATA bucket will not be accessible to the S3 interface. Additionally, a rule has been set in place to prevent the bucket from being loaded. _METADATA directory structure
/users/ contains all the user data as files, with the user id as the file name and the user data as the file contents.• /emails/ contains soft link files. The name of each file is the email, while the file pointed to is the user data file in /users/.
/access_keys/ contains soft link files. The name of each file is the access key, while the file pointed to is the user data file in /users/.
/multipart/ contains multipart data for each bucket. See below for internal structure.
Reference for multipart upload: https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html Multipart upload allows uploading a single object as a set of parts. Each part is a contiguous portion of the object's data. The object parts could be uploaded independently and in any order. If transmission of any part fails, it can be retransmitted without affecting other parts. After all parts of the object are uploaded, the S3 layer assembles these parts and creates the object. The multipart upload is done on three steps: initiating the upload, uploading the object parts, then completing the upload. Upon completion, the object is constructed from the parts and the object is accessible like any other object in the bucket.
Multipart upload initiation
Generate a unique multipart_upload_id that identifies the upload
Create the directory upload_dir = multipart/<bucket_name>/<multipart_upload_id> in the _METADATA bucket
Add any metadata and attributes related to the upload in the rgw_entry xattr of the upload_dir directory
Write the data of each part to an object under <upload_dir>/<part_id>
Metadata about each part is stored in the xattr rgw_part of the part object
Multipart upload completion
Create the object under the correct key and bucket, creating any missing directories in the process
Copy data from each part to the created object ordered by part_id
Move the metadata and attributes from the upload directory to the object
Delete the upload_dir directory created during initiation, including the parts.
Versioning is supported mostly out of the box by setting the version id in the object name. We maintain a symbolic link to the latest version of the object with the suffix [latest]. This is to allow easier access to the latest version of the object when reading the file or updating the latest version.
TODO: Delete an object when its versioning is turned on. Do we delete all versions? Just latest?
DAOS object API do not support listing dkeys in order, so they will be returned in random ordering. This is a challenge when designing the list operation because we require being able to list the keys in order.
The current design for multipart object uploads the parts to the metadata bucket then copies them to the object. This might cause some performance degradation on the complete multipart operation. A structure where an object can have multiple parts is preferred
The current design opens a lot of object handles to reach an object - this is resolved by caching. Listing files requires opening each file - could be resolved by accessing lower layers.
We should go through the API and add test cases for verify the behavior and consistency of all libds3 API functions.
Create a pool, ds3_connect, list bucket, create 10 buckets, list bucket, destroy all buckets, list buckets, ds3_disconnect, destroy pool
Create pool, connect, create bucket, create 50 objects with random names and prefixes, ds3_bucket_list_obj lists objects in lexical order
ds3_disconnect, destroy pool
Create pool, ds3_connect, verify creating user name and/or email address longer than DS3_MAX_USER_NAME fails
Create multiple users, get each user by name, email, and key
ds3_user_set to change email addresses and access IDs. Verify get of previous email addresses and keys fail and changed email addresses and keys succeed
ds3_user_remove of all users succeeds, verify get each by name, email, and key fails, verify (a repeated) ds3_user_remove of all users fails
Create user, get user by name, email, and key, attempt to re-create same user(s) and check expected error
ds3_disconnect, destroy pool
Create pool, connect, create bucket, create multiple objects with 0-3 delimiters in the prefix and random encoded data
Write random amounts of random data to each object and read back to verify
ds3_obj_get_info to verify encoded data, ds3_obj_set_info to change encoded data
ds3_obj_destroy all objects checking for success, destroy all objects again checking for error, ds3_bucket_list_obj returns an empty list
Create a previously destroyed object checking for success and an empty object, re-create object checking for error
ds3_disconnect, destroy pool
Create pool, ds3_connect, verify ds3_upload_remove a non-existent upload fails
Verify ds3_upload_get_info of non-existent bucket or uploads fails
Create a multipart object with a random number of parts (2-32), part sizes, and part information. Verify ds3_upload_list_parts list all parts and the part information matches the values set via ds3_part_set_info. Verify the data in each part matches the expected.
Issue ds3_upload_list_parts with marker of zero and a number of parts less than created. Verify only parts less than value and greater than marker are returned, the marker points to the next expected part (number + 1) and is_truncated is true. Repeat process to to return all parts (use value of marker returned by previous calls). Last repetition should result in is_truncated being false.
Create more multipart objects with random number of parts, part sizes, and part information. Verify ds3_bucket_list_multipart list all multipart objects. For each object, verify ds3_upload_list_parts list all parts and the part information matches the values set via ds3_part_set_info. Verify the data in each part matches the expected.
Verify removing each previously created multipart object succeeds. Verify removing each previously created multipart object a second time fails.
Create pool, connect, create bucket, and create multiple versions of an object, each with different data. Mark one as the latest and verify *.[latest] contains the expected data
Verify ds3_bucket_list_obj does not return object versions when the list_versions parameter is false but does when the parameter is true