S3
The goal of this project is to create an S3 layer to access DAOS backend.
libds3 is an library shipped with DAOS that implement basic method to support the S3 API over a DAOS pool.
An additional goal of the design is to allow S3 access to DAOS backend while maintaining as much layout interoperability with the POSIX API as possible. In other words, the POSIX API will be a lingua franca, where users are able to write a POSIX file and then read it as an object via S3 (and vice versa). In this design, we are translating slashes / in the object name to directories in the POSIX interface. Since the POSIX interface internally uses libdfs for the layout, it makes sense to also use it for the S3 layer. Additional benefits of this are that the libdfs API offers a straight-forward API to access and create files which should make development easier than using the lower-level API. Note that the lower level API is still usable, but one must be careful not to corrupt the layout to maintain interoperability. Additionally, compared to using the POSIX API directly in the implementation, libdfs is still in userspace and does not involve the Linux kernel.
libds3 is considered as a building block to build more complex http-based S3 service.
It is currently integrated with Rados GateWay (RGW) see https://github.com/ceph/ceph/tree/main/src/rgw/driver/daos for more details.
Overview
Ceph’s RGW has a SAL (Storage Abstraction Layer) interface, which DAOS should implement, similar to RADOS implementation and DBStore experimental implementation. SAL offers abstract interfaces to access buckets, objects and other S3 features without having to deal with the S3 formatting. This implementation offers a translation of these features into DAOS.
DAOS S3 Mapping
S3 Bucket = DAOS Container
Bucket metadata can be stored in container attributes (https://github.com/daos-stack/daos/blob/master/src/include/daos_cont.h#L436)
S3 Buckets are mapped to DAOS containers. The DAOS containers need to be marked as POSIX compliant to maintain interoperability with DFS. This is done by setting the layout type property in the DAOS container to (DAOS_PROP_CO_LAYOUT_POSIX). The libdfs provides a method (dfs_cont_create_with_label) that creates a DAOS container and initializes it to support DFS, and this DGW implementation uses that method.
Bucket information (metadata, number of objects, size, etc.) will be added as DAOS container attributes (using daos_cont_set_attr). The data will be encoded by DGW and stored in one value (WIP name in the current implementation is "rgw_info"). Listing buckets within a pool uses daos_pool_list_cont to list DAOS containers, returning only the
containers that have labels.
S3 Object = DAOS FS File (DFS)
Object metadata can be stored in extended attributes (https://github.com/daos-stack/daos/blob/master/src/include/daos_fs.h)
S3 objects map directly to DFS files, so we use dfs_open or dfs_lookup API calls. The RGW object metadata is stored inside the extended attributes of the files, using dfs_setxattr and dfs_getxattr. Slashes / in keys are be translated into directories (using dfs_mkdir) to make it compatible with POSIX API. If a different delimiter is specified using the S3 list-objects operation, then we list all the objects then divide the results by delimiter. Since this implementation uses the object’s extended attributes to store metadata, maintaining a separate bucket index should not be needed. This decision might be revisited in the future.
Basic Operations
When accessing objects, REST APIs arrive to DGW with three parameters: account information (access key in S3), bucket or container name, and object name (or key). After authentication, DGW should look up the user id from the given access keys and check that the user can access the bucket. Then, it looks up the bucket within the pool using its label and stores the open DAOS container handle (from daos_cont_open) in the bucket entity DaosBucket. Next, DFS is mounted using dfs_mount, and the resulting DFS handle is also stored in the DaosBucket entity. After the operation is done, all the handles created in this process should be closed in reverse order: the DFS object handle, the DFS mount handle, then the container handle.
GET Operation
In the case of GET operation, the object key is looked up within the directory structure using dfs_lookup, which internally traverses each directory in the key (split by /) until it finds the file, and an open DFS object handle is returned and stored in the DaosObject entity. The handle is then used to read the object data using dfs_read. When the operation is done, the DFS object handle is released using dfs_release.
PUT Operation
In the case of a PUT operation, the object must be created along with all the directories in the object key. To achieve this, the key is split by /, then for each part of the key, a directory is looked up using dfs_lookup_rel or created using dfs_mkdir if it did not exist, then move on to the next part. For the last part of the key, a DFS file is created using dfs_open and the open DFS file handle is stored in the DaosObject entity. The handle is then used to write data to the object using dfs_write. When the
operation is done, the DFS object handle is released using dfs_release.
DELETE Operation
In the case of DELETE operation, the object’s parent directory is looked up within the directory structure using dfs_lookup, and an open DFS directory handle is returned. The handle is then used to remove the object using dfs_remove. When the operation is done, the DFS directory handle is released using dfs_release.
Users
In order to integrate with S3, DGW needs to store some user information and metadata. This includes user id, name, email, access keys, and the access policies for the user. This information is shared across the whole system, so it must be stored in a place accessible at the pool level. DAOS supports simple ACL policies on containers and pools, but S3 requires more user metadata and the complicated S3 policies do not map nicely to DAOS ACL policies. Users are thus stored in the metadata container as shown in the next section
The Metadata Container
A new hidden DAOS container called _METADATA is used to store all pool metadata. The user list and other metadata could then be stored as DFS objects within it. The same method could be used to store other miscellaneous metadata we might need in the future in the same bucket.
S3 bucket names cannot contain underscores or capital letters [https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucketnamingrules.html], so the
_METADATA bucket will not be accessible to the S3 interface. Additionally, a rule has been set in place to prevent the bucket from being loaded. _METADATA directory structure
/users/ contains all the user data as files, with the user id as the file name and the user data as the file contents.• /emails/ contains soft link files. The name of each file is the email, while the file pointed to is the user data file in /users/.
/access_keys/ contains soft link files. The name of each file is the access key, while the file pointed to is the user data file in /users/.
/multipart/ contains multipart data for each bucket. See below for internal structure.
Example Contents:
Multipart upload
Reference for multipart upload:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html
Multipart upload allows uploading a single object as a set of parts. Each part is a contiguous portion of the object's data. The object parts could be uploaded independently and in any order. If transmission of any part fails, it can be retransmitted without affecting other parts. After all parts of the object are uploaded, the S3 layer assembles these parts and creates the object. The multipart upload is done on three steps: initiating the upload, uploading the object parts, then completing the upload. Upon completion, the object is constructed from the parts and the object is accessible like any other object in the bucket.
Design summary:
Multipart upload initiation
Generate a unique multipart_upload_id that identifies the upload
Create the directory upload_dir =
multipart/<bucket_name>/<multipart_upload_id> in the _METADATA bucketAdd any metadata and attributes related to the upload in the rgw_entry xattr of the
upload_dir directory
Parts upload
Write the data of each part to an object under <upload_dir>/<part_id>
Metadata about each part is stored in the xattr rgw_part of the part object
Multipart upload completion
Create the object under the correct key and bucket, creating any missing directories in
the processCopy data from each part to the created object ordered by part_id
Move the metadata and attributes from the upload directory to the object
Delete the upload_dir directory created during initiation, including the parts.
Versioning
Versioning is supported mostly out of the box by setting the version id in the object name. We maintain a symbolic link to the latest version of the object with the suffix [latest]. This is to allow easier access to the latest version of the object when reading the file or updating the latest version.
Example Contents:
TODO: Delete an object when its versioning is turned on. Do we delete all versions? Just latest?
Caching
Not yet implemented - more research is needed.
Goals:
Cache open bucket handles
Cache open object handles
Avoid searching deep when doing bucket index.
Cache metadata for objects
Candidates:
boost::lru_cache; value can be any object, including something that holds open pointers
ObjectCache in rgw_cache; value is anything that fits in a buffer list; no destructor
Intel’s gurt cache; used in dfs_sys. Maybe we should use dfs_sys
Notifications
To be researched.
Object Lock
To be researched.
Challenges
Ordering
DAOS object API do not support listing dkeys in order, so they will be returned in random ordering. This is a challenge when designing the list operation because we require being able to list the keys in order.
Multipart Objects
The current design for multipart object uploads the parts to the metadata bucket then copies them to the object. This might cause some performance degradation on the complete multipart operation. A structure where an object can have multiple parts is preferred
Performance
The current design opens a lot of object handles to reach an object - this is resolved by caching. Listing files requires opening each file - could be resolved by accessing lower layers.
Testing
Basic unit test for the libds3 library must be developed following the same model as libdfs (see https://github.com/daos-stack/daos/blob/master/src/tests/suite/dfs_unit_test.c)
We should go through the API and add test cases for verify the behavior and consistency of all libds3 API functions.
Component | Test Name | Description | |
---|---|---|---|
1 | Bucket | DS3_BUCKET_TEST1 | Create a pool, ds3_connect, list bucket, create 10 buckets, list bucket, destroy all buckets, list buckets, ds3_disconnect, destroy pool |
2 |
| DS3_BUCKET_LIST_OBJ | Create pool, connect, create bucket, create 50 objects with random names and prefixes, ds3_bucket_list_obj lists objects in lexical order |
3 |
|
| ds3_disconnect, destroy pool |
4 | User | DS3_USER_ERROR | Create pool, ds3_connect, verify creating user name and/or email address longer than |
5 |
| DS3_USER_GET | Create multiple users, get each user by name, email, and key |
6 |
| DS3_USER_UPDATE | ds3_user_set to change email addresses and access IDs. Verify get of previous email addresses and keys fail and changed email addresses and keys succeed |
7 |
| DS3_USER_REMOVE | ds3_user_remove of all users succeeds, verify get each by name, email, and key fails, verify (a repeated) ds3_user_remove of all users fails |
8 |
| DS3_USER_RECREATE | Create user, get user by name, email, and key, attempt to re-create same user(s) and check expected error |
9 |
|
| ds3_disconnect, destroy pool |
10 | Object | DS3_OBJ_CREATE | Create pool, connect, create bucket, create multiple objects with 0-3 delimiters in the prefix and random encoded data |
11 |
| DS3_OBJ_RW | Write random amounts of random data to each object and read back to verify |
12 |
| DS3_OBJ_INFO | ds3_obj_get_info to verify encoded data, ds3_obj_set_info to change encoded data |
13 |
| DS3_OBJ_DESTROY | ds3_obj_destroy all objects checking for success, destroy all objects again checking for error, ds3_bucket_list_obj returns an empty list |
14 |
| DS3_OBJ_RECREATE | Create a previously destroyed object checking for success and an empty object, re-create object checking for error |
15 |
|
| ds3_disconnect, destroy pool |
16 | Multipart | DS3_MULTI_REMOVE_ERR | Create pool, ds3_connect, verify ds3_upload_remove a non-existent upload fails |
17 |
| DS3_MULTI_INFO_ERR | Verify ds3_upload_get_info of non-existent bucket or uploads fails |
18 |
| DS3_MULTI_LIST_PARTS | Create a multipart object with a random number of parts (2-32), part sizes, and part information. Verify ds3_upload_list_parts list all parts and the part information matches the values set via ds3_part_set_info. Verify the data in each part matches the expected. |
19 |
| DS3_MULTI_LIST_MARKER | Issue ds3_upload_list_parts with marker of zero and a number of parts less than created. Verify only parts less than value and greater than marker are returned, the marker points to the next expected part (number + 1) and is_truncated is true. Repeat process to to return all parts (use value of marker returned by previous calls). Last repetition should result in is_truncated being false. |
20 |
| DS3_MULTI_CREATE | Create more multipart objects with random number of parts, part sizes, and part information. Verify ds3_bucket_list_multipart list all multipart objects. For each object, verify ds3_upload_list_parts list all parts and the part information matches the values set via ds3_part_set_info. Verify the data in each part matches the expected. |
21 |
| DS3_MULTI_REMOVE | Verify removing each previously created multipart object succeeds. Verify removing each previously created multipart object a second time fails. |
22 | Versioning | DS3_VERSION_MARK | Create pool, connect, create bucket, and create multiple versions of an object, each with different data. Mark one as the latest and verify |
23 |
| DS3_VERSION_LIST | Verify ds3_bucket_list_obj does not return object versions when the |
24 |
|
| ds3_disconnect, destroy pool |
Video Demos
Demo 1:
The demo showed more S3 operations: get object, delete object, and demonstrated S3 objectversioning, and multipart upload
Demo 2:
A demo of the initial proof of concept implementation can be found here:
The demo showed how to set up the S3 – DAOS system, in addition to some basic S3 operations: list-buckets, make-bucket, put object. It also demonstrated how DGW interacts with DFuse by creatingobjects as files and that slashes (/) in the object path are showing up as directories.