WAL Detailed Design

Code Module & Layering

On DAOS engine side, three modules will be changed to support VOS on blob:

  1. umem module is at the bottom layer, a new umem type will be introduced to provide:

    1. Allocator: The new allocator manages space for both mmap’d tmpfs and meta blob, same to libpmemobj allocator, the new allocator will provide delayed atomicity allocate interfaces (reserve & publish) and transactional allocate/free interfaces.

    2. Local transaction: Besides the existing umem transaction interfaces, some new interface might need be introduced for VOS or other umem callers to track the changes against the transaction. On transaction commit, WAL transaction commit will be invoked with all the involved changes as input; On transaction abort, all involved changes should be reverted via an internal undo log (The undo operation should rollback to the exact original physical sate). In the following sections of this document, I assume the tracked transaction changes will be represented by a ‘umem_action’ vector (it could be changed to different form in the umem design).

    3. Page cache management: For phase 1, it’s just 1 to 1 direct mapping of tmpfs space and meta blob space, however, we do need to manage the space in a page granularity (relative large) and track the per-page pending, committed and check-pointed transaction IDs. Given that checkpoint & replay logic is tightly coupled with page cache management, we are going to implement checkpoint & replay logic in umem.

  2. BIO module is on the top of umem, it provides:

    1. Block device management: Manages SPDK blobs and associated I/O context for META/WAL.

    2. Metadata Block I/O APIs: Used by umem to load or flush pages.

    3. WAL: It’s a physical redo logging to ensure the atomicity of transaction. WAL commit and bunch of checkpoint & replay facilitating APIs will be provided.

  3. VOS is the top-most one within these three modules, it’ll be responsible to call BIO device management APIs to prepare the meta context and pass it to umem along with other necessary callbacks for the WAL commit, checkpoint and replay.

New BIO APIs

Meta data & WAL I/O context management APIs

(The per-xstream NVMe context and SMD should be extended to support META/WAL blob management for regular VOS pool, RDB and sysdb, see “BIO & SMD Detailed Design” for details)

/** * Create & format META/WAL blobs * * \param[in] xs_ctxt Per-xstream NVMe context * \param[in] pool_id Pool UUID * \param[in] meta_sz META blob size in bytes * \param[in] wal_sz WAL blob size in bytes * \param[in] csum_type Checksum type used for WAL * * \return Zero on success, negative value on error */ int bio_meta_create(struct bio_xs_context *xs_ctxt, uuid_t pool_id, uint64_t meta_sz, uint64_t wal_sz, uint16_t csum_type); /** * Destroy META/WAL blobs * * \param[in] xs_ctxt Per-xstream NVMe context * \param[in] pool_id Pool UUID * * \return Zero on success, negative value on error */ int bio_meta_destroy(struct bio_xs_context *xs_ctxt, uuid_t pool_id); /* Opaque meta instance for caller */ extern struct bio_meta_instance; /** * Open META/WAL blobs, load WAL header, allocate opaque 'bio_meta_instace' * * \param[in] xs_ctxt Per-xstream NVMe context * \param[in] pool_id Pool UUID * \param[in] umm umem instance * \param[out] mi Allocated meta instance * * \return Zero on success, negagtive value on error */ int bio_meta_open(struct bio_xs_context *xs_ctxt, uuid_t pool_id, struct umem_instance *umm, struct bio_meta_instance **mi); /** * Close META/WAL blobs, free opaque 'bio_meta_instance' * * \param[in] mi Meta instance to be freed * * \return N/A */ void bio_meta_close(struct bio_meta_instance *mi);

Meta data I/O APIs

/** * Read meta data from meta blob * * \param[in] mi Meta instance * \param[in] bsgl BIO SGL to be read * \param[out] sgl Read buffer SGL * * \return Zero on success, negative value on error */ int bio_meta_readv(struct bio_meta_instance *mi, struct bio_sglist *bsgl, struct d_sg_list_t *sgl); /** * Write meta data to meta blob * * \param[in] mi Meta instance * \param[in] bsgl BIO SGL to be written * \param[in] sgl Write buffer SGL * * \return Zero on success, negative value on error */ int bio_meta_writev(struct bio_meta_instance *mi, struct bio_sglist *bsgl, struct d_sg_list_t *sgl);

WAL APIs

/** * Reserve WAL space and acquire next unused transaction ID * * \param[in] mi Meta instance * \param[out] tx_id Next unused transaction ID * * \return Zero on success, negative value on error */ int bio_wal_reserve(struct bio_meta_instance *mi, uint64_t *tx_id); /** * Compare two transaction IDs (from same WAL instance) * * \param[in] mi Meta instance * \param[in] id1 Transaction ID1 * \param[in] id2 Transaction ID2 * * \return 1: id1 > id2; -1: id1 < id2; 0: id1 == id2; */ int bio_wal_id_cmp(struct bio_meta_instance *mi, uint64_t id1, uint64_t id2); /** * Commit WAL transaction * * \param[in] mi Meta instance * \param[in] tx_id Transaction ID to be committed * \param[in] actv Vector for changes involved in transaction * \param[in] act_nr Vector size * \param[in] biod BIO IO descriptor (Optional) * * \return Zero on success, negative value on error */ int bio_wal_commit(struct bio_meta_instance *mi, uint64_t tx_id, struct umem_action *actv, unsigned int act_nr, struct bio_desc *biod); /** * Start checkpoint procedure * * \param[in] mi Meta instance * \param[out] last_id Highest transaction ID to be checkpointed * * \return Zero on success, negative value on error */ int bio_wal_ckp_start(struct bio_meta_instance *mi, uint64_t *last_id); /** * End checkpoint procedure, update WAL header, reclaim log space * * \param[in] mi Meta instance * \param[in] last_id Highest checkpointed transaction ID * * \return Zero on success, negative value on error */ int bio_wal_ckp_end(struct bio_meta_instance *mi, uint64_t last_id); /** * Replay committed transactions on post-crash recovery * * \param[in] mi Meta instance * \param[in] replay_cb Callback for transaction replay * * \return Zero on success, negative value on error */ int bio_wal_replay(struct bio_meta_instance *mi, int (*replay_cb)(struct umem_action *act);

Asynchronous bio_iod_post()

When VOS is stored on SSD, there will be two NVMe I/O latencies for a large value (>= 4k) update:

  1. NVMe I/O to data blob for value;

  2. WAL commit on transaction commit;

To reduce update latency, these two NVMe I/Os could be submitted in parallel, that require an asynchronous bio_iod_post() and make the WAL commit waits on both data I/O and WAL I/O.

Callbacks for umem

See umem design

Use cases

Bulk value update steps

  1. vea_reserve() is called to reserve space on data blob for the value.

  2. RDMA value to DMA buffer, then bio_iod_post_async() is called to submit SPDK write to data blob for the value.

  3. bio_wal_reserve() is called to acquire WAL transaction ID, please be aware this function could yield when checkpointing failed to reclaim log space promptly.

  4. umem_tx_begin() is called to start local transaction.

  5. Update VOS index, publish reserve by vea_publish(), the new umem API will be called by to track all the changes.

  6. umem_tx_commit() is called to commit the transaction, it first updates the per-page pending transaction ID, then calls bio_wal_commit() (through registered callback), bio_wal_commit() internally waits for the value update from step 2 and other prior depended WAL commits completed, after bio_wal_commit() finished, it updates the per-page committed transaction ID.

  7. Update in-memory visibility flag to make the value visible for fetch.

  8. Reply to client.

Small value update steps

  1. umem_reserve() is called to reserve space on meta blob for the value.

  2. Copy value to reserved space on tmpfs, then bio_iod_post_async() is called to submit SPDK write to meta blob for the value. (Two choices listed below)

    1. If we directly write value to meta blob here, WAL space could be saved, however, it will rely on allocator to ensure the value is not co-located on same SSD page with other metadata or small values;

    2. If we store the value in WAL along with other changes in this transaction, it’ll require more WAL space but don’t have to rely on customized allocator.

  3. bio_wal_reserve() is called to acquire WAL transaction ID, please be aware this function could yield when checkpointing failed to reclaim log space promptly.

  4. umem_tx_begin() is called to start local transaction.

  5. Update VOS index, publish reserve by umem_publish(), the new umem API will be called by to track all the changes.

  6. umem_tx_commit() is called to commit the transaction, it first updates the per-page pending transaction ID, then calls bio_wal_commit() (through registered callback), bio_wal_commit() internally waits for the value update from step 2 (this won’t be necessary if the value is stored in WAL) and other prior depended WAL commits completed, after bio_wal_commit() finished, it updates the per-page committed transaction ID.

  7. Update in-memory visibility flag to make the value visible for fetch.

  8. Reply to client.

Checkpoint

Checkpoint should be triggered by a per VOS pool User Level Thread (ULT) regularly, the timing of checkpoint will be largely decided by the amount of committed transactions and the amount of WAL free space, to improve overall system efficiency, the activities of GC & aggregation could also be considered. On umempobj_close(), all committed transactions need be check-pointed for a clean close. A umem checkpoint interface will be provided, and it works as following:

  1. Calls bio_wal_ckp_start() (through callback) to get the highest transaction ID to be check-pointed.

  2. Iterate page cache and check the per-page pending/committed/check-pointed ID to see if the page needs be check-pointed, check-point needs to wait for pending transaction committed.

  3. Calls bio_meta_writev() to flush page back to meta blob, some per-page metadata could be introduced to track the data being touched by committed transactions in SSD page granularity, that could minimize write amplification on data flush. The bio_meta_writev() should copy the data being flushed for SPDK I/O, so that new transaction update to flushed area won’t be blocked.

  4. Updates per-page check-pointed ID (and other per-page metadata for finer flush granularity).

  5. Calls bio_wal_ckp_end() to update WAL header and reclaim WAL space.

Replay

SSD data load and WAL replay will be done in umempobj_open() in following steps:

  1. Calls bio_meta_readv() (through callback) to load meta blob into tmpfs.

  2. Calls bio_wal_replay() to replay committed transactions if any, bio_wal_replay() accepts a replay callback which replay changes to tmpfs and update per-page committed ID.

  3. Regular checkpoint procedure is performed (this step could be deferred to checkpoint ULT).