Page Comparison

...

WAL APIs

Code Block

language	c

/**
 * AcquireReserve WAL space and acquire next unused transaction ID
 *
 * \param[in]   mi          Meta instance
 * \param[out]  tx_id       Next unused transaction ID
 *
 * \return                  Zero on success, Nextnegative unusedvalue transactionon IDerror
 */
uint64_tint bio_wal_id_getreserve(struct bio_meta_instance *mi, uint64_t *tx_id);
 
 /**
  * Compare two transaction IDs (from same WAL instance)
  *
  * \param[in]  mi          Meta instance
  * \param[in]  id1         Transaction ID1
  * \param[in]  id2         Transaction ID2
  *
  * \return                 1: id1 > id2; -1: id1 < id2; 0: id1 == id2;
  */
int bio_wal_id_cmp(struct bio_meta_instance *mi, uint64_t id1, uint64_t id2);

/**
 * Commit WAL transaction
 *
 * \param[in]   mi          Meta instance
 * \param[in]   tx_id       Transaction ID to be committed
 * \param[in]   actv        Vector for changes involved in transaction
 * \param[in]   act_nr      Vector size
 * \param[in]   biod        BIO IO descriptor (Optional)
 *
 * \return                  Zero on success, negative value on error
 */
int bio_wal_commit(struct bio_meta_instance *mi, uint64_t tx_id,
                  struct umem_action *actv, unsigned int act_nr,
                  struct bio_desc *biod);

/**
 * Start checkpoint procedure
 *
 * \param[in]   mi          Meta instance
 * \param[out]  last_id     Highest transaction ID to be checkpointed
 *
 * \return                  Zero on success, negative value on error
 */
int bio_wal_ckp_start(struct bio_meta_instance *mi, uint64_t *last_id);

/**
 * End checkpoint procedure, update WAL header, reclaim log space
 *
 * \param[in]   mi          Meta instance
 * \param[in]   last_id     Highest checkpointed transaction ID
 *
 * \return                  Zero on success, negative value on error
 */
int bio_wal_ckp_end(struct bio_meta_instance *mi, uint64_t last_id);

/**
 * Replay committed transactions on post-crash recovery
 *
 * \param[in]   mi          Meta instance
 * \param[in]   replay_cb   Callback for transaction replay
 *
\param[in] * \return max_nr      Max number of transactions can be replayed  *  * \returnZero on success, negative value on error
 */
int bio_wal_replay(struct bio_meta_instance *mi,      0:   Nothing to be replayed;
 *                          +ve: Number of replayed transaction;
 *                          -ve: Error on replay;
 */
int bio_wal_replay(struct bio_meta_instance *mi,
                  int (*replay_cb)(struct umem_action *actv, unsigned int act_nr),
                  unsigned int max_replay_nr);

Asynchronous bio_iod_post()

When VOS is stored on SSD, there will be two NVMe I/O latencies for a large value (>= 4k) update:

NVMe I/O to data blob for value;
WAL commit on transaction commit;

To reduce update latency, these two NVMe I/Os could be submitted in parallel, that require an asynchronous bio_iod_post() and make the WAL commit waits on both data I/O and WAL I/O.

Code Block

language	c

/* Asynchronous version of bio_iod_post(), it doesn't wait for the NVMe I/O completion */
int bio_iod_post_async(struct bio_desc *biod, int err);

Callbacks for umem

See umem design

Use cases

Bulk value update steps

...

vea_reserve() is called to reserve space on data blob for the value.

...

RDMA value to DMA buffer, then bio_iod_post_async() is called to submit SPDK write to data blob for the value.

...

int (*replay_cb)(struct umem_action *act);

Asynchronous bio_iod_post()

When VOS is stored on SSD, there will be two NVMe I/O latencies for a large value (>= 4k) update:

NVMe I/O to data blob for value;
WAL commit on transaction commit;

To reduce update latency, these two NVMe I/Os could be submitted in parallel, that require an asynchronous bio_iod_post() and make the WAL commit waits on both data I/O and WAL I/O.

Code Block

language	c

/* Asynchronous version of bio_iod_post(), it doesn't wait for the NVMe I/O completion */
int bio_iod_post_async(struct bio_desc *biod, int err);

Callbacks for umem

See umem design

Use cases

Bulk value update steps

vea_reserve() is called to reserve space on data blob for the value.
RDMA value to DMA buffer, then bio_iod_post_async() is called to submit SPDK write to data blob for the value.
bio_wal_reserve() is called to acquire WAL transaction ID, please be aware this function could yield when checkpointing failed to reclaim log space promptly.
umem_tx_begin() is called to start local transaction.
Update VOS index, publish reserve by vea_publish(), the new umem API will be called by to track all the changes.
umem_tx_commit() is called to commit the transaction, it first updates the per-page pending transaction ID, then calls bio_wal_commit() (through registered callback), bio_wal_commit() internally waits for the value update from step 2 and other prior depended WAL commits completed, after bio_wal_commit() finished, it updates the per-page committed transaction ID.
Update in-memory visibility flag to make the value visible for fetch.
Reply to client.

Small value update steps

umem_reserve() is called to reserve space on meta blob for the value.
Copy value to reserved space on tmpfs, then bio_iod_post_async() is called to submit SPDK write to meta blob for the value. (Two choices listed below)
1. If we directly write value to meta blob here, WAL space could be saved, however, it will rely on allocator to ensure the value is not co-located on same SSD page with other metadata or small values;
2. If we store the value in WAL along with other changes in this transaction, it’ll require more WAL space but don’t have to rely on customized allocator.
bio_wal_reserve() is called to acquire WAL transaction ID, please be aware this function could yield when checkpointing failed to reclaim log space promptly.
umem_tx_begin() is called to start local transaction.
Update VOS index, publish reserve by veaumem_publish(), the new umem API will be called by to track all the changes.
umem_tx_commit() is called to commit the transaction, it first updates the per-page pending transaction ID, then calls bio_wal_commit() (through registered callback), bio_wal_commit() internally waits for the value update from step 2 step 2 (this won’t be necessary if the value is stored in WAL) and other prior depended WAL commits completed, after bio_wal_commit() finished, it updates the per-page committed transaction ID.
Update in-memory visibility flag to make the value visible for fetch.
Reply to client.

Small value update steps

umem_reserve() is called to reserve space on meta blob for the value.
Copy value to reserved space on tmpfs, then bio_iod_post_async() is called to submit SPDK write to meta blob for the value. (Two choices listed below)
1. If we directly write value to meta blob here, WAL space could be saved, however, it will rely on allocator to ensure the value is not co-located on same SSD page with other metadata or small values;
2. If we store the value in WAL along with other changes in this transaction, it’ll require more WAL space but don’t have to rely on customized allocator.
bio_wal_id_get() is called to acquire WAL transaction ID, please be aware this function could yield when checkpointing failed to reclaim log space promptly.
umem_tx_begin() is called to start local transaction.
Update VOS index, publish reserve by umem_publish(), the new umem API will be called by to track all the changes.
umem_tx_commit() is called to commit the transaction, it first updates the per-page pending transaction ID, then calls bio_wal_commit() (through registered callback), bio_wal_commit() internally waits for the value update from step 2 (this won’t be necessary if the value is stored in WAL) and other prior depended WAL commits completed, after bio_wal_commit() finished, it updates the per-page committed transaction ID.
Update in-memory visibility flag to make the value visible for fetch.
Reply to clientto client.

Checkpoint

Checkpoint should be triggered by a per VOS pool User Level Thread (ULT) regularly, the timing of checkpoint will be largely decided by the amount of committed transactions and the amount of WAL free space, to improve overall system efficiency, the activities of GC & aggregation could also be considered. On umempobj_close(), all committed transactions need be check-pointed for a clean close. A umem checkpoint interface will be provided, and it works as following:

Calls bio_wal_ckp_start() (through callback) to get the highest transaction ID to be check-pointed.
Iterate page cache and check the per-page pending/committed/check-pointed ID to see if the page needs be check-pointed, check-point needs to wait for pending transaction committed.
Calls bio_meta_writev() to flush page back to meta blob, some per-page metadata could be introduced to track the data being touched by committed transactions in SSD page granularity, that could minimize write amplification on data flush. The bio_meta_writev() should copy the data being flushed for SPDK I/O, so that new transaction update to flushed area won’t be blocked.
Updates per-page check-pointed ID (and other per-page metadata for finer flush granularity).
Calls bio_wal_ckp_end() to update WAL header and reclaim WAL space.

Replay

SSD data load and WAL replay will be done in umempobj_open() in following steps:

Calls bio_meta_readv() (through callback) to load meta blob into tmpfs.
Calls bio_wal_replay() to replay committed transactions if any, bio_wal_replay() accepts a replay callback which replay changes to tmpfs and update per-page committed ID.
Regular checkpoint procedure is performed (this step could be deferred to checkpoint ULT).

Versions Compared

Old Version 9

New Version Current

Key

WAL APIs

Asynchronous bio_iod_post()

Callbacks for umem

Use cases

Bulk value update steps

Asynchronous bio_iod_post()

Callbacks for umem

Use cases

Bulk value update steps

Small value update steps

Small value update steps

Checkpoint

Replay