Page Comparison

...

umem_reserve() is called to reserve space on meta blob for the value.
Copy value to reserved space on tmpfs, then bio_iod_post_async() is called to submit SPDK write to meta blob for the value. (Two choices listed below)
1. If we directly write value to meta blob here, WAL space could be saved, however, it will rely on allocator to ensure the value is not co-located on same SSD page with other metadata or small values;
2. If we store the value in WAL along with other changes in this transaction, it’ll require more WAL space but don’t have to rely on customized allocator.
bio_wal_id_get() is called to acquire WAL transaction ID, please be aware this function could yield when checkpointing failed to reclaim log space promptly.
umem_tx_begin() is called to start local transaction.
Update VOS index, publish reserve by umem_publish(), the new umem API will be called by to track all the changes.
umem_tx_commit() is called to commit the transaction, it first updates the per-page pending transaction ID, then calls bio_wal_commit() (through registered callback), bio_wal_commit() internally waits for the value update from step 2 (this won’t be necessary if the value is stored in WAL) and other prior depended WAL commits completed, after bio_wal_commit() finished, it updates the per-page committed transaction ID.
Update in-memory visibility flag to make the value visible for fetch.
Reply to client.

Checkpoint

Checkpoint should be triggered by a per VOS pool ULT regularly, the timing of checkpoint will be largely decided by the amount of committed transactions and the amount of WAL free space, to improve overall system efficiency, the activities of GC & aggregation could also be considered. On umempobj_close(), all committed transactions need be check-pointed for a clean close. A umem checkpoint interface will be provided, and it works as following:

Calls bio_wal_ckp_start() (through callback) to get the highest transaction ID to be check-pointed.
Iterate page cache and check the per-page pending/committed/check-pointed ID to see if the page needs be check-pointed, check-point needs to wait for pending transaction committed.
Calls bio_meta_writev() to flush page back to meta blob, some per-page metadata could be introduced to track the data being touched by committed transactions in SSD page granularity, that could minimize write amplification on data flush. The bio_meta_writev() should copy the data being flushed for SPDK I/O, so that new transaction update to flushed area won’t be blocked.
Updates per-page check-pointed ID (and other per-page metadata for finer flush granularity).
Calls bio_wal_ckp_end() to update WAL header and reclaim WAL space.

Replay

SSD data load and WAL replay will be done in umempobj_open() in following steps:

Calls bio_meta_readv() (through callback) to load meta blob into tmpfs.
Calls bio_wal_replay() to replay committed transactions if any, bio_wal_replay() accepts a replay callback which replay changes to tmpfs and update per-page committed ID.
Regular checkpoint procedure is performed (this step could be deferred to checkpoint ULT).

Versions Compared

Old Version 9

New Version 10

Key

Checkpoint

Replay