UMEM cache

UMEM cache currently manages the mmap’d VOS pool space in a memory page (16MB) granularity, and there is already a fixed page mapping table implemented during md-on-ssd phase I. We’ll need to turn this fixed mapping into dynamic mapping for phase II to support larger MD-blob size (larger than the mmap’d VOS file) and on-demand page swapping.

This page mapping table will be a runtime table initialized on VOS pool open, once VOS needs to access MD-blob, it’ll just need to call a simple umem cache API to ensure the required data is loaded & pinned in memory pages, the arduous work of checkpointing & evicting old page data, loading new data, updating mapping, etc. will happen under the hood.

To avoid all the side effects of transaction restart, we’d like to guarantee that everything required by a transaction must being loaded & pinned in cache before the transaction start, for the latency sensitive operation like update or fetch request handling, all the required data could be loaded in parallel when starting the operation.

Allocator will need to be aware of the page size, so that we can ensure the allocator never allocates across the page boundary. To keep things simple and efficient, allocator could assume that everything needed is already loaded in memory in most cases, only few exceptions (requires allocator to explicitly map or load pages) will be described in this document.

UMEM cache APIs

UMEM cache will essentially export three sets of APIs:

Cache initialization & finalization APIs.
Cache map, load & pin APIs.
MD-blob offset & memory address converting APIs.

Cache initialization & finalization APIs:

/**
 * Initalize umem cache.
 *
 * \param[in] store       UMEM store
 * \param[in] page_sz     Page size specified by caller
 * \param[in] tot_md_pgs  Total MD pages (md-blob size)
 * \param[in] tot_mem_pgs Total memory pages (tmpfs vos file size)
 * \param[in] max_ne_pgs  Max non-evict-able pages
 * \param[in] base        Start memory address
 * \param[in] evictable_fn  Callback provided by allocator to tell if a page is evictable  
 *
 * \return              Zero on success, negative value on error
 */
int umem_cache_alloc(struct umem_store *store, uint32_t page_sz, uint32_t tot_md_pgs, uint32_t tot_mem_pgs, uint32_t max_ne_pgs, void *base, int (*evictable_fn)(uint32_t pg_id));

/**
 * Finalize umem cache.
 *
 * \param[in] store     UMEM store
 *
 * \return              N/A
 */
 void umem_cache_free(struct umem_store *store);

Allocator calls umem_cache_alloc() on pool open to initialize the cache, the key parameters are:

Page size, which is equal to allocator bucket size
Total meta pages, which represents the md-blob size
Total memory pages, which represents the tmpfs vos file size
Maximum non-evict-able pages (should be less than total memory pages)
Start memory address
Callback provided by allocator to tell if MD page is evict-able or not

When the umem_cache_alloc() caller specifies same tot_md_pgs and tot_mem_pgs, the umem cache will run in phase 1 mode and all memory pages will be mapped to meta pages respectively.

Allocator calls umem_cache_free() on pool close.

Cache map, load and pin APIs:

struct umem_cache_range {
  umem_off_t    cr_off;   /* Offset within the md-blob */
  daos_size_t   cr_size;  /* Range size */
};

/**
 * Map MD pages in specified range to memory pages. The range to be mapped should be
 * empty (no page loading required), and the map operation will fail with -DER_BUSY
 * if there are not enough free pages (no page eviction required)
 *
 * \param[in] store     UMEM store
 * \param[in] ranges    Ranges to be mapped
 * \param[in] range_nr  Number of ranges
 *
 * \return              Zero on success, -DER_BUSY when not enough free pages.
 */
int umem_cache_map(struct umem_store *store, struct umem_cache_range *ranges, int range_nr);

/**
 * Load & Map MD pages in specified range to memory pages.
 *
 * \param[in] store     UMEM store
 * \param[in] ranges    Offset within md-blob
 * \param[in] range_nr  Length to be mapped
 * \param[in] is_sys    Internal access from system ULTs (aggregation etc.)
 *
 * \return              Zero on success, negative value on error.
 */
int umem_cache_load(struct umem_store *store, struct umem_cache_range *ranges, int range_nr, bool is_sys);

On pool open, allocator calls umem_cache_load() to load non-empty non-evict-able pages into memory.

In a transaction, allocator might need to allocate space from an empty non-evict-able page which isn’t in cache yet, then it could call the umem_cache_map() to map the empty MD page to a free memory page, to avoid NVMe I/O (CPU yield) in umem_cache_map() , we must ensure that there is at least one free in-memory page (not in-use, not dirty) available for mapping.

/**
 * Load & Map MD pages in specified range to memory pages, then take a
 * reference on the mapped memory pages, so that the pages won't be evicted
 * until unpin is called. It's usually for the cases where we need the pages
 * stay loaded across a yield.
 *
 * \param[in] store     UMEM store
 * \param[in] ranges    Ranges to be pinned
 * \param[in] range_nr  Number of ranges
 * \param[in] is_sys    Internal access from system ULTs (aggregation etc.)
 *
 * \return              Zero on success, negative value on error
 */
 int umem_cache_pin(struct umem_store *store, umem_cache_pin_range *ranges, int range_nr, bool is_sys);
 
/**
 * Unpin the pages pinned by prior umem_cache_pin().
 *
 * \param[in] store     UMEM store
 * \param[in] ranges    Ranges to be pinned
 * \param[in] range_nr  Number of ranges
 *
 * \return              N/A
 */
 void umem_cache_unpin(struct umem_store *store, umem_cache_pin_range *ranges, int range_nr);

VOS will call umem_cache_pin() or umem_cache_load() to load required regions in several places:

umem_cache_load()when entering replay callback.
umem_cache_pin() when entering update/fetch handler, umem_cache_unpin() on exiting.
umem_cache_load() before the local transaction start for DTX commit (It requires updating ilog, SV/EV record).
umem_cache_pin() before iterator enters subtree, umem_cache_unpin() when iterator leaves subtree.
umem_cache_load() before GC drains transaction start (It's hard to predict what MD-blob region will be accessed before starting GC local transaction, the GC drain code needs be re-organized).
umem_cache_pin() after memory is reserved by umem_reserve(), (The memory could be reserved from an empty zone which was not pinned, and it could be used for RDMA transfer), umem_cache_unpin() when the reserved memory is published or canceled.

MD-blob offset and memory address converting APIs:

/**
 * Converting MD-blob offset to memory address.
 *
 * \param[in] store     UMEM store
 * \param[in] umoff     MD-blob offset
 *
 * \return              Memory address for the MD-blob offset
 */
 void *umem_cache_off2ptr(struct umem_store *store, umem_off_t umoff);
 
/**
 * Converting memory address to MD-blob offset.
 *
 * \param[in] store     UMEM store
 * \param[in] ptr       Memory address
 *
 * \return              MD-blob offset for the memory pointer
 */
 umem_off_t umem_cache_ptr2off(struct umem_store *store, void *ptr);

This pair of API could be called by VOS & allocator to do the MD-blob offset to memory address converting, the specified offset or pointer must have been mapped in cache.