VOS Changes

To support object eviction in phase 2, bucket allocator (https://daosio.atlassian.net/wiki/spaces/DC/pages/11434950657 ) and dynamic bucket mapping (https://daosio.atlassian.net/wiki/spaces/DC/pages/11402969092 ) will be introduced. The bucket allocator will be used by VOS to centralize allocations of an object (or a sub-tree) into single (or few limited) bucket(s), that would make VOS be able to load & pin the object in memory before starting local transaction.

Object to bucket mapping

An object to bucket mapping table will be maintained by VOS, this table is used to store the bucket usage information for each object, the information includes bucket ID used by the object, and how many allocations the object performed to the bucket. On local transaction start, the table will be queried to acquire the bucket to be loaded & pinned. On each object alloc and object free call (see the definition of object alloc, object free in next section), the table will be updated accordingly.

In phase 2, we support only single bucket for an object, it could be extended to multiple buckets in the future, and the table could be extended as a sub-tree to bucket mapping as well.

To make the future extension easier and simplify compatibility work, this table will be separately stored in a b-tree, and the durable format of vos_pool_df will be changed to hold the tree root.

Object alloc & object free

All the allocations private to a VOS object are classified as object alloc, that includes all the tree node & record allocations from a VOS object tree and all its sub-trees. The bucket allocator will be applied to object alloc only, the bucket ID for an object could be acquired when preparing the object tree (query object to bucket mapping table by OID), then being passed down to sub-trees through btr_context or evt_context. The meta data shared between objects like VEA heap, DTX meta, GC bag, etc. will be allocated in non-evict-able zones through the legacy allocation API.

Object free is done by GC, where we may have to change the durable format vos_gc_item to make it hold a bucket ID, so that the bucket ID could be acquired from vos_obj_df (query bucket ID by OID) and being passed down to sub-trees through the vos_gc_item.

To avoid unnecessary object to bucket table lookup when freeing a non-evict-able space, allocator should provide a helper function to tell if a given address (or bucket) is evict-able or not.

Load & pin bucket for object

VOS will need to call umem_cache_pin() to load & pin the required data in following places:

  1. Begin of replay callback.

  2. Begin of update and fetch handler.

  3. Before the DTX commit transaction start, since it’ll update ilog or SV/EV record which are allocated through bucket allocator. (@Fan Yong could help on this)

  4. Before iterator enters object tree.

  5. Before GC drains a sub-tree. (The GC drain code needs be re-organized since current GC may drain many objects in same transaction, and the objects are unknown before starting transaction)

Multi-objects transaction

When multiple objects are involved in a transaction, all the involved objects must be loaded & pinned beforehand, that’ll result in higher probability of loading failure when the system is running short of memory buckets, but we may live with it when the supported object count is limited. Another thing worth noting is that umem_cache_pin() needs be carefully implemented to load buckets in certain order to avoid deadlock issues.

Heavy code changes would required for compound RPC handler & VOS DTX transaction, this might be overlapping with the DTX handle cleanup proposal from @Jeffrey V Olivier (Deactivated) ? (https://daosio.atlassian.net/wiki/spaces/DC/pages/11428298775 )

Cache thrashing problem caused by iteration

There are many engine internal services like VOS aggregation which iterates the whole VOS tree repeatedly, if umem cache always simply choose victim buckets based on LRU mechanism, the cache would be easily overwhelmed by those internal iterators. To solve such cache thrashing problem caused by iteration, the umem_cache_pin() should have a parameter indicating the caller’s purpose, so that on victim selecting, umem cache will be able to make more reasonable decision based on this additional information.