DAV Allocator changes to support Memory Buckets

In this section we cover high level requirements that are defined for the DAV allocator to support Memory Buckets (Memory bucket allocator) and propose high level changes to design and implementation to meet the requirements.

Requirements

VOS heap as a collection of Memory Buckets

MD_ON_SSD project allows DAOS metadata that is traditionally stored in PMEM to be hosted on DRAM and backed by SSD. Phase-1 of the project required all the metadata to be always present in DRAM. Since DRAM capacity is much lesser than PMEM, we need a mechanism that allows less frequently accessed metadata contents to be evicted out of DRAM and load it back on demand. One way to achieve this is to split the VOS heap into a collection of big memory chunks called memory buckets. Allocations within a memory bucket will not cross the memory bucket boundary. Not all memory buckets are required to be present in the DRAM. At any instance, only a fraction of the memory buckets that can satisfy the active client IO requests need to be present in DRAM. This in turn allows VOS heap for a particular pool on the target to be larger than that of available DRAM.

Support evictable and non-evictable Memory Buckets

DAOS when processing the client IO requests uses a transaction model to operate on the metadata. The server thread is not expected to yield the CPU in the middle of a transaction on metadata. For phase-2 of the project, we will be retaining the same model. This implies that all memory buckets that holds the data required for the successful completion of the transaction must be present in DRAM before starting the transaction. This adds complexity not only in the logic of predetermining the memory buckets to be loaded into DRAM but also in the time required to preload all the dependent memory buckets. We handle this by locking a fraction of memory buckets into DRAM permanently. These memory buckets are called non-evictable memory buckets. The type of the memory bucket (evictable/non-evictable) to be used is determined by the caller/application at the time of allocation within the VOS heap.

Loading required memory buckets before starting a transaction

Phase-2 of the project further restricts the number of memory buckets to be loaded before starting a transaction to a maximum of 1, by defining the below rules:

New object creation will always happen in an evictable memory bucket with sufficient free space within.
Subsequent allocations related to the object will use the same memory bucket that was chosen during the first allocation. If there is no space within that memory bucket, then allocate from non-evictable memory bucket.
Flattened objects are stored in evictable memory bucket that is either empty or has other flattened objects with sufficient space to host this flattened object metadata as well.
All other allocations will happen from non-evictable memory buckets.

This implies that all allocations related to an object will be in a single evictable memory bucket and spilled over into non-evictable memory buckets.

Allow Administrator to dynamically increase space reservation for VOS heap in SSD and DRAM

Enable Administrators to increase the VOS heap reservation for the pools as part of the memory and storage capacity upgrades. The allocator should not put restrictions based on the initial heap size and should be able to dynamically extend the heap.

Heap Layout in MD-Blob

For phase-1 of the project the layout of the VOS heap is same as that of the PMDK with each zone being approximately 16GB in size. Phase-2 requires splitting the heap into a collection of independent zones. Each zone will act as a memory bucket. No allocations within the zone will spill over into adjacent zone. The new proposed heap layout to support Memory Buckets is as shown below:

The above diagram shows the layout of the VOS heap in the MD-Blob. The VOS heap is split into a collection of zones. Each zone represents a Memory Bucket of 16MB size. The first 4K bytes in the zone holds the zone and the chunk headers. Right now, around ~520 bytes are required for this header information. The rest will be marked reserved to hold information like the checksum, usage, evict-ability, and other runtime information. There will be a total of 63 chunks within the zone and each chunk is 260k in size. A total of 2^32 zones can be supported.

The heap will grow in a lazy fashion. Of the n zones worth of space reserved in MD-Blob, the allocator will start with a small set of zones and expands only when the allocation request cannot be satisfied by the current set of zones marked as in use. The allocation fails with ENOMEM if the allocation request results in extending beyond n zones. If the value of n changes to a higher count as part of dynamic expansion of reservation, then future allocations will succeed. The max zone to which the heap has grown is tracked in the heap header for subsequent restarts.

For Phase-2 of the project, we do not plan to support shrinking of the heap.

VOS Heap in DRAM

VOS heap in DRAM will be managed by the umem_cache module (https://daosio.atlassian.net/wiki/spaces/DC/pages/11402969092 ). This cache is allocated as part of opening the heap by the allocator. The UMEM cache is divided into pages, each of size same as that of zone. These pages are populated with memory buckets either by the allocator or by the VOS module. UMEM cache maintains a translation between the zone_id/zone_offset in the MD-Blob to the address of the associated Memory Bucket in the UMEM cache and vice versa. A simplified diagram of this translation is given below.

As shown in the above diagram only a fraction j/n of the zones are loaded into the memory at any instance of execution. Here j > m, where m is the number of zones that are marked non-evictable. Except for the heap header no other memory block/chunk will have direct mapping from md-blob offset to DRAM offset.

Allocator expects the following services from the UMEM cache:

API for reserving page(s) to host non-evictable memory bucket(s) and creating a translation for the same. The allocator will make this request in the middle of a transaction if an allocation in non-evictable memory bucket region triggers a heap extension. The UMEM cache is expected to satisfy this request by zeroing out k (<=4) free pages and creating a translation with the corresponding zone offsets that are passed as arguments.
API for reserving a page to host evictable memory bucket, loading it from the md-blob or zeroing out the page based on the flags passed and creating a translation for the same. The request for the same will be done by the VOS module before a transaction.
Address translation APIs, off2ptr (), off2zoneid and ptr2off (). off2zoneid () should return zero if the zone_id is marked non-evictable.
Implicitly evict/unpin an evictable and non-dirty memory bucket to make space for loading/pinning new memory buckets to satisfy (1) and (2).

Allocator API Changes

Allocation

The UMEM/BMEM APIs for allocation will now always use the extended allocation functions like the dav_tx_xalloc(), dav_xreserve() and dav_xalloc(). It is expected to pass the bucket/zone id hint to these allocation routines so that the allocation can happen from a specific memory bucket through the “flags” variable. A new macro, DAV_ZONE_ID (zone_id) can be used to update the flags with the zone id hint.

Definition:

DAV_ZONE_ID (zone_id) - allocate an object from the zone specified by zone_id if possible. zone_id is equal 0, then allocation will happen from one of the zones marked as non-evictable. If the zone id is greater than 0, then the allocation will happen from the specified zone. The zone must exist, otherwise, the behavior is undefined. If the zone_id maps to a non-evictable zone or the allocation cannot be satisfied from the specified evictable zone, then the allocation will happen from one of the non-evictable zones.

The allocation routines return or accept offsets relative to the beginning of VOS heap in the MD-BLOB.

Obtaining zone_id for new object allocation

dav_get_zone_evictable(criteria), will be introduced to help VOS module (https://daosio.atlassian.net/wiki/spaces/DC/pages/11433246721) get a zone_id of appropriate bucket under which the new object can be placed. The argument criteria determine the condition for selecting a memory bucket. The specification for the same is still worked upon. An initial condition can be minimum available free space. Preference should be given to already pinned evictable memory bucket.

Address/Offset Translation

Both the UMEM and the allocator will use the umem_cache APIs umem_cache_off2ptr () and umem_cache_ptr2off () to do the translation. The VOS module will use umem_cache_off2zoneid () to extract the zone_id from the offset.

Managing Zones within the Heap

The allocator will have a zone manager within to manage zones. The manager should track and ensure the number of evictable zones are within a specified limit so that non-evictable zones can grow to fullest quota defined. During pool open it should validate the heap size (both md-blob and umem_cache) stored in the heap header is within the limits defined for the current run. It should also maintain a list of evictable memory buckets that can potentially satisfy the next call to dav_get_zone_evictable() in a quick fashion.

Allocation Class for evictable Memory Bucket

By default, there are around 71 allocation classes defined of which 55 are defined by the allocator and the rest 16 are defined by the application. Since we want all allocations related to an object to be fitted within a single evictable memory bucket, only a smaller subset of the allocation classes will be supported for evictable memory buckets. This should allow memory blocks to be satisfied from the limited memory chunks (63) available within the memory bucket. Non-evictable memory buckets will continue to use all the allocation classes currently defined.

Runtime data structures for the heap

The DAV runtime keeps tracks of unused chunks and memory blocks that can accelerate next allocation request for an allocation class. This data will now be split and maintained separately for each evictable memory bucket. The non-evictable memory buckets will share a single runtime. As an optimization to save memory, the runtime for evictable memory buckets can be destroyed when the memory bucket is evicted and rebuilt when it is loaded back.

Repopulating memory buckets upon restart

Upon target restart memory buckets in the UMEM cache must be repopulated with all non-evictable zones and those evictable memory buckets that are required for WAL replay to succeed. The allocator will scan through all the zones in md-blob. It populates all zones that are marked non-evictable on to the UMEM cache. While replaying the WAL if it finds any address translation failures it will load the corresponding zone on demand and then retry replaying the WAL for which the translation failed. It will also use the scanning of zones from md-blob to build necessary data structures for managing evictable zones.

Defragmentation and Collocation

One of the challenges with having limited number of allocation classes for evictable memory buckets is the problem of fragmentation. Allocator should be able to detect this condition and flag the memory bucket for defragmentation. The VOS module should use this info to move objects belonging to this memory bucket to a new memory bucket.

Also, another requirement will be to migrate and collocate related data of an object from non-evictable memory buckets into evictable memory bucket thus freeing up memory in non-evictable buckets. This will happen as part of object flattening for WORM objects. It will be desirable to do this on hierarchical objects as well.