Tree optimizations

DAOS uses generic data structures like B+Tree and EVTree (a variant of Rectangle Tree) to index keys and values, these common trees can significantly simplify design and implementation of metadata stack. However, the initial design was based on the availability of PMEM, which can be multiple terabytes capacity, after eliminating PMEM from the system, reducing memory consumption of these sparse trees becomes crucial. This design will introduce two ways to achieve this goal.

Dynamic root B+Tree/EVTree

Both B+Tree and EVTree have a fixed tree order, which is the number of children can be attached to a tree node, the default order of B+Tree in VOS is 20 and EVTree is 15. It means VOS will allocate a piece of memory which has 20 or 15 slots for addresses of children, even if there is only one child attached to the tree node. It is a big waste for small object, because each tree node consumes hundreds of bytes even the actual need is tens of bytes. The current solution is, when DAOS allocates the first level tree node (the first tree node attached to root), it only requests memory size for 1 slot. If more keys or values are added to the tree, DAOS will try to reallocate tree node with more slots like 3, 5, …. It will eventually reach the default tree order, after that, if more children are added to the tree, the algorithm will switch to the regular code path, which can split tree node and increase the depth of the tree.

Implementation of these improvements are already in master branch of DAOS.

Single Child B+tree/EVTree

For small object, which may have a single dkey/akey or value, DAOS has to allocate a tree node for the only child (diagram on the left side). This still wastes of tens of bytes per tree, even with dynamic tree node support (tree node size dropped from hundreds of bytes to tens of bytes). Another optimization is attaching the only child directly to the root (diagram on the right side), this can entirely eliminate the need of allocating tree node for tiny object or Write-Once value of a key. The algorithm can switch to the original code path and allocate tree node when more keys or values are added to the tree, there shouldn’t be any backward compatibility issue.

Single-Child tree