Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

DAOS uses generic data structures like B+tree and EVTree (a variant of Rectangle Tree) to index keys and values, these common trees can significantly simplify design and implementation of metadata stack. However, the initial design was based on the availability of PMEM, which can be multiple terabytes capacity, after eliminating PMEM from the system, reducing memory consumption of these sparse trees becomes crucial. There are a few ways to achieve this goal:

Dynamic root B+Tree/EVTree

Both B+Tree and EVTree have a fixed tree order, which is the number of children can be added to a tree node, the default order of B+Tree in VOS is 20 and EVTree is 15. It means VOS will allocate a piece of memory which has 20 or 15 slots for addresses of children, even there is one child attached to the tree node. This is a big waste for small object, because each tree node consumes hundreds of bytes even if the actual need is tens of bytes. The current solution is while allocating the first level tree node (the first tree node attached to root), DAOS initially creates a small tree node with 1 slot for child. If more keys or values are added to the tree, DAOS will try to reallocate tree node with more slots: 3, 5, …, it will eventually reach the default tree order. After that, if more children are added to the tree, the algorithm will switch to the regular code path: split tree node and increase the depth of the tree.

These works are already in master branch of DAOS.

Single Child B+tree/EVTree

For small object, which may have a single dkey/akey or value, DAOS has to allocate a tree node for the only child (diagram on the left side). This is still a waste of tens of bytes per tree, even with dynamic tree node support (tree node size decreased from hundreds of bytes to tens of bytes). Another optimization is attaching the only child directly to the root (diagram on the right side), this can entirely eliminate the need of allocating tree node for tiny object or Write-Once value of a key. The algorithm can switch to the original code path and allocate tree node when more keys or values are added to the tree, there shouldn’t be any backward compatibility issue.

...