https://daosio.atlassian.net/browse/DAOS-13559

Terminologies

MD-blob: metadata blob, all its contents will be copied to DRAM.
DT-blob: data blob, its content can only be temporally put in DRAM while serving I/O.
Hierarchical object: the current object format of DAOS, keys and values are indexed by tree.
Flatten: serialize object keys and values to a self-described contiguous buffer.

Background

DAOS has a python tool to estimate internal metadata consumption based on DFS data model (daos_storage_estimator.py), the numbers below are the results for 1 million 4K files.

Metadata: 1.0 GB
- 196.28 MB (object)
- 307.20 MB (dkey)
- 329.00 MB (akey)
- 192.00 MB (array value)
User data: 4.0 GB

Internal metadata is already 25% of user data of 4K files, even there are a few more things that are not counted:

VEA and DTX space consumption are not considered
PMDK/DAV has its own internal metadata

If a DAOS storage server has 1TB DRAM, it reserves 128GB DRAM for OS, DMA/RDMA buffers, VOS object cache, VEA index, DTX tables…, then it has 900GB for MD-blobs of all pools. Based on the estimated results above, each 4K file consumes 1K bytes for internal metadata, this storage server can store 900 million 4K files at most, which is 3.6TB user data. Giving a storage server can have over 100TB or more SSDs for user data, DAOS server(MD-on-SSD phase-I) can only make use of tiny portion of the storage space if dataset of application only includes small files.

Overview

In this design, DAOS will not dynamically load or evict mapped metadata pages of MD-blob, instead, DAOS will try to manage object and its metadata. For example, it can migrate significant amount of internal metadata from MD-blob to DT-blob, after that, those migrated metadata can be evicted from DRAM. During I/O handling, the evicted metadata can be brought back to DRAM from DT-blob, they can also be evicted again when system is under memory pressure.

Reduce memory consumption

Reduce the memory consumption can allow DAOS server to serve more small files even without evicting objects from DRAM, for example, if the internal metadata per 4K file can be reduced to 256 bytes, then a DAOS server can host 3.6 billion 4K-sized files, which is overall 14.4TB user data and way more better than the current case. But this is still not good enough giving the massive SSD capacity a server node can have.

Object flattening and eviction

Evicting objects of unused dataset from DRAM can be the eventual solution for DAOS to support billions small files/objects on commodity server. However, the main challenge of evicting object from DRAM is, DAOS is using generic data structure for internal metadata, theoretically, an object can scatter to different memory pages. It means that if an object was evicted from DRAM, then future I/O against this object will trigger multiple cache misses and chained reads from SSD, which can badly hurt the I/O performance. Serializing small object to contiguous buffer and storing it in SSD before evicting, which is called flattening in this document, guarantees the entire object can be fetched into DRAM on cache miss, so the read latency is no longer than latency of one SSD read.

Memory bucket allocator

Flattened format can only apply to small object which is not frequently modified, DAOS has to keep hierarchical format for large object that can be frequently updated. Because there is no plan to manage metadata cache in this design, if those objects stay in MD-blob, then they cannot be evicted from DRAM even the system is under memory pressure. The only way to evict hierarchical object from DRAM is migrating them in DT-blob, in this case, DAOS should allocate a big chunk of memory, which is called memory bucket, then do small allocations for object metadata within the bucket. The memory bucket will be written to DT-blob by checkpointing service, after that the memory bucket can be freed when it is not referenced anymore, it means objects within the memory bucket are evicted. The entire memory bucket can be brought back to DRAM before handling I/O, so there is no chained cache miss during I/O handling.