Control-Plane changes
In phase 2 the meta-blob size and VOS file size need to be able to be specified separately on pool creation whereas in phase 1 the value of each is the same so only one size is specified for both.
Contents
Return VOS file capacity in addition to meta-blob size on pool query
Prerequisite tidy of control plane pool create logic
Pass meta blob size through pool create callstack
Requirements - https://daosio.atlassian.net/browse/DAOS-14223
Requirement is to add the capability to specify two extra parameters when calling into the engine over dRPC to create a pool:
The per-target VOS-file size (used to store metadata in tmpfs/ramdisk)
The per-targets meta-blob size (in phase-II is used to store a superset of metadata changes in meta-role SSDs)
In phase-I, the size of the VOS-file is calculated by dividing the ramdisk-size by the number of engine targets. The size of the meta-blob on meta-role-SSD is the number of SPDK blob clusters required to fit the size of the VOS-file in bytes (multiplied by the size of a SPDK blob cluster).
In phase-II, the size of the meta-blob (on-SSD) should be capable of being much larger than the VOS-file (on-RAMdisk) and potentially be auto calculated to use all available meta-role-SSD capacity.
These extra parameter values should be passed from the control plane to the engine on pool create and could be manually provided over dmg (as per --scm-size and --nvme-size) or calculated in the control plane. Validation should be performed on inputs to make sure they are viable within the RAM and SSD capacity limit.
Associated display output updates are also required for dmg storage query usage
, dmg pool create
, dmg pool query
and dmg pool list
.
User-interface updates
When presenting pool related SCM/NVME info the display output isn’t relevant for MD-on-SSD. The suggested changes are described here.
The significant non-presentation-layer change related to this ticket is to provide the “Total memory-file size” in distinction to “Metadata on NVME” stats in dRPC responses to PoolQuery and PoolCreate responses. This specific task is covered in a separate ticket as it’s not specifically related to the dmg/presentation layer: https://daosio.atlassian.net/browse/DAOS-16209.
Pool query display output - https://daosio.atlassian.net/browse/DAOS-16326
Currently the space info shows how pool storage is distributed in DAOS PMem mode where there are 2 distinct components, SCM and NVMe:
Pool space info:
- Target(VOS) count:1
- Storage tier 0 (SCM):
Total size: 2 B
Free: 1 B, min:0 B, max:0 B, mean:0 B
- Storage tier 1 (NVMe):
Total size: 2 B
Free: 1 B, min:0 B, max:0 B, mean:0 B
Rebuild failed, rc=0, status=2
The usage statistics returned are meta-blob-on-SSD in tier-0 and data-blob-on-SSD in tier-1 (usage of “tiers” in this pool context is different to how they are used to define engine instance storage where NVMe tiers are used to map SSDs to “roles”). Therefore suggested output in MD-on-SSD mode reports Total and Free (aggregated from all targets) size for Metadata and Data blobs and additionally the Total (aggregated) size of all per-target memory (VOS-index) files:
Pool space info:
- Target count:1
- Total memory-file size: 1 B
- Metadata storage:
Total size: 2 B
Free: 1 B, min:0 B, max:0 B, mean:0 B
- Data storage:
Total size: 2 B
Free: 1 B, min:0 B, max:0 B, mean:0 B
Rebuild busy, 42 objs, 21 recs
The per-pool sizes for “Total”, “Free” will be indicated as with the previous display format in “Metadata storage” and “Data storage” tiers whilst an extra entry “Total memory-file size” will report the total aggregated size of all per-target VOS-index files across the pool. As with existing format, “min”, “max” and “mean” are per-target values.
Pool create display output - https://daosio.atlassian.net/browse/DAOS-14422
Similarly, dmg pool create
response display output isn’t relevant in its current form:
Pool created with 5.66%,94.34% storage tier ratio
-------------------------------------------------
UUID : 0000...
Service Leader : 0
Service Ranks : [0-2]
Storage Ranks : [0-3]
Total Size : 42 GB
Storage tier 0 (SCM) : 2.4 GB (600 MB / rank)
Storage tier 1 (NVMe): 40 GB (10 GB / rank)
and suggested output:
Memory-file usage
Note that in pool create output the memory-file (VOS-index on tmpfs) size is not shown and the Metadata storage stats will reflect meta-blob-on-SSD. If an admin needs to know memory-file size (which maybe different to meta-blob size) then dmg pool query
can be run.
The “RAM cache”, also referred to as “memory file” and “VOS index file”, space usage is not displayed as its usefulness is questionable. The memory file represents a dynamic sliding-window subsection of the MD-on-SSD which doesn’t reflect metadata space usage across all SSD blobs.
Pool create commandline options
dmg pool create
input options will also be changed to reflect MD-on-SSD e.g. manual pool storage size parameters --scm-size
and --nvme-size
are to be replaced with --meta-size
and--data-size. An additional
--mem-ratio
will be added to tune the memory-file to meta-blob capacity relationship in MD-on-SSD mode (mode indicated by specifying bdev “roles” in the server config file).
The RAM-disk/tmpfs space reserved for “memory files” will be determined by adjusting the --mem-ratio
option to dmg pool create cmd. The size per-target in-memory VOS-file will be equal to the meta-blob
storage size multiplied by the --mem-ratio
fraction. The meta-blob
size is defined as the per-rank space reserved for "metadata blobs on SSD" and is requested using the --meta-size
option whilst the per-rank space reserved for “data” blobs is requested using the --data-size
option. Both aforementioned values are automatically calculated if the --size
auto option is used.
Storage query usage display updates - https://daosio.atlassian.net/browse/DAOS-16327
dmg storage query usage
should report total SSD free space per-tier per-rank, regardless of meta/data usage split. The rationale behind this is that SCM/NVMe differentiation is not useful for MD-on-SSD so just report blobstore total/free space across all tier SSDs. The main use case envisaged for dmg storage query usage
is for an administrator to see how much space is available on each NVMe tier for the purpose of creating more pools. As such, the table layout should be one row per engine, with columns for total/free for each tier.
Pool list display updates - https://daosio.atlassian.net/browse/DAOS-16328
Consider and implement updates that create relevant pool list output in MD-on-SSD phase-2 mode.
These changes are only for MD-on-SSD mode and in PMem or non-MD-on-SSD-tmpfs mode the original display output will persist. There shouldn’t be many control-API changes for this display update so JSON output for the commands should remain mostly unchanged.
Old options (e.g. --scm-size
) will be supported in MD-on-SSD mode for backward compatibility but when using the mem-ratio
will be hardcoded to a value of 1
(phase-1 mode) and the memory-file
size will be equal to meta-blob
size.
Pool create --size updates - DAOS-16160: Correctly implement dmg pool create --size option for MD-on-SSD phase-IIResolved
Internal changes to properly support --size
pool create option.
This section covers implementation improvements on total-size and percentage-of-resources pool size options when creating pool in MD-on-SSD modes. An extra option to specify the ratio between VOS-file size and Meta-blob size will also be covered.
Clarification of terminology within the context of a DAOS pool
mem_size: vos-file size in ram-disk tmpfs (1 vos-file per-target, size is per-target)
meta_size: meta-blob size on META-role SSD (size is per-target)
data_size: data-blob size on DATA-role SSD (size is per-target)
pmem-mode: original non-md-on-ssd mode where metadata is stored on pmem
phase1-mode: md-on-ssd where mem_size == meta_size
phase2-mode: md-on-ssd where mem_size < meta_size
mem-ratio: ratio of mem_size to meta_size
Note: "{mem,meta,data}_size" refer to per-target allocations as units of blobs or files. This is distinct from the dmg pool create
commandline parameters --{meta,data}-size
which are used to specify per-rank allocations to be spread across all engine targets assigned to the rank.
Clarification of some basic assumptions
There is one vos file per target.
There is one meta blob per vos file.
If meta+data roles share a tier, meta and data blobs will be on the same SSD.
The engine's BIO module assigns SSDs from each tier to a VOS target in a round-robin manner. Depending on how roles have been assigned to NVMe tiers in the server configuration file (yaml), each VOS target may have 1 (all roles assigned to a single NVMe tier in yaml), 2 (roles distributed across two NVMe tiers in yaml) or 3 SSDs (3 NVMe tiers, each assigned only one role in yaml) assigned.
When a pool is created each VOS target will create the meta, wal, data blobs on the assigned SSD(s) respectively.
For a single DAOS pool, each target will have a VOS pool, and each VOS pool has a meta-blob, so if the meta SSD is used by one target only, there will be one meta-blob on the SSD, if it's shared by multiple targets, there will be multiple meta-blobs.
If there are 2 meta-role SSDs (assuming wal and data roles are on separate SSDs/tiers), the two meta-role SSDs will be assigned to targets in round-robin manner, assuming there are 8 targets, then each meta-role SSD will be shared by 4 targets.
Notes on syntax
dmg pool query
output metrics in capitals are aggregated i.e. total pool values
output metrics in lowercase are per-target
dmg pool create
--size
is either percentage of available capacity or total size in bytes--{scm,nvme,meta,data}-size
are used to specify per-rank allocations to be spread across all VOS targets assigned to the rank
Problem statement
In MD-on-SSD mode you can use --size
option for either percentage or total pool size.
The --size
option has 2 modes:
Percentage: The SCM/ramdisk and NVMe capacity across all servers is evaluated and the given percentage is selected from these capacities. The SCM:NVMe ratio of the pool will end up being the SCM:NVMe total capacity ratio on the hosts.
Total Pool Size: A pool of total size is created, the SCM:NVMe ratio and list of ranks can be specified on the commandline. By default the ratio will be 6:94. The server capacities are evaluated and total size divided up across ranks available.
Neither of these options work completely for MD-on-SSD mostly because they assume that SSDs are exclusively used for DATA and SCM exclusively used for MD. When we have shared NVMe roles these assumptions break and the --size
modes also. The current implementation counts any NVMe SSD with DATA as fully usable but does not consider the META role in any calculation. SCM pool component is assumed synonymous with mem-file and NVMe META capacity is not factored in.
As a result of these limitations the user could run into confusing situations using --size
option with MD-on-SSD.
Adjustments also need to be made to handle phase2-mode.
WAL-related sizes should be relatively insignificant in production environments and concern for WAL sizes should not be exposed to user.
Proposed solution
To enable --size
to work properly in phase2-mode (mem_size < meta_size), a new option (similar to current --tier-ratio
option) needs be provided to specify the ratio between mem_size and meta_size. --mem-ratio
will be used to specify the ratio during dmg pool create
when --size
option is specified. When this option isn't specified, a default mem_size:meta_size ratio will be used.
Solution design
To allow automation of pool storage component size calculations:
mem_size will be derived from the
--size
value and tmpfs capacitymeta_size will be derived from mem_size and
--mem-ratio
(or default ratio)data_size will be derived from meta_size, data tier capacity (blobstore free space) and role assignment in server yaml.
Tasks required when --size
option is specified as total size:
Keep behaviour consistent with PMem mode where total pool size includes both DATA and META components (no change required)
Derive pool meta_size and data_size components using total_size and tier-ratio. (no change required)
Calculate mem_size as meta_size * mem-ratio (1 if not set). Ramdisk/tmpfs free capacity should then be checked to see if it can satisfy mem_size. (no change required - pre-check, if needed should be implemented at a later date, presumably create will fail with no-space due to fallocate)
Tasks required when --size
option is specified as percentage:
Derive mem_size and data_size components using percentage of available ramdisk and SSD capacity. The rationale for deriving space usage from ramdisk usage is that this is most likely to be the limiting factor as opposed to SSD usage.
Calculate mem_size as percentage of ramdisk/tmpfs free capacity.
Calculate meta_size as
mem_size/mem-ratio
.data_size in pool taken from bdev scan where adjustments have already been made…
Subtract meta_size from free blobstore space if an SSD shares META+DATA roles
Subtract WAL size from free blobstore space if an SSD shares WAL+DATA roles
Adjustments updated to take account of mem-ratio
Adjustments updated to take account of target striping across bdev tier
Take percentage of remainder
Summary of behaviour after changes:
The default value of mem-ratio should be 1:1 (mem_size:meta_size) to preserve phase-1 behaviour if
--mem-ratio
option is not used.phase-2 mode can be enabled by specifying a
--mem-ratio
value between 1 and 99 (percentage) either when using--size
or--meta-size
and--data-size
options.Auto sizes will be limited by ramdisk free space. Increase utility of SSD capacity for metadata by reducing
--mem-ratio
value.--tier-ratio
option will remain incompatible with--size
option when specified as a percentage.
Return VOS file capacity in addition to meta blob size on pool query - https://daosio.atlassian.net/browse/DAOS-16209
Problem statement
To perform the display updates described in DAOS-14422: Update dmg pool create for MD-on-SSD P2Resolved additional information needs to be retrieved in pool query
dRPC calls. Meta-blob total size (aggregated total across all targets) and min/max/mean per-target stats should be displayed in addition to memory-file (read VOS-index file) aggregated total size across all targets.
In phase 1 implentation
Pool query will send
POOL_TGT_QUERY
broadcast to all ranksOn each rank
vos_space_query()
is called to get the VOS-pool usage and capacityThe VOS-file size, stored in the pool durable format
vos_pool_df->pd_scm_sz
, is used as the SCM capacity invos_pool_query()
vos_space_query()
currently calls into PMDK/BMEM API to get space usage, so meta-blob usage & capacity (meta-blob capacity == vos file capacity) is returned in phase 1
Proposed solution
In phase 2, the SCM stats returned should reflect the meta-blob usage and capacity.
VOS-file size should also be returned in pool query response but as a separate field which may need adding in
vos_space_query()
(wire protocol change).vos_space_query()
needs to be modified to get correct usage and capacity of meta blob including probably storingmeta_sz
instead ofscm_sz
inpd_scm_sz
.
Prerequisite tidy of control plane pool create logic
Consolidate to improve clarity and coherence with the aim of simplifying phase-2 related development.
DAOS-9556: Move pool creation with all remaining capacity to the DAOS management serverResolved
DAOS-16158: Improve clarity and coherence in pool create workflow (DAOS-14223/PR-13044)Resolved
Pass meta blob size through pool create callstack - DAOS-14416: Pass meta_sz through vos_pool_create call-stackResolved
Add extra parameter to dmg pool create
to allow a Meta-blob size that is different from the VOS-index file size.