Control-Plane changes

In phase 2 the meta-blob size and VOS file size need to be able to be specified separately on pool creation whereas in phase 1 the value of each is the same so only one size is specified for both.

Contents

Requirements

User-interface updates

Pool create --size updates

Return VOS file capacity in addition to meta-blob size on pool query

Prerequisite tidy of control plane pool create logic

Pass meta blob size through pool create callstack

 

Requirements - https://daosio.atlassian.net/browse/DAOS-14223

Requirement is to add the capability to specify two extra parameters when calling into the engine over dRPC to create a pool:

  • The per-target VOS-file size (used to store metadata in tmpfs/ramdisk)

  • The per-targets meta-blob size (in phase-II is used to store a superset of metadata changes in meta-role SSDs)

In phase-I, the size of the VOS-file is calculated by dividing the ramdisk-size by the number of engine targets. The size of the meta-blob on meta-role-SSD is the number of SPDK blob clusters required to fit the size of the VOS-file in bytes (multiplied by the size of a SPDK blob cluster).

In phase-II, the size of the meta-blob (on-SSD) should be capable of being much larger than the VOS-file (on-RAMdisk) and potentially be auto calculated to use all available meta-role-SSD capacity.

These extra parameter values should be passed from the control plane to the engine on pool create and could be manually provided over dmg (as per --scm-size and --nvme-size) or calculated in the control plane. Validation should be performed on inputs to make sure they are viable within the RAM and SSD capacity limit.

Associated display output updates are also required for dmg storage query usage, dmg pool create , dmg pool query and dmg pool list.

 

User-interface updates

When presenting pool related SCM/NVME info the display output isn’t relevant for MD-on-SSD. The suggested changes are described here.

The significant non-presentation-layer change related to this ticket is to provide the “Total memory-file size” in distinction to “Metadata on NVME” stats in dRPC responses to PoolQuery and PoolCreate responses. This specific task is covered in a separate ticket as it’s not specifically related to the dmg/presentation layer: https://daosio.atlassian.net/browse/DAOS-16209.

Pool query display output - https://daosio.atlassian.net/browse/DAOS-16326

Currently the space info shows how pool storage is distributed in DAOS PMem mode where there are 2 distinct components, SCM and NVMe:

Pool space info: - Target(VOS) count:1 - Storage tier 0 (SCM): Total size: 2 B Free: 1 B, min:0 B, max:0 B, mean:0 B - Storage tier 1 (NVMe): Total size: 2 B Free: 1 B, min:0 B, max:0 B, mean:0 B Rebuild failed, rc=0, status=2

The usage statistics returned are meta-blob-on-SSD in tier-0 and data-blob-on-SSD in tier-1 (usage of “tiers” in this pool context is different to how they are used to define engine instance storage where NVMe tiers are used to map SSDs to “roles”). Therefore suggested output in MD-on-SSD mode reports Total and Free (aggregated from all targets) size for Metadata and Data blobs and additionally the Total (aggregated) size of all per-target memory (VOS-index) files:

Pool space info: - Target count:1 - Total memory-file size: 1 B - Metadata storage: Total size: 2 B Free: 1 B, min:0 B, max:0 B, mean:0 B - Data storage: Total size: 2 B Free: 1 B, min:0 B, max:0 B, mean:0 B Rebuild busy, 42 objs, 21 recs

The per-pool sizes for “Total”, “Free” will be indicated as with the previous display format in “Metadata storage” and “Data storage” tiers whilst an extra entry “Total memory-file size” will report the total aggregated size of all per-target VOS-index files across the pool. As with existing format, “min”, “max” and “mean” are per-target values.

Pool create display output - https://daosio.atlassian.net/browse/DAOS-14422

Similarly, dmg pool create response display output isn’t relevant in its current form:

Pool created with 5.66%,94.34% storage tier ratio ------------------------------------------------- UUID : 0000... Service Leader : 0 Service Ranks : [0-2] Storage Ranks : [0-3] Total Size : 42 GB Storage tier 0 (SCM) : 2.4 GB (600 MB / rank) Storage tier 1 (NVMe): 40 GB (10 GB / rank)

and suggested output:

Memory-file usage

Note that in pool create output the memory-file (VOS-index on tmpfs) size is not shown and the Metadata storage stats will reflect meta-blob-on-SSD. If an admin needs to know memory-file size (which maybe different to meta-blob size) then dmg pool query can be run.

The “RAM cache”, also referred to as “memory file” and “VOS index file”, space usage is not displayed as its usefulness is questionable. The memory file represents a dynamic sliding-window subsection of the MD-on-SSD which doesn’t reflect metadata space usage across all SSD blobs.

Pool create commandline options

dmg pool create input options will also be changed to reflect MD-on-SSD e.g. manual pool storage size parameters --scm-sizeand --nvme-size are to be replaced with --meta-size and--data-size. An additional --mem-ratio will be added to tune the memory-file to meta-blob capacity relationship in MD-on-SSD mode (mode indicated by specifying bdev “roles” in the server config file).

The RAM-disk/tmpfs space reserved for “memory files” will be determined by adjusting the --mem-ratio option to dmg pool create cmd. The size per-target in-memory VOS-file will be equal to the meta-blob storage size multiplied by the --mem-ratio fraction. The meta-blob size is defined as the per-rank space reserved for "metadata blobs on SSD" and is requested using the --meta-size option whilst the per-rank space reserved for “data” blobs is requested using the --data-size option. Both aforementioned values are automatically calculated if the --size auto option is used.

Storage query usage display updates - https://daosio.atlassian.net/browse/DAOS-16327

dmg storage query usage should report total SSD free space per-tier per-rank, regardless of meta/data usage split. The rationale behind this is that SCM/NVMe differentiation is not useful for MD-on-SSD so just report blobstore total/free space across all tier SSDs. The main use case envisaged for dmg storage query usageis for an administrator to see how much space is available on each NVMe tier for the purpose of creating more pools. As such, the table layout should be one row per engine, with columns for total/free for each tier.

Pool list display updates - https://daosio.atlassian.net/browse/DAOS-16328

Consider and implement updates that create relevant pool list output in MD-on-SSD phase-2 mode.

 

These changes are only for MD-on-SSD mode and in PMem or non-MD-on-SSD-tmpfs mode the original display output will persist. There shouldn’t be many control-API changes for this display update so JSON output for the commands should remain mostly unchanged.

Old options (e.g. --scm-size) will be supported in MD-on-SSD mode for backward compatibility but when using the mem-ratio will be hardcoded to a value of 1 (phase-1 mode) and the memory-file size will be equal to meta-blob size.

Pool create --size updates - https://daosio.atlassian.net/browse/DAOS-16160

Internal changes to properly support --size pool create option.

This section covers implementation improvements on total-size and percentage-of-resources pool size options when creating pool in MD-on-SSD modes. An extra option to specify the ratio between VOS-file size and Meta-blob size will also be covered.

Clarification of terminology within the context of a DAOS pool

  • mem_size: vos-file size in ram-disk tmpfs (1 vos-file per-target, size is per-target)

  • meta_size: meta-blob size on META-role SSD (size is per-target)

  • data_size: data-blob size on DATA-role SSD (size is per-target)

  • pmem-mode: original non-md-on-ssd mode where metadata is stored on pmem

  • phase1-mode: md-on-ssd where mem_size == meta_size

  • phase2-mode: md-on-ssd where mem_size < meta_size

  • mem-ratio: ratio of mem_size to meta_size

Note: "{mem,meta,data}_size" refer to per-target allocations as units of blobs or files. This is distinct from the dmg pool create commandline parameters --{meta,data}-size which are used to specify per-rank allocations to be spread across all engine targets assigned to the rank.

Clarification of some basic assumptions

  • There is one vos file per target.

  • There is one meta blob per vos file.

  • If meta+data roles share a tier, meta and data blobs will be on the same SSD.

  • The engine's BIO module assigns SSDs from each tier to a VOS target in a round-robin manner. Depending on how roles have been assigned to NVMe tiers in the server configuration file (yaml), each VOS target may have 1 (all roles assigned to a single NVMe tier in yaml), 2 (roles distributed across two NVMe tiers in yaml) or 3 SSDs (3 NVMe tiers, each assigned only one role in yaml) assigned.

  • When a pool is created each VOS target will create the meta, wal, data blobs on the assigned SSD(s) respectively.

  • For a single DAOS pool, each target will have a VOS pool, and each VOS pool has a meta-blob, so if the meta SSD is used by one target only, there will be one meta-blob on the SSD, if it's shared by multiple targets, there will be multiple meta-blobs.

  • If there are 2 meta-role SSDs (assuming wal and data roles are on separate SSDs/tiers), the two meta-role SSDs will be assigned to targets in round-robin manner, assuming there are 8 targets, then each meta-role SSD will be shared by 4 targets.

Notes on syntax

dmg pool query

  • output metrics in capitals are aggregated i.e. total pool values

  • output metrics in lowercase are per-target

dmg pool create

  • --size is either percentage of available capacity or total size in bytes

  • --{scm,nvme,meta,data}-size are used to specify per-rank allocations to be spread across all VOS targets assigned to the rank

Problem statement

In MD-on-SSD mode you can use --size option for either percentage or total pool size.

The --size option has 2 modes:

  • Percentage: The SCM/ramdisk and NVMe capacity across all servers is evaluated and the given percentage is selected from these capacities. The SCM:NVMe ratio of the pool will end up being the SCM:NVMe total capacity ratio on the hosts.

  • Total Pool Size: A pool of total size is created, the SCM:NVMe ratio and list of ranks can be specified on the commandline. By default the ratio will be 6:94. The server capacities are evaluated and total size divided up across ranks available.

Neither of these options work completely for MD-on-SSD mostly because they assume that SSDs are exclusively used for DATA and SCM exclusively used for MD. When we have shared NVMe roles these assumptions break and the --size modes also. The current implementation counts any NVMe SSD with DATA as fully usable but does not consider the META role in any calculation. SCM pool component is assumed synonymous with mem-file and NVMe META capacity is not factored in.

As a result of these limitations the user could run into confusing situations using --size option with MD-on-SSD.

Adjustments also need to be made to handle phase2-mode.

WAL-related sizes should be relatively insignificant in production environments and concern for WAL sizes should not be exposed to user.

Proposed solution

To enable --size to work properly in phase2-mode (mem_size < meta_size), a new option (similar to current --tier-ratio option) needs be provided to specify the ratio between mem_size and meta_size. --mem-ratio will be used to specify the ratio during dmg pool create when --size option is specified. When this option isn't specified, a default mem_size:meta_size ratio will be used.

 

Solution design

To allow automation of pool storage component size calculations:

  • mem_size will be derived from the --size value and tmpfs capacity

  • meta_size will be derived from mem_size and --mem-ratio (or default ratio)

  • data_size will be derived from meta_size, data tier capacity (blobstore free space) and role assignment in server yaml.

Tasks required when --size option is specified as total size:

  • Keep behaviour consistent with PMem mode where total pool size includes both DATA and META components (no change required)

  • Derive pool meta_size and data_size components using total_size and tier-ratio. (no change required)

  • Calculate mem_size as meta_size * mem-ratio (1 if not set). Ramdisk/tmpfs free capacity should then be checked to see if it can satisfy mem_size. (no change required - pre-check, if needed should be implemented at a later date, presumably create will fail with no-space due to fallocate)

Tasks required when --size option is specified as percentage:

  • Derive mem_size and data_size components using percentage of available ramdisk and SSD capacity. The rationale for deriving space usage from ramdisk usage is that this is most likely to be the limiting factor as opposed to SSD usage.

  • Calculate mem_size as percentage of ramdisk/tmpfs free capacity.

  • Calculate meta_size as mem_size/mem-ratio.

  • data_size in pool taken from bdev scan where adjustments have already been made…

    • Subtract meta_size from free blobstore space if an SSD shares META+DATA roles

    • Subtract WAL size from free blobstore space if an SSD shares WAL+DATA roles

    • Adjustments updated to take account of mem-ratio

    • Adjustments updated to take account of target striping across bdev tier

  • Take percentage of remainder

 

Summary of behaviour after changes:

  • The default value of mem-ratio should be 1:1 (mem_size:meta_size) to preserve phase-1 behaviour if --mem-ratio option is not used.

  • phase-2 mode can be enabled by specifying a --mem-ratio value between 1 and 99 (percentage) either when using --size or --meta-size and --data-size options.

  • Auto sizes will be limited by ramdisk free space. Increase utility of SSD capacity for metadata by reducing --mem-ratio value.

  • --tier-ratio option will remain incompatible with --size option when specified as a percentage.

 

Return VOS file capacity in addition to meta blob size on pool query - https://daosio.atlassian.net/browse/DAOS-16209

Problem statement

To perform the display updates described in https://daosio.atlassian.net/browse/DAOS-14422 additional information needs to be retrieved in pool query dRPC calls. Meta-blob total size (aggregated total across all targets) and min/max/mean per-target stats should be displayed in addition to memory-file (read VOS-index file) aggregated total size across all targets.

In phase 1 implentation

  • Pool query will send POOL_TGT_QUERY broadcast to all ranks

  • On each rank vos_space_query() is called to get the VOS-pool usage and capacity

  • The VOS-file size, stored in the pool durable format vos_pool_df->pd_scm_sz, is used as the SCM capacity in vos_pool_query()

  • vos_space_query() currently calls into PMDK/BMEM API to get space usage, so meta-blob usage & capacity (meta-blob capacity == vos file capacity) is returned in phase 1

Proposed solution

In phase 2, the SCM stats returned should reflect the meta-blob usage and capacity.

  • VOS-file size should also be returned in pool query response but as a separate field which may need adding in vos_space_query() (wire protocol change).

  • vos_space_query() needs to be modified to get correct usage and capacity of meta blob including probably storing meta_sz instead of scm_sz in pd_scm_sz.

 

Prerequisite tidy of control plane pool create logic

Consolidate to improve clarity and coherence with the aim of simplifying phase-2 related development.

 

Pass meta blob size through pool create callstack - https://daosio.atlassian.net/browse/DAOS-14416

Add extra parameter to dmg pool create to allow a Meta-blob size that is different from the VOS-index file size.