Proposal for DAOS tools consolidation.
Tools | dmg | daos | dcp |
---|---|---|---|
Target | Administrator | Users | Users |
API | Control plane API (Go) | Data plane API (C) | Data plane API (C) |
Authentication | Certificate | daos_agent | daos_agent |
Lustre Equivalent | lctl/mkfs/mount/IML(partially) | lfs | pcp |
Functionality |
|
|
|
Syntax: dmg [resource] [action] [args]
daos [resource] [action] [args]
Specifying DAOS system name (formerly known as server group):
Specifying storage server ranks (e.g., pool create/add-storage/del-storage, and system drain/reintegrate/kill/exclude)
Specifying added or removed pool service replica ranks (for pool add-svc/del-svc):
Specifying number of pool service replicas (for pool create):
Specifying pool service replica ranks (legacy - currently required but eventually will not be needed) specify replica ranks:
Specifying a fault domain / entire rack of servers (e.g., for pool create/add-storage/del-storage and system drain/reintegrate/kill/exclude)
Specifying targets (e.g., for system drain/reintegrate):
Specifying container snapshots
Notes:
There are 2 variants of the commands: 1) where the user provides the pool and container UUIDs ; and a 2) where the user provides only the unified namespace path to which the container is linked. In the second format, the implementation will resolve the pool and container UUIDs by getting extended filesystem attributes of the specified entity in the path (i.e., the user does not provide the pool UUID and does not provide the container UUID).
The remaining container commands use the --cont=UUID form (the --path= option is available, but is not shown)
Tool | Component | Component Args | Operation | Operation Args | Description, Notes / Issues | API | Implemented? |
dmg | storage | scan | discover all storage available on the nodes applying filters from yaml file | Y | |||
query | report status & stats about storage | ||||||
query smd | --devices --pools | query SMD device table query SMD pool table. | Y | ||||
query nvme-health | --hostlist="HOST:PORT" | query raw SPDK NVMe device health stats. Returns all stats for all NVMe SSDs on all hosts in hostlist. | Y | ||||
query blobstore-health | --devuuid="DEVICE_UUID" --tgtid="VOS_TGT_ID" | query BIO in-memory health data. Returns all BIO device health data and I/O errors & checksum error stats for given device UUID or VOS target ID. | Y | ||||
query device-state | --devuuid="DEVICE_UUID" | query the current device state of the given device UUID stored in SMD (ie NORMAL or FAULTY). | Y | ||||
set-faulty | --devuuid="DEVICE_UUID" | allow admin/user to manually set the device state of a given device to FAULTY (will trigger faulty device reaction callbacks). | Y | ||||
prep | device-specific configuration that may require a reboot. E.g. setting up AEP DIMMs in interleaved mode | Y | |||||
burnin | running fio against storage devices to verify it operates well and validate the performance. | ||||||
format | reset content of NVMe SSDs, format SCM with ext4, mount SCM and start the DAOS service (io_server) | Y | |||||
fwupdate | firmware update | ||||||
network | scan * | list discovered network interfaces * suggestion: report which interfaces and OFI providers would be used with the discovered interfaces | Y | ||||
query * | report status & stats about network interfaces * suggestion: perform a local test to indicate in advance if an OFI runtime error is going to occur with the interface, for example as seen with daos_server: "na_ofi_getinfo(): fi_getinfo failed, rc=-61(No data available)". Here, was from a VM build that didn't have PSM2 devel. | ||||||
system | --sys=SYSNAME | query | report service status on all or a subset of the servers | Y | |||
list-pools | list all pools created (do we want an alias for this as "pool list"?) | Y | |||||
Same as above | list-ranks | list all DAOS system server ranks in the specified system ("query" currently lists all system ranks, do we need list-ranks?) | |||||
Same as above | stat * | report various stats about the service * alternative command name: get-statistics | |||||
Same as above | log * | report service logs * alternative: get-log | |||||
Same as above | debug * | change debug mask * alternative: set-debug | |||||
Same as above | drain | --fdomains=FDRANKLIST --fd=FDRANKLIST * --ranks=SRVRANKLIST --targets=SRVRANK:TGTRANK LIST --tgt=SRVRANK:TGTRANK LIST | drain a list of racks, list of servers, or list of targets in preparation for maintenance * use --fd= as a convenience (shorter than --fdomains) This one really does require ability to specify at target or SSD level. Use case: one of the SSDs in a server is about to fail, hot swap it after a drain and before a reintegrate. | ||||
Same as above | reintegrate | --fdomains=FDRANKLIST --fd=FDRANKLIST --ranks=SRVRANKLIST --targets=SRVRANK:TGTRANK LIST --tgt=SRVRANK:TGTRANK LIST | reintegrate a drained component | ||||
Same as above | stop | full shutdown of the DAOS service | Y | ||||
Same as above | start | restart service after full shutdown | Y | ||||
Same as above | kill | --fdomains=FDRANKLIST --fd=FDRANKLIST --ranks=SRVRANKLIST | abrupt shutdown of a particular server (really: set of servers, or whole fault domains/racks of servers) | ||||
Same as above | exclude * | --fdomains=FDRANKLIST --fd=FDRANKLIST --ranks=SRVRANKLIST | Remove node from DAOS system (really: set of servers, or whole fault domains/racks of servers) * alternative command names: del-nodes, del-servers? | ||||
scrub | start | Start background checksum scrubbing process (or resume after prior stop) | |||||
stop | Stop scrubbing process | ||||||
query | Report status of background checksum scrubbing process (e.g., number of corruptions found, percentage of storage scanned so far) | ||||||
pool | --pool=UUID --sys=SYSNAME --svc=SRVRANKLIST * | query | Report pool status * (applies to all pool commands) given a pool UUID and DAOS system name, eventually is expected the implementation will look up the existing pool service replica SRVRANKLIST (i.e., get rid of need for svc=) | daos_pool_query() | Y | ||
Same as above | stat * | Get pool statistics * alternative command name: get-statistics | ? | ||||
Same as above | get-prop * | Get pool properties * alternative: prop (but I like having the commands be "verbs") | ? | ||||
set-prop | TBD | Y | |||||
Same as above | get-acl * overwrite-acl update-acl delete-acl | Get/set/delete pool access control | ? | Y | |||
Same as above | get-attr set-attr del-attr list-attrs | --attr=ATTRNAME (get,del) --value=VALUESTR (set) no arguments for list-attrs | Get / set user attributes | daos_pool_attr_get() daos_pool_attr_set() daos_pool_attr_list() | |||
Same as above | list-containers list-cont | List all containers in the pool | |||||
N/A | create | --user=USERNAME@, --group=GROUPNAME@, --mode=MODE --nsvc=NREP --sys=SYSNAME --ranks=SRVRANKLIST * --scm-size=SIZE --nvme-size=SIZE --fdomains=FDRANKLIST --fd=FDRANKLIST --acl-file=FILE | * change existing dmg --target= to --ranks= | daos_pool_create() | Y | ||
Same as above | destroy | --force | daos_pool_destroy() | Y | |||
Same as above | add-storage * | --fdomains=FDRANKLIST --fd=FDRANKLIST --ranks=SRVRANKLIST | Add a storage fault domain (rack) or list of servers to an existing pool * formerly named "extend" | daos_pool_exclude() | |||
Same as above | del-storage * | --fdomains=FDRANKLIST --fd=FDRANKLIST --ranks=SRVRANKLIST | Remove a fault domain (rack) or list of servers from a pool. * formerly named "exclude" | daos_pool_extend() | |||
Same as above | add-svc | --ranks=MORESRVRANKLIST | Add a pool service replicate --svc= to specify current list of metadata service server ranks; --ranks= to specify new ranks to add to the set. | ? | |||
Same as above | del-svc | --ranks=OLDSRVRANKLIST | Remove a pool service replicate | ? | |||
Same as above | rebuild | Manage rebuild for a pool | ? | ||||
Same as above | rebalance | Trigger rebalance after add-storage(extend) by racks/servers | ? | ||||
Same as above | resize | --scm-size=SIZE --nvme-size=SIZE | Extend the size of a pool's existing targets | ||||
Same as above | evict | Evict all active pool connections | daos_pool_evict() | ||||
Same as above | lurk * | Dump activity on the pool * alternative: get-log | ? | ||||
Tool | Component | Operation | Arguments | Description and Notes / Issues | API | ||
daos | pool * | --pool=UUID --sys=SYSNAME --svc=SRVRANKLIST | query, stat, get-prop | report pool status (rebuild/rebalancing status, ...) report various stats about the pool (size, usage, number of containers, ...), same as dmg pool stats show pool properties Note: daos pool is mostly "read-only" versus "dmg pool" used by the administrator. So the set-prop command is not available here. | Y (query, get-prop). Missing statistics support for "stat", but it may stay an admin/dmg thing ? | ||
Same as above | get-attr set-attr del-attr list-attrs | --attr=ATTRNAME (get,del) --value=VALUESTR (set) no arguments for list-attrs | Y | ||||
Same as above | list-containers list-cont | List all containers in the specified pool | Y | ||||
container cont | Pool related (same as daos pool): --pool=UUID --sys=SYSNAME --svc=SRVRANKLIST Container (choose 1): --cont=UUID OR --path=FILESYSDIR | query * | show container status query by container UUID with --cont or query by unified namespace (directory or file) --path=FSENTITY (like current duns resolve_path). Note: do not specify --pool= when querying by path. * alternative: get-status | daos_cont_query() | Y | ||
Same as above | stat | show various container statistics * alternative command name: get-statistics | ? | Missing statistics/metrics support for "stat". | |||
Same as above | get-attr set-attr del-attr list-attrs | --attr=ATTRNAME (get,del) --value=VALUESTR (set) no arguments for list-attrs | set/retrieve user attributes | Y | |||
Same as above | get-prop | * is there such a thing as getting container properties (like pool properties)? | ? | Y | |||
set-prop | TBD | Y (currently, only "label"property is supported. | |||||
Same as above | list-objects list-obj | Enumerate all objects in the specified container | Y | ||||
Pool related: same as above Container related: (--cont=UUID) OR --path=FSENTITY --type=CONTTYPE --oclass=OBJCLASS --chunk_size=NBYTES | create | (implementation generates CUUID if not specified) Also optional are --path/type/oclass/chunk_size | create a container with specific properties (including type, object class, and chunk_size if provided) and link it with the path (if provided - similar to duns link_path, create a POSIX container with DFS-specific parameters). CONTTYPE: posix, hdf5 OBJCLASS: tiny, small, large, R2S, R2, repl_max | daos_cont_create() | Y | ||
Same as query above | destroy | --force * | destroy a container based on UUID or path (unlink path as well if provided) * current dcont destroy does not have the --force option | daos_cont_destroy() | Y | ||
Same as query above | create-snap, list-snaps, destroy-snap | --snap=NAME (create) --epc=NUM (destroy) --epcrange=RANGE (destroy) | Take, list, destroy container snapshots List all snapshots in the container Destroy a single snapshot by epoch number, or all that snapshots between two epoch numbers (inclusive of the begin/end epoch numbers?). | Y | |||
Same as query above | rollback | --snap=NAME --epc=NUM | Revert a container back to a previous snapshot specified by name or epoch number. | ||||
verify * | Validate content of a POSIX container * TBD - this was in a separate "daos fs" command section that has been merged into "daos cont" | ||||||
object obj | Pool, cont related (same as daos pool and daos cont): --pool=UUID --sys=SYSNAME --svc=SRVRANKLIST --cont=UUID Object: --oid=OID | query * | TBD: Epoch? | Show the layout of a particular DAOS object including all the targets where it is distributed | daos_obj_open()? daos_obj_fetch()? | Y | |
Same as above | list-keys | daos_obj_list_dkey()? daos_obj_list_akey() | |||||
Same as above | dump | Dump content of an object |