DAOS Tools

Proposal for DAOS tools consolidation.






Control plane API (Go)

Data plane API (C)

Data plane API (C)





Lustre Equivalent




  • Storage provisionning
  • Burn-in
  • Firmware update
  • Data plane mgmt & monitoring
  • Configure/monitor scrubbing
  • Pool mgmt
  • Telemetry
  • Pool query
  • Container mgmt
  • Unified namespace mgmt
  • Container user attributes
  • Snapshots
  • Object debugging
  • POSIX container configuration
  • Parallel copy of POSIX containers
  • HDF5-level copy
  • Container parking

Syntax: dmg  [resource] [action] [args]
              daos [resource] [action] [args]

Proposal: High Level Characteristics

Proposal: Characteristics: Resource Names Summary

Specifying DAOS system name (formerly known as server group):

  • --sys-name=SYSNAME ; or --sys=SYSNAME (example: --sys=daos_server)

Specifying storage server ranks (e.g., pool create/add-storage/del-storage, and system drain/reintegrate/kill/exclude)

  • --ranks=SRVRANKLIST (example: --ranks=0,1,2)

Specifying added or removed pool service replica ranks (for pool add-svc/del-svc):

  • --ranks=SRVRANKLIST (example: --ranks=0,1,2)

Specifying number of pool service replicas (for pool create):

  • --nsvc=NUM

Specifying pool service replica ranks (legacy - currently required but eventually will not be needed) specify replica ranks:

  • --svc=SRVRANKLIST (example: --svc=1,2,3)

Specifying a fault domain / entire rack of servers (e.g., for pool create/add-storage/del-storage and system drain/reintegrate/kill/exclude)

  • --fdomains=FDRANKLIST (often a single item, but keeping a list for flexibility)
  • --fd=FDRANKLIST (shorter option name for convenience)

Specifying targets (e.g., for system drain/reintegrate):

  • List of Rank:Target pairs (--targets=SRVRANK:TGTRANK LIST ; or --tgt=SRVRANK:TGTRANK LIST)
    • server 0 targets 0 and 1 (0:0,0:1)
    • Server 1 targets 2 and 4 (1:2,1:4)
    • Server 2 targets 0 and 1 (2:0,2:1)
    • (whole list all together) --tgt=0:0,0:1,1:2,1:4,2:0,2:1

Specifying container snapshots

  • named snapshot: --snap=NAME
  • snapshot identified by a single epoch number: --epc=NUM
  • snapshots that site within a specified range of epoch numbers: --epcrange=M-N

Proposal: Highlighted Operations and Tool Command Lines

Proposal: Highlighted Operations: System

  • List all pools in a DAOS system
    •  dmg system list-pools

Proposal: Highlighted Operations: Pool

  • Create (dmg pool create)
    • by server rank list
      • Specify only SCM storage
        •  dmg pool create --sys=SYSNAME --uid=UID --gid=GID --mode=MODE --nsvc=NREP --ranks=SRVRANKLIST --scm-size=SIZE
      • Specify SCM + NVMe storage
        • dmg pool create --sys=SYSNAME --uid=UID --gid=GID --mode=MODE --nsvc=NREP --ranks=SRVRANKLIST --scm-size=SIZE--nvme-size=SIZE
    • by fault domain (rack) rank list (using --fd shorthand for --fdomains)
      •  dmg pool create --sys=SYSNAME --uid=UID --gid=GID --mode=MODE --nsvc=NREP --fd=FDRANKLIST --scm-size=SIZE--nvme-size=SIZE
  • Add pool service replicas (dmg pool add-svc)
    • Usage: dmg pool add-svc --pool=UUID --sys=SYSNAME --svc=SRVRANKLIST  --ranks=MORESRVRANKSADDLIST
  • Remove pool replicas (dmg pool del-svc)
    • Usage: dmg pool del-svc --pool=UUID --sys=SYSNAME --svc=SRVRANKLIST  --ranks=OLDSRVRANKSDELLIST
  • Destroy pool in a DAOS system (dmg pool destroy)
    • Usage: dmg pool destroy --pool=UUID --sys=SYSNAME --svc=SRVRANKLIST [--force]
  • List all containers in a pool (dmg pool list-containers ; or shorter command equivalent dmg pool list-cont)
    • Usage: daos pool list-containers -pool=UUID --sys=SYSNAME --svc=SRVRANKLIST 
    • Usage: daos pool list-cont -pool=UUID --sys=SYSNAME --svc=SRVRANKLIST 
  • Add storage (dmg pool add-storage aka extend)
    • Rack (all servers in the rack, and all targets in all of the servers; using --fd shorthand for --fdomains)
      • dmg pool add-storage --pool=UUID --sys=SYSNAME --svc=SRVRANKLIST  --fd=FDRANKLIST
    • Servers (all targets on the servers)
      • dmg pool add-storage --pool=UUID --sys=SYSNAME --svc=SRVRANKLIST  --ranks=SRVRANKLIST
  • Remove storage (dmg pool del-storage aka exclude)
    • Rack (all servers in the rack, and all targets in all of the servers; using --fd shorthand for --fdomains)
      •  dmg pool del-storage --pool=UUID --sys=SYSNAME --svc=SRVRANKLIST  --fd=FDRANKLIST
    • Servers (all targets on the servers)
      • dmg pool del-storage --pool=UUID --sys=SYSNAME --svc=SRVRANKLIST --ranks=SRVRANKLIST

Proposal: Highlighted Operations: Container


  • command for all container operations is daos container (shown in the examples below). However, as a convenience, a shorter command equivalent may be used daos cont
  • daos container list-objects command (shown in the examples below) has a shorter command equivalent daos container list-obj.
  • resource for object commands is daos object. However, as a convenience, a shorter equivalent may be used daos obj

Container Create - by UUID and/or unified namespace path

  • Create a container in a pool (daos container create)
    • User-specified container UUID
      • Usage: daos container create -pool=UUID --sys=SYSNAME --svc=SRVRANKLIST --cont=UUID
    • No container UUID specified (implementation generates a random UUID as a convenience)
      • Usage: daos container create --pool=UUID --sys=SYSNAME --svc=SRVRANKLIST
    • User-specified container UUID and user-specified unified namespace path to link the container to
      • Usage: daos container create --pool=UUID --sys=SYSNAME --svc=SRVRANKLIST --cont=UUID --path=/path/to/create_and_link --type=POSIX|HDF5 --oclass=tiny|small|large|R2|R2S|repl_max --chunk_size=BYTES
        • path is a directory for type=POSIX, and is a file for type=HDF5
        • oclass is DAOS object class
        • chunk_size is the chunk_size in bytes to use with files created in the container.
    • No container UUID specified, and user-specified unified namespace path to link the container to (implementation will generate a random UUID)
      • Usage:: daos container create --pool=UUID --sys=SYSNAME --svc=SRVRANKLIST -path=/path/to/create_and_link --type=POSIX|HDF5 --oclass=tiny|small|large|R2|R2S|repl_max --chunk_size=BYTES

Container "Lookup" (All Other Commands) - by UUID or unified namespace path

There are 2 variants of the commands: 1) where the user provides the pool and container UUIDs ; and a 2) where the user provides only the unified namespace path to which the container is linked. In the second format, the implementation will resolve the pool and container UUIDs by getting extended filesystem attributes of the specified entity in the path (i.e., the user does not provide the pool UUID and does not provide the container UUID).

  • Destroy a container in a pool (daos container destroy)
    • Destroy by container UUID
      • Usage: daos container destroy -pool=UUID --sys=SYSNAME --svc=SRVRANKLIST --cont=UUID
    • Destroy by path that the container is linked to
      • Usage: daos container destroy -sys=SYSNAME --svc=SRVRANKLIST --path=/path/to/destroy_cont_and_unlink

The remaining container commands use the --cont=UUID form (the --path= option is available, but is not shown)

  • List all objects in a container (daos-container list-objects)
    • Usage: daos container list-objects -pool=UUID --sys=SYSNAME --svc=SRVRANKLIST --cont=UUID 
  • Create a snapshot on container based on the latest committed epoch
    • Unnamed
      • Usage: daos container create-snap -pool=UUID --sys=SYSNAME --svc=SRVRANKLIST --cont=UUID 
    • Named
      • Usage: daos container create-snap -pool=UUID --sys=SYSNAME --svc=SRVRANKLIST --cont=UUID --snap=mysnapname
  • List all snapshots in a container
    • Usage: daos container list-snaps -pool=UUID --sys=SYSNAME --svc=SRVRANKLIST --cont=UUID
  • Destroy container snapshot(s)
    • Single epoch snapshot
      • Usage: daos container destroy-snap  -pool=UUID --sys=SYSNAME --svc=SRVRANKLIST --cont=UUID --epc=B
    • Multiple snapshots within an epoch range
      • Usage: daos container destroy-snap -pool=UUID --sys=SYSNAME --svc=SRVRANKLIST --cont=UUID  --epcrange=B-D
  • Rollback container to specified snapshot
    • Rollback to a named snapshot
      • Usage: daos container rollback  -pool=UUID --sys=SYSNAME --svc=SRVRANKLIST --cont=UUID --snap=mysnapname
    • Rollback to a snapshot at an epoch number
      • Usage: daos container rollback  -pool=UUID --sys=SYSNAME --svc=SRVRANKLIST --cont=UUID --epc=A

Proposal: daosctl test considerations

  • Change --server-group to --sys=
  • Change --size to --scm-size?
  • Change --replicas=NUM_METADATA_REPLICAS to --nsvc=
  • Change --servers=SRVRANKLIST to --svc= (for pool replica ranks)
  • Change exclude-target --targets= to take a list of pairs (instead of current approach that makes pairs from 2 lists: --rank=ra,rb,rc and --targets=ta,tb,tc)
  • Change --server= to --rank=
  • Change --rank= to --ranks= (or svc= ???) for kill-leader - (what will we do for dmg kill? Probably --ranks=. Choose same)
  • Change --server=SERVER-LIST to  --ranks= (or svc= ???) for kill-server (what will we do for dmg kill? Choose same)
  • Change -c-uuid to --cont=CUUID
  • Change -p-uuid to --pool=PUUID

Proposal: Commands, Resources, Operations and Arguments


ComponentComponent ArgsOperation

Operation Args

Description, Notes / IssuesAPIImplemented?


discover all storage available on the nodes applying filters from yaml file

report status & stats about storage

query smd



query SMD device table

query SMD pool table.


query nvme-health--hostlist="HOST:PORT"query raw SPDK NVMe device health stats. Returns all stats for all NVMe SSDs on all hosts in hostlist.

query blobstore-health



query BIO in-memory health data. Returns all BIO device health data and I/O errors & checksum error stats for given device UUID or VOS target ID.

query device-state--devuuid="DEVICE_UUID"query the current device state of the given device UUID stored in SMD (ie NORMAL or FAULTY).

set-faulty--devuuid="DEVICE_UUID"allow admin/user to manually set the device state of a given device to FAULTY (will trigger faulty device reaction callbacks).


device-specific configuration that may require a reboot. E.g. setting up AEP DIMMs in interleaved mode



running fio against storage devices to verify it operates well and validate the performance.


reset content of NVMe SSDs, format SCM with ext4, mount SCM and start the DAOS service (io_server)


firmware update


 scan *

list discovered network interfaces

 * suggestion: report which interfaces and OFI providers would be used with the discovered interfaces


 query *
 report status & stats about network interfaces

 * suggestion: perform a local test to indicate in advance if an OFI runtime error is going to occur with the interface, for example as seen with daos_server: "na_ofi_getinfo(): fi_getinfo failed, rc=-61(No data available)". Here, was from a VM build that didn't have PSM2 devel.




report service status on all or a subset of the servers

list all pools created (do we want an alias for this as "pool list"?)

Same as above


 list all DAOS system server ranks in the specified system ("query" currently lists all system ranks, do we need list-ranks?)

Same as above stat *

report various stats about the service

 * alternative command name: get-statistics

Same as above log *
report service logs

 * alternative: get-log

Same as above

 debug *
 change debug mask

* alternative: set-debug

Same as above







drain a list of racks, list of servers, or list of targets in preparation for maintenance

 * use --fd= as a convenience (shorter than --fdomains)

This one really does require ability to specify at target or SSD level. Use case: one of the SSDs in a server is about to fail, hot swap it after a drain and before a reintegrate.

Same as above







reintegrate a drained component

Same as above stop
full shutdown of

the DAOS service


Same as above start
restart service after full shutdown

Same as above kill




abrupt shutdown of a particular server (really: set of servers, or whole fault domains/racks of servers)

Same as above exclude *




Remove node from DAOS system (really: set of servers, or whole fault domains/racks of servers)

 * alternative command names: del-nodes, del-servers?

Start background checksum scrubbing process (or resume after prior stop)

Stop scrubbing process

Report status of background checksum scrubbing process (e.g., number of corruptions found, percentage of storage scanned so far)

 pool --pool=UUID



Report pool status

 * (applies to all pool commands) given a pool UUID and DAOS system name, eventually is expected the implementation will look up the existing pool service replica SRVRANKLIST (i.e., get rid of need for svc=)


Same as above stat *
Get pool statistics

 * alternative command name: get-statistics


Same as above get-prop *
Get pool properties

 * alternative: prop (but I like having the commands be "verbs")




Same as aboveget-acl *




Get/set/delete pool access control?Y

Same as above





 --attr=ATTRNAME (get,del)

 --value=VALUESTR (set)

no arguments for list-attrs

 Get / set user attributes


Same as above



List all containers in the pool

N/A create--user=USERNAME@,





 --ranks=SRVRANKLIST *






 * change existing dmg --target= to --ranks=


Same as above destroy --force

Same as aboveadd-storage *




Add a storage fault domain (rack) or list of servers to an existing pool

 * formerly named "extend"


Same as abovedel-storage *




Remove a fault domain (rack) or list of servers from a pool.

 * formerly named "exclude"


Same as above

 add-svc --ranks=MORESRVRANKLIST

Add a pool service replicate

--svc= to specify current list of metadata service server ranks; --ranks= to specify new ranks to add to the set.


Same as above del-svc --ranks=OLDSRVRANKLISTRemove a pool service replicate?

Same as above rebuild
Manage rebuild for a pool?

Same as above rebalance
Trigger rebalance after add-storage(extend) by racks/servers?

Same as above resize



Extend the size of a pool's existing targets

Same as above evict
Evict all active pool connections daos_pool_evict()

Same as above lurk *
Dump activity on the pool

 * alternative: get-log

OperationArgumentsDescription and Notes / IssuesAPI


 pool * --pool=UUID






report pool status (rebuild/rebalancing status, ...)

report various stats about the pool (size, usage, number of containers, ...), same as dmg pool stats

show pool properties

Note: daos pool is mostly "read-only" versus "dmg pool" used by the administrator. So the set-prop command is not available here.

Y (query, get-prop). Missing statistics support for "stat", but it may stay an admin/dmg thing ?

Same as above get-attr




 --attr=ATTRNAME (get,del)

 --value=VALUESTR (set)

no arguments for list-attrs


Same as above



List all containers in the specified pool



Pool related

(same as daos pool):




Container (choose 1):




 query *
show container status

query by container UUID with --cont 


query by unified namespace (directory or file) --path=FSENTITY (like current duns resolve_path). Note: do not specify --pool= when querying by path.

 * alternative: get-status


Same as above stat
show various container statistics

 * alternative command name: get-statistics

?Missing statistics/metrics support for "stat".

Same as above get-attr




--attr=ATTRNAME (get,del)

 --value=VALUESTR (set)

no arguments for list-attrs

 set/retrieve user attributes

Same as aboveget-prop

 * is there such a thing as getting container properties (like pool properties)?



Y (currently, only "label"property is supported.

Same as above



Enumerate all objects in the specified container

Pool related:

same as above

Container related:








 (implementation generates CUUID if not specified)

Also optional are --path/type/oclass/chunk_size

create a container with specific properties (including type, object class, and chunk_size if provided) and link it with the path (if provided - similar to duns link_path, create a POSIX container with DFS-specific parameters). 

CONTTYPE: posix, hdf5

OBJCLASS: tiny, small, large, R2S, R2, repl_max


Same as query above destroy --force * destroy a container based on UUID or path (unlink path as well if provided)

* current dcont destroy does not have the --force option


Same as query above




--snap=NAME (create)

--epc=NUM (destroy)

--epcrange=RANGE (destroy)

Take, list, destroy container snapshots
Optionally name snapshots on creation. Snapshot created based on the most recent committed epoch.

List all snapshots in the container

Destroy a single snapshot by epoch number, or all that snapshots between two epoch numbers (inclusive of the begin/end epoch numbers?).


Same as query above rollback



Revert a container back to a previous snapshot specified by name or epoch number.

 verify *

Validate content of a POSIX container

 * TBD - this was in a separate "daos fs" command section that has been merged into "daos cont"



Pool, cont related
(same as daos pool and daos cont):







 query *


TBD: Epoch?Show the layout of a particular DAOS object including all the targets where it is distributed

 * get-layout (or get layout) instead of query


Same as above list-keys


Same as above dump
Dump content of an object

Proposal: TODO

  1. Container create use cases, unified namespace related additions
    1. container create by pool UUID + path (container UUID generated)
    2. query by path only
    3. Determine if create is to be done in "daos cont create" with more options, or have a uns or fs resource (e.g., daos uns).
  2. Container type specification (optional: unknown if not specified)
    1. --type=fs (or --type=posix)
    2. --type=block (future: e.g., spdk, virtio for cloud use cases)
    3. --type=hdf5
  3. Object class and chunk sizes.

Proposal: Opens

  1. How to expose the mapping of SSDs to VOS targets on a given server node
    1. Use cases
      • SSD fails:
        • system will detect, DAOS will rebuild pool excluding affected target (using SSD to target mapping and using DAOS API to exclude by target).
      • Admin gets predictive alert that SSD wear is high, could fail soon.
        • Admin needs a way to query hw topology and associate with affected targets, then invoke dmg exclude (dmg system drain specifying targets I think is what we decided instead)
    2. How:
      • Pool map, topology portion - does this contain topology details at this low level - or does it only go to the server/node level and stop there?
      • System map?
  2. How to number (or instead name?) the fault domains in the system (with a numeric rank just like servers?)
    1. How: system map?
  3. Server ranks: should we support ranges of consecutive ranks?
    1. Example: ranks 0-1023
      1. --ranks=0-1023
    2. Example: ranks 0-511 and 1000-1023:
      1. --ranks=0-511,1000-1023
  4.  daosctl related
    1. OK to include "test" commands from daosctl into the official product dmg admin and daos user tools? Options
      1. include test commands in the official tools
        1. Print in help messages
        2. Do not print in help messages, but support in the code
      2. Do not include test commands in official tools (maintain daosctl as a developer-only utility)
        1. keep the "test-" commands in daosctl, but remove the ones that are supported by "dmg" and "daos" tools (e.g., pool create, container create, ...)
    2. Is daosctl connect-pool needed?
  5. (more daosctl related) Should we have daos pool kill-leader command?
    1. daosctl kill-leader (kills one of the metadata service server ranks for a specific pool)