Client-Side Health Checks

Status: Draft / Under review / Approved / Implemented / Abandoned

Created: Apr 18, 2024

Last Updated: Apr 18, 2024

Authors: @Michael MacDonald

Objective

Provide simple client-side DAOS health and performance tooling to assist with functionality troubleshooting and performance investigations.

Background

Troubleshooting DAOS problems is difficult, even for experienced DAOS developers and admins.

Experience has shown that neither users nor many new admins know how to investigate a failed or misbehaving DAOS system.  There is an evolving body of documentation and experience that can help with this problem, but documentation typically suffers from neglect and experience migrates on to other tasks.

Examples of initial troubleshooting steps that might be taken:

  • daos pool query default-pool: Attempt to connect to the Pool Service and view rebuild status, used/available storage, etc.

  • self_test: Attempt to run nondestructive network connectivity tests between clients and servers

  • daos version --json: Display the exact version of the client being used

Beyond these steps, the investigation might continue by examining server logs and instance/VM dashboards, or even involve logging into the servers directly in order to run administrative commands.

Examples of steps that might be taken, assuming that investigators have the ability to log into servers:

  • dmg system query: Attempt to connect to the Management Service and view membership (servers) details and status

  • daos_server version --json: Display the exact version of the server being used

  • Check/collect server/engine logs

Overview

A completely automated troubleshooting process is probably not feasible, but we should be able to provide some tooling to encode best practices and bootstrap an investigation. For simpler problems (agent misconfiguration, etc), the tooling may even be able to provide a level of self-service diagnosis and resolution for users.

As a starting point, a new set of health-related subcommands will be added to the daos binary distributed with DAOS client packages.

A quick example:

$ daos health check Component Versions   Server Version: v2.4.2-ps-1.0   Client Version: v2.4.2-42-g10b9df149e Server Status: 24 Joined, 1 Excluded Pool Status   default-pool: Rebuilding (86% complete) Container Status   posix-container: Healthy

The exact details of the UI will need to be refined as part of the detailed design, but we can begin with defining the set of information that would be helpful for end users and as a starting place for support personnel:

  • Client/Agent/Server versions, with exact build information including tags

  • Simple server status (e.g. 24/25 Joined, 1/25 Excluded)

  • Simple pool status (Healthy, Rebuilding, Low Space, Unreachable)

  • Simple container status (Healthy, Unhealthy)

As part of collecting this basic set of information, the tool can learn a fair amount about potential problems and display potential resolutions. For example:

  • If the agent is not running, display an informative error with corrective guidance

  • If the servers cannot be contacted by the agent on the client’s behalf, indicate possible causes (incorrect agent configuration, connectivity issues, etc) and solutions

  • If there are unhealthy components in the system (servers, pool, container), clearly identify them

Beyond these binary healthy/not-healthy checks, the tool could provide more advanced checks to assist with performance troubleshooting. One obvious starting place for this kind of advanced functionality is a wrapper around the DAOS self_test utility, where the tool performs quick and nondestructive performance tests against each server in order to produce actual throughput and latency results that can be compared to expected results. Based on the actual numbers seen at the client side, the investigation may be able to more quickly zero in on suboptimal client placement relative to the servers or other networking issues.

Example output:

$ daos health net-test This command will initiate a non-destructive evaluation of the network between this client and the DAOS servers in the system. Throughput Results: | Rank | Read Throughput | Write Throughput | |------|-----------------|------------------| |  0   |18Gbps (2.25GB/s)|  32Gbps (4GB/s)  | ... Latency Results: | Rank | Read Latency | Write Latency | |------|--------------|---------------| |  0   |     32ms     |    8.3ms      | ...

Detailed Design

TBD