Additional RAS events for 2.8

Stakeholders

Identify developer(s), component validation engineer(s) & reviewer(s).

Introduction

RAS events can improve DAOS system observability for day 2 operations. At the moment there’s only limited number of events covered and they are only available in text logs, which makes it hard for quick overview feature and quite challenging for any automation.

Storing RAS events into durable storage on mgmt service DB and providing access to them would greatly benefit operations and embedding DAOS in other systems: e.g. external system would be able to react on certain events.

Requirements & Use Cases

  • Store RAS events persistently in the mgmt service DB.

  • Expose stored events via API.

  • Add dmg command to show and filter events

    • Filters by since and type should be enough for most of the cases, e.g. dmg system ras --since="26-10-2024T10:00:000"

    • The output should support human readable and machine digestible formats. The regular tab formatted table for default output and json if --json flag provided.

  • If time allows extend RAS events with:

    • pool creation/deletion

    • container creation/deletion

    • pool property changes

    • container property changes

    • pool extension/drain

    • new rank joined

  • Retention policy - large systems can generate tremendous amount of event. The basic retention policy should clean up old events (say 30 days by default) to ensure mgmt service have available storage.

    • Optionally, there could be CLI flag to clear events after read (for the systems that read events and store them in their internal storage).

    • It can be implemented in the next release. The small system of 20 servers running for 6 months generated ~370k RAS events, even if every event is 1Kb it leave plenty of headspace before implementing retention policy.

Design Overview

The overall idea is to re-use existing infrastructure of raising events and mgmt service DB: when an event is raised, in addition to adding it to the log, it’s also sent to MS to be stored in the DB.

The events durable store should provide an API for dmg command line to display events with simple filters by date and type.

The infrastructure in place allow to redirect RAS events from servers to management service and it’s built on PubSub primitive which should allow to add a durable storage layer handler that would be subscribed on RASTypeAny. It would be similar to how EventLogger/EventForwarder implements the events.Handler interface with OnEvent.

The event firing mechanism are already in implemented and in use via ds_notify_ras* set of functions and, therefore only the expansion of the event types is needed.

User Interface

The events most relevant for system administration and automation/tooling and should be contained within dmg tool.

To simplify integration with other system, machine readable format should be supported, if --json option is provided, the human readable output switches to json.

The examples of usage:

  • dmg system ras - show the recent events (say 24h)

  • dmg system ras --type pool_rebuild_started,pool_rebuild_started show the required events

  • dmg system ras --json --since="26-10-2024T10:00:000" --type=engine_died

Impacts

Any performance impact?

There might be performance impact when extended events for pool/container are added. To mitigate that some batching for event storage might be preferable or even separate service.

References

Events package on control plane

Events on engine side

Event logger

Examples of firing the events