Additional RAS events for 2.8
Stakeholders
Identify developer(s), component validation engineer(s) & reviewer(s).
Introduction
RAS events can improve DAOS system observability for day 2 operations. At the moment there’s only limited number of events covered and they are only available in text logs, which makes it hard for quick overview feature and quite challenging for any automation.
Storing RAS events into durable storage on mgmt service DB and providing access to them would greatly benefit operations and embedding DAOS in other systems: e.g. external system would be able to react on certain events.
Requirements & Use Cases
Store RAS events persistently in the mgmt service DB.
Expose stored events via API.
Add dmg command to show and filter events
Filters by
since
andtype
should be enough for most of the cases, e.g.dmg system ras --since="26-10-2024T10:00:000"
The output should support human readable and machine digestible formats. The regular tab formatted table for default output and json if
--json
flag provided.
If time allows extend RAS events with:
pool creation/deletion
container creation/deletion
pool property changes
container property changes
pool extension/drain
new rank joined
Retention policy - large systems can generate tremendous amount of event. The basic retention policy should clean up old events (say 30 days by default) to ensure mgmt service have available storage.
Optionally, there could be CLI flag to clear events after read (for the systems that read events and store them in their internal storage).
It can be implemented in the next release. The small system of 20 servers running for 6 months generated ~370k RAS events, even if every event is 1Kb it leave plenty of headspace before implementing retention policy.
Design Overview
The overall idea is to re-use existing infrastructure of raising events and mgmt service DB: when an event is raised, in addition to adding it to the log, it’s also sent to MS to be stored in the DB.
The events durable store should provide an API for dmg
command line to display events with simple filters by date and type.
The infrastructure in place allow to redirect RAS events from servers to management service and it’s built on PubSub primitive which should allow to add a durable storage layer handler that would be subscribed on RASTypeAny
. It would be similar to how EventLogger/EventForwarder
implements the events.Handler
interface with OnEvent
.
The event firing mechanism are already in implemented and in use via ds_notify_ras*
set of functions and, therefore only the expansion of the event types is needed.
User Interface
The events most relevant for system administration and automation/tooling and should be contained within dmg
tool.
To simplify integration with other system, machine readable format should be supported, if --json
option is provided, the human readable output switches to json.
The examples of usage:
dmg system ras
- show the recent events (say 24h)dmg system ras --type pool_rebuild_started,pool_rebuild_started
show the required eventsdmg system ras --json --since="26-10-2024T10:00:000" --type=engine_died
Impacts
Any performance impact?
There might be performance impact when extended events for pool/container are added. To mitigate that some batching for event storage might be preferable or even separate service.
References
Events package on control plane