Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Store RAS events persistently in the mgmt service DB.

  • Expose stored events via API.

  • Add dmg command to show and filter events

    • Filters by since and type should be enough for most of the cases, e.g. dmg system ras --since="26-10-2024T10:00:000"

    • The output should support human readable and machine digestible formats. The regular tab formatted table for default output and json if --json flag provided.

  • If time allows extend RAS events with:

    • pool creation/deletion

    • container creation/deletion

    • pool property changes

    • container property changes

    • pool extension/drain

    • new rank joined

  • Retention policy - large systems can generate tremendous amount of event. The basic retention policy should clean up old events (say 30 days by default) to ensure mgmt service have available storage.

    • Optionally, there could be CLI flag to clear events after read (for the systems that read events and store them in their internal storage).

    • It can be implemented in the next release. The small system of 20 servers running for 6 months generated ~370k RAS events, even if every event is 1Kb it leave plenty of headspace before implementing retention policy.

Design Overview

The overall idea is to re-use existing infrastructure of raising events and mgmt service DB: when an event is raised, in addition to adding it to the log, it’s also sent to MS to be stored in the DB.

...