RAS Events

Regarding the RAS events, the intent is to provide a table in the online documentation similar to what GPFS does (seehttps://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.2/com.ibm.spectrum.scale.v5r02.doc/bl1pdg_rasevents_gpfs.htm).

The format of the DAOS RAS event will be as follows:

Field	Optional/Mandatory	Description
ID	Mandatory	Unique event identifier referenced in the manual. 64-char string.
Type	Mandatory	Event type e.g. STATE_CHANGE or INFO_ONLY.
Timestamp	Mandatory	Fully qualified timestamp associated with the event. Resolution at the microseconds and include the timezone offset to avoid locality issues.
Severity	Mandatory	Fatal/Warning/Error/Info
Msg	Mandatory	Human readable message.
HID	Optional	Identify hardware component involved in the event. E.g. PCI address for SSD, network interface, …
Rank	Optional	DAOS rank involved in the event.
PID	Optional	Identifier of the process involved in the RAS event
TID	Optional	Identifier of the thread involved in the RAS event.
JOBID	Optional	Identifier of the job involved in the RAS event.
Hostname	Optional	Hostname of the node involved in the event.
PUUID	Optional	Pool UUID involved in the event, if any.
CUUID	Optional	Container UUID involved in the event, if relevant.
OID	Optional	Object identifier involved in the event, if relevant.
Control Operation	Optional	Recommended automatic action, if any.
Data	Optional	Specific instance data treated as a blob.

RAS events include:

DAOS system is up
DAOS system is stopped
DAOS system is (re)formatted
Minimal number of storage devices (i.e. PMEM, SSD) not met
Cannot find requested network interface
No enough huge pages
Rank is evicted by SWIM
I/O server / rank failed … and then restarted
Faulty SSD or wear levelling factor degrading and below threshold
SCM error
Checksum error
Mgmt service lost redundancy
Pool service lost redundancy
Rank is evicted from pool
Rank is reintegrated in pool
Rebuild started and completed
Rebuild hit some failure (e.g. no space)
OSA space rebalancing completed/failed
Data might have been lost due to cascading failures taking us below the container’s redundancy factor
Watchdog triggered when ULT does not seem to be making progress
Aggregation hit some failures
Too many transaction aborts/retries preventing forward progress
A bunch of out-of-bounds errors
Agent is unreachable
A lot of containers are being created (e.g. >1000 in a single pool)
A lot of snapshots are being created (e.g. >1000 for a single container)

A plugin interface will be available to allow emitting RAS events via different channel (e.g. syslog, ...). They will also be logged into the io/control logs to ease troubleshooting after the fact by placing the errors in line with any other messages which are emitted.