Regarding the RAS events, the intent is to provide a table in the online documentation similar to what GPFS does (seehttps://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.2/com.ibm.spectrum.scale.v5r02.doc/bl1pdg_rasevents_gpfs.htm).
The format of the DAOS RAS event will be as follows:
Unique event identifier referenced in the manual. 64-char string.
|Type||Mandatory||Event type e.g. STATE_CHANGE or INFO_ONLY.|
Fully qualified timestamp associated with the event. Resolution at the microseconds and include the timezone offset to avoid locality issues.
Human readable message.
Identify hardware component involved in the event. E.g. PCI address for SSD, network interface, …
DAOS rank involved in the event.
|PID||Optional||Identifier of the process involved in the RAS event|
|TID||Optional||Identifier of the thread involved in the RAS event.|
|JOBID||Optional||Identifier of the job involved in the RAS event.|
Hostname of the node involved in the event.
Pool UUID involved in the event, if any.
Container UUID involved in the event, if relevant.
Object identifier involved in the event, if relevant.
Recommended automatic action, if any.
Specific instance data treated as a blob.
RAS events include:
- DAOS system is up
- DAOS system is stopped
- DAOS system is (re)formatted
- Minimal number of storage devices (i.e. PMEM, SSD) not met
- Cannot find requested network interface
- No enough huge pages
- Rank is evicted by SWIM
- I/O server / rank failed … and then restarted
- Faulty SSD or wear levelling factor degrading and below threshold
- SCM error
- Checksum error
- Mgmt service lost redundancy
- Pool service lost redundancy
- Rank is evicted from pool
- Rank is reintegrated in pool
- Rebuild started and completed
- Rebuild hit some failure (e.g. no space)
- OSA space rebalancing completed/failed
- Data might have been lost due to cascading failures taking us below the container’s redundancy factor
- Watchdog triggered when ULT does not seem to be making progress
- Aggregation hit some failures
- Too many transaction aborts/retries preventing forward progress
- A bunch of out-of-bounds errors
- Agent is unreachable
- A lot of containers are being created (e.g. >1000 in a single pool)
- A lot of snapshots are being created (e.g. >1000 for a single container)
A plugin interface will be available to allow emitting RAS events via different channel (e.g. syslog, ...). They will also be logged into the io/control logs to ease troubleshooting after the fact by placing the errors in line with any other messages which are emitted.