RAS Events
Regarding the RAS events, the intent is to provide a table in the online documentation similar to what GPFS does (seehttps://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.2/com.ibm.spectrum.scale.v5r02.doc/bl1pdg_rasevents_gpfs.htm).
The format of the DAOS RAS event will be as follows:
Field | Optional/Mandatory | Description |
ID | Mandatory | Unique event identifier referenced in the manual. 64-char string. |
Type | Mandatory | Event type e.g. STATE_CHANGE or INFO_ONLY. |
Timestamp | Mandatory | Fully qualified timestamp associated with the event. Resolution at the microseconds and include the timezone offset to avoid locality issues. |
Severity | Mandatory | Fatal/Warning/Error/Info |
Msg | Mandatory | Human readable message. |
HID | Optional | Identify hardware component involved in the event. E.g. PCI address for SSD, network interface, … |
Rank | Optional | DAOS rank involved in the event. |
PID | Optional | Identifier of the process involved in the RAS event |
TID | Optional | Identifier of the thread involved in the RAS event. |
JOBID | Optional | Identifier of the job involved in the RAS event. |
Hostname | Optional | Hostname of the node involved in the event. |
PUUID | Optional | Pool UUID involved in the event, if any. |
CUUID | Optional | Container UUID involved in the event, if relevant. |
OID | Optional | Object identifier involved in the event, if relevant. |
Control Operation | Optional | Recommended automatic action, if any. |
Data | Optional | Specific instance data treated as a blob. |
RAS events include:
DAOS system is up
DAOS system is stopped
DAOS system is (re)formatted
Minimal number of storage devices (i.e. PMEM, SSD) not met
Cannot find requested network interface
No enough huge pages
Rank is evicted by SWIM
I/O server / rank failed … and then restarted
Faulty SSD or wear levelling factor degrading and below threshold
SCM error
Checksum error
Mgmt service lost redundancy
Pool service lost redundancy
Rank is evicted from pool
Rank is reintegrated in pool
Rebuild started and completed
Rebuild hit some failure (e.g. no space)
OSA space rebalancing completed/failed
Data might have been lost due to cascading failures taking us below the container’s redundancy factor
Watchdog triggered when ULT does not seem to be making progress
Aggregation hit some failures
Too many transaction aborts/retries preventing forward progress
A bunch of out-of-bounds errors
Agent is unreachable
A lot of containers are being created (e.g. >1000 in a single pool)
A lot of snapshots are being created (e.g. >1000 for a single container)
A plugin interface will be available to allow emitting RAS events via different channel (e.g. syslog, ...). They will also be logged into the io/control logs to ease troubleshooting after the fact by placing the errors in line with any other messages which are emitted.