RAS Events
Regarding the RAS events, the intent is to provide a table in the online documentation similar to what GPFS does (seehttps://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.2/com.ibm.spectrum.scale.v5r02.doc/bl1pdg_rasevents_gpfs.htm).
The format of the DAOS RAS event will be as follows:
Field | Optional/Mandatory | Description |
ID | Mandatory | Unique event identifier referenced in the manual. 64-char string. |
Type | Mandatory | Event type e.g. STATE_CHANGE or INFO_ONLY. |
Timestamp | Mandatory | Fully qualified timestamp associated with the event. Resolution at the microseconds and include the timezone offset to avoid locality issues. |
Severity | Mandatory | Fatal/Warning/Error/Info |
Msg | Mandatory | Human readable message. |
HID | Optional | Identify hardware component involved in the event. E.g. PCI address for SSD, network interface, … |
Rank | Optional | DAOS rank involved in the event. |
PID | Optional | Identifier of the process involved in the RAS event |
TID | Optional | Identifier of the thread involved in the RAS event. |
JOBID | Optional | Identifier of the job involved in the RAS event. |
Hostname | Optional | Hostname of the node involved in the event. |
PUUID | Optional | Pool UUID involved in the event, if any. |
CUUID | Optional | Container UUID involved in the event, if relevant. |
OID | Optional | Object identifier involved in the event, if relevant. |
Control Operation | Optional | Recommended automatic action, if any. |
Data | Optional | Specific instance data treated as a blob. |
RAS events include:
- DAOS system is up
- DAOS system is stopped
- DAOS system is (re)formatted
- Minimal number of storage devices (i.e. PMEM, SSD) not met
- Cannot find requested network interface
- No enough huge pages
- Rank is evicted by SWIM
- I/O server / rank failed … and then restarted
- Faulty SSD or wear levelling factor degrading and below threshold
- SCM error
- Checksum error
- Mgmt service lost redundancy
- Pool service lost redundancy
- Rank is evicted from pool
- Rank is reintegrated in pool
- Rebuild started and completed
- Rebuild hit some failure (e.g. no space)
- OSA space rebalancing completed/failed
- Data might have been lost due to cascading failures taking us below the container’s redundancy factor
- Watchdog triggered when ULT does not seem to be making progress
- Aggregation hit some failures
- Too many transaction aborts/retries preventing forward progress
- A bunch of out-of-bounds errors
- Agent is unreachable
- A lot of containers are being created (e.g. >1000 in a single pool)
- A lot of snapshots are being created (e.g. >1000 for a single container)
A plugin interface will be available to allow emitting RAS events via different channel (e.g. syslog, ...). They will also be logged into the io/control logs to ease troubleshooting after the fact by placing the errors in line with any other messages which are emitted.