RAS Events

Regarding the RAS events, the intent is to provide a table in the online documentation similar to what GPFS does (seehttps://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.2/com.ibm.spectrum.scale.v5r02.doc/bl1pdg_rasevents_gpfs.htm).

The format of the DAOS RAS event will be as follows:

Field

Optional/Mandatory

Description

ID

Mandatory

Unique event identifier referenced in the manual. 64-char string.

TypeMandatoryEvent type e.g. STATE_CHANGE or INFO_ONLY.

Timestamp

Mandatory

Fully qualified timestamp associated with the event. Resolution at the microseconds and include the timezone offset to avoid locality issues.

Severity

Mandatory

Fatal/Warning/Error/Info

Msg

Mandatory

Human readable message.

HID

Optional

Identify hardware component involved in the event. E.g. PCI address for SSD, network interface, …

Rank

Optional

DAOS rank involved in the event.

PIDOptionalIdentifier of the process involved in the RAS event
TIDOptionalIdentifier of the thread involved in the RAS event.
JOBIDOptionalIdentifier of the job involved in the RAS event.

Hostname

Optional

Hostname of the node involved in the event.

PUUID

Optional

Pool UUID involved in the event, if any.

CUUID

Optional

Container UUID involved in the event, if relevant.

OID

Optional

Object identifier involved in the event, if relevant.

Control Operation

Optional

Recommended automatic action, if any.

Data

Optional

Specific instance data treated as a blob.

 

RAS events include:

  • DAOS system is up
  • DAOS system is stopped
  • DAOS system is (re)formatted
  • Minimal number of storage devices (i.e. PMEM, SSD) not met
  • Cannot find requested network interface
  • No enough huge pages
  • Rank is evicted by SWIM
  • I/O server / rank failed … and then restarted
  • Faulty SSD or wear levelling factor degrading and below threshold
  • SCM error
  • Checksum error
  • Mgmt service lost redundancy
  • Pool service lost redundancy
  • Rank is evicted from pool
  • Rank is reintegrated in pool
  • Rebuild started and completed
  • Rebuild hit some failure (e.g. no space)
  • OSA space rebalancing completed/failed
  • Data might have been lost due to cascading failures taking us below the container’s redundancy factor
  • Watchdog triggered when ULT does not seem to be making progress
  • Aggregation hit some failures
  • Too many transaction aborts/retries preventing forward progress
  • A bunch of out-of-bounds errors
  • Agent is unreachable
  • A lot of containers are being created (e.g. >1000 in a single pool)
  • A lot of snapshots are being created (e.g. >1000 for a single container)

A plugin interface will be available to allow emitting RAS events via different channel (e.g. syslog, ...). They will also be logged into the io/control logs  to ease troubleshooting after the fact by placing the errors in line with any other messages which are emitted.