NVMe SSD LED Control in DAOS

NVMe SSD LED Control in DAOS

In the Distributed Asynchronous Object Storage (DAOS) ecosystem, LED management is a critical component of the Reliability, Availability, and Serviceability (RAS) framework, ensuring that physical hardware states match the software's cluster inventory.

 

The Role of the Intel VMD and SPDK

DAOS operates primarily in userspace to bypass kernel overhead. When NVMe drives are behind an Intel Volume Management Device (VMD), DAOS manages LEDs via the SPDK (Storage Performance Development Kit) VMD driver.

Hardware Isolation

VMD creates a separate PCI domain. The SPDK driver interacts with this domain to standardise how blink codes (SFF-8489/IBPI) are sent to the backplane.

API Triggering

When a DAOS service (via daos_server) detects a disk event, it calls the spdk_vmd_set_led_state() API. This translates logical storage health into physical signals without requiring a kernel-level context switch.

 

LED States and Device Health

DAOS uses a solid amber LED to indicate “FAULT” device state and a blinking LED to identify the drive to data center technicians. Note that these descriptions refer to a VMD implementation using SPDK to set values based that are interpreted by a specific platform. Due to the variability of platform implementations and standard interpretations outcomes occasionally vary. The RAS LED events however will be consistent in message output.

LED normal state

  • SPDK pattern: off

  • SPDK constant: SPDK_VMD_LED_STATE_OFF

  • DAOS constant: OFF

Identify (Locate)

Triggered manually by an administrator via dmg storage led identify. This helps physically locate a drive for planned maintenance. Identify state will persist until cancelled or timeout, no other states will be shown during that period (identify overrides normal and faulty states).

LED locate:

  • SPDK pattern: ~4Hz blinking

  • SPDK constant: SPDK_VMD_LED_STATE_IDENTIFY

  • DAOS constant: QUICK_BLINK

Fault (Critical Failure)

If the DAOS engine BIO module detects I/O (media) or checksum errors and the predefined faulty thresholds have been exceeded then an “auto-faulty” reaction occurs. “auto-faulty" thresholds are part of the self-healing framework designed to automatically isolate failing storage components (targets or engine ranks) to maintain data availability. This can happen if the drive becomes unresponsive or unreliable. The BIO module then automatically sets the LED to a solid amber "FAULT" state. This indicates the drive has been evicted and is no longer part of the pool and is safe to pull/hotplug.

LED faulty:

  • SPDK pattern: solid on

  • SPDK constant: SPDK_VMD_LED_STATE_FAULT

  • DAOS constant: ON

Rebuild/Activity (Not Yet Implemented)

While DAOS typically leaves green activity LEDs to the hardware firmware, the status LED may in future (TODO) be programmed to indicate an active data rebuild or "Evacuation" process, signaling that the drive is still logically "in-use" despite pending removal.

LED rebuild:

  • SPDK pattern: ~1Hz blinking

  • SPDK constant: SPDK_VMD_LED_STATE_REBUILD

  • DAOS constant: SLOW_BLINK

 

Non-VMD and Heterogeneous Platforms

On non-Intel platforms (like AMD or ARM) or configurations where VMD is disabled, DAOS relinquishes direct LED control to the OS or external orchestrators.

Enclosure Services

DAOS emits RAS events to syslog and an external agent (like ledctl or a BMC-based service) can monitor these events and update the LEDs via Native Platform Enclosure Management (https://pcisig.com/PCIExpress/ECN/Base/MSI-X ) PCI express specification defined protocol.

Scripting

In these scenarios, external scripts can receive RAS events which describe LED patterns and device addresses from DAOS and toggle LEDs via the IPMI or Redfish interfaces.

 

RAS Event Schema (LED State Changes)

DAOS emits RAS events when a drive's health status changes or an LED toggles. These events are logged to syslog and can be consumed by any configured syslog application. See RAS Events for details on RAS events and schema.

Example: Device Replace Event

If a DAOS device is replaced with dmg storage replace nvme command, the following RAS events are issued (first to clear LED state then to indicate replacement):

Mar 4 14:49:58 edaos-15 install/bin/daos_server[1227720]: id: [device_led_set] ts: [2026-03-04T14:49:58.656807+0000] host: [edaos-15.daos] type: [INFO] sev: [NOTICE] msg: [LED on device 0000:5e:00.0 set to state OFF] pid: [1254274] rank: [0] Mar 4 14:49:58 edaos-15 install/bin/daos_server[1227720]: id: [device_replace] ts: [2026-03-04T14:49:58.658815+0000] host: [edaos-15.daos] type: [INFO] sev: [NOTICE] msg: [Replaced device: b5d51aba with device b5d51aba#012] pid: [1254274] rank:[0]

Valid strings for LED state in RAS events are QUICK_BLINK (locate/identify), ON (fault), OFF (normal).

 

Managing LEDs with DAOS’s dmg tool

The primary tool for managing NVMe storage LEDs in a DAOS environment is the dmg (DAOS Management) utility. Below are the specific command syntaxes used to manually control LED states for drive identification and maintenance workflows.

The dmg storage led identify command is used to physically locate a drive by blinking its status LED. The command takes DAOS device UUIDs and PCI addresses as drive IDs. See command usage and administration guide for more details.

Instructions on fault management can also be found in the administration guide which includes references to dmg storage set nvme-faulty and dmg storage replace nvme which will turn on and turn off device LED’s respectively.

 

Practical Tips for Administrators

  • Scan First: Use dmg storage scan -v to retrieve a list of all drives, their current health status, and their UUIDs/PCI addresses.

  • JSON Output: For automation scripts, add the -j flag (e.g., dmg -j storage scan) to receive machine-readable output for parsing device IDs and other details.

  • Permissions: These commands must be run by a user with DAOS administrative certificates unless insecure mode is enabled.