Engine Suicide and Automatic Restart - Design Documentation

Engine Suicide and Automatic Restart - Design Documentation

JIRA Ticket: DAOS-17427
Title: Handle engine suicides by automatically restarting the engines
Priority: P2-High
Status: In Review
Components: Control Plane, SWIM
Fix Version: 2.8 Community Release

Problem Statement

Current Behavior

When a DAOS engine detects that it has been excluded from the system (typically after a prolonged network disruption), it commits suicide by raising SIGKILL. Often the problem is temporary and manual intervention is required to investigate the problem and rejoin the engine to the system by restarting the process.

Desired Behavior

The engine should automatically restart and rejoin the system when still in a functional state (and therefore capable of emitting an event and self-terminating) but excluded from the group map. This state represents high probability that the problem associated with the exclusion was temporal in nature and therefore a rejoin is valid. Whilst the behavior will be enabled by default there will be an option to disable related engine restart via the server config file.

  1. Notify the local control server about the exclusion condition and self-termination (suicide)

  2. Allow the local control server to manage the process restart

  3. Automatically rejoin the system after restart

  4. Avoid repeated rapid restarts (rate limiting)

Solution Design

Architecture Overview

The solution implements an event-driven automatic restart mechanism using the existing RAS event system:

  1. Engine Detection: Engine detects exclusion condition when it receives a CART event that signifies the rank has been excluded from the system group map.

  2. Event Notification: Engine raises engine_suicide RAS event

  3. Control Plane Handling: Local control server receives and processes event

  4. Await Stop: Wait for engine process to exit completely

  5. Automatic Restart: Local control server restarts engine

  6. System Rejoin: Engine rejoins with new incarnation

Event-Based Communication

Event Type: engine_suicide (RAS_ENGINE_SUICIDE)

  • Type: INFO_ONLY (not a state change)

  • Severity: NOTICE

  • Fields: rank, incarnation, hostname, timestamp

  • Message: "excluded rank suicide detected"

Rationale for INFO_ONLY:

  • Event is local to the host (doesn't need MS database update)

  • Handled by local control server regardless of MS leadership

  • No need to forward to MS leader (unlike STATE_CHANGE events)

Implementation: Control Plane Handler

Function: handleEngineSuicide() in src/control/server/server_utils.go

Responsibilities:

  1. Validate event (timestamp, rank)

  2. Locate engine instance by rank

  3. Verify single instance matches rank

  4. Wait for engine process to stop (using pollInstanceState)

  5. Request engine restart (via requestStart)

Error Handling:

  • Returns errors (propagated to event handler)

  • Logs errors for debugging

  • Gracefully handles missing ranks, timeouts

Subscription Model

The suicide handler is registered in both follower and leader modes:

Follower Mode (registerFollowerSubscriptions):

srv.pubSub.Subscribe(events.RASTypeInfoOnly, events.HandlerFunc(func(ctx context.Context, evt *events.RASEvent) { switch evt.ID { case events.RASEngineSuicide: if err := handleEngineSuicide(ctx, srv, evt); err != nil { srv.log.Errorf("handleEngineSuicide: %s", err) } } }))

Leader Mode (registerLeaderSubscriptions):

  • Same subscription as follower mode

  • Ensures local engines restart even when node is MS leader

Suicide-Restart Flow Diagram

suicide_restart_flow.png

State Machine

[Engine Running] | | Detects exclusion v [Suicide Event Raised] | | Event published v [Control Server Notified] | | handleEngineSuicide() @ v [Wait for Engine Stop] <----- pollInstanceState() polls every 500ms | | IsStarted() == false v [Request Restart] | | requestStart() v [Engine Starting] | | System join with new incarnation v [Engine Running]

Design Decisions

1. Use RAS Event (Not dRPC or Exit Code)

Considered Alternatives:

  • New dRPC method

  • Special exit code (e.g., exit(DSS_EXIT_EXCLUDED))

Chosen: RAS Event System

Rationale:

  • ✅ Leverages existing infrastructure

  • ✅ Automatically logged to syslog

  • ✅ No new dRPC protocol needed

  • ✅ Avoids SPDK segfaults with exit codes

  • ✅ Consistent with other engine notifications

2. INFO_ONLY vs STATE_CHANGE

Chosen: INFO_ONLY

Rationale:

  • Event is local to the control server

  • No MS database update needed

  • Faster handling (no MS communication)

  • Works regardless of MS leadership

  • Restart is a local operation

3. Poll for Stop vs Immediate Restart

Chosen: Wait for stop using pollInstanceState()

Rationale:

  • ✅ Ensures complete shutdown before restart

  • ✅ Reuses existing polling infrastructure

  • ✅ Prevents failure to restart due to race condition

  • ✅ Configurable timeout via context

  • ✅ Consistent with StopRanks pattern

4. Handler in Both Follower and Leader

Chosen: Subscribe in both modes

RationRateale:

  • Engine suicide is a local event

  • Restart must happen on local server

  • Independent of MS leadership

  • Ensures high availability

Rate Limiting and Safety

Current Implementation

  • Rate-limiting applied; one automatic restart can be applied per restart time period

    • disable_engine_auto_restart enables or disables

    • engine_auto_restart_min_delay sets minimum time between automatic restarts

  • Restart relies on engine state checks (won't restart if already running)

  • Context timeouts prevent indefinite waits

Testing Strategy

Unit Tests

File: src/control/server/server_utils_test.go

Coverage:

  • ✅ Event validation (timestamp, rank)

  • ✅ Engine lookup and filtering

  • ✅ Stop waiting mechanism

  • ✅ Restart request triggering

  • ✅ Error handling (all paths)

  • ✅ Edge cases (invalid ranks, timeouts)

  • ✅ Multi-engine scenarios

  • ✅ Subscription registration (follower/leader)

Integration Testing (TODO)

Scenario: Network disruption recovery

  1. Start DAOS cluster with multiple engines

  2. Inject network partition (iptables or network namespace)

  3. Wait for SWIM to detect and exclude engine

  4. Engine detects exclusion and commits suicide

  5. Verify suicide event is raised and logged

  6. Verify control server restarts engine

  7. Verify engine rejoins with new incarnation

  8. Restore network connectivity

  9. Verify system operates normally

API Surface

Public Functions (Engine)

/** * Notify control plane that an excluded engine has committed suicide. * * \param[in] rank Rank that committed suicide. * \param[in] incarnation Incarnation of rank that committed suicide. * * \retval Zero on success, non-zero otherwise. */ int ds_notify_rank_suicide(d_rank_t rank, uint64_t incarnation);

Internal Functions (Control)

// Handle local engine suicide and restart engine to rejoin system. func handleEngineSuicide(ctx context.Context, srv *server, evt *events.RASEvent) error

Comparison with Related Features

vs. handleRankDead (SWIM)

Aspect

handleRankDead

handleEngineSuicide

Aspect

handleRankDead

handleEngineSuicide

Trigger

SWIM detection

Engine self-detection

Event Type

STATE_CHANGE

INFO_ONLY

MS Update

Yes (mark dead)

No

Action

Mark dead in DB

Restart engine

Scope

System-wide

Local server

Future Work

Potential Enhancements

  1. Rate Limiting

    • Track restart attempts per rank

    • Exponential backoff on repeated suicides

    • Alert threshold for excessive restarts

  2. Metrics

    • Count of suicide events per rank

    • Average restart time

    • Success/failure rates

  3. Advanced Restart Strategy

    • Configurable restart policy

    • Dependency checking before restart

    • Health verification after restart

  4. Testing Tools

    • Fault injection interface (debug builds)

    • "Fake exclude" dmg command

    • Automated integration tests

References

Code Files

  • src/control/server/server_utils.go - Main handler

  • src/control/server/server_utils_test.go - Unit tests

  • src/control/events/ras.go - Event constants

  • src/engine/drpc_ras.c - Event raising

  • src/include/daos_srv/ras.h - Event definitions

Documentation

  • docs/admin/administration.md - Event reference

  • docs/overview/fault.md - Fault model concepts

  • src/control/events/README.md - Developer guide

Related Tickets

  • DAOS-17427 - Engine suicide automatic restart (this ticket)

  • Related to SWIM rank dead handling

  • Related to self-healing mechanisms

Glossary

  • Suicide: Engine self-termination when detecting removal from group map

  • Incarnation: Sequence number identifying engine restart instance

  • INFO_ONLY: Event type that is logged but doesn't update MS database

  • STATE_CHANGE: Event type that will be forwarded to MS database leader

  • Poll: Periodic checking of engine state

  • Harness: Control plane component managing engine instances

  • SWIM: Scalable Weakly-consistent Infection-style process group Membership protocol

Comments