Engine Suicide and Automatic Restart - Design Documentation
JIRA Ticket: DAOS-17427
Title: Handle engine suicides by automatically restarting the engines
Priority: P2-High
Status: In Review
Components: Control Plane, SWIM
Fix Version: 2.8 Community Release
Problem Statement
Current Behavior
When a DAOS engine detects that it has been excluded from the system (typically after a prolonged network disruption), it commits suicide by raising SIGKILL. Often the problem is temporary and manual intervention is required to investigate the problem and rejoin the engine to the system by restarting the process.
Desired Behavior
The engine should automatically restart and rejoin the system when still in a functional state (and therefore capable of emitting an event and self-terminating) but excluded from the group map. This state represents high probability that the problem associated with the exclusion was temporal in nature and therefore a rejoin is valid. Whilst the behavior will be enabled by default there will be an option to disable related engine restart via the server config file.
Notify the local control server about the exclusion condition and self-termination (suicide)
Allow the local control server to manage the process restart
Automatically rejoin the system after restart
Avoid repeated rapid restarts (rate limiting)
Solution Design
Architecture Overview
The solution implements an event-driven automatic restart mechanism using the existing RAS event system:
Engine Detection: Engine detects exclusion condition when it receives a CART event that signifies the rank has been excluded from the system group map.
Event Notification: Engine raises
engine_suicideRAS eventControl Plane Handling: Local control server receives and processes event
Await Stop: Wait for engine process to exit completely
Automatic Restart: Local control server restarts engine
System Rejoin: Engine rejoins with new incarnation
Event-Based Communication
Event Type: engine_suicide (RAS_ENGINE_SUICIDE)
Type: INFO_ONLY (not a state change)
Severity: NOTICE
Fields: rank, incarnation, hostname, timestamp
Message: "excluded rank suicide detected"
Rationale for INFO_ONLY:
Event is local to the host (doesn't need MS database update)
Handled by local control server regardless of MS leadership
No need to forward to MS leader (unlike STATE_CHANGE events)
Implementation: Control Plane Handler
Function: handleEngineSuicide() in src/control/server/server_utils.go
Responsibilities:
Validate event (timestamp, rank)
Locate engine instance by rank
Verify single instance matches rank
Wait for engine process to stop (using
pollInstanceState)Request engine restart (via
requestStart)
Error Handling:
Returns errors (propagated to event handler)
Logs errors for debugging
Gracefully handles missing ranks, timeouts
Subscription Model
The suicide handler is registered in both follower and leader modes:
Follower Mode (registerFollowerSubscriptions):
srv.pubSub.Subscribe(events.RASTypeInfoOnly,
events.HandlerFunc(func(ctx context.Context, evt *events.RASEvent) {
switch evt.ID {
case events.RASEngineSuicide:
if err := handleEngineSuicide(ctx, srv, evt); err != nil {
srv.log.Errorf("handleEngineSuicide: %s", err)
}
}
}))Leader Mode (registerLeaderSubscriptions):
Same subscription as follower mode
Ensures local engines restart even when node is MS leader
Suicide-Restart Flow Diagram
State Machine
[Engine Running]
|
| Detects exclusion
v
[Suicide Event Raised]
|
| Event published
v
[Control Server Notified]
|
| handleEngineSuicide()
@ v
[Wait for Engine Stop] <----- pollInstanceState() polls every 500ms
|
| IsStarted() == false
v
[Request Restart]
|
| requestStart()
v
[Engine Starting]
|
| System join with new incarnation
v
[Engine Running]Design Decisions
1. Use RAS Event (Not dRPC or Exit Code)
Considered Alternatives:
New dRPC method
Special exit code (e.g.,
exit(DSS_EXIT_EXCLUDED))
Chosen: RAS Event System
Rationale:
✅ Leverages existing infrastructure
✅ Automatically logged to syslog
✅ No new dRPC protocol needed
✅ Avoids SPDK segfaults with exit codes
✅ Consistent with other engine notifications
2. INFO_ONLY vs STATE_CHANGE
Chosen: INFO_ONLY
Rationale:
Event is local to the control server
No MS database update needed
Faster handling (no MS communication)
Works regardless of MS leadership
Restart is a local operation
3. Poll for Stop vs Immediate Restart
Chosen: Wait for stop using pollInstanceState()
Rationale:
✅ Ensures complete shutdown before restart
✅ Reuses existing polling infrastructure
✅ Prevents failure to restart due to race condition
✅ Configurable timeout via context
✅ Consistent with
StopRankspattern
4. Handler in Both Follower and Leader
Chosen: Subscribe in both modes
RationRateale:
Engine suicide is a local event
Restart must happen on local server
Independent of MS leadership
Ensures high availability
Rate Limiting and Safety
Current Implementation
Rate-limiting applied; one automatic restart can be applied per restart time period
disable_engine_auto_restart enables or disables
engine_auto_restart_min_delay sets minimum time between automatic restarts
Restart relies on engine state checks (won't restart if already running)
Context timeouts prevent indefinite waits
Testing Strategy
Unit Tests
File: src/control/server/server_utils_test.go
Coverage:
✅ Event validation (timestamp, rank)
✅ Engine lookup and filtering
✅ Stop waiting mechanism
✅ Restart request triggering
✅ Error handling (all paths)
✅ Edge cases (invalid ranks, timeouts)
✅ Multi-engine scenarios
✅ Subscription registration (follower/leader)
Integration Testing (TODO)
Scenario: Network disruption recovery
Start DAOS cluster with multiple engines
Inject network partition (iptables or network namespace)
Wait for SWIM to detect and exclude engine
Engine detects exclusion and commits suicide
Verify suicide event is raised and logged
Verify control server restarts engine
Verify engine rejoins with new incarnation
Restore network connectivity
Verify system operates normally
API Surface
Public Functions (Engine)
/**
* Notify control plane that an excluded engine has committed suicide.
*
* \param[in] rank Rank that committed suicide.
* \param[in] incarnation Incarnation of rank that committed suicide.
*
* \retval Zero on success, non-zero otherwise.
*/
int ds_notify_rank_suicide(d_rank_t rank, uint64_t incarnation);Internal Functions (Control)
// Handle local engine suicide and restart engine to rejoin system.
func handleEngineSuicide(ctx context.Context, srv *server, evt *events.RASEvent) errorComparison with Related Features
vs. handleRankDead (SWIM)
Aspect | handleRankDead | handleEngineSuicide |
|---|---|---|
Trigger | SWIM detection | Engine self-detection |
Event Type | STATE_CHANGE | INFO_ONLY |
MS Update | Yes (mark dead) | No |
Action | Mark dead in DB | Restart engine |
Scope | System-wide | Local server |
Future Work
Potential Enhancements
Rate Limiting
Track restart attempts per rank
Exponential backoff on repeated suicides
Alert threshold for excessive restarts
Metrics
Count of suicide events per rank
Average restart time
Success/failure rates
Advanced Restart Strategy
Configurable restart policy
Dependency checking before restart
Health verification after restart
Testing Tools
Fault injection interface (debug builds)
"Fake exclude" dmg command
Automated integration tests
References
Code Files
src/control/server/server_utils.go- Main handlersrc/control/server/server_utils_test.go- Unit testssrc/control/events/ras.go- Event constantssrc/engine/drpc_ras.c- Event raisingsrc/include/daos_srv/ras.h- Event definitions
Documentation
docs/admin/administration.md- Event referencedocs/overview/fault.md- Fault model conceptssrc/control/events/README.md- Developer guide
Related Tickets
DAOS-17427 - Engine suicide automatic restart (this ticket)
Related to SWIM rank dead handling
Related to self-healing mechanisms
Glossary
Suicide: Engine self-termination when detecting removal from group map
Incarnation: Sequence number identifying engine restart instance
INFO_ONLY: Event type that is logged but doesn't update MS database
STATE_CHANGE: Event type that will be forwarded to MS database leader
Poll: Periodic checking of engine state
Harness: Control plane component managing engine instances
SWIM: Scalable Weakly-consistent Infection-style process group Membership protocol