Persistent Check Leader
Stakeholders
Identify developer(s), component validation engineer(s) & reviewer(s).
Developer: @Kris Jacque
Reviewers: @Fan Yong @Michael MacDonald @Tom Nabarro @Liang Zhen
Introduction
When the DAOS system is in checker mode, the check leader is an engine that coordinates the checker activities among other engines and returns the result of a check to the Management Service (MS), which runs in the control plane. The check leader is essentially a liaison between the control plane and the data plane while in this special mode.
In the initial implementation, the control plane presumes the check leader is local to the MS leader, which could be any MS replica. This can become a problem in cases where the MS leadership changes during a check. The check leader is the owner of all state information about the current checker instance. If MS leadership changes once a check has started, future checker commands will be sent to an engine that is not the check leader. This can result in confusing/misleading outputs, or even a second check instance starting on a different check leader.
This has been seen in rare cases on our CI system (DAOS-18537), and could occur in the field. This fix is targeted for 2.8 if possible, otherwise 2.8.1.
Requirements & Use Cases
Requirements
The MS must preserve the check leader across MS leadership changes.
The MS leader must forward check leader requests to the current check leader.
If there is no current check leader saved, the MS leader must use a local engine.
If there is no current check leader saved, the MS must record the rank of the engine it has chosen.
If the request to the current check leader cannot be sent (i.e. the rank is dead), a new check leader must be chosen.
Use Case
Admin enables checker mode.
Admin starts a check.
MS leadership changes due to some issue.
Admin issues any checker command.
The checker instance should appear continuous from the admin’s perspective.
Design Overview
Describe the key architecture decisions, benefits & drawbacks.
Describe what software component(s) be modified and how. Diagrams are welcomed.
For the sake of simplicity, we will manage the check leader from the MS in the control plane. The MS leader receives all dmg checker commands, is able to record persistent system properties and communicate with other control plane nodes in the DAOS cluster. This approach limits changes to the control plane level.
High-Level Flow
MS leader check command handlers
All check command gRPC handlers will need to ensure the correct check leader is used.
dmg → gRPC → MS leader
MS leader gRPC handler
Check the system prop
If not set or not valid:
Select a check leader locally on MS leader
Set system prop to selected leader
If another error, fail
Look up rank’s control address in system DB
If rank is local to the MS leader:
Send dRPC locally
Else:
Send CheckLeaderDrpc to check leader rank’s control address over gRPC
Process results and return
CheckLeaderDrpc gRPC endpoint
This is a new gRPC endpoint restricted to servers.
MS leader gRPC handler → gRPC → MS replica hosting the check leader
MS CheckLeaderDrpc handler
Basic sanity checks
Is checker mode enabled?
Is the requested rank currently the check leader?
Is the requested rank local?
Is the requested rank in an appropriate state? (i.e. not Stopped or AdminExcluded)
Verify dRPC request
Checker module only - don’t allow an arbitrary dRPC request to be sent to this endpoint
Method is one we allow to be forwarded
Send dRPC request to rank
Return results directly to MS without further processing
Checker Up-calls
While the checker instance is in progress, check engines currently report their status back to the check leader. Rather than having the check leader aggregate these results and report to the MS leader, this functionality can be taken over by the control plane.
dRPC methods
We can use the existing dRPC methods that are currently required to be called on the MS leader. These methods will be updated to be usable from any node by forwarding the command appropriately to the MS leader. This will require server-to-server gRPC endpoints to be added for each of these methods.
DRPC_METHOD_CHK_LIST_POOL - from CHK leader
DRPC_METHOD_CHK_REG_POOL - from CHK leader
DRPC_METHOD_CHK_DEREG_POOL - from CHK leader
DRPC_METHOD_CHK_REPORT - from either CHK leader or CHK engine
CHK_REPORT will be updated to report from each check engine. The results will be forwarded to the MS.
Up-call flow
Check engine reports its status to its local control plane via dRPC.
Local control plane forwards the call to the MS leader.
The MS leader updates the MS database accordingly.
If leadership changes while processing is in progress, an error will be reported.
The check engine’s control plane should retry with the new leader.
System property: check_leader
The MS will store a new internal system property, check_leader, that contains the rank number of the last selected check leader.
This property will be checked before issuing checker commands. If it is not yet set, or is invalid, the MS leader will select a check leader and record it.
Invalid values include:
NilRank
Non-number strings
Numbers that don’t correspond to real ranks in the system
If the current check leader cannot be reached, a new check leader must be chosen. Otherwise the check leader remains persistently stored for the life of the checker instance.
Choosing a check leader
The check leader must always be on an MS replica. This allows its parent control plane to always have, at minimum, read access to the MS database.
When no leader has been selected yet, the control plane will select an engine belonging to the MS leader. Even if leadership is lost, the node remains an MS replica.
User Interface
From the user perspective, this change will be transparent. The checker instance should appear continuous to them, with no differences in how they interact with the existing feature.
Any errors reported by the underlying remote dRPC call will be processed as if they occurred locally.
Impacts
Any performance impact?
Any API changes? If so, internal or external API? Any changes required to middleware? Any interop requirements?
Any VOS/config layout changes? How will migration will be supported?
Any extra parameters required in the config file?
Any wire protocol change? How will interop be supported?
Any impact on the rebuild protocol?
Any impact on aggregation?
Any impact on security?
Performance
No impact on a normal running DAOS cluster.
Checker mode: Additional latency in checker commands due to the need to forward them over the management network. This should be negligible assuming functional network.
Security
New gRPC endpoint for MS nodes: This is adding to the servers' public gRPC API. Protections already exist in certificate-enabled systems to prevent unauthorized access to this API. Additional sanity checks will prevent sending arbitrary dRPC messages to arbitrary ranks.
Interop
New gRPC endpoint for MS nodes: This is used internally by daos_server. We do not support mixed daos_server versions within the same cluster.
Quality
How the feature will be tested? Unit tests, functional tests and system tests need to be covered.
Describe the extra soak/performance tests that should be added.
Unit testing will accompany all development changes.
Current functional tests intermittently trigger this issue. Additional tests TBD.
Add a method to force MS leadership change for testing purposes.
This could be included with dmg fault injection commands.
Project Milestones
Description of the different milestones delivering incremental functionality.
Describe what will work/not work and what will be validated.
Targeted date for each milestone.
References (optional)
External papers, web page (if any)
Concerns
Approvals
Reviewed & Approved by | Names | Date |
|---|---|---|
Feature Developers | Name(s) | |
Tech Lead/Architect | Name(s) | |
Test Engineers | Name(s) | |
Engineering Managers | Names(s) |
Feature Test plan |
|---|
|