Persistent Check Leader

Persistent Check Leader

Stakeholders

Identify developer(s), component validation engineer(s) & reviewer(s).

Developer: @Kris Jacque

Reviewers: @Fan Yong @Michael MacDonald @Tom Nabarro @Liang Zhen

Introduction

When the DAOS system is in checker mode, the check leader is an engine that coordinates the checker activities among other engines and returns the result of a check to the Management Service (MS), which runs in the control plane. The check leader is essentially a liaison between the control plane and the data plane while in this special mode.

In the initial implementation, the control plane presumes the check leader is local to the MS leader, which could be any MS replica. This can become a problem in cases where the MS leadership changes during a check. The check leader is the owner of all state information about the current checker instance. If MS leadership changes once a check has started, future checker commands will be sent to an engine that is not the check leader. This can result in confusing/misleading outputs, or even a second check instance starting on a different check leader.

This has been seen in rare cases on our CI system (DAOS-18537), and could occur in the field. This fix is targeted for 2.8 if possible, otherwise 2.8.1.

Requirements & Use Cases

Requirements

  • The MS must preserve the check leader across MS leadership changes.

  • The MS leader must forward check leader requests to the current check leader.

  • If there is no current check leader saved, the MS leader must use a local engine.

  • If there is no current check leader saved, the MS must record the rank of the engine it has chosen.

  • If the request to the current check leader cannot be sent (i.e. the rank is dead), a new check leader must be chosen.

Use Case

  1. Admin enables checker mode.

  2. Admin starts a check.

  3. MS leadership changes due to some issue.

  4. Admin issues any checker command.

  5. The checker instance should appear continuous from the admin’s perspective.

Design Overview

Describe the key architecture decisions, benefits & drawbacks.
Describe what software component(s) be modified and how. Diagrams are welcomed.

For the sake of simplicity, we will manage the check leader from the MS in the control plane. The MS leader receives all dmg checker commands, is able to record persistent system properties and communicate with other control plane nodes in the DAOS cluster. This approach limits changes to the control plane level.

High-Level Flow

MS leader check command handlers

All check command gRPC handlers will need to ensure the correct check leader is used.

  • dmg → gRPC → MS leader

  • MS leader gRPC handler

    • Check the system prop

      • If not set or not valid:

        • Select a check leader locally on MS leader

        • Set system prop to selected leader

      • If another error, fail

    • Look up rank’s control address in system DB

    • If rank is local to the MS leader:

      • Send dRPC locally

    • Else:

      • Send CheckLeaderDrpc to check leader rank’s control address over gRPC

    • Process results and return

CheckLeaderDrpc gRPC endpoint

This is a new gRPC endpoint restricted to servers.

  • MS leader gRPC handler → gRPC → MS replica hosting the check leader

    • MS CheckLeaderDrpc handler

      • Basic sanity checks

        • Is checker mode enabled?

        • Is the requested rank currently the check leader?

        • Is the requested rank local?

        • Is the requested rank in an appropriate state? (i.e. not Stopped or AdminExcluded)

      • Verify dRPC request

        • Checker module only - don’t allow an arbitrary dRPC request to be sent to this endpoint

        • Method is one we allow to be forwarded

      • Send dRPC request to rank

      • Return results directly to MS without further processing

Checker Up-calls

While the checker instance is in progress, check engines currently report their status back to the check leader. Rather than having the check leader aggregate these results and report to the MS leader, this functionality can be taken over by the control plane.

dRPC methods

We can use the existing dRPC methods that are currently required to be called on the MS leader. These methods will be updated to be usable from any node by forwarding the command appropriately to the MS leader. This will require server-to-server gRPC endpoints to be added for each of these methods.

  1. DRPC_METHOD_CHK_LIST_POOL - from CHK leader

  2. DRPC_METHOD_CHK_REG_POOL - from CHK leader

  3. DRPC_METHOD_CHK_DEREG_POOL - from CHK leader

  4. DRPC_METHOD_CHK_REPORT - from either CHK leader or CHK engine

    1. CHK_REPORT will be updated to report from each check engine. The results will be forwarded to the MS.

Up-call flow

  1. Check engine reports its status to its local control plane via dRPC.

  2. Local control plane forwards the call to the MS leader.

  3. The MS leader updates the MS database accordingly.

    1. If leadership changes while processing is in progress, an error will be reported.

    2. The check engine’s control plane should retry with the new leader.

System property: check_leader

The MS will store a new internal system property, check_leader, that contains the rank number of the last selected check leader.

This property will be checked before issuing checker commands. If it is not yet set, or is invalid, the MS leader will select a check leader and record it.

Invalid values include:

  • NilRank

  • Non-number strings

  • Numbers that don’t correspond to real ranks in the system

If the current check leader cannot be reached, a new check leader must be chosen. Otherwise the check leader remains persistently stored for the life of the checker instance.

Choosing a check leader

The check leader must always be on an MS replica. This allows its parent control plane to always have, at minimum, read access to the MS database.

When no leader has been selected yet, the control plane will select an engine belonging to the MS leader. Even if leadership is lost, the node remains an MS replica.

User Interface

From the user perspective, this change will be transparent. The checker instance should appear continuous to them, with no differences in how they interact with the existing feature.

Any errors reported by the underlying remote dRPC call will be processed as if they occurred locally.

Impacts

Any performance impact?
Any API changes? If so, internal or external API? Any changes required to middleware? Any interop requirements?
Any VOS/config layout changes? How will migration will be supported?
Any extra parameters required in the config file?
Any wire protocol change? How will interop be supported?
Any impact on the rebuild protocol?
Any impact on aggregation?
Any impact on security?

  • Performance

    • No impact on a normal running DAOS cluster.

    • Checker mode: Additional latency in checker commands due to the need to forward them over the management network. This should be negligible assuming functional network.

  • Security

    • New gRPC endpoint for MS nodes: This is adding to the servers' public gRPC API. Protections already exist in certificate-enabled systems to prevent unauthorized access to this API. Additional sanity checks will prevent sending arbitrary dRPC messages to arbitrary ranks.

  • Interop

    • New gRPC endpoint for MS nodes: This is used internally by daos_server. We do not support mixed daos_server versions within the same cluster.

Quality

How the feature will be tested? Unit tests, functional tests and system tests need to be covered.
Describe the extra soak/performance tests that should be added.

  • Unit testing will accompany all development changes.

  • Current functional tests intermittently trigger this issue. Additional tests TBD.

    • Add a method to force MS leadership change for testing purposes.

      • This could be included with dmg fault injection commands.

Project Milestones

Description of the different milestones delivering incremental functionality.
Describe what will work/not work and what will be validated.
Targeted date for each milestone.

References (optional)

External papers, web page (if any)

Concerns

 



Approvals

Reviewed & Approved by

Names

Date

Reviewed & Approved by

Names

Date

Feature Developers

Name(s)



Tech Lead/Architect

Name(s)



Test Engineers

Name(s)



Engineering Managers

Names(s)



Feature Test plan

Feature Test plan

  • Provide link to test plan for feature under development





Comments