Isolated Server

Stakeholders

Author(s): Johann Lombardi

Reviewer(s): Liang Zhen & Li Wei

Developer(s): TBD

Validation engineer(s): TBD

Introduction

If a DAOS engine is partitioned over the network from the other engines and is still reachable by some client nodes, this engine might still sever fetch/read requests and return stale data if it has been excluded by the other engines. The problem does not happen if the excluded engine has a way to be informed of its exclusion (e.g. via swim) in which case it will commit a suicide.

While such a scenario might seem unlikely, it can still happen and might be seen a data corruption issue. This technical debt should thus be addressed.

Requirements & Use Cases

No stale or inconsistent data should be returned to clients, even transiently, in case of network partitioning. The use case considered in this design is when a set of DAOS engines (a set can be just one engine) is isolated from the other engines, but can still communicate with client nodes. The approach to address this technical debt should consider systems at the scale of Aurora (1k+ servers) and beyond.

Design Overview

Relying on SWIM to address this technical debt does not seem to be practical. For an isolated node to question whether it has been excluded, it would need to set an upper bond on by when not getting pinged is abnormal which is difficult due to the probabilistic nature of SWIM and its weak membership management. Not to mention that if two nodes are isolated, they might still ping each others.

An alternative mechanism is thus required. A traditional solution to this problem is to grant a lease to a server. If a server is not able to renew a lease, then it should stop serving I/Os. When an exclusion happens, the member servers should then wait for the lease time before excluding the node. For DAOS, it would mean that rebuild would be delayed by the lease time which increases the window to be hit by another consecutive failure and potentially lose data.

The proposed approach is to implement a scalable lease mechanism without delaying online rebuild by proceeding as follows:

  • The management service regularly issues an iv broadcast across all the engines to push a new lease deadline

  • If a server passes the lease deadline, it tries to read the iv variable storing the latest deadline from its parent. If the parent does not reply, it tries to read from the iv root.

  • A node is allowed to serve I/Os until the latest lease deadline + lease time.

  • Upon exclusion, the node is marked as down in the pool map immediately and rebuild can proceed. That being said, new writes issued from client nodes to objects impacted by the rebuild should be delayed until the lease time has passed.

The extra iv_fetch() from the parent and extra lease time is expected to cover the use case where the parent in the IV tree died and did not propagate the new lease time to its children. It might be avoided altogether if the IV tree is quickly reconfigured to avoid the failed node.

The following changes are thus required:

  • IV trees are currently created per pool. A new IV tree for the system will thus be required. This tree can be used to propagate system attributes

  • A new ULT must be started in the mgmt service to regularly issue iv broadcast to update the deadline.

  • The pool service should no longer listen to SWIM events and instead be notified by the management service when a node is excluded by SWIM. This allows the pool service to accurately determine when to stop delaying writes based on the time when the management service made the exclusion.

  • The system map should be broadcasted along with the lease update.

  • The logic to try to refresh the least when the lease deadline has passed must be implemented on the engine.

  • The engine should return a special error code to the client once the lease deadline + lease time has passed. Client should try to refresh the pool map when getting such an error code and resubmit the RPC.

  • Surviving engines must delay write processing until the lease time has passed after exclusion.

User Interface

How is the user/admin expected to interact with the new feature? Describe the API/tool.
What are all the tunables provided to the user/admin?
Any extra statistics that should reported to the user/admin?
Explain how errors will be handled.

No user/admin-visible changes.

New statistics should be added.

Impacts

Any performance impact?
Any API changes? If so, internal or external API? Any changes required to middleware? Any interop requirements?
Any VOS/config layout changes? How will migration will be supported?
Any extra parameters required in the config file?
Any wire protocol change? How will interop be supported?
Any impact on the rebuild protocol?
Any impact on aggregation?
Any impact on security?

Quality

How the feature will be tested? Unit tests, functional tests and system tests need to be covered.
Describe the extra soak/performance tests that should be added.

Project Milestones

Description of the different milestones delivering incremental functionality.
Describe what will work/not work and what will be validated.
Targeted date for each milestone.