Incremental reintegration

Introduction

In the current version of DAOS, integration wipe out all the existent data and always reconstruct from scratch, this could take very long time if the original storage target is nearly full because the time consumption equals to (capacity / bandwidth)

If the failure is just a temporary outage like accident power down or daemon crash, wiping out data before integration can generate a lot more unnecessary data movement. In order to avoid the extra data movement, “incremental reintegration” should be supported by recovery system of DAOS: committed data should be retained, only the missing data should be reconstructed by recovery system.

Requirements

A few requirements for this feature:

  • Full dataset wipe-out should be removed from the reintegration protocol

  • Majority of data should be retained for reintegration, but it is OK to sacrifice a small portion of data, which is generated lately, to simplify the protocol.

  • Data correctness verification after reintegration is an optional feature: today it can be done by tool running on client, in the future, the engine side should be able to invoke a service to do the check.

Design overview

DAOS uses a relax 2-phase commit protocol, which allows server to reply to client before the distributed transaction actually being committed, it is difficult to precisely determine validity of data after the failed target is brought back to the cluster.

In order to simplify the protocol and avoid significant amount RPC exchanges for checking outstanding transaction status, a few timestamps should be maintained:

  • Each target should maintain a “stable epoch” for all containers, data before “stable epoch” is guaranteed to be committed and aggregated, in other word, it’s 'immutable”

  • Different targets may have different “stable epoch” for different containers, pool leader should gather all these epochs and use the minimum value as the “global stable epoch”.

  • When an engine restarted for any reason, data before “global stable epoch” should always be retained.

  • Instead of discarding all data of the target being reintegrated, the reintegration service should only trim data after “global stable epoch”

  • After the trimming phase, integration service can start to reconstruct data after “global stable epoch”

  • For read-only container (or container w/o any update), “stable epoch” should auto-progress to the latest epoch.

  • logging stable epoch, admin can monitor and estimate incremental reintegration status.

In this way, there is no requirement of running extra transaction protocol and determine data validity, but at cost of trimming some valid data from the target.