Stakeholders

Identify developer(s), component validation engineer(s) & reviewer(s).

Introduction

This feature aims at building an infrastructure to manage pool format changes when upgrading DAOS to a new version (major, minor or bug fix). It does not cover the control plane and is limited to all pool/container/object metadata and data (including placement). Pools are considered here as independent entities that can be upgraded independently.

This feature is targeted at DAOS 2.2 to should provide an upgrade path from 2.0.x to 2.2.x and upward.

Requirements & Use Cases

Describe the key functionality, use cases and assumptions. This should include what is in-scope and out-of-scope.

SRS ticket is available here: 

The administrator should be able to upgrade to a new DAOS version while preserving its data. The existing pools and containers should be accessible after the upgrade with the same level of functionality as before the upgrade. New features available at the pool, container or object level won’t be automatically enabled on existing pools. New containers can be created in the pool, but won’t use the new features. At that point, the admin can decide to downgrade to the previous version.

Once satisfied with the new version, the admin can upgrade each pool individually to enable the new features/formats. The pool being upgraded won’t be accessible at the time of the upgrade (openers will be revoked and no new open request will be allowed during the upgrade). Upon completion of the upgrade, all pool is migrated to the new layout version and is available to users again.

Design Overview

Each DAOS version is associated with a layout version. For simplicity, this layout version is associated with the pool and cover all containers/objects in this pool. While this layout version might be bumped by one increment between releases, it might aggregate several independent layout changes including (non-exhaustive list):

The current layout version for a pool should be shown via pool properties. It is an integer number that is not directly related to the DAOS version number. A pool can be upgraded to the latest supported version via “dmg pool upgrade”. A pool is inaccessible during the upgrade. The upgrade can complete promptly if the layout changes were minor (e.g. adding new properties like in the case of 2.0 to 2.2) or be long running (e.g. placement changes). The dmg pool upgrade returns immediately and status is reported asynchronously via another pool property (similar to the container state property). The pool status can be reused for catastrophic recovery too and should probably be a bit flag if we want to record multiple state in the future.

If the system fails or is shutdown during the an pool upgrade, the process will be automatically resumed once the engine(s) are restarted. The status of the pool will be marked as upgrade in progress.

A pool upgrade can be initiated only if it is in a stable state where no rebuild/reintegration/extension operation is in progress. An error will be reported to the admin that should retry later.

For each version, we would need a list of actions that need to be performed in order to move to the next version. For instance:

When a pool is upgraded, the service can check and run required actions for this pool, e.g. pool-A(V3) only requires the last 3 actions for V4, pool-B(V2) has to run all actions of V3 and V4. Some of these actions might depend on data migration services (similar to rebalance of server addition).

For action that require data migration, we may ask in the future to have a certain percentage of free space before triggering the upgrade. The upgrade operation will fail to start if this requirement isn’t met and it will be up to the admin to free up space or extend the pool before attempting the upgrade again.

Please note that the container has a layout version property that is intended to be used by the middleware and is not interpreted by DAOS internally. This feature tracks the versioning at the pool level with new properties to be introduced.

User Interface

Primary interface will be via dmg pool upgrade.

$ dmg pool get-prop mypool
[…]
Pool version: 1
Pool status: Online

$ dmg pool upgrade mypool
Upgrading mypool from version 1 to version 2

$ dmg pool get-prop mypool
[…]
Pool version: 1
Pool status: Upgrading

$ dmg pool get-prop mypool
[…]
Pool version: 2
Pool status: Online

Some metrics might be exported in the future to monitor the progress of the upgrade.

Impacts

Any performance impact?
Any API changes? If so, internal or external API? Any changes required to middleware? Any interop requirements?Any VOS/config layout changes? How will migration will be supported?Any extra parameters required in the config file?Any wire protocol change? How will interop be supported?Any impact on the rebuild protocol?Any impact on aggregation?Any impact on security?

Quality

How the feature will be tested? Unit tests, functional tests and system tests need to be covered.
Describe the extra soak/performance tests that should be added.

A bunch of functional tests need to be integrated into the CI:

In addition to this, end-to-end upgrade/downgrade tests should be developed.

Project Milestones

Description of the different milestones delivering incremental functionality.
Describe what will work/not work and what will be validated.
Targeted date for each milestone.