DAOS Rolling Upgrade
Feature requirements
DAOS should support rolling upgrade to ensure continuous availability and seamless user experience, by upgrading storage engines incrementally rather than all at once which requires downtime of the entire storage cluster. Rolling upgrade also aligns with the distributed architecture’s fault-tolerant design, if an issue arises during the upgrade, only a subset of nodes is affected, and the remaining storage engines can still provide I/O service without disruption.
Scope statements
These scenarios should be supported by DAOS:
Guarantee availability of service while incrementally upgrading DAOS servers.
Storage cluster may run in mixed-version mode for long time, for example, days or even weeks.
Rolling upgrade can be aborted, then DAOS system can rollback to old version.
DAOS servers can online switch to protocol of new version on completion of rolling upgrade.
If clients are upgraded before servers, clients can switch to new protocol only after all servers are upgraded to new version.
DAOS client can automatically reconnect upgraded servers, and continue I/O with those servers.
Rolling upgrades are only supported between consecutive main versions, main version skipping is not allowed.
Rolling downgrade cannot be supported in 2.8 and 3.0, but user can incrementally rollback upgraded engines to old version before committing rolling upgrade.
Definitions
Engine version
The hard coded version of engine software, it can be different with the runtime version of DAOS. DAOS engine should support protocols of both engine version and runtime version (system version), which should be consecutive on main version number.
System version
The runtime version of DAOS storage system, it is persistently stored in a database of management service. When a DAOS engine joins DAOS system, it can get the system version from management service and run with protocols matching with this version. It should be noted that this database is only created by the mgmt service after initiating a rolling upgrade, and will be deleted upon either completing or aborting the upgrade process.
In the progress of rolling upgrade, there are two system version numbers, one is the current runtime version or the legacy version, the other is the next version for rolling upgrade, which is inactive until completion of update.
Pool version
It is the durable format version of DAOS pool, which may or may not be changed between different system versions. Durable format upgrade is not part of rolling upgrade, administrator can decide whether to upgrade durable format for each pool after completion of rolling upgrade. Durable format upgrade is already supported by DAOS, it will not be included in this design.
Administrator Interfaces
Administrator should explicitly start rolling upgrade by executing a new DMG command “upgrade”, which can notify management system begin of the upgrading process, otherwise any engine with different version with system version will fail to join the storage cluster; administrator should also explicitly complete or abort rolling upgrade, to avoid unambiguous status and compatibility risks.
After starting of rolling upgrade, administrator can shut down certain number of DAOS engines, upgrade RPMs for those engines then bring them back. Those engines are allowed to join DAOS cluster and run with the current system version, instead of the engine version.
Dmg system “upgrade” command
Upgrades are non-concurrent operations. Once initiated by an administrator, no additional rolling upgrade can be started until the current process either completes or is explicitly aborted. The following sub-commands of “upgrade” might be introduced
Prepare
Administrator should initiate the upgrade process using sub-command “prepare”, which can notify MGMT service to maintain global status of upgrade and manage version for each engine. This sub-command has a few parameters:
Version: Administrator should designate the new version of upgrade; DAOS engine can only be upgraded to this specified version and rejoin the cluster under that version.
Policy: Additionally, administrator can use the policy parameter to define the data rebuild strategy, such as whether the storage engines being upgraded should be evicted from the cluster, whether their stored data should be rebuilt, and how long to wait before initiating the rebuild. This can allow administrator to avoid unnecessary data migration during rapid upgrades or ensure data safety in slow upgrade scenarios. Potential options for policy could include “no_eviction”, “no_rebuild”, “default”, etc.
No_eviction: I/O will be blocked until the engines being upgraded rejoin.
No_rebuild: engine being upgraded can be evicted and I/O can proceed, however, rebuild will not be triggered to avoid data movement.
Default: both eviction and rebuild are enabled during upgrade.
Enable
After initiating a rolling upgrade using “dmg upgrade prepare”, administrator can specify a set of engines for upgrade by executing “dmg upgrade enable –ranks=…”, where “ranks” accepts expressions to identify engine IDs. The designated engines must then be shut down for RPM updates. Importantly, engines not explicitly selected for upgrade should be unable to rejoin the storage cluster if updated accidentally. This explicit mechanism prevents operational errors that could compromise data safety.
Commit
During a rolling upgrade, after all the data engines have been upgraded to the target version, the administrator runs the command “dmg system upgrade commit” to finalize the rolling upgrade. If all the engines in the cluster have indeed been upgraded to the initially specified version, the cluster will complete the rolling upgrade and switch to the new version protocol for future I/O services.
Abort
If a failure occurs during the rolling upgrade, the administrator can abort the upgrading process by running “dmg system upgrade abort”. Before executing this command, the administrator should downgrade the RPMs of the upgraded data engines while ensuring service continuity. Finally, the abort command can be executed to clear the rolling upgrade metadata maintained by the mgmt service.
Query
During the rolling upgrade, the administrator can use the query command to monitor the upgrade status, which includes the number of engines on the new/old versions and their version details.
RPM update
RPM updates can only be performed on engines that have been explicitly designated by the administrator via DMG commands. Otherwise, the updated engines will be unable to join the storage cluster.
Design details
The rolling upgrade requires modifications in several aspects:
Metadata Store: A new metadata store must be established on the mgmt service to maintain global upgrade status and information.
Control Plane Interface: Corresponding control plane interfaces should be provided to allow users to manage the rolling upgrade process.
RPC format extension and I/O protocol switch: The RPC format needs to be extended so that engines can detect version changes during normal communication and switch to appropriate I/O protocol.
Client Compatibility: To ensure uninterrupted I/O services, the client module must support registration and seamless switching between both the old and new I/O protocols
Metadata store for rolling upgrade
The management service establishes a new RAFT-based metadata store to maintain cluster-wide rolling upgrade state, enabling upgrade continuity during node failures while ensuring data persistence and system reliability. This metadata store includes this information:
System-wide rolling upgrade states
not_started, in_progress, completed, aborted.
The current system version and the next version
Ranks deployed with the legacy version
Ranks deployed with the next version
As previously mentioned, the legacy version is still functionally the current version during the rolling upgrade, version switching only occurs after full cluster upgrading to the next version. The diagram below illustrates the architecture described above.
Additionally, each engine also manages metadata including:
Hard-coded engine version
The current system version obtained from the Mgmt service during cluster joining.
Join operations will fail if the engine's version is lower than the system version, or the major version difference exceeds 1.
Upon successful upgrade of all engines and administrator confirmation via 'dmg system upgrade commit', the mgmt. service will change the current system version from the legacy version to the next version and enforce strict version homogeneity by prohibiting mixed-version engines from joining the storage cluster thereafter. During the commit, system version updates are atomic and idempotent. Post-commit validation ensures all ranks use the new version protocols by default.
If an engine attempts RPC communication using an outdated version due to being out-of-sync, such requests will be rejected. The affected engine must then complete version switch before retrying operations.
RPC protocol version
The selection of RPC protocol versions varies across different upgrade scenarios.
Server-First Upgrade Path
Legacy Client Support: Existing clients maintain stable connections using the original protocol version.
Backward Compatibility: Achieved via protocol version negotiation during pool connect handshake.
Client-First Upgrade Path
Automatic Protocol Adoption: Updated clients dynamically detect and switch to newer protocol versions after server upgrade.
Forward Compatibility Requirement: Servers must implement a version-aware handshake mechanism to notify client about version change.
In the former scenario, no protocol negotiation modifications are required. The implementation only needs to validate the network stack's capability to re-establish connections across all supported network types after server restarts/upgrades, then ensure resumed communication with automatic resend of pending RPCs.
In the second scenario, the DAOS client needs to register both legacy and new RPC formats. During the upgrade process, the client continues using the legacy version as the system version, which is cached in the connection handle, for server communication. After the servers complete their rolling upgrade, they can loosely notify clients version change by setting version in the common header of RPC replies. When the client detects version change in the response, if the new system version matches what the client supports, it initiates refresh RPCs to request additional metadata about the new version for pool and container connections. Finally, it switches to the new I/O protocol to ensure compatibility and unlock all new features. The whole process happens automatically, the client keeps working with the old version until everything's ready, then seamlessly transitions to the new one without any service interruption.