Feature requirements
DAOS should support rolling upgrade to ensure continuous availability and seamless user experience, by upgrading storage engines incrementally rather than all at once which requires downtime of the entire storage cluster. Rolling upgrade also aligns with the distributed architecture’s fault-tolerant design, if an issue arises during the upgrade, only a subset of nodes is affected, and the remaining storage engines can still provide I/O service without disruption.
Scope statements
These scenarios should be supported by DAOS:
Guarantee availability of service while incrementally upgrading DAOS servers.
Storage cluster may run in mixed-version mode for long time, for example, days or even weeks.
Rolling upgrade can be aborted, then DAOS system can rollback to old version.
DAOS servers can online switch to protocol of new version on completion of rolling upgrade.
If clients are upgraded before servers, clients can switch to new protocol only after all servers are upgraded to new version.
DAOS client can automatically reconnect upgraded servers, and continue I/O with those servers.
Rolling upgrades are only supported between consecutive main versions, main version skipping is not allowed.
Rolling downgrade cannot be supported in 2.8 and 3.0, but user can incrementally rollback upgraded engines to old version before committing rolling upgrade.
Definitions
Engine version
The hard coded version of engine software, it can be different with the runtime version of DAOS. DAOS engine should support protocols of both engine version and runtime version (system version), which should be consecutive on main version number.
...
The runtime version of DAOS storage system, it is persistently stored in a database of management service. When a DAOS engine joins DAOS system, it can get the system version from management service and run with protocols matching with this version. It should be noted that this database is only created by the mgmt service after initiating a rolling upgrade, and will be deleted upon either completing or aborting the upgrade process.
In the progress of rolling upgrade, there are two system version numbers, one is the current runtime version or the legacy version, the other is the next version for rolling upgrade, which is inactive until completion of update.
...
It is the durable format version of DAOS pool, which may or may not be changed between different system versions. Durable format upgrade is not part of rolling upgrade, administrator can decide whether to upgrade durable format for each pool after completion of rolling upgrade. Durable format upgrade is already supported by DAOS, it will not be included in this design.
Administrator Interfaces
Administrator should explicitly start rolling upgrade by executing a new DMG command “upgrade”, which can notify management system begin of the upgrading process, otherwise any engine with different version with system version will fail to join the storage cluster; administrator should also explicitly complete or abort rolling upgrade, to avoid unambiguous status and compatibility risks.
...
Dmg system “upgrade” command
Upgrades are non-concurrent operations. Once initiated by an administrator, no additional rolling upgrade can be started until the current process either completes or is explicitly aborted. The following sub-commands of “upgrade” might be introduced
...
RPM updates can only be performed on engines that have been explicitly designated by the administrator via DMG commands. Otherwise, the updated engines will be unable to join the storage cluster.
Design details
The rolling upgrade requires modifications in several aspects:
...