Introduction

In pmem mode, when a data SSD is going to be worn out (SSD faulty criteria are satisfied), it will be marked as FAULTY automatically by BIO SSD monitor (or manually by administrator), hence all the affected pool targets will be automatically marked as DOWN and rebuild being triggered accordingly. Once the faulty SSD is replaced by a new (hot plugged) SSD through ‘dmg storage replace’ command, reintegration will be automatically triggered to bring back all the former DOWN targets. This mechanism ensures application not being interrupted by SSD failures (when data redundancy is available).

Let’s recap on what are performed on the data SSD faulty and replacement events.

When a data SSD is marked as FAULTY (It basically update SMD to mark the device as FAULTY):

Enter faulty state: The in-memory device state transits from BIO_BS_STATE_NORMAL to BIO_BS_STATE_FAULTY. Any NVMe I/O against the device will be rejected with a client retry-able error: -DER_NVME_IO.
Faulty reaction: A pre-registered BIO faulty reaction callback is repeatedly called to mark all the affected pool targets as DOWN through a client API (pool target exclude), until all the affected pool targets are in DOWN or DOWNOUT state.
Enter teardown state: The in-memory device state transits from BIO_BS_STATE_FAULTY to BIO_BS_STATE_TEARDOWN .
Teardown SPDK in-memory structures: Close SPDK blobs, SPDK I/O channels, unload SPDK blobstore at the end.
Enter out state: The in-memory device state transits from BIO_BS_STATE_TEARDOWN to BIO_BS_STATE_OUT.

Once the device state is in BIO_BS_STATE_OUT, the faulty device is able be replaced by a new device through the 'dmg storage replace' command:

Replace device: According to the information tracked by SMD, all blobs existing on old device are created on the new device, SMD is then updated to replace the old device information with new one’s.
Enter setup state: The in-memory device state transits from BIO_BS_STATE_OUT to BIO_BS_STATE_SETUP.
Setup SPDK in-memory structures: Load SPDK blobstore, open existing SPDK blobs, open SPDK I/O channels.
Enter normal state: The in-memory device state transits from BIO_BS_STATE_SETUP to BIO_BS_STATE_NORMAL.
Reint reaction: A pre-registered BIO reint reaction callback is repeatedly called to reintegrate the former DOWN pool targets through a client API (pool target reint), until all the affected pool targets are in UP or UPIN state.

In md-on-ssd mode, the foregoing data SSD faulty & reint workflow will keep unchanged, but few new challenges for WAL/Meta SSD faulty & reint events need be addressed:

Given that WAL commit error & checkpointing flush error are not tolerated in current design, two extra faulty criteria will be added for WAL or meta SSD: WAL commit failure and checkpointing flush failure.
Once a WAL/Meta SSD is marked as FAULTY, it implies that the tmpfs contains some uncommitted changes (we can’t undo these uncommitted changes in current design), to avoid accessing to the uncommitted changes, any inflight or incoming request need be rejected with client retry-able error: -DER_NVME_IO.
Since faulty WAL/Meta SSD will make the per-target in-memory structures (everything attached to ds_pool_child) invalid, all the in-memory structures need be torn down in the “faulty reaction” phase, and being setup in the “reint reaction” phase. Meanwhile, the per-target pool child impacted by faulty WAL/Meta SSD should be skipped when starting pools on engine start, otherwise, engine will fail to start.
When SSD roles (WAL, meta or data) are assigned to separated devices, a faulty WAL (or meta) device will make the metadata (or WAL data) and data located on other healthy device(s) invalid, hence all these invalid blobs on healthy device(s) need be recreated in the “replace device” phase.
When a faulty WAL/Meta SSD impacts RDB, the RDB in-memory structures need be torn down in the “faulty reaction” phase, and all the impacted ranks should always be excluded from the PS membership, until the faulty SSD is replaced.

Requirements for md-on-ssd Phase 1 Productionization

In scope

The above #1 to #4 issues should be addressed to make the faulty WAL/Meta SSD impacts only minimized pool targets.

Out of scope

To properly address the issue #5, wire and disk format changes might be involved, it might not fit into the release schedule. For phase 1 productionization, we could use a simple workaround of excluding (then reintegrating) the whole rank when faulty WAL/Meta SSD contains RDB, of course, the downside of such simple workaround is that all pool targets of the rank will be unnecessarily impacted by single faulty WAL/Meta SSD.

Design Overview

Track per-target faulty in VOS layer

To address the #2 issue, a pair of VOS API will be introduced to set/get target error code.

/* Set a global error code for current target, it'll be tracked in VOS TLS (vos_tls) */
void vos_tgt_error_set(int error);

/* Read the error code for current target */
int vos_tgt_error_get(void);

vos_tgt_error_set(-DER_NVME_IO) will be called on WAL commit failure (or checkpointing flush failure), vos_tgt_error_set(0) will be called in the "reint reaction" to clear the per-target global error code.

All VOS object APIs will be modified to call the vos_tgt_error_get() on function entry & exit, if the acquired error code is non-zero, the API will fail with the returned value instantly, meanwhile, all the VOS object API callers (from regular pool or RDB) need be examined to see if the error code could be properly propagated back to client.

Note: The target faulty information can be acquired by query SMD, but it’s not affordable to query SMD directly on performance critical path.

Mark faulty for WAL commit (or checkpointing flush) failure

On WAL commit (vos_wal_commit()) or checkpointing flush (vos_meta_flush_post()) failure, the vos_tgt_error_set(-DER_NVME_IO) will be called to mark the target as faulty, on the other hand, BIO internally will automatically mark the corresponding WAL/Meta SSD as faulty when any NVMe write error is hit.

Start & Stop per-target pool individually

Currently all pools (ds_pool) are automatically started/stopped on engine start/stop, and each per-target pool (ds_pool_child) are consequently started/stopped by a thread collective call to pool_child_add_one()/pool_child_delete_one(). To address the issue #3, we need to make the per-target pool able to be started/stopped individually, that may require quite a lot of changes to current code:

Move container stop into pool_child_delete_one().
Close all per-target container handles in pool_child_delete_one().
Abort per-target rebuild ULTs in pool_child_delete_one().
Ensure all the thread collective calls won’t be inadvertently failed due to faulty target.

When the ds_pool_child able to be started/stopped individually, we could stop/start the ds_pool_child in the "faulty reaction"/"reint reaction" phase, also on engine start, we could query SMD and skip ds_pool_child start for the target which has faulty WAL/Meta SSD assigned.

Future Work

Addressing issue #5 may requires:

Extend pool map to have an extra sys target state, so that by checking pool map, the rank with DOWN sys target could be always excluded from PS membership. This involves both wire protocol & on-disk layout changes.
Provide pool service step down/up APIs which could be called by the “faulty reaction”/”reint reaction”. These APIs are better to be non-blocking and idempotent, and a query API could be provided to query whether the step up/down is completed.

WAL/Meta SSD faulty and reintegration