Introduction

In pmem mode, when a data SSD is going to be worn out (SSD faulty criteria are satisfied), it will be marked as FAULTY automatically by BIO SSD health monitor (or manually by administrator), hence all the impacted pool targets will be automatically marked as DOWN and rebuild being triggered accordingly. Once the faulty SSD is replaced by a new (hot plugged) SSD through ‘dmg storage replace’ command, reintegration will be automatically triggered to bring back all the former DOWN pool targets. The NVMe I/O errors against the faulty device will be converted to a retry-able error -DER_NVME_IO before returning to client, that ensures application not being interrupted by SSD failures (when data redundancy is available).

Let’s recap on what are performed on the data SSD faulty and replacement events.

When a data SSD is marked as FAULTY (It basically mark the device as FAULTY in SMD):

Enter faulty state: The in-memory device state transits from BIO_BS_STATE_NORMAL to BIO_BS_STATE_FAULTY. Any NVMe I/O against the device will be rejected with a client retry-able error: -DER_NVME_IO.
Faulty reaction: A pre-registered BIO faulty reaction callback is repeatedly called to mark all the affected pool targets as DOWN through a client API (pool target exclude), until all the affected pool targets are in DOWN or DOWNOUT state.
Enter teardown state: The in-memory device state transits from BIO_BS_STATE_FAULTY to BIO_BS_STATE_TEARDOWN .
Teardown SPDK in-memory structures: Repeatedly try to close SPDK blobs, SPDK I/O channels, unload SPDK blobstore, until everything is cleared.
Enter out state: The in-memory device state transits from BIO_BS_STATE_TEARDOWN to BIO_BS_STATE_OUT.

Once the device state is in BIO_BS_STATE_OUT, the faulty device is able be replaced by a new device through the 'dmg storage replace' command:

Replace device: According to the information in SMD, all blobs existed on old device are created on the new device, SMD is then updated to replace the old device information with new one’s.
Enter setup state: The in-memory device state transits from BIO_BS_STATE_OUT to BIO_BS_STATE_SETUP.
Setup SPDK in-memory structures: Repeatedly try to load SPDK blobstore, open existing SPDK blobs, open SPDK I/O channels, until everything is setup.
Enter normal state: The in-memory device state transits from BIO_BS_STATE_SETUP to BIO_BS_STATE_NORMAL.
Reint reaction: A pre-registered BIO reint reaction callback is repeatedly called to reintegrate the former DOWN pool targets through a client API (pool target reint), until all the affected pool targets are in UP or UPIN state.

In md-on-ssd mode, the foregoing data SSD faulty & reint workflow will keep unchanged, but few new challenges for WAL/Meta SSD faulty & reint events need be addressed:

Given that WAL commit error & checkpointing flush error are not tolerated (There is no easy way to UNDO, so WAL/Meta SSD error is regarded as fatal error), two extra faulty criteria will be added for WAL or meta SSD: WAL commit failure and checkpointing flush failure.
Once a WAL/Meta SSD is marked as FAULTY, it implies that the tmpfs contains some uncommitted changes, to avoid accessing to the uncommitted changes, any inflight or incoming requests need be rejected with client retry-able error: -DER_NVME_IO.
Since faulty WAL/Meta SSD will make the per-target in-memory structures (everything attached to ds_pool_child) invalid, all the in-memory structures need be torn down in the “faulty reaction” phase, and being setup in the “reint reaction” phase. Meanwhile, on engine start, the per-target pool child impacted by faulty WAL/Meta SSD should not be started, otherwise, engine start will fail.
When SSD roles (WAL, meta or data) are assigned to separated devices, a faulty WAL (or meta) device will make the metadata (or WAL data) and data located on other healthy device(s) invalid, hence all these invalid blobs on healthy device(s) need be recreated in the “replace device” phase.
When a faulty WAL/Meta SSD impacts RDB, the RDB in-memory structures need be torn down in the “faulty reaction” phase, and all the impacted ranks should always be excluded from the PS membership, until the faulty SSD is replaced.

Requirements for md-on-ssd Phase 1 Productionization

In scope

The above #1 to #4 issues should be addressed to make the faulty WAL/Meta SSD impacts only minimized pool targets.

Out of scope

To properly address the issue #5, wire and disk format changes might be involved, the work might not fit into the release schedule. For phase 1 productionization, we could use a simple workaround of excluding (then reintegrating) the whole rank when faulty WAL/Meta SSD contains RDB, of course, the downside of such simple workaround is that all the pool targets on the rank will be unnecessarily impacted by single faulty WAL/Meta SSD.

Design Overview

Track per-target faulty in VOS layer

To address the #2 issue, a pair of VOS API will be introduced to set/get target error code.

/* Set a global error code for current target, it'll be tracked in VOS TLS (vos_tls) */
void vos_tgt_error_set(int error);

/* Read the error code for current target */
int vos_tgt_error_get(void);

vos_tgt_error_set(-DER_NVME_IO) will be called on WAL commit failure (or checkpointing flush failure), vos_tgt_error_set(0) will be called in the "reint reaction" to clear the per-target global error code.

All VOS object APIs will be modified to call the vos_tgt_error_get() on function entry & exit, if the acquired error code is non-zero, the API will fail with the returned value instantly, meanwhile, all the VOS object API callers (from regular pool or RDB) need be examined to see if the error code could be correctly propagated back to client.

Note: The target faulty information can be acquired by query SMD, but it’s not affordable to query SMD directly on performance critical path.

Mark faulty on WAL commit (or checkpointing flush) failure

On WAL commit (vos_wal_commit()) or checkpointing flush (vos_meta_flush_post()) failure, the vos_tgt_error_set(-DER_NVME_IO) will be called to mark the target as faulty, on the other hand, BIO internally will automatically mark the corresponding WAL/Meta SSD as faulty when any NVMe write error is hit.

Start & Stop per-target pool individually

Currently all pools (ds_pool) are automatically started/stopped on engine start/stop, and each per-target pool (ds_pool_child) are consequently started/stopped by a thread collective call of pool_child_add_one()/pool_child_delete_one(). To address the issue #3, we need to make the per-target pool able to be started/stopped individually, that may require quite a lot of changes to current code:

Move container stop into pool_child_delete_one().
Abort per-target rebuild ULTs in pool_child_delete_one().
Ensure all the thread collective calls won’t fail inadvertently due to the faulty target.

When the ds_pool_child is able to be started/stopped individually, we could stop/start the ds_pool_child in the "faulty reaction"/"reint reaction" phase, also on engine start, we could query SMD and skip the ds_pool_child start on the target which has faulty WAL/Meta SSD assigned.

Future Work

Addressing issue #5 may require:

Extend pool map to have an extra sys target state, so that by checking pool map, the rank with DOWN sys target could always be excluded from the PS membership. This involves both wire protocol & on-disk layout changes.
Provide pool service step down/up APIs which could be called by the “faulty reaction”/”reint reaction”. These APIs are better to be non-blocking and idempotent, and a query API could be provided to query whether the step up/down is completed.

WAL/Meta SSD faulty and reintegration