Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Requirements for md-on-ssd Phase 1 Productionization

In scope

The above #1 to #4 issues should be addressed to make the faulty WAL/Meta SSD impacts only minimized pool targets.

Out of scope

To properly address the issue #5, wire and disk format changes might be involved, it might not fit into the release schedule. For phase 1 productionization, we could use a simple workaround of excluding (then reintegrating) the whole rank when faulty WAL/Meta SSD contains RDB, of course, the downside of such simple workaround is that all pool targets of the rank will be unnecessarily impacted by single faulty WAL/Meta SSD.

Design Overview

Track per-target faulty in VOS layer

To address the #2 issue, a pair of VOS API will be introduced to set/get target error code.

Code Block
/* Set a global error code for current target, it'll be tracked in VOS TLS (vos_tls) */
void vos_tgt_error_set(int error);

/* Read the error code for current target */
int vos_tgt_error_get(void);

vos_tgt_error_set(-DER_NVME_IO) will be called on WAL commit failure (or checkpointing flush failure), vos_tgt_error_set(0) will be called in the "reint reaction" to clear the per-target global error code.

All VOS object APIs will be modified to call the vos_tgt_error_get() on function entry & exit, if the acquired error code is non-zero, the API will fail with the returned value instantly, meanwhile, all the VOS object API callers (from regular pool or RDB) need be examined to see if the error code could be properly propagated back to client.

Note: The target faulty information can be acquired by query SMD, but it’s not affordable to query SMD directly on performance critical path.

Mark faulty for WAL commit (or checkpointing flush) failure

On WAL commit (vos_wal_commit()) or checkpointing flush (vos_meta_flush_post()) failure, the vos_tgt_error_set(-DER_NVME_IO) will be called to mark the target as faulty, on the other hand, BIO internally will automatically mark the corresponding WAL/Meta SSD as faulty when any NVMe write error is hit.

Start & Stop per-target pool individually

Currently all pools (ds_pool) are automatically started/stopped on engine start/stop, and each per-target pool (ds_pool_child) are consequently started/stopped by a thread collective call to pool_child_add_one()/pool_child_delete_one(). To address the issue #3, we need to make the per-target pool able to be started/stopped individually, that may require quite a lot of changes to current code:

  • Move container stop into pool_child_delete_one().

  • Close all per-target container handles in pool_child_delete_one().

  • Abort per-target rebuild ULTs in pool_child_delete_one().

  • Ensure all the thread collective calls won’t be inadvertently failed due to faulty target.

When the ds_pool_child able to be started/stopped individually, we could stop/start the ds_pool_child in the "faulty reaction"/"reint reaction" phase, also on engine start, we could query SMD and skip ds_pool_child start for the target which has faulty WAL/Meta SSD assigned.

Future Work

Addressing issue #5 may requires:

  • Extend pool map to have an extra sys target state, so that by checking pool map, the rank with DOWN sys target could be always excluded from PS membership. This involves both wire protocol & on-disk layout changes.

  • Provide pool service step down/up APIs which could be called by the “faulty reaction”/”reint reaction”. These APIs are better to be non-blocking and idempotent, and a query API could be provided to query whether the step up/down is completed.