Metadata on SSD Phase 1 Test Plan

Introduction - Metadata on SSD overview can be found at Metadata on SSDs - DAOS Community - Confluence (atlassian.net)

This test plan covers additional tests needed to verify the extra work required upon engine restart. This includes loading the blob from the SSD and re-applying any changes from the WAL.

HW Requirements --identify the hardware that will be required to execute the tests covered by the plan.

SW Components - DAOS, Python, Avocado, IOR, Dfuse, POSIX environment

Opens/Limitations – list any unresolved issues, or limitations of the testing.

Test Description	JIRA	Priority	Test Steps	Resource	Release

Test Description	JIRA	Priority	Test Steps	Resource	Release
Verify data access after engine restart w/ WAL replay + w/ check pointing (unsynchronized WAL & VOS)	DAOS-13009: Implement verify data access after engine restart w/ WAL replay + w/ check pointing test caseResolved	P1	Start 2 DAOS servers with 1 engine per server Create a single pool and container Run ior w/ DFS to populate the container with data After ior has completed, shutdown every engine cleanly (`dmg system stop`) Restart each engine (`dmg system start`) Verify the previously written data matches with an ior read		2.4
Verify POSIX data access after engine restart (check modification timestamp?)	DAOS-13010: Implement verify POSIX data access after engine restart test caseResolved	P1	Start 2 DAOS servers with 1 engine per server Create a single pool and a POSIX container Start dfuse Write and then read data to the dfuse mount point After the read has completed, unmount dfuse Shutdown every engine cleanly (dmg system stop) Restart each engine (dmg system start) Remount dfuse Verify the previously written data exists Verify more data can be written		2.4
Verify device roles in dmg storage query output	https://daosio.atlassian.net/browse/DAOS-13011	P1	Start 1 DAOS server with 1 engine per server Get a list of device information (`dmg storage query list-devices`) Verify each device’s role entry matches the expected value based upon the server storage configuration		2.4
Verify data access after engine restart w/o WAL replay + w/ check pointing (synchronized WAL & VOS)	https://daosio.atlassian.net/browse/DAOS-13012	P2	Start 2 DAOS servers with 1 engine per server Create a single pool and container Run ior w/ DFS to populate the container with data Confirm that all data has been checkpointed After ior has completed, shutdown every engine cleanly (`dmg system stop`) Restart each engine (`dmg system start`) Verify the previously written data matches with an ior read	DAOS-13016: Add new mechanism to verify data has been checkpointedResolved	2.4
Verify data access after engine restart w/ WAL replay + w/o check pointing (unsynchronized WAL & VOS)	https://daosio.atlassian.net/browse/DAOS-13013	P2	Start 2 DAOS servers with 1 engine per server Create a single pool and container Disable checkpointing Run ior w/ DFS to populate the container with small amount of data After ior has completed, shutdown every engine cleanly (`dmg system stop`) Restart each engine (`dmg system start`) Verify the previously written data matches with an ior read	https://daosio.atlassian.net/browse/DAOS-13017	2.4
Verify snapshots after engine restart	https://daosio.atlassian.net/browse/DAOS-13014	P2	Start 2 DAOS servers with 1 engine per server Create a single pool and container in the pool Run ior w/ DFS to populate the container with persistent data followed by creating a snapshot (`daos container create-snap`). Repeat this three times. Verify all three snapshots exist (`daos container list-snaps`) Remove the second snapshot (`daos container destroy-snap`) Verify that two snapshots exist (`daos container list-snaps`) Shutdown every engine cleanly (`dmg system stop --force`) Restart each engine (`dmg system start`) Verify all engines have joined (`dmg system query`) Verify that two snapshots exist (`daos container list-snaps`) Remove the two snapshots (`daos container destroy-snap`) Verify that no snapshots exist (`daos container list-snaps`)		2.4
Verify pool & container attributes after engine restart	DAOS-13015: Implement verify pool & container attributes after engine restart test caseResolved	P2	Start 3 DAOS servers with 1 engine on each server Create a multiple pools and containers List the current pool and container attributes Modify at least one different attribute on each pool and container Shutdown every engine cleanly (`dmg system stop`) Restart each engine (`dmg system start`) Verify each modified pool and container attribute is still set		2.4
Verify the specific metrics to track activity of md_on_ssd.	DAOS-11626: Add tests for new md_on_ssd metricsResolved	P2	test_wal_commit_metrics Start 2 DAOS servers with 1 engine on each server Verify the engine_dmabuff_wal_* metrics are 0 Create a pool Verify the engine_dmabuff_wal_sz metric is greater than 0 Verify the engine_dmabuff_wal_waiters metrics are 0 test_wal_reply_metrics Start 2 DAOS servers with 1 engine on each server Verify the engine_pool_vos_rehydration_replay_* metrics are 0 Create a pool Verify the engine_pool_vos_rehydration_replay_count metric is 1 Verify the engine_pool_vos_rehydration_replay_entries metric is > 0 Verify the engine_pool_vos_rehydration_replay_size metric is > 0 Verify the engine_pool_vos_rehydration_replay_time metric is within 10,000 - 50,000 Verify the engine_pool_vos_rehydration_replay_transactions metric is > 0 test_wal_checkpoint_metrics Start 2 DAOS servers with 1 engine on each server Verify the engine_pool_checkpoint_* metrics are 0 Create a pool w/ check pointing disabled (pool1) Verify the engine_pool_checkpoint_* metrics are 0 for pool1 Create a pool w/ check pointing eanbled (pool2) Verify the engine_pool_checkpoint_* metrics are 0 for pool1 Verify the engine_pool_checkpoint_dirty_chunks metrics are within 0-300 for pool2 Verify the engine_pool_checkpoint_dirty_pages metrics are within 0-3 for pool2 Verify the engine_pool_checkpoint_duration metrics are within 0-300 for pool2 Verify the engine_pool_checkpoint_iovs_copied metrics are > 0 for pool2 Verify the engine_pool_checkpoint_wal_purged metrics are >= 0 for pool2 Create a container for pool2 Use ior to write data to pool2 Wait double the check point frequency to allow for check pointing to complete Verify the engine_pool_checkpoint_* metrics are 0 for pool1 Verify the engine_pool_checkpoint_dirty_chunks metrics are within 0-300 for pool2 Verify the engine_pool_checkpoint_dirty_pages metrics are within 0-3 for pool2 Verify the engine_pool_checkpoint_duration metrics are within 0-300 for pool2 Verify the engine_pool_checkpoint_iovs_copied metrics are > 0 for pool2 Verify the engine_pool_checkpoint_wal_purged metrics are > 0 for pool2		2.6
Add running a subset of pr tests with MD on SSD in master PRs	DAOS-13530: Add running a subset of pr tests with MD on SSD in master PRsResolved	P1	Update the existing nvme/fault.py test to expect a stopped server when setting a device fault that has "has_sys_xs" set to true. Additional testing handled by https://daosio.atlassian.net/wiki/spaces/DC/pages/11161927681		2.6

Reviewed and Approved By

Reviewed and Approved By
Test Engineer	Phil Henderson
Feature Developer
Date	Initial review on Feb 9, 2023

Feature Design Document

Feature Design Document
*Link to design document for feature under test

*Functional testing - Full functionality tests, verify all corner cases, can be automated in CI. To be run either in daily/weekly build, where it makes sense.

*Negative testing - Ensures application can gracefully handle invalid input or unexpected user behavior. To be run either in daily/weekly build, where it makes sense.

*Stress testing - Endurance test for checking resource limitations. To be run on weekly build or manually.