This test plan covers additional tests needed to verify the extra work required upon engine restart. This includes loading the blob from the SSD and re-applying any changes from the WAL.
HW Requirements --identify the hardware that will be required to execute the tests covered by the plan.
Get a list of device information (dmg storage query list-devices)
Verify each device’s role entry matches the expected value based upon the server storage configuration
2.4
Verify data access after engine restart w/o WAL replay + w/ check pointing
(synchronized WAL & VOS)
P2
Start 2 DAOS servers with 1 engine per server
Create a single pool and container
Run ior w/ DFS to populate the container with data
Confirm that all data has been checkpointed
After ior has completed, shutdown every engine cleanly (dmg system stop)
Restart each engine (dmg system start)
Verify the previously written data matches with an ior read
2.4
Verify data access after engine restart w/ WAL replay + w/o check pointing (unsynchronized WAL & VOS)
P2
Start 2 DAOS servers with 1 engine per server
Create a single pool and container
Disable checkpointing
Run ior w/ DFS to populate the container with small amount of data
After ior has completed, shutdown every engine cleanly (dmg system stop)
Restart each engine (dmg system start)
Verify the previously written data matches with an ior read
2.4
Verify snapshots after engine restart
P2
Start 2 DAOS servers with 1 engine per server
Create a single pool and container in the pool
Run ior w/ DFS to populate the container with persistent data followed by creating a snapshot (daos container create-snap). Repeat this three times.
Verify all three snapshots exist (daos container list-snaps)
Remove the second snapshot (daos container destroy-snap)
Verify that two snapshots exist (daos container list-snaps)
Shutdown every engine cleanly (dmg system stop --force)
Restart each engine (dmg system start)
Verify all engines have joined (dmg system query)
Verify that two snapshots exist (daos container list-snaps)
Remove the two snapshots (daos container destroy-snap)
Verify that no snapshots exist (daos container list-snaps)
2.4
Verify pool & container attributes after engine restart
P2
Start 3 DAOS servers with 1 engine on each server
Create a multiple pools and containers
List the current pool and container attributes
Modify at least one different attribute on each pool and container
Shutdown every engine cleanly (dmg system stop)
Restart each engine (dmg system start)
Verify each modified pool and container attribute is still set
2.4
Verify the specific metrics to track activity of md_on_ssd.
P2
test_wal_commit_metrics
Start 2 DAOS servers with 1 engine on each server
Verify the engine_dmabuff_wal_* metrics are 0
Create a pool
Verify the engine_dmabuff_wal_sz metric is greater than 0
Verify the engine_dmabuff_wal_waiters metrics are 0
test_wal_reply_metrics
Start 2 DAOS servers with 1 engine on each server
Verify the engine_pool_vos_rehydration_replay_* metrics are 0
Create a pool
Verify the engine_pool_vos_rehydration_replay_count metric is 1
Verify the engine_pool_vos_rehydration_replay_entries metric is > 0
Verify the engine_pool_vos_rehydration_replay_size metric is > 0
Verify the engine_pool_vos_rehydration_replay_time metric is within 10,000 - 50,000
Verify the engine_pool_vos_rehydration_replay_transactions metric is > 0
test_wal_checkpoint_metrics
Start 2 DAOS servers with 1 engine on each server
Verify the engine_pool_checkpoint_* metrics are 0
Create a pool w/ check pointing disabled (pool1)
Verify the engine_pool_checkpoint_* metrics are 0 for pool1
Create a pool w/ check pointing eanbled (pool2)
Verify the engine_pool_checkpoint_* metrics are 0 for pool1
Verify the engine_pool_checkpoint_dirty_chunks metrics are within 0-300 for pool2
Verify the engine_pool_checkpoint_dirty_pages metrics are within 0-3 for pool2
Verify the engine_pool_checkpoint_duration metrics are within 0-300 for pool2
Verify the engine_pool_checkpoint_iovs_copied metrics are > 0 for pool2
Verify the engine_pool_checkpoint_wal_purged metrics are >= 0 for pool2
Create a container for pool2
Use ior to write data to pool2
Wait double the check point frequency to allow for check pointing to complete
Verify the engine_pool_checkpoint_* metrics are 0 for pool1
Verify the engine_pool_checkpoint_dirty_chunks metrics are within 0-300 for pool2
Verify the engine_pool_checkpoint_dirty_pages metrics are within 0-3 for pool2
Verify the engine_pool_checkpoint_duration metrics are within 0-300 for pool2
Verify the engine_pool_checkpoint_iovs_copied metrics are > 0 for pool2
Verify the engine_pool_checkpoint_wal_purged metrics are > 0 for pool2
2.6
Add running a subset of pr tests with MD on SSD in master PRs
P1
Update the existing nvme/fault.py test to expect a stopped server when setting a device fault that has "has_sys_xs" set to true.
Additional testing handled by
2.6
Reviewed and Approved By
Reviewed and Approved By
Test Engineer
Phil Henderson
Feature Developer
Date
Initial review on Feb 9, 2023
Feature Design Document
Feature Design Document
*Link to design document for feature under test
*Functional testing - Full functionality tests, verify all corner cases, can be automated in CI. To be run either in daily/weekly build, where it makes sense.
*Negative testing - Ensures application can gracefully handle invalid input or unexpected user behavior. To be run either in daily/weekly build, where it makes sense.
*Stress testing - Endurance test for checking resource limitations. To be run on weekly build or manually.