Metadata on SSD Phase 1 Test Plan

Introduction - Metadata on SSD overview can be found at Metadata on SSDs - DAOS Community - Confluence (atlassian.net)

This test plan covers additional tests needed to verify the extra work required upon engine restart. This includes loading the blob from the SSD and re-applying any changes from the WAL.

HW Requirements --identify the hardware that will be required to execute the tests covered by the plan.  

SW Components - DAOS, Python, Avocado, IOR, Dfuse, POSIX environment

Opens/Limitations – list any unresolved issues, or limitations of the testing.



Test Description

JIRA

Priority

Test Steps

Resource

Release

Test Description

JIRA

Priority

Test Steps

Resource

Release

Verify data access after engine restart w/ WAL replay + w/ check pointing
(unsynchronized WAL & VOS)

 

DAOS-13009: Implement verify data access after engine restart w/ WAL replay + w/ check pointing test caseResolved

P1

  1. Start 2 DAOS servers with 1 engine per server

  2. Create a single pool and container

  3. Run ior w/ DFS to populate the container with data

  4. After ior has completed, shutdown every engine cleanly (dmg system stop)

  5. Restart each engine (dmg system start)

  6. Verify the previously written data matches with an ior read



2.4

Verify POSIX data access after engine restart

(check modification timestamp?)

DAOS-13010: Implement verify POSIX data access after engine restart test caseResolved

P1

  1. Start 2 DAOS servers with 1 engine per server

  2. Create a single pool and a POSIX container

  3. Start dfuse

  4. Write and then read data to the dfuse mount point

  5. After the read has completed, unmount dfuse

  6. Shutdown every engine cleanly (dmg system stop)

  7. Restart each engine (dmg system start)

  8. Remount dfuse

  9. Verify the previously written data exists

  10. Verify more data can be written

 

2.4

Verify device roles in dmg storage query output

DAOS-13011: Implement verify device roles in dmg storage query output test caseResolved

P1

  1. Start 1 DAOS server with 1 engine per server

  2. Get a list of device information (dmg storage query list-devices)

  3. Verify each device’s role entry matches the expected value based upon the server storage configuration

 

2.4

Verify data access after engine restart w/o WAL replay + w/ check pointing

(synchronized WAL & VOS)

DAOS-13012: Implement verify data access after engine restart w/o WAL replay + w/ check pointing test caseResolved

P2

  1. Start 2 DAOS servers with 1 engine per server

  2. Create a single pool and container

  3. Run ior w/ DFS to populate the container with data

  4. Confirm that all data has been checkpointed

  5. After ior has completed, shutdown every engine cleanly (dmg system stop)

  6. Restart each engine (dmg system start)

  7. Verify the previously written data matches with an ior read

DAOS-13016: Add new mechanism to verify data has been checkpointedResolved

2.4

Verify data access after engine restart w/ WAL replay + w/o check pointing
(unsynchronized WAL & VOS)

DAOS-13013: Implement verify data access after engine restart w/ WAL replay + w/o check pointing test caseResolved

P2

  1. Start 2 DAOS servers with 1 engine per server

  2. Create a single pool and container

  3. Disable checkpointing

  4. Run ior w/ DFS to populate the container with small amount of data

  5. After ior has completed, shutdown every engine cleanly (dmg system stop)

  6. Restart each engine (dmg system start)

  7. Verify the previously written data matches with an ior read

DAOS-13017: Add a new mechanism for disabling checkpointingResolved

2.4

Verify snapshots after engine restart

DAOS-13014: Implement verify snapshots after engine restart test caseResolved

P2

  1. Start 2 DAOS servers with 1 engine per server

  2. Create a single pool and container in the pool

  3. Run ior w/ DFS to populate the container with persistent data followed by creating a snapshot (daos container create-snap). Repeat this three times.

  4. Verify all three snapshots exist (daos container list-snaps)

  5. Remove the second snapshot (daos container destroy-snap)

  6. Verify that two snapshots exist (daos container list-snaps)

  7. Shutdown every engine cleanly (dmg system stop --force)

  8. Restart each engine (dmg system start)

  9. Verify all engines have joined (dmg system query)

  10. Verify that two snapshots exist (daos container list-snaps)

  11. Remove the two snapshots (daos container destroy-snap)

  12. Verify that no snapshots exist (daos container list-snaps)

 

2.4

Verify pool & container attributes after engine restart

https://daosio.atlassian.net/browse/DAOS-13015

P2

  1. Start 3 DAOS servers with 1 engine on each server

  2. Create a multiple pools and containers

  3. List the current pool and container attributes

  4. Modify at least one different attribute on each pool and container

  5. Shutdown every engine cleanly (dmg system stop)

  6. Restart each engine (dmg system start)

  7. Verify each modified pool and container attribute is still set

 

2.4

Verify the specific metrics to track activity of md_on_ssd.

DAOS-11626: Add tests for new md_on_ssd metricsResolved

P2

test_wal_commit_metrics

  1. Start 2 DAOS servers with 1 engine on each server

  2. Verify the engine_dmabuff_wal_* metrics are 0

  3. Create a pool

  4. Verify the engine_dmabuff_wal_sz metric is greater than 0

  5. Verify the engine_dmabuff_wal_waiters metrics are 0

test_wal_reply_metrics

  1. Start 2 DAOS servers with 1 engine on each server

  2. Verify the engine_pool_vos_rehydration_replay_* metrics are 0

  3. Create a pool

  4. Verify the engine_pool_vos_rehydration_replay_count metric is 1

  5. Verify the engine_pool_vos_rehydration_replay_entries metric is > 0

  6. Verify the engine_pool_vos_rehydration_replay_size metric is > 0

  7. Verify the engine_pool_vos_rehydration_replay_time metric is within 10,000 - 50,000

  8. Verify the engine_pool_vos_rehydration_replay_transactions metric is > 0

test_wal_checkpoint_metrics

  1. Start 2 DAOS servers with 1 engine on each server

  2. Verify the engine_pool_checkpoint_* metrics are 0

  3. Create a pool w/ check pointing disabled (pool1)

  4. Verify the engine_pool_checkpoint_* metrics are 0 for pool1

  5. Create a pool w/ check pointing eanbled (pool2)

  6. Verify the engine_pool_checkpoint_* metrics are 0 for pool1

  7. Verify the engine_pool_checkpoint_dirty_chunks metrics are within 0-300 for pool2

  8. Verify the engine_pool_checkpoint_dirty_pages metrics are within 0-3 for pool2

  9. Verify the engine_pool_checkpoint_duration metrics are within 0-300 for pool2

  10. Verify the engine_pool_checkpoint_iovs_copied metrics are > 0 for pool2

  11. Verify the engine_pool_checkpoint_wal_purged metrics are >= 0 for pool2

  12. Create a container for pool2

  13. Use ior to write data to pool2

  14. Wait double the check point frequency to allow for check pointing to complete

  15. Verify the engine_pool_checkpoint_* metrics are 0 for pool1

  16. Verify the engine_pool_checkpoint_dirty_chunks metrics are within 0-300 for pool2

  17. Verify the engine_pool_checkpoint_dirty_pages metrics are within 0-3 for pool2

  18. Verify the engine_pool_checkpoint_duration metrics are within 0-300 for pool2

  19. Verify the engine_pool_checkpoint_iovs_copied metrics are > 0 for pool2

  20. Verify the engine_pool_checkpoint_wal_purged metrics are > 0 for pool2

 

2.6

Add running a subset of pr tests with MD on SSD in master PRs

DAOS-13530: Add running a subset of pr tests with MD on SSD in master PRsResolved

P1

Update the existing nvme/fault.py test to expect a stopped server when setting a device fault that has "has_sys_xs" set to true.

Additional testing handled by https://daosio.atlassian.net/wiki/spaces/DC/pages/11161927681

 

2.6

 

 



Reviewed and Approved By

 

Reviewed and Approved By

 

Test Engineer

Phil Henderson

Feature Developer

 

Date

Initial review on Feb 9, 2023

 

Feature Design Document

Feature Design Document

*Link to design document for feature under test

 

*Functional testing - Full functionality tests, verify all corner cases, can be automated in CI. To be run either in daily/weekly build, where it makes sense.

*Negative testing - Ensures application can gracefully handle invalid input or unexpected user behavior. To be run either in daily/weekly build, where it makes sense.

*Stress testing - Endurance test for checking resource limitations. To be run on weekly build or manually.