Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Add Validation Workflow - WIP

...

This is currently a work in progress. See https://github.com/daos-stack/daos_scaled_testing/blob/master/database/README.md for general usage.

Running the Validation Suite

TODO

Description of test cases

...

Description of test cases

Frontera Test Plan

Validation Suite Workflow

These instructions cover running the Frontera performance suite for validation of Test Builds, Release Candidates, and performance-sensitive features.

Info

Some details are my personal organizational style and can be tweaked as desired - Dalton Bohning

Clone Test Scrips

Using a fresh clone of the test scripts keeps the configuration isolated from other ongoing work. E.g. cloning into a directory with the ticket number.

Code Block
cd $WORK/TESTS
git clone git@github.com:daos-stack/daos_scaled_testing.git daos-xxxx
cd daos-xxxx/frontera

Build DAOS

Using a separate build directory makes it easy to reference the build later - perhaps several versions later.

Code Block
vim run_build.sh
# BUILD_DIR="${WORK}/BUILDS/v2.2.0-rc1"
# DAOS_BRANCH="release/2.2"
# DAOS_COMMIT="v2.2.0-rc1"

Running on a compute node is quicker, but not required.

Code Block
idev
# Wait for session
./run_build.sh
# Get a cup of coffee

Common Reasons for Build Failures

TODO

Run Sanity Tests

Before executing hundreds of test cases, make sure the test scripts and DAOS are behaving as expected. For example, sometimes a simple daos or dmg interface change can cause every test to fail.

Code Block
vim run_testlist.py
# env['JOBNAME']     = "daos-xxxx-sanity"
# env['DAOS_DIR']    = abspath(expandvars("${WORK}/BUILDS/v2.2.0-rc1/latest/daos"))
# env['RES_DIR']     = abspath(expandvars("${WORK}/RESULTS/v2.2.0-rc1_sanity"))
./run_testlist.py -r tests/sanity/

After the tests complete, sanity check the results. All tests should contain Pass.

Code Block
./get_results.py --tests all "${WORK}/RESULTS/v2.2.0-rc1_sanity"
cat "${WORK}/RESULTS/v2.2.0-rc1_sanity/*.csv"

Optionally, you could look through one or more logs to check for unexpected warnings or errors.

Run Validation Suite

Info

Frontera only allows queuing 100 jobs per user, so tests will need to be executed in batches.

Info

It’s not strictly required to run the tests serially, but this seems to reduce the variance and interference between jobs. Because of how the priority queue works, running serially doesn’t necessarily take longer than running in parallel - especially since fewer jobs need to be re-ran in case of variance/noise/interference.

Info

The --dryrun option to run_testlist.py helps make sure the expected variants will be ran.

Basic Tests

There are currently 48 variants in this group.

Code Block
vim run_testlist.py
# env['JOBNAME']     = "daos-xxxx"
# env['DAOS_DIR']    = abspath(expandvars("${WORK}/BUILDS/v2.2.0-rc1/latest/daos"))
# env['RES_DIR']     = abspath(expandvars("${WORK}/RESULTS/v2.2.0-rc1"))
./run_testlist.py -r --serial tests/basic

EC Tests

Use showq -u to get the SLURM_JOB_ID of the last job in the queue, if any.

There are currently 42 EC IOR variants.

Info

There are also rf0 variants, but those are not usually ran. The --filter option runs only the EC variants.

Code Block
./run_testlist.py --serial --slurm_dep_afterany <LAST_SLURM_JOB_ID> --filter "oclass=EC_16P2G1 oclass=EC_16P2GX oclass=EC_8P2G1 oclass=EC_8P2GX oclass=EC_4P2G1 oclass=EC_4P2GX oclass=EC_2P1G1 oclass=EC_2P1GX" tests/ec_vs_rf0_complex/ior*

There are currently 42 EC MDTest variants.

Code Block
./run_testlist.py --serial --slurm_dep_afterany <LAST_SLURM_JOB_ID> --filter "oclass=EC_16P2G1 oclass=EC_16P2GX oclass=EC_8P2G1 oclass=EC_8P2GX oclass=EC_4P2G1 oclass=EC_4P2GX oclass=EC_2P1G1 oclass=EC_2P1GX" tests/ec_vs_rf0_complex/mdtest*

Rebuild Tests

There is currently 1 variant in this group.

It’s helpful, but not necessary, to use a different results directory.

Code Block
vim run_testlist.py
# env['JOBNAME']     = "daos-xxxx-rebuild"
# env['DAOS_DIR']    = abspath(expandvars("${WORK}/BUILDS/v2.2.0-rc1/latest/daos"))
# env['RES_DIR']     = abspath(expandvars("${WORK}/RESULTS/v2.2.0-rc1_rebuild"))
./run_testlist.py -r --serial --slurm_dep_afterany <LAST_SLURM_JOB_ID> tests/rebuild/load_ec.py

Max Scale Tests

There is currently 1 variant in this group.

It’s helpful, but not necessary, to use a different results directory.

Code Block
vim run_testlist.py
# env['JOBNAME']     = "daos-xxxx-max"
# env['DAOS_DIR']    = abspath(expandvars("${WORK}/BUILDS/v2.2.0-rc1/latest/daos"))
# env['RES_DIR']     = abspath(expandvars("${WORK}/RESULTS/v2.2.0-rc1_max"))
./run_testlist.py -r --serial --slurm_dep_afterany <LAST_SLURM_JOB_ID> tests/max/max.py

TCP Tests

There is currently 1 variant in this group.

It’s helpful, but not necessary, to use a different results directory.

Note

Don’t forget to change daos_server.yml and env_daos back to Verbs after running with TCP!

Code Block
vim daos_server.yml
# provider: ofi+verbs;ofi_rxm           # <-- Comment/remove this out
# provider: ofi+tcp;ofi_rxm             # <-- Uncomment/add this
#  - FI_OFI_RXM_DEF_TCP_WAIT_OBJ=pollfd # <-- Uncomment/add this

vim env_daos
# PROVIDER="${2:-tcp}"
Code Block
vim run_testlist.py
# env['JOBNAME']     = "daos-xxxx-tcp"
# env['DAOS_DIR']    = abspath(expandvars("${WORK}/BUILDS/v2.2.0-rc1/latest/daos"))
# env['RES_DIR']     = abspath(expandvars("${WORK}/RESULTS/v2.2.0-rc1_tcp"))
./run_testlist.py -r --serial --slurm_dep_afterany <LAST_SLURM_JOB_ID> --filter "daos_servers=2,daos_clients=8" tests/basic/mdtest_easy.py

Gather Results

TODO

Insert to Database

TODO

Generate Reports

TODO

Analyze

TODO