Table of Contents

maxLevel	1

...

Good Conduct

Before doing any work on Frontera, you should read and understand Citizenship Good Conduct on Frontera.

You should also be aware of the limited credits for running jobs. After logging in, you can run:

...

Code Block

language	bash

cd ${WORK}/TESTS/daos_scaled_testing/frontera
python3 -m pip install --upgrade pip
python3 -m pip install --user -r python3_requirements.txt

Build MPI packages - Optional - Not Recommended

Expand

By default, the system installed MVAPICH2 is used and recommended. If you want to use MPICH or OpenMPI, they must be installed from scratch.

Since we only build with a single core on login nodes (remember Citizenship on Frontera), this may take a while to complete.

Code Block

language	bash

cd ${WORK}/TESTS/daos_scaled_testing/frontera
./build_and_install_tools.sh

This script is not well maintained and may need adjustment.

Build DAOS

Edit run_build.sh:

...

Code Block

language	bash

idev
# wait
cd ${WORK}/TESTS/daos_scaled_testing/frontera
./run_build.sh

Running Tests

...

This is currently a work in progress. See https://github.com/daos-stack/daos_scaled_testing/blob/master/database/README.md for general usage.

Description of test cases

/wiki/spaces/DC/pages/2183563249

Validation Suite Workflow

These instructions cover running the Frontera performance suite for validation of Test Builds, Release Candidates, and performance-sensitive features.

Info
Some details are my personal organizational style and can be tweaked as desired - Dalton Bohning

Clone Test Scripts

Using a fresh clone of the test scripts keeps the configuration isolated from other ongoing work. E.g. cloning into a directory with the ticket number.

Code Block

language	bash

cd $WORK/TESTS
git clone git@github.com:daos-stack/daos_scaled_testing.git daos-xxxx
cd daos-xxxx/frontera

Build DAOS

Info
Using a separate build directory for each version makes it easy to reference the build later - perhaps several versions later.

Info
Running on a compute node is much quicker, but not required.

Code Block

language	bash

idev
# Wait for session
./run_build.sh --build-dir "${WORK}/BUILDS/v2.2.0-rc1" \
               --daos-branch "release/2.2" \
               --daos-commit "v2.2.0-rc1"
# Get a cup of coffee

Common Reasons for Build Failures

Merge conflict in control

DAOS requires sudo/root permission, but we do not have sudo/root on Frontera. In run_build.sh:merge_extra_daos_branches(), we merge a “hack” branch for DAOS to compile and run on Frontera. Using the latest branch as a base, we need to create a new branch, rebase on the conflicting commit in DAOS, resolve conflicts, push the branch, and update run_build.sh.

For example, the branch dbohning/io500-base-e2a10d7 is based on master commit e2a10d7, and has 5 “hacks”. Coordinate with Dalton Bohning as necessary if a new conflict arises.

Run Sanity Tests

Before executing hundreds of test cases, make sure the test scripts and DAOS are behaving as expected. For example, sometimes a simple daos or dmg interface change can cause every test to fail.

Code Block

language	bash

./run_testlist.py --jobname "daos-sanity" \
                  --daos_dir "${WORK}/BUILDS/v2.2.0-rc1/latest/daos" \
                  --res_dir "${WORK}/RESULTS/v2.2.0-rc1_sanity" \
                  -r \
                  tests/sanity/

After the tests complete, sanity check the results. All tests should contain Pass.

Code Block

language	bash

./get_results.py --tests all "${WORK}/RESULTS/v2.2.0-rc1_sanity"
cat "${WORK}/RESULTS/v2.2.0-rc1_sanity/*.csv"

Optionally, you could look through one or more logs to check for unexpected warnings or errors.

Run Validation Suite

...

TODO

Description of test cases

...

Info
It is recommended to run the tests serially with --serial in order to reduce the variance and interference between jobs. Running in parallel tends to have more interference between larger jobs, which then requires re-running those jobs. And because Frontera uses a priority queue to schedule jobs, the overall time-to-result is about the same when ran serially.
Info
The --dryrun option to run_testlist.py will list the test variants to be ran, but not actually run them.
Info
The --jobname, --daos_dir, and --res_dir options to run_testlist.py can also be edited in the script itself instead of specifying on the command line.

RF0 and EC Tests

There are currently 48 RF0, 42 EC IOR variants, and 42 EC MDTest variants

Code Block

language	bash

./run_testlist.py --jobname "daos-xxxx" \
                  --daos_dir "${WORK}/BUILDS/v2.2.0-rc1/latest/daos" \
                  --res_dir "${WORK}/RESULTS/v2.2.0-rc1" \
                  -r --serial \
                  tests/basic tests/ec_vs_rf0_complex

Rebuild Tests

There is currently 1 variant in this group.

It’s helpful to use a different results directory from the previous tests since these results aren’t stored in the database.

Code Block

language	bash

./run_testlist.py --jobname "daos-xxxx-rebuild" \
                  --daos_dir "${WORK}/BUILDS/v2.2.0-rc1/latest/daos" \
                  --res_dir "${WORK}/RESULTS/v2.2.0-rc1_rebuild" \
                  -r --serial \
                  --slurm_dep_afterany <LAST_SLURM_JOB_ID> \
                  tests/rebuild/load_ec.py

Max Scale Tests

There is currently 1 variant in this group.

It’s helpful to use a different results directory from the previous tests since these results aren’t stored in the database.

Code Block

language	bash

./run_testlist.py --jobname "daos-xxxx-max" \
                  --daos_dir "${WORK}/BUILDS/v2.2.0-rc1/latest/daos" \
                  --res_dir "${WORK}/RESULTS/v2.2.0-rc1_max" \
                  -r --serial \
                  --slurm_dep_afterany <LAST_SLURM_JOB_ID> \
                  tests/max/max.py

TCP Tests

There is currently 1 variant in this group.

It’s helpful to use a different results directory from the previous tests since these results aren’t stored in the database.

Note
Don’t forget to change daos_server.yml and env_daos back to Verbs after running with TCP!

Code Block

language	bash

vim daos_server.yml
# provider: ofi+verbs;ofi_rxm           # <-- Comment/remove this
# provider: ofi+tcp;ofi_rxm             # <-- Uncomment/add this
#  - FI_OFI_RXM_DEF_TCP_WAIT_OBJ=pollfd # <-- Uncomment/add this

vim env_daos
# PROVIDER="${2:-tcp}"

Code Block

language	bash

./run_testlist.py --jobname "daos-xxxx-tcp" \
                  --daos_dir "${WORK}/BUILDS/v2.2.0-rc1/latest/daos" \
                  --res_dir "${WORK}/RESULTS/v2.2.0-rc1_tcp" \
                  -r --serial \
                  --slurm_dep_afterany <LAST_SLURM_JOB_ID> \
                  --filter "daos_servers=2,daos_clients=8" \
                  tests/basic/mdtest_easy.py

Gather Results

The results can be gathered and the CSVs optionally emailed to avoid SCP:

Code Block
./get_results.py ${WORK}/RESULTS/v2.2.0-rc1 --email <YOUR_EMAIL>

It’s not necessary to email the other variants, since they aren’t inserted into the database.

Code Block
./get_results.py ${WORK}/RESULTS/v2.2.0-rc1_rebuild ./get_results.py ${WORK}/RESULTS/v2.2.0-rc1_max ./get_results.py ${WORK}/RESULTS/v2.2.0-rc1_tcp

Common Reasons for Test Failures

clock drift is too high

The test framework sanity checks that the nodes are in-sync. This is usually transient, and re-running the job is the easiest solution. If this occurs very frequently, TACC might need to be notified.

out of memory

Frontera compute nodes only have ~85G RAM each. Most tests run with at most 60 seconds of write. Even with dedup and a special IOR patch, it’s possible to run out of space. If reducing the ior/mdtest stonewall time helps, then it’s probably just a space constraint.

Reporting

Info
The database is hosted on and only accessible from wolf-17vm4. Eventually, we need a more robust solution.

Navigate to Scripts

Code Block

language	bash

cd /home/dbohning/frontera/daos_scaled_testing/database

Insert to Database

From frontera_scaled_testing/database:

Code Block
./db.py import results_ior raw/frontera_performance/results_ior/ior_v2.2.0-rc1.csv ./db.py import results_mdtest raw/frontera_performance/results_mdtest/mdtest_v2.2.0-rc1.csv

You should see:

Code Block
66 rows inserted

Archive the Raw CSVs

Archive the raw CSV data in the SharePoint archive.

Generate Reports

There are two standard reports, which we currently base on v2.0.0. It’s best to keep using this baseline so there is a common reference point.

--base-commit is the hash for the version you want to compare against. I.e. the baseline.

--commit is the hash for the version you just ran with.

Code Block

language	bash

./db.py report basic --base-commit efdf65b --commit e68757d --of xlsx -o "v2.2.0-rc1 (rf=0) (base v2.0.0).xlsx"
./db.py report ec --base-commit efdf65b --commit e68757d --of xlsx -o "v2.2.0-rc1 (EC) (base v2.0.0).xlsx"

Upload Reports

The generated spreadsheets should be uploaded here. Create a new folder, following the existing naming convention.

Analyze Reports

Each sheet of each report has a % column for each metric on the far right. The percent difference is for base-commit -> commit. At the bottom is min, max, and mean for each column. Anything less than -5% is colored red. These are “tentative regressions”. If an entire column - i.e. all variants - of a sheet/group shows a regression, it’s more likely a “real regression”. If only some rows show regressions, it might just be variance with DAOS on Frontera. Re-run each of these “tentative regressions”. If performance is on-par with the baseline, it was most likely due to system variance. If performance is still low, it might be a “real regression”. Create a ticket for “real regressions”. Replace “tentative regressions” with the “better” result.

Replacing Results

Reverse the import by deleting:

Code Block
./db.py delete results_ior raw/frontera_performance/results_ior/ior_v2.2.0-rc1.csv ./db.py delete results_mdtest raw/frontera_performance/results_mdtest/mdtest_v2.2.0-rc1.csv

Delete the “bad” job results
Re-generate the CSVs with get_results.py
Re-archive the CSVs if necessary
Re-import with db.py import
Re-generate the reports with db.py report
Replace the spreadsheets in SharePoint

Analyze Reports Again

The second analysis is similar to the first, except we don’t want to just keep running “tentative regressions”. If a regression is there after multiple runs, it’s likely a “real regression”. Deciding when results are just variance is mostly quantitative, but somewhat of an art.

Create Tickets

Create tickets for regressions and be sure to include or link to the logs.

Archive Logs

All job logs should be archived. There isn’t a centralized location, so put them somewhere stable on wolf, set permissions with chmod -R go+rx, and put the path on the ticket.

Version	Old Version 37	New Version 57
Changes made by	Dalton Bohning	Dalton Bohning
Saved on	Jul 13, 2022	Mar 08, 2024

Versions Compared

Key

Good Conduct

Build MPI packages - Optional - Not Recommended

Build DAOS

Running Tests

Description of test cases

Validation Suite Workflow

Clone Test Scripts

Build DAOS

Common Reasons for Build Failures

Merge conflict in control

Run Sanity Tests

Run Validation Suite

Description of test cases

RF0 and EC Tests

Rebuild Tests

Max Scale Tests

TCP Tests

Gather Results

Common Reasons for Test Failures

clock drift is too high

out of memory

Reporting

Navigate to Scripts

Insert to Database

Archive the Raw CSVs

Generate Reports

Upload Reports

Analyze Reports

Replacing Results

Analyze Reports Again

Create Tickets

Archive Logs

Content Comparison

Versions Compared

Key

Good Conduct

Build MPI packages - Optional - Not Recommended

Build DAOS

Running Tests

Description of test cases

Validation Suite Workflow

Clone Test Scripts

Build DAOS

Common Reasons for Build Failures

Merge conflict in *control*

Run Sanity Tests

Run Validation Suite

Description of test cases

RF0 and EC Tests

Rebuild Tests

Max Scale Tests

TCP Tests

Gather Results

Common Reasons for Test Failures

clock drift is too high

out of memory

Reporting

Navigate to Scripts

Insert to Database

Archive the Raw CSVs

Generate Reports

Upload Reports

Analyze Reports

Replacing Results

Analyze Reports Again

Create Tickets

Archive Logs

Merge conflict in control