We have an automated test infrastructure for building and running DAOS tests on Frontera.
Please see below for instructions on building and running tests.
Before doing any work on Frontera, you should read and understand Good Conduct on Frontera.
You should also be aware of the limited credits for running jobs. After logging in, you can run:
/usr/local/etc/taccinfo --------------------- Project balances for user dbohninx ---------------------- | Name Avail SUs Expires | | | STAR-Intel ##### YYYY-MM-DD | | |
All of these setup instructions should be ran on a login node (E.g. login3.frontera).
Add the following line to ~/.bashrc
. There should be an if block labeled "SECTION 2" where you should put this.
export PATH=$HOME/.local/bin:$PATH |
mkdir -p ${WORK}/{BUILDS,TESTS,RESULTS,WEEKLY_RESULTS,TOOLS} cd ${WORK}/TESTS git clone https://github.com/daos-stack/daos_scaled_testing |
These packages are for some of the .py scripts for post-processing results
cd ${WORK}/TESTS/daos_scaled_testing/frontera python3 -m pip install --upgrade pip python3 -m pip install --user -r python3_requirements.txt |
By default, the system installed MVAPICH2 is used and recommended. If you want to use MPICH or OpenMPI, they must be installed from scratch. Since we only build with a single core on login nodes (remember Citizenship on Frontera), this may take a while to complete.
This script is not well maintained and may need adjustment. This script is not well maintained and may need adjustment. |
Edit run_build.sh:
vim run_build.sh |
Configure these lines:
BUILD_DIR="${WORK}/BUILDS/" DAOS_BRANCH="master" |
Optionally, you can choose to build a specific branch, commit, or cherry pick.
When executed on a login node, run_build.sh
will only use a single process. It is recommended to build on a development node. This will build DAOS in <BUILD_DIR>/<date>/daos and build the latest hpc/ior.
idev # wait cd ${WORK}/TESTS/daos_scaled_testing/frontera ./run_build.sh |
The test script should be executed on a login node, which uses slurm to reserve nodes and run jobs.
cd ${WORK}/TESTS/daos_scaled_testing/frontera vim run_testlist.py ... # Configure these lines env['JOBNAME'] = "<sbatch_jobname>" env['DAOS_DIR'] = abspath(expandvars("${WORK}/BUILDS/latest/daos")) # Path to daos env['RES_DIR'] = abspath(expandvars("${WORK}/RESULTS")) # Path to test results |
The DAOS configs are daos_scaled_testing/frontera/daos_{server,agent,control}.yml
.
The client / test runner environment is defined in daos_scaled_testing/frontera/env_daos
.
Test configs are in the daos_scaled_testing/frontera/tests
directory. Each config is a python file that defines a list of test variants to run. For example, in tests/sanity/ior.py
:
''' IOR sanity tests. Defines a list 'tests' containing dictionary items of tests. ''' # Default environment variables used by each test env_vars = { 'pool_size': '85G', 'chunk_size': '1M', 'segments': '1', 'xfer_size': '1M', 'block_size': '150G', 'sw_time': '5', 'iterations': '1', 'ppc': 32 } # List of tests tests = [ { 'test_group': 'IOR', 'test_name': 'ior_sanity', 'oclass': 'SX', 'scale': [ # (num_servers, num_clients, timeout_minutes) (1, 1, 1), ], 'env_vars': dict(env_vars), 'enabled': True }, ] |
These parameters can be configured as desired. Only tests with 'enabled': True
will run. To execute some tests:
$ ./run_testlist.py tests/sanity/ior.py Importing tests from tests/sanity/ior.py 001. Running ior_sanity SX, 1 servers, 1 clients, 1048576 ec_ell_size ... Submitted batch job 3728480 |
You can then monitor the status of your jobs in the queue:
showq -u |
When all the sbatch jobs complete, you should see the results at <RES_DIR>/<date>. You can extract the IOR and MdTest results into CSV format by running:
cd ${WORK}/TESTS/daos_scaled_testing/frontera # Get all ior and mdtest results ./get_results.py <RES_DIR> # Get all ior and mdtest results, and email the result CSVs ./get_results.py <RES_DIR> --email first.last@intel.com |
This is currently a work in progress. See https://github.com/daos-stack/daos_scaled_testing/blob/master/database/README.md for general usage.
/wiki/spaces/DC/pages/2183563249
These instructions cover running the Frontera performance suite for validation of Test Builds, Release Candidates, and performance-sensitive features.
Some details are my personal organizational style and can be tweaked as desired - Dalton Bohning |
Using a fresh clone of the test scripts keeps the configuration isolated from other ongoing work. E.g. cloning into a directory with the ticket number.
cd $WORK/TESTS git clone git@github.com:daos-stack/daos_scaled_testing.git daos-xxxx cd daos-xxxx/frontera |
Using a separate build directory for each version makes it easy to reference the build later - perhaps several versions later. |
Running on a compute node is much quicker, but not required. |
idev # Wait for session ./run_build.sh --build-dir "${WORK}/BUILDS/v2.2.0-rc1" \ --daos-branch "release/2.2" \ --daos-commit "v2.2.0-rc1" # Get a cup of coffee |
DAOS requires sudo/root permission, but we do not have sudo/root on Frontera. In run_build.sh:merge_extra_daos_branches()
, we merge a “hack” branch for DAOS to compile and run on Frontera. Using the latest branch as a base, we need to create a new branch, rebase on the conflicting commit in DAOS, resolve conflicts, push the branch, and update run_build.sh
.
For example, the branch dbohning/io500-base-e2a10d7 is based on master commit e2a10d7, and has 5 “hacks”. Coordinate with Dalton Bohning as necessary if a new conflict arises.
Before executing hundreds of test cases, make sure the test scripts and DAOS are behaving as expected. For example, sometimes a simple daos
or dmg
interface change can cause every test to fail.
./run_testlist.py --jobname "daos-sanity" \ --daos_dir "${WORK}/BUILDS/v2.2.0-rc1/latest/daos" \ --res_dir "${WORK}/RESULTS/v2.2.0-rc1_sanity" \ -r \ tests/sanity/ |
After the tests complete, sanity check the results. All tests should contain Pass
.
./get_results.py --tests all "${WORK}/RESULTS/v2.2.0-rc1_sanity" cat "${WORK}/RESULTS/v2.2.0-rc1_sanity/*.csv" |
Optionally, you could look through one or more logs to check for unexpected warnings or errors.
It is recommended to run the tests serially with --serial in order to reduce the variance and interference between jobs. Running in parallel tends to have more interference between larger jobs, which then requires re-running those jobs. And because Frontera uses a priority queue to schedule jobs, the overall time-to-result is about the same when ran serially. |
The --dryrun option to run_testlist.py will list the test variants to be ran, but not actually run them. |
The --jobname, --daos_dir, and --res_dir options to run_testlist.py can also be edited in the script itself instead of specifying on the command line. |
There are currently 48 RF0, 42 EC IOR variants, and 42 EC MDTest variants
./run_testlist.py --jobname "daos-xxxx" \ --daos_dir "${WORK}/BUILDS/v2.2.0-rc1/latest/daos" \ --res_dir "${WORK}/RESULTS/v2.2.0-rc1" \ -r --serial \ tests/basic tests/ec_vs_rf0_complex |
There is currently 1 variant in this group.
It’s helpful to use a different results directory from the previous tests since these results aren’t stored in the database.
./run_testlist.py --jobname "daos-xxxx-rebuild" \ --daos_dir "${WORK}/BUILDS/v2.2.0-rc1/latest/daos" \ --res_dir "${WORK}/RESULTS/v2.2.0-rc1_rebuild" \ -r --serial \ --slurm_dep_afterany <LAST_SLURM_JOB_ID> \ tests/rebuild/load_ec.py |
There is currently 1 variant in this group.
It’s helpful to use a different results directory from the previous tests since these results aren’t stored in the database.
./run_testlist.py --jobname "daos-xxxx-max" \ --daos_dir "${WORK}/BUILDS/v2.2.0-rc1/latest/daos" \ --res_dir "${WORK}/RESULTS/v2.2.0-rc1_max" \ -r --serial \ --slurm_dep_afterany <LAST_SLURM_JOB_ID> \ tests/max/max.py |
There is currently 1 variant in this group.
It’s helpful to use a different results directory from the previous tests since these results aren’t stored in the database.
Don’t forget to change daos_server.yml and env_daos back to Verbs after running with TCP! |
vim daos_server.yml # provider: ofi+verbs;ofi_rxm # <-- Comment/remove this # provider: ofi+tcp;ofi_rxm # <-- Uncomment/add this # - FI_OFI_RXM_DEF_TCP_WAIT_OBJ=pollfd # <-- Uncomment/add this vim env_daos # PROVIDER="${2:-tcp}" |
./run_testlist.py --jobname "daos-xxxx-tcp" \ --daos_dir "${WORK}/BUILDS/v2.2.0-rc1/latest/daos" \ --res_dir "${WORK}/RESULTS/v2.2.0-rc1_tcp" \ -r --serial \ --slurm_dep_afterany <LAST_SLURM_JOB_ID> \ --filter "daos_servers=2,daos_clients=8" \ tests/basic/mdtest_easy.py |
The results can be gathered and the CSVs optionally emailed to avoid SCP:
./get_results.py ${WORK}/RESULTS/v2.2.0-rc1 --email <YOUR_EMAIL> |
It’s not necessary to email the other variants, since they aren’t inserted into the database.
./get_results.py ${WORK}/RESULTS/v2.2.0-rc1_rebuild ./get_results.py ${WORK}/RESULTS/v2.2.0-rc1_max ./get_results.py ${WORK}/RESULTS/v2.2.0-rc1_tcp |
The test framework sanity checks that the nodes are in-sync. This is usually transient, and re-running the job is the easiest solution. If this occurs very frequently, TACC might need to be notified.
Frontera compute nodes only have ~85G RAM each. Most tests run with at most 60 seconds of write. Even with dedup
and a special IOR patch, it’s possible to run out of space. If reducing the ior/mdtest stonewall time helps, then it’s probably just a space constraint.
The database is hosted on and only accessible from wolf-17vm4. Eventually, we need a more robust solution. |
cd /home/dbohning/frontera/daos_scaled_testing/database |
From frontera_scaled_testing/database
:
./db.py import results_ior raw/frontera_performance/results_ior/ior_v2.2.0-rc1.csv ./db.py import results_mdtest raw/frontera_performance/results_mdtest/mdtest_v2.2.0-rc1.csv |
You should see:
66 rows inserted |
Archive the raw CSV data in the SharePoint archive.
There are two standard reports, which we currently base on v2.0.0. It’s best to keep using this baseline so there is a common reference point.
--base-commit
is the hash for the version you want to compare against. I.e. the baseline.
--commit
is the hash for the version you just ran with.
./db.py report basic --base-commit efdf65b --commit e68757d --of xlsx -o "v2.2.0-rc1 (rf=0) (base v2.0.0).xlsx" ./db.py report ec --base-commit efdf65b --commit e68757d --of xlsx -o "v2.2.0-rc1 (EC) (base v2.0.0).xlsx" |
The generated spreadsheets should be uploaded here. Create a new folder, following the existing naming convention.
Each sheet of each report has a %
column for each metric on the far right. The percent difference is for base-commit -> commit
. At the bottom is min
, max
, and mean
for each column. Anything less than -5%
is colored red. These are “tentative regressions”. If an entire column - i.e. all variants - of a sheet/group shows a regression, it’s more likely a “real regression”. If only some rows show regressions, it might just be variance with DAOS on Frontera. Re-run each of these “tentative regressions”. If performance is on-par with the baseline, it was most likely due to system variance. If performance is still low, it might be a “real regression”. Create a ticket for “real regressions”. Replace “tentative regressions” with the “better” result.
Reverse the import by deleting:
./db.py delete results_ior raw/frontera_performance/results_ior/ior_v2.2.0-rc1.csv ./db.py delete results_mdtest raw/frontera_performance/results_mdtest/mdtest_v2.2.0-rc1.csv |
Delete the “bad” job results
Re-generate the CSVs with get_results.py
Re-archive the CSVs if necessary
Re-import with db.py import
Re-generate the reports with db.py report
Replace the spreadsheets in SharePoint
The second analysis is similar to the first, except we don’t want to just keep running “tentative regressions”. If a regression is there after multiple runs, it’s likely a “real regression”. Deciding when results are just variance is mostly quantitative, but somewhat of an art.
Create tickets for regressions and be sure to include or link to the logs.
All job logs should be archived. There isn’t a centralized location, so put them somewhere stable on wolf
, set permissions with chmod -R go+rx
, and put the path on the ticket.