We have an automated test infrastructure for building and running DAOS tests on Frontera.
Please see below for instructions on building and running tests.
Before doing any work on Frontera, you should read and understand Citizenship on Frontera.
You should also be aware of the limited credits for running jobs. After logging in, you can run:
/usr/local/etc/taccinfo --------------------- Project balances for user dbohninx ---------------------- | Name Avail SUs Expires | | | STAR-Intel ##### YYYY-MM-DD | | |
All of these setup instructions should be ran on a login node (E.g. login3.frontera).
Add the following line to ~/.bashrc
. There should be an if block labeled "SECTION 2" where you should put this.
export PATH=$HOME/.local/bin:$PATH |
mkdir -p ${WORK}/{BUILDS,TESTS,RESULTS,WEEKLY_RESULTS,TOOLS} cd ${WORK}/TESTS git clone https://github.com/daos-stack/daos_scaled_testing |
These packages are for some of the .py scripts for post-processing results
cd ${WORK}/TESTS/daos_scaled_testing python3 -m pip install --upgrade pip python3 -m pip install --user -r python3_requirements.txt |
By default, the system installed MVAPICH2 is used and recommended. If you want to use MPICH or OpenMPI, they must be installed from scratch.
Since we only build with a single core on login nodes (remember Citizenship on Frontera), this may take a while to complete.
cd ${WORK}/TESTS/daos_scaled_testing/frontera ./build_and_install_tools.sh |
This script is not well maintained and may need adjustment.
This script is not well maintained and may need adjustment.
Edit run_build.sh:
vim run_build.sh |
Configure these lines:
BUILD_DIR="${WORK}/BUILDS/" DAOS_BRANCH="master" |
Optionally, you can choose to build a specific branch, commit, or cherry pick.
When executed on a login node, run_build.sh
will only use a single process. It is recommended to build on a development node. This will build DAOS in <BUILD_DIR>/<date>/daos and build the latest hpc/ior.
idev # wait ./run_build.sh |
The test script should be executed on a login node, which uses slurm to reserve nodes and run jobs.
cd ${WORK}/TESTS/daos_scaled_testing/frontera vim run_testlist.py ... # Configure these lines env['JOBNAME'] = "<sbatch_jobname>" env['DAOS_DIR'] = abspath(expandvars("${WORK}/BUILDS/latest/daos")) # Path to daos env['RES_DIR'] = abspath(expandvars("${WORK}/RESULTS")) # Path to test results |
The DAOS configs are daos_scaled_testing/frontera/daos_{server,agent,control}.yml
.
The client / test runner environment is defined in daos_scaled_testing/frontera/env_daos
.
Test configs are in the daos_scaled_testing/frontera/tests
directory. Each config is a python file that defines a list of test variants to run. For example, in tests/sanity/ior.py
:
''' IOR sanity tests. Defines a list 'tests' containing dictionary items of tests. ''' # Default environment variables used by each test env_vars = { 'pool_size': '85G', 'chunk_size': '1M', 'segments': '1', 'xfer_size': '1M', 'block_size': '150G', 'sw_time': '5', 'iterations': '1', 'ppc': 32 } # List of tests tests = [ { 'test_group': 'IOR', 'test_name': 'ior_sanity', 'oclass': 'SX', 'scale': [ # (num_servers, num_clients, timeout_minutes) (1, 1, 1), ], 'env_vars': dict(env_vars), 'enabled': True }, ] |
These parameters can be configured as desired. Only tests with 'enabled': True
will run. To execute some tests:
$ ./run_testlist.py tests/sanity/ior.py Importing tests from tests/sanity/ior.py 001. Running ior_sanity SX, 1 servers, 1 clients, 1048576 ec_ell_size ... Submitted batch job 3728480 |
You can then monitor the status of your jobs in the queue:
showq -u |
When all the sbatch jobs complete, you should see the results at <RES_DIR>/<date>. You can extract the IOR and MdTest results into CSV format by running:
cd ${WORK}/TESTS/daos_scaled_testing/frontera # Get all ior and mdtest results ./get_results.py <RES_DIR> # Get all ior and mdtest results, and email the result CSVs ./get_results.py <RES_DIR> --email first.last@intel.com |
This is currently a work in progress. See https://github.com/daos-stack/daos_scaled_testing/blob/master/database/README.md for general usage.
These instructions cover running the Frontera performance suite for validation of Test Builds, Release Candidates, and performance-sensitive features.
Some details are my personal organizational style and can be tweaked as desired - Dalton Bohning |
Using a fresh clone of the test scripts keeps the configuration isolated from other ongoing work. E.g. cloning into a directory with the ticket number.
cd $WORK/TESTS git clone git@github.com:daos-stack/daos_scaled_testing.git daos-xxxx cd daos-xxxx/frontera |
Using a separate build directory makes it easy to reference the build later - perhaps several versions later.
vim run_build.sh # BUILD_DIR="${WORK}/BUILDS/v2.2.0-rc1" # DAOS_BRANCH="release/2.2" # DAOS_COMMIT="v2.2.0-rc1" |
Running on a compute node is quicker, but not required.
idev # Wait for session ./run_build.sh # Get a cup of coffee |
TODO
Before executing hundreds of test cases, make sure the test scripts and DAOS are behaving as expected. For example, sometimes a simple daos
or dmg
interface change can cause every test to fail.
./run_testlist.py --jobname "daos-sanity" \ --daos_dir "${WORK}/BUILDS/v2.2.0-rc1/latest/daos" \ --res_dir "${WORK}/RESULTS/v2.2.0-rc1_sanity" \ -r tests/sanity/ |
After the tests complete, sanity check the results. All tests should contain Pass
.
./get_results.py --tests all "${WORK}/RESULTS/v2.2.0-rc1_sanity" cat "${WORK}/RESULTS/v2.2.0-rc1_sanity/*.csv" |
Optionally, you could look through one or more logs to check for unexpected warnings or errors.
Frontera only allows queuing 100 jobs per user, so tests will need to be executed in batches. |
It’s not strictly required to run the tests serially, but this seems to reduce the variance and interference between jobs. Because of how the priority queue works, running serially doesn’t necessarily take longer than running in parallel - especially since fewer jobs need to be re-ran in case of variance/noise/interference. |
The --dryrun option to run_testlist.py helps make sure the expected variants will be ran. |
The --jobname, --daos_dir, and --res_dir options to run_testlist.py can also be edited in the script itself instead of specifying on the command line. |
There are currently 48 variants in this group.
./run_testlist.py --jobname "daos-xxxx" \ --daos_dir "${WORK}/BUILDS/v2.2.0-rc1/latest/daos" \ --res_dir "${WORK}/RESULTS/v2.2.0-rc1" \ -r --serial \ tests/basic |
Use showq -u
to get the SLURM_JOB_ID
of the last job in the queue, if any.
There are currently 42 EC IOR variants.
There are also rf0 variants, but those are not usually ran. The --filter option runs only the EC variants. |
./run_testlist.py --jobname "daos-xxxx" \ --daos_dir "${WORK}/BUILDS/v2.2.0-rc1/latest/daos" \ --res_dir "${WORK}/RESULTS/v2.2.0-rc1" \ -r --serial \ --slurm_dep_afterany <LAST_SLURM_JOB_ID> \ --filter "oclass=EC_16P2G1 oclass=EC_16P2GX oclass=EC_8P2G1 oclass=EC_8P2GX oclass=EC_4P2G1 oclass=EC_4P2GX oclass=EC_2P1G1 oclass=EC_2P1GX" \ tests/ec_vs_rf0_complex/ior* |
There are currently 42 EC MDTest variants.
./run_testlist.py --jobname "daos-xxxx" \ --daos_dir "${WORK}/BUILDS/v2.2.0-rc1/latest/daos" \ --res_dir "${WORK}/RESULTS/v2.2.0-rc1" \ -r --serial \ --slurm_dep_afterany <LAST_SLURM_JOB_ID> \ --filter "oclass=EC_16P2G1 oclass=EC_16P2GX oclass=EC_8P2G1 oclass=EC_8P2GX oclass=EC_4P2G1 oclass=EC_4P2GX oclass=EC_2P1G1 oclass=EC_2P1GX" \ tests/ec_vs_rf0_complex/mdtest* |
There is currently 1 variant in this group.
It’s helpful, but not necessary, to use a different results directory.
./run_testlist.py --jobname "daos-xxxx-rebuild" \ --daos_dir "${WORK}/BUILDS/v2.2.0-rc1/latest/daos" \ --res_dir "${WORK}/RESULTS/v2.2.0-rc1_rebuild" \ -r --serial \ --slurm_dep_afterany <LAST_SLURM_JOB_ID> \ tests/rebuild/load_ec.py |
There is currently 1 variant in this group.
It’s helpful, but not necessary, to use a different results directory.
./run_testlist.py --jobname "daos-xxxx-max" \ --daos_dir "${WORK}/BUILDS/v2.2.0-rc1/latest/daos" \ --res_dir "${WORK}/RESULTS/v2.2.0-rc1_max" \ -r --serial \ --slurm_dep_afterany <LAST_SLURM_JOB_ID> \ tests/max/max.py |
There is currently 1 variant in this group.
It’s helpful, but not necessary, to use a different results directory.
Don’t forget to change daos_server.yml and env_daos back to Verbs after running with TCP! |
vim daos_server.yml # provider: ofi+verbs;ofi_rxm # <-- Comment/remove this # provider: ofi+tcp;ofi_rxm # <-- Uncomment/add this # - FI_OFI_RXM_DEF_TCP_WAIT_OBJ=pollfd # <-- Uncomment/add this vim env_daos # PROVIDER="${2:-tcp}" |
./run_testlist.py --jobname "daos-xxxx-tcp" \ --daos_dir "${WORK}/BUILDS/v2.2.0-rc1/latest/daos" \ --res_dir "${WORK}/RESULTS/v2.2.0-rc1_tcp" \ -r --serial \ --slurm_dep_afterany <LAST_SLURM_JOB_ID> \ --filter "daos_servers=2,daos_clients=8" \ tests/basic/mdtest_easy.py |
TODO
TODO
TODO
TODO