DAOS on Frontera

We have an automated test infrastructure for building and running DAOS tests on Frontera.

Please see below for instructions on building and running tests.

Citizenship

Before doing any work on Frontera, you should read and understand Citizenship on Frontera.

You should also be aware of the limited credits for running jobs. After logging in, you can run:

1 2 3 4 /usr/local/etc/taccinfo --------------------- Project balances for user dbohninx ---------------------- | Name Avail SUs Expires | | | STAR-Intel ##### YYYY-MM-DD | |

Initial Setup

All of these setup instructions should be ran on a login node (E.g. login3.frontera).

Add your local binary directory to your PATH

Add the following line to ~/.bashrc. There should be an if block labeled "SECTION 2" where you should put this.

1 export PATH=$HOME/.local/bin:$PATH

Setup directories and clone the test scripts

1 2 3 mkdir -p ${WORK}/{BUILDS,TESTS,RESULTS,WEEKLY_RESULTS,TOOLS} cd ${WORK}/TESTS git clone https://github.com/daos-stack/daos_scaled_testing

Install python dependencies for post-job scripts

These packages are for some of the .py scripts for post-processing results

1 2 3 cd ${WORK}/TESTS/daos_scaled_testing python3 -m pip install --upgrade pip python3 -m pip install --user -r python3_requirements.txt

Build MPI packages - Optional

By default, the system installed MVAPICH2 is used and recommended. If you want to use MPICH or OpenMPI, they must be installed from scratch.

Since we only build with a single core on login nodes (remember Citizenship on Frontera), this may take a while to complete.

1 2 cd ${WORK}/TESTS/daos_scaled_testing/frontera ./build_and_install_tools.sh

This script is not well maintained and may need adjustment.

Build DAOS

Edit run_build.sh:

1 vim run_build.sh

Configure these lines:

1 2 BUILD_DIR="${WORK}/BUILDS/" DAOS_BRANCH="master"

Optionally, you can choose to build a specific branch, commit, or cherry pick.

When executed on a login node, run_build.sh will only use a single process. It is recommended to build on a development node. This will build DAOS in <BUILD_DIR>/<date>/daos and build the latest hpc/ior.

1 2 3 idev # wait ./run_build.sh

Running Tests

The test script should be executed on a login node, which uses slurm to reserve nodes and run jobs.

Configure run_testlist.py:

1 2 3 4 5 6 7 cd ${WORK}/TESTS/daos_scaled_testing/frontera vim run_testlist.py ... # Configure these lines env['JOBNAME'] = "<sbatch_jobname>" env['DAOS_DIR'] = abspath(expandvars("${WORK}/BUILDS/latest/daos")) # Path to daos env['RES_DIR'] = abspath(expandvars("${WORK}/RESULTS")) # Path to test results

Configure DAOS Configs

The DAOS configs are daos_scaled_testing/frontera/daos_{server,agent,control}.yml.

The client / test runner environment is defined in daos_scaled_testing/frontera/env_daos.

Test Variants

Test configs are in the daos_scaled_testing/frontera/tests directory. Each config is a python file that defines a list of test variants to run. For example, in tests/sanity/ior.py:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 ''' IOR sanity tests. Defines a list 'tests' containing dictionary items of tests. ''' # Default environment variables used by each test env_vars = { 'pool_size': '85G', 'chunk_size': '1M', 'segments': '1', 'xfer_size': '1M', 'block_size': '150G', 'sw_time': '5', 'iterations': '1', 'ppc': 32 } # List of tests tests = [ { 'test_group': 'IOR', 'test_name': 'ior_sanity', 'oclass': 'SX', 'scale': [ # (num_servers, num_clients, timeout_minutes) (1, 1, 1), ], 'env_vars': dict(env_vars), 'enabled': True }, ]

These parameters can be configured as desired. Only tests with 'enabled': True will run. To execute some tests:

1 2 3 4 5 $ ./run_testlist.py tests/sanity/ior.py Importing tests from tests/sanity/ior.py 001. Running ior_sanity SX, 1 servers, 1 clients, 1048576 ec_ell_size ... Submitted batch job 3728480

You can then monitor the status of your jobs in the queue:

1 showq -u

Get test results

When all the sbatch jobs complete, you should see the results at <RES_DIR>/<date>. You can extract the IOR and MdTest results into CSV format by running:

1 2 3 4 5 6 7 cd ${WORK}/TESTS/daos_scaled_testing/frontera # Get all ior and mdtest results ./get_results.py <RES_DIR> # Get all ior and mdtest results, and email the result CSVs ./get_results.py <RES_DIR> --email first.last@intel.com

Storing and retrieving tests in the database

This is currently a work in progress. See https://github.com/daos-stack/daos_scaled_testing/blob/master/database/README.md for general usage.

Description of test cases

Frontera Test Plan

Validation Suite Workflow

These instructions cover running the Frontera performance suite for validation of Test Builds, Release Candidates, and performance-sensitive features.

Some details are my personal organizational style and can be tweaked as desired - @Dalton Bohning

Clone Test Scrips

Using a fresh clone of the test scripts keeps the configuration isolated from other ongoing work. E.g. cloning into a directory with the ticket number.

1 2 3 cd $WORK/TESTS git clone git@github.com:daos-stack/daos_scaled_testing.git daos-xxxx cd daos-xxxx/frontera

Build DAOS

Using a separate build directory makes it easy to reference the build later - perhaps several versions later.

1 2 3 4 vim run_build.sh # BUILD_DIR="${WORK}/BUILDS/v2.2.0-rc1" # DAOS_BRANCH="release/2.2" # DAOS_COMMIT="v2.2.0-rc1"

Running on a compute node is quicker, but not required.

1 2 3 4 idev # Wait for session ./run_build.sh # Get a cup of coffee

Common Reasons for Build Failures

TODO

Run Sanity Tests

Before executing hundreds of test cases, make sure the test scripts and DAOS are behaving as expected. For example, sometimes a simple daos or dmg interface change can cause every test to fail.

1 2 3 4 5 vim run_testlist.py # env['JOBNAME'] = "daos-xxxx-sanity" # env['DAOS_DIR'] = abspath(expandvars("${WORK}/BUILDS/v2.2.0-rc1/latest/daos")) # env['RES_DIR'] = abspath(expandvars("${WORK}/RESULTS/v2.2.0-rc1_sanity")) ./run_testlist.py -r tests/sanity/

After the tests complete, sanity check the results. All tests should contain Pass.

1 2 ./get_results.py --tests all "${WORK}/RESULTS/v2.2.0-rc1_sanity" cat "${WORK}/RESULTS/v2.2.0-rc1_sanity/*.csv"

Optionally, you could look through one or more logs to check for unexpected warnings or errors.

Run Validation Suite

Frontera only allows queuing 100 jobs per user, so tests will need to be executed in batches.

It’s not strictly required to run the tests serially, but this seems to reduce the variance and interference between jobs. Because of how the priority queue works, running serially doesn’t necessarily take longer than running in parallel - especially since fewer jobs need to be re-ran in case of variance/noise/interference.

The --dryrun option to run_testlist.py helps make sure the expected variants will be ran.

Basic Tests

There are currently 48 variants in this group.

1 2 3 4 5 vim run_testlist.py # env['JOBNAME'] = "daos-xxxx" # env['DAOS_DIR'] = abspath(expandvars("${WORK}/BUILDS/v2.2.0-rc1/latest/daos")) # env['RES_DIR'] = abspath(expandvars("${WORK}/RESULTS/v2.2.0-rc1")) ./run_testlist.py -r --serial tests/basic

EC Tests

Use showq -u to get the SLURM_JOB_ID of the last job in the queue, if any.

There are currently 42 EC IOR variants.

There are also rf0 variants, but those are not usually ran. The --filter option runs only the EC variants.

1 ./run_testlist.py --serial --slurm_dep_afterany <LAST_SLURM_JOB_ID> --filter "oclass=EC_16P2G1 oclass=EC_16P2GX oclass=EC_8P2G1 oclass=EC_8P2GX oclass=EC_4P2G1 oclass=EC_4P2GX oclass=EC_2P1G1 oclass=EC_2P1GX" tests/ec_vs_rf0_complex/ior*

There are currently 42 EC MDTest variants.

1 ./run_testlist.py --serial --slurm_dep_afterany <LAST_SLURM_JOB_ID> --filter "oclass=EC_16P2G1 oclass=EC_16P2GX oclass=EC_8P2G1 oclass=EC_8P2GX oclass=EC_4P2G1 oclass=EC_4P2GX oclass=EC_2P1G1 oclass=EC_2P1GX" tests/ec_vs_rf0_complex/mdtest*

Rebuild Tests

There is currently 1 variant in this group.

It’s helpful, but not necessary, to use a different results directory.

1 2 3 4 5 vim run_testlist.py # env['JOBNAME'] = "daos-xxxx-rebuild" # env['DAOS_DIR'] = abspath(expandvars("${WORK}/BUILDS/v2.2.0-rc1/latest/daos")) # env['RES_DIR'] = abspath(expandvars("${WORK}/RESULTS/v2.2.0-rc1_rebuild")) ./run_testlist.py -r --serial --slurm_dep_afterany <LAST_SLURM_JOB_ID> tests/rebuild/load_ec.py

Max Scale Tests

There is currently 1 variant in this group.

It’s helpful, but not necessary, to use a different results directory.

1 2 3 4 5 vim run_testlist.py # env['JOBNAME'] = "daos-xxxx-max" # env['DAOS_DIR'] = abspath(expandvars("${WORK}/BUILDS/v2.2.0-rc1/latest/daos")) # env['RES_DIR'] = abspath(expandvars("${WORK}/RESULTS/v2.2.0-rc1_max")) ./run_testlist.py -r --serial --slurm_dep_afterany <LAST_SLURM_JOB_ID> tests/max/max.py

TCP Tests

There is currently 1 variant in this group.

It’s helpful, but not necessary, to use a different results directory.

Don’t forget to change daos_server.yml and env_daos back to Verbs after running with TCP!

1 2 3 4 5 6 7 vim daos_server.yml # provider: ofi+verbs;ofi_rxm # <-- Comment/remove this out # provider: ofi+tcp;ofi_rxm # <-- Uncomment/add this # - FI_OFI_RXM_DEF_TCP_WAIT_OBJ=pollfd # <-- Uncomment/add this vim env_daos # PROVIDER="${2:-tcp}"
1 2 3 4 5 vim run_testlist.py # env['JOBNAME'] = "daos-xxxx-tcp" # env['DAOS_DIR'] = abspath(expandvars("${WORK}/BUILDS/v2.2.0-rc1/latest/daos")) # env['RES_DIR'] = abspath(expandvars("${WORK}/RESULTS/v2.2.0-rc1_tcp")) ./run_testlist.py -r --serial --slurm_dep_afterany <LAST_SLURM_JOB_ID> --filter "daos_servers=2,daos_clients=8" tests/basic/mdtest_easy.py

Gather Results

TODO

Insert to Database

TODO

Generate Reports

TODO

Analyze

TODO