How to Run daos on TACC Stampede2
Build DAOS locally and Copy the content on Stampede2 login node:
Login to the Stampede2 and get the home folder full path, For Example: /home1/12345/samirrav
Login as root on Boro reserved machine where you want to build the code.
Create the Folder /home1/12345/samirrav (This needs to match with TACC Home folder path)
Clone daos repo "git clone https://github.com/daos-stack/daos.git"
Do git merge origin/tanabarr/control-no-ipmctl (This patch is required to build code without ipmctl on Stampede2 HW)
Build the daos code
Clone IOR repo git clone https://github.com/daos-stack/ior-hpc.git
Build IOR code and have it install where daos folder /home1/12345/samirrav/daos/install/
Now create the tar.bz2 file for daos folder tar -cjvf daos.tar.bz2 daos/
Copy that daos.tar.bz2 on your Local machine,
From local machine you can scp to the TACC Login node (you can scp directly on TACC machine from Boro system but I do it locally via WinSCP as I don't want to do Token verification every time). This will take few minute for SCP
On Stampede2 node untar the bz2 file using command on the same location matching as on Boro cd /home1/12345/samirrav ; tar xvfj daos.tar.bz2
Have your environment setup script ready with Bin/Lib path exported (Same like we do in Boro or Wolf Environment)
At this point you are ready to run server. Either you can do it manually or you can run sbatch script.
Run manually (During Development):
To get the number of required machine you want using command For example idev -m 60 -N 3 -p skx-dev, This will reserve 3 Nodes for 60min (skx-dev has limit of 2 hours). You might have to wait for some time but skx-dev is faster as you can not have more than 4 node requested. If you need more use skx-normal which has limit of 128. Refer the https://portal.tacc.utexas.edu/user-guides/stampede2#running-queues
Once the slurm reserve the machine you will be on one of the machine console. But you can ssh to other machine from same session or from another Stampede2 login machine session. (User should be able to access the machine until you have the reservation). User will not have ssh access once the reservation is done and node has been released so make sue you get all the logs for debug purpose in case of failure.
Now you can start the server manually orterun --np 1 -x CPATH -x PATH -x LD_LIBRARY_PATH --report-uri /home1/12345/samirrav/hostsfile/uri.txt --enable-recovery daos_server start -i -a /home1/12345/samirrav/daos/install/tmp/ -o /home1/12345/samirrav/hostsfile/daos_server_psm2.yml --debug
From another machine create the pool using dmg command or any other client side operation.
Run via SBATCH:
Use the sample script available from https://jira.hpdd.intel.com/secure/attachment/31378/script_backup.sh
Change below parameter based on your requirement
Change the number of DAOS server/Client count
Create the log directory for example /scratch/12345/samirrav/Log and make sure it matches in sbatch script.
Now run the sbatch script.
sbatch scripts/main.sh IOR
-----------------------------------------------------------------
Welcome to the Stampede2 Supercomputer12345
-----------------------------------------------------------------No reservation for this job
--> Verifying valid submit host (login1)...OK
--> Verifying valid jobname...OK
--> Enforcing max jobs per user...OK
--> Verifying availability of your home dir (/home1/12345/samirrav)...OK
--> Verifying availability of your work dir (/work/12345/samirrav/stampede2)...OK
--> Verifying availability of your scratch dir (/scratch/12345/samirrav)...OK
--> Verifying valid ssh keys...OK
--> Verifying access to desired queue (skx-dev)...OK
--> Verifying job request is within current queue limits...OK
--> Checking available allocation (STAR-Intel)...OK
Submitted batch job 4551152JOB will be queued and you will see status getting printed "OK", if some thing goes wrong at any stage, it will not queue the JOB and use needs to debug the sbatch script.
Check the status of the JOB using below command. It will update as job gets the resource and runs.
login1(1038)$ squeue | grep samir
4551152 skx-dev test_dao samirrav PD 0:00 3 (Resources)Once the JOB is finished logs will be copied to Log/4551152/ folder. It will copy all the server/client/agent logs from all the system part of JOB.
User can cancel the job any time using scancel JOB_ID (scancel 4551152)
Avocado setup on TACC (With Python2):
Package needs to be install:
pip install --user avocado-framework==57.0
pip install --user avocado_framework_plugin_loader_yaml==57.0
pip install --user avocado_framework_plugin_result_html==57.0
pip install --user avocado_framework_plugin_varianter_yaml_to_mux==57.0
pip install --user clustershell
Avocado Sanity test:
login2(1221)$ avocado variants --tree -m daos/src/tests/ftest/pool/attribute.yaml
Multiplex tree representation:
┗━━ run
┣━━ hosts
┣━━ server_config
┗━━ attrtests
┣━━ createmode
┣━━ createset
┣━━ createsize
┣━━ name_handles
┃ ╚══ validlongname
┗━━ value_handles
╚══ validvalue
DAOS patch to run Avocado test on TACC:
Avocado test run: