- Login to the Stampede2 and get the folder full path, For Example: /home1/12345/samirrav
- Login as root on Boro reserved machine where you want to build the code
- Create the Folder /home1/12345/samirrav
- Clone daos repo "git clone https://github.com/daos-stack/daos.git"
- Do git merge origin/tanabarr/control-no-ipmctl (This patch is required to build code without ipmctl on Stampede2 HW)
- Build the daos code
- Clone IOR repo git clone https://github.com/daos-stack/ior-hpc.git
- Build IOR code and have it install where daos folder /home1/12345/samirrav/daos/install/
- Now create the tar.bz2 file for daos folder tar -cjvf daos.tar.bz2 daos/
- Copy that daos.tar.bz2 on your Local machine,
- From local machine you can scp to the TACC Login node (you can scp directly on TACC machine from Boro system but I do it locally via WinSCP as I don't want to do Token verification every time). This will take few minute for SCP
- On TACC untar the bz2 file using command on the same location matching as on Boro cd /home1/12345/samirrav ; tar xvfj daos.tar.bz2
- Have your environment setup script ready to Look fir Bin/Lib path (Same like we do in Boro or Wolf Environment)
- At this point you are ready to run server. Either you can do it manually or you can run sbatch script.
Run manually:
- To get the number of required machine you want using command For example idev -m 60 -N 3 -p skx-dev, This will get the 3 Nodes for 60min (skx-dev has limit of 2 hours). You might have to wait for some time but skx-dev is faster as you can not have more than 4 node requested. If you need more use skx-normal which has limit of 128
- Once the slurm reserve the machine you will be on one of the machine console. But you can ssh to other machine from same session or from new SSH session. (User should be able to access the machine until you have the reservation)
- Now you can start the server manually orterun --np 1 -x CPATH -x PATH -x LD_LIBRARY_PATH --report-uri /home1/12345/samirrav/hostsfile/uri.txt --enable-recovery daos_server start -i -a /home1/12345/samirrav/daos/install/tmp/ -o /home1/12345/samirrav/hostsfile/daos_server_psm2.yml --debug
- From another machine create the pool using dmg command
Run via SBATCH:
- Use the sample script available from https://jira.hpdd.intel.com/secure/attachment/31378/script_backup.sh
- Change below parameter based on your requirement
Slurm Header Description #SBATCH -p skx-dev Partition name where you want to queue the JOB. Each partition has it's own limitation of node and number of hours node can be used. How many JOB can be queued on the partition, You can refer https://portal.tacc.utexas.edu/user-guides/stampede2#running-queues #SBATCH -N 3
# Total Number of nodes, In this case It's 3 [You need to have NO_OF_SERVERS + NO_OF_CLIENTS +1 one more system needs to be reserved, which will be used for initiating tests. So if you need 1 server and 1 client for testing,need to reserve 3 system for it. If you want 126 server and 1 CN need to reserve 128] #SBATCH -n 144
# Total Number of mpi tasks (48 x Total No of nodes) #SBATCH -t 02:00:00
Run time keep it close so in worst case some thing goes wrong it wont end up utilizing the node hours #SBATCH --mail-user=samir.raval@intel.com
Your email ID so once the script is launched you will notify when JOB started and when JOB finished with it's return code
For Example:
Slurm Job_id=4546499 Name=test_daos1 Began, Queued time 04:30:54 (4:30 is the time took to start the JOB)
Slurm Job_id=4546499 Name=test_daos1 Ended, Run time 00:01:33, COMPLETED, ExitCode 0 (00:01:33 is the time took to complete the JOB)
- Change the number of DAOS server/Client count
System used for Count DAOS_SERVERS
1 DAOS_CLIENTS
1 - Now run the sbatch script.
sbatch scripts/main.sh IOR
-----------------------------------------------------------------
Welcome to the Stampede2 Supercomputer
-----------------------------------------------------------------No reservation for this job
--> Verifying valid submit host (login1)...OK
--> Verifying valid jobname...OK
--> Enforcing max jobs per user...OK
--> Verifying availability of your home dir (/home1/06739/samirrav)...OK
--> Verifying availability of your work dir (/work/06739/samirrav/stampede2)...OK
--> Verifying availability of your scratch dir (/scratch/06739/samirrav)...OK
--> Verifying valid ssh keys...OK
--> Verifying access to desired queue (skx-dev)...OK
--> Verifying job request is within current queue limits...OK
--> Checking available allocation (STAR-Intel)...OK
Submitted batch job 4551152- JOB will be queued and you will see status getting printed "OK", if some thing goes wrong it will not queue the JOB and use needs to debug the sbatch script.
- Check the status of the JOB using, It will update as job gets the resource and runs.
- login1(1038)$ squeue | grep samir
4551152 skx-dev test_dao samirrav PD 0:00 3 (Resources)
login1(1039)$