Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Use the sample script available from https://jira.hpdd.intel.com/secure/attachment/31378/script_backup.sh 
  • Change below parameter based on your requirement


    Slurm Header Description
    #SBATCH -p skx-devPartition name where you want to queue the JOB. Each partition has it's own limitation of node and number of hours node can be used. How many JOB can be queued on the partition, You can refer  https://portal.tacc.utexas.edu/user-guides/stampede2#running-queues 
    #SBATCH -N 3                   
    # Total Number of nodes, In this case It's 3 [You need to have NO_OF_SERVERS + NO_OF_CLIENTS +1 one more system needs to be reserved, which will be used for initiating tests. So if you need 1 server and 1 client for testing,need to reserve 3 system for it. If you want 126 server and 1 CN need to reserve 128]
    #SBATCH -n 144                  
    # Total Number of mpi tasks (48 x Total No of nodes)
    #SBATCH -t 02:00:00
    Run time keep it close so in worst case some thing goes wrong it wont end up utilizing the node hours
    #SBATCH --mail-user=samir.raval@intel.com

    Your email ID so once the script is launched you will notify when JOB started and when JOB finished with it's return code

    For Example:

    Slurm Job_id=4546499 Name=test_daos1 Began, Queued time 04:30:54 (4:30 is the time took to start the JOB)

    Slurm Job_id=4546499 Name=test_daos1 Ended, Run time 00:01:33, COMPLETED, ExitCode 0 (00:01:33 is the time took to complete the JOB)

  • Change the number of DAOS server/Client count 


    System used forCount
    DAOS_SERVERS
    1
    DAOS_CLIENTS
    1
  • Now run the sbatch script.
  • sbatch scripts/main.sh IOR

    -----------------------------------------------------------------
              Welcome to the Stampede2 SupercomputerSupercomputer12345
    -----------------------------------------------------------------

    No reservation for this job
    --> Verifying valid submit host (login1)...OK
    --> Verifying valid jobname...OK
    --> Enforcing max jobs per user...OK
    --> Verifying availability of your home dir (/home1/0673912345/samirrav)...OK
    --> Verifying availability of your work dir (/work/0673912345/samirrav/stampede2)...OK
    --> Verifying availability of your scratch dir (/scratch/0673912345/samirrav)...OK
    --> Verifying valid ssh keys...OK
    --> Verifying access to desired queue (skx-dev)...OK
    --> Verifying job request is within current queue limits...OK
    --> Checking available allocation (STAR-Intel)...OK
    Submitted batch job 4551152


  • JOB will be queued and you will see status getting printed "OK", if some thing goes wrong it will not queue the JOB and use needs to debug the sbatch script.
  • Check the status of the JOB using, It will update as job gets the resource and runs.
  • login1(1038)$ squeue | grep samir
    4551152 skx-dev test_dao samirrav PD 0:00 3 (Resources)
    login1(1039)$
  • Once the JOB is finished logs will be copied to Log/4551152/ folder. It will copy all the server/client/agent logs from all the system part of JOB.
  • User can cancel the job any time using scancel JOB_ID (scancel 4551152)