Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

NOTE THESE ARE NOT TO BE APPLIED TO 2.0 TESTING, USE THE QUICKSTARTS IN THE 2.0 ON-LINE DOCUMENTATION

Table of Contents


Introduction

...

The quick start requires a minimum of 2 servers each 1 server with PMEM and SSDs connected via infiniband storage network and 2 client nodes 1 client node  and 1 admin node without pmem/ssd but on the infiniband storage network.

 All nodes have a base openSUSE or SLES 15.2 installed.

  • sudo access configured
  • password-less ssh configured
  • pdsh installed (or some other means of running multiple remote commands in parallel)

In addition the server nodes should also have:

For the use of the commands outlined on this page the following shell variables will need to be defined:

  • ADMIN_NODE
  • CLIENT_NODES
  • SERVER_NODES
  • ALL_NODES

Install pdsh on the admin node

Code Block
sudo zypper install pdsh


For example, if one wanted to use node-1 as their admin node, node-2 and node-3 as client nodes, and node-[4 and node-6] 5 as their server nodes then these variables would be defined as:

Code Block
ADMIN_NODE=node-1
CLIENT_NODES=node-2, node-3
SERVER_NODES=node-4, node-5,node-6
ALL_NODES=$ADMIN_NODE,$CLIENT_NODES,$SERVER_NODES

...

Note

If a client node is also serving as an admin node then exclude $ADMIN_NODE from the ALL_NODES assignment to prevent duplication, e.g.

ALL_NODES=$CLIENT_NODES,$SERVER_NODES

Set-Up

Please refer here for initial set up which consists of rpm installation, generate and set up certificates, setting up config files, starting servers and agents.

Note

For this quick start, the daos-tests package will need to be installed on the client nodes



The following applications will be run from a client node:

...

Code Block
SHARED_DIR=<shared dir by all nodes>
export FI_UNIVERSE_SIZE=2048
export OFI_INTERFACE=eth0
export CRT_PHY_ADDR_STR="ofi+sockets"

# selt_test --help for more details on params

#Generate the attach info file (enable SHARED_DIR with perms for sudo to write )
sudo daos_agent -o /etc/daos/daos_agent.yml -l $SHARED_DIR/daos_agent.log dump-attachinfo -o $SHARED_DIR/daos_server.attach_info_tmp

# Run: 
self_test --path $SHARED_DIR --group-name daos_server --endpoint 0-1:0           (for 4 servers --endpoint 0-3:0   ranks:tags)


# Sample output:

Adding endpoints:                                                                                   
  ranks: 0-1 (# ranks = 2)                                                                          
  tags: 0 (# tags = 1)                                                                              
Warning: No --master-endpoint specified; using this command line application as the master endpoint 
Self Test Parameters:                                                                               
  Group name to test against: daos_server                                                           
  # endpoints:                2                                                                     
  Message sizes:              [(200000-BULK_GET 200000-BULK_PUT), (200000-BULK_GET 0-EMPTY), (0-EMPTY 200000-BULK_PUT), (200000-BULK_GET 1000-IOV), (1000-IOV 200000-BULK_PUT), (1000-IOV 1000-IOV), (1000-IOV 0-EMPTY), (0-EMPTY 1000-IOV), (0-EMPTY 0-EMPTY)]                                                                                                             
  Buffer addresses end with:  <Default>                                                                                                                                               
  Repetitions per size:       20000                                                                                                                                                   
  Max inflight RPCs:          1000                                                                                                                                                    

CLI [rank=0 pid=3255]   Attached daos_server
##################################################
Results for message size (200000-BULK_GET 200000-BULK_PUT) (max_inflight_rpcs = 1000):

Master Endpoint 2:0
-------------------
        RPC Bandwidth (MB/sec): 222.67
        RPC Throughput (RPCs/sec): 584
        RPC Latencies (us):           
                Min    : 27191        
                25th  %: 940293       
                Median : 1678137      
                75th  %: 2416765      
                Max    : 3148987      
                Average: 1671626      
                Std Dev: 821872.40    
        RPC Failures: 0               

        Endpoint results (rank:tag - Median Latency (us)):
                0:0 - 2416764                             
                1:0 - 969063                              

##################################################
Results for message size (200000-BULK_GET 0-EMPTY) (max_inflight_rpcs = 1000):

Master Endpoint 2:0
-------------------
        RPC Bandwidth (MB/sec): 112.08
        RPC Throughput (RPCs/sec): 588
        RPC Latencies (us):           
                Min    : 2880         
                25th  %: 1156162      
                Median : 1617356      
                75th  %: 2185604      
                Max    : 2730569      
                Average: 1659133      
                Std Dev: 605053.68    
        RPC Failures: 0               

        Endpoint results (rank:tag - Median Latency (us)):
                0:0 - 2185589                             
                1:0 - 1181363                             

##################################################
Results for message size (0-EMPTY 200000-BULK_PUT) (max_inflight_rpcs = 1000):

Master Endpoint 2:0
-------------------
        RPC Bandwidth (MB/sec): 112.11
        RPC Throughput (RPCs/sec): 588
        RPC Latencies (us):           
                Min    : 4956         
                25th  %: 747786       
                Median : 1558111      
                75th  %: 2583834      
                Max    : 3437395      
                Average: 1659959      
                Std Dev: 1078975.59   
        RPC Failures: 0               

        Endpoint results (rank:tag - Median Latency (us)):
                0:0 - 2583826                             
                1:0 - 776862                              

##################################################
Results for message size (200000-BULK_GET 1000-IOV) (max_inflight_rpcs = 1000):

Master Endpoint 2:0
-------------------
        RPC Bandwidth (MB/sec): 112.57
        RPC Throughput (RPCs/sec): 587
        RPC Latencies (us):           
                Min    : 2755         
                25th  %: 12341        
                Median : 1385716      
                75th  %: 3393178      
                Max    : 3399349      
                Average: 1660125      
                Std Dev: 1446054.82   
        RPC Failures: 0               

        Endpoint results (rank:tag - Median Latency (us)):
                0:0 - 12343                               
                1:0 - 3393174                             

##################################################
Results for message size (1000-IOV 200000-BULK_PUT) (max_inflight_rpcs = 1000):

Master Endpoint 2:0
-------------------
        RPC Bandwidth (MB/sec): 112.68
        RPC Throughput (RPCs/sec): 588
        RPC Latencies (us):           
                Min    : 4557         
                25th  %: 522380       
                Median : 1640322      
                75th  %: 2725419      
                Max    : 3441963      
                Average: 1661254      
                Std Dev: 1147206.09   
        RPC Failures: 0               

        Endpoint results (rank:tag - Median Latency (us)):
                0:0 - 600190                              
                1:0 - 2725402                             

##################################################
Results for message size (1000-IOV 1000-IOV) (max_inflight_rpcs = 1000):

Master Endpoint 2:0
-------------------
        RPC Bandwidth (MB/sec): 88.87
        RPC Throughput (RPCs/sec): 46595
        RPC Latencies (us):             
                Min    : 1165           
                25th  %: 21374          
                Median : 21473          
                75th  %: 21572          
                Max    : 21961          
                Average: 20923          
                Std Dev: 2786.99        
        RPC Failures: 0                 

        Endpoint results (rank:tag - Median Latency (us)):
                0:0 - 21430                               
                1:0 - 21516                               

##################################################
Results for message size (1000-IOV 0-EMPTY) (max_inflight_rpcs = 1000):

Master Endpoint 2:0
-------------------
        RPC Bandwidth (MB/sec): 59.03
        RPC Throughput (RPCs/sec): 61902
        RPC Latencies (us):             
                Min    : 1164           
                25th  %: 15544          
                Median : 16104          
                75th  %: 16575          
                Max    : 17237          
                Average: 15696          
                Std Dev: 2126.37        
        RPC Failures: 0                 

        Endpoint results (rank:tag - Median Latency (us)):
                0:0 - 15579                               
                1:0 - 16571                               

##################################################
Results for message size (0-EMPTY 1000-IOV) (max_inflight_rpcs = 1000):

Master Endpoint 2:0
-------------------
        RPC Bandwidth (MB/sec): 46.93
        RPC Throughput (RPCs/sec): 49209
        RPC Latencies (us):             
                Min    : 945            
                25th  %: 20327          
                Median : 20393
                75th  %: 20434
                Max    : 20576
                Average: 19821
                Std Dev: 2699.27
        RPC Failures: 0

        Endpoint results (rank:tag - Median Latency (us)):
                0:0 - 20393
                1:0 - 20393

##################################################
Results for message size (0-EMPTY 0-EMPTY) (max_inflight_rpcs = 1000):

Master Endpoint 2:0
-------------------
        RPC Bandwidth (MB/sec): 0.00
        RPC Throughput (RPCs/sec): 65839
        RPC Latencies (us):
                Min    : 879
                25th  %: 14529
                Median : 15108
                75th  %: 15650
                Max    : 16528
                Average: 14765
                Std Dev: 2087.87
        RPC Failures: 0

        Endpoint results (rank:tag - Median Latency (us)):
                0:0 - 14569
                1:0 - 15649


...

Code Block
module load gnu-openmpi/3.1.6

or

export LD_LIBRARY_PATH=<openmpi lib path>:$LD_LIBRARY_PATH
export PATH=<openmpi bin path>:$PATH


export D_LOG_FILE=/tmp/daos_perf.log


# Single process
daos_perf -a 64 -d 256 -c R2S -P 20G -T daos -s 1k -R "U;pV" -g /etc/daos/daos_control.yamlyml



# MPI
orterun --enable-recovery -x D_LOG_FILE=/tmp/daos_perf_daos.log --host <host name>:4 --map-by node --mca btl_openib_warn_default_gid_prefix "0" --mca btl "tcp,self" --mca oob "tcp" --mca pml "ob1" --mca btl_tcp_if_include "eth0" --np 4 --tag-output /usr/bin/daos_perf -a 64 -d 256 -c R2S -P 20G -T daos -s 1k -R "U;pV" -g /etc/daos/daos_control.yamlyml  

# Sample Output:

Test :
        DAOS R2S (full stack, 2 replica)
Pool :
        9c88849b-b0d6-4444-bb39-42769a7a1ef5
Parameters :
        pool size     : SCM: 20480 MB, NVMe: 0 MB
        credits       : -1 (sync I/O for -ve)
        obj_per_cont  : 1 x 1 (procs)
        dkey_per_obj  : 256
        akey_per_dkey : 64
        recx_per_akey : 16
        value type    : single
        stride size   : 1024
        zero copy     : no
        VOS file      : <NULL>
Running test=UPDATE
Running UPDATE test (iteration=1)
UPDATE successfully completed:
        duration : 91.385233  sec
        bandwith : 2.801      MB/sec
        rate     : 2868.56    IO/sec
        latency  : 348.607    us (nonsense if credits > 1)
Duration across processes:
        MAX duration : 91.385233  sec
        MIN duration : 91.385233  sec
        Average duration : 91.385233  sec
Completed test=UPDATE 

...

Code Block
dmg pool create --namelabel=daos_test_pool --size=500G
  
# Sample output 
Creating DAOS pool with automatic storage allocation: 500 GB NVMe + 6.00% SCM

Pool created with 6.00% SCM/NVMe ratio
---------------------------------------

  UUID          : acf889b6-f290-4d7b-823a-5fae0014a64d

  Service Ranks : 0

  Storage Ranks : 0

  Total Size    : 530 GB

  SCM           : 30 GB (30 GB / rank)

  NVMe          : 500 GB (500 GB / rank)


dmg pool list

# Sample output
Pool UUID                            Svc Replicas
--------------                       ----------------
acf889b6-f290-4d7b-823a-5fae0014a64d 0

DAOS_POOL=<pool uuid> (define on all clients)

...

Code Block
daos cont create --type=POSIX --oclass=SX --pool=$DAOS_POOL
DAOS_CONT=<cont uuid>  (define on all clients)

...

Code Block
# Create directory
mkdir -p /tmp/daos_dfuse/daos_test

# Use dfuse to mount the daos container to the above directory
dfuse --container $DAOS_CONT --disable-direct-iocaching --mountpoint /tmp/daos_dfuse/daos_test --pool $DAOS_POOL

# verfiy that the file type is dfuse
df -h

# Sample output
dfuse                                                       500G   17G   34G  34% /tmp/daos_dfuse/daos_test

...

Code Block
module load gnu-mpich/3.4~a2

or 

export LD_LIBRARY_PATH=<mpich lib path>:$LD_LIBRARY_PATH
export PATH=<mpich bin path>:$PATH


# Download ior source 
git clone https://github.com/hpc/ior.git 

# Build IOR 
cd ior 
./bootstrap 
mkdir build;cd build 
MPICC=mpicc ../configure --with-daos=/usr --prefix=<your dir> 
make 
make install 

# Add IOR to paths  add <your dir>/lib to LD_LIBRARY_PATh and <your dir>/bin to PATH

...

Code Block
mpirun -hosts <hosts> -np 16 --ppn 16 dcp --bufsize 64MB --chunksize 128MB /tmp/daos_dfuse/daos_test daos://$DAOS_POOL/$DAOS_CONT3


#Sample output

[2021-04-29T23:55:52] Walking /tmp/daos_dfuse/daos_test
[2021-04-29T23:55:52] Walked 11 items in 0.026 secs (417.452 items/sec) ...
[2021-04-29T23:55:52] Walked 11 items in 0.026 seconds (415.641 items/sec)
[2021-04-29T23:55:52] Copying to /
[2021-04-29T23:55:52] Items: 11
[2021-04-29T23:55:52]   Directories: 1
[2021-04-29T23:55:52]   Files: 10
[2021-04-29T23:55:52]   Links: 0
[2021-04-29T23:55:52] Data: 10.000 GiB (1.000 GiB per file)
[2021-04-29T23:55:52] Creating 1 directories
[2021-04-29T23:55:52] Creating 10 files.
[2021-04-29T23:55:52] Copying data.
[2021-04-29T23:56:53] Copied 1.312 GiB (13%) in 61.194 secs (21.963 MiB/s) 405 secs left ...
[2021-04-29T23:58:11] Copied 6.000 GiB (60%) in 139.322 secs (44.099 MiB/s) 93 secs left ...
[2021-04-29T23:58:11] Copied 10.000 GiB (100%) in 139.322 secs (73.499 MiB/s) done
[2021-04-29T23:58:11] Copy data: 10.000 GiB (10737418240 bytes)
[2021-04-29T23:58:11] Copy rate: 73.499 MiB/s (10737418240 bytes in 139.322 seconds)
[2021-04-29T23:58:11] Syncing data to disk.
[2021-04-29T23:58:11] Sync completed in 0.006 seconds.
[2021-04-29T23:58:11] Fixing permissions.
[2021-04-29T23:58:11] Updated 11 items in 0.002 seconds (4822.579 items/sec)
[2021-04-29T23:58:11] Syncing directory updates to disk.
[2021-04-29T23:58:11] Sync completed in 0.001 seconds.
[2021-04-29T23:58:11] Started: Apr-29-2021,23:55:52
[2021-04-29T23:58:11] Completed: Apr-29-2021,23:58:11
[2021-04-29T23:58:11] Seconds: 139.335
[2021-04-29T23:58:11] Items: 11
[2021-04-29T23:58:11]   Directories: 1
[2021-04-29T23:58:11]   Files: 10
[2021-04-29T23:58:11]   Links: 0
[2021-04-29T23:58:11] Data: 10.000 GiB (10737418240 bytes)
[2021-04-29T23:58:11] Rate: 73.492 MiB/s (10737418240 bytes in 139.335 seconds)


# Create directory
mkdir /tmp/datamover3

#RUN 
mpirun -hosts wolf-184<host> --ppn 16 -np 16 dcp --bufsize 64MB --chunksize 128MB daos://$DAOS_POOL/$DAOS_CONT3 /tmp/datamover3/

# Sample output
[2021-04-30T00:02:14] Walking /
[2021-04-30T00:02:15] Walked 12 items in 0.112 secs (107.354 items/sec) ...
[2021-04-30T00:02:15] Walked 12 items in 0.112 seconds (107.236 items/sec)
[2021-04-30T00:02:15] Copying to /tmp/datamover3
[2021-04-30T00:02:15] Items: 12
[2021-04-30T00:02:15]   Directories: 2
[2021-04-30T00:02:15]   Files: 10
[2021-04-30T00:02:15]   Links: 0
[2021-04-30T00:02:15] Data: 10.000 GiB (1.000 GiB per file)
[2021-04-30T00:02:15] Creating 2 directories
[2021-04-30T00:02:15] Original directory exists, skip the creation: `/tmp/datamover3/' (errno=17 File exists)
[2021-04-30T00:02:15] Creating 10 files.
[2021-04-30T00:02:15] Copying data.
[2021-04-30T00:03:15] Copied 1.938 GiB (19%) in 60.341 secs (32.880 MiB/s) 251 secs left ...
[2021-04-30T00:03:46] Copied 8.750 GiB (88%) in 91.953 secs (97.441 MiB/s) 13 secs left ...
[2021-04-30T00:03:46] Copied 10.000 GiB (100%) in 91.953 secs (111.361 MiB/s) done
[2021-04-30T00:03:46] Copy data: 10.000 GiB (10737418240 bytes)
[2021-04-30T00:03:46] Copy rate: 111.361 MiB/s (10737418240 bytes in 91.954 seconds)
[2021-04-30T00:03:46] Syncing data to disk.
[2021-04-30T00:03:47] Sync completed in 0.135 seconds.
[2021-04-30T00:03:47] Fixing permissions.
[2021-04-30T00:03:47] Updated 12 items in 0.000 seconds (71195.069 items/sec)
[2021-04-30T00:03:47] Syncing directory updates to disk.
[2021-04-30T00:03:47] Sync completed in 0.001 seconds.
[2021-04-30T00:03:47] Started: Apr-30-2021,00:02:15
[2021-04-30T00:03:47] Completed: Apr-30-2021,00:03:47
[2021-04-30T00:03:47] Seconds: 92.091
[2021-04-30T00:03:47] Items: 12
[2021-04-30T00:03:47]   Directories: 2
[2021-04-30T00:03:47]   Files: 10
[2021-04-30T00:03:47]   Links: 0
[2021-04-30T00:03:47] Data: 10.000 GiB (10737418240 bytes)
[2021-04-30T00:03:47] Rate: 111.194 MiB/s (10737418240 bytes in 92.091 seconds)


# Verify the two directories have the same content mjean@wolf-184:~/build> ls -la /tmp/datamover3/daos_test/
total 10485808
drwxr-xr-x 2 mjean mjean       4096 Apr 30 00:02 .
drwxr-xr-x 3 mjean mjean       4096 Apr 30 00:02 ..
-rw-r--r-- 1 mjean mjean 1073741824 Apr 30 00:03 testfile.00000000
-rw-r--r-- 1 mjean mjean 1073741824 Apr 30 00:03 testfile.00000001
-rw-r--r-- 1 mjean mjean 1073741824 Apr 30 00:03 testfile.00000002
-rw-r--r-- 1 mjean mjean 1073741824 Apr 30 00:03 testfile.00000003
-rw-r--r-- 1 mjean mjean 1073741824 Apr 30 00:03 testfile.00000004
-rw-r--r-- 1 mjean mjean 1073741824 Apr 30 00:03 testfile.00000005
-rw-r--r-- 1 mjean mjean 1073741824 Apr 30 00:03 testfile.00000006
-rw-r--r-- 1 mjean mjean 1073741824 Apr 30 00:03 testfile.00000007
-rw-r--r-- 1 mjean mjean 1073741824 Apr 30 00:03 testfile.00000008
-rw-r--r-- 1 mjean mjean 1073741824 Apr 30 00:03 testfile.00000009
mjean@wolf-184:~/build> ls -la /tmp/daos_dfuse/daos_test/
total 10485760
-rw-r--r-- 1 mjean mjean 1073741824 Apr 29 16:31 testfile.00000000
-rw-r--r-- 1 mjean mjean 1073741824 Apr 29 16:31 testfile.00000001
-rw-r--r-- 1 mjean mjean 1073741824 Apr 29 16:31 testfile.00000002
-rw-r--r-- 1 mjean mjean 1073741824 Apr 29 16:31 testfile.00000003
-rw-r--r-- 1 mjean mjean 1073741824 Apr 29 16:31 testfile.00000004
-rw-r--r-- 1 mjean mjean 1073741824 Apr 29 16:31 testfile.00000005
-rw-r--r-- 1 mjean mjean 1073741824 Apr 29 16:31 testfile.00000006
-rw-r--r-- 1 mjean mjean 1073741824 Apr 29 16:31 testfile.00000007
-rw-r--r-- 1 mjean mjean 1073741824 Apr 29 16:31 testfile.00000008
-rw-r--r-- 1 mjean mjean 1073741824 Apr 29 16:31 testfile.00000009

...