Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Testing Area Test

Test Priority (1- HIGH,  2 - LOW)

Number of ServersNumber of ClientsInput ParameterExpected ResultObserved ResultDefectNotesExpected SU's (1 node * 1 hour = 1 SU)
Server YAML config options To verify the test cases from below section with specific server config options in YAML file 1



target = [16]

nr_xs_helpers = [1]

CRT_CTX_SHARE_ADDR=[0, 1]

No sever crash,

Performance increase linearly 



No need individual test but below test can be used this configuration 

Performance

No Replica 

Run IOR and collect BW

Run IOR small size and collect IOPS

1

1,

8,

32,

128

128

1,

16,

96,

256

740

protocol : daos

Transfer Size: 256B 4K 128K 512K 1M (Do we need non standard size also be covered?)

Block Size: 64M (Depend upon no. of process as file size will increase because of it)

FPP and SSF

single server got ~12GB Read/write so it should scale linearly.

With 128 server should be close to 1.5TB  BW? 




1406 Nodes

taking ~30 min 


703

Replica 2 Way

Run IOR and collect BW

Run IOR small size and collect IOPS

1

8,

32,

128

16,

96,

740

Same As Above



1020 Nodes

for ~30 min

510

Replica 3 Way

Run IOR and collect BW

Run IOR small size and collect IOPS

1

8,

32,

128

16,

96,

740

Same As Above



1020 Nodes

for ~30 min

510

Replica 4 Way

Run IOR and collect BW

Run IOR small size and collect IOPS

1

8,

32,

128

16,

96,

740

Same As Above



1020 Nodes

for ~30 min

510

Any Erasure Encoding object class need to  run? May be with medium size?

EC_2P1G1
EC_2P2G1
EC_8P2G1

1?3296Same As Above?



128 nodes for ~60 min120
Metadata Test (Using MDTest)1

1,

8,

32,

128

128

1,

16,

96,

256

740

How many tasks per client 1 ,4 or only 8?

What class type should be tested ? 

-n = 1000 (every process will creat/stat/read/remove )

-z = 0 and 20 (depth of hierarchical directory structure) 

Result with 1 server, 1 client is available from

https://jira.hpdd.intel.com/secure/attachment/31383/sbatch_run.txt




1406 Nodes

taking ~15 min 


350
CART self_test1

2

32

126

1

1

1

orterun --timeout 3600 --mca mtl ^psm2,ofi -x FI_PSM2_DISCONNECT=1 -np 1 -ompi-server <urifile> self_test --group-name daos_server --endpoint 0-<NO_OF_SERVER>:0 --master-endpoint 0-<NO_OF_SERVER>:0 --message-sizes 'b1048576',' b1048576 0','0 b1048576',' b1048576 i2048',' i2048 b1048576',' i2048',' i2048 0','0 i2048','0' --max-inflight-rpcs 1 --repetitions 100


https://wiki.hpdd.intel.com/download/attachments/114950812/2SN_1CN_TACC-Stampede2_20191022_144511.txt?api=v2


https://wiki.hpdd.intel.com/download/attachments/114950812/32SN_1CN_TACC-Stampede2_20191022_172517.txt?api=v2



POSIX (Fuse)2?3296Run IOR with POSIX mode. Are we there to get the full performance ? 



128

for ~60 min

128
DFS2

Not sure if we want to cover dfs as we are covering daos with IOR on above test cases





HDF5?2?3296Any specific test we want to run?





FIO?


Do we want to test this?





Functionality and Scale testingRun all daos_test2128740




868 node for ~60 min
Single server/Max clients
1867

Create pool, Query pool

Run IOR (Specific size?)

Poole create should work fine. IOR will run with ~5000 tasks so it should success. Query pool info after IOR run and measure the pool size compare to file size.


868 node for ~30 min434
Max servers/single client
8671

Create pool, Query pool

Run IOR 

Poole create should work fine. IOR will be run with 8 tasks so it should success. Query pool info after IOR run and measure the pool size compare to file size.


868 node for ~30 min434

Large number of Pools (~1000)



128

Server number  seems ok?

740

Create large number of pools (~90MB each),

Write small data with IOR.

Restart all the servers.

Query all the pools

Read the IOR data from each pool with verification

what other operation needed after pool creation? 


Measure server restart time with this many pools

Pool query should report correct sizes after IOR write

IOR read should work fine with data validation after all server restart




868 node for ~60 min 868

dmg utility testing

for example: pool query




dmg pool create 

dmg pool query

dmg pool destroy

Anything more to cover? Some of this tools are going to cover in other test cases







Negative Scenarios with ScalabilityServer failure and rebuild data1287401

Create the multiple pools.

Store the IOR with 2,3,4 replica and with multiple groups.

Kill server one by one 64 maximum (Half the requested size)?

After each server kill read the IOR data and verify the content.

Multiple server can be killed (2/4/8), Object data will be lost if all copy lost. May be we can verify the remaining system is functional

Rebuild should happen for all the object and data should not be corrupted after server failure


868 for ~2 hours868
daos_run_io_conf1287402

This will exclude the ranks and add it back in to the loop for given number.

We can have maximum 16 targets and include all rank. Test will exclude the rank randomly and add it back.

Pool query is also part of this test to verify the usage 

We have not tried on TACC but locally it works but there are few issue need to be resolved which we caught during local testing (DAOS-3510)


868 for ~30 min434
Reliability and Data Integrity (Soak testing)Current Soak testing







868

for 2 hours 

1736






















...