TACC System information:
- NO NVMe
- NO Persistence Memory
- Server will only use SCM (tmpfs which has size of ~90G only)
Node available with Queue Name
Queue Name Node Type Max Nodes per Job
(assoc'd cores)*Max Duration Max Jobs in Queue* Charge Rate
(per node-hour)skx-dev
SKX 4 nodes
(192 cores)*2 hrs 1* 1 SU skx-normal
SKX 128 nodes
(6,144 cores)*48 hrs 25* 1 SU skx-large
**SKX 868 nodes
(41,664 cores)*48 hrs 3* 1 SU Stampede2 SKX Compute Node Specifications
Model: Intel Xeon Platinum 8160 ("Skylake") Total cores per SKX node: 48 cores on two sockets (24 cores/socket) Hardware threads per core: 2 Hardware threads per node: 48 x 2 = 96 Clock rate: 2.1GHz nominal (1.4-3.7GHz depending on instruction set and number of active cores) RAM: 192GB (2.67GHz) DDR4 Cache: 32KB L1 data cache per core; 1MB L2 per core; 33MB L3 per socket. Each socket can cache up to 57MB (sum of L2 and L3 capacity). Local storage: 144GB /tmp
partition on a 200GB SSD. Size of/tmp
partition as of 14 Nov 2017.
Server:
Server Environment variables (If set any)
Client Configuration:
Configuration:
Environment variables (If set any):
Testing Area | Test | Test Priority (1- HIGH, 2 - LOW) | Number of Servers | Number of Clients | Input Parameter | Expected Result | Observed Result | Defect | Notes | Expected SU's (1 node * 1 hour = 1 SU) | |
---|---|---|---|---|---|---|---|---|---|---|---|
Server YAML config options | To verify the test cases from below section with specific server config options in YAML file | 1 | target = [16] nr_xs_helpers = [1] CRT_CTX_SHARE_ADDR=[0, 1] | No sever crash, Performance increase linearly | No need individual test but below test can be used this configuration | ||||||
Performance | No Replica Run IOR and collect BW Run IOR small size and collect IOPS | 1 | 1, 8, 32, 128 128 | 1, 16, 96, 256 740 | Transfer Size: 256B 4K 128K 512K 1M (Do we need non standard size also be covered?) Block Size: 64M (Depend upon no. of process as file size will increase because of it) FPP and SSF | single server got ~12GB Read/write so it should scale linearly. | 1406 Nodes taking ~30 min | 703 | |||
Replica 2 Way Run IOR and collect BW Run IOR small size and collect IOPS | 1 | 8, 32, 128 | 16, 96, 740 | ||||||||
Replica 3 Way Run IOR and collect BW Run IOR small size and collect IOPS | 1 | 8, 32, 128 | 16, 96, 740 | ||||||||
Replica 4 Way Run IOR and collect BW Run IOR small size and collect IOPS | 1 | 8, 32, 128 | 16, 96, 740 | ||||||||
Metadata Test (Using MDTest) | 1 | 1, 8, 32, 128 128 | 1, 16, 96, 256 740 | How many tasks per client 1 ,4 or only 8? What class type should be tested ? -n = 1000 (every process will creat/stat/read/remove ) -z = 0 and 20 (depth of hierarchical directory structure) | Result with 1 server, 1 client is available from https://jira.hpdd.intel.com/secure/attachment/31383/sbatch_run.txt | ||||||
CART self_test | |||||||||||
POSIX (Fuse) | 2? | 32 | 96 | Run IOR with POSIX mode. Are we there to get the full performance ? | |||||||
FIO? | Do we want to test this? | ||||||||||
Functionality and Scale testing | Run all daos_test | 128 | 740 | ||||||||
Single server/Max clients | 1 | 767 | Create pool, Query pool Run IOR (Specific size?) | Poole create should work fine. IOR will run with ~5000 tasks so it should success. Query pool info after IOR run and measure the pool size compare to file size. | |||||||
Max servers/single client | 767 | 1 | Create pool, Query pool Run IOR | Poole create should work fine. IOR will be run with 8 tasks so it should success. Query pool info after IOR run and measure the pool size compare to file size. | |||||||
Large number of Pools (~1000) | 128 Count is ok? | 740 | Create large number of pools (~90MB each), Write small data with IOR. Restart all the servers. Query all the pools Read the IOR data from each pool with verification what other operation needed after pool creation? | Measure server restart time with this many pools Pool query should report correct sizes after IOR write IOR read should work fine with data validation after all server restart | |||||||
dmg utility testing for example: pool query | |||||||||||
Negative Scenarios with Scalability | |||||||||||
Reliability and Data Integrity (Soak testing) | |||||||||||
Question:
- Do we need to test other replica other than what mention in above?
- Any Erasure encoding class needs to be tested ?