Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 36 Next »

TACC System information:

  • NO NVMe
  • NO Persistence Memory
  • Server will only use SCM (tmpfs which has size of ~90G only)
  • Node available with Queue Name

    Queue NameNode TypeMax Nodes per Job
    (assoc'd cores)*
    Max DurationMax Jobs in Queue*Charge Rate
    (per node-hour)
    skx-devSKX4 nodes
    (192 cores)*
    2 hrs1*1 SU
    skx-normalSKX128 nodes
    (6,144 cores)*
    48 hrs25*1 SU
    skx-large**SKX868 nodes
    (41,664 cores)*
    48 hrs3*1 SU
  • Stampede2 SKX Compute Node Specifications

    Model: Intel Xeon Platinum 8160 ("Skylake")
    Total cores per SKX node: 48 cores on two sockets (24 cores/socket)
    Hardware threads per core: 2
    Hardware threads per node: 48 x 2 = 96
    Clock rate: 2.1GHz nominal (1.4-3.7GHz depending on instruction set and number of active cores)
    RAM: 192GB (2.67GHz) DDR4
    Cache: 32KB L1 data cache per core; 1MB L2 per core; 33MB L3 per socket. Each socket can cache up to 57MB (sum of L2 and L3 capacity).
    Local storage: 144GB /tmp partition on a 200GB SSD. Size of /tmp partition as of 14 Nov 2017.

Server:

DAOS Server yaml
# single server instance per config file for now
servers:
-
  targets: 16                		# Confirm the number of targets
  first_core: 0              		# offset of the first core for service xstreams
  nr_xs_helpers: 1           		# count of offload/helper xstreams per target
  fabric_iface: ib0          		# map to OFI_INTERFACE=ib0
  fabric_iface_port: 31416   		# map to OFI_PORT=31416
  log_mask: ERR     		 		# map to D_LOG_MASK=ERR
  log_file: /tmp/daos_server.log 	# map to D_LOG_FILE=/tmp/server.log

  # Environment variable values should be supplied without encapsulating quotes.
  env_vars:                 # influence DAOS IO Server behaviour by setting env variables
  - CRT_TIMEOUT=120
  - CRT_CREDIT_EP_CTX=0
  - PSM2_MULTI_EP=1
  - CRT_CTX_SHARE_ADDR=1
  - PMEMOBJ_CONF=prefault.at_open=1;prefault.at_create=1;  # Do we need this?
  - PMEM_IS_PMEM_FORCE=1								   # Do we need this?

  # Storage definitions

  # When scm_class is set to ram, tmpfs will be used to emulate SCM.
  # The size of ram is specified by scm_size in GB units.
  scm_mount: /dev/shm   # map to -s /mnt/daos
  scm_class: ram
  scm_size: 90

  # When scm_class is set to dcpm, scm_list is the list of device paths for
  # AppDirect pmem namespaces (currently only one per server supported).
  # scm_class: dcpm
  # scm_list: [/dev/pmem0]

  # If using NVMe SSD (will write /mnt/daos/daos_nvme.conf and start I/O
  # service with -n <path>)
  # bdev_class: nvme
  # bdev_list: ["0000:81:00.0"]  # generate regular nvme.conf

  # If emulating NVMe SSD with malloc devices
  # bdev_class: malloc  # map to VOS_BDEV_CLASS=MALLOC
  # bdev_size: 4                # malloc size of each device in GB.
  # bdev_number: 1              # generate nvme.conf as follows:
              # [Malloc]
              #   NumberOfLuns 1
              #   LunSizeInMB 4000

  # If emulating NVMe SSD over kernel block device
  # bdev_class: kdev            # map to VOS_BDEV_CLASS=AIO
  # bdev_list: [/dev/sdc]       # generate nvme.conf as follows:
              # [AIO]
              #   AIO /dev/sdc AIO2

  # If emulating NVMe SSD with backend file
  # bdev_class: file            # map to VOS_BDEV_CLASS=AIO
  # bdev_size: 16           # file size in GB. Create file if does not exist.
  # bdev_list: [/tmp/daos-bdev] # generate nvme.conf as follows:
              # [AIO]
              #   AIO /tmp/aiofile AIO1 4096

Server Environment variables (If set any)

Client Configuration:

Configuration:

Environment variables (If set any):



Testing Area Test

Test Priority (1- HIGH,  2 - LOW)

Number of ServersNumber of ClientsInput ParameterExpected ResultObserved ResultDefectNotesExpected SU's (1 node * 1 hour = 1 SU)
Server YAML config options To verify the test cases from below section with specific server config options in YAML file 1



target = [16]

nr_xs_helpers = [1]

CRT_CTX_SHARE_ADDR=[0, 1]

No sever crash,

Performance increase linearly 



No need individual test but below test can be used this configuration 

Performance

No Replica

Run IOR and collect BW

Run IOR small size and collect IOPS

1

1,

8,

32,

128

128

1,

16,

96,

256

740

Transfer Size: 256B 4K 128K 512K 1M

Block Size: 64M (Depend upon no. of process as file size will increase because of it)

FPP and SSF

single server got ~12GB Read/write so it should scale linearly. 


1406 Nodes

taking ~30 min 


703

Replica 2 Way

Run IOR and collect BW

Run IOR small size and collect IOPS

1

8,

32,

128

16,

96,

740








Replica 3 Way

Run IOR and collect BW

Run IOR small size and collect IOPS

1

8,

32,

128

16,

96,

740








Replica 4 Way

Run IOR and collect BW

Run IOR small size and collect IOPS

1

8,

32,

128

16,

96,

740








Metadata Test (Using MDTest)1

1,

8,

32,

128

128

1,

16,

96,

256

740

How many tasks per client 1 ,4 or only 8?

What class type should be tested ? 

-n = 1000 (every process will creat/stat/read/remove )

-z = 0 and 20 (depth of hierarchical directory structure) 

Result with 1 server, 1 client is available from

https://jira.hpdd.intel.com/secure/attachment/31383/sbatch_run.txt






CART self_test









POSIX (Fuse)2?



















Functionality and Scale testingRun all daos_test
128740






Single server/Max clients
1739

Create pool, Query pool

Run IOR (Specific size?)

Poole create should work fine. IOR will run with ~5000 tasks so it should success. Query pool info after IOR run and measure the pool size compare to file size.




Max servers/single client
7391

Create pool, Query pool

Run IOR 

Poole create should work fine. IOR will be run with 8 tasks so it should success. Query pool info after IOR run and measure the pool size compare to file size.




Large number of Pools (~1000)



128

Count is ok?

740

Create large number of pools (~90MB each),

Write small data with IOR.

Restart all the servers.

Query all the pools

Read the IOR data from each pool with verification

what other operation needed after pool creation? 


Measure server restart time with this many pools

Pool query should report correct sizes after IOR write

IOR read should work fine with data validation after all server restart






dmg utility testing

for example: pool query











Negative Scenarios with Scalability











































Reliability and Data Integrity (Soak testing)
































Question:

  • Knights Landing has more node and tried running daos server. Can this be preferred over Skylake ? Reason is they have higher number of node 2048 compare to skx which has 868 nodes only. The KNL charges is 0.8 SU where SKX is 1.0 so we have 20% lower so can run more number of tests.
  • No labels