Test dmg scalability on TDS

Description

The intent is to evaluate dmg and the control plane scalability. For this, we need just one client node and as many servers as possible. Let’s start with the TDS and then try to scale beyond on Aurora. It would be great to run this test with the CXI provider and the 2.2 image.

for N in (1, 2, 4, 8, 16, 32)

  1. Measure time to format N engines

  2. Measure time to run dmg system query

  3. Measure time to create one pool spanning all the engines and using the full capacity

  4. Measure time of dmg pool list

  5. Measure time of dmg pool query

  6. Measure time to destroy the pool

  7. Measure time to create 50 pools spanning all the engines with each pool using a 1/50th of the capacity

  8. Measure time of dmg pool list

  9. Measure time of dmg pool query for one of those pool

  10. Measure time to collect the metrics from one engine with dmg

  11. Measure time to destroy all the pools

  12. Measure time to stop the system with dmg system stop

  13. Measure time to start the system with dmg system start

Activity

Show:

Kurniawan Alfizah May 18, 2022 at 5:49 PM

Hi Michael, wrt your question on DAOS-10508, about metric-50. We can say it is the time to collect metrics via dmg for all of those 50 pools. Basically it's running 'dmg -i telem metrics query' while we have 50 pools compare to run the same command after we destroy all those pools. (metrics).

Michael:
ok, got it. thanks. so it makes sense that it takes longer because there are more metrics. maybe some room for improvement there, will think about it.

Mjmac Macdonald May 10, 2022 at 2:59 PM

Wow, yeah. Very nice improvement in format times. Nice work on that, !

Thanks for re-running, .

The times all seem reasonable to me, although I’m wondering what the “metrics - 50” times represent, exactly. Is that the time to collect metrics via dmg for each of the 50 pools?

Kurniawan Alfizah May 10, 2022 at 2:34 PM
Edited

Running with PR-8603 as it’s been included in the latest image on Sunspot (Thanks Maureen). And here are the results, looking good.

N

1

2

4

8

8 - 3 AP

16

16 - 3AP

32

32 - 3AP

format engines

18.875

33.006

33.766

33.578

33.382

33.452

33.91

33.894

33.791

system query

0.098

0.102

0.125

0.113

1.388

0.164

0.142

0.137

0.137

one pool create

4.672

4.691

4.765

4.461

5.26

4.801

4.847

4.888

4.923

pool list

0.112

0.126

0.129

0.119

0.391

0.394

0.147

0.374

0.373

pool query

0.089

0.099

0.098

0.111

0.368

0.362

0.11

0.346

0.346

pool destroy

1.774

1.847

1.856

2.099

2.689

1.892

1.812

2.642

2.399

50 pools create

82.14

82.906

84.88

82.45

87.016

84.016

88.044

86.67

89.34

pool list

0.951

0.931

8.733

8.435

9.499

9.04

9.747

11.06

8.55

pool query

0.099

0.113

0.347

0.352

0.359

0.108

0.111

0.113

0.359

metrics - 50

69.903

142.015

163.281

163.15

162.268

162.41

163.161

159.965

161.69

pool destroy all

36.187

36.11

61.143

59.93

62.691

60.168

66.216

65.477

57.97

metrics

1.802

3.453

3.322

3.492

3.45

3.415

3.491

3.549

3.442

dmg sys stop

1.621

1.61

1.628

1.615

1.622

1.611

1.624

1.621

1.627

dmg sys start

16.12

16.63

16.62

16.62

16.63

16.63

16.14

16.62

16.63

Mjmac Macdonald May 9, 2022 at 1:58 PM

That PR to improve NVMe storage format times has now been landed to release/2.2.

Mjmac Macdonald May 8, 2022 at 2:42 PM

It does seem to scale pretty well; that’s nice to see. I think we should redo the storage format test once lands on the release/2.2 branch – that should improve the time significantly.

Done

Details

Assignee

Reporter

Priority

Affects versions

Required for Version

Fix versions

Components

Story Points

Created May 3, 2022 at 9:23 PM
Updated March 6, 2024 at 11:18 PM
Resolved May 18, 2022 at 5:49 PM