Test dmg scalability on TDS
Description
Activity
Kurniawan Alfizah May 18, 2022 at 5:49 PM
Hi Michael, wrt your question on DAOS-10508, about metric-50. We can say it is the time to collect metrics via dmg for all of those 50 pools. Basically it's running 'dmg -i telem metrics query' while we have 50 pools compare to run the same command after we destroy all those pools. (metrics).
Michael:
ok, got it. thanks. so it makes sense that it takes longer because there are more metrics. maybe some room for improvement there, will think about it.
Mjmac Macdonald May 10, 2022 at 2:59 PM
Wow, yeah. Very nice improvement in format times. Nice work on that, !
Thanks for re-running, .
The times all seem reasonable to me, although I’m wondering what the “metrics - 50” times represent, exactly. Is that the time to collect metrics via dmg for each of the 50 pools?
Kurniawan Alfizah May 10, 2022 at 2:34 PMEdited
Running with PR-8603 as it’s been included in the latest image on Sunspot (Thanks Maureen). And here are the results, looking good.
N | 1 | 2 | 4 | 8 | 8 - 3 AP | 16 | 16 - 3AP | 32 | 32 - 3AP |
format engines | 18.875 | 33.006 | 33.766 | 33.578 | 33.382 | 33.452 | 33.91 | 33.894 | 33.791 |
system query | 0.098 | 0.102 | 0.125 | 0.113 | 1.388 | 0.164 | 0.142 | 0.137 | 0.137 |
one pool create | 4.672 | 4.691 | 4.765 | 4.461 | 5.26 | 4.801 | 4.847 | 4.888 | 4.923 |
pool list | 0.112 | 0.126 | 0.129 | 0.119 | 0.391 | 0.394 | 0.147 | 0.374 | 0.373 |
pool query | 0.089 | 0.099 | 0.098 | 0.111 | 0.368 | 0.362 | 0.11 | 0.346 | 0.346 |
pool destroy | 1.774 | 1.847 | 1.856 | 2.099 | 2.689 | 1.892 | 1.812 | 2.642 | 2.399 |
50 pools create | 82.14 | 82.906 | 84.88 | 82.45 | 87.016 | 84.016 | 88.044 | 86.67 | 89.34 |
pool list | 0.951 | 0.931 | 8.733 | 8.435 | 9.499 | 9.04 | 9.747 | 11.06 | 8.55 |
pool query | 0.099 | 0.113 | 0.347 | 0.352 | 0.359 | 0.108 | 0.111 | 0.113 | 0.359 |
metrics - 50 | 69.903 | 142.015 | 163.281 | 163.15 | 162.268 | 162.41 | 163.161 | 159.965 | 161.69 |
pool destroy all | 36.187 | 36.11 | 61.143 | 59.93 | 62.691 | 60.168 | 66.216 | 65.477 | 57.97 |
metrics | 1.802 | 3.453 | 3.322 | 3.492 | 3.45 | 3.415 | 3.491 | 3.549 | 3.442 |
dmg sys stop | 1.621 | 1.61 | 1.628 | 1.615 | 1.622 | 1.611 | 1.624 | 1.621 | 1.627 |
dmg sys start | 16.12 | 16.63 | 16.62 | 16.62 | 16.63 | 16.63 | 16.14 | 16.62 | 16.63 |
Mjmac Macdonald May 9, 2022 at 1:58 PM
That PR to improve NVMe storage format times has now been landed to release/2.2.
Mjmac Macdonald May 8, 2022 at 2:42 PM
It does seem to scale pretty well; that’s nice to see. I think we should redo the storage format test once lands on the release/2.2 branch – that should improve the time significantly.
The intent is to evaluate dmg and the control plane scalability. For this, we need just one client node and as many servers as possible. Let’s start with the TDS and then try to scale beyond on Aurora. It would be great to run this test with the CXI provider and the 2.2 image.
for N in (1, 2, 4, 8, 16, 32)
Measure time to format N engines
Measure time to run dmg system query
Measure time to create one pool spanning all the engines and using the full capacity
Measure time of dmg pool list
Measure time of dmg pool query
Measure time to destroy the pool
Measure time to create 50 pools spanning all the engines with each pool using a 1/50th of the capacity
Measure time of dmg pool list
Measure time of dmg pool query for one of those pool
Measure time to collect the metrics from one engine with dmg
Measure time to destroy all the pools
Measure time to stop the system with dmg system stop
Measure time to start the system with dmg system start