DAOS engine scalability improvements
Introduction
DAOS storage cluster should be able to scale to thousands of storage servers and handle I/O requests from millions of client processes. All the distributed protocols used by DAOS can support this level of scalability, however, from perspective of software design and implementation, if there is no resource control on the engine side, constantly sending broadcasts to all targets of all engines, or processing RPCs from all client processes can immediately consume all memory on a storage node, the service will be end up with being killed by OOM killer.
Requirements
There are a few requirements to improve DAOS engine scalability:
Collect metrics and understand memory consumption per I/O RPC
Define a upper limit for concurrent RPC being processed
A certain amount of resources should be reserved for rebuild system, aggregation service and other emergency services
Start to reject RPCs when the threshold is reached.
Client can retry the rejected RPC after certain time interval
Retry interval should be randomized, otherwise they’ll be rejected again if all clients start retry at the same time
Rejected RPC may have higher priority then regular RPC
Add priority to RPC?
SWIM RPC always has the top priority, top priority RPC should not be rejected
Separate thresholds for regular RPC and high priority RPC.
Client stack can bump RPC priority when retrying
Review memory consumption of other engine components
ULT stack size
VOS caches
regular data structure size
Generic memory consumption reduction on engine side.
Design Overview
DAOS server RPC throttling
Resources on DAOS server is limited, if service threads keep creating ULT for incoming RPC, its processing capability will not be able to keep up with receiving speed at some point, DAOS engine will be killed by OOM killer.
In order to control resource consumption on engine side:
DAOS engine should track number of incoming RPCs and created ULTs, these counters are per-xstream.
ULT counter: RPC being processed
queued RPC counter: RPC queued (no ULT yet)
Engine shall stop handling incoming RPC after reaching a certain threshold and simply reject them. A new error code is returned to client, a hint could be returned as well (optional)
For example, The hint is based on the number of RPCs being processed, queued and rejected in the past 10 seconds
hint can be the retry time interval: 5-10 seconds, 10-20 seconds, 20-30 seconds.
Based on the returned error code, DAOS client will retry the RPC after a random time interval (within a certain range) if there is not hint, or retry after the server hinted time interval.
Either the client decided retry interval, or the server hinted retry interval should not be somehow randomized, otherwise the fabric or server will be overloaded and RPC can timeout again because of congestion.
ULT management on helper xstream
Today whenever I/O xstream offload RPC forwarding to helper xstream, it always creates a new ULT on the helper xstream, this is expensive because each ULT consume tens of kilo-bytes and context switch of ULTs also consumes CPU cycles.
Because RPC forwarding has a very simple state-machine, instead of creating thousands or even more ULTs for each RPC batch:
Each helper xstream creates one RPC-forwarding ULT
It also maintains a queue for RPC forwarding tasks
Other xstreams can queue a RPC forwarding task on one of the helper xstream
RPC forwarding ULT can process the queued RPC forwarding tasks
Collective punch of large object
Details can be found here: DAOS-14105: Punch large-scaled object collectivelyResolved
Low latency traffic class for SWIM message
Based on discussion with HPE, we should use “low-latency traffic class” for SWIM, but this is not exported by Mercury yet.
SWIM RPC should never be throttled by DAOS engine.
Tracking ticket: DAOS-14262: Use low-latency traffic class for SWIMResolved
New resend-check mechanism
Currently, DAOS server depends on DTX table to detect resent RPC. But there is no mechanism from client to server for RPC reply ACK, so DAOS serve has to hold related committed DTX entry until being removed via DTX aggregation. There are two main issues:
DTX table will occupy a lot of resource, especially the DRAM. If DTX aggregation cannot purge them in time, it may cause server OOM.
DTX aggregation handling huge amount of committed DTX entries will affect regular IO progress, as to performance wave periodically.
In order to reduce the count of committed DTX entries in DTX table, the key point is to reduce the time of maintaining a committed DTX entry. The basic idea is to introduce indirect RPC reply ACK from client to server, the ACK is piggyback via subsequent RPC to the same target. Some key ideas:
When a thread sends RPC (any kind) to the target, it will check whether previous modification RPCs have been replied, if yes, then piggyback its information via current RPC to the target. Then server can know that related DTX entry can be removed from committed DTX table.
On server side, if some RPC contains former RPC reply ACK information, then make related DTX entry as removable, then subsequent DTX aggregation will remove them from committed DTX table sometime later. On the other hand, if server cannot get some DTX’s reply ACK in time, then related DTX entry will be finally removed by DTX aggregation when older than some time-based threshold.
For supporting to send multiple IO requests to the same target concurrently, client will maintain some credits array for each thread. Before the thread sending (modification) RPC to related target, it needs to obtain a credit and holds such credit until the RPC is replied or failed out.
Another important factor is DTX aggregation. Currently, DTX aggregation is VOS target local independent behavior. That cannot match the requirement for above RPC-reply-ACK based RPC resend detection. The key reason is that DTX is not local transaction, instead, it may cross multiple targets (for replicated object or EC object, or for distributed transaction). It means that the DTX entry with the same ID may exist on multiple targets, but related RPC with former one ACK is only sent to one target. Then current target that has received the ACK needs to make other related targets to know such ACK. Please note that current DTX may work against another redundancy group that is different from the one with ACK piggybacked in the RPC, so we cannot directly piggyback it via regular IO forward for current DTX. Under such case, changing DTX aggregation as global operation is more suitable. It will cause more RPCs among servers. But we will make it as batched operation, so the relative overhead for global DTX aggregation may be not too serious.
This feature may include RPC format changes, so interoperability should be considered too.
Tracking ticket: https://daosio.atlassian.net/browse/DAOS-14231