DAOS QoS

Background

DAOS is designed support multi-tenancy, QoS is an important feature to guarantee fairness of service for tenants. However, today DAOS has no QoS framework, it simply uses FIFO as I/O requests queue, which can neither guarantee fairness nor support request priority. In addition, the DAOS client has no mechanism to throttle RPC sending. A malicious application (or a legitimate application at large scale) can start hundreds of processes and send millions of RPCs to a server, which will consume significant resources on the server. The DAOS engine has no throttling on RPC receiving either, so an engine may underperform or even be killed by the OOM killer because it can indefinitely allocate resources for incoming I/O requests, unable to keep up with processing those requests.

To resolve these issues, this document provides a high-level design of DAOS QoS, which includes three areas:

  • QoS framework to guarantee fairness between different users and to support RPC priority.

  • Server-side throttling, which should prevent a server from indefinitely receiving incoming requests and eventually losing the capability of processing requests.

  • Client-side throttling, which should prevent a client from sending out an unlimited number of RPCs.

Design Overview

QoS framework of DAOS

QoS Session (QS)

A new concept “QoS Session” is introduced in this design. A QoS Session (QS) describes all the server resources allocated for the associated dataset, and for the services managing the dataset. For example, an I/O request only consumes bandwidth assigned to the QS; aggregation and rebuild services also only use the CPU cycles assigned to the QS and operate on the data associated with the QS. QS is a global domain; all the engines include the same set of QSs.

A QS includes tasks, task-queues, and credits. A QS can be assigned with a specified percentage of resources of the server, which is quantified as “credits”. A task, which is an execution unit, consumes a certain number of credits. If a task requires more than the remaining credits of the QS it belonging to, the task will be put on one of the task-queues of the QS until the required credits become available.

QS can be nested, but a QS cannot belong to multiple parent QSs. The resources assigned to a QS must always be equal or less than the remaining resources of its parent.

QS provides an API for callers to poll a task for running. QS may have a customized task selection policy and sub-session polling schema. By default, round-robin is the schema for sub-session polling, and FIFO is the policy for task selection. Advanced policy/schema is not included in this design document.

QoS Credits

There are two types of credits for DAOS QoS: operation credits and payload credits.

  • Operation credit: A task always consumes 1 operation credit. In other word, the total number of operation credits of a QS equals to the number of running ULTs [on the server]. Because system ULTs can occupy CPU for longer time, each of them may consume more than one operation credit.

  • Payload credit: A task can consume N payload credits, N is the (size/1024). Sometimes it is hard to estimate the I/O size before running the task, for example for the aggregation service, so a “reserve” API should be provided that allows a task to reserve extra resources and return unused resources before starting I/O.

QoS Task (QT)

QoS Task (QT) is an execution unit, it includes a callback function and the required credits (operation credits and payload credits). There are two types of QTs: user QT and system QT.

  • User QT

When a request arrives at the server, it is assigned to a QS based on its QS-ID carried by RPC, which is an UUID. But it will not be scheduled immediately, instead it will be queued as a QT. The polling ULT will check if the QS still has enough credits for the QT and decides if it can be scheduled or should wait for resources.

The DAOS engine shall create a ULT for a scheduled QT, the credits held by the task will be returned to QS on completion of the ULT. The callback function of a user QT is the RPC handler.

  • System QT

The DAOS engine has a few background services: EC aggregation service, VOS aggregation service, VOS GC service, checksum scrubber, and rebuild service. When activated, each of them will consume CPU cycles and I/O bandwidth. Resources consumed by these background services should be taken from the QS they belong to.

Not all QS will have all the system QTs. For example, if a user creates a QS for each container, then there will not be dedicated GC and aggregation QTs for each individual container, instead they share the QT of their parent, which is the pool QS.

DAOS storage model and QoS Session (QS)

DAOS has no per-object access control, instead it can only control user access at the pool or container level. By default, DAOS creates a session for each active pool, and a sub-session for each container within the pool. There is no credit assigned to these sessions by default, instead QTs are polled from QSs in round-robin manner. Within in a QS, QTs are selected from a FIFO queue by default.

Besides container sub-sessions, a pool session also includes a system sub-session which has system QTs like aggregation, GC, scrubber and rebuild. By default, there are no dedicated aggregation and other system services for each container, but DAOS will provide an interface to allow the user to allocate dedicated system QTs for each container.

If a container has dedicated system QTs, the pool level system QTs should skip this container and avoid putting more resources on it than expected.

Administrators can assign resources to a QS, for example, 50% of operation credits and 30% of the overall bandwidth to pool QS.

Custom QoS Session

The DAOS QoS framework should provide an interface through dmg that allows administrator to create a custom QoS session. Custom QS has no task or sub-session within it after creation, but the administrator can assign pools or containers to a custom QS. All the assigned pools and containers share the resources within the custom QS, and each of them represents a sub-session within the custom QS.

Administrators can also create sub-sessions within a custom QS and add system QTs into the sub-session.

QT selection policy

FIFO based QT selection has many potential issues:

  • It cannot guarantee fairness for requests from different clients, because one client can submit thousands of RPCs in a second, and all the other clients have to wait until those RPCs are processed.

  • If a QT requires too many credits, it cannot be scheduled. When it’s already at the head of the queue, it may block all the following small requests for a long time. Simply reordering requests cannot resolve the issue because then the large request can starve if QS keeps moving small requests to the head of the queue.

To resolve these issues, an advanced QT selection policy should be provided. There are some existent algorithms that can be used, but these algorithms will not be discussed in this document.

Server-Side RPC Throttling

DAOS currently has no server-side RPC throttling. If I/O workloads are not well balanced in a large cluster, hundreds of millions of RPCs can be submitted to a server. In this case, they will be all queued as tasks to QSs, and will all pin a lot of resources. The server may not be able to have enough CPU cycles or memory to process those queued tasks at a regular speed. Even worse, it may be killed by OOM killer as the memory pressure grows due to the ever-increasing number of queued tasks.

To avoid incoming requests pin too many server resources even if the server cannot catch up and process already queued tasks, a QS may reject additional task creation after the number of its already queued tasks exceeds a threshold. The rejected request should return a hint about the waiting time before retrying.

Each QS should maintain its own statistics for task scheduling rate and average waiting time of queued tasks. If QS decides to reject a new QT, it should provide a time hint based on the historical statistics, so the client can retry based on the hint.

Client-Side RPC Throttling

Today DAOS/CaRT can only limit number of inflight RPCs for a process, but because the DAOS client is a user space library, a compute node can start an arbitrary number of client processes and submit millions of RPCs to every engine. It is typical for HPC applications to run one task per physical CPU core, which results in O(100) processes per client.

Kernel filesystems do not have the same issue because the kernel space I/O daemon is the only agent of RPC sending. It can easily control the total number of RPCs from all client processes running on the same node. The DAOS userspace client library is lightweight and more flexible, but it is within the private address space and network context of application process, so it is hard to throttle the total rate of RPC sending on the node level.

Creating an I/O agent is one option: All the I/O requests would go through this agent and client processes would not have direct communications with the servers anymore. However, this approach requires significant changes to the client-side code and network stack, and it also incurs an extra memory copy.

Instead of using a heavy weight DAOS I/O agent, a session-based mechanism is introduced by this design: The DAOS agent will create a shared memory and maintain all the session IDs being accessed by this node. It also allocates a certain number of operation credits to each session ID, for example, 128 credits.

Before a client process submits an I/O request, it should access the shared memory and obtain a credit. It will return the credit on completion of the operation. If there is no available credit for the session, the process should reference to a spot of the wait-queue in shared memory, then queue the request in its own context. Each time the process calls progress, it should poll the status of waiting spot, and submit the RPC if the spot is marked as ready. After completion of an RPC, the credit taken by it should be returned to the shared memory, and mark the first waiting spot as ready, so the other waiters can find out while polling.