Flock Design
Background
Flock(2), an advisory lock interface in Linux, can be applied to files to control access. It offers different lock modes, allowing processes to share or exclusively access a protected resource. However, flock only works on regular files and relies on cooperation among applications. If an application doesn't acquire the lock, it should proceed without blocking, even if another application holds an exclusive lock.
While mandatory locks exist, where the Linux kernel enforces access based on file modes, they are not relevant to this discussion.
Often, flock isn't used to protect file content itself but rather to control access to resources. For instance, it might be applied to a zero-length file to ensure only one job configures or initializes resources before others consume them. Although range locks exist in Linux, they are rarely used in practice.
Objective
This design document proposes enhancing DAOS with distributed flock functionality, enabling locking across different client nodes.
Note that this design doc doesn't aim to implement the lockf(2) interface, which has different semantics and capabilities.
Overview
This section will describe how the flock works in the ParalleStore from a high level. First of all, a new type of RPC: DAOS_OBJ_RPC_LOCK will be added for this purpose. Both the object lock, unlock, and refresh operations will be packed in the same RPC, along with their lock mode if applicable.
The following diagram demonstrates a typical process of two client requesting conflicting locks:
In order to address the common issues of distributed systems, where we could lose clients and servers at any time, the design has to be still operational if that happens. In general, the client has to refresh its granted lock to prove that it’s still alive, and the server will have to store the granted lock in persistent storage to leverage the existing redundant architecture of DAOS.
Lock Mode
To satisfy the flock semantics, there are two lock modes: shared and exclusive. The shared lock can be granted to multiple clients, while the exclusive lock can only be granted once. If a lock request is received and a conflicting lock has already been granted, the lock request will be added to a queue. Due to scalability concerns, we will never save the queue to persistent storage because the queue length would be proportional to the number of clients. For the same reasons, we won't record the number of clients that a shared lock is granted to in persistent storage.
Theoretically, for each object, the server will only keep two locks with references. The shared lock can have a reference greater than 1, but the exclusive lock can only have a reference of 1. When a lock is granted, it will update the flock AKey in the directory entry for the object and will be replicated across shards if applicable.
Lock Request
The client lock request, triggered by a user request, passes through the fuse kernel and is received by the dfuse daemon. If the server can grant the lock, it will initiate a transaction and write the corresponding data structure (describing the lock as an aKey) into the VOS. The server will then start a lease timer only if there exists conflicting lock requests in the queue; and if the client fails to refresh the lock by the expiration of the lease time, the server will revoke it and may grant it to other clients.
If the lock cannot be granted due to a conflict, the request is added to a queue and enters long-polling mode. In this mode, the server delays the reply for rpc_timeout / 2 and responds to the client with an EAGAIN error. The client, upon seeing this error code, will retry the lock request.
If the lock request is granted, the client will schedule refresh requests. If not, the server replies with EAGAIN, and the client reissues the request. The EAGAIN error only indicates that the request cannot be fulfilled at that time, not necessarily a true error.
Once finished using the lock, the client sends a lock release request. If this is the last granted client (always true for exclusive locks, sometimes for shared), the server picks the next client in the queue and grants it a new lock. For shared locks, it will find all applicable clients. If there is no request in the queue, it will just punch the aKey for flock.
Server Side Processing
In the last section, we briefly discussed how servers handle lock requests. This section will elaborate on the design specifics.
Lock Wait Queue
The lock waiting queue will be stored in memory. If the server crashes, the queue's state will be lost. The client will receive a timeout from the lock request, determine the pool map changes, and then requeue the lock request on the new server. This means that the queue order will not be preserved. The server will periodically iterate through the waiting queue and respond to the request with an EAGAIN error, and then delete the corresponding request from the queue. The client should then requeue the request.
Once a conflicting lock is unlocked, the server will find the next request in the queue to be granted, the server anticipates that the client will send periodic refresh requests. The server will start a timer (lease timer) and the lock will be revoked if the client fails to refresh the lock by then. In order to handle the potential network hiccups, the lease timer should be at least twice the length of refresh interval.
Adding a timestamp or enqueue_id to lock requests could help maintain fairness, but it's not guaranteed. This is because the earliest request might be retrying at the time of unlock.
Polling Mode
When a lock request won’t be fulfilled immediately due to the existence of conflicting locks, the lock request will enter into polling mode. In the polling mode, the server will delay rpc_timeout / 2 seconds and reply the lock request with EAGAIN, if the lock request won’t be granted by then. The polling mode will exit immediately if the corresponding lock can be granted. And the client will be replied with a successfully granted lock handle instead.
When a lock request enters polling mode, the corresponding RPC must be copied from the receiver buffer. This prevents exhaustion of the CRT multi-recv buffer.
The purpose of poll mode is to mimic the long-polling technique used in web servers, with the goal of reducing the volume of lock requests.
Persistent State
The persistent state is stored as an aKey of the dKey that represents the directory entry for the file. This means the flock request will be sent with the directory oid with the corresponding dKey; we do this to leverage the existing redundancy mechanism that exists in DAOS already.
The persistent state is only present when clients have been granted a lock. The lock's description is the minimum information stored in the persistent state. This includes:
lock cookie - this is to uniquely identify the lock; we need this to identify obsolete lock requests from the client side
lock reference count (number of clients granted for shared lock; must be 1 for exclusive lock)
lock mode
Last seen pool version
(optionally) lock extent (currently [0, eof)).
The persistent state is stored as an aKey of the object and replicated over sharded objects, if applicable. Client-specific states are not maintained in persistent storage due to scalability concerns. Instead, client states are kept in memory, which means they will be lost if the server crashes. Recovering the state after a server crash will be discussed in the following section.
The persistent state can only be updated by the server of lead shard.
Server Recovery
In the event of a server crash where the lead shard resides, a pool change will occur, and a new shard will take over as the lead. When this new lead comes online, it will initiate a grace period to restore the list of granted clients. This grace period must exceed twice the suggested refreshing period. During this time, the server will not grant new locks but will instead wait for clients to send refreshing requests. Ideally, the lock's reference count should match the number of granted clients. If not, the server will reset the reference count, and the remaining clients will lose their granted locks.
Queued clients do not require special handling, as they will detect the server crash and queue their requests with the new server. However, queue fairness will not be guaranteed.
Lock Conversion
The flock manual mandates that the implementation support lock conversion. This means that if a new lock request is made to the same file descriptor from the same process, the existing lock should be converted. This implies that deadlock will not occur if conflicting lock requests are made to the same file descriptor. This behavior will be implemented because some applications may depend on it.
However, implementing this in fuse presents technical challenges because it will not pass the flock request with the process identifier to dfuse. In fuse, every flock request has a lock_owner, which is a hash value based on the file structure pointer and fuse session pointer. Therefore, it will not uniquely identify a file descriptor from a process because there exists a small chance of hash collision meaning two different opening files could lead to the same lock_owner. Properly fixing this would require patching the fuse kernel and libfuse, which is not easy. In the initial version, we will use the lock_owner as the unique identifier anyway, and this problem will be addressed in a later phase.
A single RPC can complete a conversion from an exclusive lock to a shared lock. However, the conversion from a shared lock to an exclusive lock requires two RPCs: the first RPC will unlock the original lock, and the second RPC will request a new exclusive lock.
Client Crash
If a client holding a granted lock crashes, it will cease sending refresh requests to the server. The server will then remove the client from the granted list and decrement the lock's reference count, similar to the lock release process.