Interception library design document

Stakeholders

Lei Huang, Mohamad Chaarawi, Ashley Pittman

Introduction

Applications normally need to be modified by adopting the DFS API to fully take advantage of the performance provided by DAOS. Such requirements need extra effort from software developers for porting and maintaining. For applications that do not support DFS, dfuse is used to support applications using POSIX interface without the need of code changes with degraded I/O performance compared using native DFS API. Interception library libioil.so was introduced to boost read/write bandwidth by bypassing dfuse for read()/write() and their variants. Metadata related functions (e.g., open, stat, unlink, etc.) still suffer from inferior performance restricted by dfuse. A new interception library that intercepts all I/O related functions in glibc (including read/write as well as metadata related functions) is required to deliver good performance using POSIX comparable with using DFS.

Design Overview

The design of this interception library is quite straightforward. We just need a way to implement function interception and provide the new functions for all interested functions in libc, libpthread and ld. To bypass dfuse, we do need to use fake file descriptors (fd) managed in user space. Such fake fd can be assigned as large integers, so they can be distinguished easily from the real fd allocated by Linux kernel. All I/O related functions have to be intercepted since kernel does not recognize the fake fd allocated by our interception library.

There are different techniques for function interception. The easiest way is implementing the function one wants to intercept and export function names in the shared library. However, this method does not work for internal function calls, e.g., target functions in libc are called inside other functions in libc.

We use a function interception technique called trampoline in this library when needed. It overwrites the entry of target functions with a jump instruction to redirect to the new implementation. The erased instructions are backed up in memory and executed when the original functions are invoked. We can precisely control what functions we need to intercept with this method.

In principal, we do need to implement all I/O related functions in our interception library since fake fd is used. This means a large number of APIs are involved and a lot of tests will be needed. It may take significant time to be stable and support most applications.

User Interface

In current implementations, simple string comparison for path and integer comparison for fd are used to determine whether file/directory is on DAOS filesystem or not. A list of DAOS container mount points are maintained.

The current interception library can query /proc/mount and determine dfuse mount points. One mount point can be passed with environmental variables for testing purpose.

Example,

export DAOS_POOL="testpool"
export DAOS_CONTAINER="testcont"
export DAOS_MOUNT_POINT="/dfs"

export LD_PRELOAD=/usr/lib64/libpil4dfs.so

mpirun -np xx mdtest …

Limitations

Stability issues: Many APIs are involved. We have not carefully tested each function. There may be bugs, uncovered/not intercepted functions, etc.

Current code was developed and tested on x86_64. We do have ongoing work to port the library to Arm64, but we have not tested on Arm64 yet.

dfuse has to be mounted for pool & container handle resolution and to handle some operations on the container that we do not supported yet.

  • Support for multiple pool and containers within a singled dfuse mountpoint is not there yet (each container accessed should be mounted separately), i.e. no UNS support (concerns about the overhead of getfattr())

  • Large overhead for small tasks (200~300 ms in daos_init())

  • Not working for statically linked executable (such as Go applications)

  • No support of creating a process with the executable and shared object files stored on DAOS yet

  • No support for applications using fork yet (ibverbs not working due to verbs limitation. Tcp seems not working either.)

Those unsupported features are still available through dfuse.

 

Note:

Supporting running executable file out of dfs is also possible with memfd_create() and fexecve(). Will explore this later. User space cache probably is needed to get reasonable performance.