Pydaos Torch module
Introduction
Pytorch integration with DAOS allows using DAOS as a fast storage for training dataset for machine learning. Using native pytorch Dataset with DAOS module will provide more smooth integration without need of dfuse and interception library. It should also provide increased throughput for the large sample files as well as control over prefetching and async parallel loading in batches.
Requirements & Use Cases
The module should implement the Pytorch Map Style Dataset and Iterable Dataset interfaces.
Supporting these interfaces opens the way to leverage existing python DataLoader, which allows batching, collocation, multiple readers and numerous other features out of the box as well as the integration with existing software frameworks. It should be built on publicly available client interfaces/libraries (in this case libdaos/libdfs) to avoid dependency on the server side changes or other internal parts (with the exception for process forking) to achieve the same client interoperability out of the box, as described in Interoperability Matrix. Thus the upgrading/downgrading versions or bug fixing should be a matter of upgrading the client libraries.
Additionally the Checkpoint interface support is required (an implementation example could be found here and the integration with DLIO benchmark here).
Design Overview
At its core Pytorch Map Style Dataset represents the static array of sample items that can be addressed by index. To implement such an interface only two methods are required: __len__ and __getitem__. This implies that before the dataset can be used it should load a full namespace during its initialization.
There’s an additional optimization that can be implemented for map-style dataset: __getitems__ method that allows batch loading of multiple samples in parallel.
The module itself will be residing under src/client/pydaos/torch directory in the DAOS source tree and consist of a shim layer written in C and an interfacing python module. Module should support setuptools installation alongside the pydaos module.
Implementing __getitem__ is pretty straightforward call passthrough from interfacing python module to underlying dfs_read function in shim layer. To avoid unnecessary data copies from python env to C shim library, the shim layer should use buffers provided by the caller.
As a positive side effect of this approach, when, in the future, GDS memory buffer becomes available, torch module should be able to support RDMA directly to the GPU.
__getitems__ implementation should be event driven via DAOS event queue primitive with the limitation on number of in-flight requests to maximise utilisation of the network. To maximise the throughput each worker process should have its own event queue. This can be achieved by providing worker_init_fn of the developed Dataset.
Initial implementation of the namespace scanning assumes simple recursive directory listing with a separate milestone to optimise it with parallel directory reading with DAOS dfs_obj_anchor_split(), which would be handy for the Iterable Dataset implementation.
Iterable Dataset should implement __iter__() protocol that provides an iterator over samples in the dataset. This type of dataset should also support parallel iteration via worker_init_fn() in the same manner as the batching in __getitems__ of map style dataset, with the same set of optimizations.
Checkpointing interface on the pytorch side expect implementation of python io.BufferedIOBase interface, which would be a direct mapping to dfs_read/write calls in the shim layer.
At the moment DAOS DFS does not provide any caching, but there’s ongoing work on that. As soon as it becomes available it can be added transparently with minimal effort to evaluate performance increase. Even in the preview stage it should be safe to do so as the Dataset provides only read-only access to the underlying training samples.
User Interface
The user interface for the pydaos.torch module consists of creating appropriate instances of the Dataset (map-style or iterable) with setting necessary parameters in the constructor. The only necessary parameters are pool and container labels, others are mostly for fine tuning for the specific workload:
ds = pydaos.torch.Dataset(pool=”pool”, cont=”container”)
ds1 = pydaos.torch.Dataset(pool=”pool”, cont=”cont”, path=”/cv/samples/train/”)
Dataset parameters should be documented with python docstrings describing their usage.
Quality
Some/unit testing is to be done with NLT DAOS tooling and the functional testing via DAOs functional test suite, to cover typical Dataset and DataLoader usage scenarios.
Performance benchmarking is intended to be done with DLIO Benchmark.
Project Milestones
There are six milestones with self explanatory names:
Design
Map Style Dataset implementation
Iterable Style Dataset Implementation
Namespace scanning optimisation
Checkpointing
Documentation and scale testing
They can be tracked in Jira: DAOS-16355: pydaos.torch modulesOpen
Impacts
No known impact apart from the better adoption of the DAOS ecosystem via providing an additional way to access data on DAOS storage and integrating it in the AI/ML ecosystem.