Half of the IO500 Top 10 are using DAOS as the backend storage system. From the HDD to SSD, the performance is increasing and also the latency reduces a lot. But from SSD to SCM, not only the performance is increasing, but also SCM supports transactional access by byte, rollback after exceptions. Due to the SCM not supporting DMA, it will cost a large amount of CPU resources accessing the device. RDMA zero copy is designed from the software level to optimise the performance. From the hardware perspective, intel proposed the DSA accelerator. Both the software revolution and next stage storage hardware support make us interested to try it on Arm64 servers.
Build Issue Fix
Persistent memory is widely used in DAOS. Currently, the DAOS only supports X86_64 with Optane memory and SSD. But for development purposes, we can use the traditional memory to emulate the Optane persistent memory features. The memory emulation makes it possible to experience Arm64 with DAOS. While as we explore the project to another purely new architecture, there are some build and compatible issues.
Data structure check issue: The static data structure check on Arm64 does not work well. The root cause is that pthered_mutex. Its length varies on different platforms. In our case, it is 48 bytes on Arm64 while 40 bytes on X86_64. That will induce the assert error about structure size check. With the help from upstream now, this issue does not exist.
SPDK build: The SPDK build issue is also solved, this is just a hardcoded architecture problem on X86_64 and we have change the code to make it build natively according to the architecture.
IPMCTL build: Another problem is due to the Ipmctl. This is the Intel Persite memory management tools, which does not support Arm64. On Arm64, we have remove the dependencies of the libipmctl.so and add the NULL interface, to make it compatible with the DAOS framework.
Telemetry: It only supports build amd64 and should be changed to amd64 and arm64.
dup2 does not work on Arm64. We use Dup3 instead
After the building issue was fixed, we tried to set up the DAOS testing environment and we were faced with some new challenges.
Crash on 64K page
After deployment on Arm64, we tried to setup the DAOS environment, but it always crash with a lot of “??” in the calling stack, and before crash, the calling stack went through the mercury network module. Apart from this, there is no more details, even though we open all the kernel and DAOS debug info.
According to the experience on other projects before, we suspected that this should be a kernel page size issue. As we know, the X86_64 only supports 4K page, while 4K, 16K and 64K pages are all supported on the Arm64 platform. Change to 4K kernel configuration, the crash issue disappears.
No network interface, and it will induce the exception exit.DAOS does not support Bond, vlan etc virtual network devices, so we need to run on network hardware.
Another tricky problem is network timeout. There are 2 types of timeout. One is that the cluster clock synchronisation issue, so that the server-client communication trigger the timeout threshold. Another timeout issue is relatively complicated. It happens when the DAOS runs for a while and it will make the DAOS run slow. Digging more we found that the CPU usage is not high, but there are few free memories. The root cause is that the memory allocation from the slow path. Since we use the memories to emulate the persistent memory, after reducing the amount of the emulated persistent memory, there is no timeout issue.
IO500 Test Suites
The IO500 test will be terminated after running for 60-70 seconds, the reported issue is the test used up all the memories. At that time. The SCM space was used up, but there were free NVME spaces. The DAOS metadata is stored at SCM and it will not move to NVME for metadata management even if there is no free SCM space. At that case, we increase the memory from 256G to 512G.
But the IO500 can also run out of the SCM after the test lasts for about 200 seconds. 300 seconds is needed for this task so that the IO500 test result is valid. SCM should be used less to avoid the metadata management issue. The SCM is used for store the metadata, unaligned data and small block data. Especially, in DAOS the data which is less than 4K will be stored to SCM.
We’ve changed the small data block threshold to less than 2K, and the IO500 test suites can be finished. But if we enlarge the target numbers and increase the concurrency, the SCM was used up again. In DAOS, every target will be reserved 2GB SCM and 10GB NVME, it is not modifiable with API. So due to the resource limitation as the memory emulated PMEM, we changed the SCM to 512MB for each target. By leveraging this change, the IO500 test were never failed due to SCM and the performance was increased.
Client trial: Some part of the mdtest test did not increase accompanied with the clients number increase, the problem should be in client performance. We set up 10 and 4 of them do not have better performance. The performance continued to increase after we changed the client to a better performance node.
End to end verification test: The SHA512 newly created container always failed on Arm64. In comparison, the isal_crypto SHA512 test cases can pass on Amd64. We have submited a patch and fixed this issue.
The IO500 only cares about the performance but does not check the data consistency. In order to test the data correctness, we design a procedure that copies a lot of data with different sizes to the DAOS filesystem. Then restart the DAOS cluster and read the data from DAOS. We checked the MD5 for the data stored to DAOS and the data read from DAOS, the md5 are the same and the data is consistent. The data consistency works well when we extend the test to 2 or 3 replicas, EC 2+1 2+2 4+2.In the press test, the server end will coredump when copied back from DAOS. This issue needs to do more investigation, from the log it shows that the release version NULL trigger asserts.
Block device support
DAOS now plan to use container to support block devices, but the container does not support to preset the capacity. DAOS offers the dfs plugin to support fio test, but it is a file system, not a block device.FIO 4k random write: 110us, random read: 160us. Mix read/write: 150us. The single server 4K concurrent wite: 420K IOPS.
Arm64 test case fail fix
The test daos_test、run_test.sh. In Arm64 platform, we found that there are some vos small IO test fail, epoch failed and object check failed. Those issues are regarding with test code itself, not doing the right initialization code for local variables. On X86_64, the initial value of local variables are 0 while on Arm64 if you don’t initial it, the value is random. Apart from those test cases referred before, we can pass all other test cases.
Many Asserts inserted in the code make us debugging quite smoothly on the Arm64 platform. DAOS uses some external modules, such as SPDK, PMDK, ISAL, ISAL-crypto, almost all of them have been enabled and optimised on Arm64, so that they are quite stable and do not need more patches to make them work on the new platform.
The dashboard built with prometheus+grafana makes it easy to diagnose the performance as all the test results are visual. Every target use single thread and there is no threading concurrency issue, so that we can avoid the Arm64 weak memory order problem.
We plan to add the Arm64 CI support, covering multiple OS build, unit and integration. We will also abstract the master branch code to do the performance test.
Use battery backup memory to validate the PMDK flush cache and data correctness. After DAOS supports UCX, NVME OF RDMA modules, we will continue to validate the function on Arm64.
Arm64 Patches List for DAOS
https://github.com/daos-stack/daos/pull/9505 DAOS-10922 control: Implement ipmctl for aarch64 (#10922) #9505
https://github.com/daos-stack/daos/pull/9487 DAOS-10899 test: use expected variable to do assert checking (#10899) #9487
https://github.com/daos-stack/daos/pull/9486 DAOS-10898 obj: fix the assert in small_io test (#10898) #9486
https://github.com/daos-stack/daos/pull/9456 DAOS-10891 tests: fix incorrect assert in check_oclass (#10891) #9456