nvm0612: Reclaiming free NVMe space is too slow, causing ENOSPACE

Description

DAOS 2.2 requirement is to be able to sustain multiple iterations of IOR-easy without hitting ENOSPACE, where each iteration writes >50% of the total pool capacity (e.g., 80% of the total pool capacity).

Original issue on 1.3 (partially addressed in 2.0):

On a DAOS server with ~25TB of NVMe space, an IOR-easy job that writes ~6.3TB per iteration fails in the 4th iteration, with

This should not happen, as the files of the previous iterations have been deleted and space should be reclaimed as the pool fills up. Even worse, after the job abort the "daos cont destroy" that runs in the Slurm epilog is so slow that the epilog itself times out.

This is on daos-1.3.106-3.7265.g74926ea0.el7.x86_64 (clients daos-1.3.106-3.7265.g74926ea0.el8.x86_64)

Attachments

9

Activity

Ivan Poddubnyy 
February 9, 2023 at 6:18 PM

and , could you clarify whether you are still seeing the problem with 2.2.0?

Please open the new ticket(s) and attach the logs if you do. thank you.

I am going to close this ticket since the original problem has been resolved in 2.2.0

Hongxun Lin 
November 15, 2022 at 7:44 AM

i found two vea_flush in function vos_gc_pool , first with “force = true” ,but the second is “false” ,are there some goals on that ? In my fio test ,i tryied to set the second vea_flush with “force = true”, no “fio io_u error ,error=No space left on device” errrors and exits again. i have no idea if we could do this change.

Yawei Niu 
November 14, 2022 at 2:25 AM

We used this patch for fio test based on version 2.2.0, but an engine (dual engine) broke during the test, we are not sure what the problem is. Do you have an opinion on that?

I’m not quite sure how engine broke during the test, maybe you hit the DAOS-11092? Could you try to apply the fix of to see if your issue could be fixed?

limiaomiao 
November 14, 2022 at 2:09 AM
(edited)

We used this patch https://github.com/daos-stack/daos/pull/9147 for fio test based on version 2.2.0, but an engine (dual engine) broke during the test, we are not sure what the problem is. Do you have an opinion on that?

system log:

daos_engine log:

Ivan Poddubnyy 
October 18, 2022 at 5:33 PM

based on a comment in #community (slack), the problem still exists in 2.2-rc3

Fixed

Details

Assignee

Reporter

Priority

Affects versions

Labels

Components

Story Points

Bug Exposure

Bug Source

Number of Occurrences

Created October 31, 2021 at 1:56 PM
Updated February 9, 2023 at 6:19 PM
Resolved February 9, 2023 at 6:19 PM