Aurora ECB testing: 256ECB:105 servers - ior with MPIIO/DFS EC_16P2GX hung

Description

Config:
Clients: 256 ECB nodes
Servers: 105
Daos: master
Provider: cxi
ppn: 104


MPIIO console output (had to ctrl+C):

 

No server evictions were noticed.
Seeing a bunch of timeouts in log files.

One of the server logs test_io_sys_admin_daos_server_1.log.81395.rank=6:

 

One of the client logs x4610c3s2b0n0_test_io_sys_admin_daos_client.log:

 

  • Tried with DFS (EC_16P2GX) and getting same timeout messages as with MPIIO.

  • Tried with aggregation disabled with DFS (EC_16P2GX), still no difference.

Please only refer to logs from 2024/05/24 01:19:21. Entire daos system was restarted at this time and the reported failure occurred after this.

Attachments

29

Activity

Show:

Scott P.January 22, 2025 at 6:16 PM

Resolved in discussion with Gordon. Not seen in 2.6.x

Jerome SoumagneAugust 16, 2024 at 3:15 PM

yes something happens in the cxi provider/driver in this case, that’s the same as the log posted in

Mohamad ChaarawiAugust 16, 2024 at 2:17 PM

For the third run that actually did start IOR:

1.5k ECB, EC_8P2GX we see the original issue of:

so EC_8P2 has the same problem.

Mohamad ChaarawiAugust 16, 2024 at 2:11 PM

for both the issues of -1020 failed to initialize DAOS, this means that the test did not even start and just failed right at the beginning. the error there on both cases is:

 

 

that could be a network issue on the node itself, or maybe something else. It could also be intermittent as i have seen this sometimes and it goes away. any idea?

So we will need to retry those cases, but now please use the daos_testing queue (with 1.5k nodes instead of 2k).

Maureen JeanAugust 16, 2024 at 1:42 PM

I did not see the request to run with patch until now. The build we are using currently has 2.6 tip as of 2 days ago plus PR-14817. Should I apply PR-14934 and PR-14817?

 

I was able to run 3 test cases last night and results are post in the wiki

Done

Details

Assignee

Reporter

Priority

Affects versions

Labels

Components

Bug Exposure

3-Medium

Bug Source

Unknown

Number of Occurrences

1

Created May 24, 2024 at 3:06 PM
Updated January 27, 2025 at 6:53 PM
Resolved January 22, 2025 at 6:16 PM