5-15-18
Test Configuration
Tip of master, commit 7cffa6f494db6d95df1f08033ccd5cbfe36f93eb
All tests run with ofi+psm2, ib0.
daos_test: Run with 8 server (boro-[3-9],boro-58), 2 client (boro-[59-60]). Killed servers, cleaned /mnt/daos in between runs listed below.
Tests requiring pool to be created via dmg used 4GB pool. These used boro-59 as client.
mpich tests used boro-3 as server, boro-59 as client, with a 1GB pool.
Test Results
daos_test
Separate runs with cleanup in between:
- -mpcCiAeoRd - PASS
- -r - PASS
daosperf
1K Records
CREDITS=1
CREDITS=8
4K Records
CREDITS=1
IOR, single client 4GB pool
IOR, 2 client 10GB pool, data verification enabled
daos_bench
kv-idx-update
Time: 66.753337 seconds (14980.524402 ops per second)
kv-dkey-update
Time: 0.011910 seconds (8396.287980 ops per second)
kv-akey-update
Time: 0.010629 seconds (9407.848418 ops per second)
kv-dkey-fetch
Time: 0.006670 seconds (14993.068912 ops per second)
kv-akey-fetch
Time: 0.006212 seconds (16098.672065 ops per second)
mpich tests
Results:
No Errors on all tests for ~50% of attempts, segfault in daos_server in 50% of attempts. No ticket has been filed yet, though Mohamad has begun debug. For convenience, the text of his email exchange to the daos-devel-request mailing list is pasted here:
Stephen ran into a server segfault when testing mpich. I verified and replicated the segfault, and it seems it has to do with ofi/PSM2 (trace below). I didn’t see that segfault before, and it doesn’t occur all the time. I just ran the mpi-io tests and replicated once on the first run an another time on the second run, so it’s easy to replicate although not always. Here are my env variables: export CRT_PHY_ADDR_STR="ofi+psm2" export OFI_INTERFACE=ib0 export OFI_PORT=26850 export FI_PSM2_NAME_SERVER=1 export PSM2_MULTI_EP=1 export FI_SOCKETS_MAX_CONN_RETRY=1 export CRT_CTX_SHARE_ADDR=1 export CRT_CTX_NUM=8 export D_LOG_MASK=CRIT export DD_SUBSYS=0 On the client I have the same as above but I add this (CRT_CTX_NUM is overwritten): export DAOS_SINGLETON_CLI=1 export CRT_ATTACH_INFO_PATH=~ export FI_PSM2_DISCONNECT=1 export CRT_CTX_NUM=2 From the trace it seems we run into this on a pool connect? Again I never run into this before, so seems like a regression at the server side. Does this look familiar? Segfault trace: #0 0x00007ffff3f0cd0f in psmx2_write_generic () from /scratch/mschaara/DEPS/ofi/lib/libfabric.so.1 #1 0x00007ffff3f0e458 in psmx2_writev () from /scratch/mschaara/DEPS/ofi/lib/libfabric.so.1 #2 0x00007ffff536f35e in fi_writev (context=0x7fffcd23a928, key=2, addr=<optimized out>, dest_addr=3940649673949191, count=1, desc=0x7fffd81e0538, iov=0x7fffd81e0540, ep=0x7fde60) ---Type <return> to continue, or q <return> to quit--- at /scratch/mschaara/DEPS/ofi/include/rdma/fi_rma.h:130 #3 na_ofi_put (na_class=0x6d88d0, context=0x7fffcc02d510, callback=<optimized out>, arg=<optimized out>, local_mem_handle=<optimized out>, local_offset=<optimized out>, remote_mem_handle=0x7fffcd23e900, remote_offset=0, length=64, remote_addr=0x7fffcd23d8a0, remote_id=0 '\000', op_id=0x7fffcd23c9f0) at /home/mschaara/source/deps_daos/daos_m/_build.external/mercury/src/na/na_ofi.c:3642 #4 0x00007ffff558ddd1 in hg_bulk_transfer_pieces ( na_bulk_op=na_bulk_op@entry=0x7ffff558dbf0 <hg_bulk_na_put>, origin_addr=origin_addr@entry=0x7fffcd23d8a0, origin_id=0 '\000', hg_bulk_origin=hg_bulk_origin@entry=0x7fffcd239840, origin_segment_start_index=origin_segment_start_index@entry=0, origin_segment_start_offset=origin_segment_start_offset@entry=0, hg_bulk_local=hg_bulk_local@entry=0x7fffcd23a7b0, local_segment_start_index=local_segment_start_index@entry=0, local_segment_start_offset=local_segment_start_offset@entry=0, size=size@entry=64, scatter_gather=scatter_gather@entry=0 '\000', hg_bulk_op_id=hg_bulk_op_id@entry=0x7fffcd23e760, na_op_count=na_op_count@entry=0x0, use_sm=0 '\000') at /home/mschaara/source/deps_daos/daos_m/_build.external/mercury/src/mercury_bulk.c:784 #5 0x00007ffff558fb98 in hg_bulk_transfer (op_id=0x7fffd81e0838, ---Type <return> to continue, or q <return> to quit--- op_id@entry=0x40, size=64, size@entry=0, local_offset=<optimized out>, hg_bulk_local=0x7fffcd23a7b0, hg_bulk_local@entry=0x0, origin_offset=<optimized out>, hg_bulk_origin=0x7fffcd239840, origin_id=<optimized out>, origin_addr=<optimized out>, op=<optimized out>, arg=0x7fffcd23eaf0, callback=0x7ffff7754080 <crt_hg_bulk_transfer_cb>, context=<optimized out>) at /home/mschaara/source/deps_daos/daos_m/_build.external/mercury/src/mercury_bulk.c:955 #6 HG_Bulk_transfer_id (context=<optimized out>, callback=callback@entry=0x7ffff7754080 <crt_hg_bulk_transfer_cb>, arg=arg@entry=0x7fffcd23eaf0, op=<optimized out>, origin_addr=<optimized out>, origin_id=origin_id@entry=0 '\000', origin_handle=0x7fffcd239840, origin_offset=origin_offset@entry=0, local_handle=local_handle@entry=0x7fffcd23a7b0, local_offset=local_offset@entry=0, size=size@entry=64, op_id=op_id@entry=0x7fffd81e0838) at /home/mschaara/source/deps_daos/daos_m/_build.external/mercury/src/mercury_bulk.c:1721 #7 0x00007ffff558fde2 in HG_Bulk_transfer (context=<optimized out>, callback=callback@entry=0x7ffff7754080 <crt_hg_bulk_transfer_cb>, arg=arg@entry=0x7fffcd23eaf0, op=<optimized out>, origin_addr=<optimized out>, origin_handle=<optimized out>, ---Type <return> to continue, or q <return> to quit--- origin_offset=0, local_handle=0x7fffcd23a7b0, local_offset=0, size=64, op_id=op_id@entry=0x7fffd81e0838) at /home/mschaara/source/deps_daos/daos_m/_build.external/mercury/src/mercury_bulk.c:1643 #8 0x00007ffff77591ba in crt_hg_bulk_transfer ( bulk_desc=bulk_desc@entry=0x7fffd81e0880, complete_cb=0x7fffda5c0e10 <bulk_cb>, arg=0x7fffd81e0840, opid=0x7fffd81e0838) at src/cart/crt_hg.c:1654 #9 0x00007ffff77383a3 in crt_bulk_transfer ( bulk_desc=bulk_desc@entry=0x7fffd81e0880, complete_cb=complete_cb@entry=0x7fffda5c0e10 <bulk_cb>, arg=arg@entry=0x7fffd81e0840, opid=opid@entry=0x7fffd81e0838) at src/cart/crt_bulk.c:172 #10 0x00007fffda5c0d64 in transfer_map_buf (tx=tx@entry=0x7fffd81e0a10, svc=<optimized out>, rpc=rpc@entry=0x7fffcd23e348, remote_bulk=0x7fffcd239840, required_buf_size=required_buf_size@entry=0x7fffcd23e48c) at src/pool/srv_pool.c:1689 #11 0x00007fffda5c433e in ds_pool_connect_handler (rpc=0x7fffcd23e348) at src/pool/srv_pool.c:1801 #12 0x00007ffff7773a00 in crt_handle_rpc (arg=0x7fffcd23e348) at src/cart/crt_rpc.c:1745 #13 0x00007ffff6a8f708 in ABTD_thread_func_wrapper () ---Type <return> to continue, or q <return> to quit--- from /scratch/mschaara/DEPS//argobots/lib/libabt.so.0 #14 0x00007ffff6a8fc91 in make_fcontext () from /scratch/mschaara/DEPS//argobots/lib/libabt.so.0 #15 0x0000000000000000 in ?? () Thanks, Mohamad ---------------------------------------------------- I also realized it’s not always the same trace where it segfaults.. here is another one: (gdb) bt #0 0x00007ffff33b60ab in __psm2_mq_isend2 (mq=0x71c8d0, dest=0x7fffcd53c370, flags=0, stag=0x7fffd8195ae0, buf=0x7fffd8d6d800, len=184, context=0x7fffcc068788, req=0x7fffd8195ad8) at /scratch/ESSIO/OPA/opa-psm2/psm_mq.c:733 ---Type <return> to continue, or q <return> to quit--- #1 0x00007ffff3f05594 in psmx2_tagged_send_no_flag_av_table () from /scratch/mschaara/DEPS/ofi/lib/libfabric.so.1 #2 0x00007ffff536fa03 in fi_tsend (context=0x7fffcc068788, tag=4294967297, dest_addr=3940649673949431, desc=0x7fffcc031a40, len=184, buf=0x7fffd8d6d800, ep=0x7fde60) at /scratch/mschaara/DEPS/ofi/include/rdma/fi_tagged.h:116 #3 na_ofi_msg_send_expected (na_class=0x6d88d0, context=0x7fffcc02d510, callback=<optimized out>, arg=<optimized out>, buf=0x7fffd8d6d800, buf_size=184, plugin_data=0x7fffcc031a40, dest=0x7fffcd6316b0, target_id=0 '\000', tag=1, op_id=0x7fffcc068610) at /home/mschaara/source/deps_daos/daos_m/_build.external/mercury/src/na/na_ofi.c:3298 #4 0x00007ffff5585583 in hg_core_respond_na (hg_core_handle=<optimized out>) at /home/mschaara/source/deps_daos/daos_m/_build.external/mercury/src/mercury_core.c:2084 #5 0x00007ffff558895b in HG_Core_respond (handle=0x7fffcc068510, callback=callback@entry=0x7ffff558a290 <hg_core_respond_cb>, arg=arg@entry=0x7fffcc068930, flags=<optimized out>, payload_size=<optimized out>) at /home/mschaara/source/deps_daos/daos_m/_build.external/mercury/src/mercury_core.c:4794 #6 0x00007ffff558d82a in HG_Respond (handle=0x7fffcc068930, callback=callback@entry=0x7ffff77547a0 <crt_hg_reply_send_cb>, ---Type <return> to continue, or q <return> to quit--- arg=arg@entry=0x7fffcd7280f0, out_struct=out_struct@entry=0x7fffcd728170) at /home/mschaara/source/deps_daos/daos_m/_build.external/mercury/src/mercury.c:2305 #7 0x00007ffff77579f1 in crt_hg_reply_send ( rpc_priv=rpc_priv@entry=0x7fffcd7280f0) at src/cart/crt_hg.c:1312 #8 0x00007ffff77741b4 in crt_reply_send (req=req@entry=0x7fffcd728148) at src/cart/crt_rpc.c:1527 #9 0x00007fffda1999cf in ds_obj_rw_complete (map_version=1, status=0, ioh=..., cont_hdl=<optimized out>, rpc=0x7fffcd728148) at src/object/srv_obj.c:102 #10 ds_obj_rw_handler (rpc=0x7fffcd728148) at src/object/srv_obj.c:724 #11 0x00007ffff7773a00 in crt_handle_rpc (arg=0x7fffcd728148) at src/cart/crt_rpc.c:1745 #12 0x00007ffff6a8f708 in ABTD_thread_func_wrapper () from /scratch/mschaara/DEPS//argobots/lib/libabt.so.0 #13 0x00007ffff6a8fc91 in make_fcontext () from /scratch/mschaara/DEPS//argobots/lib/libabt.so.0 #14 0x0000000000000000 in ?? ()