Atlassian uses cookies to improve your browsing experience, perform analytics and research, and conduct advertising. Accept all cookies to indicate that you agree to our use of cookies on your device.
Atlassian uses cookies to improve your browsing experience, perform analytics and research, and conduct advertising. Accept all cookies to indicate that you agree to our use of cookies on your device. Atlassian cookies and tracking notice, (opens new window)
IOR, 2 client 10GB pool, data verification enabled
[sdwillso@boro-59 ~]$ orterun -np 1 --hostfile ~/hostlists/daos_client_hostlist --mca mtl ^psm2,ofi --ompi-server file:~/scripts/uri.txt ior -W -v -i 5 -a DAOS -w -o `uuidgen` -b 10g -t 1m -O daospool=2b8ce7c0-5bba-4fc0-909b-6d968b252b93,daosrecordsize=1m,daosstripesize=1m,daosstripecount=1024,daosaios=16,daosobjectclass=LARGE,daosPoolSvc=1,daosepoch=1
IOR-3.0.1: MPI Coordinated Test of Parallel I/O
Began: Tue May 15 22:16:04 2018
Command line used: ior -W -v -i 5 -a DAOS -w -o f6b0d019-98f2-4c49-bd25-25a0d66c6c2f -b 10g -t 1m -O daospool=2b8ce7c0-5bba-4fc0-909b-6d968b252b93,daosrecordsize=1m,daosstripesize=1m,daosstripecount=1024,daosaios=16,daosobjectclass=LARGE,daosPoolSvc=1,daosepoch=1
Machine: Linux boro-59.boro.hpdd.intel.com
Start time skew across all tasks: 0.00 sec
Test 0 started: Tue May 15 22:16:04 2018
Path: /home/sdwillso
FS: 3.8 TiB Used FS: 9.0% Inodes: 250.0 Mi Used Inodes: 1.8%
Participating tasks: 1
[0] WARNING: USING daosStripeMax CAUSES READS TO RETURN INVALID DATA
Summary:
api = DAOS
test filename = f6b0d019-98f2-4c49-bd25-25a0d66c6c2f
access = single-shared-file, independent
pattern = segmented (1 segment)
ordering in a file = sequential offsets
ordering inter file= no tasks offsets
clients = 1 (1 per node)
repetitions = 5
xfersize = 1 MiB
blocksize = 10 GiB
aggregate filesize = 10 GiB
access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
------ --------- ---------- --------- -------- -------- -------- -------- ----
Commencing write performance test: Tue May 15 22:16:04 2018
write 3724 10485760 1024.00 0.001198 2.75 0.002832 2.75 0
Verifying contents of the file(s) just written.
Tue May 15 22:16:07 2018
remove - - - - - - 0.007806 0
Commencing write performance test: Tue May 15 22:16:15 2018
write 3769 10485760 1024.00 0.000693 2.71 0.003378 2.72 1
Verifying contents of the file(s) just written.
Tue May 15 22:16:17 2018
remove - - - - - - 0.007733 1
Commencing write performance test: Tue May 15 22:16:25 2018
write 3749 10485760 1024.00 0.000684 2.73 0.002396 2.73 2
Verifying contents of the file(s) just written.
Tue May 15 22:16:28 2018
remove - - - - - - 0.007591 2
Commencing write performance test: Tue May 15 22:16:36 2018
write 3774 10485760 1024.00 0.000679 2.71 0.002356 2.71 3
Verifying contents of the file(s) just written.
Tue May 15 22:16:39 2018
remove - - - - - - 0.007547 3
Commencing write performance test: Tue May 15 22:16:47 2018
write 3784 10485760 1024.00 0.000760 2.70 0.002375 2.71 4
Verifying contents of the file(s) just written.
Tue May 15 22:16:49 2018
remove - - - - - - 0.007593 4
Max Write: 3784.03 MiB/sec (3967.84 MB/sec)
Summary of all tests:
Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum
write 3784.03 3723.54 3760.10 21.55 2.72342 0 1 1 5 0 0 1 0 0 1 10737418240 1048576 10737418240 DAOS 0
daos_bench
kv-idx-update
Time: 66.753337 seconds (14980.524402 ops per second)
[sdwillso@boro-59 ~]$ orterun -np 1 --mca mtl ^psm2,ofi --ompi-server file:~/scripts/uri.txt daosbench --test=kv-idx-update --testid=1 --svc=1 --dpool=357aa259-1ab1-4fe3-9125-71bab9ed9139 --container=`uuidgen` --object-class=tiny --aios=32 --indexes=1000000
================================
DAOSBENCH (KV)
Started at
Tue May 15 18:27:17 2018
=================================
===============================
Test Setup
---------------
Test: kv-idx-update
DAOS pool :357aa259-1ab1-4fe3-9125-71bab9ed9139
DAOS container :7e5c5fb5-b197-4752-8e56-a45ffabc9005
Value buffer size: 64
Number of processes: 1
Number of indexes/process: 1000000
Number of asynchronous I/O: 32
===============================
kv-idx-update
Time: 66.753337 seconds (14980.524402 ops per second)
kv-dkey-update
Time: 0.011910 seconds (8396.287980 ops per second)
[sdwillso@boro-59 ~]$ orterun -np 1 --mca mtl ^psm2,ofi --ompi-server file:~/scripts/uri.txt daosbench --test=kv-dkey-update --testid=1 --svc=1 --dpool=4b5b1db0-245e-4037-aa61-47cb5006cace --container=`uuidgen` --object-class=tiny --aios=32 --indexes=1000000
================================
DAOSBENCH (KV)
Started at
Tue May 15 18:36:16 2018
=================================
===============================
Test Setup
---------------
Test: kv-dkey-update
DAOS pool :4b5b1db0-245e-4037-aa61-47cb5006cace
DAOS container :10bb507e-cd78-4d38-9865-961c10480a37
Value buffer size: 64
Number of processes: 1
Number of keys/process: 100
Number of asynchronous I/O: 32
===============================
kv-dkey-update
Time: 0.011910 seconds (8396.287980 ops per second)
kv-akey-update
Time: 0.010629 seconds (9407.848418 ops per second)
[sdwillso@boro-59 ~]$ orterun -np 1 --mca mtl ^psm2,ofi --ompi-server file:~/scripts/uri.txt daosbench --test=kv-akey-update --testid=1 --svc=1 --dpool=c67bcdbd-36ec-4e00-857f-204df99f0646 --container=`uuidgen` --object-class=tiny --aios=32 --indexes=1000000
================================
DAOSBENCH (KV)
Started at
Tue May 15 18:39:29 2018
=================================
===============================
Test Setup
---------------
Test: kv-akey-update
DAOS pool :c67bcdbd-36ec-4e00-857f-204df99f0646
DAOS container :288e9ad9-42e6-4016-9284-93d2a011f1f3
Value buffer size: 64
Number of processes: 1
Number of keys/process: 100
Number of asynchronous I/O: 32
===============================
kv-akey-update
Time: 0.010629 seconds (9407.848418 ops per second)
kv-dkey-fetch
Time: 0.006670 seconds (14993.068912 ops per second)
[sdwillso@boro-59 ~]$ orterun -np 1 --mca mtl ^psm2,ofi --ompi-server file:~/scripts/uri.txt daosbench --test=kv-dkey-fetch --testid=1 --svc=1 --dpool=f7849c43-4508-4fc3-a866-de3a998cd3a7 --container=`uuidgen` --object-class=tiny --aios=32 --indexes=1000000
================================
DAOSBENCH (KV)
Started at
Tue May 15 18:41:31 2018
=================================
===============================
Test Setup
---------------
Test: kv-dkey-fetch
DAOS pool :f7849c43-4508-4fc3-a866-de3a998cd3a7
DAOS container :318250b7-e15b-442c-b746-29e7b2cc7229
Value buffer size: 64
Number of processes: 1
Number of keys/process: 100
Number of asynchronous I/O: 32
===============================
kv-dkey-fetch
Time: 0.006670 seconds (14993.068912 ops per second)
kv-akey-fetch
Time: 0.006212 seconds (16098.672065 ops per second)
[sdwillso@boro-59 ~]$ orterun -np 1 --mca mtl ^psm2,ofi --ompi-server file:~/scripts/uri.txt daosbench --test=kv-akey-fetch --testid=1 --svc=1 --dpool=5999fb3b-2f4d-4c3f-b7d5-79a9be8d13a4 --container=`uuidgen` --object-class=tiny --aios=32 --indexes=1000000
================================
DAOSBENCH (KV)
Started at
Tue May 15 18:43:19 2018
=================================
===============================
Test Setup
---------------
Test: kv-akey-fetch
DAOS pool :5999fb3b-2f4d-4c3f-b7d5-79a9be8d13a4
DAOS container :b3291b9d-a3bc-4d3e-b250-e21e3d8597fa
Value buffer size: 64
Number of processes: 1
Number of keys/process: 100
Number of asynchronous I/O: 32
===============================
kv-akey-fetch
Time: 0.006212 seconds (16098.672065 ops per second)
mpich tests
Results:
No Errors on all tests for ~50% of attempts, segfault in daos_server in 50% of attempts. No ticket has been filed yet, though Mohamad has begun debug. For convenience, the text of his email exchange to the daos-devel-request mailing list is pasted here:
Stephen ran into a server segfault when testing mpich. I verified and replicated the segfault, and it seems it has to do with ofi/PSM2 (trace below).
I didn’t see that segfault before, and it doesn’t occur all the time. I just ran the mpi-io tests and replicated once on the first run an another time on the second run, so it’s easy to replicate although not always.
Here are my env variables:
export CRT_PHY_ADDR_STR="ofi+psm2"
export OFI_INTERFACE=ib0
export OFI_PORT=26850
export FI_PSM2_NAME_SERVER=1
export PSM2_MULTI_EP=1
export FI_SOCKETS_MAX_CONN_RETRY=1
export CRT_CTX_SHARE_ADDR=1
export CRT_CTX_NUM=8
export D_LOG_MASK=CRIT
export DD_SUBSYS=0
On the client I have the same as above but I add this (CRT_CTX_NUM is overwritten):
export DAOS_SINGLETON_CLI=1
export CRT_ATTACH_INFO_PATH=~
export FI_PSM2_DISCONNECT=1
export CRT_CTX_NUM=2
From the trace it seems we run into this on a pool connect?
Again I never run into this before, so seems like a regression at the server side.
Does this look familiar?
Segfault trace:
#0 0x00007ffff3f0cd0f in psmx2_write_generic ()
from /scratch/mschaara/DEPS/ofi/lib/libfabric.so.1
#1 0x00007ffff3f0e458 in psmx2_writev ()
from /scratch/mschaara/DEPS/ofi/lib/libfabric.so.1
#2 0x00007ffff536f35e in fi_writev (context=0x7fffcd23a928, key=2,
addr=<optimized out>, dest_addr=3940649673949191, count=1,
desc=0x7fffd81e0538, iov=0x7fffd81e0540, ep=0x7fde60)
---Type <return> to continue, or q <return> to quit---
at /scratch/mschaara/DEPS/ofi/include/rdma/fi_rma.h:130
#3 na_ofi_put (na_class=0x6d88d0, context=0x7fffcc02d510,
callback=<optimized out>, arg=<optimized out>,
local_mem_handle=<optimized out>, local_offset=<optimized out>,
remote_mem_handle=0x7fffcd23e900, remote_offset=0, length=64,
remote_addr=0x7fffcd23d8a0, remote_id=0 '\000', op_id=0x7fffcd23c9f0)
at /home/mschaara/source/deps_daos/daos_m/_build.external/mercury/src/na/na_ofi.c:3642
#4 0x00007ffff558ddd1 in hg_bulk_transfer_pieces (
na_bulk_op=na_bulk_op@entry=0x7ffff558dbf0 <hg_bulk_na_put>,
origin_addr=origin_addr@entry=0x7fffcd23d8a0, origin_id=0 '\000',
hg_bulk_origin=hg_bulk_origin@entry=0x7fffcd239840,
origin_segment_start_index=origin_segment_start_index@entry=0,
origin_segment_start_offset=origin_segment_start_offset@entry=0,
hg_bulk_local=hg_bulk_local@entry=0x7fffcd23a7b0,
local_segment_start_index=local_segment_start_index@entry=0,
local_segment_start_offset=local_segment_start_offset@entry=0,
size=size@entry=64, scatter_gather=scatter_gather@entry=0 '\000',
hg_bulk_op_id=hg_bulk_op_id@entry=0x7fffcd23e760,
na_op_count=na_op_count@entry=0x0, use_sm=0 '\000')
at /home/mschaara/source/deps_daos/daos_m/_build.external/mercury/src/mercury_bulk.c:784
#5 0x00007ffff558fb98 in hg_bulk_transfer (op_id=0x7fffd81e0838,
---Type <return> to continue, or q <return> to quit---
op_id@entry=0x40, size=64, size@entry=0, local_offset=<optimized out>,
hg_bulk_local=0x7fffcd23a7b0, hg_bulk_local@entry=0x0,
origin_offset=<optimized out>, hg_bulk_origin=0x7fffcd239840,
origin_id=<optimized out>, origin_addr=<optimized out>,
op=<optimized out>, arg=0x7fffcd23eaf0,
callback=0x7ffff7754080 <crt_hg_bulk_transfer_cb>,
context=<optimized out>)
at /home/mschaara/source/deps_daos/daos_m/_build.external/mercury/src/mercury_bulk.c:955
#6 HG_Bulk_transfer_id (context=<optimized out>,
callback=callback@entry=0x7ffff7754080 <crt_hg_bulk_transfer_cb>,
arg=arg@entry=0x7fffcd23eaf0, op=<optimized out>,
origin_addr=<optimized out>, origin_id=origin_id@entry=0 '\000',
origin_handle=0x7fffcd239840, origin_offset=origin_offset@entry=0,
local_handle=local_handle@entry=0x7fffcd23a7b0,
local_offset=local_offset@entry=0, size=size@entry=64,
op_id=op_id@entry=0x7fffd81e0838)
at /home/mschaara/source/deps_daos/daos_m/_build.external/mercury/src/mercury_bulk.c:1721
#7 0x00007ffff558fde2 in HG_Bulk_transfer (context=<optimized out>,
callback=callback@entry=0x7ffff7754080 <crt_hg_bulk_transfer_cb>,
arg=arg@entry=0x7fffcd23eaf0, op=<optimized out>,
origin_addr=<optimized out>, origin_handle=<optimized out>,
---Type <return> to continue, or q <return> to quit---
origin_offset=0, local_handle=0x7fffcd23a7b0, local_offset=0, size=64,
op_id=op_id@entry=0x7fffd81e0838)
at /home/mschaara/source/deps_daos/daos_m/_build.external/mercury/src/mercury_bulk.c:1643
#8 0x00007ffff77591ba in crt_hg_bulk_transfer (
bulk_desc=bulk_desc@entry=0x7fffd81e0880,
complete_cb=0x7fffda5c0e10 <bulk_cb>, arg=0x7fffd81e0840,
opid=0x7fffd81e0838) at src/cart/crt_hg.c:1654
#9 0x00007ffff77383a3 in crt_bulk_transfer (
bulk_desc=bulk_desc@entry=0x7fffd81e0880,
complete_cb=complete_cb@entry=0x7fffda5c0e10 <bulk_cb>,
arg=arg@entry=0x7fffd81e0840, opid=opid@entry=0x7fffd81e0838)
at src/cart/crt_bulk.c:172
#10 0x00007fffda5c0d64 in transfer_map_buf (tx=tx@entry=0x7fffd81e0a10,
svc=<optimized out>, rpc=rpc@entry=0x7fffcd23e348,
remote_bulk=0x7fffcd239840,
required_buf_size=required_buf_size@entry=0x7fffcd23e48c)
at src/pool/srv_pool.c:1689
#11 0x00007fffda5c433e in ds_pool_connect_handler (rpc=0x7fffcd23e348)
at src/pool/srv_pool.c:1801
#12 0x00007ffff7773a00 in crt_handle_rpc (arg=0x7fffcd23e348)
at src/cart/crt_rpc.c:1745
#13 0x00007ffff6a8f708 in ABTD_thread_func_wrapper ()
---Type <return> to continue, or q <return> to quit---
from /scratch/mschaara/DEPS//argobots/lib/libabt.so.0
#14 0x00007ffff6a8fc91 in make_fcontext ()
from /scratch/mschaara/DEPS//argobots/lib/libabt.so.0
#15 0x0000000000000000 in ?? ()
Thanks,
Mohamad
----------------------------------------------------
I also realized it’s not always the same trace where it segfaults.. here is another one:
(gdb) bt
#0 0x00007ffff33b60ab in __psm2_mq_isend2 (mq=0x71c8d0, dest=0x7fffcd53c370,
flags=0, stag=0x7fffd8195ae0, buf=0x7fffd8d6d800, len=184,
context=0x7fffcc068788, req=0x7fffd8195ad8)
at /scratch/ESSIO/OPA/opa-psm2/psm_mq.c:733
---Type <return> to continue, or q <return> to quit---
#1 0x00007ffff3f05594 in psmx2_tagged_send_no_flag_av_table ()
from /scratch/mschaara/DEPS/ofi/lib/libfabric.so.1
#2 0x00007ffff536fa03 in fi_tsend (context=0x7fffcc068788, tag=4294967297,
dest_addr=3940649673949431, desc=0x7fffcc031a40, len=184,
buf=0x7fffd8d6d800, ep=0x7fde60)
at /scratch/mschaara/DEPS/ofi/include/rdma/fi_tagged.h:116
#3 na_ofi_msg_send_expected (na_class=0x6d88d0, context=0x7fffcc02d510,
callback=<optimized out>, arg=<optimized out>, buf=0x7fffd8d6d800,
buf_size=184, plugin_data=0x7fffcc031a40, dest=0x7fffcd6316b0,
target_id=0 '\000', tag=1, op_id=0x7fffcc068610)
at /home/mschaara/source/deps_daos/daos_m/_build.external/mercury/src/na/na_ofi.c:3298
#4 0x00007ffff5585583 in hg_core_respond_na (hg_core_handle=<optimized out>)
at /home/mschaara/source/deps_daos/daos_m/_build.external/mercury/src/mercury_core.c:2084
#5 0x00007ffff558895b in HG_Core_respond (handle=0x7fffcc068510,
callback=callback@entry=0x7ffff558a290 <hg_core_respond_cb>,
arg=arg@entry=0x7fffcc068930, flags=<optimized out>,
payload_size=<optimized out>)
at /home/mschaara/source/deps_daos/daos_m/_build.external/mercury/src/mercury_core.c:4794
#6 0x00007ffff558d82a in HG_Respond (handle=0x7fffcc068930,
callback=callback@entry=0x7ffff77547a0 <crt_hg_reply_send_cb>,
---Type <return> to continue, or q <return> to quit---
arg=arg@entry=0x7fffcd7280f0, out_struct=out_struct@entry=0x7fffcd728170)
at /home/mschaara/source/deps_daos/daos_m/_build.external/mercury/src/mercury.c:2305
#7 0x00007ffff77579f1 in crt_hg_reply_send (
rpc_priv=rpc_priv@entry=0x7fffcd7280f0) at src/cart/crt_hg.c:1312
#8 0x00007ffff77741b4 in crt_reply_send (req=req@entry=0x7fffcd728148)
at src/cart/crt_rpc.c:1527
#9 0x00007fffda1999cf in ds_obj_rw_complete (map_version=1, status=0,
ioh=..., cont_hdl=<optimized out>, rpc=0x7fffcd728148)
at src/object/srv_obj.c:102
#10 ds_obj_rw_handler (rpc=0x7fffcd728148) at src/object/srv_obj.c:724
#11 0x00007ffff7773a00 in crt_handle_rpc (arg=0x7fffcd728148)
at src/cart/crt_rpc.c:1745
#12 0x00007ffff6a8f708 in ABTD_thread_func_wrapper ()
from /scratch/mschaara/DEPS//argobots/lib/libabt.so.0
#13 0x00007ffff6a8fc91 in make_fcontext ()
from /scratch/mschaara/DEPS//argobots/lib/libabt.so.0
#14 0x0000000000000000 in ?? ()
[sdwillso@boro-59 test]$ ./run_daos_tests daos:test_file
**** Testing I/O functions ****
**** Testing simple.c ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
No Errors
**** Testing async.c ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
No Errors
**** Testing async-multiple.c ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
No Errors
**** Testing coll_test.c ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
No Errors
**** Testing excl.c ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
../../../../src/mpi/romio/adio/ad_daos/ad_daos_open.c:281 ADIOI_DAOS_Open() - Array exists (EXCL mode) (-1004)
No Errors
**** Testing file_info.c ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
No Errors
**** Testing i_noncontig.c ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
No Errors
**** Testing noncontig.c ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
No Errors
**** Testing noncontig_coll.c ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
No Errors
**** Testing noncontig_coll2.c ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
No Errors
**** Testing aggregation1.c ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
No Errors
**** Testing aggregation2.c ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
No Errors
**** Testing hindexed ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
-------------------------------------------------------
[ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 ]
[ 0] 0 1 2 3 4 5 D E F G H I
[ 1]
[ 2] 6 7 8 9 : ; J K L M N O
[ 3]
[ 4]
[ 5] X Y Z [ \ ] l m n o p q
[ 6]
[ 7] ^ _ ` a b c r s t u v w
[ 8]
[ 9]
[10] 0 1 2 3 4 5 D E F G H I
[11]
[12] 6 7 8 9 : ; J K L M N O
[13]
[14]
[15] X Y Z [ \ ] l m n o p q
[16]
[17] ^ _ ` a b c r s t u v w
[18]
[19]
No Errors
**** Testing split_coll.c ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
No Errors
**** Testing psimple.c ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
No Errors
**** Testing error.c ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
No Errors
**** Testing status.c ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
No Errors
**** Testing types_with_zeros ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
No Errors
**** Testing darray_read ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
No Errors
**** Testing fcoll_test.f ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
No Errors
**** Testing pfcoll_test.f ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
No Errors