5-15-18

Test Configuration

Tip of master, commit 7cffa6f494db6d95df1f08033ccd5cbfe36f93eb

All tests run with ofi+psm2, ib0. 

daos_test: Run with 8 server (boro-[3-9],boro-58), 2 client (boro-[59-60]). Killed servers, cleaned /mnt/daos in between runs listed below.

Tests requiring pool to be created via dmg used 4GB pool. These used boro-59 as client.

mpich tests used boro-3 as server, boro-59 as client, with a 1GB pool.

Test Results

daos_test

Separate runs with cleanup in between:

  • -mpcCiAeoRd - PASS
  • -r - PASS

daosperf

1K Records

CREDITS=1

[sdwillso@boro-59 ~]$ CREDITS=1 ./daos_m/src/tests/daos_perf.sh daos 200 1000 1K
+ /home/sdwillso/daos_m/opt/ompi/bin/orterun -quiet --hostfile /home/sdwillso/scripts/host.cli.1 --ompi-server file:/home/sdwillso/scripts/uri.txt -x DD_SUBSYS= -x DD_MASK= -x D_LOG_FILE=/tmp/daos_perf.log /home/sdwillso/daos_m/install/bin/daos_perf -T daos -P 2G -d 1 -a 200 -r 1000 -s 1K -C 1 -t -z
Test :
	DAOS (full stack)
Parameters :
	pool size     : 2048 MB
	credits       : 1 (sync I/O for -ve)
	obj_per_cont  : 1 x 8 (procs)
	dkey_per_obj  : 1
	akey_per_dkey : 200
	recx_per_akey : 1000
	value type    : single
	value size    : 1024
	zero copy     : yes
	overwrite     : yes
	VOS file      : <NULL>
Started...
update successfully completed:
	duration : 11.453220  sec
	bandwith : 136.425    MB/sec
	rate     : 139698.71  IO/sec
	latency  : 7.158      us (nonsense if credits > 1)
Duration across processes:
MAX duration : 11.449782  sec
MIN duration : 6.149746   sec
Average duration : 9.161313   sec

CREDITS=8

[sdwillso@boro-59 ~]$ CREDITS=8 ./daos_m/src/tests/daos_perf.sh daos 200 1000 1K
+ /home/sdwillso/daos_m/opt/ompi/bin/orterun -quiet --hostfile /home/sdwillso/scripts/host.cli.1 --ompi-server file:/home/sdwillso/scripts/uri.txt -x DD_SUBSYS= -x DD_MASK= -x D_LOG_FILE=/tmp/daos_perf.log /home/sdwillso/daos_m/install/bin/daos_perf -T daos -P 2G -d 1 -a 200 -r 1000 -s 1K -C 8 -t -z
Test :
	DAOS (full stack)
Parameters :
	pool size     : 2048 MB
	credits       : 8 (sync I/O for -ve)
	obj_per_cont  : 1 x 8 (procs)
	dkey_per_obj  : 1
	akey_per_dkey : 200
	recx_per_akey : 1000
	value type    : single
	value size    : 1024
	zero copy     : yes
	overwrite     : yes
	VOS file      : <NULL>
Started...
update successfully completed:
	duration : 9.665754   sec
	bandwith : 161.653    MB/sec
	rate     : 165532.87  IO/sec
	latency  : 6.041      us (nonsense if credits > 1)
Duration across processes:
MAX duration : 9.664282   sec
MIN duration : 3.791001   sec
Average duration : 7.475405   sec

4K Records

CREDITS=1

[sdwillso@boro-59 ~]$ CREDITS=1 ./daos_m/src/tests/daos_perf.sh daos 200 1000 4K
+ /home/sdwillso/daos_m/opt/ompi/bin/orterun -quiet --hostfile /home/sdwillso/scripts/host.cli.1 --ompi-server file:/home/sdwillso/scripts/uri.txt -x DD_SUBSYS= -x DD_MASK= -x D_LOG_FILE=/tmp/daos_perf.log /home/sdwillso/daos_m/install/bin/daos_perf -T daos -P 2G -d 1 -a 200 -r 1000 -s 4K -C 1 -t -z
Test :
	DAOS (full stack)
Parameters :
	pool size     : 2048 MB
	credits       : 1 (sync I/O for -ve)
	obj_per_cont  : 1 x 8 (procs)
	dkey_per_obj  : 1
	akey_per_dkey : 200
	recx_per_akey : 1000
	value type    : single
	value size    : 4096
	zero copy     : yes
	overwrite     : yes
	VOS file      : <NULL>
Started...
update successfully completed:
	duration : 14.699587  sec
	bandwith : 425.182    MB/sec
	rate     : 108846.60  IO/sec
	latency  : 9.187      us (nonsense if credits > 1)
Duration across processes:
MAX duration : 14.699079  sec
MIN duration : 6.184355   sec
Average duration : 10.489390  sec

IOR, single client 4GB pool

[sdwillso@boro-59 ~]$ orterun -np 1 --mca mtl ^psm2,ofi  --ompi-server file:~/scripts/uri.txt ior -v  -i 5 -a DAOS -w -o `uuidgen` -b 10g -t 1m -O daospool=57de6d7e-313d-467c-9cf9-817c7261a3f1,daosrecordsize=1m,daosstripesize=1m,daosstripecount=1024,daosaios=16,daosobjectclass=LARGE,daosPoolSvc=1,daosepoch=1
IOR-3.0.1: MPI Coordinated Test of Parallel I/O

Began: Tue May 15 18:23:26 2018
Command line used: ior -v -i 5 -a DAOS -w -o 4ae6dedf-f455-4100-9c49-6c36f18d205a -b 10g -t 1m -O daospool=57de6d7e-313d-467c-9cf9-817c7261a3f1,daosrecordsize=1m,daosstripesize=1m,daosstripecount=1024,daosaios=16,daosobjectclass=LARGE,daosPoolSvc=1,daosepoch=1
Machine: Linux boro-59.boro.hpdd.intel.com
Start time skew across all tasks: 0.00 sec

Test 0 started: Tue May 15 18:23:26 2018
Path: /home/sdwillso
FS: 3.8 TiB   Used FS: 9.0%   Inodes: 250.0 Mi   Used Inodes: 1.8%
Participating tasks: 1
[0] WARNING: USING daosStripeMax CAUSES READS TO RETURN INVALID DATA
Summary:
	api                = DAOS
	test filename      = 4ae6dedf-f455-4100-9c49-6c36f18d205a
	access             = single-shared-file, independent
	pattern            = segmented (1 segment)
	ordering in a file = sequential offsets
	ordering inter file= no tasks offsets
	clients            = 1 (1 per node)
	repetitions        = 5
	xfersize           = 1 MiB
	blocksize          = 10 GiB
	aggregate filesize = 10 GiB

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ---------- ---------  --------   --------   --------   --------   ----
Commencing write performance test: Tue May 15 18:23:26 2018
write     6122       10485760   1024.00    0.001180   1.67       0.003044   1.67       0   
remove    -          -          -          -          -          -          0.008123   0   
Commencing write performance test: Tue May 15 18:23:28 2018
write     6047       10485760   1024.00    0.000790   1.69       0.003101   1.69       1   
remove    -          -          -          -          -          -          0.007797   1   
Commencing write performance test: Tue May 15 18:23:30 2018
write     5787       10485760   1024.00    0.000708   1.77       0.003190   1.77       2   
remove    -          -          -          -          -          -          0.007731   2   
Commencing write performance test: Tue May 15 18:23:32 2018
write     6078       10485760   1024.00    0.000711   1.68       0.002450   1.68       3   
remove    -          -          -          -          -          -          0.007542   3   
Commencing write performance test: Tue May 15 18:23:33 2018
write     6120       10485760   1024.00    0.000705   1.67       0.002252   1.67       4   
remove    -          -          -          -          -          -          0.007559   4   

Max Write: 6122.37 MiB/sec (6419.77 MB/sec)

Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum
write        6122.37    5786.92    6030.81     125.16    1.69870 0 1 1 5 0 0 1 0 0 1 10737418240 1048576 10737418240 DAOS 0

IOR, 2 client 10GB pool, data verification enabled

[sdwillso@boro-59 ~]$ orterun -np 1 --hostfile ~/hostlists/daos_client_hostlist --mca mtl ^psm2,ofi  --ompi-server file:~/scripts/uri.txt ior -W -v  -i 5 -a DAOS -w -o `uuidgen` -b 10g -t 1m -O daospool=2b8ce7c0-5bba-4fc0-909b-6d968b252b93,daosrecordsize=1m,daosstripesize=1m,daosstripecount=1024,daosaios=16,daosobjectclass=LARGE,daosPoolSvc=1,daosepoch=1
IOR-3.0.1: MPI Coordinated Test of Parallel I/O

Began: Tue May 15 22:16:04 2018
Command line used: ior -W -v -i 5 -a DAOS -w -o f6b0d019-98f2-4c49-bd25-25a0d66c6c2f -b 10g -t 1m -O daospool=2b8ce7c0-5bba-4fc0-909b-6d968b252b93,daosrecordsize=1m,daosstripesize=1m,daosstripecount=1024,daosaios=16,daosobjectclass=LARGE,daosPoolSvc=1,daosepoch=1
Machine: Linux boro-59.boro.hpdd.intel.com
Start time skew across all tasks: 0.00 sec

Test 0 started: Tue May 15 22:16:04 2018
Path: /home/sdwillso
FS: 3.8 TiB   Used FS: 9.0%   Inodes: 250.0 Mi   Used Inodes: 1.8%
Participating tasks: 1
[0] WARNING: USING daosStripeMax CAUSES READS TO RETURN INVALID DATA
Summary:
	api                = DAOS
	test filename      = f6b0d019-98f2-4c49-bd25-25a0d66c6c2f
	access             = single-shared-file, independent
	pattern            = segmented (1 segment)
	ordering in a file = sequential offsets
	ordering inter file= no tasks offsets
	clients            = 1 (1 per node)
	repetitions        = 5
	xfersize           = 1 MiB
	blocksize          = 10 GiB
	aggregate filesize = 10 GiB

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ---------- ---------  --------   --------   --------   --------   ----
Commencing write performance test: Tue May 15 22:16:04 2018
write     3724       10485760   1024.00    0.001198   2.75       0.002832   2.75       0   
Verifying contents of the file(s) just written.
Tue May 15 22:16:07 2018

remove    -          -          -          -          -          -          0.007806   0   
Commencing write performance test: Tue May 15 22:16:15 2018
write     3769       10485760   1024.00    0.000693   2.71       0.003378   2.72       1   
Verifying contents of the file(s) just written.
Tue May 15 22:16:17 2018

remove    -          -          -          -          -          -          0.007733   1   
Commencing write performance test: Tue May 15 22:16:25 2018
write     3749       10485760   1024.00    0.000684   2.73       0.002396   2.73       2   
Verifying contents of the file(s) just written.
Tue May 15 22:16:28 2018

remove    -          -          -          -          -          -          0.007591   2   
Commencing write performance test: Tue May 15 22:16:36 2018
write     3774       10485760   1024.00    0.000679   2.71       0.002356   2.71       3   
Verifying contents of the file(s) just written.
Tue May 15 22:16:39 2018

remove    -          -          -          -          -          -          0.007547   3   
Commencing write performance test: Tue May 15 22:16:47 2018
write     3784       10485760   1024.00    0.000760   2.70       0.002375   2.71       4   
Verifying contents of the file(s) just written.
Tue May 15 22:16:49 2018

remove    -          -          -          -          -          -          0.007593   4   

Max Write: 3784.03 MiB/sec (3967.84 MB/sec)

Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum
write        3784.03    3723.54    3760.10      21.55    2.72342 0 1 1 5 0 0 1 0 0 1 10737418240 1048576 10737418240 DAOS 0

daos_bench

kv-idx-update

Time: 66.753337 seconds (14980.524402 ops per second)
[sdwillso@boro-59 ~]$ orterun -np 1 --mca mtl ^psm2,ofi  --ompi-server file:~/scripts/uri.txt daosbench --test=kv-idx-update --testid=1 --svc=1 --dpool=357aa259-1ab1-4fe3-9125-71bab9ed9139 --container=`uuidgen` --object-class=tiny --aios=32 --indexes=1000000
================================
DAOSBENCH (KV)
Started at
Tue May 15 18:27:17 2018
=================================
===============================
Test Setup
---------------
Test: kv-idx-update
DAOS pool :357aa259-1ab1-4fe3-9125-71bab9ed9139
DAOS container :7e5c5fb5-b197-4752-8e56-a45ffabc9005
Value buffer size: 64
Number of processes: 1
Number of indexes/process: 1000000
Number of asynchronous I/O: 32
===============================
kv-idx-update
Time: 66.753337 seconds (14980.524402 ops per second)

kv-dkey-update

Time: 0.011910 seconds (8396.287980 ops per second)
[sdwillso@boro-59 ~]$ orterun -np 1 --mca mtl ^psm2,ofi  --ompi-server file:~/scripts/uri.txt daosbench --test=kv-dkey-update --testid=1 --svc=1 --dpool=4b5b1db0-245e-4037-aa61-47cb5006cace --container=`uuidgen` --object-class=tiny --aios=32 --indexes=1000000
================================
DAOSBENCH (KV)
Started at
Tue May 15 18:36:16 2018
=================================
===============================
Test Setup
---------------
Test: kv-dkey-update
DAOS pool :4b5b1db0-245e-4037-aa61-47cb5006cace
DAOS container :10bb507e-cd78-4d38-9865-961c10480a37
Value buffer size: 64
Number of processes: 1
Number of keys/process: 100
Number of asynchronous I/O: 32
===============================
kv-dkey-update
Time: 0.011910 seconds (8396.287980 ops per second)

kv-akey-update

Time: 0.010629 seconds (9407.848418 ops per second)
[sdwillso@boro-59 ~]$ orterun -np 1 --mca mtl ^psm2,ofi  --ompi-server file:~/scripts/uri.txt daosbench --test=kv-akey-update --testid=1 --svc=1 --dpool=c67bcdbd-36ec-4e00-857f-204df99f0646 --container=`uuidgen` --object-class=tiny --aios=32 --indexes=1000000
================================
DAOSBENCH (KV)
Started at
Tue May 15 18:39:29 2018
=================================
===============================
Test Setup
---------------
Test: kv-akey-update
DAOS pool :c67bcdbd-36ec-4e00-857f-204df99f0646
DAOS container :288e9ad9-42e6-4016-9284-93d2a011f1f3
Value buffer size: 64
Number of processes: 1
Number of keys/process: 100
Number of asynchronous I/O: 32
===============================
kv-akey-update
Time: 0.010629 seconds (9407.848418 ops per second)

kv-dkey-fetch

Time: 0.006670 seconds (14993.068912 ops per second)
[sdwillso@boro-59 ~]$ orterun -np 1 --mca mtl ^psm2,ofi  --ompi-server file:~/scripts/uri.txt daosbench --test=kv-dkey-fetch --testid=1 --svc=1 --dpool=f7849c43-4508-4fc3-a866-de3a998cd3a7 --container=`uuidgen` --object-class=tiny --aios=32 --indexes=1000000
================================
DAOSBENCH (KV)
Started at
Tue May 15 18:41:31 2018
=================================
===============================
Test Setup
---------------
Test: kv-dkey-fetch
DAOS pool :f7849c43-4508-4fc3-a866-de3a998cd3a7
DAOS container :318250b7-e15b-442c-b746-29e7b2cc7229
Value buffer size: 64
Number of processes: 1
Number of keys/process: 100
Number of asynchronous I/O: 32
===============================
kv-dkey-fetch
Time: 0.006670 seconds (14993.068912 ops per second)

kv-akey-fetch

Time: 0.006212 seconds (16098.672065 ops per second)
[sdwillso@boro-59 ~]$ orterun -np 1 --mca mtl ^psm2,ofi  --ompi-server file:~/scripts/uri.txt daosbench --test=kv-akey-fetch --testid=1 --svc=1 --dpool=5999fb3b-2f4d-4c3f-b7d5-79a9be8d13a4 --container=`uuidgen` --object-class=tiny --aios=32 --indexes=1000000
================================
DAOSBENCH (KV)
Started at
Tue May 15 18:43:19 2018
=================================
===============================
Test Setup
---------------
Test: kv-akey-fetch
DAOS pool :5999fb3b-2f4d-4c3f-b7d5-79a9be8d13a4
DAOS container :b3291b9d-a3bc-4d3e-b250-e21e3d8597fa
Value buffer size: 64
Number of processes: 1
Number of keys/process: 100
Number of asynchronous I/O: 32
===============================
kv-akey-fetch
Time: 0.006212 seconds (16098.672065 ops per second)

mpich tests

Results:

No Errors on all tests for ~50% of attempts, segfault in daos_server in 50% of attempts. No ticket has been filed yet, though Mohamad has begun debug. For convenience, the text of his email exchange to the daos-devel-request mailing list is pasted here:

Stephen ran into a server segfault when testing mpich. I verified and replicated the segfault, and it seems it has to do with ofi/PSM2 (trace below).
I didn’t see that segfault before, and it doesn’t occur all the time. I just ran the mpi-io tests and replicated once on the first run an another time on the second run, so it’s easy to replicate although not always.
 
Here are my env variables:
export CRT_PHY_ADDR_STR="ofi+psm2"
export OFI_INTERFACE=ib0
export OFI_PORT=26850
export FI_PSM2_NAME_SERVER=1
export PSM2_MULTI_EP=1
export FI_SOCKETS_MAX_CONN_RETRY=1
export CRT_CTX_SHARE_ADDR=1
export CRT_CTX_NUM=8
export D_LOG_MASK=CRIT
export DD_SUBSYS=0
 
On the client I have the same as above but I add this (CRT_CTX_NUM is overwritten):
export DAOS_SINGLETON_CLI=1
export CRT_ATTACH_INFO_PATH=~
export FI_PSM2_DISCONNECT=1
export CRT_CTX_NUM=2
 
From the trace it seems we run into this on a pool connect?
Again I never run into this before, so seems like a regression at the server side.
 
Does this look familiar?
 
Segfault trace:
#0  0x00007ffff3f0cd0f in psmx2_write_generic ()
   from /scratch/mschaara/DEPS/ofi/lib/libfabric.so.1
#1  0x00007ffff3f0e458 in psmx2_writev ()
   from /scratch/mschaara/DEPS/ofi/lib/libfabric.so.1
#2  0x00007ffff536f35e in fi_writev (context=0x7fffcd23a928, key=2, 
    addr=<optimized out>, dest_addr=3940649673949191, count=1, 
    desc=0x7fffd81e0538, iov=0x7fffd81e0540, ep=0x7fde60)
---Type <return> to continue, or q <return> to quit---
    at /scratch/mschaara/DEPS/ofi/include/rdma/fi_rma.h:130
#3  na_ofi_put (na_class=0x6d88d0, context=0x7fffcc02d510, 
    callback=<optimized out>, arg=<optimized out>, 
    local_mem_handle=<optimized out>, local_offset=<optimized out>, 
    remote_mem_handle=0x7fffcd23e900, remote_offset=0, length=64, 
    remote_addr=0x7fffcd23d8a0, remote_id=0 '\000', op_id=0x7fffcd23c9f0)
    at /home/mschaara/source/deps_daos/daos_m/_build.external/mercury/src/na/na_ofi.c:3642
#4  0x00007ffff558ddd1 in hg_bulk_transfer_pieces (
    na_bulk_op=na_bulk_op@entry=0x7ffff558dbf0 <hg_bulk_na_put>, 
    origin_addr=origin_addr@entry=0x7fffcd23d8a0, origin_id=0 '\000', 
    hg_bulk_origin=hg_bulk_origin@entry=0x7fffcd239840, 
    origin_segment_start_index=origin_segment_start_index@entry=0, 
    origin_segment_start_offset=origin_segment_start_offset@entry=0, 
    hg_bulk_local=hg_bulk_local@entry=0x7fffcd23a7b0, 
    local_segment_start_index=local_segment_start_index@entry=0, 
    local_segment_start_offset=local_segment_start_offset@entry=0, 
    size=size@entry=64, scatter_gather=scatter_gather@entry=0 '\000', 
    hg_bulk_op_id=hg_bulk_op_id@entry=0x7fffcd23e760, 
    na_op_count=na_op_count@entry=0x0, use_sm=0 '\000')
    at /home/mschaara/source/deps_daos/daos_m/_build.external/mercury/src/mercury_bulk.c:784
#5  0x00007ffff558fb98 in hg_bulk_transfer (op_id=0x7fffd81e0838, 
---Type <return> to continue, or q <return> to quit---
    op_id@entry=0x40, size=64, size@entry=0, local_offset=<optimized out>, 
    hg_bulk_local=0x7fffcd23a7b0, hg_bulk_local@entry=0x0, 
    origin_offset=<optimized out>, hg_bulk_origin=0x7fffcd239840, 
    origin_id=<optimized out>, origin_addr=<optimized out>, 
    op=<optimized out>, arg=0x7fffcd23eaf0, 
    callback=0x7ffff7754080 <crt_hg_bulk_transfer_cb>, 
    context=<optimized out>)
    at /home/mschaara/source/deps_daos/daos_m/_build.external/mercury/src/mercury_bulk.c:955
#6  HG_Bulk_transfer_id (context=<optimized out>, 
    callback=callback@entry=0x7ffff7754080 <crt_hg_bulk_transfer_cb>, 
    arg=arg@entry=0x7fffcd23eaf0, op=<optimized out>, 
    origin_addr=<optimized out>, origin_id=origin_id@entry=0 '\000', 
    origin_handle=0x7fffcd239840, origin_offset=origin_offset@entry=0, 
    local_handle=local_handle@entry=0x7fffcd23a7b0, 
    local_offset=local_offset@entry=0, size=size@entry=64, 
    op_id=op_id@entry=0x7fffd81e0838)
    at /home/mschaara/source/deps_daos/daos_m/_build.external/mercury/src/mercury_bulk.c:1721
#7  0x00007ffff558fde2 in HG_Bulk_transfer (context=<optimized out>, 
    callback=callback@entry=0x7ffff7754080 <crt_hg_bulk_transfer_cb>, 
    arg=arg@entry=0x7fffcd23eaf0, op=<optimized out>, 
    origin_addr=<optimized out>, origin_handle=<optimized out>, 
---Type <return> to continue, or q <return> to quit---
    origin_offset=0, local_handle=0x7fffcd23a7b0, local_offset=0, size=64, 
    op_id=op_id@entry=0x7fffd81e0838)
    at /home/mschaara/source/deps_daos/daos_m/_build.external/mercury/src/mercury_bulk.c:1643
#8  0x00007ffff77591ba in crt_hg_bulk_transfer (
    bulk_desc=bulk_desc@entry=0x7fffd81e0880, 
    complete_cb=0x7fffda5c0e10 <bulk_cb>, arg=0x7fffd81e0840, 
    opid=0x7fffd81e0838) at src/cart/crt_hg.c:1654
#9  0x00007ffff77383a3 in crt_bulk_transfer (
    bulk_desc=bulk_desc@entry=0x7fffd81e0880, 
    complete_cb=complete_cb@entry=0x7fffda5c0e10 <bulk_cb>, 
    arg=arg@entry=0x7fffd81e0840, opid=opid@entry=0x7fffd81e0838)
    at src/cart/crt_bulk.c:172
#10 0x00007fffda5c0d64 in transfer_map_buf (tx=tx@entry=0x7fffd81e0a10, 
    svc=<optimized out>, rpc=rpc@entry=0x7fffcd23e348, 
    remote_bulk=0x7fffcd239840, 
    required_buf_size=required_buf_size@entry=0x7fffcd23e48c)
    at src/pool/srv_pool.c:1689
#11 0x00007fffda5c433e in ds_pool_connect_handler (rpc=0x7fffcd23e348)
    at src/pool/srv_pool.c:1801
#12 0x00007ffff7773a00 in crt_handle_rpc (arg=0x7fffcd23e348)
    at src/cart/crt_rpc.c:1745
#13 0x00007ffff6a8f708 in ABTD_thread_func_wrapper ()
---Type <return> to continue, or q <return> to quit---
   from /scratch/mschaara/DEPS//argobots/lib/libabt.so.0
#14 0x00007ffff6a8fc91 in make_fcontext ()
   from /scratch/mschaara/DEPS//argobots/lib/libabt.so.0
#15 0x0000000000000000 in ?? ()
 
Thanks,
Mohamad


----------------------------------------------------

I also realized it’s not always the same trace where it segfaults.. here is another one:
 
(gdb) bt
#0  0x00007ffff33b60ab in __psm2_mq_isend2 (mq=0x71c8d0, dest=0x7fffcd53c370, 
    flags=0, stag=0x7fffd8195ae0, buf=0x7fffd8d6d800, len=184, 
    context=0x7fffcc068788, req=0x7fffd8195ad8)
    at /scratch/ESSIO/OPA/opa-psm2/psm_mq.c:733
---Type <return> to continue, or q <return> to quit---
#1  0x00007ffff3f05594 in psmx2_tagged_send_no_flag_av_table ()
   from /scratch/mschaara/DEPS/ofi/lib/libfabric.so.1
#2  0x00007ffff536fa03 in fi_tsend (context=0x7fffcc068788, tag=4294967297, 
    dest_addr=3940649673949431, desc=0x7fffcc031a40, len=184, 
    buf=0x7fffd8d6d800, ep=0x7fde60)
    at /scratch/mschaara/DEPS/ofi/include/rdma/fi_tagged.h:116
#3  na_ofi_msg_send_expected (na_class=0x6d88d0, context=0x7fffcc02d510, 
    callback=<optimized out>, arg=<optimized out>, buf=0x7fffd8d6d800, 
    buf_size=184, plugin_data=0x7fffcc031a40, dest=0x7fffcd6316b0, 
    target_id=0 '\000', tag=1, op_id=0x7fffcc068610)
    at /home/mschaara/source/deps_daos/daos_m/_build.external/mercury/src/na/na_ofi.c:3298
#4  0x00007ffff5585583 in hg_core_respond_na (hg_core_handle=<optimized out>)
    at /home/mschaara/source/deps_daos/daos_m/_build.external/mercury/src/mercury_core.c:2084
#5  0x00007ffff558895b in HG_Core_respond (handle=0x7fffcc068510, 
    callback=callback@entry=0x7ffff558a290 <hg_core_respond_cb>, 
    arg=arg@entry=0x7fffcc068930, flags=<optimized out>, 
    payload_size=<optimized out>)
    at /home/mschaara/source/deps_daos/daos_m/_build.external/mercury/src/mercury_core.c:4794
#6  0x00007ffff558d82a in HG_Respond (handle=0x7fffcc068930, 
    callback=callback@entry=0x7ffff77547a0 <crt_hg_reply_send_cb>, 
---Type <return> to continue, or q <return> to quit---
    arg=arg@entry=0x7fffcd7280f0, out_struct=out_struct@entry=0x7fffcd728170)
    at /home/mschaara/source/deps_daos/daos_m/_build.external/mercury/src/mercury.c:2305
#7  0x00007ffff77579f1 in crt_hg_reply_send (
    rpc_priv=rpc_priv@entry=0x7fffcd7280f0) at src/cart/crt_hg.c:1312
#8  0x00007ffff77741b4 in crt_reply_send (req=req@entry=0x7fffcd728148)
    at src/cart/crt_rpc.c:1527
#9  0x00007fffda1999cf in ds_obj_rw_complete (map_version=1, status=0, 
    ioh=..., cont_hdl=<optimized out>, rpc=0x7fffcd728148)
    at src/object/srv_obj.c:102
#10 ds_obj_rw_handler (rpc=0x7fffcd728148) at src/object/srv_obj.c:724
#11 0x00007ffff7773a00 in crt_handle_rpc (arg=0x7fffcd728148)
    at src/cart/crt_rpc.c:1745
#12 0x00007ffff6a8f708 in ABTD_thread_func_wrapper ()
   from /scratch/mschaara/DEPS//argobots/lib/libabt.so.0
#13 0x00007ffff6a8fc91 in make_fcontext ()
   from /scratch/mschaara/DEPS//argobots/lib/libabt.so.0
#14 0x0000000000000000 in ?? ()




[sdwillso@boro-59 test]$ ./run_daos_tests daos:test_file
**** Testing I/O functions ****
**** Testing simple.c ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
 No Errors
**** Testing async.c ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
 No Errors
**** Testing async-multiple.c ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
 No Errors
**** Testing coll_test.c ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
 No Errors
**** Testing excl.c ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
../../../../src/mpi/romio/adio/ad_daos/ad_daos_open.c:281 ADIOI_DAOS_Open() - Array exists (EXCL mode) (-1004)

 No Errors
**** Testing file_info.c ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
 No Errors
**** Testing i_noncontig.c ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
 No Errors
**** Testing noncontig.c ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
 No Errors
**** Testing noncontig_coll.c ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
 No Errors
**** Testing noncontig_coll2.c ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
 No Errors
**** Testing aggregation1.c ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
 No Errors
**** Testing aggregation2.c ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
 No Errors
**** Testing hindexed ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
-------------------------------------------------------
   [ 0 1 2 3 4 5 6 7 8 9  0 1 2 3 4 5 6 7 8 9 ]

[ 0] 0 1 2     3 4 5      D E F     G H I    
[ 1]                                         
[ 2] 6 7 8     9 : ;      J K L     M N O    
[ 3]                                         
[ 4]                                         
[ 5] X Y Z     [ \ ]      l m n     o p q    
[ 6]                                         
[ 7] ^ _ `     a b c      r s t     u v w    
[ 8]                                         
[ 9]                                         

[10] 0 1 2     3 4 5      D E F     G H I    
[11]                                         
[12] 6 7 8     9 : ;      J K L     M N O    
[13]                                         
[14]                                         
[15] X Y Z     [ \ ]      l m n     o p q    
[16]                                         
[17] ^ _ `     a b c      r s t     u v w    
[18]                                         
[19]                                         

 No Errors
**** Testing split_coll.c ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
 No Errors
**** Testing psimple.c ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
 No Errors
**** Testing error.c ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
 No Errors
**** Testing status.c ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
 No Errors
**** Testing types_with_zeros ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
 No Errors
**** Testing darray_read ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
 No Errors
**** Testing fcoll_test.f ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
  No Errors
**** Testing pfcoll_test.f ****
POOL UUID = b6bc255f-c7df-4807-bb9c-44204137b1ac
SVC LIST = 0
  No Errors