Project

General

Profile

Bug #1740

Failure at LrtsInit with OFI build with gni provider on Edison

Added by Nitin Bhat 12 days ago. Updated 3 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Start date:
11/07/2017
Due date:
% Done:

0%


Description

nbhat4@nid00031:~/software/charm/ofi-linux-x86_64/examples/converse/pingpong> make test
../../../bin/testrun  ./pingpong +p2

Running on 2 processors:  ./pingpong
charmrun>  /usr/bin/setarch x86_64 -R  mpirun -np 2  ./pingpong
Charm++>ofi> provider: gni
Charm++>ofi> control progress: 1
Charm++>ofi> data progress: 1
Charm++>ofi> maximum inject message size: 64
Charm++>ofi> eager maximum message size: 65536 (maximum header size: 40)
Charm++>ofi> cq entries count: 8
Charm++>ofi> use inject: 1
Charm++>ofi> maximum rma size: 4294967295
Charm++>ofi> mr mode: 0x1
Charm++>ofi> use memory pool: 0
The value of ret is -13 and the error message is Unknown error -13------- Partition 0 Processor 0 Exiting: Called CmiAbort ------
Reason: OFI::LrtsInit::fi_endpoint error
[0] Stack Traceback:
  [0:0] CmiAbortHelper+0x4d  [0x41a40d]
  [0:1]   [0x41a47b]
  [0:2] LrtsInit+0xb07  [0x41f2f7]
  [0:3] ConverseInit+0xd5  [0x421395]
  [0:4] main+0x13  [0x4195d3]
  [0:5] __libc_start_main+0xf5  [0x2aaaabcca6e5]
  [0:6] _start+0x29  [0x419709]
Makefile:18: recipe for target 'test' failed
make: *** [test] Error 1

The return value of fi_endpoint is -13, which corresponds to FI_EACCESS(Permission denied).

I think the chosen provider is:

nbhat4@nid00031:~/software/charm/ofi-linux-x86_64/examples/converse/pingpong> fi_info
provider: gni
    fabric: gni
    domain: /sys/class/gni/kgni0
    version: 1.1
    type: FI_EP_RDM
    protocol: FI_PROTO_GNI

And the libfabric version is:

nbhat4@edison02:~/software/charm> fi_info --version
fi_info: 1.5.0
libfabric: 1.5.0
libfabric api: 1.5


Related issues

Related to Charm++ - Bug #1738: Failure at LrtsInit with OFI build with verbs provider on Golub New 11/06/2017

History

#1 Updated by Nitin Bhat 12 days ago

  • Subject changed from Failure at LrtsInit with OFI build on gni on Edison to Failure at LrtsInit with OFI build with gni provider on Edison

#2 Updated by Nitin Bhat 12 days ago

  • Related to Bug #1738: Failure at LrtsInit with OFI build with verbs provider on Golub added

#3 Updated by Nitin Bhat 12 days ago

  • Description updated (diff)

#4 Updated by Yohann Burette 11 days ago

I got access to Edison, downloaded and built fabtests. Then I tried to run fi_rdm using 2 nodes from the debug pool.

Using FI_PROVIDER=sockets, fi_rdm works.

Using FI_PROVIDER=gni, I get the following error:

libfabric:gni:domain:gnix_domain_open():577<trace> [15403:1] 
libfabric:gni:fabric:gnix_domain_open():586<info> [15403:1] failed to find authorization key, creating new authorization key
libfabric:gni:fabric:__gnix_alps_init():276<warn> [15403:1] lli get response failed, alps_status=4(No such file or directory)
libfabric:gni:fabric:gnixu_get_rdma_credentials():412<warn> [15403:1] __gnix_app_init() failed, ret=-5(No such file or directory)
libfabric:gni:domain:_gnix_auth_key_enable():126<info> [15403:1] set resource limits: pkey=00002aaa ptag=170 reserved=64 registration_limit=192 reserved_keys=192-255
libfabric:gni:domain:gnix_domain_open():592<info> [15403:1] authorization key=0x60f3e0 ptag 170 cookie 0x2aaa
libfabric:gni:mr:__mr_reg():181<trace> [15403:1] 
libfabric:gni:mr:_gnix_mr_cache_init():991<trace> [15403:1] 
libfabric:gni:mr:_gnix_mr_cache_init():991<trace> [15403:1] 
libfabric:gni:mr:_gnix_mr_cache_register():1527<trace> [15403:1] 
libfabric:gni:ep_ctrl:gnix_nic_alloc():907<trace> [15403:1] 
libfabric:gni:ep_ctrl:gnix_nic_alloc():1010<warn> [15403:1] GNI_CdmAttach returned GNI_RC_INVALID_PARAM
libfabric:gni:fabric:_gnix_dump_gni_res():677<warn> [15403:1] Device Resources:
dev res:       MDD, avail: 4089 res: 409 held: 0 total: 4095
dev res:        CQ, avail: 2042 res: 10 held: 0 total: 2047
dev res:       FMA, avail: 126 res: 4 held: 0 total: 127
dev res:        CE, avail: 4 res: 0 held: 0 total: 4
dev res:       DLA, avail: 16384 res: 1024 held: 0 total: 16384
dev res:       TCR, avail: 63416 res: 0 held: 0 total: 16
dev res:       DVA, avail: 4398046511104 res: 1099511627776 held: 0 total: 4398046511104
dev res:      VMDH, avail: 4 res: 0 held: 0 total: 4
libfabric:gni:fabric:_gnix_dump_gni_res():693<warn> [15403:1] Job Resources:
libfabric:gni:mr:__gnix_generic_register():404<info> [15403:1] could not allocate nic to do mr_reg, ret=-22
libfabric:gni:mr:__mr_cache_create_registration():1451<info> [15403:1] failed to register memory with callback
fi_mr_reg(): common/shared.c:395, ret=-12 (Cannot allocate memory)

I am unfamiliar with GNI, but it looks like I'm missing some kind of resource. Do I need to somehow request access to the nic when issuing salloc?

I do not know if it's related but ibv_devinfo doesn't show any HCA on the second allocated node:

yburette@nid00077:~> ibv_devinfo
Failed to get IB devices list: Function not implemented

#5 Updated by Yohann Burette 3 days ago

I tried on Cori instead of Edison.

I still can't get fabtests to run but I was able to run Charm++.

The tricky part is to use slurmpmi:

$ ./build charm++ ofi-linux-x86_64 slurmpmi --incdir /global/common/cori/software/libfabric/1.5.0/gnu/include --libdir /global/common/cori/software/libfabric/1.5.0/gnu/lib --incdir /usr/include/slurm --libdir /usr/lib64/slurmpmi --with-production -j32

Then to run NAMD with srun:

$ srun -N 2 -n 4 --export=LD_LIBRARY_PATH=/usr/lib64/slurmpmi:/global/common/cori/software/libfabric/1.5.0/gnu/lib ./namd2 $HOME/benchmarks/tiny/tiny.namd

#6 Updated by Yohann Burette 3 days ago

Just to report that this also works on Edison now.

Also available in: Atom PDF