Project

General

Profile

Bug #1738

Failure at LrtsInit with OFI build with verbs provider on Golub

Added by Jaemin Choi 13 days ago. Updated 10 days ago.

Status:
New
Priority:
Normal
Assignee:
Category:
Machine Layers
Target version:
-
Start date:
11/06/2017
Due date:
% Done:

0%


Description

When running 1darray hello example program on golub with the OFI build, I experienced the following error:

jchoi157@golub292:1darray$ mpirun ./hello
Charm++>ofi> provider: verbs;ofi_rxm
Charm++>ofi> control progress: 1
Charm++>ofi> data progress: 2
Charm++>ofi> maximum inject message size: 16320
Charm++>ofi> eager maximum message size: 65536 (maximum header size: 40)
Charm++>ofi> cq entries count: 8
Charm++>ofi> use inject: 1
Charm++>ofi> maximum rma size: 1073741824
Charm++>ofi> mr mode: 0x1
Charm++>ofi> use memory pool: 0
Charm++>ofi> use request cache: 0
Charm++>ofi> number of pre-allocated recvs: 8
Charm++>ofi> exchanging addresses over OFI
Error value: -11
[0] Stack Traceback:
  [0:0] CmiAbortHelper+0x4d  [0x527a7d]
  [0:1]   [0x527aeb]
  [0:2] LrtsInit+0xa33  [0x52c883]
  [0:3] ConverseInit+0xd5  [0x52ea85]
  [0:4] main+0x27  [0x4810c7]
  [0:5] __libc_start_main+0xfd  [0x2b4c920e7d1d]
  [0:6]   [0x47c619]
------- Partition 0 Processor 0 Exiting: Called CmiAbort ------
Reason: OFI::LrtsInit::fi_tsend error

=====================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 256
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================
[proxy:0:1@golub295] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
[proxy:0:1@golub295] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:1@golub295] main (./pm/pmiserv/pmip.c:214): demux engine error waiting for event
[mpiexec@golub292] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:69): one of the processes terminated badly; aborting
[mpiexec@golub292] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@golub292] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:199): launcher returned error waiting for completion
[mpiexec@golub292] main (./ui/mpich/mpiexec.c:385): process manager error waiting for completion

The error number, -11, indicates 'try again'.
Will email the Intel folks about it.
Attached are the outputs of fi_info and fi_info --version.

fi_info.txt View (2.91 KB) Jaemin Choi, 11/06/2017 04:49 PM


Related issues

Related to Charm++ - Bug #1740: Failure at LrtsInit with OFI build with gni provider on Edison New 11/07/2017
Blocks Charm++ - Support #1674: Add 'ofi' target to autobuild In Progress 09/11/2017

History

#1 Updated by Jaemin Choi 13 days ago

#2 Updated by Nitin Bhat 12 days ago

  • Subject changed from Failure at LrtsInit with OFI build on verbs, golub to Failure at LrtsInit with OFI build with verbs provider on Golub

#3 Updated by Nitin Bhat 12 days ago

  • Related to Bug #1740: Failure at LrtsInit with OFI build with gni provider on Edison added

#4 Updated by Nitin Bhat 10 days ago

Got the same error on Iforge (another Infiniband cluster at NCSA).

Also available in: Atom PDF