Project

General

Profile

Bug #1966

ofi non smp fails when #PUs > #cores

Added by Joseph Hutter about 1 month ago. Updated about 1 month ago.

Status:
New
Priority:
Normal
Assignee:
Category:
Machine Layers
Target version:
-
Start date:
08/16/2018
Due date:
% Done:

0%

Tags:
ofi

Description

Get this error with a 128 cpus per task, 1 task per node batch file AND with a 128 task per node, 1 cpus per task batch file

Running on 128 processors:  ./barnes -in=10m.dat -killat=2
charmrun>  /bin/setarch x86_64 -R  mpirun -np 128  ./barnes -in=10m.dat -killat=2
c417-014.stampede2.tacc.utexas.edu.199777hfi_userinit: assign_context command failed: Device or resource busy
c417-014.stampede2.tacc.utexas.edu.199777psmi_context_open: hfi_userinit: failed, trying again (1/3)
c417-014.stampede2.tacc.utexas.edu.199777hfi_userinit: assign_context command failed: Device or resource busy
c417-014.stampede2.tacc.utexas.edu.199777psmi_context_open: hfi_userinit: failed, trying again (2/3)
c417-014.stampede2.tacc.utexas.edu.199777hfi_userinit: assign_context command failed: Device or resource busy
c417-014.stampede2.tacc.utexas.edu.199777psmi_context_open: hfi_userinit: failed, trying again (3/3)
c417-014.stampede2.tacc.utexas.edu.199777hfi_userinit: assign_context command failed: Device or resource busy
[0] Stack Traceback:
 [0:0] _Z14CmiAbortHelperPKcS0_S0_ii+0x4d  [0x5741cd]
 [0:1]   [0x57423b]
 [0:2] _Z8LrtsInitPiPPPcS_S_+0xa78  [0x579188]
 [0:3] ConverseInit+0x214  [0x57ae74]
 [0:4] charm_main+0x27  [0x4bde47]
 [0:5] __libc_start_main+0xf5  [0x2ac2152ffc05]
 [0:6]   [0x498989]

When running in a basic idev, I see this on +p69

Running on 69 processors:  ./barnes -in=10m.dat -killat=2
charmrun>  /bin/setarch x86_64 -R  mpirun -np 69  ./barnes -in=10m.dat -killat=2
Charm++>ofi> provider: psm2
Charm++>ofi> control progress: 2
Charm++>ofi> data progress: 2
Charm++>ofi> maximum inject message size: 64
Charm++>ofi> eager maximum message size: 65536 (maximum header size: 40)
Charm++>ofi> cq entries count: 8
Charm++>ofi> use inject: 1
Charm++>ofi> maximum rma size: 4294967295
Charm++>ofi> mr mode: 0x2
Charm++>ofi> use memory pool: 0
Charm++>ofi> use request cache: 0
Charm++>ofi> number of pre-allocated recvs: 8
Charm++>ofi> exchanging addresses over OFI
c455-102.stampede2.tacc.utexas.edu.33662hfi_userinit: assign_context command failed: Device or resource busy
c455-102.stampede2.tacc.utexas.edu.33662psmi_context_open: hfi_userinit: failed, trying again (1/3)
c455-102.stampede2.tacc.utexas.edu.33662hfi_userinit: assign_context command failed: Device or resource busy
c455-102.stampede2.tacc.utexas.edu.33662psmi_context_open: hfi_userinit: failed, trying again (2/3)
c455-102.stampede2.tacc.utexas.edu.33662hfi_userinit: assign_context command failed: Device or resource busy
c455-102.stampede2.tacc.utexas.edu.33662psmi_context_open: hfi_userinit: failed, trying again (3/3)
c455-102.stampede2.tacc.utexas.edu.33662hfi_userinit: assign_context command failed: Device or resource busy
c455-102.stampede2.tacc.utexas.edu.33662PSM2 can't open hfi unit: -1 (err=23)
------- Partition 0 Processor 0 Exiting: Called CmiAbort ------
Reason: OFI::LrtsInit::fi_domain error
[0] Stack Traceback:
 [0:0] _Z14CmiAbortHelperPKcS0_S0_ii+0x4d  [0x5741cd]
 [0:1]   [0x57423b]
 [0:2] _Z8LrtsInitPiPPPcS_S_+0xa78  [0x579188]
 [0:3] ConverseInit+0x214  [0x57ae74]
 [0:4] charm_main+0x27  [0x4bde47]
 [0:5] __libc_start_main+0xf5  [0x2b9685a6ac05]
 [0:6]   [0x498989]

History

#1 Updated by Evan Ramos about 1 month ago

  • Description updated (diff)

#2 Updated by Evan Ramos about 1 month ago

  • Assignee set to Nitin Bhat

Also available in: Atom PDF