Project

General

Profile

Bug #1962

AMPI aborts from objid.h when run with >= 8192 ranks in non-production builds

Added by Sam White 4 months ago. Updated 4 months ago.

Status:
Merged
Priority:
Normal
Assignee:
Category:
AMPI
Target version:
Start date:
08/08/2018
Due date:
% Done:

0%


Description

Even the simplest AMPI program can fail at startup with the following:

[7] Assertion "(CmiUInt8)gid.idx <= (COLLECTION_MASK >> ELEMENT_BITS)" failed in file ./../include/objid.h line 22.
------------- Processor 7 Exiting: Called CmiAbort ------------
Reason: Assertion "(CmiUInt8)gid.idx <= (COLLECTION_MASK >> ELEMENT_BITS)" failed in file ./../include/objid.h line 22.
[7] Stack Traceback:
  [7:0] _Z14CmiAbortHelperPKcS0_S0_ii+0xb3  [0x878550]
  [7:1] CmiGetNonLocal+0  [0x878589]
  [7:2] CmiCopyMsg+0  [0x8829f5]
  [7:3] _ZN2ck5ObjIDC2E10_ckGroupIDm+0x3e  [0x780ae6]
  [7:4] _ZNK12ArrayElement7ckGetIDEv+0x43  [0x7c3ded]
  [7:5] _ZN12ArrayElement10initBasicsEv+0x10b  [0x7b742f]
  [7:6] _ZN12ArrayElementC1Ev+0x41  [0x7b74c9]
  [7:7] _ZN13ArrayElementTIiEC1Ev+0x18  [0x66ef7c]
  [7:8] _ZN7CBaseT1I13ArrayElementTIiE11CProxy_ampiEC2IJEEEDpT_+0x1d  [0x711461]
  [7:9] _ZN4ampiC1E9CkArrayIDRK14ampiCommStruct+0x34  [0x6c9c26]
  [7:10] _ZN12CkIndex_ampi20_call_ampi_marshall2EPvS0_+0xf8  [0x6f33aa]
  [7:11] CkDeliverMessageFree+0x4d  [0x749966]
  [7:12] _ZN8CkLocRec11invokeEntryEP12CkMigratablePvib+0x1b6  [0x774186]
  [7:13] _ZN8CkLocMgr15addElementToRecEP8CkLocRecP7CkArrayP12CkMigratableiPv+0x112  [0x7755cc]
  [7:14] _ZN8CkLocMgr10addElementE9CkArrayIDRK12CkArrayIndexP12CkMigratableiPv+0xf1  [0x775499]
  [7:15] _ZN7CkArray13insertElementEP14CkArrayMessageRK12CkArrayIndexPi+0x19b  [0x7ba689]
  [7:16] _ZN7CkArray13insertElementEO19CkMarshalledMessageRK12CkArrayIndexPi+0x3b  [0x7ba4eb]
  [7:17] _ZN15CkIndex_CkArray29_call_insertElement_marshall2EPvS0_+0x10f  [0x7be41d]
  [7:18] CkDeliverMessageFree+0x4d  [0x749966]
  [7:19]   [0x749ab1]
  [7:20]   [0x749cac]
  [7:21]   [0x74b447]
  [7:22]   [0x74b51b]
  [7:23] _Z15_processHandlerPvP11CkCoreState+0x115  [0x74bb8e]
  [7:24] CmiHandleMessage+0x89  [0x87f66a]
  [7:25] CsdScheduleForever+0xad  [0x87f8f4]
  [7:26] CsdScheduler+0x16  [0x87f825]
  [7:27]   [0x878359]
  [7:28] ConverseInit+0x5ec  [0x878260]
  [7:29] charm_main+0x3f  [0x73c1ee]
  [7:30] main+0x20  [0x734e3f]
  [7:31] __libc_start_main+0xf0  [0x7ffff6f72830]
  [7:32] _start+0x29  [0x6651d9]
Fatal error on PE 7> Assertion "(CmiUInt8)gid.idx <= (COLLECTION_MASK >> ELEMENT_BITS)" failed in file ./../include/objid.h line 22.

History

#1 Updated by Sam White 4 months ago

  • Subject changed from AMPI failures with more than 4096 ranks to AMPI aborts from objid.h when run with more than 4096 ranks in non-production builds

#2 Updated by Sam White 4 months ago

  • Subject changed from AMPI aborts from objid.h when run with more than 4096 ranks in non-production builds to AMPI aborts from objid.h when run with >= 8192 ranks in non-production builds

I think I know the problem here. When you launch an AMPI program with `n` VPs, you get a TCharm chare array with `n` elements, an ampiParent chare array with `n` elements, an ampi chare array with `n` elements representing MPI_COMM_WORLD, and `n` chare arrays with 1 element each, representing MPI_COMM_SELF for each VP. I think the last one is (`n` chare arrays) is more chare arrays than Charm's 64-bit ID can handle (8192= 2^13, and there are 13 bits devoted to the "collection" inside the 64-bit ID).

#3 Updated by Sam White 4 months ago

So we could change the implementation of MPI_COMM_SELF from a 1-element chare array to just a C++ object, though then we'd need to have special handling for MPI_COMM_SELF everywhere in AMPI (ie if (comm=MPI_COMM_SELF) { call C++ method inline } else { use a CProxy to make an asynchronous call}), which is a pain...

#4 Updated by Sam White 4 months ago

This issue should have been showing up since this change to implement MPI_COMM_SELF as chare arrays: https://charm.cs.illinois.edu/gerrit/#/c/charm/+/2753/

We implemented that change because we were seeing all kinds of failures in AMPI's special handling of COMM_SELF.

#5 Updated by Sam White 4 months ago

Core discussion:
- MPI_COMM_SELF should not be implemented using single-element chare arrays. The alternative is to implement it as C++ objects, but that causes code bloat in terms of always checking for comm==MPI_COMM_SELF inside AMPI and special-casing that.
- A temporary fix, for 6.9.0, is to increase the COLLECTION_BITS and decrease the ELEMENT_BITS accordingly.
- A more permanent fix is to keep track of the number of collection vs element bits in the ObjID or somewhere so it is parameterized.
- Subcommunicators more generally (besides MPI_COMM_SELF) in AMPI need a scalable implementation.
- The check on overflow of collection and element bits needs to be a true assert(), not CmiAssert(), so it's always enabled.

#6 Updated by Sam White 4 months ago

Here's the workaround, in which I somewhat arbitrarily chose the new ratio of collection vs element bits so that we can run AMPI with 16 million ranks: https://charm.cs.illinois.edu/gerrit/#/c/charm/+/4463/

#7 Updated by Sam White 4 months ago

  • Target version set to 6.9.0
  • Status changed from New to Implemented
  • Category set to AMPI

The workaround should be sufficient for 6.9.0 but longer term I'll open new redmine issues for the remaining work on AMPI and on 64-bit IDs.

#8 Updated by Sam White 4 months ago

  • Status changed from Implemented to Merged

Also available in: Atom PDF