Project

General

Profile

Bug #1792

AMPI failing occasionally during migration in PUP of MPI_Info objects

Added by Sam White over 1 year ago. Updated over 1 year ago.

Status:
Merged
Priority:
High
Assignee:
Category:
AMPI
Target version:
Start date:
02/06/2018
Due date:
% Done:

0%


Description

On multicore-darwin-x86_64, in tests/ampi/megampi/:

$ lldb -- ./pgm +vp2 +p2 
(lldb) target create "./pgm" 
Current executable set to './pgm' (x86_64).
(lldb) settings set -- target.run-args  "+vp2" "+p2" 
(lldb) r
Process 68449 launched: './pgm' (x86_64)
Charm++: standalone mode (not using charmrun)
Charm++> Running in Multicore mode:  2 threads
Converse/Charm++ Commit ID: v6.8.0-435-gfccd8e804
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (8-way SMP).
Charm++> cpu topology info is gathered in 0.002 seconds.
[0] RandCentLB created
[0] Testing: Testing...
[1] migrated from 1 to 0
[0] Testing: Testing...
[0] migrated from 0 to 1
[1] migrated from 0 to 1
[0] Testing: Testing...
pgm(68449,0x700000081000) malloc: *** error for object 0x5: pointer being freed was not allocated
*** set a breakpoint in malloc_error_break to debug
Process 68449 stopped
* thread #2: tid = 0x5b605, 0x00007fff91882f06 libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGABRT
    frame #0: 0x00007fff91882f06 libsystem_kernel.dylib`__pthread_kill + 10
libsystem_kernel.dylib`__pthread_kill:
->  0x7fff91882f06 <+10>: jae    0x7fff91882f10            ; <+20>
    0x7fff91882f08 <+12>: movq   %rax, %rdi
    0x7fff91882f0b <+15>: jmp    0x7fff9187d7cd            ; cerror_nocancel
    0x7fff91882f10 <+20>: retq   
(lldb) bt
warning: could not load any Objective-C class information. This will significantly reduce the quality of type information available.
* thread #2: tid = 0x5b605, 0x00007fff91882f06 libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGABRT
  * frame #0: 0x00007fff91882f06 libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007fff8466b4ec libsystem_pthread.dylib`pthread_kill + 90
    frame #2: 0x00007fff92a6b6df libsystem_c.dylib`abort + 129
    frame #3: 0x00007fff89c8d041 libsystem_malloc.dylib`free + 425
    frame #4: 0x00000001000182c5 pgm`CkPupPtrVec<InfoStruct, CkPupAllocatePtr<InfoStruct> >::~CkPupPtrVec() [inlined] CkZeroPtr<KeyvalPair, CkPupAllocatePtr<KeyvalPair> >::destroy() + 18 at cklists.h:475 [opt]
    frame #5: 0x00000001000182b3 pgm`CkPupPtrVec<InfoStruct, CkPupAllocatePtr<InfoStruct> >::~CkPupPtrVec() + 4 at cklists.h:498 [opt]
    frame #6: 0x00000001000182af pgm`CkPupPtrVec<InfoStruct, CkPupAllocatePtr<InfoStruct> >::~CkPupPtrVec() [inlined] CkPupPtrVec<KeyvalPair, CkPupAllocatePtr<KeyvalPair> >::~CkPupPtrVec() + 15 at cklists.h:496 [opt]
    frame #7: 0x00000001000182a0 pgm`CkPupPtrVec<InfoStruct, CkPupAllocatePtr<InfoStruct> >::~CkPupPtrVec() [inlined] InfoStruct::~InfoStruct(this=<unavailable>) at ampiimpl.h:323 [opt]
    frame #8: 0x00000001000182a0 pgm`CkPupPtrVec<InfoStruct, CkPupAllocatePtr<InfoStruct> >::~CkPupPtrVec() [inlined] InfoStruct::~InfoStruct(this=<unavailable>) at ampiimpl.h:323 [opt]
    frame #9: 0x00000001000182a0 pgm`CkPupPtrVec<InfoStruct, CkPupAllocatePtr<InfoStruct> >::~CkPupPtrVec() [inlined] CkZeroPtr<InfoStruct, CkPupAllocatePtr<InfoStruct> >::destroy() + 13 at cklists.h:475 [opt]
    frame #10: 0x0000000100018293 pgm`CkPupPtrVec<InfoStruct, CkPupAllocatePtr<InfoStruct> >::~CkPupPtrVec() + 31 at cklists.h:498 [opt]
    frame #11: 0x0000000100018274 pgm`CkPupPtrVec<InfoStruct, CkPupAllocatePtr<InfoStruct> >::~CkPupPtrVec(this=<unavailable>) + 20 at cklists.h:496 [opt]
    frame #12: 0x000000010001a79f pgm`ampiParent::~ampiParent(this=<unavailable>, vtt=<unavailable>) + 287 at ampi.C:1395 [opt]
    frame #13: 0x000000010001ab55 pgm`ampiParent::~ampiParent() [inlined] ampiParent::~ampiParent(this=0x000000010201dc00) + 21 at ampi.C:1392 [opt]
    frame #14: 0x000000010001ab49 pgm`ampiParent::~ampiParent(this=0x000000010201dc00) + 9 at ampi.C:1392 [opt]
    frame #15: 0x000000010009c161 pgm`CkArray::deleteElt(this=<unavailable>, id=<unavailable>) + 225 at ckarray.h:525 [opt]
    frame #16: 0x0000000100094ffc pgm`CkLocMgr::emigrate(this=<unavailable>, rec=<unavailable>, toPe=<unavailable>) + 700 at cklocation.C:3143 [opt]
    frame #17: 0x00000001000956b4 pgm`CkLocRec::staticMigrate(LDObjHandle, int) [inlined] CkLocRec::migrateMe(this=<unavailable>, toPe=0) + 13 at cklocation.C:1927 [opt]
    frame #18: 0x00000001000956a7 pgm`CkLocRec::staticMigrate(LDObjHandle, int) [inlined] CkLocRec::recvMigrate(this=<unavailable>) + 6 at cklocation.C:2063 [opt]
    frame #19: 0x00000001000956a1 pgm`CkLocRec::staticMigrate(h=LDObjHandle @ 0x0000700000080c30, dest=0) + 17 at cklocation.C:2056 [opt]
    frame #20: 0x000000010012cb77 pgm`LBDB::Migrate(LDObjHandle, int) [inlined] LBOM::Migrate(this=<unavailable>, dest=0) + 49 at LBOM.h:37 [opt]
    frame #21: 0x000000010012cb46 pgm`LBDB::Migrate(this=0x0000000100605e00, h=LDObjHandle @ 0x0000700000080c90, dest=0) + 118 at LBDBManager.C:324 [opt]
    frame #22: 0x000000010010e9e1 pgm`::LDMigrate(_h=LDObjHandle @ 0x0000700000080cd0, dest=<unavailable>) + 65 at lbdb.C:399 [opt]
    frame #23: 0x000000010013d800 pgm`CentralLB::ProcessReceiveMigration() [inlined] LBDatabase::Migrate(dest=<unavailable>) + 624 at LBDatabase.h:345 [opt]
    frame #24: 0x000000010013d7a2 pgm`CentralLB::ProcessReceiveMigration(this=<unavailable>) + 530 at CentralLB.C:1170 [opt]
    frame #25: 0x0000000100140b11 pgm`CkIndex_CentralLB::_call_redn_wrapper_ProcessReceiveMigration_void(impl_msg=0x0000000100544410, impl_obj_void=<unavailable>) + 17 at CentralLB.def.h:920 [opt]
    frame #26: 0x0000000100083404 pgm`_processHandler(void*, CkCoreState*) [inlined] CkDeliverMessageFree(msg=0x0000000100544410, obj=0x0000000100605f60) + 23 at ck.C:597 [opt]
    frame #27: 0x00000001000833ed pgm`_processHandler(void*, CkCoreState*) [inlined] _invokeEntryNoTrace(obj=0x0000000100605f60) + 4 at ck.C:641 [opt]
    frame #28: 0x00000001000833e9 pgm`_processHandler(void*, CkCoreState*) [inlined] _invokeEntry(obj=0x0000000100605f60) at ck.C:659 [opt]
    frame #29: 0x00000001000833e9 pgm`_processHandler(void*, CkCoreState*) [inlined] _deliverForBocMsg(CkCoreState*, int, envelope*, IrrGroup*) at ck.C:1083 [opt]
    frame #30: 0x00000001000833e9 pgm`_processHandler(void*, CkCoreState*) [inlined] _processForBocMsg(CkCoreState*, envelope*) at ck.C:1101 [opt]
    frame #31: 0x00000001000833e9 pgm`_processHandler(converseMsg=<unavailable>, ck=0x0000000100741d90) + 2809 at ck.C:1267 [opt]
    frame #32: 0x0000000100160735 pgm`CsdScheduleForever [inlined] CmiHandleMessage(msg=0x00000001005443c0) + 34 at convcore.c:1646 [opt]
    frame #33: 0x0000000100160713 pgm`CsdScheduleForever + 211 at convcore.c:1888 [opt]
    frame #34: 0x0000000100160455 pgm`CsdScheduler(maxmsgs=<unavailable>) + 21 at convcore.c:1824 [opt]
    frame #35: 0x000000010015ae32 pgm`ConverseRunPE(everReturn=0) + 738 at machine-common-core.c:1442 [opt]
    frame #36: 0x000000010015e346 pgm`call_startfn(vindex=0x0000000000000001) + 102 at machine-smp.c:414 [opt]
    frame #37: 0x00007fff8466899d libsystem_pthread.dylib`_pthread_body + 131
    frame #38: 0x00007fff8466891a libsystem_pthread.dylib`_pthread_start + 168
    frame #39: 0x00007fff84666351 libsystem_pthread.dylib`thread_start + 13

History

#1 Updated by Sam White over 1 year ago

This is showing up in some of the Jenkins builds for gerrit, but it only fails ~33% of the time on my machine.

#2 Updated by Sam White over 1 year ago

  • Status changed from In Progress to Implemented

Fix here: https://charm.cs.illinois.edu/gerrit/#/c/3637/

Our internal handling of MPI_Info's is pretty ugly, based on the use of CkPupPtrVec rather than std::vector. Cleaning that up is a slightly bigger task though.

#3 Updated by Sam White over 1 year ago

  • Status changed from Implemented to Merged

Also available in: Atom PDF