Project

General

Profile

Bug #1822

megatest/multisection test failures caused by changes to group dependencies

Added by Sam White 7 months ago. Updated 6 months ago.

Status:
Merged
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
03/07/2018
Due date:
% Done:

0%


Description

Check if one of the recent patches (group dependence, zero copy API) caused this failure on netlrts-darwin-x86_64-smp:

../../../bin/testrun  ./pgm +p3  ++local

...

test 8: initiated [multisectiontest (ebohm)]
pgm(18125,0x700000081000) malloc: *** error for object 0x10050b580: incorrect checksum for freed object - object was probably modified after being freed.
*** set a breakpoint in malloc_error_break to debug
------------- Processor 3 Exiting: Caught Signal ------------
Reason: unknown signal
[3] Stack Traceback:
  [3:0] 0   libsystem_platform.dylib            0x00007fffa098b52a _sigtramp + 26
  [3:1] 1   ???                                 0x00007fff633f45c8 0x0 + 140734858479048
  [3:2] 2   libsystem_c.dylib                   0x00007fff8f2b86df abort + 129
  [3:3] 3   libsystem_malloc.dylib              0x00007fff9c254396 szone_error + 626
  [3:4] 4   libsystem_malloc.dylib              0x00007fff9c2491cd szone_free_definite_size + 3663
  [3:5] 5   pgm                                 0x0000000100187670 IntegrateAckDatagram + 560
  [3:6] 6   pgm                                 0x0000000100187af2 CommunicationServerNet + 386
  [3:7] 7   pgm                                 0x000000010018505c ConverseRunPE + 668
  [3:8] 8   pgm                                 0x00000001001883b6 call_startfn + 102
  [3:9] 9   libsystem_pthread.dylib             0x00007fff9d71699d _pthread_body + 131
  [3:10] 10  libsystem_pthread.dylib             0x00007fff9d71691a _pthread_body + 0
  [3:11] 11  libsystem_pthread.dylib             0x00007fff9d714351 thread_start + 13

valgrind.out (65.3 KB) Sam White, 03/13/2018 01:26 PM

History

#1 Updated by Sam White 7 months ago

  • Subject changed from charm++ megatest failure in to charm++ megatest failure on netlrts-darwin-x86_64-smp

#2 Updated by Sam White 7 months ago

netlrts-darwin-x86_64 (non-SMP) failed on this last night in autobuild

#3 Updated by Sam White 6 months ago

  • Subject changed from charm++ megatest failure on netlrts-darwin-x86_64-smp to charm++ megatest/multisection test failures

netlrts-linux-x86_64-smp appears to have hung in this last night, and the various Darwin builds have all been failing in it. Have you bisected this to see which patch caused it?

#4 Updated by Sam White 6 months ago

I bisected the failure to the Group Dependence patch. I had to run "./pgm +p4" many times to trigger the failure, but I see no failures before that patch.

Here's an lldb backtrace on multicore-darwin-x86_64:

  * frame #0: 0x00007fff94f7df06 libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007fff87d664ec libsystem_pthread.dylib`pthread_kill + 90
    frame #2: 0x00007fff961666df libsystem_c.dylib`abort + 129
    frame #3: 0x00007fff8d394396 libsystem_malloc.dylib`szone_error + 626
    frame #4: 0x00007fff8d38a5f4 libsystem_malloc.dylib`tiny_free_list_remove_ptr + 289
    frame #5: 0x00007fff8d388946 libsystem_malloc.dylib`szone_free_definite_size + 1480
    frame #6: 0x00000001001c3895 pgm`free_nomigrate(mem=0x0000000104b04c10) + 21 at libmemory-default.c:806
    frame #7: 0x00000001001fd9ae pgm`CmiFree(blk=0x0000000104b04c20) + 110 at convcore.c:3128
    frame #8: 0x00000001000a8e28 pgm`::CkFreeMsg(msg=0x0000000104b04c70) + 40 at msgalloc.C:71
    frame #9: 0x000000010009c885 pgm`CMessage_CkMarshallMsg::dealloc(p=0x0000000104b04c70) + 21 at CkMarshall.def.h:48
    frame #10: 0x000000010009460d pgm`::CkDeliverMessageFree(epIdx=60, msg=0x0000000104b04c70, obj=0x0000000100e02da0) + 141 at ck.C:603
    frame #11: 0x000000010009512b pgm`_invokeEntryNoTrace(epIdx=60, env=0x0000000104b04c20, obj=0x0000000100e02da0) + 59 at ck.C:641
    frame #12: 0x0000000100094ef9 pgm`::CkCreateLocalGroup(groupID=(idx = 59), epIdx=60, env=0x0000000104b04c20) + 537 at ck.C:737
    frame #13: 0x00000001000960ad pgm`_processBocInitMsg(ck=0x0000000100e009b0, env=0x0000000104b04c20) + 109 at ck.C:1172
    frame #14: 0x0000000100096390 pgm`_processHandler(converseMsg=0x0000000104b04c20, ck=0x0000000100e009b0) + 224 at ck.C:1274
    frame #15: 0x00000001001fc112 pgm`CmiHandleMessage(msg=0x0000000104b04c20) + 146 at convcore.c:1652
    frame #16: 0x00000001001fc57d pgm`CsdScheduleForever + 173 at convcore.c:1894
    frame #17: 0x00000001001fc16a pgm`CsdScheduler(maxmsgs=-1) + 26 at convcore.c:1830
    frame #18: 0x00000001001f3672 pgm`ConverseRunPE(everReturn=0) + 994 at machine-common-core.c:1527
    frame #19: 0x00000001001f8d54 pgm`call_startfn(vindex=0x0000000000000002) + 196 at machine-smp.c:414
    frame #20: 0x00007fff87d6399d libsystem_pthread.dylib`_pthread_body + 131
    frame #21: 0x00007fff87d6391a libsystem_pthread.dylib`_pthread_start + 168
    frame #22: 0x00007fff87d61351 libsystem_pthread.dylib`thread_start + 13

Also note this warning on ICC v18.0 coming from the offending commit:

msgalloc.C(123): warning #873: function "CMessage_CkMarshallMsg::operator new(size_t={unsigned long}, int, int, int)" has no corresponding operator delete (to be called if an exception is thrown during initialization of an allocated object)
      CkMarshallMsg *m=new (size,opts->getPriorityBits(),opts->getGroupDepNum())CkMarshallMsg;

#5 Updated by Sam White 6 months ago

Running the multisection test under Valgrind shows a bunch of leaks coming from _allocEnv() and the following invalid writes in envelope::setGroupDep():

==69346== 2 errors in context 1 of 47:
==69346== Invalid write of size 4
==69346==    at 0x100011417: envelope::setGroupDep(_ckGroupID const&, int) (envelope.h:440)
==69346==    by 0x100011D56: multisectiontest_main::multisectiontest_main() (multisectiontest.C:153)
==69346==    by 0x100013F8A: CkIndex_multisectiontest_main::_call_multisectiontest_main_void(void*, void*) (multisectiontest.def.h:320)
==69346==    by 0x100093F20: CkDeliverMessageFree (ck.C:597)
==69346==    by 0x100094A8A: _invokeEntryNoTrace(int, envelope*, void*) (ck.C:641)
==69346==    by 0x10009A956: _invokeEntry(int, envelope*, void*) (ck.C:652)
==69346==    by 0x100096836: _processNewVChareMsg(CkCoreState*, envelope*) (ck.C:992)
==69346==    by 0x100095E4E: _processHandler(void*, CkCoreState*) (ck.C:1313)
==69346==    by 0x1001F5171: CmiHandleMessage (convcore.c:1652)
==69346==    by 0x1001F55CC: CsdScheduleForever (convcore.c:1894)
==69346==    by 0x1001F51C9: CsdScheduler (convcore.c:1830)
==69346==    by 0x1001EC6E1: ConverseRunPE (machine-common-core.c:1514)
==69346==  Address 0x101599d10 is 0 bytes after a block of size 160 alloc'd
==69346==    at 0x100793681: malloc (vg_replace_malloc.c:302)
==69346==    by 0x1001BC8F4: malloc_nomigrate (libmemory-default.c:798)
==69346==    by 0x1001F684A: CmiAlloc (convcore.c:3023)
==69346==    by 0x1000A31A2: envelope::alloc(unsigned char, unsigned int, unsigned short, unsigned short) (envelope.h:322)
==69346==    by 0x1000A27B9: _allocEnv(int, int, int, int) (envelope.h:497)
==69346==    by 0x1000A2A88: CkAllocMsg (msgalloc.C:40)
==69346==    by 0x1000113C4: CMessage_multisectionAID_msg::operator new(unsigned long, int) (multisectiontest.def.h:140)
==69346==    by 0x100011CD0: multisectiontest_main::multisectiontest_main() (multisectiontest.C:148)
==69346==    by 0x100013F8A: CkIndex_multisectiontest_main::_call_multisectiontest_main_void(void*, void*) (multisectiontest.def.h:320)
==69346==    by 0x100093F20: CkDeliverMessageFree (ck.C:597)
==69346==    by 0x100094A8A: _invokeEntryNoTrace(int, envelope*, void*) (ck.C:641)
==69346==    by 0x10009A956: _invokeEntry(int, envelope*, void*) (ck.C:652)
==69346==
==69346==
==69346== 2 errors in context 2 of 47:
==69346== Invalid write of size 4
==69346==    at 0x100011417: envelope::setGroupDep(_ckGroupID const&, int) (envelope.h:440)
==69346==    by 0x100011CC1: multisectiontest_main::multisectiontest_main() (multisectiontest.C:146)
==69346==    by 0x100013F8A: CkIndex_multisectiontest_main::_call_multisectiontest_main_void(void*, void*) (multisectiontest.def.h:320)
==69346==    by 0x100093F20: CkDeliverMessageFree (ck.C:597)
==69346==    by 0x100094A8A: _invokeEntryNoTrace(int, envelope*, void*) (ck.C:641)
==69346==    by 0x10009A956: _invokeEntry(int, envelope*, void*) (ck.C:652)
==69346==    by 0x100096836: _processNewVChareMsg(CkCoreState*, envelope*) (ck.C:992)
==69346==    by 0x100095E4E: _processHandler(void*, CkCoreState*) (ck.C:1313)
==69346==    by 0x1001F5171: CmiHandleMessage (convcore.c:1652)
==69346==    by 0x1001F55CC: CsdScheduleForever (convcore.c:1894)
==69346==    by 0x1001F51C9: CsdScheduler (convcore.c:1830)
==69346==    by 0x1001EC6E1: ConverseRunPE (machine-common-core.c:1514)
==69346==  Address 0x1020a5010 is 0 bytes after a block of size 160 alloc'd
==69346==    at 0x100793681: malloc (vg_replace_malloc.c:302)
==69346==    by 0x1001BC8F4: malloc_nomigrate (libmemory-default.c:798)
==69346==    by 0x1001F684A: CmiAlloc (convcore.c:3023)
==69346==    by 0x1000A31A2: envelope::alloc(unsigned char, unsigned int, unsigned short, unsigned short) (envelope.h:322)
==69346==    by 0x1000A27B9: _allocEnv(int, int, int, int) (envelope.h:497)
==69346==    by 0x1000A2A88: CkAllocMsg (msgalloc.C:40)
==69346==    by 0x1000113C4: CMessage_multisectionAID_msg::operator new(unsigned long, int) (multisectiontest.def.h:140)
==69346==    by 0x100011C2F: multisectiontest_main::multisectiontest_main() (multisectiontest.C:141)
==69346==    by 0x100013F8A: CkIndex_multisectiontest_main::_call_multisectiontest_main_void(void*, void*) (multisectiontest.def.h:320)
==69346==    by 0x100093F20: CkDeliverMessageFree (ck.C:597)
==69346==    by 0x100094A8A: _invokeEntryNoTrace(int, envelope*, void*) (ck.C:641)
==69346==    by 0x10009A956: _invokeEntry(int, envelope*, void*) (ck.C:652)

#6 Updated by Sam White 6 months ago

Running the multisection test under Valgrind shows a bunch of leaks coming from _allocEnv() and the following invalid writes in envelope::setGroupDep():

==69346== 2 errors in context 1 of 47:
==69346== Invalid write of size 4
==69346==    at 0x100011417: envelope::setGroupDep(_ckGroupID const&, int) (envelope.h:440)
==69346==    by 0x100011D56: multisectiontest_main::multisectiontest_main() (multisectiontest.C:153)
==69346==    by 0x100013F8A: CkIndex_multisectiontest_main::_call_multisectiontest_main_void(void*, void*) (multisectiontest.def.h:320)
==69346==    by 0x100093F20: CkDeliverMessageFree (ck.C:597)
==69346==    by 0x100094A8A: _invokeEntryNoTrace(int, envelope*, void*) (ck.C:641)
==69346==    by 0x10009A956: _invokeEntry(int, envelope*, void*) (ck.C:652)
==69346==    by 0x100096836: _processNewVChareMsg(CkCoreState*, envelope*) (ck.C:992)
==69346==    by 0x100095E4E: _processHandler(void*, CkCoreState*) (ck.C:1313)
==69346==    by 0x1001F5171: CmiHandleMessage (convcore.c:1652)
==69346==    by 0x1001F55CC: CsdScheduleForever (convcore.c:1894)
==69346==    by 0x1001F51C9: CsdScheduler (convcore.c:1830)
==69346==    by 0x1001EC6E1: ConverseRunPE (machine-common-core.c:1514)
==69346==  Address 0x101599d10 is 0 bytes after a block of size 160 alloc'd
==69346==    at 0x100793681: malloc (vg_replace_malloc.c:302)
==69346==    by 0x1001BC8F4: malloc_nomigrate (libmemory-default.c:798)
==69346==    by 0x1001F684A: CmiAlloc (convcore.c:3023)
==69346==    by 0x1000A31A2: envelope::alloc(unsigned char, unsigned int, unsigned short, unsigned short) (envelope.h:322)
==69346==    by 0x1000A27B9: _allocEnv(int, int, int, int) (envelope.h:497)
==69346==    by 0x1000A2A88: CkAllocMsg (msgalloc.C:40)
==69346==    by 0x1000113C4: CMessage_multisectionAID_msg::operator new(unsigned long, int) (multisectiontest.def.h:140)
==69346==    by 0x100011CD0: multisectiontest_main::multisectiontest_main() (multisectiontest.C:148)
==69346==    by 0x100013F8A: CkIndex_multisectiontest_main::_call_multisectiontest_main_void(void*, void*) (multisectiontest.def.h:320)
==69346==    by 0x100093F20: CkDeliverMessageFree (ck.C:597)
==69346==    by 0x100094A8A: _invokeEntryNoTrace(int, envelope*, void*) (ck.C:641)
==69346==    by 0x10009A956: _invokeEntry(int, envelope*, void*) (ck.C:652)
==69346==
==69346==
==69346== 2 errors in context 2 of 47:
==69346== Invalid write of size 4
==69346==    at 0x100011417: envelope::setGroupDep(_ckGroupID const&, int) (envelope.h:440)
==69346==    by 0x100011CC1: multisectiontest_main::multisectiontest_main() (multisectiontest.C:146)
==69346==    by 0x100013F8A: CkIndex_multisectiontest_main::_call_multisectiontest_main_void(void*, void*) (multisectiontest.def.h:320)
==69346==    by 0x100093F20: CkDeliverMessageFree (ck.C:597)
==69346==    by 0x100094A8A: _invokeEntryNoTrace(int, envelope*, void*) (ck.C:641)
==69346==    by 0x10009A956: _invokeEntry(int, envelope*, void*) (ck.C:652)
==69346==    by 0x100096836: _processNewVChareMsg(CkCoreState*, envelope*) (ck.C:992)
==69346==    by 0x100095E4E: _processHandler(void*, CkCoreState*) (ck.C:1313)
==69346==    by 0x1001F5171: CmiHandleMessage (convcore.c:1652)
==69346==    by 0x1001F55CC: CsdScheduleForever (convcore.c:1894)
==69346==    by 0x1001F51C9: CsdScheduler (convcore.c:1830)
==69346==    by 0x1001EC6E1: ConverseRunPE (machine-common-core.c:1514)
==69346==  Address 0x1020a5010 is 0 bytes after a block of size 160 alloc'd
==69346==    at 0x100793681: malloc (vg_replace_malloc.c:302)
==69346==    by 0x1001BC8F4: malloc_nomigrate (libmemory-default.c:798)
==69346==    by 0x1001F684A: CmiAlloc (convcore.c:3023)
==69346==    by 0x1000A31A2: envelope::alloc(unsigned char, unsigned int, unsigned short, unsigned short) (envelope.h:322)
==69346==    by 0x1000A27B9: _allocEnv(int, int, int, int) (envelope.h:497)
==69346==    by 0x1000A2A88: CkAllocMsg (msgalloc.C:40)
==69346==    by 0x1000113C4: CMessage_multisectionAID_msg::operator new(unsigned long, int) (multisectiontest.def.h:140)
==69346==    by 0x100011C2F: multisectiontest_main::multisectiontest_main() (multisectiontest.C:141)
==69346==    by 0x100013F8A: CkIndex_multisectiontest_main::_call_multisectiontest_main_void(void*, void*) (multisectiontest.def.h:320)
==69346==    by 0x100093F20: CkDeliverMessageFree (ck.C:597)
==69346==    by 0x100094A8A: _invokeEntryNoTrace(int, envelope*, void*) (ck.C:641)
==69346==    by 0x10009A956: _invokeEntry(int, envelope*, void*) (ck.C:652)

#7 Updated by Sam White 6 months ago

I added the valgrind output to the issue description above.

#8 Updated by Sam White 6 months ago

  • Subject changed from charm++ megatest/multisection test failures to megatest/multisection test failures caused by changes to group dependencies

multisectiontest sets group dependencies on messages that are multicasted over a section via CkMulticast. Perhaps there is missing group dep handling somewhere in the section multicast code?

#9 Updated by Nitin Bhat 6 months ago

  • Status changed from New to In Progress

#10 Updated by Nitin Bhat 6 months ago

  • Status changed from In Progress to Implemented

Thanks for the valgrind outputs, Sam.
The bug was caused because of the charm++ message in multisection test not allocated with a groupDependence field causing an invalid write when setGroupDep was called. The fix was to change the new message allocation call to include one groupDependence field.

Fix: https://charm.cs.illinois.edu/gerrit/#/c/3850/

#11 Updated by Sam White 6 months ago

  • Status changed from Implemented to Merged

Evan created a patch to prevent similar issues in user code: https://charm.cs.illinois.edu/gerrit/#/c/3871/

Also available in: Atom PDF