Project

General

Profile

Bug #480

Calling 'migrateMe' in SDAG serial block accesses members of deallocated chare

Added by Robert Steinke about 5 years ago. Updated over 1 year ago.

Status:
New
Priority:
Low
Assignee:
-
Category:
Migration
Target version:
-
Start date:
04/22/2014
Due date:
% Done:

0%


Description

I have some Charm++ code in ADHydro-crash.tar.gz that crashes when run on more than one processor. It doesn't crash if it is only run on one processor, or if I remove the migrateMe call at the end of mesh_element.cpp. It also doesn't crash if I rearrange the code to remove a while loop from the structured dagger code in ADHydro-nocrash.tar.gz. So it appears to be related to structured dagger and migration.

ADHydro-nocrash.tar.gz - Very similar code that doesn't crash (2.14 KB) Robert Steinke, 04/22/2014 05:26 PM

ADHydro-crash.tar.gz - Code that crashes (2.13 KB) Robert Steinke, 04/22/2014 05:26 PM

ADHydro-crash2.tar.gz - variant that crashes even when using RotateLB instead of migrateMe (2.29 KB) Robert Steinke, 04/24/2014 03:49 PM

ADHydro-crash2-output.txt View - Output of running ADHydro-crash2 on charm-6.5.1 (2.54 KB) Robert Steinke, 04/24/2014 04:15 PM

History

#1 Updated by Phil Miller about 5 years ago

  • Target version changed from 6.5.1 to 6.6.0
  • Assignee changed from PPL to Phil Miller

Targeting 6.6.0 since we're in the release preparation stages for that, and SDAG's implementation has been overhauled since the 6.5 release. I'll try to triage the problem this evening.

#2 Updated by Phil Miller about 5 years ago

  • Target version deleted (6.6.0)
  • Subject changed from Crash involving SDAG and migration to Calling 'migrateMe' in SDAG serial block accesses members of deallocated chare

OK, so this is a known limitation (documented, but rather poorly) on the use of the migrateMe function. When it's called, the element calling it is packed up, destructed, and de-allocated on its original host PE in-line, immediately, at the point of the call. When called inside SDAG code, that means that the SDAG control flow code subsequent to the call will be operating on an object that's no longer there.

If you use migration triggered by AtSync or periodic load balancing, this won't arise, since the infrastructure-controlled migration paths operate outside execution of the subject chares' entry methods. Since that generally matches live application usage patterns, while the demonstration here is commented as 'simulating' load balancing, I'd recommend working with that approach.

As a workaround, if you really want to use migrateMe() specifically, the way to do it would be to stick the call in a stand-alone entry method outside of any SDAG

// in mesh_element.ci
entry void doMigrate(int pe);
// in mesh_element.cpp
void MeshElement::doMigrate(int pe)
{
  migrateMe(pe);
}
and call that via thisProxy[thisIndex].doMigrate(pe) where you're currently calling migrateMe() explicitly in MeshElement::receiveMessage().

The documentation definitely needs to be improved to address this case explicitly, or the implementation of migrateMe made somewhat more forgiving.

As a point of interest, even the 'nocrash' version of the example you submitted exhibits this behavior, though in much milder (and hence non-crashing form). Per valgrind,

==15967== Invalid read of size 1
==15967==    at 0x4F52D0: MeshElement::_while_0_end(Closure_MeshElement::sendDoTimestep_3_closure*) (mesh_element.def.h:648)
==15967==    by 0x4F5350: MeshElement::_slist_2_end(Closure_MeshElement::sendDoTimestep_3_closure*) (mesh_element.def.h:667)
==15967==    by 0x4F539C: MeshElement::_case_0_end(Closure_MeshElement::sendDoTimestep_3_closure*) (mesh_element.def.h:681)
==15967==    by 0x4F5462: MeshElement::_caselist_0_end(Closure_MeshElement::sendDoTimestep_3_closure*, SDAG::CSpeculator*) (mesh_element.def.h:700)
==15967==    by 0x4F5696: MeshElement::_when_0_end(Closure_MeshElement::sendDoTimestep_3_closure*, SDAG::CSpeculator*, Closure_MeshElement::sendMessage_4_closure*) (mesh_element.def.h:746)
==15967==    by 0x4F56FB: MeshElement::_slist_3_end(Closure_MeshElement::sendDoTimestep_3_closure*, SDAG::CSpeculator*, Closure_MeshElement::sendMessage_4_closure*) (mesh_element.def.h:760)
==15967==    by 0x4F5843: MeshElement::_atomic_2(Closure_MeshElement::sendDoTimestep_3_closure*, SDAG::CSpeculator*, Closure_MeshElement::sendMessage_4_closure*) (mesh_element.def.h:783)
==15967==    by 0x4F56C9: MeshElement::_slist_3(Closure_MeshElement::sendDoTimestep_3_closure*, SDAG::CSpeculator*, Closure_MeshElement::sendMessage_4_closure*) (mesh_element.def.h:753)
==15967==    by 0x4F557C: MeshElement::_when_0(Closure_MeshElement::sendDoTimestep_3_closure*, SDAG::CSpeculator*, int) (mesh_element.def.h:728)
==15967==    by 0x4F59FF: MeshElement::sendMessage(Closure_MeshElement::sendMessage_4_closure*) (mesh_element.def.h:808)
==15967==    by 0x4F4621: CkIndex_MeshElement::_call_sendMessage_marshall4(void*, void*) (mesh_element.def.h:371)
==15967==    by 0x51279B: CkDeliverMessageFree (ck.C:593)
==15967==  Address 0x5bb1174 is 180 bytes inside a block of size 200 free'd
==15967==    at 0x4C2A82E: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15967==    by 0x4F228B: Chare::operator delete(void*) (charm++.h:227)
==15967==    by 0x501927: MeshElement::~MeshElement() (mesh_element.h:6)
==15967==    by 0x53124E: CkLocMgr::reclaim(CkArrayIndex const&, int) (cklocation.C:2649)
==15967==    by 0x52F0B7: CkLocRec_local::~CkLocRec_local() (cklocation.C:1639)
==15967==    by 0x52F187: CkLocRec_local::~CkLocRec_local() (cklocation.C:1645)
==15967==    by 0x532499: CkLocMgr::emigrate(CkLocRec_local*, int) (cklocation.C:3203)
==15967==    by 0x52F1BE: CkLocRec_local::migrateMe(int) (cklocation.C:1650)
==15967==    by 0x4F6E6F: CkMigratable::ckMigrate(int) (ckmigratable.h:52)
==15967==    by 0x4F6EC3: ArrayElement::migrateMe(int) (ckarray.h:537)
==15967==    by 0x4F38B9: MeshElement::receiveMessage() (mesh_element.cpp:49)
==15967==    by 0x4F57D4: MeshElement::_atomic_2(Closure_MeshElement::sendDoTimestep_3_closure*, SDAG::CSpeculator*, Closure_MeshElement::sendMessage_4_closure*) (mesh_element.def.h:776)

#3 Updated by Robert Steinke about 5 years ago

Thanks for the explanation, and thanks for getting to this so quickly. This resolves the issue for me. I don't really need to use migrateMe. I just thought it would be an easy way to test and to force a chare to be moved. I'm doing a simple example with only two chares so I thought a load balancer might wind up not moving anything. But now I know my chares are migratable so I'll just turn on the periodic load balancer.

#4 Updated by Phil Miller about 5 years ago

If you want to stress migration paths in your own code (e.g. PUP routines, process-local references), you can link the RotateLB strategy, which on every invocation cycles every migratable object to the next PE.

#5 Updated by Robert Steinke about 5 years ago

When the documentation gets updated you could also fix the fact that RotateLB isn't listed in section 7.2 "Available Load Balancing Strategies".

#6 Updated by Robert Steinke about 5 years ago

I don't know if I should open up a new ticket for this, but I found a variation that still crashes even when using a load balancer like RotateLB and not calling migrateMe. I'm now passing a struct to the MeshElement::sendInitialize() message and declaring PUPbytes() for the struct.

If run on more than one PE it crashes the first time the load balancer runs, but if I remove the struct argument from line 8 of mesh_element.ci and the two places it is called on lines 16 and 23 of adhydro.cpp then it runs fine.

#7 Updated by Phil Miller about 5 years ago

Could you post the full output from the crash you've observed? I can't reproduce this on a recent build (e.g. 6bc4cf42200374568bb49bdc3cc8d86c30739e97 committed 2014-04-22, git described as "v6.6.0-rc3-29-g6bc4cf4" in startup output). This may be a bug we've fixed since the release tarball that I'm guessing you were working with.

#8 Updated by Robert Steinke about 5 years ago

Here's the output, and I just noticed this assertion failure:

Assertion failed in file ./src/mpid/ch3/channels/nemesis/include/mpid_nem_inline.h at line 638: el->usage

The output of my real code is a lot longer and I missed it, and I didn't look at the output of the reduced example other than to see that it crashed.

#9 Updated by Robert Steinke about 5 years ago

The file mentioned in the assert is in the MPI code. I am using MPICH 3.0.1.

#10 Updated by Phil Miller about 5 years ago

I've confirmed that there was an out-of-bounds memory access in our generated code in the released 6.5.1 code that has since been fixed in the SDAG overhaul I mentioned earlier. The 6.6 stable release is coming any day now, but in the meantime, the recent release candidates tagged in our git repository should be a much more pleasant experience for present development than the last release.

#11 Updated by Robert Steinke about 5 years ago

Thanks. I'll start using 6.6. Sorry to find a bug that you've already fixed.

#12 Updated by Phil Miller over 4 years ago

  • Priority changed from Normal to Low

#13 Updated by Phil Miller over 1 year ago

  • Assignee deleted (Phil Miller)

Also available in: Atom PDF