Project

General

Profile

Bug #1539

Failure in migration when using RDMA sends in AMPI

Added by Sam White about 2 years ago. Updated about 2 years ago.

Status:
Merged
Priority:
High
Assignee:
Category:
AMPI
Target version:
Start date:
04/28/2017
Due date:
% Done:

0%


Description

Several autobuild targets segfaulted in ampi/Cjacobi3D last night after AMPI RDMA was merged: mpi-linux-x86_64, mpi-linux-x86_64-smp, verbs-linux-x86_64, verbs-linux-x86_64-smp, and mpi-crayxe all failed in the same place. That's both verbs targets, and most but not all MPI targets.
The test fails consistently in the first RDMA message after the first migration happens on >=3 PEs. It seems to never fail on just 2 PEs, and Isomalloc vs PUP doesn't matter.

Valgrind output on mpi-linux-x86_64 shows that an invalid read of size 8 is always happening after migration when processing an RDMA message.

==28092== Invalid read of size 8
==28092==    at 0x4C2F79E: memcpy@@GLIBC_2.14 (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==28092==    by 0x6743A0: serializeContig(char*, char*, unsigned long, int) (ddt.C:17)
==28092==    by 0x675D00: CkDDT_DataType::serialize(char*, char*, int, int) const (ddt.C:549)
==28092==    by 0x65B7A6: ampi::processRdmaMsg(void*, int, int, int, void*, int, int, int) (ampi.C:2827)
==28092==    by 0x62ECF6: IReq::receiveRdma(ampi*, char*, int, int, int, int) (ampi.C:5382)
==28092==    by 0x625F1B: ampi::inorderRdma(char*, int, int, int, int, int, int) (ampi.C:2571)
==28092==    by 0x625D05: ampi::genericRdma(char*, int, int, int, int, int, int) (ampi.C:2525)
==28092==    by 0x6486FB: CkIndex_ampi::_call_genericRdma_marshall8(void*, void*) (ampi.def.h:3290)
==28092==    by 0x68D590: CkDeliverMessageFree (ck.C:593)
==28092==    by 0x6AE0DE: CkLocRec::invokeEntry(CkMigratable*, void*, int, bool) (cklocation.C:1969)
==28092==    by 0x6B0CAF: CkLocMgr::deliverMsg(CkArrayMessage*, CkArrayID, unsigned long, CkArrayIndex const*, CkDeliver_t, int) (cklocation.C:2656)
==28092==    by 0x69406B: CkArray::deliver(CkArrayMessage*, CkDeliver_t) (ckarray.h:688)
==28092==  Address 0x6d00180 is 0 bytes after a block of size 80,176 alloc'd
==28092==    at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==28092==    by 0x79F946: malloc_nomigrate (libmemory-default.c:724)
==28092==    by 0x7A9B70: CmiAlloc (convcore.c:2930)
==28092==    by 0x7A4A23: PumpMsgs (machine.c:804)
==28092==    by 0x7A4C86: LrtsAdvanceCommunication (machine.c:1124)
==28092==    by 0x7A394E: AdvanceCommunication (machine-common-core.c:1317)
==28092==    by 0x7A3C16: CmiGetNonLocal (machine-common-core.c:1487)
==28092==    by 0x7A7281: CsdNextMessage (convcore.c:1772)
==28092==    by 0x7A73F5: CsdScheduleForever (convcore.c:1897)
==28092==    by 0x7A732C: CsdScheduler (convcore.c:1836)
==28092==    by 0x7A3919: ConverseRunPE (machine-common-core.c:1297)
==28092==    by 0x7A3805: ConverseInit (machine-common-core.c:1198)

What's weird is that this looks like it is actually doing the right thing: AMPI is copying memory from the recv buffer allocated in LRTS to the application's buffer (the messages in Cjacobi3D are 80,000 Bytes long).


Related issues

Related to Charm++ - Feature #1546: RDMA example with migration Merged 05/02/2017

History

#1 Updated by Sam White about 2 years ago

Even if I run with NullLB, these two leaks still show up in Valgrind for a mpi-linux-x86_64 build (I haven't tried valgrind on any other RDMA layer).

==31616== 14,976 (11,088 direct, 3,888 indirect) bytes in 154 blocks are definitely lost in loss record 556 of 575
==31616==    at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==31616==    by 0x71420A: CkRdmaIssueRgets(envelope*) (ckrdma.C:182)
==31616==    by 0x68F7A2: _processHandler(void*, CkCoreState*) (ck.C:1197)
==31616==    by 0x7A71A2: CmiHandleMessage (convcore.c:1663)
==31616==    by 0x7A746C: CsdScheduleForever (convcore.c:1900)
==31616==    by 0x7A737A: CsdScheduler (convcore.c:1836)
==31616==    by 0x7A3967: ConverseRunPE (machine-common-core.c:1297)
==31616==    by 0x7A3853: ConverseInit (machine-common-core.c:1198)
==31616==    by 0x67AB36: main (main.C:18)
==31616== 28,224 bytes in 196 blocks are definitely lost in loss record 567 of 575
==31616==    at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==31616==    by 0x79F994: malloc_nomigrate (libmemory-default.c:724)
==31616==    by 0x7A9BBE: CmiAlloc (convcore.c:2930)
==31616==    by 0x683F91: envelope::alloc(unsigned char, unsigned int, unsigned short) (envelope.h:312)
==31616==    by 0x68430D: _allocEnv(int, int, int) (envelope.h:485)
==31616==    by 0x6990E8: CkAllocMsg (msgalloc.C:21)
==31616==    by 0x6A59FC: CMessage_CkDataMsg::alloc(int, unsigned long, int*, int) (CkCallback.def.h:238)
==31616==    by 0x6A589A: CMessage_CkDataMsg::operator new(unsigned long, int*, int) (CkCallback.def.h:220)
==31616==    by 0x6A5413: CkDataMsg::buildNew(int, void const*) (ckcallback.C:473)
==31616==    by 0x6A4367: CkCallback::send(int, void const*) const (ckcallback.C:208)
==31616==    by 0x713F03: CkHandleRdmaCookie(void*) (ckrdma.C:100)
==31616==    by 0x7A47B6: ReleaseSentMessages (machine.c:615)

The latter leak is fixed if I mark the entry method that receives the CkDataMsg RDMA completion callback with the [nokeep] attribute, but the RDMA documentation and examples do not use [nokeep] and they don't delete the CkDataMsg itself. If you run valgrind on examples/charm++/rdma/ you see these same two leaks.

#2 Updated by Sam White about 2 years ago

The second leak above also happen on multicore builds... opened a separate issue for memory leaks in rdma: https://charm.cs.illinois.edu/redmine/issues/1540

#3 Updated by Sam White about 2 years ago

  • Priority changed from Normal to High

Still no real diagnosis of the underlying problem here...

==7618== 13770 errors in context 4 of 56:
==7618== Invalid read of size 8
==7618==    at 0x4C2F790: memcpy@@GLIBC_2.14 (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==7618==    by 0x677EE4: serializeContig(char*, char*, unsigned long, int) (ddt.C:17)
==7618==    by 0x679844: CkDDT_DataType::serialize(char*, char*, int, int) const (ddt.C:549)
==7618==    by 0x65E4BE: ampi::processRdmaMsg(void*, int, int, int, void*, int, int, int) (ampi.C:2855)
==7618==    by 0x631684: IReq::receiveRdma(ampi*, char*, int, int, int, int) (ampi.C:5410)
==7618==    by 0x6287FD: ampi::inorderRdma(char*, int, int, int, int, int, int) (ampi.C:2599)
==7618==    by 0x6285E7: ampi::genericRdma(char*, int, int, int, int, int, int) (ampi.C:2553)
==7618==    by 0x64B2FD: CkIndex_ampi::_call_genericRdma_marshall8(void*, void*) (ampi.def.h:3290)
==7618==    by 0x690D6E: CkDeliverMessageFree (ck.C:593)
==7618==    by 0x6B18BC: CkLocRec::invokeEntry(CkMigratable*, void*, int, bool) (cklocation.C:1969)
==7618==    by 0x6B448D: CkLocMgr::deliverMsg(CkArrayMessage*, CkArrayID, unsigned long, CkArrayIndex const*, CkDeliver_t, int) (cklocation.C:2656)
==7618==    by 0x697849: CkArray::deliver(CkArrayMessage*, CkDeliver_t) (ckarray.h:688)
==7618==  Address 0x6d02710 is 105,120 bytes inside an unallocated block of size 839,152 in arena "client" 

#4 Updated by Sam White about 2 years ago

  • Subject changed from Failure in AMPI RDMA recv path after migration to Failure in AMPI migration in TCharm::pup when unpacking

This is not related to AMPI's use of RDMA at all, and can be reproduced on commits before the AMPI RDMA patch was merged. I need to do a full git bisect to determine the cause... Autobuild failures first showed up on April 28th, but there were a ton of changes merged on the 27th...

#5 Updated by Sam White about 2 years ago

Nevermind, the issue seems to be different on mpi-darwin-x86_64, and not being able to use ++debug on MPI builds hurts. It is related to AMPI RDMA on linux.

#6 Updated by Phil Miller about 2 years ago

The range of commits that appear in the time window in question and look like they could be the culprit is as follows:

e276540 Bug #1530: move isomalloc_sync warning to memory-isomalloc.
9df6086 AMPI: nicer abort message.
27e8038 AMPI #1163: use disp_unit from window creation
#465ccec charmxi #1527: Add dummy static_cast to impl_e_opts to avoid compiler warning
76aa311 AMPI: fix accumulate routine's ptr arithmetic and eliminate unused arguments
05e5a4b AMPI: support derived datatypes in Alltoall and nbor coll routines
#5d303c2 charmxi #1526: Remove unused parameter created in migration constructor
#fda165f Bug #1537: fixing the bug that breaks shrink/expand
22ac668 AMPI: execute local sends of contiguous data inline using direct memcpy
97ba145 AMPI: fix Sendrecv_replace and assume Sendrecv buffers are not the same
#38debf5 Bug #1528 charmxi: Message type with no variable arrays can't be declared with { }
#e351aed Bug #1529: Easy 'nolb' build-time option to elide LB support
d12a6a7 Bug #1493: Fix the original patch by moving CkLocRec deletion
a0d96d5 AMPI: fix memory leak in tuple redn handling for gathers and non-commutative redns
9cb75c2 AMPI: relax ordering of sends/recvs in Alltoall{v,w} and nbor coll routines
9349919 AMPI #1111: avoid sender-side copy for large contiguous messages using Charm++'s zero copy RDMA send API
#ebb101f Bug #1487 Enable cuda compilation without GPU_MEMPOOL being defined
#79c9795 Bug #1488 Restrict maximum initial pinned memory to 1 GiB
#e90d142 Bug #1506: Fix: examples/hello/4darray wasn't using messages for multicast
f4d1581 Bug #1410: fix memory leak in Tuple reducers consisting of set/concat/custom reducers

$ git log --oneline f4d158^..e27654 | wc -l
      20

I've #'ed out the commits that seem like they can't be related, leaving just 12 suspects. That should be about 4 builds to bisect

#7 Updated by Sam White about 2 years ago

The culprit is the AMPI RDMA patch (below) on mpi-linux-x86_64:

9349919 AMPI #1111: avoid sender-side copy for large contiguous messages using Charm++'s zero copy RDMA send API

#8 Updated by Sam White about 2 years ago

  • Subject changed from Failure in AMPI migration in TCharm::pup when unpacking to Failure in AMPI migration when using RDMA sends

#9 Updated by Sam White about 2 years ago

I think I've found the problem, though I have no idea why it's happening: RDMA messages are not being forwarded to objects after they have migrated.

Basically I've added a print statement at the very top of ampi::genericRdma (the rdma entry method where this is failing), and after migration the buffer passed into the method is bad and the segfault happens in the print after migration.

Here's the output from ckarray.C, cklocation.C, and ampi.C for examples/ampi/Cjacobi3D/ for 8 VPs on 3 PEs with RotateLB on an mpi-linux-x86_64 build. The failure happens on VP 6 after it migrates from PE 2 to PE 0:

...
LocMgr on 2: Load balancer wants to migrate 6 to 0
LocMgr on 2: In CkMigratable::pup 6
LocMgr on 2: Removing barrier for element 6
LocMgr on 2: Destroying record for element 6
LocMgr on 2: Unregistering element 6 from load balancer
LocMgr on 2: In CkMigratable::~CkMigratable 6
...
[1] AMPI vp 2 send: tag=1, src=2, comm=9000000 (to 6)      <--- RDMA message send
LocMgr on 1: send 6
...
LocMgr on 0: Adding new record for element 6
LocMgr on 0: Registering element 6 with load balancer
LocMgr on 0: Element 6 migrated in
LocMgr on 0: Constructing element 6 of array
LocMgr on 0:    Invoking entry 178 on element 6
LocMgr on 0: Constructing element 6 of array
LocMgr on 0:    Invoking entry 140 on element 6
LocMgr on 0: Constructing element 6 of array
LocMgr on 0:    Invoking entry 300 on element 6
ArrayBOC on 0:   ArrayElement::pup()
LocMgr on 0: In CkMigratable::pup 6
LocMgr on 0: Registering barrier client for 6
LocMgr on 0: Registering barrier client for 6
LocMgr on 0: Registering barrier client for 6
...
In mpi machine.c on 2, handling a recv'ed RDMA msg 0x2a06860      <--- recv'ed RDMA address on PE 2
...
2 updateLocation on 1
LocMgr on 1: Sending update back to 2 for element 1
...
LocMgr on 0: DummyResumeFromSync called
LocMgr on 0: dummyAtSync called
LocMgr on 0: Element 6 resuming from sync
2 updateLocation on 1
CharmLB> RotateLB: PE [0] step 0 finished at 174.775194 duration 0.059976 s
...
LocMgr on 0: Sending update back to 1 for element 6
LocMgr on 0:    Invoking entry 149 on element 6
[0] in ampi::genericRdma on index 6, size=80000, seq=2, srcRank=2, tag=0, comm=9000000, ssendReq=0, buf 0x2a06900      <--- recv'ed RDMA address on PE 0

------------- Processor 0 Exiting: Caught Signal ------------
Reason: Segmentation fault
[0] Stack Traceback:
[0:0]   [0x7a555f]
[0:1] +0x10330  [0x7ffff6e11330]
[0:2] _ZN4ampi11genericRdmaEPciiiiii+0x8c  [0x625d1e]
[0:3] _ZN12CkIndex_ampi27_call_genericRdma_marshall8EPvS0_+0x158  [0x6482ba]
[0:4] CkDeliverMessageFree+0x53  [0x68cf03]
[0:5] _ZN8CkLocRec11invokeEntryEP12CkMigratablePvib+0x220  [0x6add30]
[0:6] _ZN8CkLocMgr10deliverMsgEP14CkArrayMessage9CkArrayIDmPK12CkArrayIndex11CkDeliver_ti+0x53c  [0x6b0bca]
[0:7] _ZN7CkArray7deliverEP14CkArrayMessage11CkDeliver_t+0x68  [0x6939de]
[0:8]   [0x68f04e]
[0:9] _Z15_processHandlerPvP11CkCoreState+0x222  [0x68f272]
[0:10] CmiHandleMessage+0x54  [0x7a787b]
[0:11] CsdScheduleForever+0xd0  [0x7a7b45]
[0:12] CsdScheduler+0x16  [0x7a7a53]
[0:13]   [0x7a3ffc]
[0:14] ConverseInit+0x3ac  [0x7a3ee8]
[0:15] main+0x44  [0x67a469]
[0:16] __libc_start_main+0xf5  [0x7ffff6a5af45]

1 updateLocation on 0

Note the two addresses differ by 160 Bytes, which I've verified is the size of the message envelope. So it looks like the message is not being forwarded for some reason.

The failure doesn't show up when using multicore or MPI-SMP (with 1 process) builds, even though the AMPI code path is exactly the same regardless. This makes sense since the RDMA message pointer would still be valid within a process.

I can't see how this could be an AMPI issue, since the buffer is bad before it is ever passed into AMPI, though I can't rule out that AMPI is doing something weird in migration that then trips this up. But until we have any proof that RDMA + migration works, I think we should assume it is not working. I opened issue #1546 to get an example RDMA + migration program.

#10 Updated by Phil Miller about 2 years ago

#11 Updated by Sam White about 2 years ago

  • Subject changed from Failure in AMPI migration when using RDMA sends to Failure in migration when using RDMA sends

#12 Updated by Sam White about 2 years ago

To reproduce this, do './build AMPI mpi-linux-x86_64 -g -O0' then 'make test' in examples/ampi/Cjacobi3D/.

Basically the problem arises when a chare does an rdma send to another chare, and then the recver chare migrates between the time when the rget finishes and the recver actually has the entry method invoked on it. If that happens, then the message needs to be forwarded to the recver's new PE, which happens but for some reason the rdma parameter's ptr still points to the address at which it recv'ed the rget, not to a new address.

#13 Updated by Vipul Harsh about 2 years ago

I looked at the code and it never actually changes the pointers inside the rdma wrappers in the message, hence the rdma pointers are invalid in another PE. I am trying to think of a fix.

#14 Updated by Phil Miller about 2 years ago

If that's all, I think the message just needs to have pack called in it before it gets forwarded, and unpack after it arrives.

#15 Updated by Vipul Harsh about 2 years ago

  • Status changed from In Progress to Implemented

#16 Updated by Vipul Harsh about 2 years ago

  • Assignee changed from Sam White to Vipul Harsh

#17 Updated by Phil Miller about 2 years ago

  • Status changed from Implemented to Merged

#18 Updated by Sam White about 2 years ago

  • Status changed from Merged to In Progress

We're still seeing the same failure in autobuild, though now less frequently. See multicore-linux-x86_64 and mpi-linux-x86_64-smp from last night, and multicore-darwin-x86_64 from the night before that.

#20 Updated by Phil Miller about 2 years ago

  • Assignee changed from Vipul Harsh to Nitin Bhat

#21 Updated by Sam White about 2 years ago

Core decided that this bug and support for SDAG should be prioritized over the GNI RDMA implementation.

#22 Updated by Matthias Diener about 2 years ago

Note that the problem occurs only sometimes. On my laptop, running with netlrts-darwin-x86_64-smp, make test in examples/ampi/Cjacobi3D crashes in about 20%. So this problem might be more common than the auto build statistics indicate.

#23 Updated by Nitin Bhat about 2 years ago

Looking at the core dump on failure for multicore-linux-x86_64, it looks like the segfault doesn't happen through the rdma codeflow. (it occurs at the delete msg statement in ampi.C:5418)

(gdb) bt
#0 0x00002aaaab746035 in __GI_raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x00002aaaab74979b in __GI_abort () at abort.c:91
#2 0x000000000077e7f5 in charmrun_abort (s=0x815ea8 "CmiFree reference count was zero-- is this a duplicate free?") at machine.c:1034
#3 0x000000000077e24b in LrtsAbort (message=0x815ea8 "CmiFree reference count was zero-- is this a duplicate free?") at machine.c:557
#4 0x000000000077d8cf in CmiAbortHelper (source=0x814b11 "Called CmiAbort", message=0x815ea8 "CmiFree reference count was zero-- is this a duplicate free?",
suggestion=0x0, tellDebugger=1, framesToSkip=0) at machine-common-core.c:1453
#5 0x000000000077d8fe in CmiAbort (message=0x815ea8 "CmiFree reference count was zero-- is this a duplicate free?") at machine-common-core.c:1457
#6 0x00000000007887d6 in CmiFree (blk=0x2aaaacabeeb0) at convcore.c:3007
#7 0x000000000067ac94 in CkFreeMsg (msg=0x2aaaacabef00) at msgalloc.C:51
#8 0x0000000000624598 in CMessage_AmpiMsg::dealloc (p=0x2aaaacabef00) at ampi.def.h:1117
#9 0x00000000006368a3 in CMessage_AmpiMsg::operator delete (p=0x2aaaacabef00) at ampi.decl.h:22
#10 0x0000000000612373 in IReq::receive (this=0x2aaaac18d530, ptr=0x2aaaad59f010, msg=0x2aaaacabef00) at ampi.C:5418
#11 0x0000000000614766 in ampi::irecv (this=0x2aaaad59f010, buf=0x2aaaaf25ea68, count=10000, type=0, src=6, tag=3, comm=9000000, request=0x471dfadfffcc) at ampi.C:6011
#12 0x00000000006148c5 in AMPI_Irecv (buf=0x2aaaaf25ea68, count=10000, type=0, src=6, tag=3, comm=9000000, request=0x471dfadfffcc) at ampi.C:6055
#13 0x00000000005b3f1e in AMPI_Main_cpp(int, char**) ()
#14 0x000000000060121b in AMPI_Fallback_Main (argc=5, argv=0x2aaaac08b0c0) at ampi.C:833
#15 0x000000000063c760 in MPI_threadstart_t::start (this=0x471dfae000d0) at ampi.C:1040
#16 0x000000000060185f in AMPI_threadstart (data=0x2aaaac16e6f0) at ampi.C:1060
#17 0x00000000005b4f3a in startTCharmThread (msg=0x2aaaac16e6d0) at tcharm.C:175
#18 0x000000000077a528 in CthStartThread (fn1=0, fn2=5984007, arg1=10922, arg2=2887182032) at libthreads-default.c:1688
#19 0x00002aaaab758200 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#20 0x0000000000000000 in ?? ()

#24 Updated by Sam White about 2 years ago

RDMA sends are being used, the failure here just doesn't look like it's RDMA because the message itself has arrived before the matching recv was posted. That means that AMPI had to copy the rdma entry method parameter to message until the recv was posted, then when it was, AMPI could copy to the recv buffer and finally delete the message. If you build AMPI with -DAMPI_RDMA_IMPL=0 this error doesn't happen AFAIK.

#25 Updated by Nitin Bhat about 2 years ago

The AMPI RDMA bug (examples/ampi/Cjacobi3D) occurs intermittently with the following steps:

1. Migration occurs and the AMPI chare on which the bug occurs (chare a) migrates to a new PE.
2. Another AMPI chare (chare b) invokes the remote entry method genericRdma on chare a.
3. genericRdma is invoked on chare a (on its new PE) and inside it, the AmpiMsg is added to the AmmTable - `msgs`. (used for storing out of order messages).
4. Soon after, the other inorder messages are processed, which makes this out of order message to be in-order and it is picked up. And the `IReq::receive` is called for it and it is processed and deleted.
5. Another msg with the same address (as the one just deleted) shows up later in the AmmTable. (this doesn’t happen through AmmPut though).
6. And this msg is picked up by `ampi::irecv`, which calls `IReq::receive` and the delete in this causes the double free error.

One more observation is that the double free error occurs for the out of the order message (it is the only out of order message there is in the entire run).

The fix for the bug seems to be removing the `AmmPut` for out-of-order messages. (Reverting the patch https://charm.cs.illinois.edu/gerrit/#/c/2509/).

With the fix, I successfully tested Cjacobi3d example with regular queues and randomized queues with ~ 50 runs in a loop on the following builds:
mpi-linux-x86_64,
mpi-linux-x86_64-smp,
multicore-linux-x86_64,
netlrts-linux-x86_64-smp,
mpi-gnicrayxc

#26 Updated by Phil Miller about 2 years ago

  • Tags set to ampi, migration, rdma
  • Subject changed from Failure in migration when using RDMA sends to Failure in migration when using RDMA sends in AMPI

#27 Updated by Phil Miller about 2 years ago

  • Status changed from In Progress to Implemented

#28 Updated by Phil Miller about 2 years ago

  • Status changed from Implemented to Merged

Also available in: Atom PDF