Project

General

Profile

Bug #1540

Memory leaks in RDMA

Added by Sam White about 2 years ago. Updated about 2 years ago.

Status:
Merged
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
04/28/2017
Due date:
% Done:

100%

Spent time:

Description

Running valgrind on examples/charm++/rdma/ you see two memory leaks on at least the mpi-linux-x86_64 and multicore-linux-x86_64 build targets (I haven't tried verbs or pami).

==31616== 14,976 (11,088 direct, 3,888 indirect) bytes in 154 blocks are definitely lost in loss record 556 of 575
==31616==    at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==31616==    by 0x71420A: CkRdmaIssueRgets(envelope*) (ckrdma.C:182)
==31616==    by 0x68F7A2: _processHandler(void*, CkCoreState*) (ck.C:1197)
==31616==    by 0x7A71A2: CmiHandleMessage (convcore.c:1663)
==31616==    by 0x7A746C: CsdScheduleForever (convcore.c:1900)
==31616==    by 0x7A737A: CsdScheduler (convcore.c:1836)
==31616==    by 0x7A3967: ConverseRunPE (machine-common-core.c:1297)
==31616==    by 0x7A3853: ConverseInit (machine-common-core.c:1198)
==31616==    by 0x67AB36: main (main.C:18)
==31616== 28,224 bytes in 196 blocks are definitely lost in loss record 567 of 575
==31616==    at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==31616==    by 0x79F994: malloc_nomigrate (libmemory-default.c:724)
==31616==    by 0x7A9BBE: CmiAlloc (convcore.c:2930)
==31616==    by 0x683F91: envelope::alloc(unsigned char, unsigned int, unsigned short) (envelope.h:312)
==31616==    by 0x68430D: _allocEnv(int, int, int) (envelope.h:485)
==31616==    by 0x6990E8: CkAllocMsg (msgalloc.C:21)
==31616==    by 0x6A59FC: CMessage_CkDataMsg::alloc(int, unsigned long, int*, int) (CkCallback.def.h:238)
==31616==    by 0x6A589A: CMessage_CkDataMsg::operator new(unsigned long, int*, int) (CkCallback.def.h:220)
==31616==    by 0x6A5413: CkDataMsg::buildNew(int, void const*) (ckcallback.C:473)
==31616==    by 0x6A4367: CkCallback::send(int, void const*) const (ckcallback.C:208)
==31616==    by 0x713F03: CkHandleRdmaCookie(void*) (ckrdma.C:100)
==31616==    by 0x7A47B6: ReleaseSentMessages (machine.c:615)

The latter leak is fixed if I mark the entry method that receives the CkDataMsg RDMA completion callback with the [nokeep] attribute, but the RDMA documentation and examples do not use [nokeep] and they don't delete the CkDataMsg itself.

History

#1 Updated by Sam White about 2 years ago

The first one above doesn't happen on multicore because multicore doesn't actually ever call CkRdmaIssueRgets...

#2 Updated by Sam White about 2 years ago

  • Status changed from New to In Progress

Fix for the leak in the MPI layer's RDMA implementation: https://charm.cs.illinois.edu/gerrit/#/c/2461/

Can Nitin/Vipul please respond here about the callback entry method and what exactly the semantics of that are? Should users always mark it [nokeep]?

#3 Updated by Nitin Bhat about 2 years ago

We didn't really workout the semantics of the callback function. But I think it would be better to have the charm runtime deallocate the memory and have the user mark the callback as nokeep. Right?

#4 Updated by Phil Miller about 2 years ago

This is just delivering a message, like any other, right? If so, the user can choose to manually delete the message, store it somewhere for later use, mark the method [nokeep] and let the RTS handle it, etc.

#5 Updated by Phil Miller about 2 years ago

The documentation and examples may need to be more explicit about disposing of the message, though.

#6 Updated by Sam White about 2 years ago

Yeah, it should be a regular Charm message but the documentation says "It should be noted that the received CkDataMsg should not be deallocated by the user as it belongs to the Charm++ Runtime. It will be deallocated by the runtime after the execution of the entry method."

#7 Updated by Sam White about 2 years ago

  • Assignee changed from Nitin Bhat to Sam White
  • Status changed from In Progress to Implemented

Updated documentation & example program: https://charm.cs.illinois.edu/gerrit/#/c/2469/

Update AMPI RDMA use: https://charm.cs.illinois.edu/gerrit/#/c/2466/

#8 Updated by Sam White about 2 years ago

  • Status changed from Implemented to In Progress

Nitin will investigate the same leak I fixed in MPI layer in PAMILRTS and Verbs

#9 Updated by Nitin Bhat about 2 years ago

  • Assignee changed from Sam White to Nitin Bhat

#10 Updated by Nitin Bhat about 2 years ago

  • % Done changed from 0 to 100
  • Status changed from In Progress to Implemented

Fix for leaks in the machine layer implementations (PAMI, Verbs, MPI): https://charm.cs.illinois.edu/gerrit/#/c/2461/

In the machine layer code, the first leak was due to the unfreed allocation of the machine specific information
associated with the receiver in CkIssueRgets on obtaining the metadata msg.
This leak was fixed by freeing this allocation in the machine layer
on completing the rdma operations. The second leak was due to the unfreed allocation of the acknowledgement struct
CmiRdmaAck associated with the sender in CmiSetRdmaAck. This leak was fixed
by freeing this allocation in the machine layer after calling the corresponding
charm layer ack handler method on receiving the ack message.

#11 Updated by Phil Miller about 2 years ago

  • Tags changed from #rdma to #rdma, memory-leak

Is that fix addressing all of the known leaks? If so, this can be marked Merged

#12 Updated by Nitin Bhat about 2 years ago

That fixes the memory leaks in the lower layer implementations.

In the rdma example too, there was a memory leak in the callback entry method as the CkDataMsg provided wasn't being deleted. The patch for that is here: https://charm.cs.illinois.edu/gerrit/#/c/2469/. That isn't merged yet.

#13 Updated by Phil Miller about 2 years ago

  • Status changed from Implemented to Merged

Also available in: Atom PDF