Memory leaks in RDMA
Running valgrind on examples/charm++/rdma/ you see two memory leaks on at least the mpi-linux-x86_64 and multicore-linux-x86_64 build targets (I haven't tried verbs or pami).
==31616== 14,976 (11,088 direct, 3,888 indirect) bytes in 154 blocks are definitely lost in loss record 556 of 575 ==31616== at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==31616== by 0x71420A: CkRdmaIssueRgets(envelope*) (ckrdma.C:182) ==31616== by 0x68F7A2: _processHandler(void*, CkCoreState*) (ck.C:1197) ==31616== by 0x7A71A2: CmiHandleMessage (convcore.c:1663) ==31616== by 0x7A746C: CsdScheduleForever (convcore.c:1900) ==31616== by 0x7A737A: CsdScheduler (convcore.c:1836) ==31616== by 0x7A3967: ConverseRunPE (machine-common-core.c:1297) ==31616== by 0x7A3853: ConverseInit (machine-common-core.c:1198) ==31616== by 0x67AB36: main (main.C:18)
==31616== 28,224 bytes in 196 blocks are definitely lost in loss record 567 of 575 ==31616== at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==31616== by 0x79F994: malloc_nomigrate (libmemory-default.c:724) ==31616== by 0x7A9BBE: CmiAlloc (convcore.c:2930) ==31616== by 0x683F91: envelope::alloc(unsigned char, unsigned int, unsigned short) (envelope.h:312) ==31616== by 0x68430D: _allocEnv(int, int, int) (envelope.h:485) ==31616== by 0x6990E8: CkAllocMsg (msgalloc.C:21) ==31616== by 0x6A59FC: CMessage_CkDataMsg::alloc(int, unsigned long, int*, int) (CkCallback.def.h:238) ==31616== by 0x6A589A: CMessage_CkDataMsg::operator new(unsigned long, int*, int) (CkCallback.def.h:220) ==31616== by 0x6A5413: CkDataMsg::buildNew(int, void const*) (ckcallback.C:473) ==31616== by 0x6A4367: CkCallback::send(int, void const*) const (ckcallback.C:208) ==31616== by 0x713F03: CkHandleRdmaCookie(void*) (ckrdma.C:100) ==31616== by 0x7A47B6: ReleaseSentMessages (machine.c:615)
The latter leak is fixed if I mark the entry method that receives the CkDataMsg RDMA completion callback with the [nokeep] attribute, but the RDMA documentation and examples do not use [nokeep] and they don't delete the CkDataMsg itself.
#2 Updated by Sam White about 2 years ago
- Status changed from New to In Progress
Fix for the leak in the MPI layer's RDMA implementation: https://charm.cs.illinois.edu/gerrit/#/c/2461/
Can Nitin/Vipul please respond here about the callback entry method and what exactly the semantics of that are? Should users always mark it [nokeep]?
#6 Updated by Sam White about 2 years ago
Yeah, it should be a regular Charm message but the documentation says "It should be noted that the received CkDataMsg should not be deallocated by the user as it belongs to the Charm++ Runtime. It will be deallocated by the runtime after the execution of the entry method."
#10 Updated by Nitin Bhat about 2 years ago
- % Done changed from 0 to 100
- Status changed from In Progress to Implemented
Fix for leaks in the machine layer implementations (PAMI, Verbs, MPI): https://charm.cs.illinois.edu/gerrit/#/c/2461/
In the machine layer code, the first leak was due to the unfreed allocation of the machine specific information
associated with the receiver in CkIssueRgets on obtaining the metadata msg.
This leak was fixed by freeing this allocation in the machine layer
on completing the rdma operations. The second leak was due to the unfreed allocation of the acknowledgement struct
CmiRdmaAck associated with the sender in CmiSetRdmaAck. This leak was fixed
by freeing this allocation in the machine layer after calling the corresponding
charm layer ack handler method on receiving the ack message.
#12 Updated by Nitin Bhat about 2 years ago
That fixes the memory leaks in the lower layer implementations.
In the rdma example too, there was a memory leak in the callback entry method as the CkDataMsg provided wasn't being deleted. The patch for that is here: https://charm.cs.illinois.edu/gerrit/#/c/2469/. That isn't merged yet.