Project

General

Profile

Bug #1307

AMPI_Comm_free should free the ampi instance

Added by Sam White over 2 years ago. Updated over 1 year ago.

Status:
Merged
Priority:
Normal
Assignee:
Category:
AMPI
Target version:
Start date:
11/20/2016
Due date:
% Done:

0%


Description

Right now, AMPI_Comm_free is a no-op. This could be bad for memory usage of applications that create and destroy communicators over time. We haven't seen that use case yet, but our pending implementation of MPI_Comm_create_group does just that.


Related issues

Related to Charm++ - Bug #1312: Deleting an array disables reclamation for all arrays bound to that location manager Merged 11/27/2016

History

#1 Updated by Sam White over 2 years ago

Also, this is important for AMPI users of MPI_Comm_split_type, because those communicators can be invalidated by migration.

#2 Updated by Sam White over 2 years ago

  • Status changed from New to In Progress

AMPI-level implementation here, but needs underlying fix in Charm++'s location manager to support bound array deletions: https://charm.cs.illinois.edu/gerrit/#/c/2000/

#3 Updated by Phil Miller over 2 years ago

  • Subject changed from MPI_Comm_free should free the ampi instance to AMPI_Comm_free should free the ampi instance

#4 Updated by Sam White over 2 years ago

Here's the failure from running tests/ampi/megampi on the above patch:

Starting program: /dcsdata/home/swhite/tmp/charm/netlrts-linux-x86_64/tests/ampi/megampi/pgm +vp2
Charm++: standalone mode (not using charmrun)
Charm++> Running in non-SMP mode: numPes 1
Converse/Charm++ Commit ID: v6.7.0-453-g9b89026
Charm++> scheduler running in netpoll mode.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (8-way SMP).
Charm++> cpu topology info is gathered in 0.000 seconds.
[0] RandCentLB created
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: cannot compare two indices of different cardinality
[0] Stack Traceback:
  [0:0] CmiAbortHelper+0xb3  [0x673965]
  [0:1] CmiAbort+0x2d  [0x6739a0]
  [0:2] _ZltRK12CkArrayIndexS1_+0x2e  [0x5bce24]
  [0:3] _ZNKSt4lessI12CkArrayIndexEclERKS0_S3_+0x27  [0x5bf1a3]
  [0:4] _ZNSt8_Rb_treeI12CkArrayIndexSt4pairIKS0_mESt10_Select1stIS3_ESt4lessIS0_ESaIS3_EE14_M_lower_boundEPSt13_Rb_tree_nodeIS3_ESC_RS2_+0x3c  [0x5bf206]
  [0:5] _ZNSt8_Rb_treeI12CkArrayIndexSt4pairIKS0_mESt10_Select1stIS3_ESt4lessIS0_ESaIS3_EE4findERS2_+0x45  [0x5bed71]
  [0:6] _ZNSt3mapI12CkArrayIndexmSt4lessIS0_ESaISt4pairIKS0_mEEE4findERS4_+0x23  [0x5be159]
  [0:7] _ZN7CkArray6lookupERK12CkArrayIndex+0x32  [0x5bcf3a]
  [0:8] _ZNK23CProxyElement_ArrayBase7ckLocalEv+0x2c  [0x5ee1b4]
  [0:9] _ZNK26CProxyElement_ArrayElement7ckLocalEv+0x18  [0x53f064]
  [0:10] _ZNK20CProxyElement_TCharm7ckLocalEv+0x18  [0x53f86a]
  [0:11] _ZN10ampiParent10prepareCtvEv+0x3d  [0x54f167]
  [0:12] _ZN10ampiParentC1Ei13CProxy_TCharm+0x201  [0x54e629]
  [0:13] _ZN18CkIndex_ampiParent26_call_ampiParent_marshall1EPvS0_+0xec  [0x569f18]
  [0:14] CkDeliverMessageFree+0x4e  [0x5ab4ad]
  [0:15] _ZN8CkLocRec11invokeEntryEP12CkMigratablePvib+0x1bb  [0x5c877b]
  [0:16] _ZN8CkLocMgr15addElementToRecEP8CkLocRecP7CkArrayP12CkMigratableiPv+0x116  [0x5ca2a0]
  [0:17] _ZN8CkLocMgr10addElementE9CkArrayIDRK12CkArrayIndexP12CkMigratableiPv+0xf1  [0x5ca169]
  [0:18] _ZN7CkArray13insertElementEP14CkArrayMessageRK12CkArrayIndexPi+0x19b  [0x5ef627]
  [0:19] _ZN7CkArray13insertInitialERK12CkArrayIndexPv+0x4b  [0x5efb1d]
  [0:20] _ZN10CkArrayMap15populateInitialEiR14CkArrayOptionsPvP7CkArray+0x308  [0x5c67a6]
  [0:21] _ZN8CkLocMgr15populateInitialER14CkArrayOptionsPvP7CkArray+0x52  [0x5f7cf8]
  [0:22] _ZN7CkArrayC1ER14CkArrayOptionsR19CkMarshalledMessage10_ckGroupID+0x32e  [0x5eec2c]
  [0:23] _ZN15CkIndex_CkArray23_call_CkArray_marshall1EPvS0_+0x103  [0x5f24f7]
  [0:24] CkDeliverMessageFree+0x4e  [0x5ab4ad]
  [0:25]   [0x5ab5f3]
  [0:26] CkCreateLocalGroup+0x204  [0x5abd3f]
  [0:27] _Z12_createGroup10_ckGroupIDP8envelope+0x174  [0x5ac0f2]
  [0:28]   [0x5ac2f3]
  [0:29] CkCreateGroup+0xec  [0x5ac4d9]
  [0:30] _ZN14CProxy_CkArray5ckNewERK14CkArrayOptionsRK19CkMarshalledMessageRK10_ckGroupIDPK14CkEntryOptions+0x185  [0x5f2175]
  [0:31]   [0x5edbb4]
  [0:32] _ZN16CProxy_ArrayBase13ckCreateArrayEP14CkArrayMessageiRK14CkArrayOptions+0x57  [0x5edc8b]
  [0:33] _ZN19CProxy_ArrayElement13ckCreateArrayEP14CkArrayMessageiRK14CkArrayOptions+0x28  [0x53f19a]
  [0:34] _ZN17CProxy_ampiParent13ckCreateArrayEP14CkArrayMessageiRK14CkArrayOptions+0x28  [0x579350]
  [0:35] _ZN17CProxy_ampiParent5ckNewEiRK13CProxy_TCharmRK14CkArrayOptionsPK14CkEntryOptions+0x111  [0x569721]
  [0:36]   [0x54db63]
  [0:37] AMPI_Init+0x60  [0x55613a]
  [0:38] _Z13AMPI_Main_cppiPPc+0x39  [0x538656]
  [0:39] AMPI_Fallback_Main+0x25  [0x54d2fa]
  [0:40] _ZN17MPI_threadstart_t5startEv+0x5c  [0x57e104]
  [0:41] AMPI_threadstart+0x37  [0x54d5d1]
  [0:42]   [0x53917f]
  [0:43] CthStartThread+0x59  [0x671285]
  [0:44] +0x49800  [0x7ffff723e800]

#5 Updated by Sam White over 2 years ago

  • Status changed from In Progress to Implemented

The AMPI-level stuff is implemented, this is just blocked on the Charm++-level issue #1312

#6 Updated by Sam White over 2 years ago

  • Target version changed from 6.8.0 to 6.8.1

#7 Updated by Sam White over 2 years ago

We can at least clear all of the ampi instance's heap memory from AMPI_Comm_free, even if we can't delete the chare array elements themselves: https://charm.cs.illinois.edu/gerrit/#/c/2192/

#8 Updated by Sam White over 2 years ago

  • Status changed from Implemented to In Progress

We need to reference count communicators b/c a user could create an intercommunicator out of two comms, then free both the original comms, and the intercomm should still function as expected.

#9 Updated by Sam White almost 2 years ago

  • Target version changed from 6.8.1 to 6.9.0

#10 Updated by Sam White over 1 year ago

The basic fix was merged: https://charm.cs.illinois.edu/gerrit/#/c/2000/

Still we don't support reference counting on nonblocking communication calls, so if a user sends a message and then frees the comm before it is recv'ed we'll abort.

#11 Updated by Sam White over 1 year ago

  • Status changed from In Progress to Merged

Also available in: Atom PDF