Project

General

Profile

Bug #2030

tests/ampi/megampi crashes in MPI_Comm_free

Added by Evan Ramos 7 months ago. Updated about 2 months ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
11/19/2018
Due date:
% Done:

0%

Tags:

Description

This failure shows up in autobuild every few days.

../../../bin/testrun  ./pgm +p2 +vp4  

Running on 2 processors:  ./pgm +vp4 
charmrun> /cygdrive/c/Program Files/Microsoft MPI/Bin/mpiexec -n 2  ./pgm +vp4 

Charm++> Running on MPI version: 2.0
Charm++> level of thread support used: MPI_THREAD_FUNNELED (desired: MPI_THREAD_FUNNELED)
Charm++> Running in SMP mode: 2 processes, 1 worker threads (PEs) + 1 comm threads per process, 2 PEs total
Charm++> The comm. thread both sends and receives messages
Charm++ warning> fences and atomic operations not available in native assembly
Converse/Charm++ Commit ID: v6.9.0-0-gc3d50ef
Charm++> Disabling isomalloc because mmap() does not work.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 hosts (1 sockets x 4 cores x 2 PUs = 8-way SMP)
Charm++> cpu topology info is gathered in 0.016 seconds.
CharmLB> RandCentLB created.

job aborted:
[ranks] message

[0] terminated

[1] process exited without calling finalize

---- error analysis -----

[1] on CS-DEXTERITY
./pgm ended prematurely and may have crashed. exit code 0xc0000005

---- error analysis -----
make[3]: *** [Makefile:24: test] Error 127
make[2]: *** [Makefile:38: test-megampi] Error 2
make[1]: *** [Makefile:37: test-ampi] Error 2
make: *** [Makefile:165: test] Error 2
make[3]: Leaving directory '/home/nikhil/autobuild/mpi-win-x86_64-smp/charm/mpi-win-x86_64-smp/tests/ampi/megampi'

http://charm.cs.illinois.edu/autobuild/old.2018_11_06__01_01/mpi-win-x86_64-smp.txt
http://charm.cs.illinois.edu/autobuild/old.2018_11_10__01_01/mpi-win-x86_64-smp.txt
http://charm.cs.illinois.edu/autobuild/old.2018_11_14__01_01/mpi-win-x86_64-smp.txt

tsan.log.3262 (402 KB) Evan Ramos, 03/15/2019 03:30 PM

History

#1 Updated by Eric Bohm 7 months ago

  • Assignee set to Evan Ramos

#2 Updated by Evan Ramos 5 months ago

  • Target version deleted (6.9.1)

#4 Updated by Sam White 4 months ago

It happened on mpi-win-smp today, and generally seems to happen somewhat frequently though not everytime.

#5 Updated by Evan Ramos 4 months ago

  • Target version set to 6.10.0

#6 Updated by Evan Ramos 3 months ago

I managed to catch this crash in Visual Studio's debugger. ampi::getRank() is called with this pointing to garbage.

pgm.exe!ampi::getRank() Line 2588
    at tmp\libs\ck-libs\ampi\ampiimpl.h(2588)
pgm.exe!MPI_Comm_free(int * comm) Line 8994
    at tmp\libs\ck-libs\ampi\ampi.c(8994)
pgm.exe!AMPI_Main_cpp(int argc, char * * argv) Line 494
    at tests\ampi\megampi\test.c(494)
pgm.exe!AMPI_Fallback_Main(int argc, char * * argv) Line 830
    at tmp\libs\ck-libs\ampi\ampi.c(830)
pgm.exe!MPI_threadstart_t::start() Line 1059
    at tmp\libs\ck-libs\ampi\ampi.c(1059)
pgm.exe!AMPI_threadstart(void * data) Line 1076
    at tmp\libs\ck-libs\ampi\ampi.c(1076)
pgm.exe!startTCharmThread(TCharmInitMsg * msg) Line 164
    at tmp\libs\ck-libs\tcharm\tcharm.c(164)
pgm.exe!FiberSetUp(void * fiberData) Line 1371
    at tmp\threads.c(1371)
[External Code]

I tried the following change to help diagnose the problem:

diff --git a/src/libs/ck-libs/ampi/ampi.C b/src/libs/ck-libs/ampi/ampi.C
index dad98cf50..8f32e8e77 100644
--- a/src/libs/ck-libs/ampi/ampi.C
+++ b/src/libs/ck-libs/ampi/ampi.C
@@ -8990,7 +8990,9 @@ AMPI_API_IMPL(int, MPI_Comm_free, MPI_Comm *comm)
     //ret = parent->freeUserKeyvals(*comm, parent->getKeyvals(*comm));
     if (*comm != MPI_COMM_WORLD && *comm != MPI_COMM_SELF) {
       ampi* ptr = getAmpiInstance(*comm);
+      CmiEnforce(*comm == ptr->getCommStruct().getComm()); // assertion 1
       ptr->barrier();
+      CmiEnforce(*comm == ptr->getCommStruct().getComm()); // assertion 2
       if (ptr->getRank() == 0) {
         CProxy_CkArray(ptr->ckGetArrayID()).ckDestroy();
       }

The odd thing is that assertion 1 succeeds but assertion 2 fails.

ptr->barrier() calls thread->suspend(), which calls CthSuspend(). I suspect the problem is there.

Alternatively, there is the following comment in tcharm_impl.h:

        /* SUBTLE: We have to do the get() because "this" may have changed
         * during a migration-suspend.  If you access *any* members
         * from this point onward, you'll cause heap corruption if
         * we're resuming from migration!  (OSL 2003/9/23) */

I tried changing assertion 2 to CmiEnforce(*comm == getAmpiInstance(*comm)->getCommStruct().getComm()); but it still failed, just with a null pointer deference here:

pgm.exe!CkArray::lookup(const CkArrayIndex & idx) Line 595
    at tmp\ckarray.h(595)
pgm.exe!CProxyElement_ArrayBase::ckLocal() Line 743
    at tmp\ckarray.c(743)
pgm.exe!CProxyElement_ArrayElement::ckLocal() Line 1031
    at include\ckarray.decl.h(1031)
pgm.exe!CProxyElement_ampi::ckLocal() Line 1650
    at tmp\libs\ck-libs\ampi\ampi.decl.h(1650)
pgm.exe!ampiParent::comm2ampi(int comm) Line 2170
    at tmp\libs\ck-libs\ampi\ampiimpl.h(2170)
pgm.exe!getAmpiInstance(int comm) Line 3799
    at tmp\libs\ck-libs\ampi\ampi.c(3799)
pgm.exe!MPI_Comm_free(int * comm) Line 8996
    at tmp\libs\ck-libs\ampi\ampi.c(8996)
pgm.exe!AMPI_Main_cpp(int argc, char * * argv) Line 494
    at tests\ampi\megampi\test.c(494)
pgm.exe!AMPI_Fallback_Main(int argc, char * * argv) Line 830
    at tmp\libs\ck-libs\ampi\ampi.c(830)
pgm.exe!MPI_threadstart_t::start() Line 1059
    at tmp\libs\ck-libs\ampi\ampi.c(1059)
pgm.exe!AMPI_threadstart(void * data) Line 1076
    at tmp\libs\ck-libs\ampi\ampi.c(1076)
pgm.exe!startTCharmThread(TCharmInitMsg * msg) Line 164
    at tmp\libs\ck-libs\tcharm\tcharm.c(164)
pgm.exe!FiberSetUp(void * fiberData) Line 1371
    at tmp\threads.c(1371)
[External Code]

I am more inclined to believe the problem is in CthSuspend(); because this failure does not always occur, and it only occurs on Windows.

#7 Updated by Evan Ramos 3 months ago

I tried running megampi on Linux with ThreadSanitizer and the list of data races was substantial. Some of them look like candidates for this issue, including AMPI implementation details relevant to the failure seen on Windows.

./build AMPI multicore-linux-x86_64 tsan -j8 -g3 -fsanitize=thread && cd multicore-linux-x86_64-tsan/tests/ampi/megampi/ && make -j8 OPTS="-g3 -fsanitize=thread" && TSAN_OPTIONS='log_path=tsan.log' ./pgm +p4 +vp2 +tcharm_nomig +noisomalloc

#8 Updated by Sam White 3 months ago

Can you post the tsan output here?

#9 Updated by Evan Ramos 3 months ago

#10 Updated by Evan Ramos 3 months ago

I ran megampi on Windows with a Microsoft tool called Application Verifier and it pointed out these two additional problems but I'm not sure either can be blamed for this issue.

1. "Invalid TLS index used for current stack trace."

        <avrf:logEntry Time="2019-03-15 : 17:46:57" LayerName="Handles" StopCode="0x301" Severity="Error">
            <avrf:message>Invalid TLS index used for current stack trace.</avrf:message>
            <avrf:parameter1>ffffffff - Invalid TLS index.</avrf:parameter1>
            <avrf:parameter2>abba - Expected lower part of the index.</avrf:parameter2>
            <avrf:parameter3>0 - Not used.</avrf:parameter3>
            <avrf:parameter4>0 - Not used.</avrf:parameter4>
            <avrf:stackTrace>
                <avrf:trace>vfbasics!+7ffe620caef9 ( @ 0)</avrf:trace>
                <avrf:trace>vfbasics!+7ffe620cb12f ( @ 0)</avrf:trace>
                <avrf:trace>pgm!CmiGetState+10 (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\machine-smp.c @ 115)</avrf:trace>
                <avrf:trace>pgm!CmiMyPe+9 (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\machine-common-core.c @ 399)</avrf:trace>
                <avrf:trace>pgm!CmiAddCLA+18 (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\convcore.c @ 325)</avrf:trace>
                <avrf:trace>pgm!CmiGetArgFlagDesc+29 (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\convcore.c @ 579)</avrf:trace>
                <avrf:trace>pgm!CmiGetArgFlag+24 (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\convcore.c @ 589)</avrf:trace>
                <avrf:trace>pgm!ConverseInit+2e (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\machine-common-core.c @ 1197)</avrf:trace>
                <avrf:trace>pgm!charm_main+41 (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\init.c @ 1713)</avrf:trace>
                <avrf:trace>pgm!main+1b (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\main.c @ 6)</avrf:trace>
                <avrf:trace>pgm!invoke_main+34 (d:\agent\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl @ 79)</avrf:trace>
                <avrf:trace>pgm!__scrt_common_main_seh+12e (d:\agent\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl @ 288)</avrf:trace>
                <avrf:trace>pgm!__scrt_common_main+e (d:\agent\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl @ 331)</avrf:trace>
                <avrf:trace>pgm!mainCRTStartup+9 (d:\agent\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_main.cpp @ 17)</avrf:trace>
                <avrf:trace>KERNEL32!BaseThreadInitThunk+14 ( @ 0)</avrf:trace>
                <avrf:trace>ntdll!RtlUserThreadStart+21 ( @ 0)</avrf:trace>
            </avrf:stackTrace>
        </avrf:logEntry>
CmiState CmiGetState(void)
{
  CmiState result;
  result = (CmiState)TlsGetValue(Cmi_state_key); // Cmi_state_key is 0xFFFFFFFF here

2. "NULL handle passed as parameter. A valid handle must be used."

        <avrf:logEntry Time="2019-03-15 : 17:48:34" LayerName="Handles" StopCode="0x303" Severity="Error">
            <avrf:message>NULL handle passed as parameter. A valid handle must be used.</avrf:message>
            <avrf:parameter1>0 - Not used.</avrf:parameter1>
            <avrf:parameter2>0 - Not used.</avrf:parameter2>
            <avrf:parameter3>0 - Not used.</avrf:parameter3>
            <avrf:parameter4>0 - Not used.</avrf:parameter4>
            <avrf:stackTrace>
                <avrf:trace>vfbasics!+7ffe620b3138 ( @ 0)</avrf:trace>
                <avrf:trace>vfbasics!+7ffe620c5847 ( @ 0)</avrf:trace>
                <avrf:trace>KERNELBASE!WaitForSingleObjectEx+a2 ( @ 0)</avrf:trace>
                <avrf:trace>vfbasics!+7ffe620c5342 ( @ 0)</avrf:trace>
                <avrf:trace>vfbasics!+7ffe620c53c8 ( @ 0)</avrf:trace>
                <avrf:trace>vfbasics!+7ffe620c5342 ( @ 0)</avrf:trace>
                <avrf:trace>vfbasics!+7ffe620c53a5 ( @ 0)</avrf:trace>
                <avrf:trace>pgm!LrtsLock+19 (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\machine-common-core.c @ 1975)</avrf:trace>
                <avrf:trace>pgm!CmiArgInit+15 (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\convcore.c @ 372)</avrf:trace>
                <avrf:trace>pgm!ConverseCommonInit+34d (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\convcore.c @ 3816)</avrf:trace>
                <avrf:trace>pgm!ConverseRunPE+39c (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\machine-common-core.c @ 1578)</avrf:trace>
                <avrf:trace>pgm!ConverseInit+66f (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\machine-common-core.c @ 1500)</avrf:trace>
                <avrf:trace>pgm!charm_main+41 (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\init.c @ 1713)</avrf:trace>
                <avrf:trace>pgm!main+1b (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\main.c @ 6)</avrf:trace>
                <avrf:trace>pgm!invoke_main+34 (d:\agent\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl @ 79)</avrf:trace>
                <avrf:trace>pgm!__scrt_common_main_seh+12e (d:\agent\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl @ 288)</avrf:trace>
                <avrf:trace>pgm!__scrt_common_main+e (d:\agent\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl @ 331)</avrf:trace>
                <avrf:trace>pgm!mainCRTStartup+9 (d:\agent\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_main.cpp @ 17)</avrf:trace>
                <avrf:trace>KERNEL32!BaseThreadInitThunk+14 ( @ 0)</avrf:trace>
                <avrf:trace>ntdll!RtlUserThreadStart+21 ( @ 0)</avrf:trace>
            </avrf:stackTrace>
        </avrf:logEntry>
void CmiArgInit(char **argv) {
    int i;
    CmiLock(_smp_mutex); // _smp_mutex is null here

#11 Updated by Evan Ramos 3 months ago

  • Status changed from New to In Progress

#12 Updated by Evan Ramos 3 months ago

These global variables in TCharm and AMPI are potential candidates for causing this issue due to data races:

static mpi_comm_worlds mpi_worlds;
int _mpi_nworlds;

static CProxy_ampiWorlds ampiWorldsGroup;

CtvExtern(TCharm *,_curTCharm);

#13 Updated by Evan Ramos about 2 months ago

  • Subject changed from tests/ampi/megampi sometimes fails on mpi-win-x86_64-smp to tests/ampi/megampi crashes in MPI_Comm_free
  • Tags set to ampi

It looks like the original failure on mpi-win-x86_64-smp and multicore-win-x86_64 is due to two compounding problems. One is that AMPI's MPI_Comm_free does not properly refresh its ampi * pointer after calling a barrier, during which migration might take place. This patch fixes this simple oversight and cleans up pointer refreshing after migration across all of AMPI: https://charm.cs.illinois.edu/gerrit/c/charm/+/5095

With this patch in place, a second issue is exposed: After migration, sometimes CProxyElement_ArrayBase::ckLocalBranch() returns null during the call to getAmpiInstance. Fortunately this issue is easily reproducible on Linux and macOS in addition to Windows.

A one-liner to do this is ./build AMPI netlrts-linux-x86_64-smp -j8 -g3 && pushd netlrts-linux-x86_64-smp/tests/ampi/megampi && make -j8 OPTS="-g3" && ./charmrun ./pgm +p2 +vp4 ++local ++debug-no-pause +isomalloc_sync +CmiSleepOnIdle. Swap +vp2 for +vp4 to crash in a different location in megampi, and linux for darwin to cause the same crash on macOS.

+p2 +vp2:

Thread 1 "pgm" received signal SIGSEGV, Segmentation fault.
0x000055555583085c in CkArray::lookup (this=0x0, idx=...) at ckarray.h:595
595         if (locMgr->lookupID(idx,id)) {
(gdb) bt
#0  0x000055555583085c in CkArray::lookup (this=0x0, idx=...) at ckarray.h:595
#1  0x000055555588a2a0 in CProxyElement_ArrayBase::ckLocal (this=0x2aaaa0903880) at ckarray.C:742
#2  0x0000555555775646 in CProxyElement_ArrayElement::ckLocal (this=0x2aaaa0903880) at ../../../../bin/../include/CkArray.decl.h:1031
#3  0x00005555557c2766 in CProxyElement_ampi::ckLocal (this=0x2aaaa0903880) at ampi.decl.h:1650
#4  0x00005555557c58a4 in ampiParent::comm2ampi (this=0x555555d46900, comm=1000003) at ampiimpl.h:2162
#5  0x000055555578e377 in getAmpiInstance (comm=1000003) at ampi.C:3787
#6  0x00005555557907f0 in ampi::suspend (this=0x555555dcd4d0) at ampi.C:4569
#7  0x0000555555790781 in ampi::barrier (this=0x555555dcd4d0) at ampi.C:4557
#8  0x00005555557a20e8 in MPI_Comm_free (comm=0x2aaaa0903e34) at ampi.C:9014
#9  0x000055555576e42f in AMPI_Main_cpp (argc=1, argv=0x555555d49320) at test.C:490
#10 0x0000555555784994 in AMPI_Fallback_Main (argc=1, argv=0x555555d49320) at ampi.C:829
#11 0x00005555557c711d in MPI_threadstart_t::start (this=0x2aaaa0903f68) at ampi.C:1055
#12 0x000055555578518e in AMPI_threadstart (data=0x555555d45e00) at ampi.C:1075
#13 0x000055555576efad in startTCharmThread (msg=0x555555d45de0) at tcharm.C:163
#14 0x0000555555921d4f in CthStartThread (arg=...) at threads.c:1784
#15 0x000055555592220f in make_fcontext ()
#16 0x0000000000000000 in ?? ()

+p2 +vp4:

Thread 1 "pgm" received signal SIGSEGV, Segmentation fault.
0x000055555583085c in CkArray::lookup (this=0x0, idx=...) at ckarray.h:595
595         if (locMgr->lookupID(idx,id)) {
(gdb) bt
#0  0x000055555583085c in CkArray::lookup (this=0x0, idx=...) at ckarray.h:595
#1  0x000055555588a2a0 in CProxyElement_ArrayBase::ckLocal (this=0x2aaaa0903880) at ckarray.C:742
#2  0x0000555555775646 in CProxyElement_ArrayElement::ckLocal (this=0x2aaaa0903880) at ../../../../bin/../include/CkArray.decl.h:1031
#3  0x00005555557c2766 in CProxyElement_ampi::ckLocal (this=0x2aaaa0903880) at ampi.decl.h:1650
#4  0x00005555557c58a4 in ampiParent::comm2ampi (this=0x555555d47740, comm=1000002) at ampiimpl.h:2162
#5  0x000055555578e377 in getAmpiInstance (comm=1000002) at ampi.C:3787
#6  0x00005555557907f0 in ampi::suspend (this=0x555555d50f20) at ampi.C:4569
#7  0x0000555555790781 in ampi::barrier (this=0x555555d50f20) at ampi.C:4557
#8  0x00005555557a20e8 in MPI_Comm_free (comm=0x2aaaa0903e80) at ampi.C:9014
#9  0x000055555576dd7c in AMPI_Main_cpp (argc=1, argv=0x555555d4c4e0) at test.C:356
#10 0x0000555555784994 in AMPI_Fallback_Main (argc=1, argv=0x555555d4c4e0) at ampi.C:829
#11 0x00005555557c711d in MPI_threadstart_t::start (this=0x2aaaa0903f68) at ampi.C:1055
#12 0x000055555578518e in AMPI_threadstart (data=0x555555d46690) at ampi.C:1075
#13 0x000055555576efad in startTCharmThread (msg=0x555555d46670) at tcharm.C:163
#14 0x0000555555921d4f in CthStartThread (arg=...) at threads.c:1784
#15 0x000055555592220f in make_fcontext ()
#16 0x0000000000000000 in ?? ()

Also available in: Atom PDF