Project

General

Profile

Bug #2030

tests/ampi/megampi sometimes fails on mpi-win-x86_64-smp

Added by Evan Ramos 4 months ago. Updated 3 days ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
11/19/2018
Due date:
% Done:

0%


Description

This failure shows up in autobuild every few days.

../../../bin/testrun  ./pgm +p2 +vp4  

Running on 2 processors:  ./pgm +vp4 
charmrun> /cygdrive/c/Program Files/Microsoft MPI/Bin/mpiexec -n 2  ./pgm +vp4 

Charm++> Running on MPI version: 2.0
Charm++> level of thread support used: MPI_THREAD_FUNNELED (desired: MPI_THREAD_FUNNELED)
Charm++> Running in SMP mode: 2 processes, 1 worker threads (PEs) + 1 comm threads per process, 2 PEs total
Charm++> The comm. thread both sends and receives messages
Charm++ warning> fences and atomic operations not available in native assembly
Converse/Charm++ Commit ID: v6.9.0-0-gc3d50ef
Charm++> Disabling isomalloc because mmap() does not work.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 hosts (1 sockets x 4 cores x 2 PUs = 8-way SMP)
Charm++> cpu topology info is gathered in 0.016 seconds.
CharmLB> RandCentLB created.

job aborted:
[ranks] message

[0] terminated

[1] process exited without calling finalize

---- error analysis -----

[1] on CS-DEXTERITY
./pgm ended prematurely and may have crashed. exit code 0xc0000005

---- error analysis -----
make[3]: *** [Makefile:24: test] Error 127
make[2]: *** [Makefile:38: test-megampi] Error 2
make[1]: *** [Makefile:37: test-ampi] Error 2
make: *** [Makefile:165: test] Error 2
make[3]: Leaving directory '/home/nikhil/autobuild/mpi-win-x86_64-smp/charm/mpi-win-x86_64-smp/tests/ampi/megampi'

http://charm.cs.illinois.edu/autobuild/old.2018_11_06__01_01/mpi-win-x86_64-smp.txt
http://charm.cs.illinois.edu/autobuild/old.2018_11_10__01_01/mpi-win-x86_64-smp.txt
http://charm.cs.illinois.edu/autobuild/old.2018_11_14__01_01/mpi-win-x86_64-smp.txt

tsan.log.3262 (402 KB) Evan Ramos, 03/15/2019 03:30 PM

History

#1 Updated by Eric Bohm 4 months ago

  • Assignee set to Evan Ramos

#2 Updated by Evan Ramos 2 months ago

  • Target version deleted (6.9.1)

#4 Updated by Sam White about 1 month ago

It happened on mpi-win-smp today, and generally seems to happen somewhat frequently though not everytime.

#5 Updated by Evan Ramos 25 days ago

  • Target version set to 6.10.0

#6 Updated by Evan Ramos 12 days ago

I managed to catch this crash in Visual Studio's debugger. ampi::getRank() is called with this pointing to garbage.

pgm.exe!ampi::getRank() Line 2588
    at tmp\libs\ck-libs\ampi\ampiimpl.h(2588)
pgm.exe!MPI_Comm_free(int * comm) Line 8994
    at tmp\libs\ck-libs\ampi\ampi.c(8994)
pgm.exe!AMPI_Main_cpp(int argc, char * * argv) Line 494
    at tests\ampi\megampi\test.c(494)
pgm.exe!AMPI_Fallback_Main(int argc, char * * argv) Line 830
    at tmp\libs\ck-libs\ampi\ampi.c(830)
pgm.exe!MPI_threadstart_t::start() Line 1059
    at tmp\libs\ck-libs\ampi\ampi.c(1059)
pgm.exe!AMPI_threadstart(void * data) Line 1076
    at tmp\libs\ck-libs\ampi\ampi.c(1076)
pgm.exe!startTCharmThread(TCharmInitMsg * msg) Line 164
    at tmp\libs\ck-libs\tcharm\tcharm.c(164)
pgm.exe!FiberSetUp(void * fiberData) Line 1371
    at tmp\threads.c(1371)
[External Code]

I tried the following change to help diagnose the problem:

diff --git a/src/libs/ck-libs/ampi/ampi.C b/src/libs/ck-libs/ampi/ampi.C
index dad98cf50..8f32e8e77 100644
--- a/src/libs/ck-libs/ampi/ampi.C
+++ b/src/libs/ck-libs/ampi/ampi.C
@@ -8990,7 +8990,9 @@ AMPI_API_IMPL(int, MPI_Comm_free, MPI_Comm *comm)
     //ret = parent->freeUserKeyvals(*comm, parent->getKeyvals(*comm));
     if (*comm != MPI_COMM_WORLD && *comm != MPI_COMM_SELF) {
       ampi* ptr = getAmpiInstance(*comm);
+      CmiEnforce(*comm == ptr->getCommStruct().getComm()); // assertion 1
       ptr->barrier();
+      CmiEnforce(*comm == ptr->getCommStruct().getComm()); // assertion 2
       if (ptr->getRank() == 0) {
         CProxy_CkArray(ptr->ckGetArrayID()).ckDestroy();
       }

The odd thing is that assertion 1 succeeds but assertion 2 fails.

ptr->barrier() calls thread->suspend(), which calls CthSuspend(). I suspect the problem is there.

Alternatively, there is the following comment in tcharm_impl.h:

        /* SUBTLE: We have to do the get() because "this" may have changed
         * during a migration-suspend.  If you access *any* members
         * from this point onward, you'll cause heap corruption if
         * we're resuming from migration!  (OSL 2003/9/23) */

I tried changing assertion 2 to CmiEnforce(*comm == getAmpiInstance(*comm)->getCommStruct().getComm()); but it still failed, just with a null pointer deference here:

pgm.exe!CkArray::lookup(const CkArrayIndex & idx) Line 595
    at tmp\ckarray.h(595)
pgm.exe!CProxyElement_ArrayBase::ckLocal() Line 743
    at tmp\ckarray.c(743)
pgm.exe!CProxyElement_ArrayElement::ckLocal() Line 1031
    at include\ckarray.decl.h(1031)
pgm.exe!CProxyElement_ampi::ckLocal() Line 1650
    at tmp\libs\ck-libs\ampi\ampi.decl.h(1650)
pgm.exe!ampiParent::comm2ampi(int comm) Line 2170
    at tmp\libs\ck-libs\ampi\ampiimpl.h(2170)
pgm.exe!getAmpiInstance(int comm) Line 3799
    at tmp\libs\ck-libs\ampi\ampi.c(3799)
pgm.exe!MPI_Comm_free(int * comm) Line 8996
    at tmp\libs\ck-libs\ampi\ampi.c(8996)
pgm.exe!AMPI_Main_cpp(int argc, char * * argv) Line 494
    at tests\ampi\megampi\test.c(494)
pgm.exe!AMPI_Fallback_Main(int argc, char * * argv) Line 830
    at tmp\libs\ck-libs\ampi\ampi.c(830)
pgm.exe!MPI_threadstart_t::start() Line 1059
    at tmp\libs\ck-libs\ampi\ampi.c(1059)
pgm.exe!AMPI_threadstart(void * data) Line 1076
    at tmp\libs\ck-libs\ampi\ampi.c(1076)
pgm.exe!startTCharmThread(TCharmInitMsg * msg) Line 164
    at tmp\libs\ck-libs\tcharm\tcharm.c(164)
pgm.exe!FiberSetUp(void * fiberData) Line 1371
    at tmp\threads.c(1371)
[External Code]

I am more inclined to believe the problem is in CthSuspend(); because this failure does not always occur, and it only occurs on Windows.

#7 Updated by Evan Ramos 3 days ago

I tried running megampi on Linux with ThreadSanitizer and the list of data races was substantial. Some of them look like candidates for this issue, including AMPI implementation details relevant to the failure seen on Windows.

./build AMPI multicore-linux-x86_64 tsan -j8 -g3 -fsanitize=thread && cd multicore-linux-x86_64-tsan/tests/ampi/megampi/ && make -j8 OPTS="-g3 -fsanitize=thread" && TSAN_OPTIONS='log_path=tsan.log' ./pgm +p4 +vp2 +tcharm_nomig +noisomalloc

#8 Updated by Sam White 3 days ago

Can you post the tsan output here?

#9 Updated by Evan Ramos 3 days ago

#10 Updated by Evan Ramos 3 days ago

I ran megampi on Windows with a Microsoft tool called Application Verifier and it pointed out these two additional problems but I'm not sure either can be blamed for this issue.

1. "Invalid TLS index used for current stack trace."

        <avrf:logEntry Time="2019-03-15 : 17:46:57" LayerName="Handles" StopCode="0x301" Severity="Error">
            <avrf:message>Invalid TLS index used for current stack trace.</avrf:message>
            <avrf:parameter1>ffffffff - Invalid TLS index.</avrf:parameter1>
            <avrf:parameter2>abba - Expected lower part of the index.</avrf:parameter2>
            <avrf:parameter3>0 - Not used.</avrf:parameter3>
            <avrf:parameter4>0 - Not used.</avrf:parameter4>
            <avrf:stackTrace>
                <avrf:trace>vfbasics!+7ffe620caef9 ( @ 0)</avrf:trace>
                <avrf:trace>vfbasics!+7ffe620cb12f ( @ 0)</avrf:trace>
                <avrf:trace>pgm!CmiGetState+10 (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\machine-smp.c @ 115)</avrf:trace>
                <avrf:trace>pgm!CmiMyPe+9 (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\machine-common-core.c @ 399)</avrf:trace>
                <avrf:trace>pgm!CmiAddCLA+18 (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\convcore.c @ 325)</avrf:trace>
                <avrf:trace>pgm!CmiGetArgFlagDesc+29 (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\convcore.c @ 579)</avrf:trace>
                <avrf:trace>pgm!CmiGetArgFlag+24 (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\convcore.c @ 589)</avrf:trace>
                <avrf:trace>pgm!ConverseInit+2e (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\machine-common-core.c @ 1197)</avrf:trace>
                <avrf:trace>pgm!charm_main+41 (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\init.c @ 1713)</avrf:trace>
                <avrf:trace>pgm!main+1b (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\main.c @ 6)</avrf:trace>
                <avrf:trace>pgm!invoke_main+34 (d:\agent\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl @ 79)</avrf:trace>
                <avrf:trace>pgm!__scrt_common_main_seh+12e (d:\agent\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl @ 288)</avrf:trace>
                <avrf:trace>pgm!__scrt_common_main+e (d:\agent\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl @ 331)</avrf:trace>
                <avrf:trace>pgm!mainCRTStartup+9 (d:\agent\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_main.cpp @ 17)</avrf:trace>
                <avrf:trace>KERNEL32!BaseThreadInitThunk+14 ( @ 0)</avrf:trace>
                <avrf:trace>ntdll!RtlUserThreadStart+21 ( @ 0)</avrf:trace>
            </avrf:stackTrace>
        </avrf:logEntry>
CmiState CmiGetState(void)
{
  CmiState result;
  result = (CmiState)TlsGetValue(Cmi_state_key); // Cmi_state_key is 0xFFFFFFFF here

2. "NULL handle passed as parameter. A valid handle must be used."

        <avrf:logEntry Time="2019-03-15 : 17:48:34" LayerName="Handles" StopCode="0x303" Severity="Error">
            <avrf:message>NULL handle passed as parameter. A valid handle must be used.</avrf:message>
            <avrf:parameter1>0 - Not used.</avrf:parameter1>
            <avrf:parameter2>0 - Not used.</avrf:parameter2>
            <avrf:parameter3>0 - Not used.</avrf:parameter3>
            <avrf:parameter4>0 - Not used.</avrf:parameter4>
            <avrf:stackTrace>
                <avrf:trace>vfbasics!+7ffe620b3138 ( @ 0)</avrf:trace>
                <avrf:trace>vfbasics!+7ffe620c5847 ( @ 0)</avrf:trace>
                <avrf:trace>KERNELBASE!WaitForSingleObjectEx+a2 ( @ 0)</avrf:trace>
                <avrf:trace>vfbasics!+7ffe620c5342 ( @ 0)</avrf:trace>
                <avrf:trace>vfbasics!+7ffe620c53c8 ( @ 0)</avrf:trace>
                <avrf:trace>vfbasics!+7ffe620c5342 ( @ 0)</avrf:trace>
                <avrf:trace>vfbasics!+7ffe620c53a5 ( @ 0)</avrf:trace>
                <avrf:trace>pgm!LrtsLock+19 (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\machine-common-core.c @ 1975)</avrf:trace>
                <avrf:trace>pgm!CmiArgInit+15 (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\convcore.c @ 372)</avrf:trace>
                <avrf:trace>pgm!ConverseCommonInit+34d (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\convcore.c @ 3816)</avrf:trace>
                <avrf:trace>pgm!ConverseRunPE+39c (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\machine-common-core.c @ 1578)</avrf:trace>
                <avrf:trace>pgm!ConverseInit+66f (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\machine-common-core.c @ 1500)</avrf:trace>
                <avrf:trace>pgm!charm_main+41 (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\init.c @ 1713)</avrf:trace>
                <avrf:trace>pgm!main+1b (c:\msys64\home\evan\charm\multicore-win-x86_64\tmp\main.c @ 6)</avrf:trace>
                <avrf:trace>pgm!invoke_main+34 (d:\agent\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl @ 79)</avrf:trace>
                <avrf:trace>pgm!__scrt_common_main_seh+12e (d:\agent\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl @ 288)</avrf:trace>
                <avrf:trace>pgm!__scrt_common_main+e (d:\agent\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl @ 331)</avrf:trace>
                <avrf:trace>pgm!mainCRTStartup+9 (d:\agent\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_main.cpp @ 17)</avrf:trace>
                <avrf:trace>KERNEL32!BaseThreadInitThunk+14 ( @ 0)</avrf:trace>
                <avrf:trace>ntdll!RtlUserThreadStart+21 ( @ 0)</avrf:trace>
            </avrf:stackTrace>
        </avrf:logEntry>
void CmiArgInit(char **argv) {
    int i;
    CmiLock(_smp_mutex); // _smp_mutex is null here

#11 Updated by Evan Ramos 3 days ago

  • Status changed from New to In Progress

Also available in: Atom PDF