Project

General

Profile

Bug #1706

MPI LrtsAbort doesn't kill all replicas

Added by Jim Phillips almost 2 years ago. Updated almost 2 years ago.

Status:
Merged
Priority:
High
Assignee:
Category:
Machine Layers
Target version:
Start date:
10/06/2017
Due date:
% Done:

0%


Description

User reports that when one replica on Stampede2 dies the others keep running. It looks like the machine_exit code doesn't kill other replicas. It appears to try to shut things down cleanly, which is not consistent with an abort call, particularly when that is what MPI_Abort is supposed to do. The other machine layers should also be inspected for this behavior.

void LrtsAbort(const char *message) {
    char *m;

    m = CmiAlloc(CmiMsgHeaderSizeBytes);
    CmiSetHandler(m, machine_exit_idx);
    CmiSyncBroadcastAndFree(CmiMsgHeaderSizeBytes, m);
    machine_exit(m);
    /* Program never reaches here */
    MPI_Abort(charmComm, 1);
}

static void machine_exit(char *m) {
    EmergencyExit();
    /*printf("--> %d: machine_exit\n",CmiMyPe());*/
    fflush(stdout);
    CmiNodeBarrier();
    if (CmiMyRank() == 0) {
        MPI_Barrier(charmComm);
        /*printf("==> %d: passed barrier\n",CmiMyPe());*/
        MPI_Abort(charmComm, 1);
    } else {
        while (1) CmiYield();
    }
}

History

#1 Updated by Jim Phillips almost 2 years ago

Also, we have rank 0 (rather than the communication thread) making MPI calls, which is not kosher for MPI_THREAD_FUNNELED.

#2 Updated by Phil Miller almost 2 years ago

  • Assignee set to Sam White

Assign to Sam as MPI machine layer owner.

I'm thinking LrtsAbort should just call MPI_Abort and not try to do anything more.

#3 Updated by Jim Phillips almost 2 years ago

Or have the comm thread call MPI_Abort.

#4 Updated by Sam White almost 2 years ago

  • Status changed from New to In Progress

I'm not sure how we should go about making this safe for thread level FUNNELED in SMP mode.
1. Add a flag to the objects that get posted to the comm thread's PCQueue for sending, have the comm thread check that flag before every send.
2. Add some kind of separate atomic flag that the comm thread checks every once in a while, workers set it when they want to abort.
3. Use MPI's thread level SERIALIZED, do a node-all barrier (including the comm thread) then call MPI_Abort.

#1 and #2 will add some small overhead to the common cases of not aborting. It looks like the existing implementation of LrtsAbort in the MPI layer basically does #3, except it still sets the thread level to FUNNELED. I don't think there is any extra overhead of setting SERIALIZED over FUNNELED in any common MPI implementation...

Here's a patch without any fix for the thread safety issue: https://charm.cs.illinois.edu/gerrit/#/c/3111/

#5 Updated by Jim Phillips almost 2 years ago

Regarding 3, if you could get the comm thread to participate in a node-all barrier you could just have it call MPI_Abort.

Why not hijack commThdExit in machine-common-core.c, since it is already being checked on a regular basis? If the comm thread is going to be checking a flag on a regular basis there is no reason for it not to be a nested flag, that if set enables a check for both clean exit and abort.

static void CommunicationServer(int sleepTime) {
#if CMK_SMP
    AdvanceCommunication(1);

    if (commThdExit == CmiMyNodeSize()) {
        MACHSTATE(2, "CommunicationServer exiting {");
        LrtsDrainResources();
        MACHSTATE(2, "} CommunicationServer EXIT");

        ConverseCommonExit();

#if CMK_USE_XPMEM
        CmiExitXpmem();
#endif  
        CmiNodeAllBarrier();
        LrtsExit();
    }
#endif
}

#6 Updated by Phil Miller almost 2 years ago

I'm kind of inclined to question the assumption that (presumably) an application-level call to CmiAbort should bring down all partitions. In as much as running with partitions means having essentially separate instances of the entire job and runtime, that just happen to have been scheduled together, working instances should be able to continue just fine. Calls to an API at the partitioned level (i.e. Cmi*) should logically stay within a partition.

That logic breaks down as soon as one wants to actually have any interaction among the partitions, as in replica-exchange methods, because then those communication operations have to report failure, callers have to check for it, etc.

#7 Updated by Jim Phillips almost 2 years ago

How would you detect that the partition you are sending an asynchronous message to has aborted?
Also, there are many ways to run multiple independent jobs. The partition feature exists to allow those jobs to communicate.

#8 Updated by Phil Miller almost 2 years ago

  • Status changed from In Progress to Merged

Also available in: Atom PDF