Project

General

Profile

Bug #1076

failure to exit after long run on PSC Bridges

Added by Jim Phillips about 3 years ago. Updated over 1 year ago.

Status:
Rejected
Priority:
Normal
Assignee:
Category:
Machine Layers
Target version:
-
Start date:
05/15/2016
Due date:
% Done:

0%

Spent time:

Description

Long MPI run on 9 nodes of PSC Bridges failed to exit, hung until job was killed. Note the lack of "[Partition 0][Node 0] End of program" below. Short run exits fine.

Converse/Charm++ Commit ID: v6.7.0-42-gf835f2a-namd-charm-6.7.0-build-2016-Feb-07-94042
Info: Based on Charm++/Converse 60800 for mpi-linux-x86_64-iccstatic
Info: Built Tue Apr 12 12:17:35 EDT 2016 by jphillip on br006.pvt.bridges.psc.edu
Info: 1 NAMD  2.11  Linux-x86_64-MPI  252    r379.pvt.bridges.psc.edu  jphillip
Info: Running on 252 processors, 252 nodes, 9 physical nodes.
...
WRITING EXTENDED SYSTEM TO OUTPUT FILE AT STEP 10000000
WRITING COORDINATES TO OUTPUT FILE AT STEP 10000000
The last position output (seq=-2) takes 0.005 seconds, 540.750 MB of memory in use
WRITING VELOCITIES TO OUTPUT FILE AT STEP 10000000
The last velocity output (seq=-2) takes 0.005 seconds, 540.750 MB of memory in use
====================================================

WallClock: 47768.617188  CPUTime: 47768.617188  Memory: 540.750000 MB
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: *** STEP 132466.0 ON r379 CANCELLED AT 2016-05-15T10:29:44 ***

History

#1 Updated by Jim Phillips about 3 years ago

This seems reproducible. 1M-step jobs exit fine, 10M-step jobs hang as above.

#2 Updated by Jim Phillips about 3 years ago

I get the same symptom even for NAMD 2.10 with Charm++
v6.6.1-rc1-1-gba7c3c3-namd-charm-6.6.1-build-2014-Dec-08-28969

WallClock: 48653.468750  CPUTime: 48653.468750  Memory: 539.621094 MB
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: *** STEP 132767.0 ON r256 CANCELLED AT 2016-05-17T00:31:57 ***

Note that NAMD is definitely calling CkExit:

void BackEnd::exit(void) {
  float cpuTime = CmiCpuTimer() - cpuTime_start;
  float wallTime = CmiWallTimer() - wallTime_start;
  CmiPrintf("====================================================\n\n" 
            "WallClock: %f  CPUTime: %f  Memory: %f MB\n",
            wallTime, cpuTime, memusage_MB());
  CkExit();
}

Also, shorter runs exit normally:

WallClock: 5.761169  CPUTime: 5.761169  Memory: 539.691406 MB
[Partition 0][Node 0] End of program

#3 Updated by Jim Phillips about 3 years ago

Stampede MPI version exits normally.

#4 Updated by Eric Bohm about 3 years ago

  • Assignee set to Bilge Acun

#5 Updated by Jim Phillips about 3 years ago

According to Rick Costa of PSC User Services
"The same hanging issue has surfaced for other packages."

There is an MPI_Barrier call in LrtsDrainResources that is likely what is hanging.

#6 Updated by Jim Phillips about 3 years ago

Jim Phillips wrote:

According to Rick Costa of PSC User Services
"The same hanging issue has surfaced for other packages."

There is an MPI_Barrier call in LrtsDrainResources that is likely what is hanging.

Nevermind. This comment was in reference to a startup hang, not an exit hang.

#7 Updated by Jim Phillips about 3 years ago

Jim Phillips wrote:

Jim Phillips wrote:

According to Rick Costa of PSC User Services
"The same hanging issue has surfaced for other packages."

There is an MPI_Barrier call in LrtsDrainResources that is likely what is hanging.

Nevermind. This comment was in reference to a startup hang, not an exit hang.

OK, do mind. "The job hangs are all on long-running jobs, not on startup. I know of 3 NAMD users with this problem and 1 Trinity user. There may be other users. We are looking into this problem."

#8 Updated by Jim Phillips about 3 years ago

On Bridges after a long run 30 of 252 ranks are not making the MPI_Barrier call in LrtsDrainResources():

[jphillip@br006 ~]$ grep "entering MPI_Barrier" ~jphillip/jun8_long.log | wc -l
222
[jphillip@br006 ~]$ grep "entering MPI_Barrier" ~jphillip/jun8_long.log | sort -n -k2 | head
PE 0 entering MPI_Barrier
PE 1 entering MPI_Barrier
PE 2 entering MPI_Barrier
PE 3 entering MPI_Barrier
PE 4 entering MPI_Barrier
PE 5 entering MPI_Barrier
PE 6 entering MPI_Barrier
PE 8 entering MPI_Barrier
PE 9 entering MPI_Barrier
PE 10 entering MPI_Barrier

What I assume is that the messaging code at the beginning of LrtsDrainResources() never exits because the receiving end has already entered MPI_Barrier and is not receiving messages, so CheckAllAsyncMsgsSent() never returns true.

void LrtsDrainResources() {
#if !CMK_SMP
    while (!CheckAllAsyncMsgsSent()) {
        PumpMsgs();
        ReleaseSentMessages();
    }
#else
    if(Cmi_smp_mode_setting == COMM_THREAD_SEND_RECV){
        while (!MsgQueueEmpty() || !CheckAllAsyncMsgsSent()) {
            ReleaseSentMessages();
            SendMsgBuf();
            PumpMsgs();
        }
    }else if(Cmi_smp_mode_setting == COMM_THREAD_ONLY_RECV) {
        while(!CheckAllAsyncMsgsSent()) {
            ReleaseSentMessages();
        }
    }
#endif
#if CMK_MEM_CHECKPOINT || CMK_MESSAGE_LOGGING
    if (CmiMyPe() == 0 && CmiMyPartition()==0)
    {
      mpi_end_spare();
    }
#endif
    CmiPrintf("PE %d entering MPI_Barrier\n", CmiMyPe());
    MACHSTATE(2, "Machine exit barrier begin {");
    START_EVENT();
    if (MPI_SUCCESS != MPI_Barrier(charmComm))
        CmiAbort("LrtsDrainResources: MPI_Barrier failed!\n");
    END_EVENT(10);
    MACHSTATE(2, "} Machine exit barrier end");
    CmiPrintf("PE %d finished MPI_Barrier\n", CmiMyPe());
}

#9 Updated by Jim Phillips about 3 years ago

This is a stack trace from rank 7 (hanging):

#0  0x00007f55cd42b117 in sched_yield () from /lib64/libc.so.6
#1  0x00007f55cdd48105 in PMPIDI_CH3I_Progress (progress_state=0x2403040, is_blocking=0) at ../../src/mpid/ch3/channels/nemesis/src/ch3_progress.c:913
#2  0x00007f55cdfe0d1a in MPIR_Test_impl (request=0x2403040, flag=0x0, status=0x2168ffb0) at ../../src/mpi/pt2pt/test.c:52
#3  0x00007f55cdfe094c in PMPI_Test (request=0x2403040, flag=0x0, status=0x2168ffb0) at ../../src/mpi/pt2pt/test.c:162
#4  0x0000000000ea92e4 in ReleaseSentMessages () at machine.c:566
#5  0x0000000000ea9834 in LrtsDrainResources () at machine.c:1105
#6  0x0000000000ea8abc in ConverseExit () at machine-common-core.c:1335
#7  0x0000000000d4e01f in _exitHandler (env=0x26a1560) at init.C:619
#8  0x0000000000eac334 in CmiHandleMessage (msg=0x26a1560) at convcore.c:1633
#9  0x0000000000eac695 in CsdScheduleForever () at convcore.c:1861
#10 0x0000000000eac58b in CsdScheduler (maxmsgs=-1) at convcore.c:1797
#11 0x00000000004c3127 in master_init (argc=<optimized out>, argv=<optimized out>) at src/BackEnd.C:156
#12 BackEnd::init (argc=37761088, argv=0x0) at src/BackEnd.C:247
#13 0x00000000004bb8c3 in main (argc=37761088, argv=0x0) at src/mainfunc.C:43

#4  0x0000000000ea92e4 in ReleaseSentMessages () at machine.c:566
566             if (MPI_Test(&(msg_tmp->req), &done, &sts) != MPI_SUCCESS)
(gdb) print *msg_tmp
$1 = {msg = 0x46ce530 "", next = 0x26b8e10, size = 5876, destpe = 86, mode = 1, req = -335544264}

and rank8 (in MPI_Barrier):

#0  0x00007f384b170117 in sched_yield () from /lib64/libc.so.6
#1  0x00007f384ba8d105 in PMPIDI_CH3I_Progress (progress_state=0x199f040, is_blocking=0) at ../../src/mpid/ch3/channels/nemesis/src/ch3_progress.c:913
#2  0x00007f384bb98e39 in MPIC_Wait (request_ptr=0x199f040) at ../../src/mpi/coll/helper_fns.c:281
#3  0x00007f384bb98b32 in MPIC_Sendrecv (sendbuf=0x199f040, sendcount=0, sendtype=1276230016, dest=-1, sendtag=12830, recvbuf=0xd, recvcount=0, recvtype=1275068685, source=7, recvtag=1, comm=-2080374783, status=0x1,
    errflag=0x7ffcbb451400) at ../../src/mpi/coll/helper_fns.c:559
#4  0x00007f384ba7244a in MPIR_Barrier_intra (comm_ptr=0x199f040, errflag=0x0) at ../../src/mpi/coll/barrier.c:515
#5  0x00007f384ba71f0a in MPIR_Barrier_advanced (comm_ptr=0x199f040, errflag=0x0) at ../../src/mpi/coll/barrier.c:791
#6  0x00007f384ba71d17 in I_MPIR_Barrier_intra (comm_ptr=0x199f040, errflag=0x0) at ../../src/mpi/coll/barrier.c:738
#7  0x00007f384ba71bdc in MPIR_Barrier (comm_ptr=0x199f040, errflag=0x0) at ../../src/mpi/coll/barrier.c:648
#8  0x00007f384ba71b0c in MPIR_Barrier_impl (comm_ptr=0x199f040, errflag=0x0) at ../../src/mpi/coll/barrier.c:699
#9  0x00007f384ba73932 in PMPI_Barrier (comm=26865728) at ../../src/mpi/coll/barrier.c:914
#10 0x0000000000ea985d in LrtsDrainResources () at machine.c:1129
#11 0x0000000000ea8abc in ConverseExit () at machine-common-core.c:1335
#12 0x0000000000d4e01f in _exitHandler (env=0x3cadb20) at init.C:619
#13 0x0000000000eac334 in CmiHandleMessage (msg=0x3cadb20) at convcore.c:1633
#14 0x0000000000eac695 in CsdScheduleForever () at convcore.c:1861
#15 0x0000000000eac58b in CsdScheduler (maxmsgs=-1) at convcore.c:1797
#16 0x00000000004c3127 in master_init (argc=<optimized out>, argv=<optimized out>) at src/BackEnd.C:156
#17 BackEnd::init (argc=26865728, argv=0x0) at src/BackEnd.C:247
#18 0x00000000004bb8c3 in main (argc=26865728, argv=0x0) at src/mainfunc.C:43

and rank 15 (hanging):

#0  0x00007fa566d5f117 in sched_yield () from /lib64/libc.so.6
#1  0x00007fa56767c105 in PMPIDI_CH3I_Progress (progress_state=0x1d3d040, is_blocking=0) at ../../src/mpid/ch3/channels/nemesis/src/ch3_progress.c:913
#2  0x00007fa567914d1a in MPIR_Test_impl (request=0x1d3d040, flag=0x0, status=0x1a43313d) at ../../src/mpi/pt2pt/test.c:52
#3  0x00007fa56791494c in PMPI_Test (request=0x1d3d040, flag=0x0, status=0x1a43313d) at ../../src/mpi/pt2pt/test.c:162
#4  0x0000000000ea92e4 in ReleaseSentMessages () at machine.c:566
#5  0x0000000000ea9834 in LrtsDrainResources () at machine.c:1105
#6  0x0000000000ea8abc in ConverseExit () at machine-common-core.c:1335
#7  0x0000000000d4e01f in _exitHandler (env=0x2005a30) at init.C:619
#8  0x0000000000eac334 in CmiHandleMessage (msg=0x2005a30) at convcore.c:1633
#9  0x0000000000eac695 in CsdScheduleForever () at convcore.c:1861
#10 0x0000000000eac58b in CsdScheduler (maxmsgs=-1) at convcore.c:1797
#11 0x00000000004c3127 in master_init (argc=<optimized out>, argv=<optimized out>) at src/BackEnd.C:156
#12 BackEnd::init (argc=30658624, argv=0x0) at src/BackEnd.C:247
#13 0x00000000004bb8c3 in main (argc=30658624, argv=0x0) at src/mainfunc.C:43

#4  0x0000000000ea92e4 in ReleaseSentMessages () at machine.c:566
566             if (MPI_Test(&(msg_tmp->req), &done, &sts) != MPI_SUCCESS)
(gdb) print *msg_tmp
$1 = {msg = 0x3ef4550 "", next = 0x3070fb0, size = 1636, destpe = 153, mode = 1, req = -335544301}

Both of the destpes have already entered MPI_Barrier.

#10 Updated by Jim Phillips about 3 years ago

One thing that confuses me is that NAMD does a quiescence detection before exiting, so where are these outstanding messages coming from?

#11 Updated by Jim Phillips about 3 years ago

The outgoing messages from rank 7 all seem to be of the same type:

(gdb) print *(CmiMsgHeaderExt*)(msg_tmp->msg)
$5 = {rank = 0, hdl = 47, xhdl = 39, info = 45, stratid = 0, redID = 0, root = 0, cksum = 0 '\000', magic = 126 '~'}
(gdb) print *(CmiMsgHeaderExt*)(msg_tmp->next->msg)
$6 = {rank = 0, hdl = 47, xhdl = 39, info = 45, stratid = 0, redID = 0, root = 0, cksum = 0 '\000', magic = 126 '~'}
(gdb) print *(CmiMsgHeaderExt*)(msg_tmp->next->next->msg)
$7 = {rank = 0, hdl = 47, xhdl = 39, info = 45, stratid = 0, redID = 0, root = 0, cksum = 0 '\000', magic = 126 '~'}
(gdb) print *(CmiMsgHeaderExt*)(msg_tmp->next->next->next->msg)
$11 = {rank = 0, hdl = 47, xhdl = 39, info = 45, stratid = 0, redID = 0, root = 0, cksum = 0 '\000', magic = 126 '~'}
(gdb) print *(CmiMsgHeaderExt*)(msg_tmp->next->next->next->next->msg)
$12 = {rank = 0, hdl = 47, xhdl = 39, info = 45, stratid = 0, redID = 0, root = 0, cksum = 0 '\000', magic = 126 '~'}
(gdb) print *(CmiMsgHeaderExt*)(msg_tmp->next->next->next->next->next->msg)
$13 = {rank = 0, hdl = 47, xhdl = 39, info = 45, stratid = 0, redID = 0, root = 0, cksum = 0 '\000', magic = 126 '~'}
(gdb) print *(CmiMsgHeaderExt*)(msg_tmp->next->next->next->next->next->next->msg)
$14 = {rank = 0, hdl = 47, xhdl = 39, info = 45, stratid = 0, redID = 0, root = 0, cksum = 0 '\000', magic = 126 '~'}
(gdb) print *(CmiMsgHeaderExt*)(msg_tmp->next->next->next->next->next->next->next->msg)
$15 = {rank = 0, hdl = 47, xhdl = 39, info = 45, stratid = 0, redID = 0, root = 0, cksum = 0 '\000', magic = 126 '~'}

(gdb) print _charmHandlerIdx
$56 = 39
(gdb) print index_skipCldHandler
$57 = 47
(gdb) print _infoIdx
$64 = 45

(gdb) print *(envelope*)msg_tmp->msg
$66 = {core = "\000\000/\000'\000-\000\000\000\000\000\000\000\000\000\000~\000", type = {chare = {
      ptr = 0x1d, forAnyPe = 0, bype = 0}, group = {g = {idx = 29}, rednMgr = {idx = 0}, dep = {idx = 0}, 
      epoch = 0, arrayEp = 0}, array = {index = {nInts = 29, dimension = 0, {index = {0, 0, 0}, 
          indexShorts = {0, 0, 0, 0, 0, 0}}}, listenerData = {0, 0}, arr = {idx = 0}, hopCount = 0 '\000', 
      ifNotThere = 0 '\000'}, objid = {id = 29, hopCount = 0 '\000', ifNotThere = 0 '\000'}, roData = {
      count = 29}, roMsg = {roIdx = 29}}, pe = 7, totalsize = 5684, event = 298609646, ref = 0, 
  priobits = 32, epIdx = 433, attribs = {msgIdx = 74 'J', mtype = 5 '\005', queueing = 4 '\004', 
    isPacked = 1 '\001', isUsed = 1 '\001'}, align = 0x45b07fd ""}
(gdb) print *(envelope*)msg_tmp->next->msg
$67 = {core = "\000\000/\000'\000-\000\000\000\000\000\000\000\000\000\000~\000", type = {chare = {
      ptr = 0x1d, forAnyPe = 0, bype = 0}, group = {g = {idx = 29}, rednMgr = {idx = 0}, dep = {idx = 0}, 
      epoch = 0, arrayEp = 0}, array = {index = {nInts = 29, dimension = 0, {index = {0, 0, 0}, 
          indexShorts = {0, 0, 0, 0, 0, 0}}}, listenerData = {0, 0}, arr = {idx = 0}, hopCount = 0 '\000', 
      ifNotThere = 0 '\000'}, objid = {id = 29, hopCount = 0 '\000', ifNotThere = 0 '\000'}, roData = {
      count = 29}, roMsg = {roIdx = 29}}, pe = 7, totalsize = 5684, event = 298626068, ref = 0, 
  priobits = 32, epIdx = 433, attribs = {msgIdx = 74 'J', mtype = 5 '\005', queueing = 4 '\004', 
    isPacked = 1 '\001', isUsed = 1 '\001'}, align = 0x479692d ""}
(gdb) print *(envelope*)msg_tmp->next->next->msg
$68 = {core = "\000\000/\000'\000-\000\000\000\000\000\000\000\000\000\000~\000", type = {chare = {
      ptr = 0x1d, forAnyPe = 0, bype = 0}, group = {g = {idx = 29}, rednMgr = {idx = 0}, dep = {idx = 0}, 
      epoch = 0, arrayEp = 0}, array = {index = {nInts = 29, dimension = 0, {index = {0, 0, 0}, 
          indexShorts = {0, 0, 0, 0, 0, 0}}}, listenerData = {0, 0}, arr = {idx = 0}, hopCount = 0 '\000', 
      ifNotThere = 0 '\000'}, objid = {id = 29, hopCount = 0 '\000', ifNotThere = 0 '\000'}, roData = {
      count = 29}, roMsg = {roIdx = 29}}, pe = 7, totalsize = 5588, event = 298627346, ref = 0, 
  priobits = 32, epIdx = 433, attribs = {msgIdx = 74 'J', mtype = 5 '\005', queueing = 4 '\004', 
    isPacked = 1 '\001', isUsed = 1 '\001'}, align = 0x46d292d ""}

(gdb) print Cpv_BOCclass_group_
$69 = {{workDistrib = {idx = 12}, patchMgr = {idx = 14}, proxyMgr = {idx = 13}, computeMgr = {idx = 15}, 
    computePmeMgr = {idx = 29}, nodePmeMgr = {idx = 1}, computeExtMgr = {idx = 17}, computeGBISserMgr = {
      idx = 18}, computeFmmSerialMgr = {idx = 19}, computeMsmSerialMgr = {idx = 20}, computeMsmMsaMgr = {
      idx = 0}, computeMsmMgr = {idx = 21}, multicastMgr = {idx = 22}, reductionMgr = {idx = 16}, 
    collectionMgr = {idx = 24}, broadcastMgr = {idx = 25}, ldbCoordinator = {idx = 26}, sync = {idx = 23}, 
    node = {idx = 28}, ioMgr = {idx = 0}, nodeProxyMgr = {idx = 2}, dataExchanger = {idx = 27}}, {
    workDistrib = {idx = 0}, patchMgr = {idx = 0}, proxyMgr = {idx = 0}, computeMgr = {idx = 0}, 
    computePmeMgr = {idx = 0}, nodePmeMgr = {idx = 0}, computeExtMgr = {idx = 0}, computeGBISserMgr = {
      idx = 0}, computeFmmSerialMgr = {idx = 0}, computeMsmSerialMgr = {idx = 0}, computeMsmMsaMgr = {
      idx = 0}, computeMsmMgr = {idx = 0}, multicastMgr = {idx = 0}, reductionMgr = {idx = 0}, 
    collectionMgr = {idx = 0}, broadcastMgr = {idx = 0}, ldbCoordinator = {idx = 0}, sync = {idx = 0}, 
    node = {idx = 0}, ioMgr = {idx = 0}, nodeProxyMgr = {idx = 0}, dataExchanger = {idx = 0}}}

Hope that last bit is useful. If I'm reading this correctly the messages are for computePmeMgr.

#12 Updated by Jim Phillips about 3 years ago

The hang does not occur with OpenMPI, so this is likely an issue with Intel's MPI library.

#13 Updated by Jim Phillips about 3 years ago

The hang also does not occur with MVAPICH.

#14 Updated by Phil Miller about 2 years ago

  • Description updated (diff)

Format description

#15 Updated by Phil Miller about 2 years ago

It would be helpful to know if this is still an issue. MPI_Test is not supposed to be a blocking operation, though the hanging ranks may be spinning on it such that it will always be in the stack trace. If it does still occur, the entry method being invoked by those messages can be seen by looking up env->epIdx in _entryTable

#16 Updated by Jim Phillips about 2 years ago

The hangs went away after switching from Intel MPI to OpenMPI. I'm assuming it was a bug in Intel MPI on OmniPath.

#17 Updated by Sam White over 1 year ago

  • Status changed from New to Rejected

Also available in: Atom PDF