Project

General

Profile

Bug #1832

SMP hangs in megatest/multisectiontest

Added by Sam White over 1 year ago. Updated over 1 year ago.

Status:
Merged
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
03/17/2018
Due date:
% Done:

0%


Description

Jenkins builds and the netlrts-linux-x86_64-smp autobuild target are hanging pretty consistently in tests/charm++/megatest/multisectiontest, like this:

test 36: initiated [multi groupmulti (gengbin)]
test 36: completed (0.23 sec)
test 37: initiated [multi groupsectiontest (ebohm)]
test 37: completed (0.54 sec)
test 38: initiated [multi multisectiontest (ebohm)]
Build timed out (after 60 minutes). Marking the build as aborted.

https://charm.cs.illinois.edu/autobuild/cur/netlrts-linux-x86_64-smp.txt

History

#1 Updated by Nitin Bhat over 1 year ago

  • Status changed from New to In Progress

#2 Updated by Sam White over 1 year ago

  • Priority changed from Normal to High

This is one of the few bugs left that would block the release of 6.9.0

#3 Updated by Juan Galvez over 1 year ago

Do we know the commit that triggered this bug?

#4 Updated by Sam White over 1 year ago

Originally [1] caused failures in this test. That initial failure was then fixed by [2], but now we are seeing this hang in the same test.

[1] https://charm.cs.illinois.edu/gerrit/#/c/3528/
[2] https://charm.cs.illinois.edu/gerrit/#/c/3850/

#5 Updated by Juan Galvez over 1 year ago

Yeah, I thought it had been fixed by the above patch.

So what we are getting now is a random hang, not always in the same place?

#6 Updated by Juan Galvez over 1 year ago

Most of the hangs or crashes are in megatest/multisection test, but today SMP seemed to fail in tests/converse/commbench, not sure if just a spurious failure or what.

#7 Updated by Juan Galvez over 1 year ago

This is a QD bug introduced by the group dependence patch.

It hangs multisection test only because that test uses QD. The hang occurs only if the new "groupdependence" test added to megatest runs (without the groupdependence test it never hangs).
And it hangs only if during the groupdependence test, a group dependence and message arrival order causes a message (intended for either a singleton chare or a nodegroup) to be temporarily buffered. This situation doesn't happen very frequently which is why the bug is hard to replicate.

The actual cause of the bug is that the QD counters for the message that is buffered are incremented more than once, because the call to `ck->process()` for these msg types occurs in `_processHandler`, which is called everytime the msg is dequed. The message is dequed the first time it arrives and any time it is buffered because of a group dependency and released afterwards.

The solution appears to be to not call `ck->process()` in `_processHandler` but rather in the function that processes the msg (e.g. _processNodeBocInitMsg) right before the entry method is invoked. But if this is done, then I'll note that the `_initHandler` function in init.C needs to be modified also to disable calling `CpvAccess(_qd)->process();` for NodeBocInitMsg, otherwise QD doesn't work.

#8 Updated by Nitin Bhat over 1 year ago

  • Status changed from In Progress to Implemented

#9 Updated by Nitin Bhat over 1 year ago

  • Status changed from Implemented to Merged

Also available in: Atom PDF