SMP hangs in megatest/multisectiontest
Jenkins builds and the netlrts-linux-x86_64-smp autobuild target are hanging pretty consistently in tests/charm++/megatest/multisectiontest, like this:
test 36: initiated [multi groupmulti (gengbin)] test 36: completed (0.23 sec) test 37: initiated [multi groupsectiontest (ebohm)] test 37: completed (0.54 sec) test 38: initiated [multi multisectiontest (ebohm)] Build timed out (after 60 minutes). Marking the build as aborted.
#7 Updated by Juan Galvez over 1 year ago
This is a QD bug introduced by the group dependence patch.
It hangs multisection test only because that test uses QD. The hang occurs only if the new "groupdependence" test added to megatest runs (without the groupdependence test it never hangs).
And it hangs only if during the groupdependence test, a group dependence and message arrival order causes a message (intended for either a singleton chare or a nodegroup) to be temporarily buffered. This situation doesn't happen very frequently which is why the bug is hard to replicate.
The actual cause of the bug is that the QD counters for the message that is buffered are incremented more than once, because the call to `ck->process()` for these msg types occurs in `_processHandler`, which is called everytime the msg is dequed. The message is dequed the first time it arrives and any time it is buffered because of a group dependency and released afterwards.
The solution appears to be to not call `ck->process()` in `_processHandler` but rather in the function that processes the msg (e.g. _processNodeBocInitMsg) right before the entry method is invoked. But if this is done, then I'll note that the `_initHandler` function in init.C needs to be modified also to disable calling `CpvAccess(_qd)->process();` for NodeBocInitMsg, otherwise QD doesn't work.
#8 Updated by Nitin Bhat over 1 year ago
- Status changed from In Progress to Implemented