Bug #1658

Premature detection of Quiescence when TRAM is being used

Added by Laxmikant "Sanjay" Kale 8 days ago. Updated 8 days ago.

Target version:
Start date:
Due date:
% Done:



This was identified in Charm++ version of Quicksilver by Karthik and Nikhil, in an LLNL internship project. There are multiple "aggregate" TRAM entry methods and non-aggregate regular entry methods executing during the time the problem arises. It is non-deterministic. Justin and Sanjay wrote a test program (in late July) that also non-deterministically reproduce the problem. Phil may have additional comments, since he was looking at it too. (362 Bytes) Laxmikant "Sanjay" Kale, 08/12/2017 02:48 PM

hello.C View (1.92 KB) Laxmikant "Sanjay" Kale, 08/12/2017 02:49 PM

qdFix.patch View (1.81 KB) Laxmikant "Sanjay" Kale, 08/12/2017 02:55 PM


#1 Updated by Laxmikant "Sanjay" Kale 8 days ago

The bug is (I am reasonably sure) due to faulty quiescence detection algorithm in qd.C. It employees 2 phases. Once the counts of created and processed message (summed up via a converse custom reduction) are equal, it does one more round to confirm nothing has changed (which is a normal possibility.. since there may be 2 messages m1 and m2 such that only m1's creation and m2's processing have been counted. ). It uses a notion of "dirty" on each processor to track if something has changed is in
qd.h L121 int isDirty(void) { return ((mProcessed > oProcessed) || cDirty); }
oProcessed is the number of messages processed at the previous pass (o for old) of the algorithm. Note that it is not tracking oCreated in a similar manner. And that is the bug. If a message is created, remains stuck in TRAM (NDMeshStreamer) buffer awaiting timeout, the system won't detect any issue, under a narrow sliver of situations.
One could fix this by tracking oCreated, but its much better to fix the algorithm to solely rely on counts (and idleness).

#2 Updated by Laxmikant "Sanjay" Kale 8 days ago

A simple test program I wrote (after fixing the bug) demonstrates the problem consistently. Its a variation on hello, with 2 methods, one aggregated and one not. Each works like a chain. It fails most of the time with the current QD and succeeds with the fix. (Fix coming up in the next log).

#3 Updated by Laxmikant "Sanjay" Kale 8 days ago

The patch that fixes it is attached. I suggest, after due scrutiny and testing, we merge this, so that users have a bug-fixed version. But after that, we should cleanup the code to remove the second "check dirty-ness only" phase, and any relevant code.

The fix is basically to maintain the global count (created and processed.. same count) from the last time they were equal. Next time they are equal, they also need to be equal to this saved count, for us to declare quiescence. This is the original algorithm of our QD paper in 1993. I don't know who changed it to make it track "dirty" and why. For now, I have retained the dirty-tracking phase at the end, to get a bug-fix out expeditiously.

Will someone volunteer to test the patch, and move it thru the gerrit review?

#4 Updated by Laxmikant "Sanjay" Kale 8 days ago

  • Assignee changed from Laxmikant "Sanjay" Kale to Phil Miller

#5 Updated by Phil Miller 8 days ago

  • Status changed from New to Implemented

Confirmed that the attached test code fails consistently on netlrts-darwin-x86_64, with 2 PEs running on a single host with pre-patch code.

#7 Updated by Phil Miller 8 days ago

Confirmed that the patch fixes the test, even when modified to run itself in multiple iterations in a single job.

Also available in: Atom PDF