Project

General

Profile

Bug #1870

Hang in mpi-linux-x86_64-syncft when run with 1 PE

Added by Nitin Bhat over 1 year ago. Updated over 1 year ago.

Status:
Merged
Priority:
High
Assignee:
Category:
Fault Tolerance
Target version:
Start date:
04/17/2018
Due date:
% Done:

0%


Description

This issue was initially thought to be associated with the Zerocopy API. But it's a separate issue.

Discussion log:

#1 Updated by Sam White 10 days ago
The issue looks to be that mpi/conv-mach-syncft.h was not updated with the additional field 'msgType' in the MPI direct API patch, but when I try adding it there it fixes the build but then I get hangs in tests/charm++/megatest/... Maybe it needs to be in a certain order or something?

#2 Updated by Sam White 8 days ago
This is a release blocker

#3 Updated by Nitin Bhat 8 days ago
I see the hang as well, when I add the fix. However, I am intermittently seeing the hang on earlier commits too. I tried previous commits and went backwards upto https://charm.cs.illinois.edu/gerrit/#/c/charm/+/3580/. I am yet to pinpoint this to a specific commit that causes a hang. It's surprising how this wasn't hanging on the autobuild machine (respect). It's hanging quite frequently on my machine (charity). I'll manually test it on respect and see if I encounter the hang.

#4 Updated by Nitin Bhat 8 days ago
Launching a serial version using gdb ./pgm causes the hang at the following location:

CkMemCheckPT::BuddyPE (pe=0, this=0xceeec0) at ckmemcheckpoint.C:158
158 while (budpe == pe || isFailed(budpe))

migration: requires at least 2 processors.
test 44: completed (0.00 sec)
test 45: initiated [multi marshall (olawlor)]
test 45: completed (0.06 sec)
test 46: initiated [multi priomsg (fang)]
test 46: completed (0.00 sec)
test 47: initiated [multi priotest (mlind)]
test 47: completed (0.00 sec)
test 48: initiated [multi statistics (olawlor)]
test 48: completed (0.00 sec)
test 49: initiated [multi reduction (olawlor)]
test 49: completed (0.00 sec)
test 50: initiated [multi immediatering (gengbin)]
test 50: completed (0.00 sec)
test 51: initiated [multi callback (olawlor)]
test 51: completed (0.00 sec)
test 52: initiated [all-at-once]
varsize: requires at least 2 processors
varraystest: requires at least 2 processors
groupsectiontest: requires at least 2 processors
migration: requires at least 2 processors.
multisectiontest: requires at least 2 processors
^C
Thread 1 "pgm" received signal SIGINT, Interrupt.
CkMemCheckPT::BuddyPE (pe=0, this=0xceeec0) at ckmemcheckpoint.C:158
158      while (budpe == pe || isFailed(budpe))
(gdb)

#5 Updated by Sam White 8 days ago
Which of the while expressions is always true?

#6 Updated by Nitin Bhat 8 days ago
The first expression is true. The while loop seems to become an infinite loop when both pe and budpe are 0.

Print from inside the while loop returns the following:

[0] inside while loop budpe:0, pe:0, isFailed(budpe):0, CkNumPes():1
[0] inside while loop budpe:0, pe:0, isFailed(budpe):0, CkNumPes():1
[0] inside while loop budpe:0, pe:0, isFailed(budpe):0, CkNumPes():1
.....

^ same print repeats

#7 Updated by Sam White 8 days ago
Yeah, in-memory checkpointing can't work in the +p1 case, since there's no remote memory to checkpoint and restart from. I think the buddyPE routine should abort if CkNumPes()==1.

There's a separate question of what Charm++ should do when a user calls CkStartMemCheckpoint() in a run on 1 PE. Should we print a warning and return, or should we abort? I'm not sure

#8 Updated by Sam White 7 days ago
Of course we also need to make the test skip this case when run with +p1 to avoid failing out of the tests.

With skipping that, does mpi-syncft pass all the tests?

#9 Updated by Sam White 4 days ago
Hmm, the same test passes for netlrts-darwin-x86_64-syncft. I fixed a recently introduced build error for syncft here: https://charm.cs.illinois.edu/gerrit/#/c/charm/+/3984/

#10 Updated by Nitin Bhat about 20 hours ago
It seems like the reason this bug is showing up on MPI alone is this code in CkMemCheckPT::CkMemCheckPT() in ckmemcheckpoint.C


352:#if CMK_CONVERSE_MPI
353-  void pingBuddy();
354-  void pingCheckHandler();
355-  CcdCallOnCondition(CcdPERIODIC_100ms,(CcdVoidFn)pingBuddy,NULL);
356-  CcdCallOnCondition(CcdPERIODIC_5s,(CcdVoidFn)pingCheckHandler,NULL);
357-#endif

This is called in ConverseInit and results in calling of the pingBuddy method that causes the hang. This is not called on other build archs.

#11 Updated by Sam White about 14 hours ago
Quote Edit Delete
I think there should be an "if (CkNumPes() > 1)" around that block?

#12 Updated by Sam White about 10 hours ago
Does the test pass when run with +p2?

#13 Updated by Nitin Bhat about 1 hour ago
Yeah, I think the solution is to have that block around it.

Also, I think the reason this was not caught was because in previous autobuilds, it was minimally tested with just "make -C ../tests mpisyncfttest" after LIBS was built. And none of those tests tested with just one PE.

Yes, it passes with +p2, +p3 and +p4. I tested with around 40 iterations.

History

#1 Updated by Nitin Bhat over 1 year ago

  • Status changed from New to Implemented

#2 Updated by Nitin Bhat over 1 year ago

  • Status changed from Implemented to Merged

Also available in: Atom PDF