Project

General

Profile

Bug #1849

mpi-linux-x86_64-syncft broken by zero copy API

Added by Sam White 16 days ago. Updated 6 days ago.

Status:
Merged
Priority:
High
Assignee:
Category:
Machine Layers
Target version:
Start date:
04/07/2018
Due date:
% Done:

0%


Description

../bin/charmc  -optimize -production   -I. -o machine.o machine.c
In file included from machine-common-core.c:439:0,
                 from machine.c:413:
machine-cma.c: In function ‘readShmCma’:
machine-cma.c:53:15: warning: implicit declaration of function ‘process_vm_readv’ [-Wimplicit-function-declaration]
   int nread = process_vm_readv(remote_pid, local, numOps, remote, numOps, 0);
               ^
machine-cma.c: In function ‘writeShmCma’:
machine-cma.c:71:15: warning: implicit declaration of function ‘process_vm_writev’ [-Wimplicit-function-declaration]
   int nread = process_vm_writev(remote_pid, local, numOps, remote, numOps, 0);
               ^
machine.c: In function ‘LrtsSendFunc’:
machine.c:47:63: error: ‘CmiMsgHeaderBasic {aka struct <anonymous>}’ has no member named ‘msgType’
 #define CMI_MSGTYPE(msg)            ((CmiMsgHeaderBasic *)msg)->msgType
                                                               ^
machine.c:542:5: note: in expansion of macro ‘CMI_MSGTYPE’
     CMI_MSGTYPE(msg) = REGULAR;
     ^
machine-onesided.c: In function ‘LrtsIssueRget’:
machine.c:47:63: error: ‘CmiMsgHeaderBasic {aka struct <anonymous>}’ has no member named ‘msgType’
 #define CMI_MSGTYPE(msg)            ((CmiMsgHeaderBasic *)msg)->msgType
                                                               ^
machine-onesided.c:133:3: note: in expansion of macro ‘CMI_MSGTYPE’
   CMI_MSGTYPE(postInfoMsg) = POST_DIRECT_SEND;
   ^
machine-onesided.c: In function ‘LrtsIssueRput’:
machine.c:47:63: error: ‘CmiMsgHeaderBasic {aka struct <anonymous>}’ has no member named ‘msgType’
 #define CMI_MSGTYPE(msg)            ((CmiMsgHeaderBasic *)msg)->msgType
                                                               ^
machine-onesided.c:202:3: note: in expansion of macro ‘CMI_MSGTYPE’
   CMI_MSGTYPE(postInfoMsg) = POST_DIRECT_RECV;
   ^
machine.c: In function ‘PumpMsgs’:
machine.c:47:63: error: ‘CmiMsgHeaderBasic {aka struct <anonymous>}’ has no member named ‘msgType’
 #define CMI_MSGTYPE(msg)            ((CmiMsgHeaderBasic *)msg)->msgType
                                                               ^
machine.c:878:16: note: in expansion of macro ‘CMI_MSGTYPE’
             if(CMI_MSGTYPE(msg) == REGULAR) {
                ^
machine.c:47:63: error: ‘CmiMsgHeaderBasic {aka struct <anonymous>}’ has no member named ‘msgType’
 #define CMI_MSGTYPE(msg)            ((CmiMsgHeaderBasic *)msg)->msgType
                                                               ^
machine.c:880:23: note: in expansion of macro ‘CMI_MSGTYPE’
             } else if(CMI_MSGTYPE(msg) == POST_DIRECT_RECV || CMI_MSGTYPE(msg) == POST_DIRECT_SEND) {
                       ^
machine.c:47:63: error: ‘CmiMsgHeaderBasic {aka struct <anonymous>}’ has no member named ‘msgType’
 #define CMI_MSGTYPE(msg)            ((CmiMsgHeaderBasic *)msg)->msgType
                                                               ^
machine.c:880:63: note: in expansion of macro ‘CMI_MSGTYPE’
             } else if(CMI_MSGTYPE(msg) == POST_DIRECT_RECV || CMI_MSGTYPE(msg) == POST_DIRECT_SEND) {
                                                               ^
machine.c:47:63: error: ‘CmiMsgHeaderBasic {aka struct <anonymous>}’ has no member named ‘msgType’
 #define CMI_MSGTYPE(msg)            ((CmiMsgHeaderBasic *)msg)->msgType
                                                               ^
machine.c:891:18: note: in expansion of macro ‘CMI_MSGTYPE’
               if(CMI_MSGTYPE(msg) == POST_DIRECT_RECV) {
                  ^

History

#1 Updated by Sam White 16 days ago

The issue looks to be that mpi/conv-mach-syncft.h was not updated with the additional field 'msgType' in the MPI direct API patch, but when I try adding it there it fixes the build but then I get hangs in tests/charm++/megatest/... Maybe it needs to be in a certain order or something?

#2 Updated by Sam White 14 days ago

  • Priority changed from Normal to High

This is a release blocker

#3 Updated by Nitin Bhat 14 days ago

I see the hang as well, when I add the fix. However, I am intermittently seeing the hang on earlier commits too. I tried previous commits and went backwards upto https://charm.cs.illinois.edu/gerrit/#/c/charm/+/3580/. I am yet to pinpoint this to a specific commit that causes a hang. It's surprising how this wasn't hanging on the autobuild machine (respect). It's hanging quite frequently on my machine (charity). I'll manually test it on respect and see if I encounter the hang.

#4 Updated by Nitin Bhat 14 days ago

Launching a serial version using gdb ./pgm causes the hang at the following location:

CkMemCheckPT::BuddyPE (pe=0, this=0xceeec0) at ckmemcheckpoint.C:158
158 while (budpe == pe || isFailed(budpe))

migration: requires at least 2 processors.
test 44: completed (0.00 sec)
test 45: initiated [multi marshall (olawlor)]
test 45: completed (0.06 sec)
test 46: initiated [multi priomsg (fang)]
test 46: completed (0.00 sec)
test 47: initiated [multi priotest (mlind)]
test 47: completed (0.00 sec)
test 48: initiated [multi statistics (olawlor)]
test 48: completed (0.00 sec)
test 49: initiated [multi reduction (olawlor)]
test 49: completed (0.00 sec)
test 50: initiated [multi immediatering (gengbin)]
test 50: completed (0.00 sec)
test 51: initiated [multi callback (olawlor)]
test 51: completed (0.00 sec)
test 52: initiated [all-at-once]
varsize: requires at least 2 processors
varraystest: requires at least 2 processors
groupsectiontest: requires at least 2 processors
migration: requires at least 2 processors.
multisectiontest: requires at least 2 processors
^C
Thread 1 "pgm" received signal SIGINT, Interrupt.
CkMemCheckPT::BuddyPE (pe=0, this=0xceeec0) at ckmemcheckpoint.C:158
158      while (budpe == pe || isFailed(budpe))
(gdb)

#5 Updated by Sam White 14 days ago

Which of the while expressions is always true?

#6 Updated by Nitin Bhat 14 days ago

The first expression is true. The while loop seems to become an infinite loop when both pe and budpe are 0.

Print from inside the while loop returns the following:

[0] inside while loop budpe:0, pe:0, isFailed(budpe):0, CkNumPes():1
[0] inside while loop budpe:0, pe:0, isFailed(budpe):0, CkNumPes():1
[0] inside while loop budpe:0, pe:0, isFailed(budpe):0, CkNumPes():1
..... 
^ same print repeats

#7 Updated by Sam White 14 days ago

  • Status changed from New to In Progress

Yeah, in-memory checkpointing can't work in the +p1 case, since there's no remote memory to checkpoint and restart from. I think the buddyPE routine should abort if CkNumPes()==1.

There's a separate question of what Charm++ should do when a user calls CkStartMemCheckpoint() in a run on 1 PE. Should we print a warning and return, or should we abort? I'm not sure

#8 Updated by Sam White 13 days ago

Of course we also need to make the test skip this case when run with +p1 to avoid failing out of the tests.

With skipping that, does mpi-syncft pass all the tests?

#9 Updated by Sam White 10 days ago

Hmm, the same test passes for netlrts-darwin-x86_64-syncft. I fixed a recently introduced build error for syncft here: https://charm.cs.illinois.edu/gerrit/#/c/charm/+/3984/

#10 Updated by Nitin Bhat 7 days ago

It seems like the reason this bug is showing up on MPI alone is this code in CkMemCheckPT::CkMemCheckPT() in ckmemcheckpoint.C

352:#if CMK_CONVERSE_MPI
353-  void pingBuddy();
354-  void pingCheckHandler();
355-  CcdCallOnCondition(CcdPERIODIC_100ms,(CcdVoidFn)pingBuddy,NULL);
356-  CcdCallOnCondition(CcdPERIODIC_5s,(CcdVoidFn)pingCheckHandler,NULL);
357-#endif

This is called in ConverseInit and results in calling of the pingBuddy method that causes the hang. This is not called on other build archs.

#11 Updated by Sam White 6 days ago

I think there should be an "if (CkNumPes() > 1)" around that block?

#12 Updated by Sam White 6 days ago

Does the test pass when run with +p2?

#13 Updated by Nitin Bhat 6 days ago

Yeah, I think the solution is to have that block around it.

Also, I think the reason this was not caught was because in previous autobuilds, it was minimally tested with just "make -C ../tests mpisyncfttest" after LIBS was built. And none of those tests tested with just one PE.

Yes, it passes with +p2, +p3 and +p4. I tested with around 40 iterations.

#14 Updated by Nitin Bhat 6 days ago

  • Status changed from In Progress to Implemented

#15 Updated by Nitin Bhat 6 days ago

  • Status changed from Implemented to Merged

Also available in: Atom PDF