Project

General

Profile

Bug #1279

Proactive fault tolerance fails due to sending message to dead node.

Added by Justin Miron over 2 years ago. Updated almost 2 years ago.

Status:
New
Priority:
Normal
Assignee:
Category:
Fault Tolerance
Target version:
-
Start date:
11/03/2016
Due date:
% Done:

50%


Description

CkLocMgr::deliverMsg attempts to send messages to evacuated chares.

Fails this check:
if((!CmiNodeAlive(destPE) && destPE != allowMessagesOnly)){
CkAbort("Cannot send to a chare on a dead node");
}

CmiNodeAlive checks if the valid processor bit is set for destPE.
allowMessagesOnly should set the value msg->pe on every node during the ACK to the evacuation. This is set AFTER evacuation has occurred and the PE announces its evacuation.

allowMessagesOnly is set after valid processor bit is set to 0. If a message is attempted to be delivered between these two events, a failure could occur.

Investigating setting the allowMessagesOnly value first.

History

#1 Updated by Justin Miron over 2 years ago

This check may not be neccessary. If the PE was previously on a node that is now dead, then it should call DeliverUnknown as it may have been migrated. Though, this will trigger a deliver to the homePE, if the homePE is the dead processor then this will fail.

Check referred to:
if((!CmiNodeAlive(destPE) && destPE != allowMessagesOnly)){
CkAbort("Cannot send to a chare on a dead node");
}

#2 Updated by Sam White over 2 years ago

This was changed in the 64bit ID changes. Look at line 2635 here:
https://charm.cs.illinois.edu/gerrit/#/c/1217/5/src/ck-core/cklocation.C

#3 Updated by Justin Miron over 2 years ago

  • % Done changed from 0 to 50

Thanks, that helps a lot.

Reinserting the getNextPE code works when finding the next PE off of the evacuated destPE integer. Using the CkArrayIndices leads to problems as the CkArrayindex* passed in is sometimes NULL.

getNextPE previously used a hash of the CkArrayIndices, need an equivalent for the integers.

The CkAbort is now avoided, but proactive fault tolerance still hangs.

#4 Updated by Sam White over 2 years ago

  • Target version changed from 6.8.0 to 6.8.1

#5 Updated by Eric Bohm about 2 years ago

  • Assignee changed from Justin Miron to PPL

#6 Updated by Sam White almost 2 years ago

  • Target version changed from 6.8.1 to 6.9.0

#7 Updated by Phil Miller almost 2 years ago

  • Target version deleted (6.9.0)

#8 Updated by Sam White almost 2 years ago

  • Assignee deleted (PPL)
  • Status changed from In Progress to New

#9 Updated by Eric Bohm almost 2 years ago

  • Assignee set to Juan Galvez

Also available in: Atom PDF