Project

General

Profile

Bug #1507

ckio test failure on gni-crayxc

Added by Sam White 7 months ago. Updated 3 months ago.

Status:
Merged
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
04/16/2017
Due date:
% Done:

0%


Description

The ckio test has failed the past 2+ days on autobuild on gni-crayxc, seemingly due to a race condition between array construction and use: http://charm.cs.uiuc.edu/autobuild/cur/gni-crayxc.txt

make[3]: Entering directory `/scratch1/scratchdirs/acun/autobuild/gni-crayxc/charm/gni-crayxc/tests/charm++/io'
../../../bin/testrun  ./pgm +p4 4  

Running on 4 processors:  ./pgm 4
srun -n 4 -c 2 ./pgm 4
srun: Job step creation temporarily disabled, retrying
srun: Job step created
Charm++> Running on Gemini (GNI) with 4 processes
Charm++> static SMSG
Charm++> SMSG memory: 19.8KB
Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0 means no limit)
Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB
Charm++> Cray TLB page size: 8192K
Charm++> Running in non-SMP mode: numPes 4
Converse/Charm++ Commit ID: 7788005
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (48-way SMP).
Main ran
Main saw file ready
Main saw file ready
Main saw file ready
Main saw file ready
Main saw file ready
Main saw file ready
Main saw session ready
Main saw session ready
Main saw session ready
Main saw session ready
Main saw session ready
Main saw session ready
------------- Processor 1 Exiting: Called CmiAbort ------------
Reason: Cannot send a message from an array without a local branch

Related issues

Related to Charm++ - Feature #1352: CkArrayOptions callback for completion of chare array initialization Merged 01/08/2017
Related to Charm++ - Bug #1652: CkArray::ckDestroy() does not delete CkMulticastMgr Merged 08/04/2017
Related to Charm++ - Bug #1647: ckNew(): CkReductionMgr not constructed on all PEs Closed 08/01/2017

History

#1 Updated by Phil Miller 7 months ago

  • Priority changed from Normal to High

#2 Updated by Phil Miller 7 months ago

I may have a simpler test case for this, or at least one that exhibits the same CmiAbort behavior.

#3 Updated by Phil Miller 7 months ago

The interesting note from the case I looked into is that a sane array ID is returned into the proxy object, but the manager and elements themselves all somehow see zero instead, and hence don't get back an object pointer when they go looking for it.

#4 Updated by Thomas Quinn 7 months ago

I've been having a CkIO failure on a ChaNGa production run on Blue Waters that may be related to this. The symptom in my case is that a CkReductionMgr associated with the CkIO WriteSession group is having its AddToInactiveList() method called, and the object members are garbage (in particular inactiveList is bad.)

#5 Updated by Phil Miller 6 months ago

  • Assignee changed from Ronak Buch to Phil Miller

#6 Updated by Phil Miller 6 months ago

  • Related to Feature #1352: CkArrayOptions callback for completion of chare array initialization added

#7 Updated by Phil Miller 6 months ago

Looks like the issue is that a message referencing the newly-constructed write session is reaching PEs other than 0 before the constructor for the write session array. I can't rely on a group dependence for the ready message CkIO spits out, because that may only apply on one PE that has the array manager constructed, and then broadcast the array ID to PEs that don't yet have it constructed. So, #1352 would be really helpful here

#8 Updated by Phil Miller 6 months ago

  • Status changed from New to In Progress

#9 Updated by Phil Miller 6 months ago

  • Status changed from In Progress to Implemented

https://charm.cs.illinois.edu/gerrit/2519

Tom, if you're still seeing this an issue here, could you try the above patch and let us know if that resolves it? Note that you'll have to pull in its predecessor as well, since it relies on a just-implemented bit of functionality that's not on mainline yet.

#10 Updated by Thomas Quinn 6 months ago

It will take me a little while to reproduce my problem, since it usually happens after restarting from a checkpoint.

But I will give it a go.

#11 Updated by Phil Miller 6 months ago

  • Status changed from Implemented to In Progress

I'm seeing issues with that patch on simple ChaNGa test runs. Working through them now.

#12 Updated by Phil Miller 6 months ago

  • Status changed from In Progress to Implemented

Underlying issue with the patch provided, given that it was failing after restart from a checkpoint, was that an array map ID wasn't marked readonly, and so got lost between the initial and restarted run. Having fixed that (see patch series), ChaNGa's tests seem to run correctly for me.

#13 Updated by Phil Miller 6 months ago

  • Status changed from Implemented to Merged

#14 Updated by Thomas Quinn 6 months ago

I'm not sure if it is the same bug, but after several checkpoints I get the following errors while in CkIO:
------------- Processor 2760 Exiting: Called CmiAbort ------------
Reason: Error! This group proxy has not been initialized!
------------- Processor 14927 Exiting: Called CmiAbort ------------
Reason: Requested adjustment to prior reduction!

This is with v6.8.0-beta1-156-gfff8f25 on both a Cray XE or a Cray XC.

#15 Updated by Phil Miller 6 months ago

  • Status changed from Merged to In Progress

Re-opening pending investigation

#16 Updated by Thomas Quinn 6 months ago

I tried this with a later version of charm (cd4d6f8), and things seem to now be OK.
Perhaps this was related to Bug #1568: Fix chare array initCallback to deliver a CkReductionMsg?

#17 Updated by Phil Miller 6 months ago

  • Status changed from In Progress to Merged

The subsequent fixed issue you found seems likely to have been the culprit. Reclosing for now.

#18 Updated by Thomas Quinn 4 months ago

I'm getting this failure again on Blue Waters using charm version 5fd855b. It seems to only be happening with large node count (1024) and after a restart. I've got the latest crash in ~trq/scratch/h201824mr if that helps.

#19 Updated by Phil Miller 3 months ago

  • Target version changed from 6.8.0 to 6.8.1
  • Status changed from Merged to Feedback

Tom, with the other fix that you recently pushed, could you test that this still reproduces, and potentially open a new bug with the full details of the current observed failure?

#20 Updated by Thomas Quinn 3 months ago

The fix I pushed is a direct result of digging in to the crash here. I'm almost certain that this is fixed, but I am rerunning to be sure.

#21 Updated by Phil Miller 3 months ago

  • Related to Bug #1652: CkArray::ckDestroy() does not delete CkMulticastMgr added

#22 Updated by Phil Miller 3 months ago

  • Related to Bug #1647: ckNew(): CkReductionMgr not constructed on all PEs added

#23 Updated by Thomas Quinn 3 months ago

With the recent fix, I can no longer reproduce this bug. I recommend this issue be closed.

#24 Updated by Phil Miller 3 months ago

  • Target version changed from 6.8.1 to 6.8.0
  • Status changed from Feedback to Merged

Thank you, done. Sorry for the trouble, and thank you for the fixes.

Also available in: Atom PDF