Project

General

Profile

Bug #1507

ckio test failure on gni-crayxc

Added by Sam White about 1 month ago. Updated 1 day ago.

Status:
In Progress
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
04/16/2017
Due date:
% Done:

0%


Description

The ckio test has failed the past 2+ days on autobuild on gni-crayxc, seemingly due to a race condition between array construction and use: http://charm.cs.uiuc.edu/autobuild/cur/gni-crayxc.txt

make[3]: Entering directory `/scratch1/scratchdirs/acun/autobuild/gni-crayxc/charm/gni-crayxc/tests/charm++/io'
../../../bin/testrun  ./pgm +p4 4  

Running on 4 processors:  ./pgm 4
srun -n 4 -c 2 ./pgm 4
srun: Job step creation temporarily disabled, retrying
srun: Job step created
Charm++> Running on Gemini (GNI) with 4 processes
Charm++> static SMSG
Charm++> SMSG memory: 19.8KB
Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0 means no limit)
Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB
Charm++> Cray TLB page size: 8192K
Charm++> Running in non-SMP mode: numPes 4
Converse/Charm++ Commit ID: 7788005
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (48-way SMP).
Main ran
Main saw file ready
Main saw file ready
Main saw file ready
Main saw file ready
Main saw file ready
Main saw file ready
Main saw session ready
Main saw session ready
Main saw session ready
Main saw session ready
Main saw session ready
Main saw session ready
------------- Processor 1 Exiting: Called CmiAbort ------------
Reason: Cannot send a message from an array without a local branch

Related issues

Related to Charm++ - Feature #1352: CkArrayOptions callback for completion of chare array initialization Merged 01/08/2017

History

#1 Updated by Phil Miller about 1 month ago

  • Priority changed from Normal to High

#2 Updated by Phil Miller about 1 month ago

I may have a simpler test case for this, or at least one that exhibits the same CmiAbort behavior.

#3 Updated by Phil Miller about 1 month ago

The interesting note from the case I looked into is that a sane array ID is returned into the proxy object, but the manager and elements themselves all somehow see zero instead, and hence don't get back an object pointer when they go looking for it.

#4 Updated by Thomas Quinn 29 days ago

I've been having a CkIO failure on a ChaNGa production run on Blue Waters that may be related to this. The symptom in my case is that a CkReductionMgr associated with the CkIO WriteSession group is having its AddToInactiveList() method called, and the object members are garbage (in particular inactiveList is bad.)

#5 Updated by Phil Miller 17 days ago

  • Assignee changed from Ronak Buch to Phil Miller

#6 Updated by Phil Miller 16 days ago

  • Related to Feature #1352: CkArrayOptions callback for completion of chare array initialization added

#7 Updated by Phil Miller 16 days ago

Looks like the issue is that a message referencing the newly-constructed write session is reaching PEs other than 0 before the constructor for the write session array. I can't rely on a group dependence for the ready message CkIO spits out, because that may only apply on one PE that has the array manager constructed, and then broadcast the array ID to PEs that don't yet have it constructed. So, #1352 would be really helpful here

#8 Updated by Phil Miller 16 days ago

  • Status changed from New to In Progress

#9 Updated by Phil Miller 15 days ago

  • Status changed from In Progress to Implemented

https://charm.cs.illinois.edu/gerrit/2519

Tom, if you're still seeing this an issue here, could you try the above patch and let us know if that resolves it? Note that you'll have to pull in its predecessor as well, since it relies on a just-implemented bit of functionality that's not on mainline yet.

#10 Updated by Thomas Quinn 15 days ago

It will take me a little while to reproduce my problem, since it usually happens after restarting from a checkpoint.

But I will give it a go.

#11 Updated by Phil Miller 11 days ago

  • Status changed from Implemented to In Progress

I'm seeing issues with that patch on simple ChaNGa test runs. Working through them now.

#12 Updated by Phil Miller 11 days ago

  • Status changed from In Progress to Implemented

Underlying issue with the patch provided, given that it was failing after restart from a checkpoint, was that an array map ID wasn't marked readonly, and so got lost between the initial and restarted run. Having fixed that (see patch series), ChaNGa's tests seem to run correctly for me.

#13 Updated by Phil Miller 11 days ago

  • Status changed from Implemented to Merged

#14 Updated by Thomas Quinn 2 days ago

I'm not sure if it is the same bug, but after several checkpoints I get the following errors while in CkIO:
------------- Processor 2760 Exiting: Called CmiAbort ------------
Reason: Error! This group proxy has not been initialized!
------------- Processor 14927 Exiting: Called CmiAbort ------------
Reason: Requested adjustment to prior reduction!

This is with v6.8.0-beta1-156-gfff8f25 on both a Cray XE or a Cray XC.

#15 Updated by Phil Miller 1 day ago

  • Status changed from Merged to In Progress

Re-opening pending investigation

Also available in: Atom PDF