Project

General

Profile

Bug #1323

megatest multisection test failures

Added by Sam White over 2 years ago. Updated over 2 years ago.

Status:
Merged
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
12/07/2016
Due date:
% Done:

0%


Description

Core decided that we should document in the manual the usage that is expected/disallowed, the test itself should be updated to test more relevant cases, and the runtime should issue an error if the user does the wrong thing.


Related issues

Related to Charm++ - Feature #1187: Automatic delegation of section work to CkMulticastMgr Merged 08/25/2016
Related to Charm++ - Feature #1352: CkArrayOptions callback for completion of chare array initialization Merged 01/08/2017

History

#1 Updated by Sam White over 2 years ago

  • Status changed from New to Merged
  • translation missing: en.field_closed_date set to 2016-12-13 11:55:31.033404

Turn off automatic section delegation in the test: https://charm.cs.illinois.edu/gerrit/#/c/2046/
Update the sections manual: https://charm.cs.illinois.edu/gerrit/#/c/2038/

#2 Updated by Phil Miller over 2 years ago

  • Status changed from Merged to In Progress
  • translation missing: en.field_closed_date deleted (2016-12-13 11:55:31.033404)

We're still seeing failures of the section delegation elements of megatest in Jenkins Nightly Build. Please take a look.

#3 Updated by Phil Miller over 2 years ago

There's valgrind output from a current run. Had to compile megatest with -O1 for it to trigger. Higher optimization inlines some stuff and loses a stack frame or two. Less optimization and it doesn't crash.

==4203== Invalid read of size 1
==4203==    at 0x5A01AB: CProxySection_ArrayBase::ckAutoDelegate(int) (ckarray.C:551)
==4203==    by 0x4D959B: ckAutoDelegate (CkArray.decl.h:1222)
==4203==    by 0x4D959B: CProxySection_ArrayElement (CkArray.decl.h:1211)
==4203==    by 0x4D959B: CProxySection_multisectiontest_array1d (multisectiontest.decl.h:1434)
==4203==    by 0x4D959B: multisectiontest_grp::recvID(multisectionAID_msg*) (multisectiontest.C:316)
==4203==    by 0x4D98D5: CkIndex_multisectiontest_grp::_call_recvID_multisectionAID_msg(void*, void*) (multisectiontest.def.h:793)
==4203==    by 0x54C3D9: CkDeliverMessageFree (ck.C:593)
==4203==    by 0x553FA1: _invokeEntryNoTrace (ck.C:637)
==4203==    by 0x553FA1: _invokeEntry (ck.C:655)
==4203==    by 0x553FA1: _deliverForBocMsg (ck.C:1095)
==4203==    by 0x553FA1: _processForBocMsg (ck.C:1107)
==4203==    by 0x553FA1: _processHandler(void*, CkCoreState*) (ck.C:1238)
==4203==    by 0x61DC37: CmiHandleMessage (convcore.c:1814)
==4203==    by 0x61DC37: CsdScheduleForever (convcore.c:2051)
==4203==    by 0x61DF6C: CsdScheduler (convcore.c:1987)
==4203==    by 0x61C546: ConverseRunPE (machine.c:2732)
==4203==    by 0x61C546: ConverseInit (machine.c:3106)
==4203==    by 0x4C80D6: main (main.C:18)
==4203==  Address 0x228 is not stack'd, malloc'd or (recently) free'd

#4 Updated by Phil Miller over 2 years ago

  • Status changed from In Progress to Implemented

There's a pretty obvious race between the ckNew calls to construct the various arrays and the recvID call made on the group that's supposed to create a delegated section referencing those arrays.

Fix here:
https://charm.cs.illinois.edu/gerrit/2092

#5 Updated by Phil Miller over 2 years ago

  • Status changed from Implemented to In Progress
./charmrun +p4 ++local ./pgm
Charmrun> scalable start enabled. 
Charmrun> started all node programs in 0.053 seconds.
Converse/Charm++ Commit ID: v6.7.0-531-g59e27d7a0
Charm++> scheduler running in netpoll mode.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (8-way SMP).
Charm++> cpu topology info is gathered in 0.000 seconds.
Megatest is running on 4 nodes 4 processors. 
test 0: initiated [groupring (milind)]
test 0: completed (0.05 sec)
test 1: initiated [nodering (milind)]
test 1: completed (0.01 sec)
test 2: initiated [varsizetest (mjlang)]
test 2: completed (0.00 sec)
test 3: initiated [varsizetest2 (phil)]
test 3: completed (0.00 sec)
test 4: initiated [varraystest (milind)]
test 4: completed (0.00 sec)
test 5: initiated [groupcast (mjlang)]
test 5: completed (0.00 sec)
test 6: initiated [groupmulti (gengbin)]
test 6: completed (0.00 sec)
test 7: initiated [groupsectiontest (ebohm)]
test 7: completed (0.00 sec)
test 8: initiated [multisectiontest (ebohm)]
Pe 0 Array 57
Pe 0 Array 57
Pe 0 Array 57
Pe 0 Array 59
Pe 0 Array 59
Pe 0 Array 59
Pe 0 Array 61
Pe 0 Array 61
Pe 0 Array 61
PE 0 group 53 depends on 61
PE 0 group 54 depends on 61
PE 0 group 55 depends on 61
Pe 1 Array 57
Pe 1 Array 57
Pe 1 Array 57
Pe 1 Array 59
Pe 1 Array 59
Pe 1 Array 59
Pe 1 Array 61
Pe 1 Array 61
Pe 1 Array 61
PE 1 group 53 depends on 61
PE 1 group 54 depends on 61
PE 1 group 55 depends on 61
PE 3 group 53 depends on 61
Pe 2 Array 57
Pe 2 Array 57
Pe 2 Array 59
Pe 2 Array 59
Pe 2 Array 61
Pe 2 Array 61
PE 2 group 53 depends on 61
PE 2 group 54 depends on 61
PE 2 group 55 depends on 61
Charmrun> error on request socket to node 3 '127.0.0.1'--
Socket closed before recv.

Apparently, the section delivery across the groups isn't respecting the group dependency set on the message. Notice that the message is delivered on PE 3 even though it hasn't constructed array with manager GID 61 yet.

#6 Updated by Phil Miller over 2 years ago

  • Status changed from In Progress to Implemented

It turns out, no group message delivery (other than construction) was respecting dependencies. Fixing that made the patch above work.

#8 Updated by Sam White over 2 years ago

Nevermind, we're still getting a segfault from this test: https://charm.cs.illinois.edu/autobuild/cur/gni-crayxc.txt

#9 Updated by Sam White over 2 years ago

  • Status changed from Implemented to In Progress

Marking 'In Progress' since we continue to see failures here, though some of the issues have been fixed.

#10 Updated by Phil Miller over 2 years ago

  • translation missing: en.field_closed_date set to 2017-02-07 17:10:58.548583
  • Status changed from In Progress to Merged

Also available in: Atom PDF