Project

General

Profile

Bug #1509

-tracemode summary always fails an assertion at exit

Added by Sam White 7 days ago. Updated 4 days ago.

Status:
New
Priority:
Normal
Assignee:
Category:
Tracing
Target version:
Start date:
04/17/2017
Due date:
% Done:

0%


Description

If you do a make test OPTS="-tracemode summary" in tests/, it fails the first test:

../../../bin/testrun  ./pgm +p1  ++local
Charmrun> scalable start enabled. 
Charmrun> started all node programs in 0.006 seconds.
Charm++> Running in non-SMP mode: numPes 1
Converse/Charm++ Commit ID: v6.8.0-beta1-46-g80fa50245
Charm++: Tracemode Summary enabled.
Trace: traceroot: ./pgm
Charm++> scheduler running in netpoll mode.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (8-way SMP).
Charm++> cpu topology info is gathered in 0.000 seconds.
Megatest is running on 1 nodes 1 processors. 
test 0: initiated [groupring (milind)]
test 0: completed (0.00 sec)
test 1: initiated [nodering (milind)]
test 1: completed (0.00 sec)
test 2: initiated [varsizetest (mjlang)]
varsize: requires at least 2 processors
test 2: completed (0.00 sec)
test 3: initiated [varsizetest2 (phil)]
test 3: completed (0.00 sec)
test 4: initiated [varraystest (milind)]
varraystest: requires at least 2 processors
test 4: completed (0.00 sec)
test 5: initiated [groupcast (mjlang)]
test 5: completed (0.00 sec)
test 6: initiated [groupmulti (gengbin)]
test 6: completed (0.00 sec)
test 7: initiated [groupsectiontest (ebohm)]
groupsectiontest: requires at least 2 processors
test 7: completed (0.00 sec)
test 8: initiated [multisectiontest (ebohm)]
multisectiontest: requires at least 2 processors
test 8: completed (0.00 sec)
test 9: initiated [nodecast (milind)]
test 9: completed (0.00 sec)
test 10: initiated [synctest (mjlang)]
[0] Assertion "inIdle == 0 && inExec == 1" failed in file trace-summary.C line 756.
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: Assertion "inIdle == 0 && inExec == 1" failed in file trace-summary.C line 756.
[0] Stack Traceback:
  [0:0] 0   pgm                                 0x00000001001e11d3 CmiAbortHelper + 179
  [0:1] 1   pgm                                 0x00000001001dec4b CmiAbort + 43
  [0:2] 2   pgm                                 0x00000001001e9e63 __cmi_assert + 51
  [0:3] 3   pgm                                 0x00000001002266fd _ZN12TraceSummary10endExecuteEv + 61
  [0:4] 4   pgm                                 0x000000010008fc97 _ZN10TraceArray10endExecuteEv + 103
  [0:5] 5   pgm                                 0x00000001000d47e9 _ZL12_invokeEntryiP8envelopePv + 521
  [0:6] 6   pgm                                 0x00000001000d049d _ZL24_processForPlainChareMsgP11CkCoreStateP8envelope + 285
  [0:7] 7   pgm                                 0x00000001000cfd53 _Z15_processHandlerPvP11CkCoreState + 595
  [0:8] 8   pgm                                 0x00000001001e8c28 CmiHandleMessage + 72
  [0:9] 9   pgm                                 0x00000001001e8f53 CsdScheduleForever + 195
  [0:10] 10  pgm                                 0x00000001001e8c7a CsdScheduler + 26
  [0:11] 11  pgm                                 0x00000001001e0df7 ConverseRunPE + 279
  [0:12] 12  pgm                                 0x00000001001e08f3 ConverseInit + 931
  [0:13] 13  pgm                                 0x00000001000b22c5 main + 69
  [0:14] 14  pgm                                 0x0000000100000f94 start + 52
  [0:15] 15  ???                                 0x0000000000000001 0x0 + 1
Fatal error on PE 0> Assertion "inIdle == 0 && inExec == 1" failed in file trace-summary.C line 756.

History

#1 Updated by Sam White 7 days ago

  • Subject changed from Failed assertion with -tracemode summary in charm++ megatest to -tracemode summary always fails an assertion at exit

This seems to happen on any program using -tracemode summary?

This is examples/charm++/hello/1darray, which fails similarly but in a different assertion during exit:

[0] CombineSummary called!
[0] Assertion "inIdle == 0 && inExec == 0" failed in file trace-summary.C line 824.
  [0:0] 0   hello                               0x0000000100135773 CmiAbortHelper + 179
  [0:1] 1   hello                               0x00000001001331eb CmiAbort + 43
  [0:2] 2   hello                               0x000000010013e403 __cmi_assert + 51
  [0:3] 3   hello                               0x000000010017c550 _ZN12TraceSummary9beginIdleEd + 96
  [0:4] 4   hello                               0x0000000100006967 _ZN10TraceArray9beginIdleEd + 151
  [0:5] 5   hello                               0x00000001000068bf traceCommonBeginIdle + 31
  [0:6] 6   hello                               0x0000000100141bec call_cblist_keep + 124
  [0:7] 7   hello                               0x0000000100141a71 CcdRaiseCondition + 129
  [0:8] 8   hello                               0x000000010013d13d CsdBeginIdle + 45
  [0:9] 9   hello                               0x000000010013d522 CsdScheduleForever + 242
  [0:10] 10  hello                               0x000000010013d21a CsdScheduler + 26
  [0:11] 11  hello                               0x000000010000b335 CkExit + 213
  [0:12] 12  hello                               0x00000001000019c2 _ZN4Main4doneEv + 34
  [0:13] 13  hello                               0x000000010000199a _ZN12CkIndex_Main15_call_done_voidEPvS0_ + 42
  [0:14] 14  hello                               0x0000000100022191 CkDeliverMessageFree + 65
  [0:15] 15  hello                               0x0000000100022d5b _ZL19_invokeEntryNoTraceiP8envelopePv + 59
  [0:16] 16  hello                               0x0000000100028ecb _ZL12_invokeEntryiP8envelopePv + 299
  [0:17] 17  hello                               0x0000000100024a3d _ZL24_processForPlainChareMsgP11CkCoreStateP8envelope + 285
  [0:18] 18  hello                               0x00000001000242f3 _Z15_processHandlerPvP11CkCoreState + 595
  [0:19] 19  hello                               0x000000010013d1c8 CmiHandleMessage + 72
  [0:20] 20  hello                               0x000000010013d4f3 CsdScheduleForever + 195
  [0:21] 21  hello                               0x000000010013d21a CsdScheduler + 26
  [0:22] 22  hello                               0x0000000100135397 ConverseRunPE + 279
  [0:23] 23  hello                               0x0000000100134e93 ConverseInit + 931
  [0:24] 24  hello                               0x0000000100006555 main + 69
  [0:25] 25  hello                               0x00000001000010a4 start + 52

#2 Updated by Ronak Buch 5 days ago

I couldn't reproduce this on the latest Charm (HEAD: ddc864b7e) on netlrts-linux-x86_64 (built with ./build charm++ netlrts-linux-x86_64 --enable-tracing --with-production).

Maybe just a Darwin bug?

#3 Updated by Sam White 5 days ago

You built with production, which disables CmiAsserts, which is the failure I saw.

#4 Updated by Ronak Buch 5 days ago

Okay, I'm able to reproduce it for runs with more than 1 PE without production.

#5 Updated by Phil Miller 4 days ago

Looking at the second assertion failure, I'm not too surprised. The exit process goes from the middle of a running entry method (the caller of CkExit) right back into the scheduler, without any indication to the tracing framework that execution of the entry method was stopped. So, when it sees what looks like new idle time while execution is apparently ongoing, it is justifiably confused. Perhaps CkExit's re-entry into the scheduler loop should make the appropriate tracing call to end execution of the calling entry method.

Also available in: Atom PDF