Project

General

Profile

Bug #1510

Hang in tests/charm++/chkpt when using -tracemode perfReport

Added by Sam White 6 months ago. Updated 15 days ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
Tracing
Target version:
Start date:
04/17/2017
Due date:
% Done:

0%


Description

Besides this one test, running make test OPTS="-tracemode perfReport" passes all tests.

../../../bin/testrun  ./hello +p4 +restart log  ++local
Charmrun> scalable start enabled. 
Charmrun> started all node programs in 0.005 seconds.
Charm++> Running in non-SMP mode: numPes 4
Converse/Charm++ Commit ID: v6.8.0-beta1-46-g80fa50245
Trace: traceroot: ./hello
Charm++> scheduler running in netpoll mode.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (8-way SMP).
Charm++> cpu topology info is gathered in 0.000 seconds.
Received 1 arguments: { |./hello| }
Main's MigCtor. a=987(0x1011404fc), b[0]=654(0x101140500), b[1]=321, old PE number 4
Main's PUPer. a=123(0x1011404fc), b[0]=456(0x101140500), b[1]=789
[3] data on Group 3
[1] data on Group 1
[2] data on Group 2
[3] data on NOdeGroup 3
[1] data on NOdeGroup 1
CHello's PUPer. step=3.
[0] data on Group 0
[2] data on NOdeGroup 2
[0] data on NOdeGroup 0
[0]CkRestartMain done. sending out callback.
myClient. a=123(0x1011404fc), b[0]=456(0x101140500), b[1]=789
step 4 done
myClient. a=123(0x1011404fc), b[0]=456(0x101140500), b[1]=789
step 5 done
myClient. a=123(0x1011404fc), b[0]=456(0x101140500), b[1]=789
step 6 done
myClient. a=123(0x1011404fc), b[0]=456(0x101140500), b[1]=789
step 7 done
myClient. a=123(0x1011404fc), b[0]=456(0x101140500), b[1]=789
step 8 done
myClient. a=123(0x1011404fc), b[0]=456(0x101140500), b[1]=789
step 9 done

Related issues

Related to Charm++ - Bug #1201: SMP builds segfault on NULL lock in tests/charm++/chkpt Rejected 09/07/2016
Related to Charm++ - Bug #1616: Race condition during ckRestart causes hang after merging patch for bug #1576 Merged 06/21/2017

History

#1 Updated by Phil Miller 6 months ago

I think we can reasonably defer this, unless we think it's indicative of some broader deficiency that users will encounter.

#2 Updated by Sam White 6 months ago

  • Target version changed from 6.8.0 to 6.8.1

#3 Updated by Sam White 6 months ago

Trying this on linux now (instead of darwin) it works, so we should be safe to add "-tracemode perfReport" to netlrts-linux or something on autobuild.

#4 Updated by Phil Miller 3 months ago

  • Related to Bug #1201: SMP builds segfault on NULL lock in tests/charm++/chkpt added

#5 Updated by Phil Miller 3 months ago

  • Related to Bug #1616: Race condition during ckRestart causes hang after merging patch for bug #1576 added

#6 Updated by Ronak Buch 30 days ago

  • Status changed from New to In Progress

I was able to reproduce this on Linux, it looks like it's because it the PICS code tries to exit prematurely, which closes an output file that it keeps trying to use:

Fatal error: glibc detected an invalid stdio handle

Program received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58
58    ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.

#7 Updated by Phil Miller 17 days ago

Taking a second look at this, are we particularly concerned with tracing support across checkpoint/restart? Do we actually expect resulting performance analysis to be meaningful?

I'm thinking the right resolution may actually be to simply throw an error on attempts to restart from a checkpoint when some tracing module is active.

Alternately, without having looked at the code or run the test myself, is the underlying cause just one of a file not being re-opened after restart?

#8 Updated by Phil Miller 17 days ago

OK, I'm running this now, and actually seeing a crash in the exit path in the first run of ./hello:

(gdb) bt
#0  0x00007ffff6f9c2d5 in _IO_vfprintf_internal (s=0xb05200, format=0x6e83a5 "NEWITER %d %d %d %llu %d\n", ap=ap@entry=0x7fffffffd8a8) at vfprintf.c:1317
#1  0x00007ffff6fa47f7 in __fprintf (stream=<optimized out>, format=<optimized out>) at fprintf.c:32
#2  0x00000000006a3935 in TraceAutoPerfBOC::globalPerfAnalyze (this=0xb02770, msg=0xb04f00) at picsautoperf.C:444
#3  0x00000000006afc15 in CkIndex_TraceAutoPerfBOC::_call_globalPerfAnalyze_CkReductionMsg (impl_msg=0xb04f00, impl_obj_void=0xb02770) at TraceAutoPerf.def.h:2996
#4  0x00000000005822e6 in CkDeliverMessageFree (epIdx=166, msg=0xb04f00, obj=0xb02770) at ck.C:597
#5  0x0000000000582432 in _invokeEntryNoTrace (epIdx=166, env=0xb04eb0, obj=0xb02770) at ck.C:641
#6  0x0000000000582542 in _invokeEntry (epIdx=166, env=0xb04eb0, obj=0xb02770) at ck.C:652
#7  0x0000000000583a15 in _deliverForBocMsg (ck=0xa08620, epIdx=166, env=0xb04eb0, obj=0xb02770) at ck.C:1083
#8  0x00000000005852d9 in CkSendMsgBranchInline (eIdx=166, msg=0xb04f00, destPE=0, gID=..., opts=5) at ck.C:1844
#9  0x0000000000585341 in CkSendMsgBranch (eIdx=166, msg=0xb04f00, pe=0, gID=..., opts=5) at ck.C:1856
#10 0x00000000006a8356 in CProxyElement_TraceAutoPerfBOC::globalPerfAnalyze (this=0x7fffffffdc90, impl_msg=0xb04f00) at TraceAutoPerf.def.h:1201
#11 0x00000000006a26a4 in TraceAutoPerfBOC::gatherSummary (this=0xb02770, msg=0xb04f00) at picsautoperf.C:165
#12 0x00000000006a359d in TraceAutoPerfBOC::getPerfData (this=0xb02770, reductionPE=0, cb=...) at picsautoperf.C:399
#13 0x00000000006af42a in CkIndex_TraceAutoPerfBOC::_call_getPerfData_marshall15 (impl_msg=0xb02ff0, impl_obj_void=0xb02770) at TraceAutoPerf.def.h:2892
#14 0x00000000005822e6 in CkDeliverMessageFree (epIdx=164, msg=0xb02ff0, obj=0xb02770) at ck.C:597
#15 0x0000000000582432 in _invokeEntryNoTrace (epIdx=164, env=0xb02fa0, obj=0xb02770) at ck.C:641
#16 0x0000000000582542 in _invokeEntry (epIdx=164, env=0xb02fa0, obj=0xb02770) at ck.C:652
#17 0x0000000000583a15 in _deliverForBocMsg (ck=0xa08620, epIdx=164, env=0xb02fa0, obj=0xb02770) at ck.C:1083
#18 0x0000000000583b1e in _processForBocMsg (ck=0xa08620, env=0xb02fa0) at ck.C:1101
#19 0x0000000000584093 in _processHandler (converseMsg=0xb02fa0, ck=0xa08620) at ck.C:1263
#20 0x000000000065fd11 in CmiHandleMessage (msg=0xb02fa0) at convcore.c:1678
#21 0x00000000006600a1 in CsdScheduleForever () at convcore.c:1915
#22 0x000000000065ffe7 in CsdScheduler (maxmsgs=-1) at convcore.c:1851
#23 0x0000000000574b07 in CkExit () at init.C:950
#24 0x0000000000598dfe in CkCallback::send (this=0xb059f8, msg=0xb059e0) at ckcallback.C:224
#25 0x00000000005f6546 in CkReductionMgr::finishReduction (this=0xb00570) at ckreduction.C:766
#26 0x00000000005f621c in CkReductionMgr::addContribution (this=0xb00570, m=0xb03d50) at ckreduction.C:676
#27 0x00000000005f59f1 in CkReductionMgr::contribute (this=0xb00570, ci=0xb01fc0, m=0xb03d50) at ckreduction.C:444
#28 0x00000000005e3c51 in ArrayElement::contribute (this=0xb01f40, dataSize=4, data=0xb01fe8, type=CkReduction::max_int, cb=..., userFlag=65535) at ckarray.C:396
#29 0x0000000000570d74 in Hello::SayHi (this=0xb01f40) at hello.C:96
#30 0x000000000056cb2a in CkIndex_Hello::_call_SayHi_void (impl_msg=0xb02ff0, impl_obj_void=0xb01f40) at hello.def.h:430
#31 0x00000000005823dc in CkDeliverMessageReadonly (epIdx=140, msg=0xb039b0, obj=0xb01f40) at ck.C:631
#32 0x00000000005a2867 in CkLocRec::invokeEntry (this=0xb02000, obj=0xb01f40, msg=0xb039b0, epIdx=140, doFree=false) at cklocation.C:2030
#33 0x0000000000587d82 in CkMigratable::ckInvokeEntry (this=0xb01f40, epIdx=140, msg=0xb039b0, doFree=false) at ckmigratable.h:72
#34 0x00000000005e7c09 in CkArrayBroadcaster::deliver (this=0xb008c0, bcast=0xb039b0, el=0xb01f40, doFree=false) at ckarray.C:1315
#35 0x00000000005e84f3 in CkArray::recvBroadcast (this=0xb00570, m=0xb039b0) at ckarray.C:1543
#36 0x00000000005ec00d in CkIndex_CkArray::_call_recvBroadcast_CkMessage (impl_msg=0xb039b0, impl_obj_void=0xb00570) at CkArray.def.h:1151
#37 0x00000000005822e6 in CkDeliverMessageFree (epIdx=67, msg=0xb039b0, obj=0xb00570) at ck.C:597
#38 0x0000000000582432 in _invokeEntryNoTrace (epIdx=67, env=0xb03960, obj=0xb00570) at ck.C:641
#39 0x0000000000582615 in _invokeEntry (epIdx=67, env=0xb03960, obj=0xb00570) at ck.C:659
#40 0x0000000000583a15 in _deliverForBocMsg (ck=0xa08620, epIdx=67, env=0xb03960, obj=0xb00570) at ck.C:1083
#41 0x0000000000583b1e in _processForBocMsg (ck=0xa08620, env=0xb03960) at ck.C:1101
#42 0x0000000000584093 in _processHandler (converseMsg=0xb03960, ck=0xa08620) at ck.C:1263
#43 0x000000000065fd11 in CmiHandleMessage (msg=0xb03960) at convcore.c:1678
#44 0x00000000006600a1 in CsdScheduleForever () at convcore.c:1915
#45 0x000000000065ffe7 in CsdScheduler (maxmsgs=-1) at convcore.c:1851
#46 0x000000000065851d in ConverseRunPE (everReturn=0) at machine-common-core.c:1298
#47 0x00000000006581c7 in ConverseInit (argc=1, argv=0x7fffffffea98, fn=0x574e41 <_initCharm(int, char**)>, usched=0, initret=0) at machine-common-core.c:1199
#48 0x0000000000573113 in main (argc=1, argv=0x7fffffffea98) at main.C:18

CkpvAccess(fpSummary) is previously closed here:

(gdb) bt
#0  _IO_new_fclose (fp=0xb05200) at iofclose.c:39
#1  0x00000000006a51e2 in traceAutoPerfExitFunction () at picsautoperf.C:730
#2  0x0000000000573ec9 in _exitHandler (env=0xb02fa0) at init.C:560
#3  0x000000000065fd11 in CmiHandleMessage (msg=0xb02fa0) at convcore.c:1678
#4  0x00000000006600a1 in CsdScheduleForever () at convcore.c:1915
#5  0x000000000065ffe7 in CsdScheduler (maxmsgs=-1) at convcore.c:1851
#6  0x0000000000574b07 in CkExit () at init.C:950
#7  0x0000000000598dfe in CkCallback::send (this=0xb059f8, msg=0xb059e0) at ckcallback.C:224
#8  0x00000000005f6546 in CkReductionMgr::finishReduction (this=0xb00570) at ckreduction.C:766
#9  0x00000000005f621c in CkReductionMgr::addContribution (this=0xb00570, m=0xb03d50) at ckreduction.C:676
#10 0x00000000005f59f1 in CkReductionMgr::contribute (this=0xb00570, ci=0xb01fc0, m=0xb03d50) at ckreduction.C:444
#11 0x00000000005e3c51 in ArrayElement::contribute (this=0xb01f40, dataSize=4, data=0xb01fe8, type=CkReduction::max_int, cb=..., userFlag=65535) at ckarray.C:396
#12 0x0000000000570d74 in Hello::SayHi (this=0xb01f40) at hello.C:96
#13 0x000000000056cb2a in CkIndex_Hello::_call_SayHi_void (impl_msg=0xb02ff0, impl_obj_void=0xb01f40) at hello.def.h:430
#14 0x00000000005823dc in CkDeliverMessageReadonly (epIdx=140, msg=0xb039b0, obj=0xb01f40) at ck.C:631
#15 0x00000000005a2867 in CkLocRec::invokeEntry (this=0xb02000, obj=0xb01f40, msg=0xb039b0, epIdx=140, doFree=false) at cklocation.C:2030
#16 0x0000000000587d82 in CkMigratable::ckInvokeEntry (this=0xb01f40, epIdx=140, msg=0xb039b0, doFree=false) at ckmigratable.h:72
#17 0x00000000005e7c09 in CkArrayBroadcaster::deliver (this=0xb008c0, bcast=0xb039b0, el=0xb01f40, doFree=false) at ckarray.C:1315
#18 0x00000000005e84f3 in CkArray::recvBroadcast (this=0xb00570, m=0xb039b0) at ckarray.C:1543
#19 0x00000000005ec00d in CkIndex_CkArray::_call_recvBroadcast_CkMessage (impl_msg=0xb039b0, impl_obj_void=0xb00570) at CkArray.def.h:1151
#20 0x00000000005822e6 in CkDeliverMessageFree (epIdx=67, msg=0xb039b0, obj=0xb00570) at ck.C:597
#21 0x0000000000582432 in _invokeEntryNoTrace (epIdx=67, env=0xb03960, obj=0xb00570) at ck.C:641
#22 0x0000000000582615 in _invokeEntry (epIdx=67, env=0xb03960, obj=0xb00570) at ck.C:659
#23 0x0000000000583a15 in _deliverForBocMsg (ck=0xa08620, epIdx=67, env=0xb03960, obj=0xb00570) at ck.C:1083
#24 0x0000000000583b1e in _processForBocMsg (ck=0xa08620, env=0xb03960) at ck.C:1101
#25 0x0000000000584093 in _processHandler (converseMsg=0xb03960, ck=0xa08620) at ck.C:1263
#26 0x000000000065fd11 in CmiHandleMessage (msg=0xb03960) at convcore.c:1678
#27 0x00000000006600a1 in CsdScheduleForever () at convcore.c:1915
#28 0x000000000065ffe7 in CsdScheduler (maxmsgs=-1) at convcore.c:1851
#29 0x000000000065851d in ConverseRunPE (everReturn=0) at machine-common-core.c:1298
#30 0x00000000006581c7 in ConverseInit (argc=1, argv=0x7fffffffea98, fn=0x574e41 <_initCharm(int, char**)>, usched=0, initret=0) at machine-common-core.c:1199
#31 0x0000000000573113 in main (argc=1, argv=0x7fffffffea98) at main.C:18

There's some screwiness in the control flow here, that maybe relied on some race condition to work at all initially:

extern "C" void traceAutoPerfExitFunction() {
  if (autoPerfProxy.ckGetGroupID().isZero()) {
    CkContinueExit();
    return;
  }

  /* Starts copying of data */
  if(user_call == 0){  // Do not call them by default if the user is calling them                                                                                                                                                  
    autoPerfProxy.endPhase();
    autoPerfProxy.endStepResumeCb(true, CkMyPe(), CkCallbackResumeThread());
  }

  CkpvAccess(isExit) = true;
  autoPerfProxy.getPerfData(0, CkCallback::ignore );

  if(CkpvAccess(fpSummary)!=NULL){
    fflush(CkpvAccess(fpSummary));
    fclose(CkpvAccess(fpSummary));
  }
}

Note the call to getPerfData, which calls globalPerfAnalyze, which writes to the FILE* in question.

I think the right thing here may be to just not explicitly close that file in the trace module's exit handler, since the underlying C library will take care of that when exit() gets called. I'll try that now.

#9 Updated by Phil Miller 17 days ago

https://charm.cs.illinois.edu/gerrit/3097 tracemode perfReport: don't close file in race with code that will write to it
https://charm.cs.illinois.edu/gerrit/3098 tracemode perfReport: register custom reduce in initnode function, as documented

These two fixes make it feasible to at least reproduce this issue as reported initially, including on multicore/smp builds.

I do still see a hang when attempting to restart from checkpoint, even on 1 PE checkpoint, 1 PE restart:

$ rm -rf log
phil@prowess:/scratch/phil/charm/multicore-linux-x86_64-v6.8.0-84-g329e0989f/tests/charm++/chkpt$ ./hello
Charm++: standalone mode (not using charmrun)
Charm++> Running in Multicore mode:  1 threads
Converse/Charm++ Commit ID: v6.8.0-84-g329e0989f
Trace: traceroot: ./hello
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (8-way SMP).
Charm++> cpu topology info is gathered in 0.000 seconds.
Running Hello on 1 processors for 8 elements
Charm++ - PICS > Enabled pics autoPerf ......
step 0 done
myClient. a=123(0xaffd2c), b[0]=456(0xaffd30), b[1]=789
step 1 done
myClient. a=123(0xaffd2c), b[0]=456(0xaffd30), b[1]=789
step 2 done
myClient. a=123(0xaffd2c), b[0]=456(0xaffd30), b[1]=789
[0] Checkpoint starting in log
step 3 done
Main's PUPer. a=123(0xaffd2c), b[0]=456(0xaffd30), b[1]=789
CHello's PUPer. step=3.
Checkpoint to disk finished in 0.463060s, sending out the cb...
myClient. a=123(0xaffd2c), b[0]=456(0xaffd30), b[1]=789
step 4 done
myClient. a=123(0xaffd2c), b[0]=456(0xaffd30), b[1]=789
step 5 done
myClient. a=123(0xaffd2c), b[0]=456(0xaffd30), b[1]=789
step 6 done
myClient. a=123(0xaffd2c), b[0]=456(0xaffd30), b[1]=789
step 7 done
myClient. a=123(0xaffd2c), b[0]=456(0xaffd30), b[1]=789
step 8 done
myClient. a=123(0xaffd2c), b[0]=456(0xaffd30), b[1]=789
step 9 done

PICS Data: PEs in group: 1
IDLE%: 0.00
OVERHEAD%: 0.00
UTIL%: 1.00
AVG_ENTRY_DURATION: 0.004332

PICS Data: PEs in group: 1
IDLE%: 0.00
OVERHEAD%: 0.00
UTIL%: 1.00
AVG_ENTRY_DURATION: 0.004332
[Partition 0][Node 0] End of program

$ ./hello +restart log
Charm++: standalone mode (not using charmrun)
Charm++> Running in Multicore mode:  1 threads
Converse/Charm++ Commit ID: v6.8.0-84-g329e0989f
Trace: traceroot: ./hello
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (8-way SMP).
Charm++> cpu topology info is gathered in 0.000 seconds.
Received 1 arguments: { |./hello| }
Main's MigCtor. a=987(0xa1e57c), b[0]=654(0xa1e580), b[1]=321, old PE number 1
Main's PUPer. a=123(0xa1e57c), b[0]=456(0xa1e580), b[1]=789
CHello's PUPer. step=3.
[0] data on Group 0
[0] data on NOdeGroup 0
[0]CkRestartMain done. sending out callback.
myClient. a=123(0xa1e57c), b[0]=456(0xa1e580), b[1]=789
step 4 done
myClient. a=123(0xa1e57c), b[0]=456(0xa1e580), b[1]=789
step 5 done
myClient. a=123(0xa1e57c), b[0]=456(0xa1e580), b[1]=789
step 6 done
myClient. a=123(0xa1e57c), b[0]=456(0xa1e580), b[1]=789
step 7 done
myClient. a=123(0xa1e57c), b[0]=456(0xa1e580), b[1]=789
step 8 done
myClient. a=123(0xa1e57c), b[0]=456(0xa1e580), b[1]=789
step 9 done
^C

#10 Updated by Phil Miller 17 days ago

Hang happens in traceAutoPerfExitFunction, at autoPerfProxy.endStepResumeCb(true, CkMyPe(), CkCallbackResumeThread());

I'll see why that callback never gets triggered.

#11 Updated by Phil Miller 16 days ago

So it looks like after the restart, t->getTraceOn() returns false, and on multiple paths, this means that the code does nothing, and the callback simply never gets called. The control flow here is astoundingly convoluted, and I'm beyond my tolerance for wanting to dig through it.

#12 Updated by Phil Miller 15 days ago

  • Target version changed from 6.8.1 to 6.9.0

Also available in: Atom PDF