Project

General

Profile

Bug #1145

PathHistory breaks chkpt test on multicore-linux64

Added by Sam White almost 3 years ago. Updated over 2 years ago.

Status:
Merged
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
07/20/2016
Due date:
% Done:

0%


Description

Multicore-linux64 autobuild failure:

make[3]: Entering directory `/scratch/autobuild/multicore/charm/multicore-linux64/tests/charm++/chkpt'
../../../bin/charmc -g -g  hello.ci
../../../bin/charmc -g -g -c hello.C
../../../bin/charmc -g -g -language charm++ -o hello hello.o 
rm -rf log/
rm -fr log
../../../bin/testrun  ./hello +p4  
Running command: ./hello +p4

------------- Processor 3 Exiting: Called CmiAbort ------------
Reason: pathHistoryManager cannot be pupped.
------------- Processor 2 Exiting: Called CmiAbort ------------
Reason: pathHistoryManager cannot be pupped.
------------- Processor 1 Exiting: Called CmiAbort ------------
Reason: pathHistoryManager cannot be pupped.
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: pathHistoryManager cannot be pupped.
Charm++ fatal error:
pathHistoryManager cannot be pupped.
Charm++: standalone mode (not using charmrun)
Charm++> Running in Multicore mode:  4 threads
Converse/Charm++ Commit ID: b240e23
Warning> using Isomalloc in SMP mode, you may need to run with '+isomalloc_sync'.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (8-way SMP).
Charm++> cpu topology info is gathered in 0.000 seconds.
Running Hello on 4 processors for 8 elements
step 0 done
myClient. a=123(0xa4a6bc), b[0]=456(0xa4a6c0), b[1]=789
step 1 done
myClient. a=123(0xa4a6bc), b[0]=456(0xa4a6c0), b[1]=789
step 2 done
myClient. a=123(0xa4a6bc), b[0]=456(0xa4a6c0), b[1]=789
[0] Checkpoint starting in log
step 3 done
Main's PUPer. a=123(0xa4a6bc), b[0]=456(0xa4a6c0), b[1]=789
CHello's PUPer. step=3.
[2] Stack Traceback:
  [2:0] CmiAbortHelper+0xbd  [0x5eb5ef]
  [2:1] CmiAbort+0x2d  [0x5eb62a]
  [2:2] _ZN18pathHistoryManager3pupERN3PUP2erE+0x1a  [0x5acd3a]
  [2:3] _ZN18recursive_pup_implI18pathHistoryManagerLi1EEclEPS0_RN3PUP2erE+0x4d  [0x5adecb]
  [2:4] _Z13recursive_pupI18pathHistoryManagerEvPT_RN3PUP2erE+0x27  [0x5ad963]
  [2:5] _ZN7CBaseT1I5Group25CProxy_pathHistoryManagerE11virtual_pupERN3PUP2erE+0x48  [0x5ac302]
  [2:6]   [0x58480f]
  [2:7] _Z14CkPupGroupDataRN3PUP2erE+0x4b  [0x584891]
  [2:8] _ZN15CkCheckpointMgr10CheckpointEPKcR10CkCallbackb+0x20b  [0x583935]
  [2:9] _ZN23CkIndex_CkCheckpointMgr26_call_Checkpoint_marshall2EPvS0_+0xef  [0x5863bf]
  [2:10] CkDeliverMessageFree+0x4e  [0x50fb95]
  [2:11]   [0x50fcdb]
  [2:12]   [0x50fdef]
  [2:13]   [0x51133a]
  [2:14]   [0x5113e0]
  [2:15] _Z15_processHandlerPvP11CkCoreState+0x126  [0x5118ae]
  [2:16] CmiHandleMessage+0x49  [0x5f2a6f]
  [2:17] CsdScheduleForever+0x9b  [0x5f2e04]
  [2:18] CsdScheduler+0x16  [0x5f2d47]
  [2:19]   [0x5eb348]
  [2:20]   [0x5e8a2d]
  [2:21] +0x7e9a  [0x2aaaaacd7e9a]
  [2:22] clone+0x6d  [0x2aaaab80236d]
[0] Stack Traceback:
  [0:0] CmiAbortHelper+0xbd  [0x5eb5ef]
  [0:1] CmiAbort+0x2d  [0x5eb62a]
  [0:2] _ZN18pathHistoryManager3pupERN3PUP2erE+0x1a  [0x5acd3a]
  [0:3] _ZN18recursive_pup_implI18pathHistoryManagerLi1EEclEPS0_RN3PUP2erE+0x4d  [0x5adecb]
  [0:4] _Z13recursive_pupI18pathHistoryManagerEvPT_RN3PUP2erE+0x27  [0x5ad963]
  [0:5] _ZN7CBaseT1I5Group25CProxy_pathHistoryManagerE11virtual_pupERN3PUP2erE+0x48  [0x5ac302]
  [0:6]   [0x58480f]
  [0:7] _Z14CkPupGroupDataRN3PUP2erE+0x4b  [0x584891]
  [0:8] _ZN15CkCheckpointMgr10CheckpointEPKcR10CkCallbackb+0x20b  [0x583935]
  [0:9] _ZN23CkIndex_CkCheckpointMgr26_call_Checkpoint_marshall2EPvS0_+0xef  [0x5863bf]
  [0:10] CkDeliverMessageFree+0x4e  [0x50fb95]
  [0:11]   [0x50fcdb]
  [0:12]   [0x50fdef]
  [0:13]   [0x51133a]
  [0:14]   [0x5113e0]
  [0:15] _Z15_processHandlerPvP11CkCoreState+0x126  [0x5118ae]
  [0:16] CmiHandleMessage+0x49  [0x5f2a6f]
  [0:17] CsdScheduleForever+0x9b  [0x5f2e04]
  [0:18] CsdScheduler+0x16  [0x5f2d47]
  [0:19]   [0x5eb348]
  [0:20] ConverseInit+0x362  [0x5eafdb]
  [0:21] main+0x3f  [0x5013c6]
  [0:22] __libc_start_main+0xed  [0x2aaaab7307ed]
  [0:23]   [0x4fa4d9]
../../../bin/testrun: line 63: 26214 Segmentation fault      (core dumped) ./charmrun $*

History

#1 Updated by Sam White almost 3 years ago

  • Subject changed from Chkpt test failure in multicore-linux64 to PathHistory breaks chkpt test on multicore-linux64
  • Target version set to 6.8.0

This is from the recent critical path header changes for the PICS merge: https://charm.cs.illinois.edu/gerrit/#/c/893/

#2 Updated by Phil Miller almost 3 years ago

  • Assignee set to Phil Miller

Strange that this only fails for multicore-linux64

#3 Updated by Phil Miller almost 3 years ago

So, this crashes the same test on multicore-darwin-x86_64 too. What I'm confused about is why it doesn't crash the chkpt test on other targets.

#4 Updated by Sam White almost 3 years ago

It happens in a netlrts-linux-x86_64-syncft build I just tried...

#5 Updated by Phil Miller almost 3 years ago

Looking at autobuild, several targets show up good, and I explicitly checked that they run that test to completion. I think they should all be failing, and this should have been caught by the Jenkins run.

#6 Updated by Phil Miller over 2 years ago

  • Status changed from New to Implemented

http://charm.cs.illinois.edu/gerrit/1331

There will still be an incompatibility between checkpoint/restart and critical path tracing, but it won't be enabled by default.

#7 Updated by Phil Miller over 2 years ago

  • Status changed from Implemented to Merged
  • translation missing: en.field_closed_date set to 2016-08-22 09:47:48.061131

#8 Updated by Phil Miller over 2 years ago

  • Target version changed from 6.8.0 to 6.8.0-beta1

Also available in: Atom PDF