Project

General

Profile

Bug #1080

multicore projections tracing runs hang at startup on 129 pes

Added by Jim Phillips about 3 years ago. Updated over 2 years ago.

Status:
Merged
Priority:
Urgent
Assignee:
Category:
-
Target version:
Start date:
05/19/2016
Due date:
% Done:

0%


Description

jim@triton$./namd2.prj +p129 $APOA1 --outputenergies 100 +logsize 1000000 +traceroot /Scr/jim/triton +traceoff 
Charm++: standalone mode (not using charmrun)
Charm++> Running in Multicore mode:  129 threads
Charm++> Using recursive bisection (scheme 3) for topology aware partitions
Converse/Charm++ Commit ID: v6.7.1-rc1-0-gbdf6a1b-namd-charm-6.7.1-proj-build-2016-Apr-16-217305
Trace: logsize: 1000000
Charm++: Tracemode Projections enabled.
Trace: traceroot: /Scr/jim/triton/namd2.prj
Warning> Randomization of stack pointer is turned on in kernel, thread migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it, or try run with '+isomalloc_sync'.  
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (272-way SMP).
Charm++> cpu topology info is gathered in 0.047 seconds.
[hangs until killed]

jim@triton$./namd2.prj +p128 $APOA1 --outputenergies 100 +logsize 1000000 +traceroot /Scr/jim/triton +traceoff 
Charm++: standalone mode (not using charmrun)
Charm++> Running in Multicore mode:  128 threads
Charm++> Using recursive bisection (scheme 3) for topology aware partitions
Converse/Charm++ Commit ID: v6.7.1-rc1-0-gbdf6a1b-namd-charm-6.7.1-proj-build-2016-Apr-16-217305
Trace: logsize: 1000000
Charm++: Tracemode Projections enabled.
Trace: traceroot: /Scr/jim/triton/namd2.prj
Warning> Randomization of stack pointer is turned on in kernel, thread migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it, or try run with '+isomalloc_sync'.  
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (272-way SMP).
Charm++> cpu topology info is gathered in 0.054 seconds.
Info: NAMD 2.11 for Linux-x86_64-multicore
Info: 

History

#1 Updated by Eric Bohm about 3 years ago

  • Assignee set to Ronak Buch

#2 Updated by Ronak Buch about 3 years ago

  • Status changed from New to In Progress

#3 Updated by Phil Miller about 3 years ago

  • Description updated (diff)
  • Target version set to 6.8.0

#4 Updated by Ronak Buch about 3 years ago

  • Status changed from In Progress to New
  • Target version deleted (6.8.0)

I was able to replicate this on triton, but I was also able (somewhat oddly) to successfully run a 129 PE job with Projections. I'm not sure why it sometimes works, I'll keep looking into it.

#5 Updated by Ronak Buch about 3 years ago

  • Status changed from New to In Progress
  • Target version set to 6.8.0

I'm not sure how the status and target version were changed in my earlier update, I'm setting them back here.

#6 Updated by Phil Miller over 2 years ago

Status? This could be release critical.

#7 Updated by Phil Miller over 2 years ago

  • Target version changed from 6.8.0 to 6.8.0-beta1

#8 Updated by Ronak Buch over 2 years ago

I'll try to dig into this with thread sanitized or valgrind (use --fair-sched=yes). We suspect that this is a race condition that arises at high PE counts since this works sometimes.

#9 Updated by Sam White over 2 years ago

Seonmyeong reported a segfault with Projections on KNL using >128 PEs. Adding him as a 'watcher' here.

#10 Updated by Sam White over 2 years ago

EDIT: nevermind.

#11 Updated by Ronak Buch over 2 years ago

With error checking enabled, the stack trace looks like:

#0  0x00007ffff6c31845 in raise () from /lib/libc.so.6
#1  0x00007ffff6c35390 in abort () from /lib/libc.so.6
#2  0x0000000000612e7e in charmrun_abort (
    s=0x66a5e0 "Can only do registrations from rank 0 processors") at machine.c:1029
#3  0x00000000006128d2 in LrtsAbort (
    message=0x66a5e0 "Can only do registrations from rank 0 processors")
    at machine.c:551
#4  0x000000000061221a in CmiAbortHelper (source=0x681d71 "Called CmiAbort", 
    message=0x66a5e0 "Can only do registrations from rank 0 processors", 
    suggestion=0x0, tellDebugger=1, framesToSkip=0) at machine-common-core.c:1451
#5  0x0000000000612249 in CmiAbort (
    message=0x66a5e0 "Can only do registrations from rank 0 processors")
    at machine-common-core.c:1455
#6  0x000000000053c970 in CkRegisteredInfo<EntryInfo>::add (
    this=0x90f820 <_entryTable>, t=0x7ffa040e3610) at register.h:275
#7  0x000000000053b635 in CkRegisterEp (
    name=0x675bc8 "AddToInactiveList(CkReductionInactiveMsg* impl_msg)", 
    call=0x5bbb42 <CkIndex_CkReductionMgr::_call_AddToInactiveList_CkReductionInactiveMsg(void*, void*)>, msgIdx=0, chareIdx=0, ck_ep_flags=8) at register.C:63
#8  0x00000000005bbb2b in CkIndex_CkReductionMgr::reg_AddToInactiveList_CkReductionInactiveMsg () at CkReduction.def.h:1256
#9  0x00000000005c2506 in CkIndex_CkReductionMgr::idx_AddToInactiveList_CkReductionInactiveMsg () at CkReduction.decl.h:1011
#10 0x00000000005ba4c1 in CProxyElement_CkReductionMgr::AddToInactiveList (
    this=0x7fffb37a54e0, impl_msg=0x7ffa04000910) at CkReduction.def.h:857
#11 0x00000000005ade47 in CkReductionMgr::informParentInactive (this=0x7ffa040e2ee0)
    at ckreduction.C:553
#12 0x00000000005adb6a in CkReductionMgr::checkIsActive (this=0x7ffa040e2ee0)
    at ckreduction.C:499
#13 0x00000000005ad65d in CkReductionMgr::doneCreatingContributors (
    this=0x7ffa040e2ee0) at ckreduction.C:312
#14 0x00000000005a9b3c in CkArrayReducer::ckEndInserting (this=0x7ffa040e3340)
    at ckarray.C:147
#15 0x00000000005a0153 in CkArray::remoteDoneInserting (this=0x7ffa040e2ee0)
    at ckarray.C:1232
#16 0x00000000005a4c24 in CkIndex_CkArray::_call_remoteDoneInserting_void (
    impl_msg=0x7fff6c1179f0, impl_obj_void=0x7ffa040e2ee0) at CkArray.def.h:974
#17 0x0000000000541e2b in CkDeliverMessageFree (epIdx=71, msg=0x7fff6c1179f0, 
    obj=0x7ffa040e2ee0) at ck.C:593
#18 0x0000000000541f7f in _invokeEntryNoTrace (epIdx=71, env=0x7fff6c1179a0, 
    obj=0x7ffa040e2ee0) at ck.C:637
#19 0x000000000054208e in _invokeEntry (epIdx=71, env=0x7fff6c1179a0, 
    obj=0x7ffa040e2ee0) at ck.C:648
#20 0x000000000054369f in _deliverForBocMsg (ck=0x7ffcbc04b110, epIdx=71, 
    env=0x7fff6c1179a0, obj=0x7ffa040e2ee0) at ck.C:1095
#21 0x0000000000543752 in _processForBocMsg (ck=0x7ffcbc04b110, env=0x7fff6c1179a0)
    at ck.C:1107
#22 0x0000000000543bc7 in _processHandler (converseMsg=0x7fff6c1179a0, 
    ck=0x7ffcbc04b110) at ck.C:1238
#23 0x000000000061944b in CmiHandleMessage (msg=0x7fff6c1179a0) at convcore.c:1814
#24 0x00000000006197d8 in CsdScheduleForever () at convcore.c:2051
#25 0x000000000061971e in CsdScheduler (maxmsgs=-1) at convcore.c:1987
#26 0x0000000000611f6f in ConverseRunPE (everReturn=0) at machine-common-core.c:1297
#27 0x000000000060f6ea in call_startfn (vindex=0x7f) at machine-smp.c:406
#28 0x00007ffff79af9ca in start_thread () from /lib/libpthread.so.0
#29 0x00007ffff6ce945d in clone () from /lib/libc.so.6
#30 0x0000000000000000 in ?? ()

Without error checking, it segfaults with the following stack trace:

#0  0x00007ffff79b1d80 in pthread_mutex_destroy () from /lib/libpthread.so.0
#1  0x000000000060ee42 in LrtsDestroyLock (lock=0x0) at machine-common-core.c:1656
#2  0x000000000060c2f6 in CmiDestroyLocks () at machine-smp.c:519
#3  0x000000000060ee6f in machine_exit (status=1) at machine.c:331
#4  0x000000000060f12b in LrtsAbort (
    message=0x665d10 "Charm++: Invalid entry method executed.  Perhaps there is an unregistered module?") at machine.c:540
#5  0x000000000060eacc in CmiAbortHelper (source=0x677fe9 "Called CmiAbort", 
    message=0x665d10 "Charm++: Invalid entry method executed.  Perhaps there is an unregistered module?", suggestion=0x0, tellDebugger=1, framesToSkip=0)
    at machine-common-core.c:1451
#6  0x000000000060eafb in CmiAbort (
    message=0x665d10 "Charm++: Invalid entry method executed.  Perhaps there is an unregistered module?") at machine-common-core.c:1455
#7  0x000000000053a5f2 in ckInvalidCallFn (msg=0x7fff540e5080, obj=0x7ffe38342ca0)
    at register.C:42
#8  0x0000000000540c05 in CkDeliverMessageFree (epIdx=0, msg=0x7fff540e5080, 
    obj=0x7ffe38342ca0) at ck.C:593
#9  0x0000000000540d34 in _invokeEntryNoTrace (epIdx=0, env=0x7fff540e5030, 
    obj=0x7ffe38342ca0) at ck.C:637
#10 0x0000000000540e43 in _invokeEntry (epIdx=0, env=0x7fff540e5030, 
    obj=0x7ffe38342ca0) at ck.C:648
#11 0x0000000000542323 in _deliverForBocMsg (ck=0x7ffdd0080270, epIdx=0, 
    env=0x7fff540e5030, obj=0x7ffe38342ca0) at ck.C:1095
#12 0x00000000005423d6 in _processForBocMsg (ck=0x7ffdd0080270, env=0x7fff540e5030)
    at ck.C:1107
#13 0x000000000054283a in _processHandler (converseMsg=0x7fff540e5030, 
    ck=0x7ffdd0080270) at ck.C:1238
#14 0x0000000000615bec in CmiHandleMessage (msg=0x7fff540e5030) at convcore.c:1814
#15 0x0000000000615f79 in CsdScheduleForever () at convcore.c:2051
#16 0x0000000000615ebf in CsdScheduler (maxmsgs=-1) at convcore.c:1987
#17 0x000000000060e821 in ConverseRunPE (everReturn=0) at machine-common-core.c:1297
#18 0x000000000060c134 in call_startfn (vindex=0x3a) at machine-smp.c:406
#19 0x00007ffff79af9ca in start_thread () from /lib/libpthread.so.0
#20 0x00007ffff6ce945d in clone () from /lib/libc.so.6
#21 0x0000000000000000 in ?? ()

#12 Updated by Ronak Buch over 2 years ago

  • Status changed from In Progress to Implemented

This ended up being caused by an array overflow overwriting a static guard, which then allowed a previously called function that had already been assigned to a static variable to be called again, causing erroneous behavior.

To fix this, the array, which is responsible for storing reductions, was changed to a vector. The fix is at https://charm.cs.illinois.edu/gerrit/#/c/2117/

#13 Updated by Phil Miller over 2 years ago

  • Status changed from Implemented to Merged
  • translation missing: en.field_closed_date set to 2017-01-18 10:52:28.657898

Also available in: Atom PDF