Bug #1653

NeighborLB segfaults during startup in SMP/multicore builds

Added by Sam White almost 2 years ago. Updated over 1 year ago.

Load Balancing
Target version:
Start date:
Due date:
% Done:



Running with NeighborLB causes a failure during initialization on multicore builds. I haven't tried SMP builds, but non-SMP is fine.

$ gdb --args ./stencil3d +p4  64 32 +balancer NeighborLB
(gdb) r
Starting program: /dcsdata/home/swhite/tmp/charm5/multicore-linux-x86_64/examples/charm++/load_balancing/stencil3d/stencil3d +p4 64 32 +balancer NeighborLB
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/".
Charm++: standalone mode (not using charmrun)
Charm++> Running in Multicore mode:  4 threads
[New Thread 0x7ffff6fd2700 (LWP 14080)]
[New Thread 0x7ffff67d1700 (LWP 14081)]
[New Thread 0x7ffff5fd0700 (LWP 14082)]
Converse/Charm++ Commit ID: v6.8.0-beta1-313-gc9c2334bd
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (8-way SMP).
Charm++> cpu topology info is gathered in 0.000 seconds.

Program received signal SIGSEGV, Segmentation fault.
__GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
66    ../nptl/pthread_mutex_lock.c: No such file or directory.
(gdb) bt
#0  __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
#1  0x000000000061715a in LBTopoLookup (name=0x649c18 "torus_nd_5") at topology.C:1331
#2  0x00000000005cb971 in NborBaseLB::NborBaseLB (this=0xa29a40, __vtt_parm=0x62d6f0 <VTT for NeighborLB+16>, opt=..., __in_chrg=<optimized out>)
    at NborBaseLB.C:44
#3  0x00000000004eea44 in CBaseT1<CkLBOptions> (args#0=..., __vtt_parm=0x62d6e8 <VTT for NeighborLB+8>, this=this@entry=0xa29a40, 
    __in_chrg=<optimized out>) at charm++.h:420
#4  NeighborLB::NeighborLB (this=this@entry=0xa29a40, opt=..., __in_chrg=<optimized out>, __vtt_parm=<optimized out>) at NeighborLB.C:12
#5  0x00000000004eeba2 in CkIndex_NeighborLB::_call_NeighborLB_marshall1 (impl_msg=<optimized out>, impl_obj_void=0xa29a40) at NeighborLB.def.h:104
#6  0x00000000005032a0 in CkDeliverMessageFree (epIdx=241, msg=0xa29860, obj=<optimized out>) at ck.C:593
#7  0x0000000000503f1e in _invokeEntryNoTrace (obj=0xa29a40, env=0xa29810, epIdx=241) at ck.C:637
#8  CkCreateLocalGroup (groupID=..., groupID@entry=..., epIdx=epIdx@entry=241, env=0xa29810) at ck.C:733
#9  0x0000000000505ac7 in _createGroup (groupID=groupID@entry=..., env=env@entry=0xa29810) at ck.C:808
#10 0x0000000000505b28 in _groupCreate (env=0xa29810) at ck.C:844
#11 CkCreateGroup (cIdx=<optimized out>, eIdx=<optimized out>, msg=0xa29860) at ck.C:890
#12 0x00000000004eef5b in CProxy_NeighborLB::ckNew (impl_noname_0=..., impl_e_opts=impl_e_opts@entry=0x0) at NeighborLB.def.h:56
#13 0x00000000004ef046 in CreateNeighborLB () at NeighborLB.C:10
#14 0x000000000056a6ce in LBDBInit::LBDBInit (this=0x948770, m=0x947ec0) at LBDatabase.C:131
#15 0x00000000004fd39d in _initCharm (unused_argc=<optimized out>, argv=argv@entry=0x7fffffffe998) at init.C:1489
#16 0x00000000005e760e in ConverseRunPE (everReturn=everReturn@entry=0) at machine-common-core.c:1296
#17 0x00000000005e8788 in ConverseInit (argc=6, argv=0x7fffffffe998, fn=<optimized out>, usched=<optimized out>, initret=0)
    at machine-common-core.c:1198
#18 0x00000000004bba37 in main (argc=<optimized out>, argv=<optimized out>) at main.C:18


#1 Updated by Sam White almost 2 years ago

  • Subject changed from NeighborLB failure during startup in multicore builds to NeighborLB segfaults during startup in SMP/multicore builds

The same failure is seen in SMP mode.

#2 Updated by Kavitha Chandrasekar over 1 year ago

The fix is to initialize CmiNodeLock in topology.C in SMP mode from _initCharm(). We might need to see if use of LBTopology needs to be replaced by TopoManager, but that could probably be addressed separately.

#3 Updated by Sam White over 1 year ago

  • Status changed from New to Implemented

#4 Updated by Sam White over 1 year ago

  • Status changed from Implemented to Merged

Also available in: Atom PDF