Project

General

Profile

Bug #1439

net-linux-x86_64-ibverbs-smp-iccstatic with tracing or debug enabled segfaults with ++ppn > 1

Added by Jim Phillips 3 months ago. Updated about 2 months ago.

Status:
Merged
Priority:
Normal
Assignee:
Category:
Tracing
Target version:
-
Start date:
02/19/2017
Due date:
% Done:

0%


Description

verbs-linux-x86_64-iccstatic works fine. Trying to do a performance comparison.

net-linux-x86_64-ibverbs-smp-iccstatic --no-build-shared --enable-tracing --enable-tracing-commthread -optimize

Starting program: /Projects/namd2/Programmers/jim/charm-6.8.0-proj-build-2017-Feb-19-137687/charm-6.8.0-pre/net-linux-x86_64-ibverbs-smp-iccstatic/bin/megatest +p2
warning: no loadable sections found in added symbol-file system-supplied DSO at 0x2aaaaaaab000
[Thread debugging using libthread_db enabled]
Charm++: standalone mode (not using charmrun)
[New Thread 0x2aaaac348940 (LWP 27786)]
[New Thread 0x2aaaacd49940 (LWP 27787)]
Converse/Charm++ Commit ID: v6.7.0-674-g3378248-namd-charm-6.8.0-proj-build-2017-Feb-19-137687
Warning> Randomization of stack pointer is turned on in kernel, thread migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it, or try run with '+isomalloc_sync'.  
Charm++> scheduler running in netpoll mode.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (64-way SMP).
Charm++> cpu topology info is gathered in 0.002 seconds.
[Thread 0x2aaaacd49940 (LWP 27787) exited]
Megatest is running on 1 nodes 2 processors. 

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x2aaaac348940 (LWP 27786)]
0x0000000000510781 in CpdBeforeEp(int, void*, void*) ()
(gdb) where
#0  0x0000000000510781 in CpdBeforeEp(int, void*, void*) ()
#1  0x0000000000504bbe in CkDeliverMessageFree ()
#2  0x00000000004f0bbb in CkCreateLocalGroup ()
#3  0x0000000000505c24 in _processBocInitMsg(CkCoreState*, envelope*) ()
#4  0x00000000004e790f in _INTERNAL_6_init_C_1f988cfe::_initHandler(void*, CkCoreState*) ()
#5  0x000000000066cdb4 in CsdScheduler ()
#6  0x000000000064fe97 in ConverseRunPE ()
#7  0x000000000064f9e9 in call_startfn ()
#8  0x000000379760683d in start_thread () from /lib64/libpthread.so.0
#9  0x0000003796ad514d in clone () from /lib64/libc.so.6

Tried a debug build:

0x0000000000515bfa in CpdBeforeEp (ep=5, obj=0x2aaab406ac10, msg=0xa50510) at debug-charm.C:64
64      if (CpvAccess(cmiArgDebugFlag)) {
(gdb) where
#0  0x0000000000515bfa in CpdBeforeEp (ep=5, obj=0x2aaab406ac10, msg=0xa50510) at debug-charm.C:64
#1  0x00000000004ff95f in CkDeliverMessageFree (epIdx=5, msg=0xa50510, obj=0x2aaab406ac10) at ck.C:591

It looks like cmiArgDebugFlag isn't initialized for all threads:

src/arch/net/machine.c ConverseInit()

#if CMK_CCS_AVAILABLE
  CpvInitialize(int, cmiArgDebugFlag);
  CpvAccess(cmiArgDebugFlag) = 0;
#endif

(gdb) break machine.c:2983
Breakpoint 2 at 0x617cc8: file machine.c, line 2983.
(gdb) run +p2
Starting program: /Projects/namd2/Programmers/jim/charm-6.8.0-debug-build-2017-Feb-19-49741/charm-6.8.0-pre/net-linux-x86_64-ibverbs-smp-iccstatic/bin/megatest +p2
warning: no loadable sections found in added symbol-file system-supplied DSO at 0x2aaaaaaab000
[Thread debugging using libthread_db enabled]

Breakpoint 2, ConverseInit (argc=2, argv=0x7fffffffdd18, fn=0x4eef31 <_initCharm(int, char**)>, usc=0, 
    everReturn=0) at machine.c:2983
2983      CpvInitialize(int, cmiArgDebugFlag);
(gdb) next
2984      CpvAccess(cmiArgDebugFlag) = 0;
(gdb) next
2986      Cmi_truecrash = 0;
(gdb) cont
Continuing.
Charm++: standalone mode (not using charmrun)
[New Thread 0x2aaaac348940 (LWP 28499)]
[New Thread 0x2aaaacd49940 (LWP 28500)]
Converse/Charm++ Commit ID: v6.7.0-674-g3378248-namd-charm-6.8.0-debug-build-2017-Feb-19-49741
Warning> Randomization of stack pointer is turned on in kernel, thread migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it, or try run with '+isomalloc_sync'.  
Charm++> Using STL-based msgQ:
Charm++> Using randomized msgQ. Priorities will not be respected!
Charm++> scheduler running in netpoll mode.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (64-way SMP).
Charm++> cpu topology info is gathered in 0.002 seconds.
[Thread 0x2aaaacd49940 (LWP 28500) exited]
Megatest is running on 1 nodes 2 processors. 

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x2aaaac348940 (LWP 28499)]
0x0000000000515bfa in CpdBeforeEp (ep=5, obj=0x2aaab4059190, msg=0xa695e0) at debug-charm.C:64
64      if (CpvAccess(cmiArgDebugFlag)) {

compare to verbs-...

(gdb) break machine-common-core.c:1217
Breakpoint 1 at 0x611334: file machine-common-core.c, line 1217.
(gdb) run +p2
The program being debugged has been started already.
Start it from the beginning? (y or n) y

Starting program: /Projects/namd2/Programmers/jim/charm-6.8.0-debug-build-2017-Feb-19-49741/charm-6.8.0-pre/verbs-linux-x86_64-smp-iccstatic/bin/megatest +p2
warning: no loadable sections found in added symbol-file system-supplied DSO at 0x2aaaaaaab000
[Thread debugging using libthread_db enabled]
Charm++: standalone mode (not using charmrun)
Charm++> Running in SMP mode: numNodes 1,  2 worker threads per process
Charm++> The comm. thread both sends and receives messages
[New Thread 0x2aaaac348940 (LWP 28687)]
[Switching to Thread 0x2aaaac348940 (LWP 28687)]

Breakpoint 1, ConverseRunPE (everReturn=0) at machine-common-core.c:1217
1217      CpvInitialize(int, cmiArgDebugFlag);
(gdb) cont
Continuing.
[New Thread 0x2aaaacd49940 (LWP 28690)]
[Switching to Thread 0x2aaaacd49940 (LWP 28690)]

Breakpoint 1, ConverseRunPE (everReturn=0) at machine-common-core.c:1217
1217      CpvInitialize(int, cmiArgDebugFlag);
(gdb) cont
Continuing.
[Switching to Thread 0x2aaaaaae6540 (LWP 28618)]

Breakpoint 1, ConverseRunPE (everReturn=0) at machine-common-core.c:1217
1217      CpvInitialize(int, cmiArgDebugFlag);
(gdb) cont
Continuing.
Converse/Charm++ Commit ID: v6.7.0-674-g3378248-namd-charm-6.8.0-debug-build-2017-Feb-19-49741
Warning> Randomization of stack pointer is turned on in kernel, thread migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it, or try run with '+isomalloc_sync'.  
Charm++> Using STL-based msgQ:
Charm++> Using randomized msgQ. Priorities will not be respected!
Charm++> scheduler running in netpoll mode.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (64-way SMP).
Charm++> cpu topology info is gathered in 0.002 seconds.
Megatest is running on 1 nodes 2 processors. 
test 0: initiated [completion_test (phil)]

net-smp-proj.patch View (1.48 KB) Jim Phillips, 02/20/2017 02:01 PM

History

#1 Updated by Jim Phillips 3 months ago

This issues was known and fixed in lrts last spring:
https://charm.cs.illinois.edu/gerrit/gitweb?p=charm.git;a=commit;h=4667d328c13e2eedcb03307e1d3074dc52bb41ea
https://charm.cs.illinois.edu/gerrit/gitweb?p=charm.git;a=commit;h=1e4157cd1b6641c1e7d6f98c96afb847c65c8883
From Phil's commit comment:

 The 'net' builds may not be fixed, but I find I don't care. We're
deprecating them anyway.

#2 Updated by Jim Phillips 3 months ago

Cannot build with "--disable-ccs" as a workaround because charmrun fails to compile:

../../bin/charmc -optimize -production  -gcc-name=gcc44 -gxx-name=g++44  -Wno-error -lm -I.. -c -seq -DCMK_NOT_USE_CONVERSE=1 -DNOTIFY charmrun.C
charmrun.C(2174): error: identifier "write_stdio_duplicate" is undefined
    write_stdio_duplicate(msg->data);
    ^

charmrun.C(2182): error: identifier "write_stdio_duplicate" is undefined
    write_stdio_duplicate(msg->data);
    ^

charmrun.C(2190): error: identifier "write_stdio_duplicate" is undefined
    write_stdio_duplicate(msg->data);
    ^

charmrun.C(2204): error: identifier "write_stdio_duplicate" is undefined
    write_stdio_duplicate(msg->data);
    ^

charmrun.C(3006): error: identifier "req_ccs_connect" is undefined
        req_ccs_connect();
        ^

#3 Updated by Jim Phillips 3 months ago

Patch attached.

#4 Updated by Eric Bohm 3 months ago

  • Assignee set to Bilge Acun

#5 Updated by Bilge Acun about 2 months ago

  • Status changed from New to Implemented

#6 Updated by Bilge Acun about 2 months ago

  • Status changed from Implemented to Merged

Also available in: Atom PDF