Project

General

Profile

Bug #1331

Isomalloc hangs in startup for Clang non-SMP builds

Added by Sam White over 2 years ago. Updated about 1 year ago.

Status:
Merged
Priority:
Normal
Assignee:
Category:
AMPI
Target version:
Start date:
12/21/2016
Due date:
% Done:

0%

Tags:

Description

The above patch doesn't fix the problem on Clang, which appears to be an initialization ordering issue:

Starting program: /dcsdata/home/swhite/tmp/charm/netlrts-linux-x86_64-clang/examples/ampi/Cjacobi3D/jacobi.iso 1 1 1 1 +vp1

Program received signal SIGSEGV, Segmentation fault.
0x00000000005fd491 in mspace_malloc ()
(gdb) bt
#0  0x00000000005fd491 in mspace_malloc ()
#1  0x0000000000602aa6 in mm_malloc ()
#2  0x0000000000604c00 in meta_malloc ()
#3  0x0000000000604ad2 in malloc ()
#4  0x00007ffff762adad in operator new(unsigned long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007ffff762aea9 in operator new[](unsigned long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x000000000063b278 in CkHashtable::buildTable (this=0x978468 <Cpv_threadCBs_+72>, newLen=5) at ckhashtable.C:116
#7  0x000000000063b481 in CkHashtable::CkHashtable (this=0x978468 <Cpv_threadCBs_+72>, layout_=..., initLen=5, NloadFactor=0.5, 
    Nhash=0x4edb20 <CkHashtableAdaptorT<unsigned int>::staticHash(void const*, unsigned long)>, 
    Ncompare=0x4edb50 <CkHashtableAdaptorT<unsigned int>::staticCompare(void const*, void const*, unsigned long)>) at ckhashtable.C:152
#8  0x00000000004edb0d in CkHashtableTslow<CkHashtableAdaptorT<unsigned int>, CkCallback*>::CkHashtableTslow (
    this=0x978468 <Cpv_threadCBs_+72>, initLen=5, NloadFactor=0.5, 
    Nhash=0x4edb20 <CkHashtableAdaptorT<unsigned int>::staticHash(void const*, unsigned long)>, 
    Ncompare=0x4edb50 <CkHashtableAdaptorT<unsigned int>::staticCompare(void const*, void const*, unsigned long)>) at ./ckhashtable.h:331
#9  0x00000000004ec055 in CkHashtableT<CkHashtableAdaptorT<unsigned int>, CkCallback*>::CkHashtableT (this=0x978468 <Cpv_threadCBs_+72>, 
    initLen=5, NloadFactor=0.5) at ./ckhashtable.h:360
#10 0x0000000000405c51 in __cxx_global_var_init1 () at ckcallback.C:19
#11 0x0000000000405dbe in global constructors keyed to a () at ckcallback.C:763
#12 0x000000000066463d in __libc_csu_init ()
#13 0x00007ffff7012ed5 in __libc_start_main (main=0x4bb830 <main(int, char**)>, argc=6, argv=0x7fffffffe9c8, init=0x6645f0 <__libc_csu_init>, 
    fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe9b8) at libc-start.c:246
#14 0x0000000000407fd4 in _start ()

Related issues

Related to Charm++ - Bug #1310: Isomalloc hangs in CpvInitialize during startup for multicore/smp builds Merged 11/23/2016
Related to Charm++ - Bug #1337: Cpv Declarations of types with constructors may induce 'static initialization order fiasco' New 12/21/2016

History

#1 Updated by Sam White over 2 years ago

The patch mentioned in the description, which fixed a hang in Isomalloc for multicore/SMP builds, is here: https://charm.cs.illinois.edu/redmine/issues/1310

Phil recommended using asan and/or valgrind to track uninitialized variables.

#2 Updated by Phil Miller over 2 years ago

The right fix for this particular issue would probably be to change that Cpv into a pointer to CkHashtable, so that the code setting it up can do the actual allocation after global constructors run.

#3 Updated by Sam White over 2 years ago

Ugh, running with asan we get a segfault in asan's initialization:

$ ../../../bin/ampicxx -fsanitize=address -g -O0 -c -DNO_PUP jacobi.C -o jacobi.iso.o
$ ../../../bin/ampicxx -fsanitize=address -g -O0 -o jacobi.iso jacobi.iso.o -module CommonLBs -memory isomalloc
$ gdb --args ./jacobi.iso 1 1 1 1 +vp1
Starting program: /dcsdata/home/swhite/tmp/charm/netlrts-linux-x86_64-clang/examples/ampi/Cjacobi3D/jacobi.iso 1 1 1 1 +vp1
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x00000000016c527b in calloc (nelem=140737488349056, nelem@entry=32, size=140737488349120, size@entry=1) at libmemory-isomalloc.c:665
665    {
(gdb) bt
#0  0x00000000016c527b in calloc (nelem=140737488349056, nelem@entry=32, size=140737488349120, size@entry=1) at libmemory-isomalloc.c:665
#1  0x00007ffff7bd7690 in _dlerror_run (operate=operate@entry=0x7ffff7bd7130 <dlsym_doit>, args=args@entry=0x7fffffffe890) at dlerror.c:141
#2  0x00007ffff7bd7198 in __dlsym (handle=<optimized out>, name=<optimized out>) at dlsym.c:70
#3  0x00000000005df300 in __interception::GetRealFunctionAddress(char const*, unsigned long*, unsigned long, unsigned long) ()
#4  0x00000000005c9d01 in __asan::InitializeAsanInterceptors() ()
#5  0x00000000005dccbd in __asan_init_v3 ()
#6  0x00007ffff7dea25a in _dl_init (main_map=0x7ffff7ffe1c8, argc=6, argv=0x7fffffffe9c8, env=0x7fffffffea00) at dl-init.c:111
#7  0x00007ffff7ddb30a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#8  0x0000000000000006 in ?? ()
#9  0x00007fffffffec1f in ?? ()
#10 0x00007fffffffec7c in ?? ()
#11 0x00007fffffffec7e in ?? ()
#12 0x00007fffffffec80 in ?? ()
#13 0x00007fffffffec82 in ?? ()
#14 0x00007fffffffec84 in ?? ()
#15 0x0000000000000000 in ?? ()

Edit: I'll take a look at changing that Cpv into a pointer.

#4 Updated by Sam White over 2 years ago

Valgrind shows an invalid write in mspace_malloc (memory-gnu-internal.c:4948):

$ /usr/bin/valgrind ./jacobi.iso 1 1 1 1 +vp1
==23286== Memcheck, a memory error detector
==23286== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==23286== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==23286== Command: ./jacobi.iso 1 1 1 1 +vp1
==23286== 
==23286== Invalid write of size 8
==23286==    at 0x5FD158: mspace_malloc (memory-gnu-internal.c:4948)
==23286==    by 0x602765: mm_malloc (memory-gnu.c:786)
==23286==    by 0x602D31: mm_realloc (memory-gnu.c:866)
==23286==    by 0x6025ED: realloc_hook_ini (memory-gnu.c:476)
==23286==    by 0x602D11: mm_realloc (memory-gnu.c:858)
==23286==    by 0x604BC0: meta_realloc (memory-isomalloc.c:129)
==23286==    by 0x604B49: realloc (libmemory-isomalloc.c:680)
==23286==    by 0x611C70: CmiAddCLA (convcore.c:292)
==23286==    by 0x61235B: CmiGetArgFlagDesc (convcore.c:524)
==23286==    by 0x612414: CmiGetArgFlag (convcore.c:534)
==23286==    by 0x60A81D: ConverseInit (machine-common-core.c:1055)
==23286==    by 0x4BB726: main (main.C:18)
==23286==  Address 0xe44f0d08e84752ac is not stack'd, malloc'd or (recently) free'd

#5 Updated by Phil Miller over 2 years ago

Write to addresses that haven't been returned from our malloc implementation out of malloc yet are likely to do that. Just get the hastable constructor with its allocation call(s) off the pre-main() static initialization path, and it should be fine.

#6 Updated by Sam White over 2 years ago

After refactoring 2 static variables with constructors (threadCBs in ck-core/ckcallback.C and lbTopoMap from conv-ldb/topology.C) into pointers that are allocated later, I now see a hang (segfault under gdb) consistently from realloc calls in CmiAddCLA. I think we need to initialize converse memory via CmiMemoryInit() earlier in Charm's initialization, right now it is called from ConverseCommonInit(). Currently, CmiMemoryInit takes argv as an argument...

Program received signal SIGSEGV, Segmentation fault.
0x00000000005772a4 in mspace_malloc ()
(gdb) bt
#0  0x00000000005772a4 in mspace_malloc ()
#1  0x0000000000579181 in mspace_realloc ()
#2  0x0000000000579f07 in mm_realloc ()
#3  0x000000000057b122 in realloc ()
#4  0x00000000005822b8 in CmiAddCLA ()
#5  0x000000000058244c in CmiGetArgIntDesc ()
#6  0x000000000057e4e7 in LrtsInit ()
#7  0x000000000057dd74 in ConverseInit ()
#8  0x00000000004903c4 in main ()

#7 Updated by Sam White over 2 years ago

  • Status changed from New to In Progress

#8 Updated by Sam White over 2 years ago

This doesn't affect netlrts-linux-x86_64-clang-smp or netlrts-darwin-x86_64-smp, which have slightly different init sequences in regards to CLAs.

#9 Updated by Sam White over 2 years ago

  • Subject changed from Isomalloc hangs in startup for clang builds to Isomalloc hangs in startup for Clang non-SMP builds

#10 Updated by Sam White over 2 years ago

The first call to malloc is always resulting in a hang now (segfault under gdb in standalone mode). Even moving the CmiMemoryInit call up doesn't fix this, because CmiMemoryInit allocates some initial memory and then we fail in that:

#0  0x00000000005fd3c8 in mspace_malloc ()
#1  0x00000000006029f6 in mm_malloc ()
#2  0x00000000006049a0 in CmiOutOfMemoryInit ()
#3  0x00000000006048f2 in CmiMemoryInit ()
#4  0x000000000060aac2 in ConverseInit (argc=6, argv=0x7fffffffe9c8, fn=0x4c0d50 <_initCharm(int, char**)>, usched=0, initret=0)
    at ./machine-common-core.c:1054
#5  0x00000000004bb847 in main (argc=6, argv=0x7fffffffe9c8) at main.C:18

#11 Updated by Sam White over 2 years ago

TODO:
1. git bisect to see if this has ever worked
2. Run under valgrind with an eye on uninitialized variables.

#12 Updated by Phil Miller over 2 years ago

When I link against -memory os-isomalloc instead of -memory isomalloc (i.e. using the system malloc as the baseline backend rather than our own old version of dlmalloc), it runs under gdb without crashing. On 2 PEs, it runs with +balancer GreedyLB with migrations happening. Valgrind shows nothing more than the usual complaints about isomalloc pup.

I think that suggests we switch it over in the default tests, and maybe consider further obsoleting the internal malloc for use with isomalloc.

#13 Updated by Sam White over 2 years ago

Great, I'll push the change to our tests to gerrit and start trying out os-isomalloc on a few different machines. Also, perhaps our old dlmalloc is partially responsible for the overhead we have seen with running on isomalloc before (ignoring migration costs), such as here: https://charm.cs.illinois.edu/redmine/issues/681

#14 Updated by Sam White over 2 years ago

os-isomalloc seems to work everywhere isomalloc does, plus it works on clang non-SMP builds. The only caveat is that on Blue Waters os-isomalloc also requires linking with '-dynamic' as explained here: https://charm.cs.illinois.edu/redmine/issues/50

Change to os-isomalloc for tests/examples is here: https://charm.cs.illinois.edu/gerrit/#/c/2128/

#15 Updated by Sam White over 2 years ago

Looking at autobuild from last night, os-isomalloc passed the tests isomalloc had been except netlrts-linux-smp, netlrts-linux-x86_64-smp, and netlrts-linux-x86_64-clang-smp. Darwin tests didn't make it to the Isomalloc tests. Here's the failure with os-isomalloc on the SMP builds:

[3] Stack Traceback:
  [3:0]   [0x82ab5c9]
  [3:1] [0xf7fdb400]
  [3:2] __pthread_mutex_lock+0x17  [0xf7f9dcb7]
  [3:3] LrtsInitCpuTopo+0x19d  [0x82c5afd]
  [3:4] _Z10_initCharmiPPc+0xe65  [0x81a9685]

#16 Updated by Phil Miller over 2 years ago

It looked like fairly similar failures across several of them to me. I suspect that the one we're seeing now ought to be easier to debug.

#17 Updated by Sam White over 2 years ago

Yes, the failure is that a CmiNodeLock in conv-core/cputopology.C is NULL when it shouldn't be. Trying to figure out how to properly synchronize creation and use of it now.

Edit: the static variable's constructor is not an issue.

#18 Updated by Phil Miller over 2 years ago

Does rearranging that to not be a static constructor fix this crash?

#19 Updated by Sam White over 2 years ago

I opened a separate issue for os-isomalloc failures on SMP builds here: https://charm.cs.illinois.edu/redmine/issues/1375

#20 Updated by Sam White over 2 years ago

  • Target version changed from 6.8.0 to 6.8.1

#21 Updated by Sam White about 2 years ago

After upgrading prowess to Xenial (default compiler: gcc 5.4.0), Jenkins nightly build failed with mpi-linux-x86_64 in Isomalloc startup, similar to Clang above.

./build AMPI mpi-linux-x86_64 -j16 -g -O0

$ cd examples/ampi/Cjacobi3D/
$ make OPTS="-g -O0" 
$ ./charmrun +p3  ./jacobi.iso 2 2 2 40 +vp8 +balancer RotateLB +LBDebug 1
[prowess:24844] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7ffff6e39390]
[prowess:24844] [ 1] ./jacobi.iso(mspace_free+0x126)[0x675326]
[prowess:24844] [ 2] ./jacobi.iso(mm_free+0x8a)[0x67642a]
[prowess:24844] [ 3] ./jacobi.iso(free+0x16)[0x6772c6]
[prowess:24844] [ 4] /usr/lib/libopen-rte.so.12(orte_util_convert_string_to_process_name+0x85)[0x7ffff65f5545]
[prowess:24844] [ 5] /usr/lib/libopen-rte.so.12(+0x42e7b)[0x7ffff6617e7b]
[prowess:24844] [ 6] /usr/lib/libopen-rte.so.12(orte_oob_base_set_addr+0x54)[0x7ffff6618b34]
[prowess:24844] [ 7] /usr/lib/libopen-pal.so.13(opal_libevent2021_event_base_loop+0x7f9)[0x7ffff638fd19]
[prowess:24844] [ 8] /usr/lib/libopen-rte.so.12(+0x35b8e)[0x7ffff660ab8e]
[prowess:24844] [ 9] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7ffff6e2f6ba]
[prowess:24844] [10] /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7ffff6b653dd]

This fails ~25% of the time on gcc-5.4.0. os-isomalloc works, so we could potentially fall back to that here, as we already do on Clang in non-SMP mode.

#22 Updated by Sam White about 2 years ago

os-isomalloc is now the default version of isomalloc used on all non-SMP builds: https://charm.cs.illinois.edu/gerrit/#/c/2767/

#23 Updated by Sam White almost 2 years ago

  • Target version changed from 6.8.1 to 6.9.0

#24 Updated by Sam White almost 2 years ago

We ended up needing to revert the above patch because it broke autobuild targets for uth-linux-x86_64, mpi-crayxc, and gni-crayxc (and possibly for the builds on verbs-linux-x86_64 but that autobuild target wasn't working at the time): https://charm.cs.illinois.edu/gerrit/#/c/2779/

#25 Updated by Sam White over 1 year ago

  • Target version deleted (6.9.0)

#26 Updated by Sam White over 1 year ago

  • Tags set to isomalloc

#27 Updated by Evan Ramos about 1 year ago

Testing https://charm.cs.illinois.edu/gerrit/4318 on Ambition and a Charmworks office machine, I don't observe any issues running an AMPI program built with -memory isomalloc and netlrts-linux-x86_64-clang.

#28 Updated by Sam White about 1 year ago

Hmm, what about on Darwin?

#29 Updated by Evan Ramos about 1 year ago

-memory isomalloc still crashes on Darwin.

$ lldb -- ./jacobi.iso 1 1 1 1 +vp1
(lldb) target create "./jacobi.iso" 
Current executable set to './jacobi.iso' (x86_64).
(lldb) settings set -- target.run-args  "1" "1" "1" "1" "+vp1" 
(lldb) r
Process 926 launched: './jacobi.iso' (x86_64)
Process 926 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
    frame #0: 0x00007fff67288b6e libsystem_kernel.dylib`__pthread_kill + 10
libsystem_kernel.dylib`__pthread_kill:
->  0x7fff67288b6e <+10>: jae    0x7fff67288b78            ; <+20>
    0x7fff67288b70 <+12>: movq   %rax, %rdi
    0x7fff67288b73 <+15>: jmp    0x7fff6727fb00            ; cerror_nocancel
    0x7fff67288b78 <+20>: retq   
Target 0: (jacobi.iso) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
  * frame #0: 0x00007fff67288b6e libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007fff67453080 libsystem_pthread.dylib`pthread_kill + 333
    frame #2: 0x00007fff671e41ae libsystem_c.dylib`abort + 127
    frame #3: 0x000000010031221e jacobi.iso`::mspace_free(msp=0x0000000100515280, mem=0x0000000100b00100) at memory-gnu-internal.c:5079
    frame #4: 0x0000000100314982 jacobi.iso`mm_free(mem=0x0000000100b00100) at memory-gnu.c:849
    frame #5: 0x000000010031686c jacobi.iso`meta_free(mem=0x0000000100b00100) at memory-isomalloc.c:118
    frame #6: 0x000000010030ebe4 jacobi.iso`::free(mem=0x0000000100b00100) at libmemory-isomalloc.C:678
    frame #7: 0x000000010037e77a jacobi.iso`cmi_hwloc__free_infos(infos=0x0000000100a1b220, count=5) at topology.c:289
    frame #8: 0x000000010037ed2b jacobi.iso`hwloc__free_object_contents(obj=0x0000000100a1b080) at topology.c:385
    frame #9: 0x000000010037ece5 jacobi.iso`cmi_hwloc_free_unlinked_object(obj=0x0000000100a1b080) at topology.c:404
    frame #10: 0x0000000100383299 jacobi.iso`hwloc_topology_clear_tree(topology=0x0000000100a1a1d0, root=0x0000000100a1b080) at topology.c:2948
    frame #11: 0x0000000100383283 jacobi.iso`hwloc_topology_clear_tree(topology=0x0000000100a1a1d0, root=0x0000000100a1baf0) at topology.c:2945
    frame #12: 0x0000000100383283 jacobi.iso`hwloc_topology_clear_tree(topology=0x0000000100a1a1d0, root=0x0000000100a1aa90) at topology.c:2945
    frame #13: 0x00000001003831a3 jacobi.iso`cmi_hwloc_topology_clear(topology=0x0000000100a1a1d0) at topology.c:2955
    frame #14: 0x0000000100380fb7 jacobi.iso`cmi_hwloc_topology_destroy(topology=0x0000000100a1a1d0) at topology.c:2971
    frame #15: 0x0000000100343e18 jacobi.iso`CmiInitHwlocTopology at cpuaffinity.c:59
    frame #16: 0x000000010031dd45 jacobi.iso`::ConverseInit(argc=6, argv=0x00007ffeefbff9e0, fn=(jacobi.iso`_initCharm(int, char**) at init.C:1151), usched=0, initret=0) at machine-common-core.C:1203
    frame #17: 0x0000000100148a7e jacobi.iso`::charm_main(argc=6, argv=0x00007ffeefbff9e0) at init.C:1698
    frame #18: 0x000000010013d692 jacobi.iso`main(argc=6, argv=0x00007ffeefbff9e0) at main.C:5
    frame #19: 0x00000001000017b4 jacobi.iso`start + 52

#30 Updated by Sam White about 1 year ago

This has been merged to make -memory isomalloc work the default on Clang non-SMP. We still change isomalloc to os-isomalloc on Darwin: https://charm.cs.illinois.edu/gerrit/4318

#31 Updated by Sam White about 1 year ago

  • Target version set to 6.9.0
  • Status changed from In Progress to Merged

Also available in: Atom PDF