Project

General

Profile

Bug #1375

os-isomalloc failures during startup on SMP builds

Added by Sam White almost 2 years ago. Updated 5 months ago.

Status:
Merged
Priority:
Normal
Assignee:
Category:
SMP
Target version:
Start date:
01/22/2017
Due date:
% Done:

0%


Description

os-isomalloc fails during startup on the first access to a lock in cputopolgy.C. os-isomalloc works on Clang non-SMP where isomalloc does not, so if we can fix this issue we should have an isomalloc solution working everywhere isomalloc can work.

Here's the failure on the SMP builds:

[3] Stack Traceback:
thread #2: tid = 0x8bc76, 0x00007fff95ae245c libsystem_pthread.dylib pthread_mutex_lock, stop reason = EXC_BAD_ACCESS (code=1, address=0x0)

  * frame #0: 0x00007fff95ae245c libsystem_pthread.dylib pthread_mutex_lock
    frame #1: 0x000000010023f275 jacobi.iso LrtsLock(lock=0x0000000000000000) + 21 at machine-common-core.c:1647
    frame #2: 0x0000000100266e56 jacobi.iso ::LrtsInitCpuTopo(argv=0x0000000100900de0) + 918 at cputopology.C:540
    frame #3: 0x0000000100267a95 jacobi.iso ::CmiInitCPUTopology(argv=0x0000000100900de0) + 21 at cputopology.C:652
    frame #4: 0x00000001000d619d jacobi.iso _initCharm(unused_argc=6, argv=0x0000000100900de0) + 9613 at init.C:1389
    frame #5: 0x00000001002418c8 jacobi.iso ConverseRunPE(everReturn=0) + 888 at machine-common-core.c:1290
    frame #6: 0x0000000100246a44 jacobi.iso call_startfn(vindex=0x0000000000000001) + 196 at machine-smp.c:406
    frame #7: 0x00007fff95ae499d libsystem_pthread.dylib_pthread_body + 131
    frame #8: 0x00007fff95ae491a libsystem_pthread.dylib_pthread_start + 168
    frame #9: 0x00007fff95ae2351 libsystem_pthread.dylibthread_start + 13

History

#1 Updated by Sam White almost 2 years ago

Fixed the hang during initialization: https://charm.cs.illinois.edu/gerrit/#/c/2151/

But now we get a failure after migration, when unpacking AMPI ranks on a new PE, inside CmiIsomallocBlockListActivate().

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff6dd2700 (LWP 23963)]
0x000000000064e235 in CmiIsomallocBlockListActivate ()
(gdb) bt
#0  0x000000000064e235 in CmiIsomallocBlockListActivate ()
#1  0x00000000004d7e34 in activateThread () at tcharm_impl.h:244
#2  TCharm::pup (this=0x7ffff0170600, p=...) at tcharm.C:273
#3  0x000000000058f212 in CkLocMgr::pupElementsFor (this=this@entry=0x7ffff0169bb0, p=..., rec=rec@entry=0x7ffff016b130, 
    type=type@entry=CkElementCreation_migrate, rebuild=rebuild@entry=false) at cklocation.C:2997
#4  0x000000000058f48f in CkLocMgr::immigrate (this=0x7ffff0169bb0, msg=0x7ffff53cd070) at cklocation.C:3181
#5  0x000000000056ca80 in CkDeliverMessageFree (epIdx=36, msg=0x7ffff53cd070, obj=<optimized out>) at ck.C:593
#6  0x000000000057309f in _deliverForBocMsg (ck=<optimized out>, obj=<optimized out>, env=<optimized out>, epIdx=<optimized out>) at ck.C:1092
#7  _processForBocMsg (env=<optimized out>, ck=<optimized out>) at ck.C:1110
#8  _processHandler (converseMsg=<optimized out>, ck=<optimized out>) at ck.C:1242
#9  0x0000000000658545 in CmiHandleMessage (msg=<optimized out>) at convcore.c:1806
#10 CsdScheduleForever () at convcore.c:2043
#11 0x000000000065889d in CsdScheduler (maxmsgs=maxmsgs@entry=-1) at convcore.c:1979
#12 0x00000000006557da in ConverseRunPE (everReturn=0) at machine-common-core.c:1297
#13 0x0000000000655845 in call_startfn (vindex=0x1) at machine-smp.c:406
#14 0x00007ffff7bc4184 in start_thread (arg=0x7ffff6dd2700) at pthread_create.c:312
#15 0x00007ffff6ecd37d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

#2 Updated by Sam White almost 2 years ago

After testing this more, it appears the above change broke '-memory isomalloc' on SMP mode (it was already broken for Clang non-SMP, now it hangs in SMP mode startup). I'm not sure what we want to do about it as far as prioritizing issues/fixes/testing.

The remaining issues with both isomalloc and os-isomalloc are difficult to debug... I think for the 6.8.0 release '-memory isomalloc' should be the recommended/tested method, with it noted in the release that os-isomalloc can be used for Clang non-SMP mode.

Reverted the above patch.

#3 Updated by Sam White almost 2 years ago

  • Category changed from AMPI to SMP

#4 Updated by Sam White almost 2 years ago

  • Priority changed from Urgent to Normal

#5 Updated by Sam White almost 2 years ago

  • Target version changed from 6.8.0 to 6.8.1

#6 Updated by Sam White over 1 year ago

os-isomalloc is now the default version of isomalloc used on all non-SMP builds: https://charm.cs.illinois.edu/gerrit/#/c/2767/

Edit: the above patch was reverted since it broke other builds... os-isomalloc is now only the default for Clang non-SMP.

#7 Updated by Sam White over 1 year ago

  • Target version changed from 6.8.1 to 6.9.0

#8 Updated by Sam White about 1 year ago

  • Target version deleted (6.9.0)

#9 Updated by Phil Miller about 1 year ago

Here's the real fix for the hang:
https://charm.cs.illinois.edu/gerrit/3204
It turns out that there was a mis-alignment of barrier calls between the worker and comm threads, which finally deadlocked where the observations above noted.

After applying this, we still get a crash with os-isomalloc when ppn > 2, but it's progress.

#10 Updated by Phil Miller about 1 year ago

  • Target version set to 6.9.0

And the subsequent crash was simply because there were CpvInitialize calls in meta_init(), and the call was guarded by if (CmiMyRank() == 0), so they were only initialized on rank 0 PEs. Sam's fixing that up, and cleaning up assignments in other memory modules that would still need to be guarded.

Hopefully, this lets us use os-isomalloc pretty uniformly, and be less troubled by the places where non-os isomalloc currently doesn't work.

#11 Updated by Sam White about 1 year ago

  • Status changed from In Progress to Implemented

This fixes memory os-isomalloc and updates the other modules to work on SMP mode: https://charm.cs.illinois.edu/gerrit/#/c/3206/

#12 Updated by Sam White about 1 year ago

  • Status changed from Implemented to Merged
  • Tags set to #smp, ampi, isomalloc

Patch to make os-isomalloc the recommended and tested version of Isomalloc everywhere: https://charm.cs.illinois.edu/gerrit/#/c/3208/

#13 Updated by Sam White 5 months ago

  • Assignee changed from Sam White to Evan Ramos
  • Target version deleted (6.9.0)
  • Status changed from Merged to New

Looks like I accidentally marked this merged. It is not implemented, though I think the last comment on the gerrit patch might help: https://charm.cs.illinois.edu/gerrit/#/c/charm/+/3208/

#14 Updated by Evan Ramos 5 months ago

#15 Updated by Sam White 5 months ago

That fixed most of the issues with os-isomalloc, but I think it still doesn't work on Cori and its version of libc. It needs to be tested there again

#16 Updated by Evan Ramos 5 months ago

It still fails due to dlsym calling calloc.

(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x000000002028793f in _INTERNALb3ae16b5::meta_malloc (size=32) at memory-isomalloc.c:105
#2  _INTERNALb3ae16b5::meta_calloc (nelem=<optimized out>, size=<optimized out>) at memory-isomalloc.c:123
#3  calloc (nelem=32, nelem@entry=1, size=size@entry=32) at libmemory-os-isomalloc.C:322
#4  0x00000000206669cc in _dlerror_run (operate=operate@entry=0x20666640 <dlsym_doit>, args=args@entry=0x7fffffff58e0) at dlerror.c:141
#5  0x0000000020666691 in __dlsym (handle=<optimized out>, name=<optimized out>, dl_caller=<optimized out>) at dlsym.c:70
#6  0x000000002028880a in initialize_memory_wrapper () at memory-os-wrapper.C:32
#7  0x0000000020288782 in initialize_memory_wrapper_malloc (size=32) at libmemory-os-isomalloc.C:133
#8  0x0000000020287823 in _INTERNALb3ae16b5::meta_malloc (size=<optimized out>) at memory-isomalloc.c:105
#9  malloc (size=32, size@entry=64) at libmemory-os-isomalloc.C:320
#10 0x00000000206bca63 in _dl_get_origin () at ../sysdeps/unix/sysv/linux/dl-origin.c:50
#11 0x00000000206bf45f in _dl_non_dynamic_init () at dl-support.c:307
#12 0x00000000206c0998 in __libc_init_first (argc=argc@entry=1, argv=argv@entry=0x7fffffff6ab8, envp=0x7fffffff6ac8) at ../csu/init-first.c:79
#13 0x00000000206344f6 in __libc_start_main (main=0x200f6130 <main>, argc=1, argv=0x7fffffff6ab8, init=0x20634a40 <__libc_csu_init>, fini=0x20634ad0 <__libc_csu_fini>, rtld_fini=0x0, stack_end=0x7fffffff6aa8) at libc-start.c:225
#14 0x000000002000ab29 in _start () at ../sysdeps/x86_64/start.S:118
(gdb) bt full
#0  0x0000000000000000 in ?? ()
No symbol table info available.
#1  0x000000002028793f in _INTERNALb3ae16b5::meta_malloc (size=32) at memory-isomalloc.c:105
        ret = 0x0
        _isomalloc_thread = 0
#2  _INTERNALb3ae16b5::meta_calloc (nelem=<optimized out>, size=<optimized out>) at memory-isomalloc.c:123
No locals.
#3  calloc (nelem=32, nelem@entry=1, size=size@entry=32) at libmemory-os-isomalloc.C:322
No locals.
#4  0x00000000206669cc in _dlerror_run (operate=operate@entry=0x20666640 <dlsym_doit>, args=args@entry=0x7fffffff58e0) at dlerror.c:141
        result = <optimized out>
#5  0x0000000020666691 in __dlsym (handle=<optimized out>, name=<optimized out>, dl_caller=<optimized out>) at dlsym.c:70
        args = {handle = 0xffffffffffffffff, name = 0x206f7b04 "realloc", who = 0x2028880a <initialize_memory_wrapper+42>, sym = 0x0}
        result = <optimized out>
#6  0x000000002028880a in initialize_memory_wrapper () at memory-os-wrapper.C:32
No locals.
#7  0x0000000020288782 in initialize_memory_wrapper_malloc (size=32) at libmemory-os-isomalloc.C:133
        malloc_wrapper = 1
#8  0x0000000020287823 in _INTERNALb3ae16b5::meta_malloc (size=<optimized out>) at memory-isomalloc.c:105
        _isomalloc_thread = 0
#9  malloc (size=32, size@entry=64) at libmemory-os-isomalloc.C:320
No locals.
#10 0x00000000206bca63 in _dl_get_origin () at ../sysdeps/unix/sysv/linux/dl-origin.c:50
        linkval = "/global/u2/e/earamos/charm/gni-crayxc-smp/tests/ampi/isomalloc/os-isomalloc", '\000' <repeats 3605 times>...
        result = <optimized out>
        len = 63
        __PRETTY_FUNCTION__ = "_dl_get_origin" 
#11 0x00000000206bf45f in _dl_non_dynamic_init () at dl-support.c:307
No locals.
#12 0x00000000206c0998 in __libc_init_first (argc=argc@entry=1, argv=argv@entry=0x7fffffff6ab8, envp=0x7fffffff6ac8) at ../csu/init-first.c:79
No locals.
#13 0x00000000206344f6 in __libc_start_main (main=0x200f6130 <main>, argc=1, argv=0x7fffffff6ab8, init=0x20634a40 <__libc_csu_init>, fini=0x20634ad0 <__libc_csu_fini>, rtld_fini=0x0, stack_end=0x7fffffff6aa8) at libc-start.c:225
        result = <optimized out>
        ev = 0x7fffffff6ac8
        auxvec = <optimized out>
        __PRETTY_FUNCTION__ = "__libc_start_main" 
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {0, 0, 0, 0, 0, 0, 0, 0}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#14 0x000000002000ab29 in _start () at ../sysdeps/x86_64/start.S:118
No locals.

#17 Updated by Evan Ramos 5 months ago

I tried the trick described on some Stack Overflow entries of using a static buffer to satisfy heap allocations inside dlsym to avoid the crash. However, on gni-crayxc-smp, dlsym returns NULL when trying to find malloc and friends and crashes when trying to fprintf the output of dlerror(). Using GDB I can see what the intended error message is, internal to dlsym: "RTLD_NEXT used in code not dynamically loaded". To solve this would require restructuring memory.C and memory-os-wrapper.C so that the latter contains the redefinitions of the allocators, compiling it as a shared object, and running all resulting programs with an LD_PRELOAD line containing it.

Is there any reason we can't use the GNU hooks instead?

#18 Updated by Sam White 5 months ago

Not that I know of, it looks like there's already some support for that guarded by CMK_MEMORY_BUILD_GNU_HOOKS in memory.C

#19 Updated by Evan Ramos 5 months ago

Reading up on it, the issue with the hooks is that when we want to call into glibc's malloc, we have to remove the hooks, and that is not thread-safe.

#20 Updated by Evan Ramos 5 months ago

I think I realize the underlying issue here:

../../../bin/../lib/libconv-util.a(sockRoutines.o): In function `skt_lookup_ip':
/global/u2/e/earamos/charm/gni-crayxc-smp/tmp/sockRoutines.c:302: warning: Using 'gethostbyname' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
> ldd isomalloc
    not a dynamic executable

The reason gni-crayxc has this problem is because the Cray compiler wrappers force binaries to be statically linked. In glibc, malloc and co. are weak symbols, which is why it works for us to redefine our own copies normally. A statically linked glibc also explains why dlsym failed to find any malloc: because we replace every externally-used symbol in glibc's malloc.o, that code is not present in the resulting binary at all.

This means that if we can find a way to force Cray's wrappers to compile a dynamic binary, os-isomalloc should work with no further effort, and if we cannot, then there is no (thread-safe) way to do it.

#21 Updated by Evan Ramos 5 months ago

Adding -dynamic to the link command causes missing GNI_* and PMI_* symbols, meaning static linking is the only option. We may need for charmc to switch the user to non-OS isomalloc if the Cray compiler wrappers are in use.

#22 Updated by Sam White 5 months ago

I see, we could contact Blue Waters (NCSA) or Cori (NERSC) administrators to ask if there's a way to force dynamic linking, but otherwise can go ahead with having charmc emit a warning and switch from os-isomalloc to isomalloc on Cray systems.

#23 Updated by Evan Ramos 5 months ago

  • Status changed from New to In Progress

With a change to conv-mach-craype.sh, dynamic linking can be made to work, and os-isomalloc with it.

#25 Updated by Sam White 5 months ago

  • Status changed from Implemented to Merged
  • Target version set to 6.9.0

Also available in: Atom PDF