Project

General

Profile

Bug #1294

Darwin SMP failure in exit

Added by Sam White over 2 years ago. Updated over 2 years ago.

Status:
Merged
Priority:
High
Category:
-
Target version:
Start date:
11/09/2016
Due date:
% Done:

100%


Description

Autobuild shows a failure for netlrts-darwin-x86_64-smp in tests/charm++/sdag/case/, seemingly during exit:

../../../../bin/testrun  +p2 ./caseTest  ++local
Charmrun> scalable start enabled. 
Charmrun> started all node programs in 0.006 seconds.
Charm++> Running in SMP mode: numNodes 2,  1 worker threads per process
Charm++> The comm. thread both sends and receives messages
Converse/Charm++ Commit ID: 6899066
Warning> using Isomalloc in SMP mode, you may need to run with '+isomalloc_sync'.
Charm++> scheduler running in netpoll mode.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (16-way SMP).
Charm++> cpu topology info is gathered in 0.007 seconds.
running
case test1, test3
     => test5
non-case test4
non-case test2
[Partition 0][Node 0] End of program
Charmrun> error on request socket to node 0 '127.0.0.1'--
Socket closed before recv.
make[4]: *** [test] Error 1
make[3]: *** [test] Error 1
make[2]: *** [test] Error 1
make[1]: *** [test] Error 1
make: *** [test] Error 2

vg - Valgrind output (97.5 KB) Michael Robson, 01/11/2017 10:14 AM

main.c View (107 Bytes) Michael Robson, 01/20/2017 04:46 PM

bar.c View (394 Bytes) Michael Robson, 01/20/2017 04:46 PM

Makefile (185 Bytes) Michael Robson, 01/20/2017 04:46 PM

History

#1 Updated by Sam White over 2 years ago

The failure is consistently reproducible

#2 Updated by Eric Bohm over 2 years ago

  • Assignee set to Michael Robson

#3 Updated by Sam White over 2 years ago

  • Target version set to 6.8.0

This is failing every night in autobuild.

#4 Updated by Phil Miller over 2 years ago

We're now even seeing this apparently in tests/util/check

#5 Updated by Phil Miller over 2 years ago

Check autobuild logs to see when failures started. Correlate date with when Sam upgraded OS on Wit.

Try under tsan

#6 Updated by Sam White over 2 years ago

I upgraded Wit to MacOS X 10.11 on Nov 7th.

#7 Updated by Phil Miller over 2 years ago

  • Target version changed from 6.8.0 to 6.8.0-beta1

#8 Updated by Sam White over 2 years ago

  • Priority changed from Normal to High

#9 Updated by Sam White over 2 years ago

I took a quick look at this. The segfault is reproducible in standalone mode on a 'netlrts-darwin-x86_64 smp --with-production --enable-error-checking -g' build.

$ lldb -- ./caseTest
(lldb) target create "./caseTest" 
Current executable set to './caseTest' (x86_64).
(lldb) r
Process 80802 launched: './caseTest' (x86_64)
Charm++: standalone mode (not using charmrun)
Charm++> Running in SMP mode: numNodes 1,  1 worker threads per process
Charm++> The comm. thread both sends and receives messages
Converse/Charm++ Commit ID: v6.7.0-484-g7ab0fba
Warning> using Isomalloc in SMP mode, you may need to run with '+isomalloc_sync'.
Charm++> scheduler running in netpoll mode.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (8-way SMP).
Charm++> cpu topology info is gathered in 0.000 seconds.
running
case test1, test3
     => test5
non-case test4
non-case test2
[Partition 0][Node 0] End of program
Process 80802 stopped
* thread #1: tid = 0x83634, 0x00007fff90d2c2da libdyld.dylib`stack_not_16_byte_aligned_error, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
    frame #0: 0x00007fff90d2c2da libdyld.dylib`stack_not_16_byte_aligned_error
libdyld.dylib`stack_not_16_byte_aligned_error:
->  0x7fff90d2c2da <+0>: movdqa %xmm0, (%rsp)
    0x7fff90d2c2df <+5>: int3   

libdyld.dylib`_dyld_func_lookup:
    0x7fff90d2c2e0 <+0>: pushq  %rbp
    0x7fff90d2c2e1 <+1>: movq   %rsp, %rbp
(lldb) bt
* thread #1: tid = 0x83634, 0x00007fff90d2c2da libdyld.dylib`stack_not_16_byte_aligned_error, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
  * frame #0: 0x00007fff90d2c2da libdyld.dylib`stack_not_16_byte_aligned_error
    frame #1: 0x00007fff5fbff420
    frame #2: 0x00000001000f8cc5 caseTest`CsdScheduleForever [inlined] CmiHandleMessage(msg=0x00000001000212d0) + 34 at convcore.c:1818 [opt]
    frame #3: 0x00000001000f8ca3 caseTest`CsdScheduleForever + 211 at convcore.c:2055 [opt]
    frame #4: 0x00000001000f89e5 caseTest`CsdScheduler(maxmsgs=<unavailable>) + 21 at convcore.c:1991 [opt]
    frame #5: 0x000000010000703a caseTest`Main::_atomic_6(this=0x000000010044ce90) + 42 at STDIN:24
    frame #6: 0x0000000100006d96 caseTest`Main::_when_5(this=0x000000010044ce90) + 150 at caseTest.def.h:841
    frame #7: 0x0000000100006cf5 caseTest`Main::_when_4_end(this=0x000000010044ce90) + 21 at caseTest.def.h:817
    frame #8: 0x0000000100006cce caseTest`Main::_atomic_5(this=0x000000010044ce90) + 46 at caseTest.def.h:831
    frame #9: 0x00000001000042e9 caseTest`Main::_when_4(this=0x000000010044ce90) + 153 at caseTest.def.h:802
    frame #10: 0x00000001000036d4 caseTest`Main::test4(this=0x000000010044ce90) + 324 at caseTest.def.h:935
    frame #11: 0x000000010000358a caseTest`CkIndex_Main::_call_test4_void(impl_msg=0x000000010044acc0, impl_obj_void=0x000000010044ce90) + 42 at caseTest.def.h:397
    frame #12: 0x000000010002eb9a caseTest`::CkDeliverMessageFree(epIdx=149, msg=0x000000010044acc0, obj=0x000000010044ce90) + 138 at ck.C:593 [opt]
    frame #13: 0x0000000100031e0c caseTest`_processHandler(void*, CkCoreState*) [inlined] _invokeEntryNoTrace(epIdx=149, obj=<unavailable>) + 20 at ck.C:637 [opt]
    frame #14: 0x0000000100031df8 caseTest`_processHandler(void*, CkCoreState*) [inlined] _invokeEntry(epIdx=149, obj=<unavailable>) at ck.C:655 [opt]
    frame #15: 0x0000000100031df8 caseTest`_processHandler(void*, CkCoreState*) [inlined] _processForPlainChareMsg(CkCoreState*, envelope*) + 265 at ck.C:1006 [opt]
    frame #16: 0x0000000100031cef caseTest`_processHandler(converseMsg=<unavailable>, ck=<unavailable>) + 4175 at ck.C:1272 [opt]
    frame #17: 0x00000001000f8cc5 caseTest`CsdScheduleForever [inlined] CmiHandleMessage(msg=0x000000010044ac70) + 34 at convcore.c:1818 [opt]
    frame #18: 0x00000001000f8ca3 caseTest`CsdScheduleForever + 211 at convcore.c:2055 [opt]
    frame #19: 0x00000001000f89e5 caseTest`CsdScheduler(maxmsgs=<unavailable>) + 21 at convcore.c:1991 [opt]
    frame #20: 0x00000001000f3569 caseTest`ConverseRunPE(everReturn=0) + 745 at machine-common-core.c:1297 [opt]
    frame #21: 0x00000001000f2125 caseTest`ConverseInit(argc=<unavailable>, argv=<unavailable>, fn=<unavailable>, usched=<unavailable>, initret=0) + 933 at machine-common-core.c:1198 [opt]
    frame #22: 0x000000010001eb0e caseTest`main(argc=<unavailable>, argv=<unavailable>) + 46 at main.C:18 [opt]
    frame #23: 0x00000001000017f4 caseTest`start + 52

#10 Updated by Phil Miller over 2 years ago

If the issue is, as hinted, that the stack is not 16-byte aligned, then it should be pretty straightforward to adjust the allocation of the new thread stack that we use to spawn the scheduler inside CkExit (so that we don't return to user code after the call).

#11 Updated by Michael Robson over 2 years ago

  • File vg added
  • % Done changed from 0 to 30
  • Status changed from New to In Progress

I am only able to reproduce this error on wit and not on my personal laptop. Running Valgrind gives the following output (attached). I was able to "fix" this by inserting any non-blank line of code but I am still working on how to actually align this properly.

#12 Updated by Sam White over 2 years ago

Have you tried alllocating an aligned buffer, passing that to pthread_attr_setstack() and pthread_create()?
What version of OS X are you running on your laptop? I think Wit was upgraded to OS X 10.11 around the time this error started occurring.

#13 Updated by Michael Robson over 2 years ago

  • File bar.c View added
  • File Makefile added
  • Status changed from In Progress to Implemented
  • % Done changed from 30 to 90
  • File main.c View added

We (Phil and I) isolated what we think might be a compiler bug that's causing this issue. See the attached files for a reproduction. In the meantime, I've reworked the Yield condition during exit to circumvent the compiler issue. Patch is here: https://charm.cs.illinois.edu/gerrit/#/c/2146/

#14 Updated by Phil Miller over 2 years ago

Reported to Apple as Bug #30186931. There doesn't seem to be any way in their system to CC others, so I'll update if/when something moves there.

#15 Updated by Michael Robson over 2 years ago

  • translation missing: en.field_closed_date set to 2017-01-25 14:55:03.431810
  • % Done changed from 90 to 100
  • Status changed from Implemented to Merged

Also available in: Atom PDF