Project

General

Profile

Bug #879

segfaults in megatest groupcast on little-endian Power8

Added by Jim Phillips over 3 years ago. Updated about 1 year ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Start date:
11/06/2015
Due date:
% Done:

0%

Spent time:

Description

On Crest (Power8 little-endian) at ORNL:

(Had to remove -Q! from compiler options to build.)

net-linux-ppc-ibverbs-smp-xlc64
./charmrun ++local +p24 ./pgm +ppn 3
...
test 5: initiated [groupcast (mjlang)]
------------- Processor 9 Exiting: Caught Signal ------------
Signal: segmentation violation

multicore-linux-ppc-xlc64
./pgm +p12
...
test 5: initiated [groupcast (mjlang)]
Segmentation fault

6.6.1 builds and tests OK with gcc or xlC.

Latest won't build with gcc (needs some unimplemented feature).

History

#1 Updated by Phil Miller over 3 years ago

Could you say what version of gcc is in use there that won't build current charm mainline?

Do non-SMP/multicore builds crash in the same place? Like, a simple netlrts-linux-ppc? If so, what's the minimum PE count to make it happen?

#2 Updated by Phil Miller over 3 years ago

  • Description updated (diff)

For reference, mainline charm should currently build with GCC 4.3 or later. The only feature beyond 4.2 that I know we rely on is C++11 variadic templates.

#3 Updated by Jim Phillips over 3 years ago

It's gcc 4.9.2. This is the beginning of the errors:

../bin/charmc   -I.   -c -o DummyLB.o DummyLB.C
In file included from /usr/include/c++/4.9/vector:63:0,
                 from ../bin/../include/ckarrayindex.h:8,
                 from envelope.h:11,
                 from CkMarshall.decl.h:4,
                 from charm++.h:134,
                 from LBDatabase.decl.h:3,
                 from LBDatabase.h:99,
                 from BaseLB.h:9,
                 from CentralLB.h:9,
                 from DummyLB.h:9,
                 from DummyLB.C:6:
/usr/include/c++/4.9/bits/stl_uninitialized.h: In function '_ForwardIterator std::uninitialized_copy(_InputIterator, _InputIterator, _ForwardIterator)':
/usr/include/c++/4.9/bits/stl_uninitialized.h:124:15: sorry, unimplemented: logical operation on vector type
            && __assignable>::
               ^
/usr/include/c++/4.9/bits/stl_uninitialized.h:124:27: error: template argument 1 is invalid
            && __assignable>::
                           ^
/usr/include/c++/4.9/bits/stl_uninitialized.h: In function 'void std::uninitialized_fill(_ForwardIterator, _ForwardIterator, const _Tp&)':
/usr/include/c++/4.9/bits/stl_uninitialized.h:184:61: sorry, unimplemented: logical operation on vector type
       std::__uninitialized_fill<__is_trivial(_ValueType) && __assignable>::
                                                             ^
/usr/include/c++/4.9/bits/stl_uninitialized.h:184:73: error: template argument 1 is invalid
       std::__uninitialized_fill<__is_trivial(_ValueType) && __assignable>::
                                                                         ^

#4 Updated by Phil Miller over 3 years ago

That looks like some defect in the system's installation of GCC and libstdc++. My hunch would be that the g++ actually being invoked is not 4.9.2, but rather some earlier version instead, but still pointed at version 4.9 headers. Regardless, it's something that needs to be reported and fixed upstream.

#5 Updated by Jim Phillips over 3 years ago

net-linux-ppc-xlc64 fails immediately:

crest-login1:~/crest/charm-6.7.0-pre/net-linux-ppc-xlc64/tests/charm++/megatest> ./charmrun ++local +p24 ./pgm
Charmrun> started all node programs in 1.169 seconds.
Converse/Charm++ Commit ID: v6.6.0-396-gc64e6a8-namd-charm-6.7.0-build-2015-Oct-28-178036
[18] Assertion "CkReduction::nReducers == count" failed in file ckreduction.C line 2860.
------------- Processor 18 Exiting: Called CmiAbort ------------
Reason:
[18] Stack Traceback:
  [18:0]   [0x10435278]
  [18:1]   [0x1043aa40]
  [18:2]   [0x102d5c10]
  [18:3]   [0x101e057c]
  [18:4]   [0x101e1d14]
  [18:5]   [0x10435e60]
  [18:6]   [0x10437ca8]
  [18:7]   [0x101d0ecc]
  [18:8] +0x24d00  [0x3fffa84f4d00]
Warning> Randomization of stack pointer is turned on in kernel, thread migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it, or try run with '+isomalloc_sync'.
Charm++> scheduler running in netpoll mode.
CharmLB> Load balancer assumes all CPUs are same.
  [18:9] __libc_start_main+0xc8  [0x3fffa84f4ef8]
Fatal error on PE 18>

#6 Updated by Jim Phillips over 3 years ago

Phil Miller wrote:

That looks like some defect in the system's installation of GCC and libstdc++. My hunch would be that the g++ actually being invoked is not 4.9.2, but rather some earlier version instead, but still pointed at version 4.9 headers. Regardless, it's something that needs to be reported and fixed upstream.

gcc (Ubuntu 4.9.2-0ubuntu1~14.04) 4.9.2
Copyright (C) 2014 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

I think it's just not implemented and they know it. Looking at the header it looks like this is only an issue if you enable C++11.

#7 Updated by Phil Miller over 3 years ago

OK, this is strange indeed. That exact code works just fine on my system with gcc 4.9.3. I have a vague suspicion that there's something weird or hacked up in that environment. For some reason, it thinks __assignable or the result of __is_trivial() is a SIMD vector type. And indeed, logical operations on those were not defined in 4.9, but have been since - contrast the following:

#8 Updated by Jim Phillips over 3 years ago

Having traced the gcc issue to simd.h including altivec.h, which defines vector and bool, gcc multicore-linux-ppc build and megatest both work fine now.

There is still the issue with xlC:

crest-login1:~/crest/charm/multicore-linux-ppc-xlc64/tests/charm++/megatest> ./pgm 32 Charm+: standalone mode (not using charmrun)
Charm++> Running in Multicore mode: 1 threads
Converse/Charm++ Commit ID: v6.7.0-rc1-0-g0a7e6d0
Warning> Randomization of stack pointer is turned on in kernel, thread migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it, or try run with '+isomalloc_sync'.
CharmLB> Load balancer assumes all CPUs are same.
[0] Assertion "CkReduction::nReducers == count" failed in file ckreduction.C line 2842.

#9 Updated by Jim Phillips over 3 years ago

The exact same assertion fails with net-linux-ppc-xlc64 as with multicore-linux-ppc-xlc64:

crest-login1:~/crest/charm/net-linux-ppc-xlc64/tests/charm++/megatest> ./charmrun +local +p32 ./pgm
Charmrun> started all node programs in 0.290 seconds.
Converse/Charm
+ Commit ID: v6.7.0-rc1-0-g0a7e6d0
Warning> Randomization of stack pointer is turned on in kernel, thread migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it, or try run with '+isomalloc_sync'.
Charm++> scheduler running in netpoll mode.
CharmLB> Load balancer assumes all CPUs are same.
[12] Assertion "CkReduction::nReducers == count" failed in file ckreduction.C line 2842.

#10 Updated by Eric Bohm over 3 years ago

  • Assignee set to Michael Robson

#11 Updated by Jim Phillips over 3 years ago

Ping.

#12 Updated by Michael Robson over 3 years ago

  • Status changed from New to In Progress

Nikhil and I encountered this same bug this morning while working on the new PAMI layer. We're pretty sure it's a compiler bug because it works for gcc and clang.

#13 Updated by Phil Miller over 3 years ago

It may also be an indication of undefined behavior. Perhaps compile with -fsanitize=undefined?

#14 Updated by Michael Robson over 3 years ago

Inside void CkReductionMgr::sanitycheck() (in ckreduction.C) we added print statments to check various values

mprobson@crest2:~/pami_charm/pami-linux-ppc64le/tests/charm++/simplearrayhello$ poe ./hello
ATTENTION: 0031-393  Ignoring -resd/MP_RESD specified for batch job
Choosing optimized barrier algorithm name I1:Barrier:P2P:P2P
Converse/Charm++ Commit ID: v6.7.0-36-gf30494a
Warning> Randomization of stack pointer is turned on in kernel, thread migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it, or try run with '+isomalloc_sync'.  
[0,0.001434 ] _initCharm started
CharmLB> Load balancer assumes all CPUs are same.
count: 0    nReducers: 26     reducerTable[0]: (nil)    reducerTable[count]: (nil)
reducerTable[0]: (nil)
reducerTable[1]: (nil)
reducerTable[2]: (nil)
reducerTable[3]: (nil)
reducerTable[4]: (nil)
reducerTable[5]: (nil)
reducerTable[6]: (nil)
reducerTable[7]: (nil)
reducerTable[8]: (nil)
reducerTable[9]: (nil)
reducerTable[10]: (nil)
reducerTable[11]: (nil)
reducerTable[12]: (nil)
reducerTable[13]: (nil)
reducerTable[14]: (nil)
reducerTable[15]: (nil)
reducerTable[16]: (nil)
reducerTable[17]: (nil)
reducerTable[18]: (nil)
reducerTable[19]: (nil)
reducerTable[20]: (nil)
reducerTable[21]: (nil)
reducerTable[22]: (nil)
reducerTable[23]: (nil)
reducerTable[24]: (nil)
reducerTable[25]: (nil)
[0] Assertion "CkReduction::nReducers == count" failed in file ckreduction.C line 2867.
------------- Processor 0 Exiting: Called CmiAbort ------------
{snd:0,rcv:0} Reason: Assertion "CkReduction::nReducers == count" failed in file ckreduction.C line 2867.
hello: machine.c:1092: void CmiAbort(const char *): Assertion `0' failed.
ERROR: 0031-250  task 0: Aborted

As you can see the reducerTable entires are all set to (nil)/NULL instead of the function pointer's address. But when we add a print of the actual address to see if it's actually set to null:

printf("::invalid_reducer: %p\n", ::invalid_reducer);

We get the following:

mprobson@crest2:~/pami_charm/pami-linux-ppc64le/tests/charm++/simplearrayhello$ poe ./hello
ATTENTION: 0031-393  Ignoring -resd/MP_RESD specified for batch job
Choosing optimized barrier algorithm name I1:Barrier:P2P:P2P
Converse/Charm++ Commit ID: v6.7.0-36-gf30494a
Warning> Randomization of stack pointer is turned on in kernel, thread migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it, or try run with '+isomalloc_sync'.  
[0,0.001375 ] _initCharm started
CharmLB> Load balancer assumes all CPUs are same.
count: 1    nReducers: 26     reducerTable[0]: 0x101116c4    reducerTable[count]: (nil)
reducerTable[0]: 0x101116c4
reducerTable[1]: (nil)
reducerTable[2]: (nil)
reducerTable[3]: (nil)
reducerTable[4]: (nil)
reducerTable[5]: (nil)
reducerTable[6]: (nil)
reducerTable[7]: (nil)
reducerTable[8]: (nil)
reducerTable[9]: (nil)
reducerTable[10]: (nil)
reducerTable[11]: (nil)
reducerTable[12]: (nil)
reducerTable[13]: (nil)
reducerTable[14]: (nil)
reducerTable[15]: (nil)
reducerTable[16]: (nil)
reducerTable[17]: (nil)
reducerTable[18]: (nil)
reducerTable[19]: (nil)
reducerTable[20]: (nil)
reducerTable[21]: (nil)
reducerTable[22]: (nil)
reducerTable[23]: (nil)
reducerTable[24]: (nil)
reducerTable[25]: (nil)
::invalid_reducer: 0x101116c4
[0] Assertion "CkReduction::nReducers == count" failed in file ckreduction.C line 2867.
------------- Processor 0 Exiting: Called CmiAbort ------------
{snd:0,rcv:0} Reason: Assertion "CkReduction::nReducers == count" failed in file ckreduction.C line 2867.
hello: machine.c:1092: void CmiAbort(const char *): Assertion `0' failed.
ERROR: 0031-250  task 0: Aborted

#15 Updated by Michael Robson over 3 years ago

Phil Miller wrote:

It may also be an indication of undefined behavior. Perhaps compile with -fsanitize=undefined?

I rebuilt charm with -fsanitize=undefined and didn't see a change in behavior.

#16 Updated by Sam White about 1 year ago

  • Status changed from In Progress to Rejected
  • Assignee deleted (Michael Robson)

I don't think this is relevant anymore. We now have Charm running on Summit and other POWER8/9 systems.

If I'm wrong, please re-open the issue.

Also available in: Atom PDF