Project

General

Profile

Bug #1045

Array pointers in messages can be overwritten by generated code in the message class constructor (causes crash on GCC 6+)

Added by Phil Miller about 3 years ago. Updated about 2 years ago.

Status:
Merged
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
04/27/2016
Due date:
% Done:

0%


Description

https://charm.cs.illinois.edu/private/tms/listlog.php?param=1111

Upcoming changes in GCC 6 optimizer capabilities may resurface this old bug even with the workaround implemented
https://gcc.gnu.org/gcc-6/changes.html

-flifetime-dse is more aggressive in dead-store elimination in situations where a memory store to a location precedes a constructor to the memory location.

If the various bits of memory allocation and construction get inlined, potentially even across object files with link-time optimization, then the stores to the pointers from the message struct to the associated spots in the memory buffer allocated for it and its payload might get eliminated.


Related issues

Related to Charm-NG - Feature #1562: Enable message allocation, construction, packing, etc, without generated .ci file code New 05/16/2017

History

#1 Updated by Eric Bohm about 3 years ago

  • Assignee set to Ralf Gunter CorrĂȘa Carvalho

#2 Updated by Phil Miller almost 3 years ago

  • Assignee changed from Ralf Gunter CorrĂȘa Carvalho to Phil Miller
  • Priority changed from Normal to High

I need to investigate more, but I think this has come to the fore.

#3 Updated by Phil Miller almost 3 years ago

This crashes in megatest

./build test netlrts-linux-x86_64 --suffix=gcc6-prod-lifetime-dse --with-production --enable-errorchecking -j4 -g

Whereas this:

./build test netlrts-linux-x86_64 --suffix=gcc6-prod-no-lifetime-dse --with-production --enable-errorchecking -j4 -g -fno-lifetime-dse

Does not crash, successfully runs the full suite of tests and examples.

#4 Updated by Phil Miller almost 3 years ago

More details on the optimization here:
https://gcc.gnu.org/gcc-6/porting_to.html

#5 Updated by Phil Miller almost 3 years ago

  • Target version set to 6.8.0-beta1

#6 Updated by Thomas Quinn almost 3 years ago

ChaNGa also dies in a CkReduction when compiled with GCC 6.2.1. Recompiling with fno-lifetime-dse fixes the problem.

#7 Updated by Phil Miller almost 3 years ago

  • Subject changed from Array pointers in messages can be overwritten by generated code in the message class constructor to Array pointers in messages can be overwritten by generated code in the message class constructor (causes crash on GCC 6+)

#8 Updated by Phil Miller almost 3 years ago

  • Status changed from New to Implemented

#9 Updated by Phil Miller over 2 years ago

  • Status changed from Implemented to Merged
  • translation missing: en.field_closed_date set to 2016-10-21 15:48:29.272387

#10 Updated by Sam White over 2 years ago

On net-win64 we get compiler warnings because MSVC doesn't support the -fno-lifetime-dse option:

Ignored Unrecognized argument -fno-lifetime-dse

#11 Updated by Sam White over 2 years ago

XLC actually gives an error for this, so the build aborts:

../bin/charmc -host  -I../../src/xlat-i/ -I../../src/xlat-i/sdag/ -I.   -c -o xi-main.o ../../src/xlat-i/xi-main.C
g++: unrecognized option '-qlanglvl=extended0x'
cc1plus: error: unrecognized command line option "-fno-lifetime-dse" 

#12 Updated by Phil Miller over 2 years ago

  • Status changed from Merged to In Progress
  • translation missing: en.field_closed_date deleted (2016-10-21 15:48:29.272387)

Ugh, the configure test is supposed to catch that. I'll fix.

#13 Updated by Nikhil Jain over 2 years ago

Phil Miller wrote:

Ugh, the configure test is supposed to catch that. I'll fix.

Config does catch it, but on BG/Q cross compilation means we use different compilers for SEQ and rest.

#14 Updated by Phil Miller over 2 years ago

On examination, there is no case where the flags we're detecting in configure should actually be passed to the native/host compiler. So, disable passing those flags to the native compiler, and we've fixed the Blue Gene Q issue, and partially mitigated the (less problematic) MSVC issue:
https://charm.cs.illinois.edu/gerrit/#/c/1972/
Would test on BG/Q, but it's not back online from maintenance yet.

#15 Updated by Phil Miller over 2 years ago

  • Status changed from In Progress to Implemented

Subsequent fixes pushed, awaiting review.

#16 Updated by Phil Miller over 2 years ago

  • translation missing: en.field_closed_date set to 2016-11-09 12:20:08.622928
  • Status changed from Implemented to Merged

It'll be nice to see this get through autobuild, but the fix is in.

#17 Updated by Phil Miller about 2 years ago

  • Related to Feature #1562: Enable message allocation, construction, packing, etc, without generated .ci file code added

#18 Updated by Phil Miller about 2 years ago

My original notes from 2011-01-07:

Shortly before Christmas, Akhil and I isolated a bug in Charm++ code
that only exhibits in newer compilers. Specifically, when creating
messages containing variable length arrays, we rely on the memory
allocation returned by our `operator new()` being passed to the
message's constructor unmodified. In some circumstances [1], this is
not the case, and the pointers that the allocator initializes as
offsets into the allocated buffer get overwritten. Subsequent code
dereferencing those pointers goes astray.

Two major things to consider:
1. We should not make a new (non-bugfix) release with a bug like this
looming over our heads, and
2. I'm pretty sure fixing this will require an API change, though the
magnitude of that change spans many possibilities.

The plan for this meeting is to explore what those possibilities are,
and what they entail in terms of implementation effort, application
adaptation effort, maintainability, and performance. I would
appreciate if group members with experience developing against APIs
with RDMA-like constructs (ibverbs, LAPI) would attend.

[1] Right now, the one I know of is when the compiler generates a
default constructor. There may be others, and that set may vary over
time, compiler version, options, etc.

Email exchange with Jim, same day:

Jim sent the following while we met:

This might be a product of new initialization rules for plain old data. Try removing the final () from both of these lines in varsizetest2.C:

varsizetest2_Msg *m = new (sizes,0) varsizetest2_Msg();

m = new (sizes,0) varsizetest2_Msg();

The () after the class name is an explicit initializer, which causes all elements to be default inititialized to zero. If you declare a constructor then it's no longer plain old data, and the default constructor is expected to initialize everything.

See http://www.informit.com/guides/content.aspx?g=cplusplus&seqNum=423

Try making the CMessage_varsizetest2_Msg constructor protected in the .decl.h file, which would actually solve the problem if it works.

If that doesn't work, the appropriate API change would be to move the pointer declarations to the CMessage_varsizetest2_Msg class. The user would have to comment them out (or use a macro) in the varsizetest2_Msg declaration and they would only exist in the .ci file.

Next message:

Maybe also try adding a private copy constructor to CMessage_varsizetest2_Msg in the .decl.h file.

Anything to keep C++0X from making the message plain old data.

http://en.wikipedia.org/wiki/C%2B%2B0x

The insight about POD versus class-with-constructor is one we missed, but it doesn't really matter. The compiler is allowed to do whatever it wants with the block of memory returned by operator new() before calling the constructor. XLC even supports a debugging flag to have such memory initialized to a particular pattern. The compiler could also ask for extra bytes that it will take off the beginning of the allocation for whatever purpose, a possibility that we ignore.

The API change suggested is the same as the minimal-effort change we came up with.

Meeting minutes, same day:

We explored the issue, and concluded that some API change would be necessary.

There are two prime possibilities:

When a message declared in a .ci file mentions variable length arrays, the pointers to those arrays should be declared in the base CMessage_foo class and initialized to the appropriate offset in its constructor, using length data hidden somewhere by the runtime. The requires user code to remove the same declarations, possibly conditional on the version of Charm++ in which this change occurs.

We could transition to an API where the client code provides a list of pieces of memory that should end up the in the outgoing message, with lengths and possibly strides. This matches the underlying APIs available in TCP (sendv()), LAPI, and probably others. This smoothly enables later implementation of optional zero-copy optimization if the caller promises not to modify the buffers in question until the send is complete.

January 18:

After discussions with people, I've started along the following path:

1. Implement the partial workaround that Gengbin described, by modifying charmxi as follows:
a. for each message class, declare an array _ck_sizes of size_t with enough slots for all of the variable length arrays
b. in CMessage_foo::operator new(), save the sizes passed in as arguments to _ck_sizes
c. Move the setting of pointers from CMessage_foo::operator new() to CMessage_foo::CMessage_foo(), using the saved sizes
2. Start exploring forward-looking APIs that obviate the still-possible issue of overwriting between CMessage_foo::CMessage_foo() and Message::Message()

Currently experiencing difficulty with 1a, in that CpvStaticDeclare won't work for static class members. Can't just declare outside the class with the name appended, because of templates. Will continue tomorrow.

Next update:

Coded bugfix. Tested net-linux-x86_64{,-smp} through added test in megatest.

Commits 5d1a75ed740aaefe35d2902c7b70276e7abcd5ef and c0776f0edad053d8315e6486f1f9ee4e3b28f7e3.

Final note, January 2012:

To actually make this bug impossible would still require API changes

Also available in: Atom PDF