Project

General

Profile

Feature #921

Entry tag [inline] is unable to optimize away most of the overhead

Added by Eric Bohm over 3 years ago. Updated 12 months ago.

Status:
Implemented
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
12/14/2015
Due date:
% Done:

0%


Description

I've been studying a fine grained code which attempts to use [inline] to reduce the overhead of computing across elements when they are known to be co-resident on the PE.

However, [inline] performs over an order of magnitude worse than
if(proxy[i]->ckLocal()){ user function } else { proxy[i].entry()}

In the case of a marshaled call we go through the full process of constructing the message before we even test whether the target can be inlined. Then execution goes through CkSend with CK_SEND_INLINE set so that we ultimately land in CkSendMsgInline. Which will unpack the data we just packed, and call the function through its entry wrapper, with tracing disabled. In terms of runtime system overhead, the only real benefit to this scheme is that we don't take a trip through the scheduler. We're still incurring message pack and unpack overhead for data which doesn't actually have to go anywhere.

Is there a good reason why we need to abide by that restriction? Why can't we generate code which will test locality and make a function call to the user code, and only hit the rest of this overhead in the else case?


Related issues

Related to Charm++ - Bug #937: [local] entry methods don't set tracing event dependencies Merged 01/03/2016
Related to Charm++ - Bug #1679: Do Not Require Default Constructors for Serializable Classes Merged 09/18/2017
Related to Charm++ - Bug #1699: [inline] entry methods should use perfect forwarding in C++11 Merged 09/30/2017

History

#1 Updated by Nikhil Jain over 3 years ago

Another place we saw this being an issue is ROSS. Replacing an inline call by a pointer saving + C++ style call gave us upto 15% performance boost. (PHOLD benchmark is really sensitive for low remote case).

#2 Updated by Eric Bohm over 3 years ago

  • Priority changed from Normal to Low

On further study [inline] is stuck with the copy implementation under current semantics. Pass by value means that a copy must be made. It could be made slightly more efficient by not going through the full message construction process, but you are going to suffer copy overhead for [inline] unless we change what it means.

#3 Updated by Phil Miller over 3 years ago

I don't think that's strictly true. If the receiver takes a const T& then we can pass the object directly through. I think a wrapper to guarantee value semantics on the caller side (pass by const reference if possible, pass by value of a copy otherwise) shouldn't be too hard. Even in the non-const case, that saves us a copy in the [inline] std::vector case, since we would avoid copying data into a message and then immediately back out - it would go from the original vector passed by the caller, to the argument vector passed to the callee.

#4 Updated by Phil Miller over 3 years ago

Here's a rough draft of the proposed wrapper (untested code):

template <typename T>
struct const_or_copy
{
  const T& t;
  const_or_copy(const T& arg) : t(arg) { }
  operator T() { return t; }
  operator const T&() { return t; }
};

If we stick that around each argument that gets passed through, I think we'd be set for any parameter that the user's function ultimately declares as const.

#5 Updated by Eric Bohm over 3 years ago

Good point on use of const as a way to support pass by reference when the user function takes const input. I think that allows the optimization to go forward without creating debugging confusion.

#6 Updated by Phil Miller over 3 years ago

So, looking at the RTS code on the delivery path, there is at least some good reason that we go to the trouble - tracing and LB accounting. If we don't want to screw up other functionality, we'll need to refactor things in the messaging infrastructure so that generated code can make that happen properly.

It looks like some of that very same instrumentation is only activated when not doing inline delivery - message creation tracing and the accounting in the LB database are only called if (type == CkDeliver_queue). That may just be to avoid incorrectly recording message send twice, though.

On further investigation, part of the necessary refactoring seems to have been done, for ChaNGa using [local] entry methods, based on comments around CkMigratable::timingBeforeCall and CkMigratable::timingAfterCall and the code in xi::Entry::genArrayDefs().

#7 Updated by Phil Miller over 3 years ago

  • Priority changed from Low to Normal
  • Target version set to 6.8.0

On further analysis, I think the right thing to do for now may actually be to have the proxy's method take all arguments by const& and to always pass those along unmolested. This automatically handles the following concerns without spurious copies or instance constructions:

  • entry method templates (unsure how the above code handles this)
  • conversion from argument(s) of type A to parameters of type B (not handled by the above code, without further modification)
  • arguments that are a subtype of the declared parameters

The cases this won't handle are as follows:

  • the method's implementation takes some parameter by non-const reference, leading to compilation failure. This might occur in practice, because existing code generation creates temporary instances that the method may want to manipulate in place, and allow to be discarded after it returns.
  • move optimizations will be ineffective if we 'decay' what would have been xvalues (e.g. temporaries, move()'d-from variables) to const lvalue references

The move optimization can be deferred to later, since it's not necessary for the common copy elisions we're looking at right now.

The failure on non-const references may be bad, because it could force a copy on other cases. The solution to those is that the method should take its parameter by value, then the compiler-/charmxi-generated code will move() from the temporary or copy as needed.

We can test for move/rvalue support with

#if defined(__cpp_rvalue_references) && __cpp_rvalue_references >= 200610

And if it's not there, then just leave off the std::move wrapping arguments in the temporary case.

This means that we may need to distinguish [local] entry methods from others, since those should allow non-const references to pass through.

#8 Updated by Phil Miller over 3 years ago

So, there's some apparent differences in how various code wrapping array element entry method invocation works between [local] (implemented by charmxi) and [inline] (implemented across various bits of RTS) entry methods. Here's the order in which they each push various stuff on the stack, spread over many different functions currently:

[local] generated code             [inline] RTS code for array          BOC (inline/not)  deliver Chare msg   [inline] chare
                                   LB suspend current running object    LB suspend
                                   Grid Q
                                   LB start timing
Trace execution / Appwork          Trace execution / Appwork            Trace             Trace               (never trace?!)
LB suspend, start timing
Debug                              Debug                                Debug             Debug               Debug

Debug should clearly be last, since that's the cut-over point for running in user-provided code. I think the LB suspension should be first, since everything else that happens in this interval is legitimately overhead. I think tracing should be before LB starts timing, since the unfortunate event of a log flush should be attributed to the processor's background load, and not the object in question. So, my preferred order is as follows (where applicable, for each object type):

  • LB suspends current object
  • Grid Q
  • Trace
  • LB timing
  • Debug

What I think I'd like to do about some/all of this is move all of that stuff into a virtual method on the object classes themselves (array element, Chare, Group), that both the generated code and the RTS can call. Thus, we simplify other code, consolidate these widely-spread concerns, and reduce duplication.

In looking at the current implementation, I also noticed that [local] entry methods don't set the event dependency field in tracing, so there's no possibility of traceback through those events. I opened #937 about that.

#9 Updated by Phil Miller over 3 years ago

Partial fix, just for arrays and not groups, pushed here: https://charm.cs.illinois.edu/gerrit/#/c/976/

#10 Updated by Phil Miller over 3 years ago

  • Status changed from New to In Progress

#11 Updated by Eric Bohm over 3 years ago

Testing of this change shows that it reduces the overhead of [inline] vs if(a.ckLocal)... by an order of magnitude. Former case was that [inline] took 20x time to completion vs manual locality test for a fine grained benchmark. Now it is only 2x. Profiling implicates SDAG constructs used in the [inline] based application level implementation in the remaining overhead. Will update with more detailed results later after more study and experimentation.

#12 Updated by Phil Miller over 3 years ago

Eric M or Nikhil: could one of you test PHOLD against the change linked above to see how performance looks now with [inline] compared to the manual pointer test mentioned above?

#13 Updated by Eric Bohm over 3 years ago

  • Assignee set to Eric Bohm

#14 Updated by Phil Miller almost 3 years ago

  • Assignee changed from Eric Bohm to Phil Miller

#15 Updated by Phil Miller almost 3 years ago

As of commit e82cc92d6ba6a2f6a0dbfb155e55fd2e8dbfb822 (including change 976 linked in comment 9):

In the non-SDAG case, passing through by const & cleanly avoids copying for PE-local [inline] calls.

If the entry method definition takes its argument by value, we get a copy, as expected. If it takes its argument by non-const reference, we get a compilation error.

So, on to the SDAG case, I guess.

#16 Updated by Phil Miller almost 3 years ago

  • Status changed from In Progress to Implemented

This should take care of allowing argument passing to [inline] methods by const& to elide copies even through SDAG when the caller and callee are on the same PE:
https://charm.cs.illinois.edu/gerrit/1282

#17 Updated by Phil Miller over 2 years ago

  • Status changed from Implemented to In Progress

#18 Updated by Phil Miller over 2 years ago

More involved wrapper definition that's been lingering on the whiteboard at Charmworks for too long, while this work has been on ice:

template <typename T>
struct arg_wrapper {
  T *t;
  char space[sizeof(T)] alignas(T); // Cray's compiler doesn't support alignas, as of version 8.5
  bool isOwned() { return t == &space; }

  arg_wrapper(const T& v)
  : t(&v)
  { }

  arg_wrapper(T&& v)
  : t(space)
  { new (space) T(move(v)); }

  ~arg_wrapper()
  {
    if (isOwned())
      t->~T();
  }

  operator const T& { return *t; }
  operator T() { return *t; }
  operator T&&()
  {
    if (!isOwned()) {
      new (space) T(*t);
      t = (T*)space; // Added per note in comment #19
    }

    return move(*t);
  }
};

In theory, this should be able to pass everything the right way through.

#19 Updated by Phil Miller about 2 years ago

Comment 18's definition of the rvalue conversion operator was missing t = space; in the non-owned case

#20 Updated by Phil Miller about 2 years ago

  • Target version changed from 6.8.0 to 6.9.0

#21 Updated by Nils Deppe over 1 year ago

If I've understood the above discussion correctly then this is the right place (otherwise I can create a new ticket). Looking through the generated hpp file I find:

template < typename System >                                                                                                                     
template < typename ReceiveTag, typename ReceiveData_t >                                                                                         
void CkIndex_ElementChare < System > ::_call_receive_data_marshall5(void* impl_msg, void* impl_obj_void)                                         
{                                                                                                                                                
  ElementChare < System > * impl_obj = static_cast<ElementChare < System >  *>(impl_obj_void);                                                   
  CkMarshallMsg *impl_msg_typed=(CkMarshallMsg *)impl_msg;                                                                                       
  char *impl_buf=impl_msg_typed->msgBuf;                                                                                                         
  /*Unmarshall pup'd fields: const Instance_t &instance, const ReceiveData_t &t*/                                                                
  PUP::fromMem implP(impl_buf);                                                                                                                  
  Instance_t instance; implP|instance;                                                                                                           
  ReceiveData_t t; implP|t;                                                                                                                      
  impl_buf+=CK_ALIGN(implP.size(),16);                                                                                                           
  /*Unmarshall arrays:*/                                                                                                                         
  impl_obj->template receive_data< ReceiveTag, ReceiveData_t >(instance, t);                                                                     
} 

where the last call should be calling std::move() (which is trivial to write outside of the std namespace if necessary). To "emulate" this we currently move the instance and t we receive in ElementChare since we know that it should be an rvalue reference in C++11. However, [inline] entry methods break this (theoretically, it sounds like not in practice) since then I may be passed a legitimate lvalue reference, which I shouldn't move, but instead should either copy or just use once. This can be fixed by using forwarding references in the receive_data function of ElementChare, since a const-lvalue reference cannot be moved. However, it would be much less worrisome if this was handle correctly by calling move in the generated header file, so that using forwarding references with std::forward works correctly. By correctly I mean both in the sense of the standard and in the sense that we should not be passing lvalue references when rvalues are the correct answer.

#22 Updated by Phil Miller over 1 year ago

  • Related to Bug #1679: Do Not Require Default Constructors for Serializable Classes added

#23 Updated by Phil Miller over 1 year ago

  • Assignee changed from Phil Miller to Eric Bohm

#24 Updated by Sam White over 1 year ago

We may want to re-assign this to Evan? I think he might have some knowledge of this work, and I'm not sure Eric has time for it right now or the next few weeks

#25 Updated by Sam White about 1 year ago

  • Tracker changed from Bug to Feature

#26 Updated by Eric Bohm about 1 year ago

  • Related to Bug #1699: [inline] entry methods should use perfect forwarding in C++11 added

#27 Updated by Eric Bohm about 1 year ago

  • Assignee changed from Eric Bohm to Evan Ramos

Reassigning to Evan in the hopes that he will have the right mix of time and expertise.

#28 Updated by Evan Ramos about 1 year ago

I'm having trouble understanding what problem(s) still need fixing for this issue to be considered complete. It appears that the discussion has shifted since the issue description was originally written, possibly without the original matter being 100% resolved, and I'm unsure if (or how much) the most recent discussion relates to or overlaps with Bug #1699.

#29 Updated by Sam White about 1 year ago

Yeah, I am unsure what is left to do here that isn't covered by Bug #1699. It might be good to have another round of profiling/timing done for the following cases on 1 PE, to clarify the state of things:
1. [inline] entry method
2. [local] entry method
3. Normal entry method
4. Manually checking ckLocal() and then calling the C++ function directly

#30 Updated by Evan Ramos about 1 year ago

Is there any existing code to profile/time these?

#31 Updated by Sam White about 1 year ago

tests/charm++/pingpong has some of them, though I don't think it passes its arguments by const&

#32 Updated by Evan Ramos 12 months ago

  • Status changed from In Progress to Implemented

Sam White wrote:

Yeah, I am unsure what is left to do here that isn't covered by Bug #1699. It might be good to have another round of profiling/timing done for the following cases on 1 PE, to clarify the state of things:
1. [inline] entry method
2. [local] entry method
3. Normal entry method
4. Manually checking ckLocal() and then calling the C++ function directly

Should these test cases block v6.9 release?

#33 Updated by Sam White 12 months ago

Not really, we can do the testing on the beta version of the release as part of our usual release testing.

#34 Updated by Sam White 12 months ago

  • Target version deleted (6.9.0)

Also available in: Atom PDF