Project

General

Profile

Bug #1379

SDAG doesn't properly handle callbacks to [reductiontarget] methods with refnums

Added by Sam White over 2 years ago. Updated about 2 years ago.

Status:
Merged
Priority:
High
Assignee:
Category:
Charmxi
Target version:
Start date:
01/24/2017
Due date:
% Done:

0%


Description

If you want to do something like 'when rednResult[refnum](int iter, double sum) serial { }' where rednResult is a [reductiontarget] entry method, SDAG does not do the reference number matching for it correctly and the program hangs. Potential workarounds are to either not use a [reductiontarget] and instead use an explicit CkReductionMsg*, or to do a reduction targeted at another entry method and have that entry method then invoke another method that you then match in an SDAG when clause.

Ed ran into this problem, and will supply a simple example to reproduce it with.

refNumExample.tar (13 KB) Sam White, 01/25/2017 02:53 PM

History

#1 Updated by Sam White over 2 years ago

  • Assignee deleted (Ralf Gunter CorrĂȘa Carvalho)

#2 Updated by Sam White over 2 years ago

Simple example program to reproduce this issue (hang) is attached. Written by Ed Hutter.

#3 Updated by Phil Miller over 2 years ago

I think we should try to get this dealt with before 6.8.0 is released

#4 Updated by Sam White over 2 years ago

  • Target version changed from 6.8.1 to 6.8.0

Semi-related is the issue of tuple reductions not being able to be delivered to [reductiontarget] entry methods: https://charm.cs.illinois.edu/redmine/issues/1250

#5 Updated by Sam White over 2 years ago

  • Assignee set to Eric Mikida

#6 Updated by Eric Mikida over 2 years ago

My thoughts after looking at this:
There are two ways in which reference numbers are used in SDAG which are "well-defined". If it's a marshaled message, then the first int argument is the reference number. If it is a message, then it is the refnum field of the message.

The issue here is that we are setting the refnum on a callback, which is a different case then the above two, since a callback can technically result in a message or a marshalled message. When a callback results in a message then everything works fine. But when it results in a marshaled message (such as when we are doing reductiontargets) then the refnum is ignored. This would lead me to believe that it is also broken when using a callback with a point-to-point marshaled message.

I'm not really sure how to proceed though with fixing this. If we keep the callback around to use the refnum that seems to be one option, but it would be a bit of a weird API change in that some marshaled messages would have the refnum as the first arg, and others would just be matching an "invisible" refnum from the perspective of the SDAG code. It would also probably break if the method was called sometimes with a callback and sometimes without one. We could also maybe try some tricks with generating two versions of a method so that there's a version with a refnum, and a version without...but that also seems like it would overly complicate the ci file even moreso.

This to me seems like a larger design/API change than a small bug fix.

#7 Updated by Phil Miller over 2 years ago

I think we may be able to simplify things.

For a common parameter-marshalled method invocation, we can make sure that the first (must be integral) argument is copied into the message object's refnum field when the message is constructed at the send side.

For a callback that has a refnum set, whatever message it gets passed gets its refnum set to that value. With the way things currently work, this can only be an explicit message, not something that's been parameter marshalled.

For a [reductiontarget] method, we have a CkReductionMsg whose refnum field may have already been set. Then there's an apparent potential conflict between that first argument and whatever the callback may have said. I think we can resolve that conflict in favor of the refnum field set on the message that came from the callback. Any integer argument coming out of unpacking the reduction message is going to be a count of objects or something like that.

So, on the sending side, we need to make sure the packed up message has its refnum field set to whatever's expected. On the receiving side, we can always just look at the message refnum field, and not worry about what we're unpacking.

Does that make sense? Where I think I'm contradicting Eric M's comment just above is that I don't see how a common parameter marshalled message can ever get passed to a callback.

#8 Updated by Eric Mikida over 2 years ago

OK I didn't know that you couldn't do marshaling in callbacks although there are still two oddities about what you propose IMO:

1) Methods that are reduction callbacks can still be invoked as a regular entry method call, so in that case you would expect the refNum to be the first argument.

2) The when blocks for reduction targets vs regular marshaled messages have a weird mismatch in expected behavior now. For example, if you encounter "when receive[refnum](int x) serial { ... }" in a large SDAG file, you can no longer tell by just inspection whether x is the refnum, or if this is a reduction target and the refnum is going to come from somewhere else (that looks somewhat magical by the way). For small files, sure the declaration is probably right there anyways, but it does seem to still be adding a bit of magic. In every other case, where the refnum comes from is readily available.

These two points may be somewhat minor, and I think we could still go ahead with Phil's solution at least for now. We just have to make sure to grab and save the refnum before the callback/message gets deleted on the receiver side. I'm just wondering if there is a better long-term solution for handling all refnums in general.

#9 Updated by Phil Miller about 2 years ago

  • Priority changed from Normal to High

#10 Updated by Eric Mikida about 2 years ago

So the change proposed by Phil is easy to implement, but I'm worried about what it might break...
At the point in the code where we are about to use the callback to send the final result of the reduction, the refNum of the reduction message is set to the value of it's user flag:
CkSetRefNum(result, result->getUserFlag());

So setting the refnum to the refnum of the callback would overwrite this. I'm not sure if that's OK or if this user flag value is actually used by anyone for anything. My inclination would be no, and even if it is, the value of the flag is still carried around in getUserFlag() so it's not like it's being lost.

#11 Updated by Eric Mikida about 2 years ago

  • Status changed from New to In Progress

Current issue:
Upon receiving a message, generated code creates a closure for that message, which in the case of marshalling or reduction targets contains the unpacked method arguments. For reduction targets, this critically doesn't unpack the refNum, and it is essentially lost.

It then passes the closure to the actual entry method, which then creates a Buffer object which contains the entry index, closure, and refNum. At this point it sets the refnum to the first argument, which is incorrect.

Proposed solution:
Move the refNum to the closure object rather than the buffer object. Intuitively it seems like the refNum is part of the closure anyways since it is part of the actual entry method call instance.

#12 Updated by Eric Mikida about 2 years ago

  • Status changed from In Progress to Implemented

A solution is implemented in https://charm.cs.illinois.edu/gerrit/#/c/2484/

As mentioned in the commit message, cases where the first argument to the reduction target is not of a type that can be assigned to a refnum will still not work, however I don't know if that is something required for 6.8.0. It would require a bit more thought and probably a more complex solution in the code generation side to do, aside from the fact that the point-to-point versions of the entry method would make no sense to have in a when with a refnum.

#13 Updated by Eric Mikida about 2 years ago

Updated the patch so it can now handle all types of reductions by using a templated method for setting the refnum of a closure. However, it can be improved in the future when C++11 support allows for enable_if and is_assignable.

#14 Updated by Eric Mikida about 2 years ago

  • Status changed from Implemented to Merged

#15 Updated by Phil Miller about 2 years ago

  • Status changed from Merged to In Progress

See Sam's comments in Gerrit about breaking the Bigsim build

#16 Updated by Eric Mikida about 2 years ago

  • Status changed from In Progress to Implemented

Was a quick fix. The error was just a code gen issue that split an assignment. Fix posted https://charm.cs.illinois.edu/gerrit/2497

#17 Updated by Sam White about 2 years ago

  • Status changed from Implemented to Merged

Also available in: Atom PDF