CkLoop fixed tree limits helper recruitment
CkLoop uses a tree of branching 4 when CkMyNodeSize() >= 8. This means that if you have ppn=9 and rank 0 calls CkLoop, then ranks 5-8 are only recruited if rank 1 is available. If ranks 1-4 are busy then no helpers are activated. This is unexpected and performance-limiting for the use case where CkLoop work is urgent but PEs are generally busy with lower-priority tasks of small grainsize.
A more dynamic approach would be to use messages in the node queue to push notify messages for the PEs, one message per X PEs on the node (probably large X to amortize overhead, but some limit to allow scaling to large thead counts).
#1 Updated by Jim Phillips over 1 year ago
Proof-of-concept patch attached. With this change I see the expected behavior, with all PEs able to steal work.
Issues with this patch:
1) Had to declare CmiPushNode() in CkLoop.C, should be added to converse.h and made non-static in all machine layers.
2) No tests for CMK_NODE_QUEUE_AVAILABLE
3) Modifies srcRank in notifyMsg, which I think might be enqueued from previous entries on other PEs. This would be unsafe if int write was non-atomic, or if CkLoop is sensitive to extra notify messages being received. It is done because there is no way to tell if message came from node queue or pe queue.
4) Only modified USE_CONVERSE_NOTIFICATION branch. Do we need the other side?
Other possible improvements:
1) For numChunks << nodesize it would be better to put numChunks/numChunks-1 messages in the node queue rather than messaging every PE.
2) If no work left don't send the notifies.
3) The implementation of tree broadcast notification broke tracing by over-writing srcRank to -1.
#5 Updated by Seonmyeong Bak 5 months ago
I think this patch can increase the overhead of CkLoop initialization by eliminating the use of implicit tree in CkLoop. (One PE pushes all the messages).
We provide this option but it seems better to add this scheduling with existing scheduling based on tree.
We may not care about other implementations for CkLoop because they are not used and perform slower than USE_CONVERSE_NOTIFICATION.
We may add a field in converseNotifyMsg so that we can tell if the message comes from node queue or pe queue.
#6 Updated by Jim Phillips 5 months ago
Actually, the overhead (number of messages sent and received) is at worst identical. The latency is only higher if all of the other non-leaf PEs in the tree are idle, which is unlikely in practice, and then the delay is only the time to send nodesize messages. If no other PE is available then the overhead is lower since the calling PE only sends the one node message and receives it after finishing the entry. The calling PE even gets to start its work earlier since it only sends one message.
#8 Updated by Seonmyeong Bak 5 months ago
- Status changed from In Progress to Implemented
Added API to set scheduling policy for CkLoop: CkLoop_SetSchedPolicy(sched schedPolicy);
Three options available: NODE_QUEUE, TREE, LIST.
NODE_QUEUE works as this issue suggests.
TREE is the original implementation which uses Tree for delivering CkLoop work.
LIST is the original implementation which doesn't use Tree.