Feature #1144

Batched message delivery to objects for better cache behavior

Added by Phil Miller almost 3 years ago. Updated about 1 year ago.

Target version:
Start date:
Due date:
% Done:



Transcript of a slack conversation indicating the multiple applications with a potential need for better cache utilization through ensuring that multiple messages to a given object get delivered in a batch, rather than interleaved with other objects.

Eric Mikida
15:05 alright, so this is a pattern that has come up in the disney work, but would also have potential applications in PDES and/or GW as well
15:06 but the idea being that often times I find I have entry methods like "doWork(WorkRequest* msg)"
15:07 where basically, this chare just gets a bunch of work requests and fulfills them one at a time as they come along.
15:07 But in certain circumstances it seems like it would be useful to be able to process them more than just one at a time. ie something like "doWork(WorkRequest** msgs, int count)"
15:08 its a similar idea to the object queue except the application has a bit more control over it as well. It would be particularly useful in the ray tracing application because cache usage is particularly super important
15:09 as well as aggregation of fine grain requests, which would work better if we could deal with more work requests at once to be able to aggregate over all of them
Phil Miller
15:10 As a potential application-level design, it may be worth trying to following:
15:12 When a message is received, just push it into an object-level buffer (vector or some sort of priority queue, as appropriate). Check a ‘work pending’ flag - if it’s not set, set it and call an entry method to ‘do all queued work’. If it’s set, just return. Fiddle with relative priorities of those things as appropriate
Eric Mikida
15:13 yeah that's not a bad idea. didnt think of using the flag to send another message.
15:13 the tricky part was not knowing at all how many messages a chare will receive but having the flag an extra message should cover that nicely
15:14 oddly enough the newer implementation of PDES is going to do something similar to this, i just failed to connect it to my current needs
Phil Miller
15:14 If you want something that’ll deliver rounds of messages, and then kick off rounds of work, you could even push that to a PE level
15:14 by having the work call be a CcdOnIdle handler instead (edited)
15:14 and each object just puts itself in the ‘has work’ list
15:15 And if you’ve really got 3 distinct applications that all need essentially this behavior, for similar enough reasons, that’s probably a sign it belongs as a runtime feature
15:16 perhaps a [batch] entry method attribute, that says to let calls accumulate and run them in batches
Eric Mikida
15:16 yeah. is the object queue currently implemented btw?
Phil Miller
15:17 Yes, it’s there in the present code. IIRC, the cleanup leading up to 64-bit ID work actually removes it
Eric Mikida
15:17 ok
Phil Miller
15:17 The object queue functionality is a build-time option, controlled by a flag that’s something like CMK_GRID_QUEUE
15:18 because it was originally built to support object queueing to schedule objects that were on the ‘border’ between distributed resources with weak networking between them to run at higher object-level priority
Eric Mikida
15:18 gotcha
15:19 ill probably look into that as well, since it does cover part of the use case im looking to address
Phil Miller
15:20 I suspect that work may get dusted off in the mid-term future at PPL anyway, since we’re going to see much wider/stronger nodes (e.g. POWER9) with relatively much less bandwidth
Eric Mikida
15:20 yup. and harshitha and co are already looking into improving the existing queues and adding new ones as far as i know
Phil Miller
15:22 I think those are at the lower-levels, for inter-PE message passing, now at the level we’re talking about, that knows about objects
15:22 Though it’s possible I don’t know all the work going on there


#1 Updated by Phil Miller almost 3 years ago

I realize that it may be possible to 'trick' Xiang's out-of-core scheduling code to do this, by having each object specify that it needs some 'token' that the OOC scheduler will have to copy in/out of memory. If we set parameters so that the OOC scheduler thinks it only has room for one such token, then it will try to deliver a bunch of messages that depend on the token at once, before moving on to deliver other messages.

#2 Updated by Eric Bohm about 1 year ago

  • Project changed from Charm++ to Charm-NG

Shifted to Charm-NG as this should be considered in the mix of what we do to revise scheduling.

Also available in: Atom PDF