Mechanism (fence/barrier) to delay execution of long methods until latency-sensitive methods are done
- NAMD wants to let multicast proxies fully distribute atom data before non-bonded computes start running
- ClothSim wants to respond to mesh information requests for remote collision work before starting local calculations
- OpenAtom has a use case, per Eric B (please edit to fill in)
- CharmLU wants to progress active-panel updates and reductions before any trailing updates
Right now, they each implement their own barrier-like construct to hold off the long methods until the latency-sensitive bits are finished. Four independent applications is more than enough to justify implementing a common mechanism in the Charm++ model.
#1 Updated by Jim Phillips over 4 years ago
A few design issues:
- The user has to specify either which messages should be held or which messages should not be held. I would suggest using a scale with a default value and letting each hold specify the minimum length that should be held. We could also let the Charm++ runtime measure and then predict which entry points might be long-running.
- At some point long-running entry points from step N should not be held, but those from step N+1 should.
- Does a newly created hold only affect newly received/generated messages or does it look at existing queue entries too?
- In NAMD things are simplified. We hold all computes until the last patch is ready, which means that at the point the hold is lifted on a PE, we know that all computes for step N are waiting in the queue but there are no computes for step N+1. This is not true for PME, CUDA, or any other compute that returns and is rescheduled for the same timestep, but we don't hold those. Multiple timestepping also complicates things, because it is possible that a patch only has computes on a node for even timesteps, so it could get a step ahead and confuse things.
#4 Updated by Eric Bohm over 3 years ago
In OpenAtom, the launch of the non-local particle force and energy computation has met the dependencies required early in the timestep. However, the output of this, particularly the force, is not critical until late in the time step when forces are integrated.
This is currently (crudely) managed by a runtime configurable parameter which will launch the non local computation alongside one of the other phases (gspace, rho, realspace), along with prioritized messages. The message traffic associated with the non-local computation is proportional to the size of the molecular system (i.e., it is the structure factor information for the portion of the system being computed locally). The computation time is N^2 Log N multiplied by the number of atom types and the number of m channels.
Therefore, in fairly simple systems with a small number of types and channels, this is relatively fast, and network permitting, can be overlapped with the electron density phase. That is desirable, as the electron density computation is one of our Amdahl bottleneck problems due to its relatively small size. However, the electron density computation is in the main line of the critical path for our heaviest hitter, the electron state wave function computation, so if the attempted overlap delays the density (or the electron state) it by interfering, there is a net loss in performance.
In more complex systems, with many types and channels, the non-local computation can grow to dominate the total time step, and should therefore be started as early as is otherwise profitable modulo self interference with the electron state or density.
Therefore, there are two distinct, but related, kinds of activity that could be adaptively managed. The communication of the structure factor data, and the computation that depends upon it.
At a lower priority, there are also the global reductions to compute a variety of total energy quantities. The computation associated with these is tiny (sum of a double per energy for less than a dozen quantities). In most cases, these are not prominent in the critical path, and could therefore be held back if the communication infrastructure is struggling with managing the large message quantity overall. There is no code to manage that now. We contribute to the reduction spanning tree as soon as we the necessary data and hope that it doesn't contribute to performance problems.
The basic issue here ties into critical path management, as seen in Isaac's work. http://charm.cs.illinois.edu/newPapers/10-24/paper.pdf and http://www.ijnc.org/index.php/ijnc/article/view/15 and Yanhua's PICS control point followup work.
Isaac's implementation came with a fair amount of message header freight, and Yanhua's PICS infrastructure did not demonstrate production level improvement for the general case.
This is a much narrower idea, so it may be possible to slice out something from the prior work that solves just this one issue in a robust way.