Project

General

Profile

Feature #634

Mechanism (fence/barrier) to delay execution of long methods until latency-sensitive methods are done

Added by Phil Miller over 4 years ago. Updated 10 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Start date:
12/18/2014
Due date:
% Done:

0%


Description

Several applications encounter cases where there is latency-sensitive work exposed concurrently with long-running but less latency-sensitive work that could delay it on a given node:
  • NAMD wants to let multicast proxies fully distribute atom data before non-bonded computes start running
  • ClothSim wants to respond to mesh information requests for remote collision work before starting local calculations
  • OpenAtom has a use case, per Eric B (please edit to fill in)
  • CharmLU wants to progress active-panel updates and reductions before any trailing updates

Right now, they each implement their own barrier-like construct to hold off the long methods until the latency-sensitive bits are finished. Four independent applications is more than enough to justify implementing a common mechanism in the Charm++ model.

History

#1 Updated by Jim Phillips over 4 years ago

A few design issues:

  • The user has to specify either which messages should be held or which messages should not be held. I would suggest using a scale with a default value and letting each hold specify the minimum length that should be held. We could also let the Charm++ runtime measure and then predict which entry points might be long-running.
  • At some point long-running entry points from step N should not be held, but those from step N+1 should.
  • Does a newly created hold only affect newly received/generated messages or does it look at existing queue entries too?
  • In NAMD things are simplified. We hold all computes until the last patch is ready, which means that at the point the hold is lifted on a PE, we know that all computes for step N are waiting in the queue but there are no computes for step N+1. This is not true for PME, CUDA, or any other compute that returns and is rescheduled for the same timestep, but we don't hold those. Multiple timestepping also complicates things, because it is possible that a patch only has computes on a node for even timesteps, so it could get a step ahead and confuse things.

#2 Updated by Nikhil Jain over 3 years ago

  • Target version changed from 6.7.0 to 6.7.1

#3 Updated by Phil Miller over 3 years ago

  • Target version changed from 6.7.1 to 6.8.0

Truly new features, that add application-level interfaces, shouldn't be going into point releases.

#4 Updated by Eric Bohm over 3 years ago

In OpenAtom, the launch of the non-local particle force and energy computation has met the dependencies required early in the timestep. However, the output of this, particularly the force, is not critical until late in the time step when forces are integrated.

This is currently (crudely) managed by a runtime configurable parameter which will launch the non local computation alongside one of the other phases (gspace, rho, realspace), along with prioritized messages. The message traffic associated with the non-local computation is proportional to the size of the molecular system (i.e., it is the structure factor information for the portion of the system being computed locally). The computation time is N^2 Log N multiplied by the number of atom types and the number of m channels.

Therefore, in fairly simple systems with a small number of types and channels, this is relatively fast, and network permitting, can be overlapped with the electron density phase. That is desirable, as the electron density computation is one of our Amdahl bottleneck problems due to its relatively small size. However, the electron density computation is in the main line of the critical path for our heaviest hitter, the electron state wave function computation, so if the attempted overlap delays the density (or the electron state) it by interfering, there is a net loss in performance.

In more complex systems, with many types and channels, the non-local computation can grow to dominate the total time step, and should therefore be started as early as is otherwise profitable modulo self interference with the electron state or density.

Therefore, there are two distinct, but related, kinds of activity that could be adaptively managed. The communication of the structure factor data, and the computation that depends upon it.

At a lower priority, there are also the global reductions to compute a variety of total energy quantities. The computation associated with these is tiny (sum of a double per energy for less than a dozen quantities). In most cases, these are not prominent in the critical path, and could therefore be held back if the communication infrastructure is struggling with managing the large message quantity overall. There is no code to manage that now. We contribute to the reduction spanning tree as soon as we the necessary data and hope that it doesn't contribute to performance problems.

#5 Updated by Eric Bohm over 2 years ago

  • Target version changed from 6.8.0 to 6.9.0

#6 Updated by Eric Bohm over 1 year ago

  • Target version changed from 6.9.0 to 7 (Next Generation Charm++)
  • Priority changed from High to Normal

#7 Updated by Eric Bohm 10 months ago

The basic issue here ties into critical path management, as seen in Isaac's work. http://charm.cs.illinois.edu/newPapers/10-24/paper.pdf and http://www.ijnc.org/index.php/ijnc/article/view/15 and Yanhua's PICS control point followup work.

Isaac's implementation came with a fair amount of message header freight, and Yanhua's PICS infrastructure did not demonstrate production level improvement for the general case.

This is a much narrower idea, so it may be possible to slice out something from the prior work that solves just this one issue in a robust way.

Also available in: Atom PDF