Automatic process launching, thread spawning, and hardware binding
Top-level tracking issue for work based on integrating hwloc with the RTS and using it to make life easier for users, when they're not trying to do detailed experiments on the thread->hardware mapping. This should mostly obviate the need for
+ppn, +pemap, +commmap, +setcpuaffinity, and in many cases
+p or other specifications of how much hardware to use.
#2 Updated by Phil Miller over 2 years ago
Existing implementation is on branch hwloc: https://charm.cs.illinois.edu/gerrit/gitweb?p=charm.git;a=shortlog;h=refs/heads/hwloc
#4 Updated by Phil Miller over 2 years ago
Before getting into how it works, there are a few caveats of the current implementation I should note:
- The current implementation assumes that the compute nodes have identical hardware to the node on which 'charmrun' executes, and that the compute nodes are all uniform. We intend to fix this, but possibly not in the initial release.
- Sets of arguments that inadvertently specify oversubscription of the available hardware generally should emit a warning at startup, but will also provide poor results (likely worse than current code).
- When passed arguments that don't fully specify the desired configuration, the defaults selected may be sub-optimal. In some cases, we don't yet have a definite sense of what those defaults 'should' be.
Feedback on desirable resolutions of the above points, where non-obvious, will be helpful.
Usage of the new functionality involves specifying three bits of information, each of which should be more orthogonal than the current options.
The first argument is the number of physical compute nodes, or 'hosts', on which the job should run:
These will be drawn from the nodelist file in order. How that file is located, and which group within it to use, remains as documented in the manual.
The second is the number of OS processes to launch, in terms of hardware units within the hosts, based on any one of the following arguments:
++processPerHost Y ++processPerSocket Y ++processPerCore Y ++processPerPU Y
In this context, 'PU' refers to a hardware thread, of which there may be several per physical core if Hyperthreading is supported.
In a build of Charm++ with the 'smp' option, each such process will comprise a communication thread and one or more worker threads, as determined by the options below. In builds without the 'smp' option, each process will comprise exactly one worker thread: thus, the above options are definitive, and the options below will be meaningless.
The third is the number of OS threads to launch, again in terms of hardware units.
++oneWthPerHost ++oneWthPerSocket ++oneWthPerCore ++oneWthPerPU
These will always be subsidiary to the processes specified above. Thus, it doesn't make sense to ask for threads at a coarser hardware granularity than the processes containing them, and the job will abort.
Alternately, to more precisely specify resource usage to match application characteristics, in 'smp' builds we still support
to ask for an exact number of worker threads in each process, in addition to the communication thread.
Whether using ++oneWthPer* or ++ppn, use of the ++processPer* options will activate automatic thread affinity binding, obviating the previous use of +pemap options.
The ++oneWthPer* options will naturally bind each thread to the corresponding hardware unit for which it was launched, all within the hardware unit of the containing process.
The ++ppn option, when passed a value that doesn't fully subscribe all hardware threads, works with an additional degree of freedom. Namely, there's the choice whether to 'bunch' the threads to share hardware, or to 'spread' the threads to use as much independent hardware as possible. We have chosen to spread the thread bindings. This favors maximum overall execution unit throughput and total cache capacity, rather than data sharing within each cache.
In general, we expect a given application to have some optimal choice of process and thread arrangement, leaving only the number of hosts on which to run as a variable.