Project

General

Profile

Feature #1173

Automatic process launching, thread spawning, and hardware binding

Added by Phil Miller over 2 years ago. Updated 6 months ago.

Status:
In Progress
Priority:
High
Assignee:
Category:
-
Target version:
-
Start date:
08/23/2016
Due date:
% Done:

60%

Tags:

Description

Top-level tracking issue for work based on integrating hwloc with the RTS and using it to make life easier for users, when they're not trying to do detailed experiments on the thread->hardware mapping. This should mostly obviate the need for +ppn, +pemap, +commmap, +setcpuaffinity, and in many cases +p or other specifications of how much hardware to use.


Subtasks

Feature #1039: reject pemap/commap with duplicate or too few cpusNewEvan Ramos

Bug #1174: Use hwloc data from compute host, rather than assuming they're identical to the host running charmrunMergedEvan Ramos

Feature #1175: Don't require autoconf to be installed on user systems for hwloc buildMergedEvan Ramos

Feature #1176: Detect unsupported non-uniformity of processes/threads in charmrun, and errorMergedEvan Ramos

Feature #1177: Support variable numbers of processes per host, each with identical thread countsNewEvan Ramos

Feature #1178: Support automated launch/spawn/bind when using charmrun ++mpiexecMergedEvan Ramos

Feature #1179: Support automated launch/spawn/bind on gni-crayxc/gni-crayxe systemsNewEvan Ramos

Feature #1180: Support automated thread spawn/bind for standalone runsMergedEvan Ramos

Feature #1181: Support automated process launch on a single host for standalone runsMergedEvan Ramos

Feature #1182: Support automated thread spawn/bind for MPI SMP buildsNewEvan Ramos


Related issues

Related to Charm++ - Feature #973: multicore: spawn a thread per core by default Merged 02/14/2016
Related to Charm++ - Bug #833: mpi smp build is locked to one core per node by default Merged 09/11/2015

History

#1 Updated by Phil Miller over 2 years ago

  • Description updated (diff)

#3 Updated by Phil Miller over 2 years ago

For new readers: the implementation on that branch currently works for runs spawned with charmrun for netlrts and verbs build targets, with and without SMP.

#4 Updated by Phil Miller over 2 years ago

CAVEATS AND LIMITATIONS
Before getting into how it works, there are a few caveats of the current implementation I should note:
  • The current implementation assumes that the compute nodes have identical hardware to the node on which 'charmrun' executes, and that the compute nodes are all uniform. We intend to fix this, but possibly not in the initial release.
  • Sets of arguments that inadvertently specify oversubscription of the available hardware generally should emit a warning at startup, but will also provide poor results (likely worse than current code).
  • When passed arguments that don't fully specify the desired configuration, the defaults selected may be sub-optimal. In some cases, we don't yet have a definite sense of what those defaults 'should' be.
    Feedback on desirable resolutions of the above points, where non-obvious, will be helpful.

GENERAL USAGE
Usage of the new functionality involves specifying three bits of information, each of which should be more orthogonal than the current options.

The first argument is the number of physical compute nodes, or 'hosts', on which the job should run:
++numHosts X
These will be drawn from the nodelist file in order. How that file is located, and which group within it to use, remains as documented in the manual.

The second is the number of OS processes to launch, in terms of hardware units within the hosts, based on any one of the following arguments:

++processPerHost Y
++processPerSocket Y
++processPerCore Y
++processPerPU Y

In this context, 'PU' refers to a hardware thread, of which there may be several per physical core if Hyperthreading is supported.

In a build of Charm++ with the 'smp' option, each such process will comprise a communication thread and one or more worker threads, as determined by the options below. In builds without the 'smp' option, each process will comprise exactly one worker thread: thus, the above options are definitive, and the options below will be meaningless.

The third is the number of OS threads to launch, again in terms of hardware units.

++oneWthPerHost
++oneWthPerSocket
++oneWthPerCore
++oneWthPerPU

These will always be subsidiary to the processes specified above. Thus, it doesn't make sense to ask for threads at a coarser hardware granularity than the processes containing them, and the job will abort.

Alternately, to more precisely specify resource usage to match application characteristics, in 'smp' builds we still support
++ppn Z
to ask for an exact number of worker threads in each process, in addition to the communication thread.

Whether using ++oneWthPer* or ++ppn, use of the ++processPer* options will activate automatic thread affinity binding, obviating the previous use of +pemap options.

The ++oneWthPer* options will naturally bind each thread to the corresponding hardware unit for which it was launched, all within the hardware unit of the containing process.

The ++ppn option, when passed a value that doesn't fully subscribe all hardware threads, works with an additional degree of freedom. Namely, there's the choice whether to 'bunch' the threads to share hardware, or to 'spread' the threads to use as much independent hardware as possible. We have chosen to spread the thread bindings. This favors maximum overall execution unit throughput and total cache capacity, rather than data sharing within each cache.

In general, we expect a given application to have some optimal choice of process and thread arrangement, leaving only the number of hosts on which to run as a variable.

#5 Updated by Jim Phillips over 2 years ago

Two missing capabilities:

1) reserve one (more?) core or PU for the OS - equivalent to -r 1 on Cray's aprun

2) only use 2 (or some other number) hyperthreads per core - equivalent to -j 2 on Cray's aprun, matters on KNL and POWER

#6 Updated by Phil Miller about 2 years ago

  • Target version changed from 6.8.0 to 6.8.0-beta1

#7 Updated by Phil Miller about 2 years ago

  • Target version changed from 6.8.0-beta1 to 6.8.0

#8 Updated by Eric Bohm almost 2 years ago

  • Target version changed from 6.8.0 to 6.8.1

#9 Updated by Sam White over 1 year ago

  • Target version changed from 6.8.1 to 6.9.0

#10 Updated by Phil Miller about 1 year ago

  • Assignee changed from Phil Miller to Evan Ramos

#11 Updated by Evan Ramos 12 months ago

  • Status changed from In Progress to Implemented

#12 Updated by Evan Ramos 12 months ago

  • Status changed from Implemented to In Progress

#13 Updated by Evan Ramos 10 months ago

  • Target version deleted (6.9.0)

Removing the version tag from this meta-task since the remaining subtasks have not been discussed for inclusion with the upcoming version.

#14 Updated by Evan Ramos 6 months ago

  • Tags set to provisioning

Also available in: Atom PDF