++oneWthPerSocket doesn't work on Darwin
$ ./charmrun ./jacobi 2 2 2 40 +vp8 +balancer RotateLB +LBDebug 1 ++local ++np 1 ++oneWthPerSocket Charmrun> scalable start enabled. Charmrun> Error: Invalid request for 0 PEs among 1 processes per host.
This should do the same thing as ++oneWthPerHost and use 1 PE
But if my MacBook has a single socket with 8 PUs, why shouldn't "++processPerSocket 1 ++oneWthPerSocket" launch 1 process with one worker thread and one comm thread in it, for a netlrts-darwin-x86_64-smp build?
If I did "++processPerPU 1 ++oneWthPerPU" then we should warn the user (or abort? Need to check what the behavior was pre-hwloc launch commands...) saying that they have oversubscribed threads on the hardware.
#8 Updated by Evan Ramos 5 months ago
./charmrun ++local ++processPerPU 1 ++oneWthPerPU ./hello Charmrun> scalable start enabled. Charmrun> Error: Invalid request for 0 PEs among 8 processes per host.
I agree that the error message should be clearer. Beyond that, I'm not sure what the best course of action is here. I currently special-case ++oneWthPerHost to allocate one worker thread and one comm thread, otherwise the option would be entirely useless. Maybe the same thing needs to be done for the other options.
Here's my understanding of the current hwloc-based command line arguments. For a concrete example, let's say we have 1 host with 2 sockets, each socket has 8 cores and each core has 2 PUs. Then the host has in total 16 cores or 32 PUs.
Non-SMP mode is mostly straightforward, but there are a couple questions as to the placement of threads:
Non-SMP Mode: ++processPerHost 1 = 1 process ++processPerHost 2 = 2 processes (first 2 PUs or first 2 cores or one per socket?) ++processPerHost 16 = 16 processes (first 16 PUs or one per core?) ++processPerHost 32 = 32 processes, one per PU ++processPerSocket 1 = 2 processes, one on each of the sockets ++processPerSocket 8 = 16 processes (one per core or first 8 PUs of each socket?) ++processPerSocket 16 = 32 processes, one on each of the PUs ++processPerCore 1 = 16 processes, one on each of the cores ++processPerCore 2 = 32 processes, one on each of the PUs ++processPerPU 1 = 32 processes, one on each of the PUs ++processPerHost >32 = error ++processPerSocket >16 = error ++processPerCore >2 = error ++processPerPU >1 = error
SMP Mode: ++processPerHost 1 ++oneWthPerHost = error (currently special-cased to work with 1 wth + 1 commth) ++processPerHost 1 ++oneWthPerSocket = 1 process with 1 wth + 1 commth, each on a different socket ++processPerHost 1 ++oneWthPerCore = 1 process with 15 wth, each on a different core, + 1 commth ++processPerHost 1 ++oneWthPerPU = 1 process with 31 wth + 1 commth ++processPerSocket 1 ++oneWthPerHost = error ++processPerSocket 1 ++oneWthPerSocket = error (could be special-cased?) ++processPerSocket 1 ++oneWthPerCore = 2 processes, 1 per socket, with 7 wth + 1 commth each ++processPerSocket 1 ++oneWthPerPU = 2 processes, 1 per socket, with 15 wth + 1 commth each ++processPerCore 1 ++oneWthPerHost = error ++processPerCore 1 ++oneWthPerSocket = error ++processPerCore 1 ++oneWthPerCore = error (could be special cased?) ++processPerCore 1 ++oneWthPerPU = 16 processes, 1 per core, each with 1 wth + 1 commth each ++processPerPU 1 ++oneWthPer* = error ++processPerSocket 2 ++oneWthPerCore = 4 processes, 2 per socket, with 3 wth + 1 commth each ++processPerSocket 2 ++oneWthPerPU = 4 processes, 2 per socket, with 7 wth + 1 commth each
Is this right? Obviously there are more combinations that can be used