SMP/non-SMP agnostic job launching arguments
Currently a user has to know whether Charm++ was built in SMP mode or not when they are running their application. We should support running with the same set of arguments regardless of SMP or non-SMP mode. This would be nice to have in 6.9.0, but is not necessary for it.
I think we might want to avoid "PE" since that term is overloaded in Charm already: a PE is sometimes described like a Core/PU, but other times only encapsulates worker threads (CkNumPes() doesn't count comm threads). I think "Thread" would be better, since we don't need to specify the difference between worker and comm threads anymore, the hwloc-based topology stuff can account for the comm threads automatically.
I'll propose ++threadsPer(Host|Socket|Core|PU) ++processPer(Host|Socket|Core|PU), where the ++threadPer* argument overrides the ++processPer* one in non-SMP mode (++processPer* is still consumed by charmrun but is ignored and maybe a warning is printed). This way, if a user launches a program using ++threadsPerCore 1 ++processPerSocket 1 on a host with say 2 sockets, with 16 cores/socket and 2 PUs/core, we get:
- in non-SMP mode: 32 processes, 1 per core.
- in SMP mode: 2 processes, 1 per socket, each with 15 worker threads + 1 comm thread.
Does that make sense? Maybe "Thread" is just confusing when running in non-SMP mode...
#4 Updated by Evan Ramos 2 months ago
Where is "PE" described like a Core/PU? The only divergence I am aware of is how a PE is a worker thread in SMP mode and a process in non-SMP mode, and this makes the term a good fit for describing provisioning in an SMP-agnostic way.
I agree that comm threads make the choice terminology difficult, but "thread" feels too unclear in non-SMP mode. My current implementation of the hwloc-informed parameters already steals from the worker thread count to provision a comm thread, so I don't think this is a significant issue.
$ ./hello +oneWthPerPU ... Charm++> Running in SMP mode: numNodes 1, 7 worker threads per process ... Charm++> Running on 1 unique compute nodes (8-way SMP).
After reading over our documentation it is better than I remembered at avoiding saying that PEs are cores/hyperthreads.
From the Charm++ manual section 1.2:
"On each PE (``PE'' stands for a ``Processing Element''. PEs are akin to processor cores; see section 1.4 for a precise description), there is a scheduler operating with its own private pool of messages."
"In a Charm++ program, a PE is a unit of mapping and scheduling: each PE has a scheduler with an associated pool of messages. Each chare is assumed to reside on one PE at a time. Depending on the runtime command-line parameters, a PE may be associated with a subset of cores or hardware threads."
"For example, on a machine with 16-core nodes, where each core has two hardware threads, one may launch a Charm++ program with one or multiple (logical) nodes per physical node. One may choose 32 PEs per (logical) node, and one logical node per physical node. Alternatively, one can launch it with 12 PEs per logical node, and 1 logical node per physical node. One can also choose to partition the physical node, and launch it with 4 logical nodes per physical node (for example), and 4 PEs per node. It is not a general practice in Charm++ to oversubscribe the underlying physical cores or hardware threads on each node."
I read the first part of section 1.4 as meaning that only worker threads are PEs (since chares can't be mapped to comm threads), but the second part of 1.4 reads to me like comm threads count as PEs since it implies that you can run with 32 PEs in 1 logical node and you won't be oversubscribed.
But maybe I'm just reading too much into that part of the documentation, and we should just update the documentation to say that a PE is a worker thread in SMP mode or a process in non-SMP mode. I think your suggestion of PEsPer* is better than ++ThreadsPer* now