AMPI implements subcommunicators in an unscalable fashion
This is the real issue behind the following: https://charm.cs.illinois.edu/redmine/issues/1962
MPI_COMM_SELF is the pathological case of creating 'n' 1-element chare arrays, but in general AMPI currently implements subcommunicators using chare arrays, which has a scaling limit in terms of the number of collections Charm++ supports in its 64-bit ID. After the patch for the above issue, AMPI can run with up to 16M ranks. If we want to go further we will run out of collection bits in the ID.
An alternative would be to use sections to implement subcommunicators, with additional data structures to keep track of ranks in subcommunicators from the parent communicator.
Two things we want in AMPI's implementation of subcommunicators:
- Same syntax for AMPI to communicate over all communicators (using the same type of proxy for remote method invocation, same syntax for bcasts, and same syntax for reductions, etc).
- Consistent performance whether messaging over all communicators.
MPI_COMM_SELF currently also several KB per VP, which could be saved if we lazily/dynamically create MPI_COMM_SELF on demand, when needed.
Also, a nice feature of sections is that collective messages don't need to go to all PEs if the elements don't span all PEs (as they do for chare arrays).