24. Partitioning in Charm++

With the latest 6.5.0 release, Charm++ has been augmented with support for partitioning. The key idea is to divide the allocated set of nodes into subsets that run independent Charm++ instances. These Charm++ instances (called partitions from now on) have a unique identifier, can be programmed to do different tasks, and can interact with each other. Addition of the partitioning scheme does not affect the existing code base or codes that do not want to use partitioning. Some of the use cases of partitioning are replicated NAMD, replica based fault tolerance, studying mapping performance etc. In some aspects, partitioning is similar to disjoint communicator creation in MPI.

24.1 Overview

Charm++ stack has three components - Charm++, Converse and machine layer. In general, machine layer handles the exchange of messages among nodes, and interacts with the next layer in the stack - Converse. Converse is responsible for scheduling of tasks (including user code) and is used by Charm++ to execute the user application. Charm++ is the top-most level in which the applications are written. During partitioning, Charm++ and machine layers are unaware of the partitioning. Charm++ assumes its partition to be the entire world, whereas machine layer considers the whole set of allocated nodes as one partition. During start up, converse divides the allocated set of nodes into partitions, in each of which Charm++ instances are run. It performs the necessary translations as interactions happen with Charm++ and the machine layer. The partitions can communicate with each other using the Converse API described later.

24.2 Ranking

Charm++ stack assigns a rank to every processing element (PE). In the non-partitioned version, a rank assigned to a PE is same at all three layers of Charm++ stack. This rank also (generally) coincides with the rank provided to processors/cores by the underlying job scheduler. The importance of these ranks derive from the fact that they are used for multiple purposes. Partitioning leads to segregation of the notion of ranks at different levels of Charm++ stack. What used to be the PE is now a local rank within a partition running a Charm++ instance. Existing methods such as CkMyPe(), CkMyNode(), CmiMyPe(), etc continue to provide these local ranks. Hence, existing codes do not require any change as long as inter-partition interaction is not required.

On the other hand, machine layer is provided with the target ranks that are globally unique. These ranks can be obtained using functions with Global suffix such as CmiNumNodesGlobal(), CmiMyNodeGlobal(), CmiMyPeGlobal() etc.

Converse, which operates at a layer between Charm++ and machine layer, performs the required transitions. It maintains relevant information for any conversion. Information related to partitions can be obtained using Converse level functions such as CmiMyPartition(), CmiNumPartitions(), etc. If required, one can also obtain the mapping of a local rank to a global rank using functions such as CmiGetPeGlobal(int perank, int partition) and CmiGetNodeGlobal(int noderank, int partition). These functions take two arguments - the local rank and the partition number. For example, CmiGetNodeGlobal(5, 2) will return the global rank of the node that belongs to partition 2 and has a local rank of 5 in partition 2. The inverse translation, from global rank to local rank, is not supported.

24.3 Startup and Partitioning

A number of compile time and runtime parameters are available for users who want to run multiple partitions in one single job.

24.4 Redirecting output from individual partitions

Output to standard output (stdout) from various partitions can be directed to separate files by passing the target path as a command line option. The run time parameter +stdout <path> is to be used for this purpose. The <path> may contain the C format specifier %d, which will be replaced by the partition number. In case, %d is specified multiple times, only the first three instances from the left will be replaced by the partition number (other or additional format specifiers will result in undefined behavior). If a format specifier is not specified, the partition number will be appended as a suffix to the specified path. Example usage:

24.5 Inter-partition Communication

A new API was added to Converse to enable sending messages from one replica to another. Currently, following functions are available for the same

Users who have coded in Converse will find these functions to be very similar to basic Converse functions for send – CmiSyncSend and CmiSyncSendAndFree. Given the local rank of a PE and the partition it belongs to, these two functions will pass the message to the machine layer. CmiInterSyncSend does not return till “message” is ready for reuse. CmiInterSyncSendAndFree passes the ownership of “message” to Charm++ RTS, which will free the message when the send is complete. Each converse message contains a message header, which makes those messages active – they contain information about their handlers. These handlers can be registered using existing API in Charm++ - CmiRegisterHandler. CmiInterNodeSend and CmiInterNodeSendAndFree are counterparts to these functions that allow sending of a message to a node (in SMP mode).