- 23 . 1 Overview
- 23 . 2 Ranking
- 23 . 3 Startup and Partitioning
- 23 . 4 Redirecting output from individual partitions
- 23 . 5 Inter-partition Communication
CmiMyPe(), etc continue to provide these local ranks. Hence, existing codes do not require any change as long as inter-partition interaction is not required.
On the other hand, machine layer is provided with the target ranks that
are globally unique. These ranks can be obtained using functions with
suffix such as
Converse, which operates at a layer between Charm++ and machine layer,
performs the required transitions. It maintains relevant information for any
conversion. Information related to partitions can be obtained using Converse
level functions such as
etc. If required, one can also obtain the mapping of a local rank to a global
rank using functions such as
CmiGetPeGlobal(int perank, int partition)
CmiGetNodeGlobal(int noderank, int partition)
. These functions
take two arguments - the local rank and the partition number. For example,
CmiGetNodeGlobal(5, 2) will return the global rank of the node that belongs to
partition 2 and has a local rank of 5 in partition 2. The inverse
translation, from global rank to local rank, is not supported.
A number of compile time and runtime parameters are available for users who want to run multiple partitions in one single job.
+replicas <replica_number>- number of partitions to be created. If no further options are provided, allocated cores/nodes are divided equally among partitions. Only this option is supported from the 6.5.0 release; remaining options are supported starting 6.6.0.
+master_partition- assign one core/node as the master partition (partition 0), and divide the remaining cores/nodes equally among remaining partitions.
+partition_sizes L[-U[:S[.R]]]#W[,...]- defines the size of partitions. A single number identifies a particular partition. Two numbers separated by a dash identify an inclusive range ( lower bound and upper bound ). If they are followed by a colon and another number (a stride ), that range will be stepped through in increments of the additional number. Within each stride, a dot followed by a run will indicate how many partitions to use from that starting point. Finally, a compulsory number sign (#) followed by a width defines the size of each of the partitions identified so far. For example, the sequence
0-4:2#10,1#5,3#15states that partitions 0, 2, 4 should be of size 10, partition 1 of size 5 and partition 3 of size 15. In SMP mode, these sizes are in terms of nodes. All workers threads associated with a node are assigned to the partition of the node. This option conflicts with
+partition_topology- use a default topology aware scheme to partition the allocated nodes.
+partition_topology_scheme <scheme>- use the given scheme to partition the allocated nodes. Currently, two generalized schemes are supported that should be useful on torus networks. If scheme is set to 1, allocated nodes are traversed plane by plane during partitioning. A hilbert curve based traversal is used with scheme 2.
-custom-part, runtime parameter:
+use_custom_partition- enables use of user defined partitioning. In order to implement a new partitioning scheme, a user must link an object exporting a C function with following prototype:
extern ``C'' void createCustomPartitions(int numparts, int *partitionSize, int *nodeMap);
numparts(input) - number of partitions to be created.
partitionSize(input) - an array that contains size of each partition.
nodeMap(output, preallocated) - a preallocated array of length
CmiNumNodesGlobal(). Entry i in this array specifies the new global node rank of a node with default node rank i . The entries in this array are block wise divided to create partitions, i.e entries 0 to partitionSize-1 belong to partition 1, partitionSize to partitionSize+partitionSize-1 to partition 2 and so on.
When this function is invoked to create partitions, TopoManager is configured to view all the allocated node as one partition. Partition based API is yet to be initialized, and should not be used. A link time parameter
-custom-partis required to be passed to
charmcfor successful compilation.
+stdout <path>is to be used for this purpose. The
<path>may contain the C format specifier %d , which will be replaced by the partition number. In case, %d is specified multiple times, only the first three instances from the left will be replaced by the partition number (other or additional format specifiers will result in undefined behavior). If a format specifier is not specified, the partition number will be appended as a suffix to the specified path. Example usage:
+stdout out/%d/logwill write to out/0/log, out/1/log, out/2/log, .
+stdout logwill write to log.0, log.1, log.2, .
+stdout out/%d/log%dwill write to out/0/log0, out/1/log1, out/2/log2, .
A new API was added to Converse to enable sending messages from one replica to another. Currently, following functions are available for the same
- CmiInterSyncSend(local_rank, partition, size, message)
- CmiInterSyncSendAndFree(local_rank, partition, size, message)
- CmiInterSyncNodeSend(local_node, partition, size, message)
- CmiInterSyncNodeSendAndFree(local_node, partition, size, message)
Users who have coded in Converse will find these functions to be very similar to basic Converse functions for send – CmiSyncSend and CmiSyncSendAndFree. Given the local rank of a PE and the partition it belongs to, these two functions will pass the message to the machine layer. CmiInterSyncSend does not return till “message” is ready for reuse. CmiInterSyncSendAndFree passes the ownership of “message” to Charm++ RTS, which will free the message when the send is complete. Each converse message contains a message header, which makes those messages active – they contain information about their handlers. These handlers can be registered using existing API in Charm++ - CmiRegisterHandler. CmiInterNodeSend and CmiInterNodeSendAndFree are counterparts to these functions that allow sending of a message to a node (in SMP mode).