10. Optimizing Entry Method Invocation

10.1 Messages

Although Charm++ supports automated parameter marshalling for entry methods, you can also manually handle the process of packing and unpacking parameters by using messages. A message encapsulates all the parameters sent to an entry method. Since the parameters are already encapsulated, sending messages is often more efficient than parameter marshalling, and can help to avoid unnecessary copying. Moreover, assume that the receiver is unable to process the contents of the message at the time that it receives it. For example, consider a tiled matrix multiplication program, wherein each chare receives an $A$-tile and a $B$-tile before computing a partial result for $C = A \times B$. If we were using parameter marshalled entry methods, a chare would have to copy the first tile it received, in order to save it for when it has both the tiles it needs. Then, upon receiving the second tile, the chare would use the second tile and the first (saved) tile to compute a partial result. However, using messages, we would just save a pointer to the message encapsulating the tile received first, instead of the tile data itself.

Managing the memory buffer associated with a message. As suggested in the example above, the biggest difference between marshalled parameters and messages is that an entry method invocation is assumed to keep the message that it is passed. That is, the Charm++ runtime system assumes that code in the body of the invoked entry method will explicitly manage the memory associated with the message that it is passed. Therefore, in order to avoid leaking memory, the body of an entry method must either delete the message that it is receives, or save a pointer to it, and delete it a later point in the execution of the code.Moreover, in the Charm++ execution model, once you pass a message buffer to the runtime system (via an asynchronous entry method invocation), you should not reuse the buffer. That is, after you have passed a message buffer into an asynchronous entry method invocation, you shouldn't access its fields, or pass that same buffer into a second entry method invocation. Note that this rule doesn't preclude the single reuse of an input message - consider an entry method invocation $i_1$, which receives as input the message buffer $m_1$. Then, $m_1$ may be passed to an asynchronous entry method invocation $i_2$. However, once $i_2$ has been issued with $m_1$ as its input parameter, $m_1$ cannot be used in any further entry method invocations.

Several kinds of message are available. Regular Charm++ messages are objects of fixed size. One can have messages that contain pointers or variable length arrays (arrays with sizes specified at runtime) and still have these pointers as valid when messages are sent across processors, with some additional coding. Also available is a mechanism for assigning priorities to a message regardless of its type. A detailed discussion of priorities appears later in this section.

10.1.1 Message Types

Fixed-Size Messages. The simplest type of message is a fixed-size message. The size of each data member of such a message should be known at compile time. Therefore, such a message may encapsulate primitive data types, user-defined data types that don't maintain pointers to memory locations, and static arrays of the aforementioned types.

Variable-Size Messages.Very often, the size of the data contained in a message is not known until runtime. For such scenarios, you can use variable-size (varsize) messages. A varsize message can encapsulate several arrays, each of whose size is determined at run time. The space required for these encapsulated, variable length arrays is allocated with the entire message comprises a contiguous buffer of memory.

Packed Messages. A packed message is used to communicate non-linear data structures via messages. However, we defer a more detailed description of their use to § 10.1.3.

10.1.2 Using Messages In Your Program

There are five steps to incorporating a (fixed or varsize) message type in your Charm++ program: (1) Declare message type in .ci file; (2) Define message type in .h file; (3) Allocate message; (4) Pass message to asynchronous entry method invocation and (5) Deallocate message to free associated memory resources.

Declaring Your Message Type. Like all other entities involved in asynchronous entry method invocation, messages must be declared in the .ci file. This allows the Charm++ translator to generate support code for messages. Message declaration is straightforward for fixed-size messages. Given a message of type MyFixedSizeMsg, simply include the following in the .ci file:

 message MyFixedSizeMsg;

For varsize messages, the .ci declaration must also include the names and types of the variable-length arrays that the message will encapsulate. The following example illustrates this requirement. In it, a message of type MyVarsizeMsg, which encapsulates three variable-length arrays of different types, is declared:

 message MyVarsizeMsg {
   int arr1[];
   double arr2[];
   MyPointerlessStruct arr3[];

Defining Your Message Type. Once a message type has been declared to the Charm++ translator, its type definition must be provided. Your message type must inherit from a specific generated base class. If the type of your message is T, then class T must inherit from CMessage_T. This is true for both fixed and varsize messages. As an example, for our fixed size message type MyFixedSizeMsg above, we might write the following in the .h file:

class MyFixedSizeMsg : public CMessage_MyFixedSizeMsg {
  int var1;
  MyPointerlessStruct var2;
  double arr3[10];

  // Normal C++ methods, constructors, etc. go here

In particular, note the inclusion of the static array of doubles, arr3, whose size is known at compile time to be that of ten doubles. Similarly, for our example varsize message of type MyVarsizeMsg, we would write something like:

class MyVarsizeMsg : public CMessage_MyVarsizeMsg {
  // variable-length arrays
  int *arr1;
  double *arr2;
  MyPointerlessStruct *arr3;
  // members that are not variable-length arrays 
  int x,y;
  double z;

  // Normal C++ methods, constructors, etc. go here

Note that the .h definition of the class type must contain data members whose names and types match those specified in the .ci declaration. In addition, if any of data members are private or protected, it should declare class CMessage_MyVarsizeMsg to be a friend class. Finally, there are no limitations on the member methods of message classes, except that the message class may not redefine operators new or delete.

Thus the mtype class declaration should be similar to:

Creating a Message. With the .ci declaration and .h definition in place, messages can be allocated and used in the program. Messages are allocated using the C++ new operator:

 MessageType *msgptr =
  new [(int sz1, int sz2, ... , int priobits=0)] MessageType[(constructor arguments)];

The arguments enclosed within the square brackets are optional, and are used only when allocating messages with variable length arrays or prioritized messages. These arguments are not specified for fixed size messages. For instance, to allocate a message of our example message MyFixedSizeMsg, we write:

MyFixedSizeMsg *msg = new MyFixedSizeMsg(<constructor args>);

In order to allocate a varsize message, we must pass appropriate values to the arguments of the overloaded new operator presented previously. Arguments sz1, sz2, ... denote the size (in number of elements) of the memory blocks that need to be allocated and assigned to the pointers (variable-length arrays) that the message contains. The priobits argument denotes the size of a bitvector (number of bits) that will be used to store the message priority. So, if we wanted to create MyVarsizeMsg whose arr1, arr2 and arr3 arrays contain 10, 20 and 7 elements of their respective types, we would write:

MyVarsizeMsg *msg = new (10, 20, 7) MyVarsizeMsg(<constructor args>);

Further, to add a 32-bit priority bitvector to this message, we would write:

MyVarsizeMsg *msg = new (10, 20, 7, sizeof(uint32_t)*8) VarsizeMessage;

Notice the last argument to the overloaded new operator, which specifies the number of bits used to store message priority. The section on prioritized execution (§ 10.3) describes how priorities can be employed in your program.

Another version of the overloaded new operator allows you to pass in an array containing the size of each variable-length array, rather than specifying individual sizes as separate arguments. For example, we could create a message of type MyVarsizeMsg in the following manner:

int sizes[3];

sizes[0] = 10;               // arr1 will have 10 elements

sizes[1] = 20;               // arr2 will have 20 elements 

sizes[2] = 7;                // arr3 will have 7 elements 

MyVarsizeMsg *msg = new(sizes, 0) MyVarsizeMsg(<constructor args>); // 0 priority bits

Sending a Message. Once we have a properly allocated message, we can set the various elements of the encapsulated arrays in the following manner:

  msg->arr1[13] = 1;
  msg->arr2[5] = 32.82;
  msg->arr3[2] = MyPointerlessStruct();
  // etc.

And pass it to an asynchronous entry method invocation, thereby sending it to the corresponding chare:


When a message is sent, i.e. passed to an asynchronous entry method invocation, the programmer relinquishes control of it; the space allocated for the message is freed by the runtime system. However, when a message is received at an entry point, it is not freed by the runtime system. As mentioned at the start of this section, received messages may be reused or deleted by the programmer. Finally, messages are deleted using the standard C++ delete operator.

10.1.3 Message Packing

The Charm++ interface translator generates implementation for three static methods for the message class CMessage_mtype . These methods have the prototypes:

    static void* alloc(int msgnum, size_t size, int* array, int priobits);
    static void* pack(mtype*);
    static mtype* unpack(void*);

One may choose not to use the translator-generated methods and may override these implementations with their own alloc, pack and unpack static methods of the mtype class. The alloc method will be called when the message is allocated using the C++ new operator. The programmer never needs to explicitly call it. Note that all elements of the message are allocated when the message is created with new. There is no need to call new to allocate any of the fields of the message. This differs from a packed message where each field requires individual allocation. The alloc method should actually allocate the message using CkAllocMsg, whose signature is given below:

void *CkAllocMsg(int msgnum, int size, int priobits); 

For varsize messages, these static methods alloc, pack, and unpack are generated by the interface translator. For example, these methods for the VarsizeMessage class above would be similar to:

// allocate memory for varmessage so charm can keep track of memory

static void* alloc(int msgnum, size_t size, int* array, int priobits)
  int totalsize, first_start, second_start;
  // array is passed in when the message is allocated using new (see below).
  // size is the amount of space needed for the part of the message known
  // about at compile time.  Depending on their values, sometimes a segfault
  // will occur if memory addressing is not on 8-byte boundary, so altered
  // with ALIGN8
  first_start = ALIGN8(size);  // 8-byte align with this macro
  second_start = ALIGN8(first_start + array[0]*sizeof(int));
  totalsize = second_start + array[1]*sizeof(double);
  VarsizeMessage* newMsg = 
    (VarsizeMessage*) CkAllocMsg(msgnum, totalsize, priobits);
  // make firstArray point to end of newMsg in memory
  newMsg->firstArray = (int*) ((char*)newMsg + first_start);
  // make secondArray point to after end of firstArray in memory
  newMsg->secondArray = (double*) ((char*)newMsg + second_start);

  return (void*) newMsg;

// returns pointer to memory containing packed message

static void* pack(VarsizeMessage* in)
  // set firstArray an offset from the start of in
  in->firstArray = (int*) ((char*)in->firstArray - (char*)in);
  // set secondArray to the appropriate offset
  in->secondArray = (double*) ((char*)in->secondArray - (char*)in);
  return in;

// returns new message from raw memory

static VarsizeMessage* VarsizeMessage::unpack(void* inbuf)
  VarsizeMessage* me = (VarsizeMessage*)inbuf;
  // return first array to absolute address in memory
  me->firstArray = (int*) ((size_t)me->firstArray + (char*)me);
  // likewise for secondArray
  me->secondArray = (double*) ((size_t)me->secondArray + (char*)me);
  return me;

The pointers in a varsize message can exist in two states. At creation, they are valid C++ pointers to the start of the arrays. After packing, they become offsets from the address of the pointer variable to the start of the pointed-to data. Unpacking restores them to pointers. Custom Packed Messages

In many cases, a message must store a non-linear data structure using pointers. Examples of these are binary trees, hash tables etc. Thus, the message itself contains only a pointer to the actual data. When the message is sent to the same processor, these pointers point to the original locations, which are within the address space of the same processor. However, when such a message is sent to other processors, these pointers will point to invalid locations.

Thus, the programmer needs a way to ``serialize'' these messages only if the message crosses the address-space boundary. Charm++ provides a way to do this serialization by allowing the developer to override the default serialization methods generated by the Charm++ interface translator. Note that this low-level serialization has nothing to do with parameter marshalling or the PUP framework described later.

Packed messages are declared in the .ci file the same way as ordinary messages:

message PMessage;

Like all messages, the class PMessage needs to inherit from CMessage_PMessage and should provide two static methods: pack and unpack. These methods are called by the Charm++ runtime system, when the message is determined to be crossing address-space boundary. The prototypes for these methods are as follows:

static void *PMessage::pack(PMessage *in);

static PMessage *PMessage::unpack(void *in);

Typically, the following tasks are done in pack method:

On the receiving processor, the unpack method is called. Typically, the following tasks are done in the unpack method:

Here is an example of a packed-message implementation:

// File:

mainmodule PackExample {
  message PackedMessage;

// File: pgm.h

class PackedMessage : public CMessage_PackedMessage
    BinaryTree<char> btree; // A non-linear data structure 
    static void* pack(PackedMessage*);
    static PackedMessage* unpack(void*);

// File: pgm.C


PackedMessage::pack(PackedMessage* inmsg)
  int treesize = inmsg->btree.getFlattenedSize();
  int totalsize = treesize + sizeof(int);
  char *buf = (char*)CkAllocBuffer(inmsg, totalsize);
  // buf is now just raw memory to store the data structure
  int num_nodes = inmsg->btree.getNumNodes();
  memcpy(buf, &num_nodes, sizeof(int));  // copy numnodes into buffer
  buf = buf + sizeof(int);               // don't overwrite numnodes
  // copies into buffer, give size of buffer minus header
  inmsg->btree.Flatten((void*)buf, treesize);    
  buf = buf - sizeof(int);              // don't lose numnodes
  delete inmsg;
  return (void*) buf;


PackedMessage::unpack(void* inbuf)
  // inbuf is the raw memory allocated and assigned in pack
  char* buf = (char*) inbuf;
  int num_nodes;
  memcpy(&num_nodes, buf, sizeof(int));
  buf = buf + sizeof(int);
  // allocate the message through Charm RTS
  PackedMessage* pmsg = 
    (PackedMessage*)CkAllocBuffer(inbuf, sizeof(PackedMessage));
  // call "inplace" constructor of PackedMessage that calls constructor
  // of PackedMessage using the memory allocated by CkAllocBuffer,
  // takes a raw buffer inbuf, the number of nodes, and constructs the btree
  pmsg = new ((void*)pmsg) PackedMessage(buf, num_nodes);  
  return pmsg;

PackedMessage* pm = new PackedMessage();  // just like always 


While serializing an arbitrary data structure into a flat buffer, one must be very wary of any possible alignment problems. Thus, if possible, the buffer itself should be declared to be a flat struct. This will allow the C++ compiler to ensure proper alignment of all its member fields.

10.2 Entry Method Attributes

Charm++ provides a handful of special attributes that entry methods may have. In order to give a particular entry method an attribute, you must specify the keyword for the desired attribute in the attribute list of that entry method's .ci file declaration. The syntax for this is as follows:

entry [attribute1, ..., attributeN] void EntryMethod(parameters);

Charm++ currently offers the following attributes that one may assign to an entry method: threaded, sync, exclusive, nokeep, notrace, appwork, immediate, expedited, inline, local, python, reductiontarget, aggregate.

entry methods run in their own non-preemptible threads. These entry methods may perform blocking operations, such as calls to a sync entry method, or explicitly suspending themselves. For more details, refer to section 12.1.

entry methods are special in that calls to them are blocking-they do not return control to the caller until the method finishes execution completely. Sync methods may have return values; however, they may only return messages or data types that have the PUP method implemented. Callers must run in a thread separate from the runtime scheduler, e.g. a threaded entry methods. Calls expecting a return value will receive it as the return from the proxy invocation:

 ReturnMsg* m;
 m = A[i].foo(a, b, c);

For more details, refer to section 12.2.

entry methods should only exist on NodeGroup objects. One such entry method will not execute while some other exclusive entry methods belonging to the same NodeGroup object are executing on the same node. In other words, if one exclusive method of a NodeGroup object is executing on node N, and another one is scheduled to run on the same node, the second exclusive method will wait to execute until the first one finishes. An example can be found in tests/charm++/pingpong.

entry methods take only a message as their lone argument, and the memory buffer for this message is managed by the Charm++ runtime system rather than by the user. This means that the user has to guarantee that the message will not be buffered for later usage or be freed in the user code. Additionally, users are not allowed to modify the contents of a nokeep message, since for a broadcast the same message can be reused for all entry method invocations on each PE. If a user frees the message or modifies its contents, a runtime error may result. An example can be found in examples/charm++/histogram_group.

entry methods will not be traced during execution. As a result, they will not be considered and displayed in Projections for performance analysis. Additionally, immediate entry methods are by default notrace and will not be traced during execution.

this entry method will be marked as executing application work. It will be used for performance analysis.

entry methods are executed in an ``immediate'' fashion as they skip the message scheduling while other normal entry methods do not. Immediate entry methods can only be associated with NodeGroup objects, otherwise a compilation error will result. If the destination of such entry method is on the local node, then the method will be executed in the context of the regular PE regardless the execution mode of Charm++ runtime. However, in the SMP mode, if the destination of the method is on the remote node, then the method will be executed in the context of the communication thread. For that reason, immediate entry methods should be used for code that is performance critical and does not take too long in terms of execution time because long running entry methods can delay communication by occupying the communication thread for entry method execution rather than remote communication.

Such entry methods can be useful for implementing multicasts/reductions as well as data lookup when such operations are on the performance critical path. On a certain Charm++ PE, skipping the normal message scheduling prevents the execution of immediate entry methods from being delayed by entry functions that could take a long time to finish. Immediate entry methods are implicitly ``exclusive'' on each node, meaning that one execution of immediate message will not be interrupted by another. Function CmiProbeImmediateMsg() can be called in user codes to probe and process immediate messages periodically. Also note that immediate entry methods are by default notrace and are not traced during execution. An example of immediate entry method can be found in examples/charm++/immediateEntryMethod.

entry methods skip the priority-based message queue in Charm++ runtime. It is useful for messages that require prompt processing when adding the immediate attribute to the message does not apply. Compared with the immediate attribute, the expedited attribute provides a more general solution that works for all types of Charm++ objects, i.e. Chare, Group, NodeGroup and Chare Array. However, expedited entry methods will still be scheduled in the lower-level Converse message queue, and be processed in the order of message arrival. Therefore, they may still suffer from delays caused by long running entry methods. An example can be found in examples/charm++/satisfiability.

entry methods will be called as a normal C++ member function if the message recipient happens to be on the same PE. The call to the function will happen inline, and control will return to the calling function after the inline method completes. Because of this, these entry methods need to be re-entrant as they could be called multiple times recursively. Parameters to the inline method will be passed by reference to avoid any copying, packing, or unpacking of the parameters. This makes inline calls effective when large amounts of data are being passed, and copying or packing the data would be an expensive operation. Perfect forwarding has been implemented to allow for seamless passing of both lvalue and rvalue references. Note that calls with rvalue references must take place in the same translation unit as the .decl.h file to allow for the appropriate template instantiations. Alternatively, the method can be made templated and referenced from multiple translation units via CK_TEMPLATES_ONLY. An explicit instantiation of all lvalue references is provided for compatibility with existing code. If the recipient resides on a different PE, a regular message is sent with the message arguments packed up using PUP, and inline has no effect. An example ``inlineem'' can be found in tests/charm++/megatest.

entry methods are equivalent to normal function calls: the entry method is always executed immediately. This feature is available only for Group objects and Chare Array objects. The user has to guarantee that the recipient chare element resides on the same PE. Otherwise, the application will abort with a failure. If the local entry method uses parameter marshalling, instead of marshalling input parameters into a message, it will pass them directly to the callee. This implies that the callee can modify the caller data if method parameters are passed by pointer or reference. The above description of perfect forwarding for inline entry methods also applies to local entry methods. Furthermore, input parameters are not required to be PUPable. Considering that these entry methods always execute immediately, they are allowed to have a non-void return value. An example can be found in examples/charm++/hello/local.

entry methods are enabled to be called from python scripts as explained in chapter 26. Note that the object owning the method must also be declared with the keyword python. Refer to chapter 26 for more details.

entry methods can be used as the target of reductions while taking arguments of the same type specified in the contribute call rather than a message of type CkReductionMsg. See section 4.6 for more information.

data sent to this entry method will be aggregated into larger messages before being sent, to reduce fine-grained overhead. The aggregation is handled by the Topological Routing and Aggregation Module (TRAM). The argument to this entry method must be a single fixed-size object. More details on TRAM are given in the TRAM section of the libraries manual.

10.3 Controlling Delivery Order

By default, Charm++ processes the messages sent in roughly FIFO order when they arrive at a PE. For most programs, this behavior is fine. However, for optimal performance, some programs need more explicit control over the order in which messages are processed. Charm++ allows you to adjust delivery order on a per-message basis. An example program demonstrating how to modify delivery order for messages and parameter marshaling can be found in examples/charm++/prio. Queueing Strategies

The order in which messages are processed in the recipient's queue can be set by explicitly setting the queuing strategy using one the following constants. These constants can be applied when sending a message or invoking an entry method using parameter marshaling: Parameter Marshaling

For parameter marshaling, the queueingtype can be set for CkEntryOptions, which is passed to an entry method invocation as the optional last parameter.

  CkEntryOptions opts1, opts2;

  chare.entry_name(arg1, arg2, opts1);
  chare.entry_name(arg1, arg2, opts2);

When the message with opts1 arrives at its destination, it will be pushed onto the end of the message queue as usual. However, when the message with opts2 arrives, it will be pushed onto the front of the message queue. Messages

For messages, the CkSetQueueing function can be used to change the order in which messages are processed, where queueingtype is one of the above constants.

void CkSetQueueing(MsgType message, int queueingtype)

The first two options, CK_QUEUEING_FIFO and CK_QUEUEING_LIFO, are used as follows:

  MsgType *msg1 = new MsgType ;
  CkSetQueueing(msg1, CK_QUEUEING_FIFO);

  MsgType *msg2 = new MsgType ;
  CkSetQueueing(msg2, CK_QUEUEING_LIFO);

Similar to the parameter marshalled case described above, msg1 will be pushed onto the end of the message queue, while msg2 will be pushed onto the front of the message queue. Prioritized Execution

The basic FIFO and LIFO strategies are sufficient to approximate parallel breadth-first and depth-first explorations of a problem space, but they do not allow more fine-grained control. To provide that degree of control, Charm++ also allows explicit prioritization of messages.

The other six queueing strategies involve the use of priorities. There are two kinds of priorities which can be attached to a message: integer priorities and bitvector priorities. These correspond to the I and B queueing strategies, respectively. In both cases, numerically lower priorities will be dequeued and delivered before numerically greater priorities. The FIFO and LIFO queueing strategies then control the relative order in which messages of the same priority will be delivered.

To attach a priority field to a message, one needs to set aside space in the message's buffer while allocating the message. To achieve this, the size of the priority field in bits should be specified as a placement argument to the new operator, as described in section [*]. Although the size of the priority field is specified in bits, it is always padded to an integral number of ints. A pointer to the priority part of the message buffer can be obtained with this call:

void *CkPriorityPtr(MsgType msg)

Integer priorities are quite straightforward. One allocates a message with an extra integer parameter to ``new'' (see the first line of the example below), which sets aside enough space (in bits) in the message to hold the priority. One then stores the priority in the message. Finally, one informs the system that the message contains an integer priority using CkSetQueueing:

  MsgType *msg = new (8*sizeof(int)) MsgType;
  *(int*)CkPriorityPtr(msg) = prio;
  CkSetQueueing(msg, CK_QUEUEING_IFIFO); Bitvector Prioritization

Bitvector priorities are arbitrary-length bit-strings representing fixed-point numbers in the range 0 to 1. For example, the bit-string ``001001'' represents the number .001001binary. As with integer priorities, higher numbers represent lower priorities. However, bitvectors can be of arbitrary length, and hence the priority numbers they represent can be of arbitrary precision.

Arbitrary-precision priorities are often useful in AI search-tree applications. Suppose we have a heuristic suggesting that tree node $N_1$ should be searched before tree node $N_2$. We therefore designate that node $N_1$ and its descendants will use high priorities, and that node $N_2$ and its descendants will use lower priorities. We have effectively split the range of possible priorities in two. If several such heuristics fire in sequence, we can easily split the priority range in two enough times that no significant bits remain, and the search begins to fail for lack of meaningful priorities to assign. The solution is to use arbitrary-precision priorities, i.e. bitvector priorities.

To assign a bitvector priority, two methods are available. The first is to obtain a pointer to the priority field using CkPriorityPtr, and then manually set the bits using the bit-setting operations inherent to C. To achieve this, one must know the format of the bitvector, which is as follows: the bitvector is represented as an array of unsigned integers. The most significant bit of the first integer contains the first bit of the bitvector. The remaining bits of the first integer contain the next 31 bits of the bitvector. Subsequent integers contain 32 bits each. If the size of the bitvector is not a multiple of 32, then the last integer contains 0 bits for padding in the least-significant bits of the integer.

The second way to assign priorities is only useful for those who are using the priority range-splitting described above. The root of the tree is assigned the null priority-string. Each child is assigned its parent's priority with some number of bits concatenated. The net effect is that the entire priority of a branch is within a small epsilon of the priority of its root.

It is possible to utilize unprioritized messages, integer priorities, and bitvector priorities in the same program. The messages will be processed in roughly the following order:

Additionally, long integer priorities can be specified by the L strategy.

A final reminder about prioritized execution: Charm++ processes messages in roughly the order you specify; it never guarantees that it will deliver the messages in precisely the order you specify. Thus, the correctness of your program should never depend on the order in which the runtime delivers messages. However, it makes a serious attempt to be ``close'', so priorities can strongly affect the efficiency of your program. Skipping the Queue

Some operations that one might want to perform are sufficiently latency-sensitive that they should never wait in line behind other messages. The Charm++ runtime offers two attributes for entry methods, expedited and immediate, to serve these needs. For more information on these attributes, see Section 10.2 and the example in tests/charm++/megatest/

10.4 Zero Copy Messaging API

Apart from using messages, Charm++ also provides APIs to avoid sender and receiver side copies. On RDMA enabled networks like GNI, Verbs, PAMI and OFI, these internally make use of one-sided communication by using the underlying Remote Direct Memory Access (RDMA) enabled network. For large arrays (few 100 KBs or more), the cost of copying during marshalling the message can be quite high. Using these APIs can help not only save the expensive copy operation but also reduce the application's memory footprint by avoiding data duplication. Saving these costs for large arrays proves to be a significant optimization in achieving faster message send and receive times in addition to overall improvement in performance because of lower memory consumption. On the other hand, using these APIs for small arrays can lead to a drop in performance due to the overhead associated with one-sided communication. The overhead is mainly due to additional small messages required for sending the metadata message and the acknowledgment message on completion. For within process data-transfers, this API uses regular memcpy to achieve zerocopy semantics. Similarly, on CMA-enabled machines, in a few cases, this API takes advantage of CMA to perform inter-process intra-physical host data transfers. This API is also functional on non-RDMA enabled networks like regular ethernet, except that it does not avoid copies and behaves like a regular Charm++ entry method invocation. There are two APIs that provide Zero copy semantics in Charm++ :

10.4.1 Zero Copy Direct API

The Zero copy Direct API allows users to explicitly invoke a standard set of methods on predefined buffer information objects to avoid copies. Unlike the Entry Method API which calls the zero copy operation for every zero copy entry method invocation, the direct API provides a more flexible interface by allowing the user to exploit the persistent nature of iterative applications to perform zero copy operations using the same buffer information objects across iteration boundaries. It is also more beneficial than the Zero copy entry method API because unlike the entry method API, which avoids just the sender side copy, the Zero copy Direct API avoids both sender and receiver side copies.

To send an array using the zero copy Direct API, define a CkNcpyBuffer object on the sender chare specifying the pointer, size, a CkCallback object and an optional mode parameter.

CkCallback srcCb(CkIndex_Ping1::sourceDone, thisProxy[thisIndex]);
// CkNcpyBuffer object representing the source buffer

CkNcpyBuffer source(arr1, arr1Size * sizeof(int), srcCb, CK_BUFFER_REG);

When used inside a CkNcpyBuffer object that represents the source buffer information, the callback is specified to notify about the safety of reusing the buffer and indicates that the get or put call has been completed. In those cases where the application can determine safety of reusing the buffer through other synchronization mechanisms, the callback is not entirely useful and in such cases, CkCallback::ignore can be passed as the callback parameter. The optional mode operator is used to determine the network registration mode for the buffer. It is only relevant on networks requiring explicit memory registration for performing RDMA operations. These networks include GNI, OFI and Verbs. When the mode is not specified by the user, the default mode is considered to be CK_BUFFER_REG Similarly, to receive an array using the Zero copy Direct API, define another CkNcpyBuffer object on the receiver chare object specifying the pointer, the size, a CkCallback object and an optional mode parameter. When used inside a CkNcpyBuffer object that represents the destination buffer, the callback is specified to notify the completion of data transfer into the CkNcpyBuffer buffer. In those cases where the application can determine data transfer completion through other synchronization mechanisms, the callback is not entirely useful and in such cases, CkCallback::ignore can be passed as the callback parameter.

CkCallback destCb(CkIndex_Ping1::destinationDone, thisProxy[thisIndex]);
// CkNcpyBuffer object representing the destination buffer

CkNcpyBuffer dest(arr2, arr2Size * sizeof(int), destCb, CK_BUFFER_REG);

Once the source CkNcpyBuffer and destination CkNcpyBuffer objects have been defined on the sender and receiver chares, to perform a get operation, send the source CkNcpyBuffer object to the receiver chare. This can be done using a regular entry method invocation as shown in the following code snippet, where the sender, arrProxy[0] sends its source object to the receiver chare, arrProxy[1].

// On Index 0 of arrProxy chare array


After receiving the sender's source CkNcpyBuffer object, the receiver can perform a get operation on its destination CkNcpyBuffer object by passing the source object as an argument to the runtime defined get method as shown in the following code snippet.

// On Index 1 of arrProxy chare array
// Call get on the local destination object passing the source object


This call performs a get operation, reading the remote source buffer into the local destination buffer. Similarly, a receiver's destination CkNcpyBuffer object can be sent to the sender for the sender to perform a put on its source object by passing the source CkNcpyBuffer object as an argument to the runtime defined put method as shown in in the code snippet.

// On Index 1 of arrProxy chare array


// On Index 0 of arrProxy chare array
// Call put on the local source object passing the destination object


After the completion of either a get or a put, the callbacks specified in both the objects are invoked. Within the CkNcpyBuffer source callback, sourceDone(), the buffer can be safely modified or freed as shown in the following code snippet.

// Invoked by the runtime on source (Index 0)

void sourceDone() {
    // update the buffer to the next pointer

Similarly, inside the CkNcpyBuffer destination callback, destinationDone(), the user is guaranteed that the data transfer is complete into the destination buffer and the user can begin operating on the newly available data as shown in the following code snippet.

// Invoked by the runtime on destination (Index 1)

void destinationDone() {
    // received data, begin computing

The callback methods can also take a pointer to a CkDataMsg message. This message can be used to access the original buffer information object i.e. the CkNcpyBuffer objects used for the zero copy transfer. The buffer information object available in the callback allows access to all its information including the buffer pointer and the arbitrary reference pointer set using the method setRef. It is important to note that only the source CkNcpyBuffer object is accessible using the CkDataMsg in the source callback and similarly, the destination CkNcpyBuffer object is accessible using the CkDataMsg in the destination callback. The following code snippet illustrates the accessing of the original buffer pointer in the callback method by casting the data field of the CkDataMsg object into a CkNcpyBuffer object.

// Invoked by the runtime on source (Index 0)

void sourceDone(CkDataMsg *msg) {
    // Cast msg->data to a CkNcpyBuffer to get the source buffer information object
    CkNcpyBuffer *source = (CkNcpyBuffer *)(msg->data);

    // access buffer pointer and free it

The following code snippet illustrates the usage of the setRef method.

const void *refPtr = &index;

CkNcpyBuffer source(arr1, arr1Size * sizeof(int), srcCb, CK_BUFFER_REG);


Similar to the buffer pointer, the user set arbitrary reference pointer can be also accessed in the callback method. This is shown in the next code snippet.

// Invoked by the runtime on source (Index 0)

void sourceDone(CkDataMsg *msg) {
    // update the buffer to the next pointer

    // Cast msg->data to a CkNcpyBuffer
    CkNcpyBuffer *src = (CkNcpyBuffer *)(msg->data);

    // access buffer pointer and free it

    // get reference pointer
    const void *refPtr = src->ref;

The usage of CkDataMsg and setRef in order to access the original pointer and the arbitrary reference pointer is illustrated in examples/charm++/zerocopy/direct_api/unreg/simple_get

Both the source and destination buffers are of the same type i.e. CkNcpyBuffer. What distinguishes a source buffer from a destination buffer is the way the get or put call is made. A valid get call using two CkNcpyBuffer objects obj1 and obj2 is performed as obj1.get(obj2), where obj1 is the local destination buffer object and obj2 is the remote source buffer object that was passed to this PE. Similarly, a valid put call using two CkNcpyBuffer objects obj1 and obj2 is performed as obj1.put(obj2), where obj1 is the local source buffer object and obj2 is the remote destination buffer object that was passed to this PE.

In addition to the callbacks, the return values of get and put also indicate the completion of data transfer between the buffers. When the source and destination buffers are within the same process or on different processes within the same CMA-enabled physical node, the zerocopy data transfer happens immediately without an asynchronous RDMA call. In such cases, both the methods, get and put return an enum value of CkNcpyStatus::complete. This value of the API indicates the completion of the zerocopy data transfer. On the other hand, in the case of an asynchronous RDMA call, the data transfer is not immediate and the return enum value of the get and put methods is CkNcpyStatus::incomplete. This indicates that the data transfer is in-progress and not necessarily complete. Use of CkNcpyStatus in an application is illustrated in examples/charm++/zerocopy/direct_api/reg/simple_get.

Since callbacks in Charm++ allow to store a reference number, these callbacks passed into CkNcpyBuffer can be set with a reference number using the method cb.setRefNum(num). Upon callback invocation, these reference numbers can be used to identify the buffers that were passed into the CkNcpyBuffer objects. Upon callback invocation, the reference number of the callback can be accessed using the CkDataMsg argument of the callback function. For a callback using a CkDataMsg *msg, the reference number is obtained by using the method CkGetRefNum(msg);. This is illustrated in examples/charm++/zerocopy/direct_api/unreg/simple_get. specifically useful where there is an indexed collection of buffers, where the reference number can be used to index the collection.

Note that the CkNcpyBuffer objects can be either statically declared or be dynamically allocated. Additionally, the objects are also reusable across iteration boundaries i.e. after sending the CkNcpyBuffer object, the remote PE can use the same object to perform get or put. This pattern of using the same objects across iterations is demonstrated in examples/charm++/zerocopy/direct_api/reg/pingpong.

This API is demonstrated in examples/charm++/zerocopy/direct_api Memory Registration and Modes of Operation

There are four modes of operation for the Zero Copy Direct API. These modes act as control switches on networks that require memory registration like GNI, OFI and Verbs, in order to perform RDMA operations . They dictate the functioning of the API providing flexible options based on user requirement. On other networks, where network memory management is not necessary (Netlrts) or is internally handled by the lower layer networking API (PAMI, MPI), these switches are still supported to maintain API consistency by all behaving in the similar default mode of operation. CK_BUFFER_REG

: CK_BUFFER_REG is the default mode that is used when no mode is passed. This mode doesn't distinguish between non-network and network data transfers. When this mode is passed, the buffer is registered immediately and this can be used for both non-network sends (memcpy) and network sends without requiring an extra message being sent by the runtime system for the latter case. This mode is demonstrated in examples/charm++/zerocopy/direct_api/reg CK_BUFFER_UNREG

: When this mode is passed, the buffer is initially unregistered and it is registered only for network transfers where registration is absolutely required. For example, if the target buffer is on the same PE or same logical node (or process), since the get internally performs a memcpy, registration is avoided for non-network transfers. On the other hand, if the target buffer resides on a remote PE on a different logical node, the get is implemented through an RDMA call requiring registration. In such a case, there is a small message sent by the RTS to register and perform the RDMA operation. This mode is demonstrated in examples/charm++/zerocopy/direct_api/unreg CK_BUFFER_PREREG

: This mode is only beneficial by implementations that use pre-registered memory pools. In Charm++ , GNI and Verbs machine layers use pre-registered memory pools for avoiding registration costs. On other machine layers, this mode is supported, but it behaves similar to CK_BUFFER_REG. A custom allocator, CkRdmaAlloc can be used to allocate a buffer from a pool of pre-registered memory to avoid the expensive malloc and memory registration costs. For a buffer allocated through CkRdmaAlloc, the mode CK_BUFFER_PREREG should be passed to indicate the use of a mempooled buffer to the RTS. A buffer allocated with CkRdmaAlloc can be deallocated by calling a custom deallocator, CkRdmaFree. Although the allocator CkRdmaAlloc and deallocator, CkRdmaFree are beneficial on GNI and Verbs, the allocators are functional on other networks and allocate regular memory similar to a malloc call. Importantly, it should be noted that with the CK_BUFFER_PREREG mode, the allocated buffer's pointer should be used without any offsets. Using a buffer pointer with an offset will cause a segmentation fault. This mode is demonstrated in examples/charm++/zerocopy/direct_api/prereg CK_BUFFER_NOREG:

This mode is used for just storing pointer information without any actual networking or registration information. It cannot be used for performing any zerocopy operations. This mode was added as it was useful for implementing a runtime system feature. Memory De-registration

Similar to memory registration, there is a method available to de-register memory after the completion of the operations. This allows for other buffers to use the registered memory as machines/networks are limited by the maximum amount of registered or pinned memory. Registered memory can be de-registered by calling the deregisterMem() method on the CkNcpyBuffer object. Other Methods

In addition to deregisterMem(), there are other methods in a CkNcpyBuffer object that offer other utilities. The init(const void *ptr, size_t cnt, CkCallback &cb, unsigned short int mode=CK_BUFFER_UNREG) method can be used to re-initialize the CkNcpyBuffer object to new values similar to the ones that were passed in the constructor. For example, after using a CkNcpyBuffer object called srcInfo, the user can re-initialize the same object with other values. This is shown in the following code snippet.

// initialize src with new values

src.init(ptr, 200, newCb, CK_BUFFER_REG);

Additionally, the user can use another method registerMem in order to register a buffer that has been de-registered. Note that it is not required to call registerMem after a new initialization using init as registerMem is internally called on every new initialization. The usage of registerMem is illustrated in the following code snippet. Additionally, also note that following de-registration, if intended to register again, it is required to call registerMem even in the CK_BUFFER_PREREG mode when the buffer is allocated from a preregistered mempool. This is required to set the registration memory handles and will not incur any registration costs.

// register previously de-registered buffer


10.4.2 Zero Copy Entry Method Send API

The Zero copy Entry Method Send API only allows the user to only avoid the sender side copy without avoiding the receiver side copy. It offloads the user from the responsibility of making additional calls to support zero copy semantics. It extends the capability of the existing entry methods with slight modifications in order to send large buffers without a copy.

To send an array using the zero copy message send API, specify the array parameter in the .ci file with the nocopy specifier.

entry void foo (int size, nocopy int arr[size]);

While calling the entry method from the .C file, wrap the array i.e the pointer in a CkSendBuffer wrapper.

arrayProxy[0].foo(500000, CkSendBuffer(arrPtr));

Until the RDMA operation is completed, it is not safe to modify the buffer. To be notified on completion of the RDMA operation, pass an optional callback object in the CkSendBuffer wrapper associated with the specific nocopy array.

CkCallback cb(CkIndex_Foo::zerocopySent(NULL), thisProxy[thisIndex]);

arrayProxy[0].foo(500000, CkSendBuffer(arrPtr, cb));

The callback will be invoked on completion of the RDMA operation associated with the corresponding array. Inside the callback, it is safe to overwrite the buffer sent via the zero copy entry method send API and this buffer can be accessed by dereferencing the CkDataMsg received in the callback.

//called when RDMA operation is completed

void zerocopySent(CkDataMsg *m)
  //get access to the pointer and free the allocated buffer
  void *ptr = *((void **)(m->data));
  delete m;

The RDMA call is associated with a nocopy array rather than the entry method. In the case of sending multiple nocopy arrays, each RDMA call is independent of the other. Hence, the callback applies to only the array it is attached to and not to all the nocopy arrays passed in an entry method invocation. On completion of the RDMA call for each array, the corresponding callback is separately invoked.

As an example, for an entry method with two nocopy array parameters, each called with the same callback, the callback will be invoked twice: on completing the transfer of each of the two nocopy parameters.

For multiple arrays to be sent via RDMA, declare the entry method in the .ci file as:

entry void foo (int size1, nocopy int arr1[size1], int size2, nocopy double arr2[size2]);

In the .C file, it is also possible to have different callbacks associated with each nocopy array.

CkCallback cb1(CkIndex_Foo::zerocopySent1(NULL), thisProxy[thisIndex]);

CkCallback cb2(CkIndex_Foo::zerocopySent2(NULL), thisProxy[thisIndex]);

arrayProxy[0].foo(500000, CkSendBuffer(arrPtr1, cb1), 1024000, CkSendBuffer(arrPtr2, cb2));

This API is demonstrated in examples/charm++/zerocopy/entry_method_api and tests/charm++/pingpong

It should be noted that calls to entry methods with nocopy specified parameters are currently only supported for point to point operations and not for collective operations. Additionally, there is also no support for migration of chares that have pending RDMA transfer requests.

It should also be noted that the benefit of this API can be seen for large arrays on only RDMA enabled networks. On networks which do not support RDMA and for within process sends (which uses shared memory), the API is functional but doesn't show any performance benefit as it behaves like a regular entry method that copies its arguments.

Table 10.1 displays the message size thresholds for the zero copy entry method send API on popular systems and build architectures. These results were obtained by running examples/charm++/zerocopy/entry_method_api/pingpong in non-SMP mode on production builds. For message sizes greater than or equal to the displayed thresholds, the zero copy API is found to perform better than the regular message send API. For network layers that are not pamilrts, gni, verbs, ofi or mpi, the generic implementation is used.

Table 10.1: Message Size Thresholds for which Zero Copy Entry API performs better than Regular API
Machine Network Build Architecture Intra Processor Intra Host Inter Host
Blue Gene/Q (Vesta) PAMI pamilrts-bluegeneq 4 MB 32 KB 256 KB
Cray XC30 (Edison) Aries gni-crayxc 1 MB 2 MB 2 MB
Cray XC30 (Edison) Aries mpi-crayxc 256 KB 8 KB 32 KB
Dell Cluster (Golub) Infiniband verbs-linux-x86_64 128 KB 2 MB 1 MB
Dell Cluster (Golub) Infiniband mpi-linux-x86_64 128 KB 32 KB 64 KB
Intel Cluster (Bridges) Intel Omni-Path ofi-linux-x86_64 64 KB 32 KB 32 KB
Intel KNL Cluster (Stampede2) Intel Omni-Path ofi-linux-x86_64 1 MB 64 KB 64 KB