Subsections

10 . Optimizing Entry Method Invocation


10 . 1 Messages

Although Charm++ supports automated parameter marshalling for entry methods, you can also manually handle the process of packing and unpacking parameters by using messages. A message encapsulates all the parameters sent to an entry method. Since the parameters are already encapsulated, sending messages is often more efficient than parameter marshalling, and can help to avoid unnecessary copying. Moreover, assume that the receiver is unable to process the contents of the message at the time that it receives it. For example, consider a tiled matrix multiplication program, wherein each chare receives an $A$ -tile and a $B$ -tile before computing a partial result for $C = A \times B$ . If we were using parameter marshalled entry methods, a chare would have to copy the first tile it received, in order to save it for when it has both the tiles it needs. Then, upon receiving the second tile, the chare would use the second tile and the first (saved) tile to compute a partial result. However, using messages, we would just save a pointer to the message encapsulating the tile received first, instead of the tile data itself.

Managing the memory buffer associated with a message. As suggested in the example above, the biggest difference between marshalled parameters and messages is that an entry method invocation is assumed to keep the message that it is passed. That is, the Charm++ runtime system assumes that code in the body of the invoked entry method will explicitly manage the memory associated with the message that it is passed. Therefore, in order to avoid leaking memory, the body of an entry method must either delete the message that it is receives, or save a pointer to it, and delete it a later point in the execution of the code. Moreover, in the Charm++ execution model, once you pass a message buffer to the runtime system (via an asynchronous entry method invocation), you should not reuse the buffer. That is, after you have passed a message buffer into an asynchronous entry method invocation, you shouldn't access its fields, or pass that same buffer into a second entry method invocation. Note that this rule doesn't preclude the single reuse of an input message - consider an entry method invocation $i_1$ , which receives as input the message buffer $m_1$ . Then, $m_1$ may be passed to an asynchronous entry method invocation $i_2$ . However, once $i_2$ has been issued with $m_1$ as its input parameter, $m_1$ cannot be used in any further entry method invocations.

Several kinds of message are available. Regular Charm++ messages are objects of fixed size . One can have messages that contain pointers or variable length arrays (arrays with sizes specified at runtime) and still have these pointers as valid when messages are sent across processors, with some additional coding. Also available is a mechanism for assigning priorities to a message regardless of its type. A detailed discussion of priorities appears later in this section.

10 . 1 . 1 Message Types

Fixed-Size Messages. The simplest type of message is a fixed-size message. The size of each data member of such a message should be known at compile time. Therefore, such a message may encapsulate primitive data types, user-defined data types that don't maintain pointers to memory locations, and static arrays of the aforementioned types.

Variable-Size Messages. Very often, the size of the data contained in a message is not known until runtime. For such scenarios, you can use variable-size ( varsize ) messages. A varsize message can encapsulate several arrays, each of whose size is determined at run time. The space required for these encapsulated, variable length arrays is allocated with the entire message comprises a contiguous buffer of memory.

Packed Messages. A packed message is used to communicate non-linear data structures via messages. However, we defer a more detailed description of their use to §  10.1.3 .

10 . 1 . 2 Using Messages In Your Program

There are five steps to incorporating a (fixed or varsize) message type in your Charm++ program: (1) Declare message type in .ci file; (2) Define message type in .h file; (3) Allocate message; (4) Pass message to asynchronous entry method invocation and (5) Deallocate message to free associated memory resources.


Declaring Your Message Type. Like all other entities involved in asynchronous entry method invocation, messages must be declared in the .ci file. This allows the Charm++ translator to generate support code for messages. Message declaration is straightforward for fixed-size messages. Given a message of type MyFixedSizeMsg , simply include the following in the .ci file:


  message MyFixedSizeMsg;

For varsize messages, the .ci declaration must also include the names and types of the variable-length arrays that the message will encapsulate. The following example illustrates this requirement. In it, a message of type MyVarsizeMsg , which encapsulates three variable-length arrays of different types, is declared:

  message MyVarsizeMsg {
   int arr1[];
   double arr2[];
   MyPointerlessStruct arr3[];
 };


Defining Your Message Type. Once a message type has been declared to the Charm++ translator, its type definition must be provided. Your message type must inherit from a specific generated base class. If the type of your message is T , then class T must inherit from CMessage_T . This is true for both fixed and varsize messages. As an example, for our fixed size message type MyFixedSizeMsg above, we might write the following in the .h file:


 class MyFixedSizeMsg : public CMessage_MyFixedSizeMsg {
  int var1;
  MyPointerlessStruct var2;
  double arr3[10];

  // Normal C++ methods, constructors, etc. go here
};

In particular, note the inclusion of the static array of double s, arr3 , whose size is known at compile time to be that of ten double s. Similarly, for our example varsize message of type MyVarsizeMsg , we would write something like:

 class MyVarsizeMsg : public CMessage_MyVarsizeMsg {
  // variable-length arrays
  int *arr1;
  double *arr2;
  MyPointerlessStruct *arr3;
  
  // members that are not variable-length arrays 
  int x,y;
  double z;

  // Normal C++ methods, constructors, etc. go here
};

Note that the .h definition of the class type must contain data members whose names and types match those specified in the .ci declaration. In addition, if any of data members are private or protected , it should declare class CMessage_MyVarsizeMsg to be a friend class. Finally, there are no limitations on the member methods of message classes, except that the message class may not redefine operators new or delete .

Thus the mtype class declaration should be similar to:


Creating a Message. With the .ci declaration and .h definition in place, messages can be allocated and used in the program. Messages are allocated using the C++ new operator:


  MessageType *msgptr =
  new [(int sz1, int sz2, ... , int priobits=0)] MessageType[(constructor arguments)];

The arguments enclosed within the square brackets are optional, and are used only when allocating messages with variable length arrays or prioritized messages. These arguments are not specified for fixed size messages. For instance, to allocate a message of our example message MyFixedSizeMsg , we write:


 MyFixedSizeMsg *msg = new MyFixedSizeMsg(<constructor args>);

In order to allocate a varsize message, we must pass appropriate values to the arguments of the overloaded new operator presented previously. Arguments sz1, sz2, ... denote the size (in number of elements) of the memory blocks that need to be allocated and assigned to the pointers (variable-length arrays) that the message contains. The priobits argument denotes the size of a bitvector (number of bits) that will be used to store the message priority. So, if we wanted to create MyVarsizeMsg whose arr1 , arr2 and arr3 arrays contain 10, 20 and 7 elements of their respective types, we would write:


 MyVarsizeMsg *msg = new (10, 20, 7) MyVarsizeMsg(<constructor args>);

Further, to add a 32-bit priority bitvector to this message, we would write:


 MyVarsizeMsg *msg = new (10, 20, 7, sizeof(uint32_t)*8) VarsizeMessage;

Notice the last argument to the overloaded new operator, which specifies the number of bits used to store message priority. The section on prioritized execution (§  10.3 ) describes how priorities can be employed in your program.

Another version of the overloaded new operator allows you to pass in an array containing the size of each variable-length array, rather than specifying individual sizes as separate arguments. For example, we could create a message of type MyVarsizeMsg in the following manner:


 int sizes[3];

sizes[0] = 10;               // arr1 will have 10 elements

sizes[1] = 20;               // arr2 will have 20 elements 

sizes[2] = 7;                // arr3 will have 7 elements 


MyVarsizeMsg *msg = new(sizes, 0) MyVarsizeMsg(<constructor args>); // 0 priority bits


Sending a Message. Once we have a properly allocated message, we can set the various elements of the encapsulated arrays in the following manner:


   msg->arr1[13] = 1;
  msg->arr2[5] = 32.82;
  msg->arr3[2] = MyPointerlessStruct();
  // etc.

And pass it to an asynchronous entry method invocation, thereby sending it to the corresponding chare:


 myChareArray[someIndex].foo(msg);

When a message is sent , i.e. passed to an asynchronous entry method invocation, the programmer relinquishes control of it; the space allocated for the message is freed by the runtime system. However, when a message is received at an entry point, it is not freed by the runtime system. As mentioned at the start of this section, received messages may be reused or deleted by the programmer. Finally, messages are deleted using the standard C++ delete operator.

10 . 1 . 3 Message Packing

The Charm++ interface translator generates implementation for three static methods for the message class CMessage_mtype . These methods have the prototypes:


     static void* alloc(int msgnum, size_t size, int* array, int priobits);
    static void* pack(mtype*);
    static mtype* unpack(void*);

One may choose not to use the translator-generated methods and may override these implementations with their own alloc, pack and unpack static methods of the mtype class. The alloc method will be called when the message is allocated using the C++ new operator. The programmer never needs to explicitly call it. Note that all elements of the message are allocated when the message is created with new . There is no need to call new to allocate any of the fields of the message. This differs from a packed message where each field requires individual allocation. The alloc method should actually allocate the message using CkAllocMsg , whose signature is given below:


 void *CkAllocMsg(int msgnum, int size, int priobits); 

For varsize messages, these static methods alloc , pack , and unpack are generated by the interface translator. For example, these methods for the VarsizeMessage class above would be similar to:


 // allocate memory for varmessage so charm can keep track of memory

static void* alloc(int msgnum, size_t size, int* array, int priobits)
{
  int totalsize, first_start, second_start;
  // array is passed in when the message is allocated using new (see below).
  // size is the amount of space needed for the part of the message known
  // about at compile time.  Depending on their values, sometimes a segfault
  // will occur if memory addressing is not on 8-byte boundary, so altered
  // with ALIGN8
  first_start = ALIGN8(size);  // 8-byte align with this macro
  second_start = ALIGN8(first_start + array[0]*sizeof(int));
  totalsize = second_start + array[1]*sizeof(double);
  VarsizeMessage* newMsg = 
    (VarsizeMessage*) CkAllocMsg(msgnum, totalsize, priobits);
  // make firstArray point to end of newMsg in memory
  newMsg->firstArray = (int*) ((char*)newMsg + first_start);
  // make secondArray point to after end of firstArray in memory
  newMsg->secondArray = (double*) ((char*)newMsg + second_start);

  return (void*) newMsg;
}

// returns pointer to memory containing packed message

static void* pack(VarsizeMessage* in)
{
  // set firstArray an offset from the start of in
  in->firstArray = (int*) ((char*)in->firstArray - (char*)in);
  // set secondArray to the appropriate offset
  in->secondArray = (double*) ((char*)in->secondArray - (char*)in);
  return in;
}

// returns new message from raw memory

static VarsizeMessage* VarsizeMessage::unpack(void* inbuf)
{
  VarsizeMessage* me = (VarsizeMessage*)inbuf;
  // return first array to absolute address in memory
  me->firstArray = (int*) ((size_t)me->firstArray + (char*)me);
  // likewise for secondArray
  me->secondArray = (double*) ((size_t)me->secondArray + (char*)me);
  return me;
}

The pointers in a varsize message can exist in two states. At creation, they are valid C++ pointers to the start of the arrays. After packing, they become offsets from the address of the pointer variable to the start of the pointed-to data. Unpacking restores them to pointers.


10 . 1 . 3 . 1 Custom Packed Messages

In many cases, a message must store a non-linear data structure using pointers. Examples of these are binary trees, hash tables etc. Thus, the message itself contains only a pointer to the actual data. When the message is sent to the same processor, these pointers point to the original locations, which are within the address space of the same processor. However, when such a message is sent to other processors, these pointers will point to invalid locations.

Thus, the programmer needs a way to ``serialize'' these messages only if the message crosses the address-space boundary. Charm++ provides a way to do this serialization by allowing the developer to override the default serialization methods generated by the Charm++ interface translator. Note that this low-level serialization has nothing to do with parameter marshalling or the PUP framework described later.

Packed messages are declared in the .ci file the same way as ordinary messages:


 message PMessage;

Like all messages, the class PMessage needs to inherit from CMessage_PMessage and should provide two static methods: pack and unpack . These methods are called by the Charm++ runtime system, when the message is determined to be crossing address-space boundary. The prototypes for these methods are as follows:


 static void *PMessage::pack(PMessage *in);

static PMessage *PMessage::unpack(void *in);

Typically, the following tasks are done in pack method:

On the receiving processor, the unpack method is called. Typically, the following tasks are done in the unpack method:

Here is an example of a packed-message implementation:


 // File: pgm.ci

mainmodule PackExample {
  ...
  message PackedMessage;
  ...
};

// File: pgm.h
...

class PackedMessage : public CMessage_PackedMessage
{
  public:
    BinaryTree<char> btree; // A non-linear data structure 
    static void* pack(PackedMessage*);
    static PackedMessage* unpack(void*);
    ...
};
...

// File: pgm.C
...

void*

PackedMessage::pack(PackedMessage* inmsg)
{
  int treesize = inmsg->btree.getFlattenedSize();
  int totalsize = treesize + sizeof(int);
  char *buf = (char*)CkAllocBuffer(inmsg, totalsize);
  // buf is now just raw memory to store the data structure
  int num_nodes = inmsg->btree.getNumNodes();
  memcpy(buf, &num_nodes, sizeof(int));  // copy numnodes into buffer
  buf = buf + sizeof(int);               // don't overwrite numnodes
  // copies into buffer, give size of buffer minus header
  inmsg->btree.Flatten((void*)buf, treesize);    
  buf = buf - sizeof(int);              // don't lose numnodes
  delete inmsg;
  return (void*) buf;
}


PackedMessage*

PackedMessage::unpack(void* inbuf)
{
  // inbuf is the raw memory allocated and assigned in pack
  char* buf = (char*) inbuf;
  int num_nodes;
  memcpy(&num_nodes, buf, sizeof(int));
  buf = buf + sizeof(int);
  // allocate the message through Charm RTS
  PackedMessage* pmsg = 
    (PackedMessage*)CkAllocBuffer(inbuf, sizeof(PackedMessage));
  // call "inplace" constructor of PackedMessage that calls constructor
  // of PackedMessage using the memory allocated by CkAllocBuffer,
  // takes a raw buffer inbuf, the number of nodes, and constructs the btree
  pmsg = new ((void*)pmsg) PackedMessage(buf, num_nodes);  
  CkFreeMsg(inbuf);
  return pmsg;
}
... 

PackedMessage* pm = new PackedMessage();  // just like always 

pm->btree.Insert('A');
...

While serializing an arbitrary data structure into a flat buffer, one must be very wary of any possible alignment problems. Thus, if possible, the buffer itself should be declared to be a flat struct. This will allow the C++ compiler to ensure proper alignment of all its member fields.

10 . 1 . 3 . 2 Immediate Messages

Immediate messages are special messages that skip the Charm scheduler, they can be executed in an ``immediate'' fashion even in the middle of a normal running entry method. They are supported only in nodegroup.


10 . 2 Entry Method Attributes

Charm++ provides a handful of special attributes that entry methods may have. In order to give a particular entry method an attribute, you must specify the keyword for the desired attribute in the attribute list of that entry method's .ci file declaration. The syntax for this is as follows:


 entry [attribute1, ..., attributeN] void EntryMethod(parameters);

Charm++ currently offers the following attributes that one may assign to an entry method: threaded , sync , exclusive , nokeep , notrace , appwork , immediate , expedited , inline , local , python .

threaded
entry methods run in their own non-preemptible threads. These entry methods may perform blocking operations, such as calls to a sync entry method, or explicitly suspending themselves. For more details, refer to section  12.1 .

sync
entry methods are special in that calls to them are blocking-they do not return control to the caller until the method finishes execution completely. Sync methods may have return values; however, they may only return messages or data types that have the PUP method implemented. Callers must run in a thread separate from the runtime scheduler, e.g. a threaded entry methods. Calls expecting a return value will receive it as the return from the proxy invocation:


  ReturnMsg* m;
 m = A[i].foo(a, b, c);

For more details, refer to section  12.2 .

exclusive
entry methods should only exist on NodeGroup objects. One such entry method will not execute while some other exclusive entry methods belonging to the same NodeGroup object are executing on the same node. In other words, if one exclusive method of a NodeGroup object is executing on node N, and another one is scheduled to run on the same node, the second exclusive method will wait to execute until the first one finishes. An example can be found in tests/charm++/pingpong .

nokeep
entry methods only take a message as the argument, and the memory buffer for this message will be managed by the Charm++ runtime rather than the user calls. This means that user has to guarantee that the message should not be buffered for a later usage or be freed in the user codes. Otherwise, a runtime error will be caused. Such entry methods entail runtime optimizations such as reusing the message memory. An example can be found in examples/charm++/histogram_group .

notrace
entry methods will not be traced during execution. As a result, they will not be considered and displayed in Projections for performance analysis.

appwork
this entry method will be marked as executing application work. It will be used for performance analysis.

immediate
entry methods are executed in an ``immediate'' fashion as they skip the message scheduling while other normal entry methods do not. Immediate entry methods should be only associated with NodeGroup objects although it is not checked during compilation. If the destination of such entry method is on the local node, then the method will be executed in the context of the regular PE regardless the execution mode of Charm++ runtime. However, in the SMP mode, if the destination of the method is on the remote node, then the method will be executed in the context of the communication thread. Such entry methods can be useful for implementing multicasts/reductions as well as data lookup when such operations are on the performance critical path. On a certain Charm++ PE, skipping the normal message scheduling prevents the execution of immediate entry methods from being delayed by entry functions that could take a long time to finish. Immediate entry methods are implicitly ``exclusive'' on each node, meaning that one execution of immediate message will not be interrupted by another. Function CmiProbeImmediateMsg() can be called in user codes to probe and process immediate messages periodically. An example ``immediatering'' can be found in tests/charm++/megatest .

expedited
entry methods skip the priority-based message queue in Charm++ runtime. It is useful for messages that require prompt processing when adding the immediate attribute to the message does not apply. Compared with the immediate attribute, the expedited attribute provides a more general solution that works for all types of Charm++ objects, i.e. Chare, Group, NodeGroup and Chare Array. However, expedited entry methods will still be scheduled in the lower-level Converse message queue, and be processed in the order of message arrival. Therefore, they may still suffer from delays caused by long running entry methods. An example can be found in examples/charm++/satisfiability .

inline
entry methods will be called as a normal C++ member function if the message recipient happens to be on the same PE. The call to the function will happen inline, and control will return to the calling function after the inline method completes. Because of this, these entry methods need to be re-entrant as they could be called multiple times recursively. Parameters to the inline method will be passed by const reference to avoid any copying, packing, or unpacking of the parameters. This makes inline calls effective when large amounts of data are being passed, and copying or packing the data would be an expensive operation. If the recipient resides on a different PE, a regular message is sent with the message arguments packed up using PUP, and inline has no effect. An example ``inlineem'' can be found in tests/charm++/megatest .

local
entry methods are equivalent to normal function calls: the entry method is always executed immediately. This feature is available only for Group objects and Chare Array objects. The user has to guarantee that the recipient chare element reside on the same PE. Otherwise, the application will abort on a failure. If the local entry method uses parameter marshalling, instead of marshalling input parameters into a message, it will pass them directly to the callee. This implies that the callee can modify the caller data if method parameters are passed by pointer or reference. Furthermore, input parameters do not require to be PUPable. Considering that these entry methods always execute immediately, they are allowed to have a non-void return value. Nevertheless, the return type of the method must be a pointer. An example can be found in examples/charm++/hello/local .

python
entry methods are enabled to be called from python scripts as explained in chapter  26 . Note that the object owning the method must also be declared with the keyword python . Refer to chapter  26 for more details.

reductiontarget
entry methods may be used as the target of reductions, despite not taking CkReductionMsg as an argument. See section  4.6 for more references.

aggregate
data sent to this entry method will be aggregated into larger messages before being sent, to reduce fine-grained overhead. The aggregation is handled by the Topological Routing and Aggregation Module (TRAM). The argument to this entry method must be a single fixed-size object. More details on TRAM are given in the Converse and Charm++ Libraries Manual (http://charm.cs.illinois.edu/manuals/html/libraries/5.html).

10 . 3 Controlling Delivery Order

By default, Charm++ processes the messages sent in roughly FIFO order when they arrive at a PE. For most programs, this behavior is fine. However, for optimal performance, some programs need more explicit control over the order in which messages are processed. Charm++ allows you to adjust delivery order on a per-message basis. An example program demonstrating how to modify delivery order for messages and parameter marshaling can be found in examples/charm++/prio .


10 . 3 . 0 . 1 Queueing Strategies

The order in which messages are processed in the recipient's queue can be set by explicitly setting the queuing strategy using one the following constants. These constants can be applied when sending a message or invoking an entry method using parameter marshaling:

10 . 3 . 0 . 2 Parameter Marshaling

For parameter marshaling, the queueingtype can be set for CkEntryOptions , which is passed to an entry method invocation as the optional last parameter.


   CkEntryOptions opts1, opts2;
  opts1.setQueueing(CK_QUEUEING_FIFO);
  opts2.setQueueing(CK_QUEUEING_LIFO);

  chare.entry_name(arg1, arg2, opts1);
  chare.entry_name(arg1, arg2, opts2);

When the message with opts1 arrives at its destination, it will be pushed onto the end of the message queue as usual. However, when the message with opts2 arrives, it will be pushed onto the front of the message queue.

10 . 3 . 0 . 3 Messages

For messages, the CkSetQueueing function can be used to change the order in which messages are processed, where queueingtype is one of the above constants.

void CkSetQueueing(MsgType message, int queueingtype)

The first two options, CK_QUEUEING_FIFO and CK_QUEUEING_LIFO , are used as follows:


   MsgType *msg1 = new MsgType ;
  CkSetQueueing(msg1, CK_QUEUEING_FIFO);

  MsgType *msg2 = new MsgType ;
  CkSetQueueing(msg2, CK_QUEUEING_LIFO);

Similar to the parameter marshalled case described above, msg1 will be pushed onto the end of the message queue, while msg2 will be pushed onto the front of the message queue.


10 . 3 . 0 . 4 Prioritized Execution

The basic FIFO and LIFO strategies are sufficient to approximate parallel breadth-first and depth-first explorations of a problem space, but they do not allow more fine-grained control. To provide that degree of control, Charm++ also allows explicit prioritization of messages.

The other six queueing strategies involve the use of priorities . There are two kinds of priorities which can be attached to a message: integer priorities and bitvector priorities . These correspond to the I and B queueing strategies, respectively. In both cases, numerically lower priorities will be dequeued and delivered before numerically greater priorities. The FIFO and LIFO queueing strategies then control the relative order in which messages of the same priority will be delivered.

To attach a priority field to a message, one needs to set aside space in the message's buffer while allocating the message . To achieve this, the size of the priority field in bits should be specified as a placement argument to the new operator, as described in section  [*] . Although the size of the priority field is specified in bits, it is always padded to an integral number of int s. A pointer to the priority part of the message buffer can be obtained with this call:

void *CkPriorityPtr(MsgType msg)

Integer priorities are quite straightforward. One allocates a message with an extra integer parameter to ``new'' (see the first line of the example below), which sets aside enough space (in bits) in the message to hold the priority. One then stores the priority in the message. Finally, one informs the system that the message contains an integer priority using CkSetQueueing :


   MsgType *msg = new (8*sizeof(int)) MsgType;
  *(int*)CkPriorityPtr(msg) = prio;
  CkSetQueueing(msg, CK_QUEUEING_IFIFO);

10 . 3 . 0 . 5 Bitvector Prioritization

Bitvector priorities are arbitrary-length bit-strings representing fixed-point numbers in the range 0 to 1. For example, the bit-string ``001001'' represents the number .001001 binary . As with integer priorities, higher numbers represent lower priorities. However, bitvectors can be of arbitrary length, and hence the priority numbers they represent can be of arbitrary precision.

Arbitrary-precision priorities are often useful in AI search-tree applications. Suppose we have a heuristic suggesting that tree node $N_1$ should be searched before tree node $N_2$ . We therefore designate that node $N_1$ and its descendants will use high priorities, and that node $N_2$ and its descendants will use lower priorities. We have effectively split the range of possible priorities in two. If several such heuristics fire in sequence, we can easily split the priority range in two enough times that no significant bits remain, and the search begins to fail for lack of meaningful priorities to assign. The solution is to use arbitrary-precision priorities, i.e. bitvector priorities.

To assign a bitvector priority, two methods are available. The first is to obtain a pointer to the priority field using CkPriorityPtr , and then manually set the bits using the bit-setting operations inherent to C. To achieve this, one must know the format of the bitvector, which is as follows: the bitvector is represented as an array of unsigned integers. The most significant bit of the first integer contains the first bit of the bitvector. The remaining bits of the first integer contain the next 31 bits of the bitvector. Subsequent integers contain 32 bits each. If the size of the bitvector is not a multiple of 32, then the last integer contains 0 bits for padding in the least-significant bits of the integer.

The second way to assign priorities is only useful for those who are using the priority range-splitting described above. The root of the tree is assigned the null priority-string. Each child is assigned its parent's priority with some number of bits concatenated. The net effect is that the entire priority of a branch is within a small epsilon of the priority of its root.

It is possible to utilize unprioritized messages, integer priorities, and bitvector priorities in the same program. The messages will be processed in roughly the following order :

Additionally, long integer priorities can be specified by the L strategy.

A final reminder about prioritized execution: Charm++ processes messages in roughly the order you specify; it never guarantees that it will deliver the messages in precisely the order you specify. Thus, the correctness of your program should never depend on the order in which the runtime delivers messages. However, it makes a serious attempt to be ``close'', so priorities can strongly affect the efficiency of your program.

10 . 3 . 0 . 6 Skipping the Queue

Some operations that one might want to perform are sufficiently latency-sensitive that they should never wait in line behind other messages. The Charm++ runtime offers two attributes for entry methods, expedited and immediate , to serve these needs. For more information on these attributes, see Section  10.2 and the example in tests/charm++/megatest/immediatering.ci .


10 . 4 Zero Copy Message Send API

Apart from using messages, Charm++ also provides a zero copy message send API to avoid copies for entry method invocations which use parameter marshalling instead of messages. This makes use of onesided communication by using the underlying Remote Direct Memory Access (RDMA) enabled network. For large arrays (few 100 KBs or more), the cost of copying during marshalling the message can be quite high. Using this API can help not only save the expensive copy operation but also reduce the application's memory footprint by avoiding data duplication. Saving these costs for large arrays proves to be a significant optimization in achieving faster message send times. On the other hand, using the zero copy message send API for small arrays can lead to a drop in performance due to the overhead associated with onesided communication.

To send an array using the zero copy message send API, specify the array parameter in the .ci file with the nocopy specifier.

 entry void foo (int size, nocopy int arr[size]);

While calling the entry method from the .C file, wrap the array i.e the pointer in a CkSendBuffer wrapper.


 arrayProxy[0].foo(500000, CkSendBuffer(arrPtr));

Until the RDMA operation is completed, it is not safe to modify the buffer. To be notified on completion of the RDMA operation, pass an optional callback object in the CkSendBuffer wrapper associated with the specific nocopy array.


 CkCallback cb(CkIndex_Foo::zerocopySent(NULL), thisProxy[thisIndex]);

arrayProxy[0].foo(500000, CkSendBuffer(arrPtr, cb));

The callback will be invoked on completion of the RDMA operation associated with the corresponding array. Inside the callback, it is safe to overwrite the buffer sent via the zero copy API and this buffer can be accessed by dereferencing the CkDataMsg received in the callback.


 //called when RDMA operation is completed

void zerocopySent(CkDataMsg *m)
{
  //get access to the pointer and free the allocated buffer
  void *ptr = *((void **)(m->data));
  free(ptr);
  delete m;
}

The RDMA call is associated with a nocopy array rather than the entry method. In the case of sending multiple nocopy arrays, each RDMA call is independent of the other. Hence, the callback applies to only the array it is attached to and not to all the nocopy arrays passed in an entry method invocation. On completion of the RDMA call for each array, the corresponding callback is separately invoked.

As an example, for an entry method with two nocopy array parameters, each called with the same callback, the callback will be invoked twice: on completing the transfer of each of the two nocopy parameters.



For multiple arrays to be sent via RDMA, declare the entry method in the .ci file as:


 entry void foo (int size1, nocopy int arr1[size1], int size2, nocopy double arr2[size2]);

In the .C file, it is also possible to have different callbacks associated with each nocopy array.


 CkCallback cb1(CkIndex_Foo::zerocopySent1(NULL), thisProxy[thisIndex]);

CkCallback cb2(CkIndex_Foo::zerocopySent2(NULL), thisProxy[thisIndex]);

arrayProxy[0].foo(500000, CkSendBuffer(arrPtr1, cb1), 1024000, CkSendBuffer(arrPtr2, cb2));

This API is demonstrated in examples/charm++/zerocopy and tests/charm++/pingpong



It should be noted that calls to entry methods with nocopy specified parameters are currently only supported for point to point operations and not for collective operations. Additionally, there is also no support for migration of chares that have pending RDMA transfer requests.



It should also be noted that the benefit of this API can be seen for large arrays on only RDMA enabled networks. On networks which do not support RDMA and for within process sends (which uses shared memory), the API is functional but doesn't show any performance benefit as it behaves like a regular entry method that copies its arguments. Currently, the benefit of the API is mostly in terms of reducing application memory footprint as the API is largely unoptimized. Optimized versions of this API are expected to be released in the future.