Subsections

7 . Load Balancing

7 . 1 Using Converse Load Balancers

This module defines a function CldEnqueue that sends a message to a lightly-loaded processor. It automates the process of finding a lightly-loaded processor. The function CldEnqueue is extremely sophisticated. It does not choose a processor, send the message, and forget it. Rather, it puts the message into a pool of movable work. The pool of movable work gradually shrinks as it is consumed (processed), but in most programs, there is usually quite a bit of movable work available at any given time. As load conditions shift, the load balancers shifts the pool around, compensating. Any given message may be shifted more than once, as part of the pool. CldEnqueue also accounts for priorities. Normal load-balancers try to make sure that all processors have some work to do. The function CldEnqueue goes a step further: it tries to make sure that all processors have some reasonably high-priority work to do. This can be extremely helpful in AI search applications. The two assertions above should be qualified: CldEnqueue can use these sophisticated strategies, but it is also possible to configure it for different behavior. When you compile and link your program, you choose a load-balancing strategy . That means you link in one of several implementations of the load-balancer. Most are sophisticated, as described above. But some are simple and cheap, like the random strategy. The process of choosing a strategy is described in the manual Converse Installation and Usage . Before you send a message using CldEnqueue , you must write an info function with this prototype: void InfoFn(void *msg, CldPackFn *pfn, int *len, int *queueing, int *priobits, unsigned int *prioptr);
The load balancer will call the info function when it needs to know various things about the message. The load balancer will pass in the message via the parameter msg . The info function's job is to ``fill in'' the other parameters. It must compute the length of the message, and store it at *len . It must determine the pack function for the message, and store a pointer to it at *pfm . It must identify the priority of the message, and the queueing strategy that must be used, storing this information at *queueing , *priobits , and *prioptr . Caution: the priority will not be copied, so the *prioptr should probably be made to point to the message itself. After the user of CldEnqueue writes the ``info'' function, the user must register it, using this: int CldRegisterInfoFn(CldInfoFn fn)
Accepts a pointer to an info-function. Returns an integer index for the info-function. This index will be needed in CldEnqueue . Normally, when you send a message, you pack up a bunch of data into a message, send it, and unpack it at the receiving end. It is sometimes possible to perform an optimization, though. If the message is bound for a processor within the same address space, it isn't always necessary to copy all the data into the message. Instead, it may be sufficient to send a message containing only a pointer to the data. This saves much packing, unpacking, and copying effort. It is frequently useful, since in a properly load-balanced program, a great many messages stay inside a single address space. With CldEnqueue, you don't know in advance whether a message is going to cross address-space boundaries or not. If it's to cross address spaces, you need to use the ``long form'', but if it's to stay inside an address space, you want to use the faster ``short form''. We call this ``conditional packing.'' When you send a message with CldEnqueue , you should initially assume it will not cross address space boundaries. In other words, you should send the ``short form'' of the message, containing pointers. If the message is about to leave the address space, the load balancer will call your pack function, which must have this prototype: void PackFn(void **msg)
The pack function is handed a pointer to a pointer to the message (yes, a pointer to a pointer). The pack function is allowed to alter the message in place, or replace the message with a completely different message. The intent is that the pack function should replace the ``short form'' of the message with the ``long form'' of the message. Note that if it replaces the message, it should CmiFree the old message. Of course, sometimes you don't use conditional packing. In that case, there is only one form of the message. In that case, your pack function can be a no-op.

Pack functions must be registered using this:

int CldRegisterPackFn(CldPackFn fn)
Accepts a pointer to an pack-function. Returns an integer index for the pack-function. This index will be needed in CldEnqueue .

Normally, CldEnqueue sends a message to a lightly-loaded processor. After doing this, it enqueues the message with the appropriate priority. The function CldEnqueue can also be used as a mechanism to simply enqueue a message on a remote processor with a priority. In other words, it can be used as a prioritized send-function. To do this, one of the CldEnqueue parameters allows you to override the load-balancing behavior and lets you choose a processor yourself.

The prototype for CldEnqueue is as follows:

void CldEnqueue(int pe, void *msg, int infofn)
The argument msg is a pointer to the message. The parameter infofn represents a function that can analyze the message. The parameter packfn represents a function that can pack the message. If the parameter pe is CLD_ANYWHERE , the message is sent to a lightly-loaded processor and enqueued with the appropriate priority. If the parameter pe is a processor number, the message is sent to the specified processor and enqueued with the appropriate priority. CldEnqueue frees the message buffer using CmiFree .

The following simple example illustrates how a Converse program can make use of the load balancers.

hello.c:


 #include <stdio.h>
#include "converse.h"
#define CHARES 10


void startup(int argc, char *argv[]);

void registerAndInitialize();


typedef struct pemsgstruct
{
  char header[CmiExtHeaderSizeBytes];
  int pe, id, pfnx;
  int queuing, priobits;
  unsigned int prioptr;
} pemsg;


CpvDeclare(int, MyHandlerIndex);

CpvDeclare(int, InfoFnIndex);

CpvDeclare(int, PackFnIndex);


int main(int argc, char *argv[]) 
{
  ConverseInit(argc, argv, startup, 0, 0);
  CsdScheduler(-1);
}


void startup(int argc, char *argv[])
{
  pemsg *msg;
  int i;
  
  registerAndInitialize();
  for (i=0; i<CHARES; i++) {
    msg = (pemsg *)malloc(sizeof(pemsg));
    msg->pe = CmiMyPe();
    msg->id = i;
    msg->pfnx = CpvAccess(PackFnIndex);
    msg->queuing = CQS_QUEUEING_FIFO;
    msg->priobits = 0;
    msg->prioptr = 0;
    CmiSetHandler(msg, CpvAccess(MyHandlerIndex));
    CmiPrintf("[%d] sending message %d\n", msg->pe, msg->id);
    CldEnqueue(CLD_ANYWHERE, msg, CpvAccess(InfoFnIndex));
    /*    CmiSyncSend(i, sizeof(pemsg), &msg); */
  }
}


void MyHandler(pemsg *msg)
{
  CmiPrintf("Message %d created on %d handled by %d.\n", msg->id, msg->pe, 
     CmiMyPe());
}


void InfoFn(pemsg *msg, CldPackFn *pfn, int *len, int *queuing, int *priobits, 
     unsigned int *prioptr)
{
  *pfn = (CldPackFn)CmiHandlerToFunction(msg->pfnx);
  *len = sizeof(pemsg);
  *queuing = msg->queuing;
  *priobits = msg->priobits;
  prioptr = &(msg->prioptr);
}


void PackFn(pemsg **msg)




void registerAndInitialize()

  CpvInitialize(int, MyHandlerIndex);
  CpvAccess(MyHandlerIndex) = CmiRegisterHandler(MyHandler);
  CpvInitialize(int, InfoFnIndex);
  CpvAccess(InfoFnIndex) = CldRegisterInfoFn((CldInfoFn)InfoFn);
  CpvInitialize(int, PackFnIndex);
  CpvAccess(PackFnIndex) = CldRegisterPackFn((CldPackFn)PackFn);


7 . 2 How to Write a Load Balancer for Converse /Charm++

7 . 2 . 1 Introduction

This manual details how to write your own general-purpose message-based load balancer for Converse . A Converse load balancer can be used by any Converse program, but also serves as a seed load balancer for Charm++ chare creation messages. Specifically, to use a load balancer, you would pass messages to CldEnqueue rather than directly to the scheduler. This is the default behavior with chare creation message in Charm++ . Thus, the primary provision of a new load balancer is an implementation of the CldEnqueue function.

7 . 2 . 2 Existing Load Balancers and Provided Utilities

Throughout this manual, we will occasionally refer to the source code of two provided load balancers, the random initial placement load balancer ( cldb.rand.c ) and the virtual topology-based load balancer ( cldb.neighbor.c ) which applies virtual topology including dense graph to construct neighbors. The functioning of these balancers is described in the Charm++ manual load balancing section.

In addition, a special utility is provided that allows us to add and remove load-balanced messages from the scheduler's queue. The source code for this is available in cldb.c . The usage of this utility will also be described here in detail.

7 . 3 A Sample Load Balancer

This manual steps through the design of a load balancer using an example which we will call test . The test load balancer has each processor periodically send half of its load to its neighbor in a ring. Specifically, for N processors, processor K will send approximately half of its load to (K+1)%N, every 100 milliseconds (this is an example only; we leave the genius approaches up to you).

7 . 3 . 1 Minimal Requirements

The minimal requirements for a load balancer are illustrated by the following code.


 #include <stdio.h>
#include "converse.h"


char *CldGetStrategy(void)
{
  return "test";
}


CpvDeclare(int, CldHandlerIndex);


void CldHandler(void *msg)
{
  CldInfoFn ifn; CldPackFn pfn;
  int len, queueing, priobits; unsigned int *prioptr;
  
  CmiGrabBuffer((void **)&msg);
  CldRestoreHandler(msg);
  ifn = (CldInfoFn)CmiHandlerToFunction(CmiGetInfo(msg));
  ifn(msg, &pfn, &len, &queueing, &priobits, &prioptr);
  CsdEnqueueGeneral(msg, queueing, priobits, prioptr);
}


void CldEnqueue(int pe, void *msg, int infofn)
{
  int len, queueing, priobits; unsigned int *prioptr;
  CldInfoFn ifn = (CldInfoFn)CmiHandlerToFunction(infofn);
  CldPackFn pfn;

  if (pe == CLD_ANYWHERE) {
    /* do what you want with the message; in this case we'll just keep
       it local */
    ifn(msg, &pfn, &len, &queueing, &priobits, &prioptr);
    CmiSetInfo(msg,infofn);
    CsdEnqueueGeneral(msg, queueing, priobits, prioptr);
  }
  else {
    /* pe contains a particular destination or broadcast */
    ifn(msg, &pfn, &len, &queueing, &priobits, &prioptr);
    if (pfn) {
      pfn(&msg);
      ifn(msg, &pfn, &len, &queueing, &priobits, &prioptr);
    }
    CldSwitchHandler(msg, CpvAccess(CldHandlerIndex));
    CmiSetInfo(msg,infofn);
    if (pe==CLD_BROADCAST) 
      CmiSyncBroadcastAndFree(len, msg);
    else if (pe==CLD_BROADCAST_ALL)
      CmiSyncBroadcastAllAndFree(len, msg);
    else CmiSyncSendAndFree(pe, len, msg);
  }
}


void CldModuleInit()
{
  CpvInitialize(int, CldHandlerIndex);
  CpvAccess(CldHandlerIndex) = CmiRegisterHandler(CldHandler);
  CldModuleGeneralInit();
}

The primary function a load balancer must provide is the CldEnqueue function, which has the following prototype:

void CldEnqueue (int pe, void *msg, int infofn);

This function takes three parameters: pe , msg and infofn . pe is the intended destination of the msg . pe may take on one of the following values:

CldEnqueue must handle all of these possibilities. The only case in which the load balancer should get control of a message is when pe = CLD_ANYWHERE . All other messages must be sent off to their intended destinations and passed on to the scheduler as if they never came in contact with the load balancer.

The integer parameter infofn is a handler index for a user-provided function that allows CldEnqueue to extract information about (mostly components of) the message msg .

Thus, an implementation of the CldEnqueue function might have the following structure:


 void CldEnqueue(int pe, void *msg, int infofn)
{
  ...
  if (pe == CLD_ANYWHERE)
    /* These messages can be load balanced */
  else if (pe == CmiMyPe())
    /* Enqueue the message in the scheduler locally */
  else if (pe==CLD_BROADCAST) 
    /* Broadcast to all but self */
  else if (pe==CLD_BROADCAST_ALL)
    /* Broadcast to all plus self */
  else /* Specific processor number was specified */
    /* Send to specific processor */
}

In order to fill in the code above, we need to know more about the message before we can send it off to a scheduler's queue, either locally or remotely. For this, we have the info function. The prototype of an info function must be as follows:

void ifn(msg, pfn, len, queueing, priobits, prioptr);
void *msg;
CldPackFn *pfn;
int *len, *queueing, *priobits;
unsigned int **prioptr;

Thus, to use the info function, we need to get the actual function via the handler index provided to CldEnqueue . Typically, CldEnqueue would contain the following declarations:


   int len, queueing, priobits; 
  unsigned int *prioptr;
  CldPackFn pfn;
  CldInfoFn ifn = (CldInfoFn)CmiHandlerToFunction(infofn);

Subsequently, a call to ifn would look like this:


   ifn(msg, &pfn, &len, &queueing, &priobits, &prioptr);

The info function extracts information from the message about its size, queuing strategy and priority, and also a pack function, which will be used when we need to send the message elsewhere. For now, consider the case where the message is to be locally enqueued:


   ...
  else if (pe == CmiMyPe())
    {
      ifn(msg, &pfn, &len, &queueing, &priobits, &prioptr);
      CsdEnqueueGeneral(msg, queueing, priobits, prioptr);
    }
  ...

Thus, we see the info function is used to extract info from the message that is necessary to pass on to CsdEnqueueGeneral .

In order to send the message to a remote destination and enqueue it in the scheduler, we need to pack it up with a special pack function so that it has room for extra handler information and a reference to the info function. Therefore, before we handle the last three cases of CldEnqueue , we have a little extra work to do:


   ...
  else
    {
      ifn(msg, &pfn, &len, &queueing, &priobits, &prioptr);
      if (pfn) {
        pfn(&msg);
        ifn(msg, &pfn, &len, &queueing, &priobits, &prioptr);
      }
      CldSwitchHandler(msg, CpvAccess(CldHandlerIndex));
      CmiSetInfo(msg,infofn);
      ...

Calling the info function once gets the pack function we need, if there is one. We then call the pack function which rearranges the message leaving space for the info function, which we will need to call on the message when it is received at its destination, and also room for the extra handler that will be used on the receiving side to do the actual enqueuing. CldSwitchHandler is used to set this extra handler, and the receiving side must restore the original handler.

In the above code, we call the info function again because some of the values may have changed in the packing process.

Finally, we handle our last few cases:


   ...
      if (pe==CLD_BROADCAST) 
        CmiSyncBroadcastAndFree(len, msg);
      else if (pe==CLD_BROADCAST_ALL)
        CmiSyncBroadcastAllAndFree(len, msg);
      else CmiSyncSendAndFree(pe, len, msg);
    }
}

The above example also provides CldHandler which is used to receive messages that CldEnqueue forwards to other processors.


 CpvDeclare(int, CldHandlerIndex);


void CldHandler(void *msg)
{
  CldInfoFn ifn; CldPackFn pfn;
  int len, queueing, priobits; unsigned int *prioptr;
  
  CmiGrabBuffer((void **)&msg);
  CldRestoreHandler(msg);
  ifn = (CldInfoFn)CmiHandlerToFunction(CmiGetInfo(msg));
  ifn(msg, &pfn, &len, &queueing, &priobits, &prioptr);
  CsdEnqueueGeneral(msg, queueing, priobits, prioptr);
}

Note that the CldHandler properly restores the message's original handler using CldRestoreHandler , and calls the info function to obtain the proper parameters to pass on to the scheduler. We talk about this more below.

Finally, Converse initialization functions call CldModuleInit to initialize the load balancer module.


 void CldModuleInit()
{
  CpvInitialize(int, CldHandlerIndex);
  CpvAccess(CldHandlerIndex) = CmiRegisterHandler(CldHandler);
  CldModuleGeneralInit();

  /* call other init processes here */
  CldBalance();
}

7 . 3 . 2 Provided Load Balancing Facilities

Converse provides a number of structures and functions to aid in load balancing (see cldb.c). Foremost amongst these is a method for queuing tokens of messages in a processor's scheduler in a way that they can be removed and relocated to a different processor at any time. The interface for this module is as follows:


 void CldSwitchHandler(char *cmsg, int handler)

void CldRestoreHandler(char *cmsg)

int CldCountTokens()

int CldLoad()

void CldPutToken(char *msg)

void CldGetToken(char **msg)

void CldModuleGeneralInit()

Messages normally have a handler index associated with them, but in addition they have extra space for an additional handler. This is used by the load balancer when we use an intermediate handler (typically CldHandler ) to handle the message when it is received after relocation. To do this, we use CldSwitchHandler to temporarily swap the intended handler with the load balancer handler. When the message is received, CldRestoreHandler is used to change back to the intended handler.

CldPutToken puts a message in the scheduler queue in such a way that it can be retrieved from the queue. Once the message gets handled, it can no longer be retrieved. CldGetToken retrieves a message that was placed in the scheduler queue in this way. CldCountTokens tells you how many tokens are currently retrievable. CldLoad gives a slightly more accurate estimate of message load by counting the total number of messages in the scheduler queue.

CldModuleGeneralInit is used to initialize this load balancer helper module. It is typically called from the load balancer's CldModuleInit function.

The helper module also provides the following functions:


 void CldMultipleSend(int pe, int numToSend)

int CldRegisterInfoFn(CldInfoFn fn)

int CldRegisterPackFn(CldPackFn fn)

CldMultipleSend is generally useful for any load balancer that sends multiple messages to one processor. It works with the token queue module described above. It attempts to retrieve up to numToSend messages, and then packs them together and sends them, via CmiMultipleSend, to pe . If the number and/or size of the messages sent is very large, CldMultipleSend will transmit them in reasonably sized parcels. In addition, the CldBalanceHandler and its associated declarations and initializations are required to use it.

You may want to use the three status variables. These can be used to keep track of what your LB is doing (see usage in cldb.neighbor.c and itc++queens program).


 CpvDeclare(int, CldRelocatedMessages);

CpvDeclare(int, CldLoadBalanceMessages);

CpvDeclare(int, CldMessageChunks);

The two register functions register info and pack functions, returning an index for the functions. Info functions are used by the load balancer to extract the various components from a message. Amongst these components is the pack function index. If necessary, the pack function can be used to pack a message that is about to be relocated to another processor. Information on how to write info and pack functions is available in the load balancing section of the Converse Extensions manual.

7 . 3 . 3 Finishing the Test Balancer

The test balancer is a somewhat silly strategy in which every processor attempts to get rid of half of its load by periodically sending it to someone else, regardless of the load at the destination. Hopefully, you won't actually use this for anything important!

The test load balancer is available in charm/src/Common/conv-ldb/cldb.test.c. To try out your own load balancer you can use this filename and SUPER_INSTALL will compile it and you can link it into your Charm++ programs with -balance test. (To add your own new balancers permanently and give them another name other than "test" you will need to change the Makefile used by SUPER_INSTALL. Don't worry about this for now.) The cldb.test.c provides a good starting point for new load balancers.

Look at the code for the test balancer below, starting with the CldEnqueue function. This is almost exactly as described earlier. One exception is the handling of a few extra cases: specifically if we are running the program on only one processor, we don't want to do any load balancing. The other obvious difference is in the first case: how do we handle messages that can be load balanced? Rather than enqueuing the message directly with the scheduler, we make use of the token queue. This means that messages can later be removed for relocation. CldPutToken adds the message to the token queue on the local processor.


 #include <stdio.h>
#include "converse.h"
#define PERIOD 100
#define MAXMSGBFRSIZE 100000


char *CldGetStrategy(void)
{
  return "test";
}


CpvDeclare(int, CldHandlerIndex);

CpvDeclare(int, CldBalanceHandlerIndex);

CpvDeclare(int, CldRelocatedMessages);

CpvDeclare(int, CldLoadBalanceMessages);

CpvDeclare(int, CldMessageChunks);


void CldDistributeTokens()

  int destPe = (CmiMyPe()+1)%CmiNumPes(), numToSend;

  numToSend = CldLoad() / 2;
  if (numToSend > CldCountTokens())
    numToSend = CldCountTokens() / 2;
  if (numToSend > 0)
    CldMultipleSend(destPe, numToSend);
  CcdCallFnAfter((CcdVoidFn)CldDistributeTokens, NULL, PERIOD);



void CldBalanceHandler(void *msg)

  CmiGrabBuffer((void **)&msg);
  CldRestoreHandler(msg);
  CldPutToken(msg);



void CldHandler(void *msg)

  CldInfoFn ifn; CldPackFn pfn;
  int len, queueing, priobits; unsigned int *prioptr;
  
  CmiGrabBuffer((void **)&msg);
  CldRestoreHandler(msg);
  ifn = (CldInfoFn)CmiHandlerToFunction(CmiGetInfo(msg));
  ifn(msg, &pfn, &len, &queueing, &priobits, &prioptr);
  CsdEnqueueGeneral(msg, queueing, priobits, prioptr);



void CldEnqueue(int pe, void *msg, int infofn)

  int len, queueing, priobits; unsigned int *prioptr;
  CldInfoFn ifn = (CldInfoFn)CmiHandlerToFunction(infofn);
  CldPackFn pfn;

  if ((pe == CLD_ANYWHERE) && (CmiNumPes() > 1)) 
    ifn(msg, &pfn, &len, &queueing, &priobits, &prioptr);
    CmiSetInfo(msg,infofn);
    CldPutToken(msg); 
   
  else if ((pe == CmiMyPe()) || (CmiNumPes() == 1)) 
    ifn(msg, &pfn, &len, &queueing, &priobits, &prioptr);
    CmiSetInfo(msg,infofn);
    CsdEnqueueGeneral(msg, queueing, priobits, prioptr);
  
  else 
    ifn(msg, &pfn, &len, &queueing, &priobits, &prioptr);
    if (pfn) 
      pfn(&msg);
      ifn(msg, &pfn, &len, &queueing, &priobits, &prioptr);
    
    CldSwitchHandler(msg, CpvAccess(CldHandlerIndex));
    CmiSetInfo(msg,infofn);
    if (pe==CLD_BROADCAST) 
      CmiSyncBroadcastAndFree(len, msg);
    else if (pe==CLD_BROADCAST_ALL)
      CmiSyncBroadcastAllAndFree(len, msg);
    else CmiSyncSendAndFree(pe, len, msg);
  



void CldModuleInit()

  CpvInitialize(int, CldHandlerIndex);
  CpvAccess(CldHandlerIndex) = CmiRegisterHandler(CldHandler);
  CpvInitialize(int, CldBalanceHandlerIndex);
  CpvAccess(CldBalanceHandlerIndex) = CmiRegisterHandler(CldBalanceHandler);
  CpvInitialize(int, CldRelocatedMessages);
  CpvInitialize(int, CldLoadBalanceMessages);
  CpvInitialize(int, CldMessageChunks);
  CpvAccess(CldRelocatedMessages) = CpvAccess(CldLoadBalanceMessages) = 
    CpvAccess(CldMessageChunks) = 0;
  CldModuleGeneralInit();
  if (CmiNumPes() > 1)
    CldDistributeTokens();


Now look two functions up from CldEnqueue . We have an additional handler besides the CldHandler : the CldBalanceHandler . The purpose of this special handler is to receive messages that can be still be relocated again in the future. Just like the first case of CldEnqueue uses CldPutToken to keep the message retrievable, CldBalanceHandler does the same with relocatable messages it receives. CldHandler is only used when we no longer want the message to have the potential for relocation. It places messages irretrievably in the scheduler queue.

Next we look at our initialization functions to see how the process gets started. The CldModuleInit function gets called by the common Converse initialization code and starts off the periodic load distribution process by making a call to CldDistributeTokens . The entirety of the balancing is handled by the periodic invocation of this function. It computes an approximation of half of the PE's total load ( CsdLength ()), and if that amount exceeds the number of movable messages ( CldCountTokens ()), we attempt to move all of the movable messages. To do this, we pass this number of messages to move and the number of the PE to move them to, to the CldMultipleSend function.