Subsections


7 . Load Balancing

Load balancing in Charm++ is enabled by its ability to place, or migrate, chares or chare array elements. Typical application usage to exploit this feature will construct many more chares than processors, and enable their runtime migration.

Iterative applications, which are commonplace in physical simulations, are the most suitable target for Charm++ 's measurement based load balancing techniques. Such applications may contain a series of time-steps, and/or iterative solvers that run to convergence. For such computations, typically, the heuristic principle that we call ``principle of persistence'' holds: the computational loads and communication patterns between objects (chares) tend to persist over multiple iterations, even in dynamic applications. In such cases, the recent past is a good predictor of the near future. Measurement-based chare migration strategies are useful in this context. Currently these apply to chare-array elements, but they may be extended to chares in the future.

For applications without such iterative structure, or with iterative structure, but without predictability (i.e. where the principle of persistence does not apply), Charm++ supports ``seed balancers'' that move ``seeds'' for new chares among processors (possibly repeatedly) to achieve load balance. These strategies are currently available for both chares and chare-arrays. Seed balancers were the original load balancers provided in Charm since the late 80's. They are extremely useful for state-space search applications, and are also useful in other computations, as well as in conjunction with migration strategies.

For iterative computations when there is a correlation between iterations/steps, but either it is not strong, or the machine environment is not predictable (due to noise from OS interrupts on small time steps, or time-shared desktop machines), one can use a combination of the two kinds of strategies. The baseline load balancing is provided by migration strategies, but in each iteration one also spawns off work in the form of chares that can run on any processor. The seed balancer will handle such work as it arises.

Examples are in examples/charm++/load_balancing and tests/charm++/load_balancing


7 . 1 Measurement-based Object Migration Strategies

In Charm++ , objects (except groups, nodegroups) can migrate from processor to processor at runtime. Object migration can potentially improve the performance of the parallel program by migrating objects from overloaded processors to underloaded ones.

Charm++ implements a generic, measurement-based load balancing framework which automatically instruments all Charm++ objects, collects computation load and communication structure during execution and stores them into a load balancing database . Charm++ then provides a collection of load balancing strategies whose job it is to decide on a new mapping of objects to processors based on the information from the database. Such measurement based strategies are efficient when we can reasonably assume that objects in a Charm++ application tend to exhibit temporal correlation in their computation and communication patterns, i.e. future can be to some extent predicted using the historical measurement data, allowing effective measurement-based load balancing without application-specific knowledge.

Two key terms in the Charm++ load balancing framework are:


7 . 2 Available Load Balancing Strategies

Load balancing can be performed in either a centralized, a fully distributed, or an hierarchical fashion.

In centralized approaches, the entire machine's load and communication structure are accumulated to a single point, typically processor 0, followed by a decision making process to determine the new distribution of Charm++ objects. Centralized load balancing requires synchronization which may incur an overhead and delay. However, due to the fact that the decision process has a high degree of the knowledge about the entire platform, it tends to be more accurate.

In distributed approaches, load data is only exchanged among neighboring processors. There is no global synchronization. However, they will not, in general, provide an immediate restoration for load balance - the process is iterated until the load balance can be achieved.

In hierarchical approaches, processors are divided into independent autonomous sets of processor groups and these groups are organized in hierarchies, thereby decentralizing the load balancing task. Different strategies can be used to balance the load on processors inside each processor group, and processors across groups in a hierarchical fashion.

Listed below are some of the available non-trivial centralized load balancers and their brief descriptions:

Listed below are the distributed load balancers:

An example of a hierarchical strategy can be found in:

Users can choose any load balancing strategy they think is appropriate for their application. The compiler and runtime options are described in section  7.6 .


7 . 3 Load Balancing Chare Arrays

The load balancing framework is well integrated with chare array implementation - when a chare array is created, it automatically registers its elements with the load balancing framework. The instrumentation of compute time (WALL/CPU time) and communication pattern is done automatically and APIs are provided for users to trigger the load balancing. To use the load balancer, you must make your array elements migratable (see migration section above) and choose a load balancing strategy (see the section 7.2 for a description of available load balancing strategies).

There are three different ways to use load balancing for chare arrays to meet different needs of the applications. These methods are different in how and when a load balancing phase starts. The three methods are: periodic load balancing mode , at sync mode and manual mode .

In periodic load balancing mode , a user specifies only how often load balancing is to occur, using +LBPeriod runtime option to specify the time interval.

In at sync mode , the application invokes the load balancer explicitly at appropriate (generally at a pre-existing synchronization boundary) to trigger load balancing by inserting a function call (AtSync) in the application source code.

In the prior two load balancing modes, users do not need to worry about how to start load balancing. However, in one scenario, those automatic load balancers will fail to work - when array elements are created by dynamic insertion. This is because the above two load balancing modes require an application to have fixed the number of objects at the time of load balancing. The array manager needs to maintain a head count of local array elements for the local barrier. In this case, the application must use the manual mode to trigger load balancer.

The detailed APIs of these three methods are described as follows:

  1. Periodical load balancing mode : In the default setting, load balancing happens whenever the array elements are ready, with an interval of 1 second. It is desirable for the application to set a larger interval using +LBPeriod runtime option. For example ``+LBPeriod 5.0'' can be used to start load balancing roughly every 5 seconds. By default, array elements may be asked to migrate at any time, provided that they are not in the middle of executing an entry method. The array element's variable usesAtSync being false attributes to this default behavior.
  2. At sync mode : Using this method, elements can be migrated only at certain points in the execution when the application invokes AtSync() . In order to use the at sync mode, one should set usesAtSync to true in the array element constructor. When an element is ready to migrate, call AtSync()   7 . 1 . When all local elements call AtSync , the load balancer is triggered. Once all migrations are completed, the load balancer calls the virtual function ArrayElement::ResumeFromSync() on each of the array elements. This function can be redefined in the application.

    Note that the minimum time for AtSync() load balancing to occur is controlled by the LBPeriod. Unusually high frequency load balancing (more frequent than 500ms) will perform better if this value is set via +LBPeriod or SetLBPeriod() to a number shorter than your load balancing interval.

    Note that AtSync() is not a blocking call, it just gives a hint to load balancing that it is time for load balancing. During the time between AtSync and ResumeFromSync , the object may be migrated. One can choose to let objects continue working with incoming messages, however keep in mind the object may suddenly show up in another processor and make sure no operations that could possibly prevent migration be performed. This is the automatic way of doing load balancing where the application does not need to define ResumeFromSync().

    The more commonly used approach is to force the object to be idle until load balancing finishes. The user places an AtSync call at the end of some iteration and when all elements reach that call load balancing is triggered. The objects can start executing again when ResumeFromSync() is called. In this case, the user redefines ResumeFromSync() to trigger the next iteration of the application. This manual way of using the at sync mode results in a barrier at load balancing (see example here  7.8 ).

  3. Manual mode : The load balancer can be programmed to be started manually. To switch to the manual mode, the application calls TurnManualLBOn() on every processor to prevent the load balancer from starting automatically. TurnManualLBOn() should be called as early as possible in the program. It could be called at the initialization part of the program, for example from a global variable constructor, or in an initproc call (Section  9.1 ). It can also be called in the constructor of a static array or before the doneInserting call for a dynamic array. It can be called multiple times on one processor, but only the last one takes effect.

    The function call CkStartLB() starts load balancing immediately. This call should be made at only one place on only one processor. This function is not blocking, the object will continue to process messages and the load balancing, when triggered, happens in the background.

    TurnManualLBOff() turns off manual load balancing and switches back to the automatic load balancing mode.


7 . 4 Migrating objects

Load balancers migrate objects automatically. For an array element to migrate, user can refer to Section  6.3 for how to write a ``pup'' for an array element.

In general one needs to pack the whole snapshot of the member data in an array element in the pup subroutine. This is because the migration of the object may happen at any time. In certain load balancing schemes where the user explicitly controls when load balancing occurs, the user may choose to pack only a part of the data and may skip temporary data.

An array element can migrate by calling the migrateMe (destination processor) member function- this call must be the last action in an element entry method. The system can also migrate array elements for load balancing (see the section  7.3 ).

To migrate your array element to another processor, the Charm++ runtime will:

Migration constructors, then, are normally empty- all the unpacking and allocation of the data items is done in the element's pup routine. Deallocation is done in the element destructor as usual.

7 . 5 Other utility functions

There are several utility functions that can be called in applications to configure the load balancer, etc. These functions are:


7 . 6 Compiler and runtime options to use load balancing module

Load balancing strategies are implemented as libraries in Charm++ . This allows programmers to easily experiment with different existing strategies by simply linking a pool of strategy modules and choosing one to use at runtime via a command line option.

Note: linking a load balancing module is different from activating it:

Below are the descriptions about the compiler and runtime options:

  1. compile time options:

    • -module NeighborLB -module GreedyCommLB ...
      links the modules NeighborLB, GreedyCommLB etc into an application, but these load balancers will remain inactive at execution time unless overridden by other runtime options.
    • -module CommonLBs
      links a special module CommonLBs which includes some commonly used Charm++ built-in load balancers. The commonly used load balancers include BlockLB, CommLB, DummyLB, GreedyAgentLB, GreedyCommLB, GreedyLB, NeighborCommLB, NeighborLB, OrbLB, PhasebyArrayLB, RandCentLB, RecBipartLB, RefineLB, RefineCommLB, RotateLB, TreeMatchLB, RefineSwapLB, CommAwareRefineLB .
    • -balancer GreedyCommLB
      links the load balancer GreedyCommLB and invokes it at runtime.
    • -balancer GreedyCommLB -balancer RefineLB
      invokes GreedyCommLB at the first load balancing step and RefineLB in all subsequent load balancing steps.
    • -balancer ComboCentLB:GreedyLB,RefineLB
      You can create a new combination load balancer made of multiple load balancers. In the above example, GreedyLB and RefineLB strategies are applied one after the other in each load balancing step.

    The list of existing load balancers is given in Section 7.2 . Note: you can have multiple -module *LB options. LB modules are linked into a program, but they are not activated automatically at runtime. Using -balancer A at compile time will activate load balancer A automatically at runtime. Having -balancer A implies -module A, so you don't have to write -module A again, although that is not invalid. Using CommonLBs is a convenient way to link against the commonly used existing load balancers. One such load balancer, called MetisLB, requires the METIS library which is located at:

    charm/src/libs/ck-libs/parmetis/METISLib/.

    A pre-requisite for use of this library is to compile the METIS library by ``make METIS'' under charm/tmp after compiling Charm++ .

  2. runtime options:

    Runtime balancer selection options are similar to the compile time options as described above, but they can be used to override those compile time options.

    • +balancer help
      displays all available balancers that have been linked in.
    • +balancer GreedyCommLB
      invokes GreedyCommLB
    • +balancer GreedyCommLB +balancer RefineLB
      invokes GreedyCommLB at the first load balancing step and RefineLB in all subsequent load balancing steps.
    • +balancer ComboCentLB:GreedyLB,RefineLB
      same as the example in the -balancer compile time option.

    Note: +balancer option works only if you have already linked the corresponding load balancers module at compile time. Giving +balancer with a wrong LB name will result in a runtime error. When you have used -balancer A as compile time option, you do not need to use +balancer A again to activate it at runtime. However, you can use +balancer B to override the compile time option and choose to activate B instead of A.

  3. Handling the case that no load balancer is activated by users

    When no balancer is linked by users, but the program counts on a load balancer because it used AtSync() and expect ResumeFromSync() to be called to continue, a special load balancer called NullLB will be automatically created to run the program. This default load balancer calls ResumeFromSync() after AtSync() . It keeps a program from hanging after calling AtSync() . NullLB will be suppressed if another load balancer is created.

  4. Other useful runtime options

    There are a few other runtime options for load balancing that may be useful:

    • +LBDebug {verbose level}
      {verbose level} can be any positive integer number. 0 is to turn off the verbose. This option asks load balancer to output load balancing information to stdout. The bigger the verbose level is, the more verbose the output is.
    • +LBPeriod {seconds}
      {Seconds} can be any float number. This option sets the minimum period time in seconds between two consecutive load balancing steps. The default value is 1 second. That is to say that a load balancing step will not happen until 1 second after the last load balancing step.
    • +LBSameCpus
      This option simply tells load balancer that all processors are of same speed. The load balancer will then skip the measurement of CPU speed at runtime.
    • +LBObjOnly
      This tells load balancer to ignore processor background load when making migration decisions.
    • +LBSyncResume
      After load balancing step, normally a processor can resume computation once all objects are received on that processor, even when other processors are still working on migrations. If this turns out to be a problem, that is when some processors start working on computation while the other processors are still busy migrating objects, then this option can be used to force a global barrier on all processors to make sure that processors can only resume computation after migrations are completed on all processors.
    • +LBOff
      This option turns off load balancing instrumentation of both CPU and communication usage at startup time.
    • +LBCommOff
      This option turns off load balancing instrumentation of communication at startup time. The instrument of CPU usage is left on.


7 . 7 Seed load balancers - load balancing Chares at creation time

Seed load balancing involves the movement of object creation messages, or "seeds", to create a balance of work across a set of processors. This seed load balancing scheme is used to balance chares at creation time. After the chare constructor is executed on a processor, the seed balancer does not migrate it. Depending on the movement strategy, several seed load balancers are available now. Examples can be found examples/charm++/NQueen .

  1. random
    A strategy that places seeds randomly when they are created and does no movement of seeds thereafter. This is used as the default seed load balancer.
  2. neighbor
    A strategy which imposes a virtual topology on the processors, load exchange happens among neighbors only. The overloaded processors initiate the load balancing and send work to its neighbors when it becomes overloaded. The default topology is mesh2D, one can use command line option to choose other topology such as ring, mesh3D and dense graph.
  3. spray
    A strategy which imposes a spanning tree organization on the processors, results in communication via global reduction among all processors to compute global average load via periodic reduction. It uses averaging of loads to determine how seeds should be distributed.
  4. workstealing
    A strategy that the idle processor requests a random processor and steal chares.

Other strategies can also be explored by following the simple API of the seed load balancer.

Compile and run time options for seed load balancers

To choose a seed load balancer other than the default rand strategy, use link time command line option -balance foo .

When using neighbor seed load balancer, one can also specify the virtual topology at runtime. Use +LBTopo topo , where topo can be one of: (a) ring, (b) mesh2d, (c) mesh3d and (d) graph.

To write a seed load balancer, name your file as cldb.foo.c , where foo is the strategy name. Compile it in the form of library under charm/lib, named as libcldb-foo.a , where foo is the strategy name used above. Now one can use -balance foo as compile time option to charmc to link with the foo seed load balancer.


7 . 8 Simple Load Balancer Usage Example - Automatic with Sync LB

A simple example of how to use a load balancer in sync mode in one's application is presented below.

    
     /*** lbexample.ci ***/
     mainmodule lbexample {
       readonly CProxy_Main mainProxy;
       readonly int nElements;
       mainchare Main {
         entry Main(CkArgMsg *m);
         entry void done(void);
       };
       array [1D] LBExample {
         entry LBExample(void);
         entry void doWork();
       };
     };
    
   

----------------------------------------

    
     /*** lbexample.C ***/
     #include <stdio.h>
     #include "lbexample.decl.h"
     /*readonly*/ CProxy_Main mainProxy;
     /*readonly*/ int nElements;
     #define MAX_WORK_CNT 50
     #define LB_INTERVAL 5
     /*mainchare*/
     class Main : public CBase_Main
     {
     private:
       int count;
     public:
       Main(CkArgMsg* m)
       {
         /*....Initialization....*/
         mainProxy = thisProxy;
         CProxy_LBExample arr = CProxy_LBExample::ckNew(nElements);
         arr.doWork();
       };
       void done(void)
       {
         count++;
         if(count==nElements){
           CkPrintf("All done");
           CkExit();
         }
       };
     };
     /*array [1D]*/
     class LBExample : public CBase_LBExample
     {
     private:
       int workcnt;
     public:
       LBExample()
       {
         workcnt=0;
         /* May initialize some variables to be used in doWork */
         //Must be set to true to make AtSync work
         usesAtSync = true;
       }
       LBExample(CkMigrateMessage *m) { /* Migration constructor - invoked when chare migrates */ }
       
       /* Must be written for migration to succeed */
       void pup(PUP::er &p){
         CBase_LBExample::pup(p);
         p|workcnt;
         /* There may be some more variables used in doWork */
       }
      
       void doWork()
       {
         /* Do work proportional to the chare index to see the effects of LB */
         
         workcnt++;
         if(workcnt==MAX_WORK_CNT)
           mainProxy.done();
         
         if(workcnt%LB_INTERVAL==0)
           AtSync();
         else
           doWork();
       }
       
       void ResumeFromSync(){
         doWork();
       }
     };
     #include "lbexample.def.h"