24 . Tuning and Developing Load Balancers

24 . 1 Load Balancing Simulation

The simulation feature of the load balancing framework allows the users to collect information about the compute WALL/CPU time and communication of the chares during a particular run of the program and use this information later to test the different load balancing strategies to see which one is suitable for the program behavior. Currently, this feature is supported only for the centralized load balancing strategies. For this, the load balancing framework accepts the following command line options:
  1. +LBDump StepStart
    This will dump the compute and the communication data collected by the load balancing framework starting from the load balancing step StepStart into a file on the disk. The name of the file is given by the +LBDumpFile option. The load balancing step in the program is numbered starting from 0. Negative value for StepStart will be converted to 0.
  2. +LBDumpSteps StepsNo
    This option specifies the number of load balancing steps for which data will be dumped to disk. If omitted, its default value is 1. The program will exit after StepsNo files are created.
  3. +LBDumpFile FileName
    This option specifies the base name of the file created with the load balancing data. If this option is not specified, the framework uses the default file lbdata.dat . Since multiple steps are allowed, a number corresponding to the step number is appended to the filename in the form Filename.# ; this applies to both dump and simulation.
  4. +LBSim StepStart
    This option instructs the framework to do the simulation starting from StepStart step. When this option is specified, the load balancing data along with the step number will be read from the file specified in the +LBDumpFile option. The program will print the results of the balancing for a number of steps given by the +LBSimSteps option, and then will exit.
  5. +LBSimSteps StepsNo
    This option is applicable only to the simulation mode. It specifies the number of load balancing steps for which the data will be dumped. The default value is 1.
  6. +LBSimProcs
    With this option, the user can change the number of processors specified to the load balancing strategy. It may be used to test the strategy in the cases where some processor crashes or a new processor becomes available. If this number is not changed since the original run, starting from the second step file, the program will print other additional information about how the simulated load differs from the real load during the run (considering all strategies that were applied while running). This may be used to test the validity of a load balancer prediction over the reality. If the strategies used during run and simulation differ, the additional data printed may not be useful.
Here is an example which collects the data for a 1000 processor run of a program

 ./charmrun pgm +p1000 +balancer RandCentLB +LBDump 2 +LBDumpSteps 4 +LBDumpFile lbsim.dat

This will collect data on files lbsim.dat.2,3,4,5. We can use this data to analyze the performance of various centralized strategies using:

 ./charmrun pgm +balancer <Strategy to test> +LBSim 2 +LBSimSteps 4 +LBDumpFile lbsim.dat
[+LBSimProcs 900]

Please note that this does not invoke the real application. In fact, ''pgm'' can be replaced with any generic application which calls centralized load balancer. An example can be found in tests/charm++/load_balancing/lb_test .

24 . 2 Future load predictor

When objects do not follow the assumption that the future workload will be the same as the past, the load balancer might not have the right information to do a good rebalancing job. To prevent this, the user can provide a transition function to the load balancer to predict what will be the future workload, given the past instrumented one. For this, the user can provide a specific class which inherits from LBPredictorFunction and implement the appropriate functions. Here is the abstract class:

 class LBPredictorFunction {

  int num_params;
  virtual void initialize_params(double *x);

  virtual double predict(double x, double *params) =0;
  virtual void print(double *params) PredictorPrintf("LB: unknown model");;
  virtual void function(double x, double *param, double &y, double *dyda) =0;

Other than these functions, the user should provide a constructor which must initialize num_params to the number of parameters the model has to learn. This number is the dimension of param and dyda in the previous functions. For the given example, the constructor is {num_params = 2;} . If the model for computation is not known, the user can leave the system to use the default function. As seen, the function can have several parameters which will be learned during the execution of the program. For this, user can be add the following command line arguments to specify the learning behavior:
  1. +LBPredictorWindow size
    This parameter specifies the number of statistics steps the load balancer will store. The greater this number is, the better the approximation of the workload will be, but more memory is required to store the intermediate information. The default is 20.
  2. +LBPredictorDelay steps
    This will tell how many load balancer steps to wait before considering the function parameters learned and starting to use the mode. The load balancer will collect statistics for a +LBPredictorWindow steps, but it will start using the model as soon as +LBPredictorDelay information are collected. The default is 10.
Moreover, another flag can be set to enable the predictor from command line: +LBPredictor .
Other than the command line options, there are some methods which can be called from the user program to modify the predictor. These methods are:

An example can be found in tests/charm++/load_balancing/lb_test/predictor .

24 . 3 Control CPU Load Statistics

Charm++ programmers can modify the CPU load data in the load balancing database before a load balancing phase starts (which is the time when load balancing database is collected and used by load balancing strategies).

In an array element, the following function can be invoked to overwrite the CPU load that is measured by the load balancing framework.

    double newTiming;

setObjTime() is defined as a method of class CkMigratable , which is the superclass of all array elements.

The users can also retrieve the current timing that the load balancing runtime has measured for the current array element using getObjTime() .

    double measuredTiming; 
   measuredTiming = getObjTime(); 

This is useful when the users want to derive a new CPU load based on the existing one.

24 . 4 Model-based Load Balancing

The user can choose to feed load balancer with their own CPU timing for each Chare based on certain computational model of the applications.

To do so, in the array element's constructor, the user first needs to turn off automatic CPU load measurement completely by setting

    usesAutoMeasure = false;

The user must also implement the following function to the chare array classes:

    virtual void CkMigratable::UserSetLBLoad();      // defined in base class

This function serves as a callback that is called on each chare object when AtSync() is called and ready to do load balancing. The implementation of UserSetLBLoad() is simply to set the current chare object's CPU load in load balancing framework. setObjTime() described above can be used for this.

24 . 5 Writing a new load balancing strategy

Charm++ programmers can choose an existing load balancing strategy from Charm++ 's built-in strategies(see  7.2 ) for the best performance based on the characteristics of their applications. However, they can also choose to write their own load balancing strategies.

The Charm++ load balancing framework provides a simple scheme to incorporate new load balancing strategies. The programmer needs to write their strategy for load balancing based on the instrumented ProcArray and ObjGraph provided by the load balancing framework. This strategy is implemented within this function:

 void FooLB::work(LDStats *stats) {
  /** ========================== INITIALIZATION ============================= */
  ProcArray *parr = new ProcArray(stats);
  ObjGraph *ogr = new ObjGraph(stats);

  /** ============================= STRATEGY ================================ */
  /// The strategy goes here
  /// The strategy goes here
  /// The strategy goes here
  /// The strategy goes here
  /// The strategy goes here

  /** ============================== CLEANUP ================================ */

Figure  24.1 explains the two data structures available to the strategy: ProcArray and ObjGraph. Using them, the strategy should assign objects to new processors where it wants to be migrated through the setNewPe() method. src/ck-ldb/GreedyLB.C can be referred.

Figure 24.1: ProcArray and ObjGraph data structures to be used when writing a load balancing strategy
Image ckgraph

Incorporating this strategy into the Charm++ build framework is explained in the next section.

24 . 6 Adding a load balancer to Charm++

Let us assume that we are writing a new centralized load balancer called FooLB. The next few steps explain the steps of adding the load balancer to the Charm++ build system:

  1. Create files named, FooLB.h and FooLB.C in directory of src/ck-ldb . One can choose to copy and rename the files GraphPartLB.* and rename the class name in those files.

  2. Implement the strategy in the FooLB class method -- FooLB::work(LDStats* stats) as described in the previous section.

  3. Build charm for your platform (This will create the required links in the tmp directory).

  4. To compile the strategy files, first add FooLB in the ALL_LDBS list in charm/tmp/ Also comment out the line containing UNCOMMON_LDBS in If FooLB will require some libraries at link time, you also need to create the dependency file called libmoduleFooLB.dep. Run the script in charm/tmp, which creates the new Makefile named ``''.

  5. Run ``make depends'' to update dependence rule of Charm++ files. And run ``make charm++'' to compile Charm++ which includes the new load balancing strategy files.

24 . 7 Understand Load Balancing Database Data Structure

To write a load balancing strategy, you need to know what information is measured during the runtime and how it is represented in the load balancing database data structure.

There are mainly 3 categories of information: a) processor information including processor speed, background load; b) object information including per object CPU/WallClock compute time and c) communication information .

The database data structure named L DStats is defined in CentralLB.h :

  struct ProcStats {  // per processor
    LBRealType total_walltime;
    LBRealType total_cputime;
    LBRealType idletime;
    LBRealType bg_walltime;
    LBRealType bg_cputime;
    int pe_speed;
    double utilization;
    bool available;
    int   n_objs;

  struct LDStats { // load balancing database
    ProcStats  *procs;
    int count;

    int   n_objs;
    int   n_migrateobjs;
    LDObjData* objData;

    int   n_comm;
    LDCommData* commData;

    int  *from_proc, *to_proc;

  1. LBRealType is the data type for load balancer measured time. It is "double" by default. User can specify the type to float at Charm++ compile time if want. For example, ./build charm++ net-linux-x86_64 -with-lbtime-type=float;
  2. procs array defines processor attributes and usage data for each processor;
  3. objData array records per object information, LDObjData is defined in lbdb.h ;
  4. commData array records per communication information. LDCommData is defined in lbdb.h .