Bug #1275

DistributedLB: Objects not migrating after strategy runs

Added by Phil Miller about 2 years ago. Updated 9 months ago.

Load Balancing
Target version:
Start date:
Due date:
% Done:



Will attach a demonstration case shortly

simple-pic-charm.tgz (14.9 KB) Phil Miller, 11/02/2016 02:00 PM


#1 Updated by Sam White about 2 years ago

Eric M just submitted a patch for an issue with DistributedLB:

#2 Updated by Phil Miller about 2 years ago

Neither of the pending patches in Gerrit resolves this issue.

Demonstration code attached.

#3 Updated by Kavitha Chandrasekar about 2 years ago

The migration phase after strategy phase seems to work okay for me. In the Makefile, the test uses 2 PEs. I believe we need atleast 3 PEs, since each underloaded PE sends info to 2 random neighbors. With +p2 it would hang in trying to find a second neighbor:

do {
rand_nbor2 = rand() % CkNumPes();
} while ((rand_nbor2 CkMyPe()) || (rand_nbor2 rand_nbor1));

#4 Updated by Phil Miller about 2 years ago

The potential to hang on 2 PEs seems like it should also be fixed.

I was running on 40 PEs when I observed the issue. I think I know what could be causing the lack of migration in that case. The overloaded PEs have objects that each would take an underloaded PE all the way past the average, past the overload threshold, to overloaded. Right now, the underloaded PEs will reject the load in that case, and the overloaded PE will keep trying to find a recipient. In this circumstance, it will never find a willing recipient. Does that make sense?

If my hypothesis is right, I think what the strategy should do instead is accept load transfers if the resulting arrangement is sufficiently less overloaded than it was without making the transfer. Maybe the previously underloaded PE can then also try to shed some of its load. Or, if we're planning on using MetaLB anyway, we can just let it take some of the overload, improving the situation to some degree, and trust that the strategy will be run again soon.

#5 Updated by Sam White almost 2 years ago

  • Category changed from Migration to Load Balancing

#6 Updated by Kavitha Chandrasekar almost 2 years ago

  • Status changed from New to In Progress

As you suggest, this might make the receiver PE overloaded, thereby not migrating objects. Before sending to an underloaded PE, the donor PE checks that transferring object to receiver PE does not make it overloaded (i.e (p_load + obj->load) < avg_load).

To check if it becomes sufficiently less overloaded overall, we could add a condition, ((p_load + obj->load) < avg_load) || (p_load + obj->load) < my_load ) which could increase the migrations performed. When run on a single node, this logic seems to increase the initial migrations. Need to perform testing at scale.

#7 Updated by Phil Miller over 1 year ago

  • Priority changed from Normal to High

#8 Updated by Michael Robson over 1 year ago

  • Tags set to changa, namd, openatom

#9 Updated by Kavitha Chandrasekar over 1 year ago to fix hang when there are only 2 PEs.

#10 Updated by Ronak Buch over 1 year ago

Does your patch fix this in general or just for the 2 PE case, Kavitha?

#11 Updated by Kavitha Chandrasekar over 1 year ago

It fixes only for 2 PE case. The migration issue will likely be fixed with With the fixes in that patch/branch, objects usually migrate as expected.

#12 Updated by Kavitha Chandrasekar over 1 year ago

I have updated the gerrit patch with Harshitha's fix from her branch. This version of DistributedLB performs load balancing in multiple phases, when the target threshold is not met, it attempts another phase of load balancing with an updated threshold. For a simple stencil load-balancing example, it performs migrations and balances load as expected.

#13 Updated by Phil Miller over 1 year ago

  • Target version changed from 6.8.0 to 6.9.0

#14 Updated by Sam White 11 months ago

  • Status changed from In Progress to Implemented

#15 Updated by Kavitha Chandrasekar 10 months ago

  • Assignee changed from Kavitha Chandrasekar to Eric Mikida

#16 Updated by Juan Galvez 9 months ago

  • Status changed from Implemented to Merged

Also available in: Atom PDF