DistributedLB: Objects not migrating after strategy runs
Will attach a demonstration case shortly
#3 Updated by Kavitha Chandrasekar over 1 year ago
The migration phase after strategy phase seems to work okay for me. In the Makefile, the test uses 2 PEs. I believe we need atleast 3 PEs, since each underloaded PE sends info to 2 random neighbors. With +p2 it would hang in trying to find a second neighbor:
rand_nbor2 = rand() % CkNumPes();
} while ((rand_nbor2 CkMyPe()) || (rand_nbor2 rand_nbor1));
#4 Updated by Phil Miller over 1 year ago
The potential to hang on 2 PEs seems like it should also be fixed.
I was running on 40 PEs when I observed the issue. I think I know what could be causing the lack of migration in that case. The overloaded PEs have objects that each would take an underloaded PE all the way past the average, past the overload threshold, to overloaded. Right now, the underloaded PEs will reject the load in that case, and the overloaded PE will keep trying to find a recipient. In this circumstance, it will never find a willing recipient. Does that make sense?
If my hypothesis is right, I think what the strategy should do instead is accept load transfers if the resulting arrangement is sufficiently less overloaded than it was without making the transfer. Maybe the previously underloaded PE can then also try to shed some of its load. Or, if we're planning on using MetaLB anyway, we can just let it take some of the overload, improving the situation to some degree, and trust that the strategy will be run again soon.
#6 Updated by Kavitha Chandrasekar over 1 year ago
- Status changed from New to In Progress
As you suggest, this might make the receiver PE overloaded, thereby not migrating objects. Before sending to an underloaded PE, the donor PE checks that transferring object to receiver PE does not make it overloaded (i.e (p_load + obj->load) < avg_load).
To check if it becomes sufficiently less overloaded overall, we could add a condition, ((p_load + obj->load) < avg_load) || (p_load + obj->load) < my_load ) which could increase the migrations performed. When run on a single node, this logic seems to increase the initial migrations. Need to perform testing at scale.
#12 Updated by Kavitha Chandrasekar about 1 year ago
I have updated the gerrit patch https://charm.cs.illinois.edu/gerrit/#/c/1951/ with Harshitha's fix from her branch. This version of DistributedLB performs load balancing in multiple phases, when the target threshold is not met, it attempts another phase of load balancing with an updated threshold. For a simple stencil load-balancing example, it performs migrations and balances load as expected.