Project

General

Profile

Feature #641

protect load balancer from variable cpu clock

Added by Jim Phillips over 4 years ago. Updated over 1 year ago.

Status:
New
Priority:
Normal
Category:
-
Target version:
-
Start date:
12/31/2014
Due date:
% Done:

0%

Spent time:

Description

When running netlrts-smp on Linux with two processes on a single node for GPU-accelerated NAMD, the OS sees some cores (generally those associated with one process or the other) as less busy and slows the cpu clock, which exaggerates the load imbalance and results in even worse overload of the other process following load balancing. Setting the cpu frequency scaling governors to "performance" makes the issue go away, but we need to understand why the OS isn't reading the CPUs as fully loaded or find a way for the load balancer to cope with time-varying cpu speeds.

History

#1 Updated by Jim Phillips over 4 years ago

The OS was reading the CPU as not loaded due to frequent calls to CmiMachineProgressImpl() in NAMD entry methods. These calls were taking a lot of time waiting on something in the machine layer. When the CmiMachineProgressImpl() calls are removed the OS keeps the cores at full or near-full speed. It is possible that it is a bug for user code to call CmiMachineProgressImpl() in non-smp builds, or at least is shouldn't serve any purpose when there is a communication thread. These calls were originally added to ensure that high-priority incoming messages were received promptly.

#2 Updated by Jim Phillips over 4 years ago

That should be "It is possible that it is a bug for user code to call CmiMachineProgressImpl() in smp builds". In any case, I assume this is related to the issue of the comm thread holding the comm lock while sending and receiving messages rather than only when manipulating queues.

#3 Updated by Nikhil Jain over 4 years ago

  • Assignee set to Harshitha Menon

#4 Updated by Harshitha Menon almost 4 years ago

Is this bug related to https://charm.cs.illinois.edu/redmine/issues/642? If so, does that fix eliminate this problem?

#5 Updated by Jim Phillips almost 4 years ago

I assume that fixes the problem. I also eliminated the CmiMachineProgressImpl() calls in NAMD for smp builds.

#6 Updated by Nikhil Jain over 3 years ago

  • Target version changed from 6.7.0 to 6.8.0

#7 Updated by Eric Bohm over 2 years ago

  • Assignee changed from Harshitha Menon to Kavitha Chandrasekar

#8 Updated by Sam White over 2 years ago

It looks like this was fixed a while ago in the following two commits:
https://charm.cs.illinois.edu/gerrit/#/c/577/
https://charm.cs.illinois.edu/gerrit/#/c/638/

Anything left to do on this issue?

#9 Updated by Phil Miller about 2 years ago

  • Target version changed from 6.8.0 to 6.8.1

#10 Updated by Eric Bohm almost 2 years ago

  • Target version changed from 6.8.1 to 6.9.0

#11 Updated by Sam White over 1 year ago

  • Target version deleted (6.9.0)

Please close this issue if it is done

Also available in: Atom PDF