Project

General

Profile

Bug #1035

Idle PEs compete with comm thread for node queue lock

Added by Jim Phillips over 2 years ago. Updated 3 months ago.

Status:
In Progress
Priority:
Normal
Category:
Machine Layers
Target version:
Start date:
04/14/2016
Due date:
% Done:

0%


Description

We've observed via projections that, in SMP mode when most of the PEs on a node are idle, the communication thread is very slow at adding incoming messages to the node queue. I assume this is because there is a single lock on the node queue, which the PEs are constantly fighting over, preventing the comm thread from adding work.
I would suggest a two-level locking approach, adding a consumer lock which PEs must obtain before taking the queue lock to receive messages. In this way idle PEs compete only with each other (which is fine). PEs should try the consumer lock but not wait for it to become available, to not delay checking their private queue. Adding to the node queue should only require the main lock.

History

#1 Updated by Jim Phillips over 2 years ago

This appears to be the node queue receive function in src/arch/util/machine-common-core.c:

void *CmiGetNonLocalNodeQ(void) {
CmiState cs = CmiGetState();
char *result = 0;
CmiIdleLock_checkMessage(&cs->idle);
if (!CMIQueueEmpty(CsvAccess(NodeState).NodeRecv)) {
MACHSTATE1(3,"CmiGetNonLocalNodeQ begin %d {", CmiMyPe());
CmiLock(CsvAccess(NodeState).CmiNodeRecvLock);
result = (char *) CMIQueuePop(CsvAccess(NodeState).NodeRecv);
CmiUnlock(CsvAccess(NodeState).CmiNodeRecvLock);
MACHSTATE1(3,"} CmiGetNonLocalNodeQ end %d ", CmiMyPe());
}
return result;
}

Rather than checking that the queue is non-empty and then calling CmiLock, it may be better to call CmiTryLock, e.g.:

void *CmiGetNonLocalNodeQ(void) {
CmiState cs = CmiGetState();
char *result = 0;
CmiIdleLock_checkMessage(&cs->idle);
while (!CMIQueueEmpty(CsvAccess(NodeState).NodeRecv)) {
if (CmiTryLock(CsvAccess(NodeState).CmiNodeRecvLock)) continue;
MACHSTATE1(3,"CmiGetNonLocalNodeQ begin %d {", CmiMyPe());
result = (char *) CMIQueuePop(CsvAccess(NodeState).NodeRecv);
CmiUnlock(CsvAccess(NodeState).CmiNodeRecvLock);
MACHSTATE1(3,"} CmiGetNonLocalNodeQ end %d ", CmiMyPe());
break;
}
return result;
}

What is confusing is that PCQueuePop appears to already be thread safe, in which case the real question is why do CmiPushNode() and CmiSendNodeSelf() need to take CmiNodeRecvLock before calling CMIQueuePush?

#2 Updated by Sam White over 2 years ago

  • Target version changed from 6.7.1 to 6.8.0

#3 Updated by Eric Bohm over 2 years ago

  • Assignee set to Harshitha Menon

#4 Updated by Harshitha Menon over 2 years ago

Jim,
Bilge mentioned that you had implemented this and sent a patch to Antti-Pekka but it did not show benefit. Can you update this issue with the details.

Thank you

#5 Updated by Jim Phillips over 2 years ago

I tried the above variant, as well as adding a lock to restrict node queue access to a single consumer at a time. Neither showed improvement in the observed performance.

#6 Updated by Eric Bohm about 2 years ago

  • Assignee changed from Harshitha Menon to Kavitha Chandrasekar

#7 Updated by Sam White almost 2 years ago

the new lockless PCQueue should alleviate this issue: https://charm.cs.illinois.edu/gerrit/#/c/1302/

#8 Updated by Sam White almost 2 years ago

  • Status changed from New to Implemented
  • Subject changed from idle PEs compete with comm thread for node queue lock to Idle PEs compete with comm thread for node queue lock
  • Assignee changed from Kavitha Chandrasekar to Bilge Acun

The new lockless queue (gerrit patch linked above) will address this issue, so re-assigning to Bilge as the owner of that patch.

#9 Updated by Sam White over 1 year ago

  • Target version changed from 6.8.0 to 6.8.1

We are not planning on merging the lockless queue before the 6.8.0 release, since it is high risk this close to the release.

#10 Updated by Eric Bohm about 1 year ago

  • Target version changed from 6.8.1 to 6.9.0

#11 Updated by Eric Bohm about 1 year ago

  • Assignee changed from Bilge Acun to Michael Robson

#12 Updated by Sam White 11 months ago

  • Status changed from Implemented to In Progress

This issue needs to be re-evaluated

#13 Updated by Sam White 10 months ago

One proposed solution: instead of having one multi-producer/single-consumer queue per PE that the comm thread and all worker threads alike push work into, split that queue into two queues: one single-producer/single-consumer queue at each PE for the comm thread only to push messages to, and one multi-producer/single-consumer queue per PE for all the other worker threads in its process to push messages to. This would eliminate the contention between comm thread and worker threads, though it would add an extra queue for the scheduler to poll.

Abhishek and and Samarth spent some time benchmarking the different queues we have last semester, so could reassign this to them? Seonmyeong also has experience with C++11 atomics and with the Csd module.

Anyway, it seems like a stretch for this to be resolved before 6.9.0

#14 Updated by Sam White 10 months ago

  • Target version changed from 6.9.0 to 6.9.1

Since there hasn't been any movement on this recently and it doesn't involve any user-visible API changes, retargeting to 6.9.1

#15 Updated by Sam White 3 months ago

Is there a reproducible test program that exhibits this issue, so that we can get someone working on this.

Also available in: Atom PDF