Project

General

Profile

Feature #1321

multiple communication threads per process

Added by Jim Phillips over 2 years ago. Updated over 1 year ago.

Status:
New
Priority:
High
Assignee:
Category:
Machine Layers
Start date:
12/05/2016
Due date:
% Done:

0%


Description

KNL scaling, particularly on Omni-Path, appears to be limited by the communication thread even when using 7 or 13 processes per node. In addition, multiple processes increases memory usage and precludes dynamic load balancing across the entire processor. It would be better to have communication load distributed across multiple threads, or offloaded from the communication thread to the PEs.

History

#1 Updated by Sam White over 2 years ago

Just a note for future work on this issue: look at the problem resolved in patch set #7 of the NodeGPU commit (https://charm.cs.illinois.edu/gerrit/#/c/1886/). The basic issue is that Converse does not have any abstraction for a node barrier among worker threads only, so the code assumes that there is one comm thread to do that. This assumption is repeated elsewhere.

#2 Updated by Sam White over 2 years ago

  • Target version set to 6.9.0
  • Subject changed from multiple communication threads to multiple communication threads per process

#3 Updated by Jim Phillips over 2 years ago

Multiple communication threads per process would be good for distributing Charm++ internal work, but will not help PSM2 bottlenecks, per the docs:

"PSM2 is currently a single-threaded library. This means that you cannot make any
concurrent PSM2 library calls. While threads may be a valid execution model for the
wider set of potential PSM2 clients, applications should currently expect better effective
use of IntelĀ® Omni-Path resources (and hence better performance) by dedicating a
single PSM2 communication endpoint to every CPU core."

In an IXPUG presentation one application solved this issue by having multiple dedicated MPI ranks for communication that communicated via shared memory with a single multi-threaded worker rank per node.

#4 Updated by Eric Bohm about 2 years ago

  • Assignee set to Ronak Buch

#5 Updated by Mikhail Shiryaev almost 2 years ago

Hello team, is there any update on that work?
The latest versions of PSM2 and OFI have improved support for multi-threaded applications and multiple communication threads in Charm++ can fit well.

#6 Updated by Ronak Buch almost 2 years ago

I've been doing more detailed tracing on OFI. Here's a representative example of the machine state tracing (two processes, ppn 13 on a single node):

[13 130.246069]> OFI::recv_callback {
[13 130.246134]> --> eager msg (e->len=4208 msg_size=4208)
[13 130.246204]> Pushing message into rank 5's queue 0x2acd2c0011e0{
[13 130.246262]> } Pushing message into rank 5's queue done
[13 130.246325]> Reposting recv req 0x13a3050 buf=0x2acd381e9610
[13 130.246383]> } OFI::recv_callback done

Total time: .000314
-> 314 microseconds

CmiPushPE: .000058
-> 58 microseconds

This is much slower than expected. Some of this is likely due to tracing and measurement overhead, but it looks really slow even without that. My suspicion is that this is maybe due to locking or some other slowdown. I'm going to do more runs to try to isolate the behavior (with less overhead and without the possibility of contention), but this seems to indicate an issue at least somewhere in the stack.

#7 Updated by Mikhail Shiryaev almost 2 years ago

Ronak, do your results mean that CmiPushPE is main hotspot for communication thread?
As far as I understand there are multiple "recv" queues corresponding to each worker thread and communication thread just pushes entry in one of queue depending on worker local thread index (which is retrieved from message header). So do you assume that there is contention between 1 worker thread and 1 comm thread around shared queue?

#8 Updated by Ronak Buch over 1 year ago

  • Target version changed from 6.9.0 to 7 (Next Generation Charm++)

Also available in: Atom PDF