Project

General

Profile

Bug #1676

Replicas slower than separate jobs on GNI systems

Added by Jim Phillips 2 months ago. Updated about 1 month ago.

Status:
New
Priority:
Normal
Category:
Machine Layers
Target version:
Start date:
09/13/2017
Due date:
% Done:

0%

Spent time:

Description

On Blue Waters, 200 nodes, 50 replicas (4 nodes per replica) non-smp running apoa1 is uniformly and significantly slower than a separate 4-node run.
No idea why. Topology-aware partitioning seems to be working OK. Testing MPI on Bridges to see if it happens there too.

History

#1 Updated by Sam White 2 months ago

What commit of charm are you using? We recently merged changes to make broadcasts and reductions topology-aware.

#2 Updated by Jim Phillips 2 months ago

6.8.0 from Sept 5 (v6.8.0-0-ga36028e-namd-charm-6.8.0-build-2017-Sep-05-28093).
No observed performance difference on Bridges.

#3 Updated by Jim Phillips 2 months ago

I see the exact same performance for v6.7.0-574-g7d61794-namd-charm-6.8.0-build-2017-Jan-23-80737 and v6.7.0-0-g46f867c-namd-charm-6.7.0-build-2015-Dec-21-45876.
Definitely not a recent change.

#4 Updated by Jim Phillips 2 months ago

The bug does not affect the MPI layer on Blue Waters.
Still need to test verbs.

#5 Updated by Jim Phillips 2 months ago

verbs layer does not appear to be affected.

#6 Updated by Phil Miller 2 months ago

Just to be clear, the only context in which this bug has been observed is GNI on Blue Waters?

MPI there is unaffected, verbs is unaffected. MPI on Bridges is not affected, I think?

We may not hold the 6.8.1 release for this, since it's not any sort of recent issue or regression. We'll obviously try to get it dealt with quickly.

#7 Updated by Sam White 2 months ago

  • Subject changed from replicas slower than separate jobs to Replicas slower than separate jobs on GNI on Blue Waters

#8 Updated by Jim Phillips 2 months ago

Correct, as far as I know this is a GNI issue.
I've only tested on Blue Waters. It may or may not affect Titan, Eos, Edison, Cori, Theta, Piz Daint, etc.

#9 Updated by Eric Bohm 2 months ago

  • Assignee set to Karthik Senthil

#10 Updated by Jim Phillips about 2 months ago

I can confirm that the bug also affects Cori (XC40), so I would assume all XC/XE/XK machines.

#11 Updated by Phil Miller about 2 months ago

Querying test-case reduction, since there are basically no progress notes on this issue:
  • Is a 2 node, 2 replica job markedly slower than a 1 node single job?
  • If no to the above, is a 4 node, 2 replica job slower than a 2 node single job?

#12 Updated by Jim Phillips about 2 months ago

No and no. I've been using 4 nodes per replica, non-smp. The effect starts to be visible above noise at 16 replicas, stands out at 64 replicas.

#13 Updated by Phil Miller about 2 months ago

Ok, so the effect grows in magnitude with replica count, and requires at least a few nodes to occur.

What about the other direction - say 64 or 128 node job with 2 replicas?

Do you have data to know if all of the replicas are slow, or are they mostly fast, and getting delayed by some interaction with one or a few slow replicas?

#14 Updated by Jim Phillips about 2 months ago

All of the replicas are uniformly slow. There is no inter-replica interaction.
I haven't looked at large node counts with small replica counts.

#15 Updated by Phil Miller about 2 months ago

  • Subject changed from Replicas slower than separate jobs on GNI on Blue Waters to Replicas slower than separate jobs on GNI system

#16 Updated by Phil Miller about 2 months ago

  • Subject changed from Replicas slower than separate jobs on GNI system to Replicas slower than separate jobs on GNI systems

#17 Updated by Jim Phillips about 2 months ago

From some basic profiling it appears that the amount of time spent in alloc_mempool_block (but not the number of calls) increases dramatically as the number of nodes and replicas is increased proportionately (from 8 nodes 2 replicas to 16 nodes 4 replicas). I'm getting crashes beyond 16 nodes.

#18 Updated by Phil Miller about 2 months ago

Could you try the same test (4 nodes per replica, increasing replica count) with +useDynamicSmsg? I'm kinda suspecting a lot of memory is being set aside for communication among increasing numbers of nodes, even as the actual communication graph has low fixed degree.

#19 Updated by Phil Miller about 2 months ago

And while you're at it, could you post your full command line and the runtime's startup output?

#20 Updated by Jim Phillips about 2 months ago

For 16 nodes:
aprun -n 496 -r 1 -N 31 -d 1 /u/sciteam/jphillip/NAMD_LATEST_CRAY-XE-ugni-BlueWaters/namd2 +pemap 0-30 +useDynamicSmsg +replicas 4 +stdout output/%d/test22.%d.log /u/sciteam/jphillip/apoa1/apoa1.namd

From a 20-node run:
Charm++> Running on Gemini (GNI) with 620 processes
Charm++> static SMSG
Charm++> SMSG memory: 3061.2KB
Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0 means no limit)
Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB
Charm++> Cray TLB page size: 8192K
Charm++> Running in non-SMP mode: numPes 620
Charm++> Using recursive bisection (scheme 3) for topology aware partitions

and the first replica of that run:
Converse/Charm++ Commit ID: v6.8.0-0-ga36028e-namd-charm-6.8.0-build-2017-Sep-05-28093
CharmLB> Load balancer assumes all CPUs are same.
Charm++> cpu affinity enabled.
Charm++> cpuaffinity PE-core map : 0-30
Charm++> Running on 4 unique compute nodes (32-way SMP).

#21 Updated by Jim Phillips about 2 months ago

Sorry, dynamic and static SMSG have indistinguishable performance at large replica counts, although the final WallClock output is actually longer for dynamic SMSG than for static SMSG.

#22 Updated by Phil Miller about 1 month ago

  • Target version changed from 6.8.1 to 6.9.0

Also available in: Atom PDF