Project

General

Profile

Bug #1676

Replicas slower than separate jobs on GNI on Blue Waters

Added by Jim Phillips 12 days ago. Updated 5 days ago.

Status:
New
Priority:
Normal
Category:
Machine Layers
Target version:
Start date:
09/13/2017
Due date:
% Done:

0%

Spent time:

Description

On Blue Waters, 200 nodes, 50 replicas (4 nodes per replica) non-smp running apoa1 is uniformly and significantly slower than a separate 4-node run.
No idea why. Topology-aware partitioning seems to be working OK. Testing MPI on Bridges to see if it happens there too.

History

#1 Updated by Sam White 12 days ago

What commit of charm are you using? We recently merged changes to make broadcasts and reductions topology-aware.

#2 Updated by Jim Phillips 12 days ago

6.8.0 from Sept 5 (v6.8.0-0-ga36028e-namd-charm-6.8.0-build-2017-Sep-05-28093).
No observed performance difference on Bridges.

#3 Updated by Jim Phillips 12 days ago

I see the exact same performance for v6.7.0-574-g7d61794-namd-charm-6.8.0-build-2017-Jan-23-80737 and v6.7.0-0-g46f867c-namd-charm-6.7.0-build-2015-Dec-21-45876.
Definitely not a recent change.

#4 Updated by Jim Phillips 11 days ago

The bug does not affect the MPI layer on Blue Waters.
Still need to test verbs.

#5 Updated by Jim Phillips 10 days ago

verbs layer does not appear to be affected.

#6 Updated by Phil Miller 10 days ago

Just to be clear, the only context in which this bug has been observed is GNI on Blue Waters?

MPI there is unaffected, verbs is unaffected. MPI on Bridges is not affected, I think?

We may not hold the 6.8.1 release for this, since it's not any sort of recent issue or regression. We'll obviously try to get it dealt with quickly.

#7 Updated by Sam White 10 days ago

  • Subject changed from replicas slower than separate jobs to Replicas slower than separate jobs on GNI on Blue Waters

#8 Updated by Jim Phillips 10 days ago

Correct, as far as I know this is a GNI issue.
I've only tested on Blue Waters. It may or may not affect Titan, Eos, Edison, Cori, Theta, Piz Daint, etc.

#9 Updated by Eric Bohm 5 days ago

  • Assignee set to Karthik Senthil

Also available in: Atom PDF