Replicas slower than separate jobs on GNI systems
On Blue Waters, 200 nodes, 50 replicas (4 nodes per replica) non-smp running apoa1 is uniformly and significantly slower than a separate 4-node run.
No idea why. Topology-aware partitioning seems to be working OK. Testing MPI on Bridges to see if it happens there too.
#6 Updated by Phil Miller 2 months ago
Just to be clear, the only context in which this bug has been observed is GNI on Blue Waters?
MPI there is unaffected, verbs is unaffected. MPI on Bridges is not affected, I think?
We may not hold the 6.8.1 release for this, since it's not any sort of recent issue or regression. We'll obviously try to get it dealt with quickly.
#13 Updated by Phil Miller about 2 months ago
Ok, so the effect grows in magnitude with replica count, and requires at least a few nodes to occur.
What about the other direction - say 64 or 128 node job with 2 replicas?
Do you have data to know if all of the replicas are slow, or are they mostly fast, and getting delayed by some interaction with one or a few slow replicas?
#17 Updated by Jim Phillips about 2 months ago
From some basic profiling it appears that the amount of time spent in alloc_mempool_block (but not the number of calls) increases dramatically as the number of nodes and replicas is increased proportionately (from 8 nodes 2 replicas to 16 nodes 4 replicas). I'm getting crashes beyond 16 nodes.
#18 Updated by Phil Miller about 2 months ago
Could you try the same test (4 nodes per replica, increasing replica count) with
+useDynamicSmsg? I'm kinda suspecting a lot of memory is being set aside for communication among increasing numbers of nodes, even as the actual communication graph has low fixed degree.
#20 Updated by Jim Phillips about 2 months ago
For 16 nodes:
aprun -n 496 -r 1 -N 31 -d 1 /u/sciteam/jphillip/NAMD_LATEST_CRAY-XE-ugni-BlueWaters/namd2 +pemap 0-30 +useDynamicSmsg +replicas 4 +stdout output/%d/test22.%d.log /u/sciteam/jphillip/apoa1/apoa1.namd
From a 20-node run:
Charm++> Running on Gemini (GNI) with 620 processes
Charm++> static SMSG
Charm++> SMSG memory: 3061.2KB
Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0 means no limit)
Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB
Charm++> Cray TLB page size: 8192K
Charm++> Running in non-SMP mode: numPes 620
Charm++> Using recursive bisection (scheme 3) for topology aware partitions
and the first replica of that run:
Converse/Charm++ Commit ID: v6.8.0-0-ga36028e-namd-charm-6.8.0-build-2017-Sep-05-28093
CharmLB> Load balancer assumes all CPUs are same.
Charm++> cpu affinity enabled.
Charm++> cpuaffinity PE-core map : 0-30
Charm++> Running on 4 unique compute nodes (32-way SMP).