Project

General

Profile

Bug #1715

20% slowdown in ChaNGa after commit 159fd36fc

Added by Thomas Quinn almost 2 years ago. Updated over 1 year ago.

Status:
Merged
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
10/17/2017
Due date:
% Done:

0%

Tags:

Description

A ChaNGa user reported a noticeable slow down when compiling with recent versions of charm. A "git bisect" session points to commit 159fd36fc as the cause:

Author: Eric Mikida <>
Date: Tue May 23 15:32:20 2017 -0500

Short-circuit receive code path for quicker array element lookup
The primary focus here is to improve performance for fine-grained
messaging for arrays. This is part of an overall effort to more fully
integrate 64-bit IDs (#178/#179) and to decouple messaging from
CkLocMgr (#165).
Change-Id: Ib5edbb8983421d82b8c0e3055e05939a7f8139d6

The slowdown can be demonstrated by running the "testcosmo" test in the changa source directory on a 4 node Ivybridge (20 cores/node) system connected with infiniband (NASA Pleiades) with the "verbs-linux-x86_64 smp" build, and summing the total time spent in "Gravity/SPH". The following awk script may help:

/Calculating gravity and SPH/ {sum2 += $14} END {print sum2}

Comparing commit 159fd36fc with the previous commit (938cb13a9) shows 56 seconds vs 47 seconds. The command to run is:
$CHARMRUN +p 72 ++ppn 18 $CHANGA -wall 15 +balancer MultistepNodeLB_notopo +setcpuaffinity +commap 0 +pemap 1-18 cube300.param >& cube300_new.out

new_time_profile.png View (210 KB) Thomas Quinn, 10/24/2017 11:18 AM

com_proc_old.png View (334 KB) Thomas Quinn, 10/24/2017 11:21 AM

com_proc_new.png View (327 KB) Thomas Quinn, 10/24/2017 11:21 AM

History

#1 Updated by Sam White over 1 year ago

  • Assignee set to Eric Mikida

Do you happen to know if that input case running on that many PEs will result in having many more chares per PE than normal for ChaNGa?

And do you know if that commit improved ChaNGa performance for other input cases?

We have an open issue to use a better hash table implementation than std::unordered_map for our specific use case, so that may also help this: https://charm.cs.illinois.edu/redmine/issues/1065

#2 Updated by Thomas Quinn over 1 year ago

The given case doesn't have an unusual number of chares/PE. It does have a very small compute/communication ratio: I would normally run this problem on only one node. The original case reported by a user has a somewhat larger than usual chares/PE, and it also has a large amount of communication.

Running the testcosmo on only 2 nodes, the commit does improve performance by ~ 5%.

#3 Updated by Eric Mikida over 1 year ago

I don't think this slowdown has to do with the use of unordered map. Even before this change the location manager still uses an unordered map, which was buried in a much deeper call stack. All this change did is pull the use of that map to the point where a message is received. My guess is the performance degradation is because this benchmark may be doing very frequent migrations, so checking this top level map almost always results in misses and we fall back to the old code path.

How frequently is load balancing done in this benchmark, and roughly what percentage of objects end up migrating? If you run the benchmark without load balancing does the change still have a negative impact?

#4 Updated by Thomas Quinn over 1 year ago

I tried the experiment of running the benchmark without load balancing (well, one load balance at the very beginning of the simulation). The new version takes 48s compared to 40s for the old version. I conclude that migration is not the issue here.

#5 Updated by Thomas Quinn over 1 year ago

I tried the experiment of running the benchmark without load balancing (well, one load balance at the very beginning of the simulation). The new version takes 48s compared to 40s for the old version. I conclude that migration is not the issue here.

I took a look at these runs with "gprof" and projections. With gprof, the noticeable thing is that DeliverViaNetwork() takes 5.54s in the new version vs 2.61s in the old.

Attached are some projections plots: From the time profile, I see that there is a lot more time spent in "overhead" in the new run.

The communications plot is interesting: in the OLD there is a number of updateLocation(CmiUint8, int nowOnPE) entries (the brown) that are missing in the new. Note that there is no migration going on. Looking at the time lines, these updateLocations() seem to occur when calls are made to the CkCacheManager group.

#6 Updated by Eric Mikida over 1 year ago

Looking into this more now. The lack of updateLocation methods makes sense because they are part of the location manager, and the new code path bypasses the location manager when it's not needed. The two pngs you uploaded seem to be broken however, and just show up as all black for me.

I'll also look into the gprof results you got. My suspicion is that the delivery method should take a bit longer due to now having a hashtable lookup, but this should be more than made up for by the fact that the location manager codepath is bypassed. Not sure why that is not the case here but I'll keep digging.

#7 Updated by Thomas Quinn over 1 year ago

I can see the PNGs if I first download them, then view them with "display".

#8 Updated by Thomas Quinn over 1 year ago

I don't know if this helps, but I notice an even bigger slow down in the "tree building" phase of ChaNGa: nearly a factor of 2 slower with the new version of charm. This phase is characterized by a lot of small (100 byte and ~200 byte) messages.

#9 Updated by Laxmikant "Sanjay" Kale over 1 year ago

can someone upload the 2 gprof outputs to compare?
(the pngs for com are not viewable on my mac.. but I don't have "display"). Com views will be useful to check if the caching is not working and all messages are going 2-3 hops (via home).

#10 Updated by Eric Mikida over 1 year ago

  • Priority changed from Normal to High
  • Target version set to 6.9.0

We think we have narrowed down the issue. Should have more info and possibly a patch to test by end of day tomorrow.

#11 Updated by Thomas Quinn over 1 year ago

I did make a "comm" projections plot, and I think I am seeing what Sanjay is suggesting: a message goes to a comm thread on another node, then goes to the comm thread on the same node, then goes to another PE on the node from which it was sent.

#12 Updated by Thomas Quinn over 1 year ago

And one more data point: if I use "NullLB", tree building speeds up by a factor of 3!

#13 Updated by Eric Mikida over 1 year ago

Yes, it looks like when things migrate, we are no longer catching 'multi-hop' messages and updating the sender. So what was before only a small number of multi-hop messages that were quickly corrected are now always multi-hop messages. NullLB removes all migration, so we don't run into this problem. I'm updating the short-circuited path to still check for multi-hop messages.

#14 Updated by Eric Mikida over 1 year ago

Pushed a patch here: https://charm.cs.illinois.edu/gerrit/3219.

Jaemin ran some tests on golub using the verbs SMP build, and it looks like performance is back to normal for the test described in this bug. Below are the times for three runs on each of the patches in question.

Current Charm: 40.0676, 40.4806, 40.2761
Before Eric's old patch: 37.1362, 37.1666, 37.2152
After Eric's new patch: 37.2735, 37.2117, 36.9612

#15 Updated by Sam White over 1 year ago

  • Status changed from New to Implemented

#16 Updated by Jaemin Choi over 1 year ago

  • Status changed from Implemented to Merged

#17 Updated by Thomas Quinn over 1 year ago

I've tested the version with the problem above. The short-circuit now actually improves performance!

Also available in: Atom PDF