Project

General

Profile

Bug #1774

Thread migration fails on ppc64le builds

Added by Nitin Bhat 5 months ago. Updated about 1 month ago.

Status:
New
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
01/11/2018
Due date:
% Done:

0%


Description

On Summitdev, megampi test (tests/ampi/megampi) with pami-linux-ppc64le-smp build hangs with the following output:

Running on 2 processors:  ./pgm +vp2
jsrun -n2 ./pgm +vp2
Choosing optimized barrier algorithm name I0:HybridBinomial:SHMEM:P2P
Converse/Charm++ Commit ID: v6.8.2-58-gc448343
Warning> Randomization of virtual memory (ASLR) is turned on in the kernel, thread migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it, or try running with '+isomalloc_sync'.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> cpu affinity enabled.
Setting default affinity
Charm++> Running on 1 unique compute nodes (160-way SMP).
Charm++> cpu topology info is gathered in 0.002 seconds.
[0] RandCentLB created


Related issues

Related to Charm++ - Bug #1775: chkpt test hangs for pami-linux-ppc64le-smp & pamilrts-linux-ppc64le-smp build New 01/11/2018

History

#1 Updated by Nitin Bhat 5 months ago

  • Description updated (diff)

#2 Updated by Sam White 5 months ago

Looks ilke you need to run with '+isomalloc_sync', but there might be other issues too

#3 Updated by Nitin Bhat 5 months ago

  • Subject changed from megampi test hangs for pami-linux-ppc64le-smp build to megampi test hangs for pami-linux-ppc64le-smp & pamilrts-linux-ppc64le-smp builds

Seeing the hang even with '+isomalloc_sync'.

Similar hang is seen for pamilrts-linux-ppc64le-smp build as well. (https://charm.cs.illinois.edu/gerrit/#/c/3141/)

#4 Updated by Nitin Bhat 5 months ago

  • Related to Bug #1775: chkpt test hangs for pami-linux-ppc64le-smp & pamilrts-linux-ppc64le-smp build added

#5 Updated by Sam White about 2 months ago

Does AMPI work on Summit (rather than SummitDev)? We want it working for v6.9.0. Also does it work on verbs-linux-ppc64le? And do you know if verbs or pami(lrts) is faster on Summit?

#6 Updated by Nitin Bhat about 1 month ago

I see that it hangs for all ppc64le targets (both smp and nonsmp, all of verbs, pami and pamilrts).

I'm not very sure about verbs vs pami{lrts} targets in terms of performance as I haven't compared them. However, looking at the namd folks use pami for their runs, I'm inclined to believe that pami is the fastest target.

#7 Updated by Eric Bohm about 1 month ago

  • Assignee set to Nitin Bhat

#8 Updated by Sam White about 1 month ago

My current hypothesis is that context threads aren't working, so we should switch the default to use uFcontext threads (which are faster anyways, and the default on all x86_64 builds already). To test we just need to link an AMPI program with "-thread uFcontext".

#9 Updated by Sam White about 1 month ago

If that doesn't work, then my next guess would be that Isomalloc is broken, possibly because we need '+isomalloc_sync' and that part of it is broken. But AMPI should work without migration if you pass '+tcharm_nomig' at runtime.

#10 Updated by Nitin Bhat about 1 month ago

Programs with "-thread uFcontext" crash with a seg fault:

bash-4.2$ make test OPTS="-thread uFcontext" 
../../../bin/ampicxx -thread uFcontext -c test.C
../../../bin/ampicxx -thread uFcontext -o pgm test.o -balancer RandCentLB
../../../bin/testrun  ./pgm +p1 +vp1

Running on 1 processors:  ./pgm +vp1
jsrun -r1 -c2 ./pgm +vp1
Choosing optimized barrier algorithm name I0:OneTaskBarrier:OneTask:OneTask
Charm++> Running in SMP mode: numNodes 1,  1 worker threads per process
Charm++> There's no comm. thread. Work threads both send and receive messages
Converse/Charm++ Commit ID: v6.8.2-667-ga69de39
Warning> Randomization of virtual memory (ASLR) is turned on in the kernel, thread migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it, or try running with '+isomalloc_sync'.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (176-way SMP).
Charm++> cpu topology info is gathered in 0.001 seconds.
CharmLB> RandCentLB created.
ERROR:  One or more process terminated with signal 11
make: *** [test] Error 139

On running with "+isomalloc_sync", there's a hang right in the beginning

bash-4.2$ make test TESTOPTS="+isomalloc_sync" 
../../../bin/testrun  ./pgm +p1 +vp1  +isomalloc_sync

Running on 1 processors:  ./pgm +vp1 +isomalloc_sync
jsrun -r1 -c2 ./pgm +vp1 +isomalloc_sync
Choosing optimized barrier algorithm name I0:OneTaskBarrier:OneTask:OneTask
Charm++> Running in SMP mode: numNodes 1,  1 worker threads per process
Charm++> There's no comm. thread. Work threads both send and receive messages
Converse/Charm++ Commit ID: v6.8.2-667-ga69de39
Charm++> Synchronizing isomalloc memory region...

#11 Updated by Sam White about 1 month ago

  • Subject changed from megampi test hangs for pami-linux-ppc64le-smp & pamilrts-linux-ppc64le-smp builds to Thread migration fails on ppc64le builds

I think Isomalloc needs +isomalloc_sync on this system, but isomalloc_sync itself is hanging during startup. We need to debug that hang further.

I opened a new issue for uFcontext threads on ppc64le: https://charm.cs.illinois.edu/redmine/issues/1913

Also available in: Atom PDF