Project

General

Profile

Bug #1796

Support for partitions on netlrts-linux-x86_64-tcp builds

Added by Sam White 8 months ago. Updated 7 months ago.

Status:
New
Priority:
Normal
Category:
-
Target version:
-
Start date:
02/08/2018
Due date:
% Done:

0%


Description

The test appears to finish but doesn't actually exit.

../../../bin/testrun  ./hello +p4 10 2 +partitions 2 ++local
Charmrun> scalable start enabled. 
Charmrun> started all node programs in 0.009 seconds.
Charm++> Running in non-SMP mode: numPes 4
Charm++> CMA enabled for within node transfers using the zerocopy API
Converse/Charm++ Commit ID: 879601b
Charm++> CMA enabled for within node transfers using the zerocopy API
Converse/Charm++ Commit ID: 879601b
Charm++> scheduler running in netpoll mode.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> scheduler running in netpoll mode.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (8-way SMP).
Charm++> cpu topology info is gathered in 0.000 seconds.
Running Hello on 2 processors for 10 elements
[0] Hello 0 created
[0] Hello 1 created
[0] Hello 2 created
[0] Hello 3 created
[0] Hello 4 created
[0] Hi[17] from element 0
[0] Hi[18] from element 1
[0] Hi[19] from element 2
[0] Hi[20] from element 3
[0] Hi[21] from element 4
[1] Hello 5 created
[1] Hello 6 created
[1] Hello 7 created
[1] Hello 8 created
[1] Hello 9 created
[1] Hi[22] from element 5
[1] Hi[23] from element 6
[1] Hi[24] from element 7
[1] Hi[25] from element 8
[1] Hi[26] from element 9
All done
[Partition 0][Node 0] End of program

History

#1 Updated by Karthik Senthil 7 months ago

I looked into this further via gdb and debug build. It looks like the process for the second partition hangs in CPU topology initialization.
This does not happen if there is only one PE per partition.

Charm++> CMA enabled for within node transfers using the zerocopy API
Converse/Charm++ Commit ID: v6.8.2-282-gddd93ef3a
Charm++> scheduler running in netpoll mode.
CharmLB> Load balancer assumes all CPUs are same.
^C
Program received signal SIGINT, Interrupt.
0x00007ffff7267730 in __poll_nocancel ()
    at ../sysdeps/unix/syscall-template.S:84
84      ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) bt
#0  0x00007ffff7267730 in __poll_nocancel ()
    at ../sysdeps/unix/syscall-template.S:84
#1  0x0000000000631a0d in CheckSocketsReady (withDelayMs=0, output=1)
    at machine-tcp.c:151
#2  0x0000000000631d3b in CommunicationServerNet (sleepTime=0, where=2)
    at machine-tcp.c:243
#3  0x0000000000632aa7 in LrtsAdvanceCommunication (whileidle=0)
    at machine.c:1672
#4  0x000000000062e9be in AdvanceCommunication (whenidle=0)
    at machine-common-core.c:1392
#5  0x000000000062ec29 in CmiGetNonLocal () at machine-common-core.c:1562
#6  0x0000000000634f34 in CsdNextMessage (s=0x7fffffffe290) at convcore.c:1754
#7  0x0000000000635280 in CsdSchedulePoll () at convcore.c:1949
#8  0x000000000064b3b2 in LrtsInitCpuTopo (argv=0x7fffffffe808)
    at cputopology.C:604
#9  0x000000000064b60d in CmiInitCPUTopology (argv=0x7fffffffe808)
    at cputopology.C:694
#10 0x0000000000547a3c in _initCharm (unused_argc=2, argv=0x7fffffffe808)
    at init.C:1362
#11 0x000000000062e977 in ConverseRunPE (everReturn=0)
    at machine-common-core.c:1371
#12 0x000000000062e892 in ConverseInit (argc=4, argv=0x7fffffffe808, 
    fn=0x5473b5 <_initCharm(int, char**)>, usched=0, initret=0)

#2 Updated by Sam White 7 months ago

Please run this partitions test on v6.8.2 to see if the recent changes to charmrun caused this issue

#3 Updated by Karthik Senthil 7 months ago

I see the same hang for this test even with charm v6.8.2

#4 Updated by Karthik Senthil 7 months ago

  • Status changed from New to Implemented

Gerrit patch to skip the partitions test for TCP build: https://charm.cs.illinois.edu/gerrit/#/c/3811/

#5 Updated by Evan Ramos 7 months ago

The patch looks good as far as not letting autobuild hang any more. I'm not sure if we can call the issue implemented though, since it only avoids our infrastructure failing instead of resolving the issue.

#6 Updated by Sam White 7 months ago

  • Status changed from Implemented to New
  • Subject changed from netlrts-linux-x86_64-tcp hangs in autobuild on partitions test to Support for partitions on netlrts-linux-x86_64-tcp builds
  • Target version deleted (6.9.0)

Yes, let's leave the issue open, since the patch is just a workaround. I don't think the real issue is a release blocker for 6.9.0 though, since tcp builds are rare.

Also available in: Atom PDF