Project

General

Profile

Bug #1796

netlrts-linux-x86_64-tcp hangs in autobuild on partitions test

Added by Sam White 17 days ago. Updated 11 days ago.

Status:
New
Priority:
Normal
Category:
-
Target version:
Start date:
02/08/2018
Due date:
% Done:

0%


Description

The test appears to finish but doesn't actually exit.

../../../bin/testrun  ./hello +p4 10 2 +partitions 2 ++local
Charmrun> scalable start enabled. 
Charmrun> started all node programs in 0.009 seconds.
Charm++> Running in non-SMP mode: numPes 4
Charm++> CMA enabled for within node transfers using the zerocopy API
Converse/Charm++ Commit ID: 879601b
Charm++> CMA enabled for within node transfers using the zerocopy API
Converse/Charm++ Commit ID: 879601b
Charm++> scheduler running in netpoll mode.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> scheduler running in netpoll mode.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (8-way SMP).
Charm++> cpu topology info is gathered in 0.000 seconds.
Running Hello on 2 processors for 10 elements
[0] Hello 0 created
[0] Hello 1 created
[0] Hello 2 created
[0] Hello 3 created
[0] Hello 4 created
[0] Hi[17] from element 0
[0] Hi[18] from element 1
[0] Hi[19] from element 2
[0] Hi[20] from element 3
[0] Hi[21] from element 4
[1] Hello 5 created
[1] Hello 6 created
[1] Hello 7 created
[1] Hello 8 created
[1] Hello 9 created
[1] Hi[22] from element 5
[1] Hi[23] from element 6
[1] Hi[24] from element 7
[1] Hi[25] from element 8
[1] Hi[26] from element 9
All done
[Partition 0][Node 0] End of program

History

#1 Updated by Karthik Senthil 11 days ago

I looked into this further via gdb and debug build. It looks like the process for the second partition hangs in CPU topology initialization.
This does not happen if there is only one PE per partition.

Charm++> CMA enabled for within node transfers using the zerocopy API
Converse/Charm++ Commit ID: v6.8.2-282-gddd93ef3a
Charm++> scheduler running in netpoll mode.
CharmLB> Load balancer assumes all CPUs are same.
^C
Program received signal SIGINT, Interrupt.
0x00007ffff7267730 in __poll_nocancel ()
    at ../sysdeps/unix/syscall-template.S:84
84      ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) bt
#0  0x00007ffff7267730 in __poll_nocancel ()
    at ../sysdeps/unix/syscall-template.S:84
#1  0x0000000000631a0d in CheckSocketsReady (withDelayMs=0, output=1)
    at machine-tcp.c:151
#2  0x0000000000631d3b in CommunicationServerNet (sleepTime=0, where=2)
    at machine-tcp.c:243
#3  0x0000000000632aa7 in LrtsAdvanceCommunication (whileidle=0)
    at machine.c:1672
#4  0x000000000062e9be in AdvanceCommunication (whenidle=0)
    at machine-common-core.c:1392
#5  0x000000000062ec29 in CmiGetNonLocal () at machine-common-core.c:1562
#6  0x0000000000634f34 in CsdNextMessage (s=0x7fffffffe290) at convcore.c:1754
#7  0x0000000000635280 in CsdSchedulePoll () at convcore.c:1949
#8  0x000000000064b3b2 in LrtsInitCpuTopo (argv=0x7fffffffe808)
    at cputopology.C:604
#9  0x000000000064b60d in CmiInitCPUTopology (argv=0x7fffffffe808)
    at cputopology.C:694
#10 0x0000000000547a3c in _initCharm (unused_argc=2, argv=0x7fffffffe808)
    at init.C:1362
#11 0x000000000062e977 in ConverseRunPE (everReturn=0)
    at machine-common-core.c:1371
#12 0x000000000062e892 in ConverseInit (argc=4, argv=0x7fffffffe808, 
    fn=0x5473b5 <_initCharm(int, char**)>, usched=0, initret=0)

Also available in: Atom PDF