LrtsInitCpuTopo() operates asynchronously and overlaps with group constructors
I'm trying to run ibverbs-smp NAMD on of Stampede and on 512+ nodes I regularly get segfaults during startup that I've tracked down to the fact that on the net layer LrtsInitCpuTopo() does it's global physical node search by hijacking the message loop until it finishes. This means that NAMD's WorkDistrib group tends to be created on some nodes before the topology information is generated.
By some miracle the MIC port doesn't have this issue. I think David Kunzman was encountering it but he has his MIC startup code sitting at the exact right place to introduce a delay. I can probably do a workaround for the NAMD release, but this is really confusing.
#16 Updated by Sam White almost 2 years ago
The above merged change doesn't change anything for bug #1381: Crash in LrtsInitCpuTopo() on Quartz with verbs layer.
./charmrun +p2 ./hello ++mpiexec ++remote-shell ./mysrun Charmrun> scalable start enabled. Charmrun> IBVERBS version of charmrun Charmrun> started all node programs in 0.214 seconds. Charm++> Running in non-SMP mode: numPes 2 Converse/Charm++ Commit ID: v6.7.0-1005-g8b8bb11 Charm++> scheduler running in netpoll mode. CharmLB> Load balancer assumes all CPUs are same. ------------- Processor 0 Exiting: Called CmiAbort ------------ Reason: Length mismatch!!  Stack Traceback: [0:0] CmiAbortHelper+0xb3 [0x613f78] [0:1] CmiAbort+0x2d [0x613fb3] [0:2] [0x618d3a] [0:3] [0x618fda] [0:4] [0x618928] [0:5] [0x618843] [0:6] [0x618c54] [0:7] LrtsAdvanceCommunication+0x1a [0x61bd6e] [0:8] [0x613dac] [0:9] CmiGetNonLocal+0x75 [0x61402a] [0:10] CsdNextMessage+0x9b [0x61e12d] [0:11] CsdSchedulePoll+0x73 [0x61e471] [0:12] LrtsInitCpuTopo+0x2e5 [0x634491] [0:13] CmiInitCPUTopology+0x18 [0x634672] [0:14] _Z10_initCharmiPPc+0x651 [0x534ad2] [0:15] [0x613d66] [0:16] ConverseInit+0x324 [0x613c82] [0:17] main+0x3f [0x5327bc] [0:18] __libc_start_main+0xf5 [0x2aaaab929b35] [0:19] [0x52dab9]
#17 Updated by Juan Galvez almost 2 years ago
Makes sense, looking at bug #1381 the error occurs while doing low-level processing of messages sent during LrtsInitCpuTopo(), in the first phase which is something that happens before the code added in my patch. Error is likely to be either processing the CmiReduce (if error is in PE0 it may be the CmiReduce), or processing the broadcast message sent from PE0.
#18 Updated by Juan Galvez almost 2 years ago
#21 Updated by Juan Galvez over 1 year ago
- Status changed from In Progress to New
- Target version changed from 6.8.1 to 6.9.0
I'm pretty sure this bug has been solved because the above mentioned patches prevent Charm init from progressing on all PEs until InitCpuTopo has completed.
Also, there is no way to currently replicate it. Stampede was decommissioned and there is no way to do large scale ibverb runs.
For now, pushing this to 6.9 but NAMD group should decide whether to retire it.