Crash in LrtsInitCpuTopo() on Quartz with verbs layer
Quartz is a new cluster with an Intel Omni-Path 100 Gb/s interconnect. If I build on verbs layer and run on more than 1 process (non-SMP with +p > 1 and SMP with numNodes>1), it fails. The mpi layer works.
$ ./charmrun +p2 ./pgm ++mpiexec ++remote-shell ./mysrun Charmrun> scalable start enabled. Charmrun> IBVERBS version of charmrun Charmrun> started all node programs in 0.227 seconds. Charm++> Running in non-SMP mode: numPes 2 Converse/Charm++ Commit ID: v6.7.0-585-g1255a4a Charm++> scheduler running in netpoll mode. CharmLB> Load balancer assumes all CPUs are same. ------------- Processor 0 Exiting: Called CmiAbort ------------ Reason: Length mismatch!!  Stack Traceback: [0:0] CmiAbortHelper+0x63 [0x5c8d23] [0:1] [0x5cc189] [0:2] CmiGetNonLocal+0xa3 [0x5cd033] [0:3] CsdNextMessage+0x75 [0x5d2655] [0:4] CsdSchedulePoll+0x48 [0x5d2998] [0:5] LrtsInitCpuTopo+0x375 [0x5e1a05] [0:6] _Z10_initCharmiPPc+0xa6b [0x5051db] [0:7] ConverseInit+0x2fe [0x5d0f1e] [0:8] main+0x27 [0x490dd7] [0:9] __libc_start_main+0xf5 [0x2aaaab928b35] [0:10] [0x4914af] Fatal error on PE 0> Length mismatch!!
#3 Updated by Jim Phillips over 2 years ago
I only see this:
which kept it from being on by default.
#7 Updated by Bilge Acun over 2 years ago
I have reproduced this bug on 2 nodes, it works on +p>1 in one node. This is likely to be the same issue we saw on Bridges:
https://charm.cs.illinois.edu/redmine/issues/1048 As that issue disappeared there after they updated some libraries, leaving this for now to be tested again later after the machine becomes more mature.
#11 Updated by Juan Galvez about 2 years ago
This particular error occurs while doing low-level processing of messages sent during LrtsInitCpuTopo(), in the first phase which is something that happens before the code added in this patch (https://charm.cs.illinois.edu/gerrit/#/c/2723/). It's likely to be either processing the CmiReduce (if error is in PE0 it may be the CmiReduce), or processing the broadcast message sent from PE0. I think it's a low-level issue and doesn't have to do with code in cputopology.