verbs crashes on Stampede KNL and Bridges
Charm++ programs crash before launch with a "Length mismatch" abort for the ibverbs SMP build on Stampede KNL.
Build command for Charm++ :
./build charm++ verbs-linux-x86_64 smp -j16 --with-production
Following is the trace for crash with examples/charm++/hello/1darray -
./charmrun +p4 ./hello ++mpiexec ++remote-shell 'ibrun -o 0'
Charmrun> scalable start enabled. Charmrun> IBVERBS version of charmrun TACC: Starting up job 59560 TACC: Starting parallel tasks... Charmrun> started all node programs in 2.072 seconds. Charm++> Running in SMP mode: numNodes 4, 1 worker threads per process Charm++> The comm. thread both sends and receives messages Converse/Charm++ Commit ID: v6.7.0-414-g0bd333d Warning> Randomization of stack pointer is turned on in kernel, thread migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it, or try run with '+isomalloc_sync'. Charm++> scheduler running in netpoll mode. CharmLB> Load balancer assumes all CPUs are same. line 1551 size: 0, len:60. ------------- Processor 4 Exiting: Called CmiAbort ------------ Reason: Length mismatch!!  Stack Traceback: [4:0] CmiAbortHelper+0x65 [0x526aa5] [4:1] [0x52a87a] [4:2] [0x52abe4] [4:3] LrtsInitCpuTopo+0x295 [0x5453c5] [4:4] _Z10_initCharmiPPc+0x1822 [0x47a2a2] [4:5] [0x52c65e] [4:6] [0x52c706] [4:7] +0x7dc5 [0x2b5d1354fdc5] [4:8] clone+0x6d [0x2b5d1450473d] Fatal error on PE 4> Length mismatch!!
#4 Updated by Jaemin Choi over 2 years ago
- Subject changed from ibverbs crashes in SMP mode on Stampede KNL to ibverbs crashes on Stampede KNL
The crashes also happen in non-SMP mode.
Thought it might be a compiler issue (icc is standard on Stampede 1.5) so tested it with gcc but the error persists.
It could be an issue with the interconnect, since both Stampede 1.5 and Quartz use Intel Omni-Path interconnects, and does not fail on Stampede 1.0.
In the length mismatch line, size is 0 whereas len is 60.
#8 Updated by Sam White over 2 years ago
- Subject changed from ibverbs crashes on Stampede KNL to verbs crashes on Stampede KNL
- Tags set to #machi, #verbs
Verbs is one of the options on OmniPath (or at least most OmniPath systems). Which layer performs best (verbs, mpi, or ofi) is a question we'd like to answer (netlrts should be worse than all these others), and the first step to answering that is to have them all working.
#13 Updated by Nitin Bhat about 2 years ago
For examples/converse/pingpong, I saw another verbs layer related crash:
[nbhat4@r135 pingpong]$ ./charmrun + Commit ID: v6.8.0-beta1-307-g0cc5e9c
Warning> Randomization of virtual memory (ASLR) is turned on in the kernel, thread migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it, or try running with '+isomalloc_sync'.
Charm++> scheduler running in netpoll mode.
Size=1024 bytes, time=9.088993 microseconds one-way
Size=2048 bytes, time=8.373976 microseconds one-way
Size=4096 bytes, time=9.494543 microseconds one-way
Size=8192 bytes, time=18.795013 microseconds one-way
Size=16384 bytes, time=27.297497 microseconds one-way
 wc0 status 10 wc[i].opcode 2
------------- Processor 1 Exiting: Called CmiAbort ------------
Reason: Work completion error in sendCq
 Stack Traceback:
[1:0] CmiAbortHelper+0x63 [0x415533]
[1:3] CcdRaiseCondition+0x106 [0x4233a6]
[1:4] CsdScheduleForever+0x12a [0x4207ea]
[1:5] CsdScheduler+0x2d [0x420a5d]
[1:6] ConverseInit+0x35a [0x41ee9a]
[1:7] main+0x2d [0x413fdf]
[1:8] __libc_start_main+0xf5 [0x2b9fcd594b15]
Fatal error on PE 1> Work completion error in sendCq