Project

General

Profile

Bug #1409

verbs crashes on Stampede KNL and Bridges

Added by Karthik Senthil over 2 years ago. Updated almost 2 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
Category:
Machine Layers
Target version:
Start date:
02/10/2017
Due date:
% Done:

0%


Description

Charm++ programs crash before launch with a "Length mismatch" abort for the ibverbs SMP build on Stampede KNL.

Build command for Charm++ :

./build charm++ verbs-linux-x86_64 smp -j16 --with-production

Following is the trace for crash with examples/charm++/hello/1darray -
Run command:

./charmrun +p4 ./hello ++mpiexec ++remote-shell 'ibrun -o 0'

Charmrun> scalable start enabled. 
Charmrun> IBVERBS version of charmrun
TACC: Starting up job 59560
TACC: Starting parallel tasks...
Charmrun> started all node programs in 2.072 seconds.
Charm++> Running in SMP mode: numNodes 4,  1 worker threads per process
Charm++> The comm. thread both sends and receives messages
Converse/Charm++ Commit ID: v6.7.0-414-g0bd333d
Warning> Randomization of stack pointer is turned on in kernel, thread migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it, or try run with '+isomalloc_sync'.  
Charm++> scheduler running in netpoll mode.
CharmLB> Load balancer assumes all CPUs are same.
line 1551 size: 0, len:60.
------------- Processor 4 Exiting: Called CmiAbort ------------
Reason: 

        Length mismatch!!

[4] Stack Traceback:
  [4:0] CmiAbortHelper+0x65  [0x526aa5]
  [4:1]   [0x52a87a]
  [4:2]   [0x52abe4]
  [4:3] LrtsInitCpuTopo+0x295  [0x5453c5]
  [4:4] _Z10_initCharmiPPc+0x1822  [0x47a2a2]
  [4:5]   [0x52c65e]
  [4:6]   [0x52c706]
  [4:7] +0x7dc5  [0x2b5d1354fdc5]
  [4:8] clone+0x6d  [0x2b5d1450473d]
Fatal error on PE 4> 

        Length mismatch!!

Related issues

Related to Charm++ - Bug #1381: Crash in LrtsInitCpuTopo() on Quartz with verbs layer Rejected 01/26/2017
Related to Charm++ - Bug #1048: Verbs on Bridges at PSC crashes or hangs Rejected 04/28/2016
Related to Charm++ - Bug #529: LrtsInitCpuTopo() operates asynchronously and overlaps with group constructors Closed 07/06/2014

History

#1 Updated by Jim Phillips over 2 years ago

Try adding --with-qlogic to the build line.

#2 Updated by Karthik Senthil over 2 years ago

The issue still persists in the build using --with-qlogic option.

#3 Updated by Sam White over 2 years ago

  • Priority changed from Normal to High
  • Target version set to 6.8.0-beta1

This is the same issue as on Quartz machine: https://charm.cs.illinois.edu/redmine/issues/1381

#4 Updated by Jaemin Choi over 2 years ago

  • Subject changed from ibverbs crashes in SMP mode on Stampede KNL to ibverbs crashes on Stampede KNL

The crashes also happen in non-SMP mode.
Thought it might be a compiler issue (icc is standard on Stampede 1.5) so tested it with gcc but the error persists.
It could be an issue with the interconnect, since both Stampede 1.5 and Quartz use Intel Omni-Path interconnects, and does not fail on Stampede 1.0.
In the length mismatch line, size is 0 whereas len is 60.

#5 Updated by Phil Miller over 2 years ago

  • Description updated (diff)

Format description

#6 Updated by Phil Miller over 2 years ago

  • Target version changed from 6.8.0-beta1 to 6.8.1

#7 Updated by Jaemin Choi over 2 years ago

I accidentally compiled and tested the netlrts layer, and that worked fine.
That brings up my question: are we supposed to use the verbs layer on top of OmniPath?

#8 Updated by Sam White over 2 years ago

  • Subject changed from ibverbs crashes on Stampede KNL to verbs crashes on Stampede KNL
  • Tags set to #machi, #verbs

Verbs is one of the options on OmniPath (or at least most OmniPath systems). Which layer performs best (verbs, mpi, or ofi) is a question we'd like to answer (netlrts should be worse than all these others), and the first step to answering that is to have them all working.

#9 Updated by Phil Miller over 2 years ago

  • Tags changed from #machi, #verbs to #verbs, #machine-layers

#10 Updated by Phil Miller about 2 years ago

  • Related to Bug #529: LrtsInitCpuTopo() operates asynchronously and overlaps with group constructors added

#11 Updated by Nitin Bhat about 2 years ago

I saw this crash for verbs-linux-x86_64 on Bridges as well.

#12 Updated by Nitin Bhat about 2 years ago

  • Subject changed from verbs crashes on Stampede KNL to verbs crashes on Stampede KNL and Bridges

#13 Updated by Nitin Bhat about 2 years ago

For examples/converse/pingpong, I saw another verbs layer related crash:

[nbhat4@r135 pingpong]$ ./charmrun p2 ./pingpong ++mpiexec
Charmrun> scalable start enabled.
Charmrun> IBVERBS version of charmrun
Charmrun> started all node programs in 0.446 seconds.
Charm++> Running in non-SMP mode: numPes 2
Converse/Charm
+ Commit ID: v6.8.0-beta1-307-g0cc5e9c
Warning> Randomization of virtual memory (ASLR) is turned on in the kernel, thread migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it, or try running with '+isomalloc_sync'.
Charm++> scheduler running in netpoll mode.
Size=1024 bytes, time=9.088993 microseconds one-way
Size=2048 bytes, time=8.373976 microseconds one-way
Size=4096 bytes, time=9.494543 microseconds one-way
Size=8192 bytes, time=18.795013 microseconds one-way
Size=16384 bytes, time=27.297497 microseconds one-way
[1] wc0 status 10 wc[i].opcode 2
------------- Processor 1 Exiting: Called CmiAbort ------------
Reason: Work completion error in sendCq
[1] Stack Traceback:
[1:0] CmiAbortHelper+0x63 [0x415533]
[1:1] [0x41974c]
[1:2] [0x41b8c5]
[1:3] CcdRaiseCondition+0x106 [0x4233a6]
[1:4] CsdScheduleForever+0x12a [0x4207ea]
[1:5] CsdScheduler+0x2d [0x420a5d]
[1:6] ConverseInit+0x35a [0x41ee9a]
[1:7] main+0x2d [0x413fdf]
[1:8] __libc_start_main+0xf5 [0x2b9fcd594b15]
[1:9] [0x413b31]
Fatal error on PE 1> Work completion error in sendCq

#14 Updated by Sam White almost 2 years ago

  • Priority changed from High to Normal

This is probably not a real issue, since MPI and OFI will be in 6.8.1 and OFI should be the preferred comm layer on both these machines.

#15 Updated by Phil Miller almost 2 years ago

(Updating all of #1048 #1381 #1409 )

Since these machines are all ostensibly going to be using the ofi network layer from 6.8.1 onward, do we want to close these bugs as 'rejected'?

#16 Updated by Sam White almost 2 years ago

  • Status changed from New to Rejected

OFI should be used on Stampede2 and Bridges.

Also available in: Atom PDF