Project

General

Profile

Bug #1381

Crash in LrtsInitCpuTopo() on Quartz with verbs layer

Added by Sam White over 2 years ago. Updated almost 2 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
Category:
Machine Layers
Target version:
Start date:
01/26/2017
Due date:
% Done:

0%


Description

Quartz is a new cluster with an Intel Omni-Path 100 Gb/s interconnect. If I build on verbs layer and run on more than 1 process (non-SMP with +p > 1 and SMP with numNodes>1), it fails. The mpi layer works.

$ ./charmrun +p2 ./pgm ++mpiexec ++remote-shell ./mysrun
Charmrun> scalable start enabled. 
Charmrun> IBVERBS version of charmrun
Charmrun> started all node programs in 0.227 seconds.
Charm++> Running in non-SMP mode: numPes 2
Converse/Charm++ Commit ID: v6.7.0-585-g1255a4a
Charm++> scheduler running in netpoll mode.
CharmLB> Load balancer assumes all CPUs are same.
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: 
        Length mismatch!!

[0] Stack Traceback:
  [0:0] CmiAbortHelper+0x63  [0x5c8d23]
  [0:1]   [0x5cc189]
  [0:2] CmiGetNonLocal+0xa3  [0x5cd033]
  [0:3] CsdNextMessage+0x75  [0x5d2655]
  [0:4] CsdSchedulePoll+0x48  [0x5d2998]
  [0:5] LrtsInitCpuTopo+0x375  [0x5e1a05]
  [0:6] _Z10_initCharmiPPc+0xa6b  [0x5051db]
  [0:7] ConverseInit+0x2fe  [0x5d0f1e]
  [0:8] main+0x27  [0x490dd7]
  [0:9] __libc_start_main+0xf5  [0x2aaaab928b35]
  [0:10]   [0x4914af]
Fatal error on PE 0> 

        Length mismatch!!

Related issues

Related to Charm++ - Bug #1409: verbs crashes on Stampede KNL and Bridges Rejected 02/10/2017
Related to Charm++ - Bug #1048: Verbs on Bridges at PSC crashes or hangs Rejected 04/28/2016
Related to Charm++ - Bug #529: LrtsInitCpuTopo() operates asynchronously and overlaps with group constructors Closed 07/06/2014

History

#1 Updated by Jim Phillips over 2 years ago

You can fix this by adding --with-qlogic to the command line, although it will probably still be slower than MPI.

#2 Updated by Sam White over 2 years ago

I added Bilge as a watcher, since I thought we had automated that build option?

#4 Updated by Sam White over 2 years ago

Building with --with-qlogic didn't help, I get the same error.

#5 Updated by Sam White over 2 years ago

This is the commit that I thought automated Mellanox vs. QLogic choice in verbs: https://charm.cs.illinois.edu/gerrit/#/c/702/

#6 Updated by Sam White over 2 years ago

  • Category set to Machine Layers
  • Assignee set to Bilge Acun

#7 Updated by Bilge Acun over 2 years ago

I have reproduced this bug on 2 nodes, it works on +p>1 in one node. This is likely to be the same issue we saw on Bridges:
https://charm.cs.illinois.edu/redmine/issues/1048 As that issue disappeared there after they updated some libraries, leaving this for now to be tested again later after the machine becomes more mature.

#8 Updated by Phil Miller over 2 years ago

  • Target version set to 6.8.0-beta1

#9 Updated by Phil Miller over 2 years ago

  • Target version changed from 6.8.0-beta1 to 6.8.1

#10 Updated by Phil Miller about 2 years ago

  • Related to Bug #529: LrtsInitCpuTopo() operates asynchronously and overlaps with group constructors added

#11 Updated by Juan Galvez about 2 years ago

This particular error occurs while doing low-level processing of messages sent during LrtsInitCpuTopo(), in the first phase which is something that happens before the code added in this patch (https://charm.cs.illinois.edu/gerrit/#/c/2723/). It's likely to be either processing the CmiReduce (if error is in PE0 it may be the CmiReduce), or processing the broadcast message sent from PE0. I think it's a low-level issue and doesn't have to do with code in cputopology.

#12 Updated by Juan Galvez about 2 years ago

We could easily disable cputopology and see if the program progresses past that. If there are issues in machine-level implementation which I think is the case, it would crash somewhere else.

#13 Updated by Eric Bohm almost 2 years ago

  • Assignee changed from Bilge Acun to Jaemin Choi

#14 Updated by Phil Miller almost 2 years ago

(Updating all of #1048 #1381 #1409 )

Since these machines are all ostensibly going to be using the ofi network layer from 6.8.1 onward, do we want to close these bugs as 'rejected'?

#15 Updated by Sam White almost 2 years ago

  • Status changed from New to Rejected

OFI should be used on Quartz.

Also available in: Atom PDF