Project

General

Profile

Bug #529

LrtsInitCpuTopo() operates asynchronously and overlaps with group constructors

Added by Jim Phillips over 4 years ago. Updated 11 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Machine Layers
Target version:
Start date:
07/06/2014
Due date:
% Done:

0%

Spent time:

Description

I'm trying to run ibverbs-smp NAMD on of Stampede and on 512+ nodes I regularly get segfaults during startup that I've tracked down to the fact that on the net layer LrtsInitCpuTopo() does it's global physical node search by hijacking the message loop until it finishes. This means that NAMD's WorkDistrib group tends to be created on some nodes before the topology information is generated.

By some miracle the MIC port doesn't have this issue. I think David Kunzman was encountering it but he has his MIC startup code sitting at the exact right place to introduce a delay. I can probably do a workaround for the NAMD release, but this is really confusing.


Related issues

Related to Charm++ - Bug #1381: Crash in LrtsInitCpuTopo() on Quartz with verbs layer Rejected 01/26/2017
Related to Charm++ - Bug #1409: verbs crashes on Stampede KNL and Bridges Rejected 02/10/2017
Related to Charm++ - Bug #1048: Verbs on Bridges at PSC crashes or hangs Rejected 04/28/2016

History

#1 Updated by Eric Bohm over 4 years ago

Core group decision:

Make "delay until main" an option. Also detect condition where a topology query is made before it is initialized and issue meaningful error.

#2 Updated by Eric Bohm over 4 years ago

  • Assignee set to Ronak Buch

#3 Updated by Ronak Buch over 4 years ago

  • Status changed from New to In Progress

#4 Updated by Jim Phillips over 4 years ago

  • Target version changed from 6.6.0 to 6.7.0

A workaround (basically CcdCallOnCondition(CcdTOPOLOGY_AVAIL, ...)) has been checked into NAMD so this issue is no longer urgent, but it is a surprising behavior that should be fixed.

#5 Updated by Ronak Buch almost 4 years ago

  • Priority changed from High to Normal

#6 Updated by Nikhil Jain about 3 years ago

Ping. What is the status here? Is this bug for real.

#7 Updated by Phil Miller about 3 years ago

  • Target version changed from 6.7.0 to 6.8.0

NAMD has its workaround, so not critical.

#8 Updated by Ronak Buch almost 3 years ago

  • Assignee changed from Ronak Buch to Vipul Harsh

#9 Updated by Bilge Acun over 1 year ago

This issue https://charm.cs.illinois.edu/redmine/issues/1381 might also be related to this.

#10 Updated by Phil Miller over 1 year ago

  • Related to Bug #1381: Crash in LrtsInitCpuTopo() on Quartz with verbs layer added

#11 Updated by Phil Miller over 1 year ago

  • Target version changed from 6.8.0 to 6.8.1

Based on the preference for MPI machine layer on OmniPath systems for now, we're deferring this.

#12 Updated by Phil Miller over 1 year ago

  • Assignee changed from Vipul Harsh to Juan Galvez

#13 Updated by Phil Miller over 1 year ago

This may have been addressed by https://charm.cs.illinois.edu/gerrit/#/c/2723/ that synchronizes startup to ensure topology is available before reaching user code.

#14 Updated by Phil Miller over 1 year ago

  • Related to Bug #1409: verbs crashes on Stampede KNL and Bridges added

#15 Updated by Phil Miller over 1 year ago

  • Related to Bug #1048: Verbs on Bridges at PSC crashes or hangs added

#16 Updated by Sam White over 1 year ago

The above merged change doesn't change anything for bug #1381: Crash in LrtsInitCpuTopo() on Quartz with verbs layer.

./charmrun +p2 ./hello ++mpiexec ++remote-shell ./mysrun
Charmrun> scalable start enabled. 
Charmrun> IBVERBS version of charmrun
Charmrun> started all node programs in 0.214 seconds.
Charm++> Running in non-SMP mode: numPes 2
Converse/Charm++ Commit ID: v6.7.0-1005-g8b8bb11
Charm++> scheduler running in netpoll mode.
CharmLB> Load balancer assumes all CPUs are same.
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: 

        Length mismatch!!

[0] Stack Traceback:
  [0:0] CmiAbortHelper+0xb3  [0x613f78]
  [0:1] CmiAbort+0x2d  [0x613fb3]
  [0:2]   [0x618d3a]
  [0:3]   [0x618fda]
  [0:4]   [0x618928]
  [0:5]   [0x618843]
  [0:6]   [0x618c54]
  [0:7] LrtsAdvanceCommunication+0x1a  [0x61bd6e]
  [0:8]   [0x613dac]
  [0:9] CmiGetNonLocal+0x75  [0x61402a]
  [0:10] CsdNextMessage+0x9b  [0x61e12d]
  [0:11] CsdSchedulePoll+0x73  [0x61e471]
  [0:12] LrtsInitCpuTopo+0x2e5  [0x634491]
  [0:13] CmiInitCPUTopology+0x18  [0x634672]
  [0:14] _Z10_initCharmiPPc+0x651  [0x534ad2]
  [0:15]   [0x613d66]
  [0:16] ConverseInit+0x324  [0x613c82]
  [0:17] main+0x3f  [0x5327bc]
  [0:18] __libc_start_main+0xf5  [0x2aaaab929b35]
  [0:19]   [0x52dab9]

#17 Updated by Juan Galvez over 1 year ago

Makes sense, looking at bug #1381 the error occurs while doing low-level processing of messages sent during LrtsInitCpuTopo(), in the first phase which is something that happens before the code added in my patch. Error is likely to be either processing the CmiReduce (if error is in PE0 it may be the CmiReduce), or processing the broadcast message sent from PE0.

#18 Updated by Juan Galvez over 1 year ago

This particular bug has probably been solved by the recent cputopology patches:
https://charm.cs.illinois.edu/gerrit/#/c/2723/
https://charm.cs.illinois.edu/gerrit/#/c/2735/

But it should be tested (NAMD workaround would probably need to be disabled to verify).

#19 Updated by Sam White over 1 year ago

I still get the exact same "Length mismatch!!" abort with verbs-linux-x86_64 on Quartz on today's master version on charm.

#20 Updated by Juan Galvez over 1 year ago

I'm referring to this bug (529), not 1381.

#21 Updated by Juan Galvez about 1 year ago

  • Status changed from In Progress to New
  • Target version changed from 6.8.1 to 6.9.0

I'm pretty sure this bug has been solved because the above mentioned patches prevent Charm init from progressing on all PEs until InitCpuTopo has completed.

Also, there is no way to currently replicate it. Stampede was decommissioned and there is no way to do large scale ibverb runs.

For now, pushing this to 6.9 but NAMD group should decide whether to retire it.

#22 Updated by Eric Bohm 12 months ago

  • Status changed from New to Resolved

#23 Updated by Juan Galvez 11 months ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF