Project

General

Profile

Bug #1932

Converse pingpong crashes for netlrts-linux-x86_64-smp build during nightly build tests

Added by Nitin Bhat 2 months ago. Updated about 2 months ago.

Status:
Merged
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
06/12/2018
Due date:
% Done:

0%


Description

Nightly build output: http://ppl-jenkins:8080/job/Nightly-Build/label=xenial,platform=netlrts-linux-x86_64-smp/1591/console

make[3]: Entering directory '/scratch/jenkins/builds/Nightly-Build/label=xenial,platform=netlrts-linux-x86_64-smp@1591/charm/netlrts-linux-x86_64-smp/examples/converse/pingpong'
../../../bin/charmc -optimize -production  -language converse++ -c pingpong.C
../../../bin/charmc -optimize -production  -language converse++ -o pingpong pingpong.o
../../../bin/testrun  ./pingpong +p2  ++ppn 2
Charmrun> scalable start enabled. 
Charmrun> started all node programs in 1.286 seconds.
Charm++> Running in SMP mode: 1 processes, 2 worker threads (PEs) + 1 comm threads per process, 2 PEs total
Charm++> The comm. thread both sends and receives messages
Converse/Charm++ Commit ID: 42773fd
Charm++> scheduler running in netpoll mode.
Charm++> Running on 1 hosts (2 sockets x 4 cores x 1 PUs = 8-way SMP)
Charm++> cpu topology info is gathered in 0.012 seconds.
Pingpong with iterations = 1000, minMsgSize = 512, maxMsgSize = 16384, increase factor = 2
------------- Processor 1 Exiting: Called CmiAbort ------------
Reason: Msg handler does not exist, possible race condition during init

[1] Stack Traceback:
  [1:0] _Z14CmiAbortHelperPKcS0_S0_ii+0x56  [0x420796]
  [1:1]   [0x4207fb]
  [1:2]   [0x42bfe4]
  [1:3] CsdSchedulePoll+0x85  [0x42c565]
  [1:4] LrtsInitCpuTopo+0x425  [0x43f285]
  [1:5] _Z6mymainiPPc+0x2d6  [0x41d6c6]
  [1:6]   [0x428525]
  [1:7]   [0x428843]
  [1:8] +0x76ba  [0x7ffff7bc16ba]
  [1:9] clone+0x6d  [0x7ffff705641d]
Fatal error on PE 1> Msg handler does not exist, possible race condition during init

Makefile:18: recipe for target 'test' failed

History

#1 Updated by Nitin Bhat 2 months ago

  • Subject changed from netlrts-linux-x86_64-smp fails converse pingpong on nightly build tests to Converse pingpong crashes for netlrts-linux-x86_64-smp build during nightly build tests

#2 Updated by Evan Ramos 2 months ago

This crash was a single random failure. Every other build for the month before it passed, and also passed the day afterward (today).

#3 Updated by Nitin Bhat 2 months ago

Okay, but I think this requires more investigation because the crash indicates that the message was not set with the right converse handler. I can look into this, where I can run this version for about 1000 iterations and see if I can reproduce this issue.

#4 Updated by Nitin Bhat 2 months ago

  • Assignee set to Nitin Bhat

#5 Updated by Nitin Bhat about 2 months ago

The bug is caused by a race condition between the two PEs. Because of a call to CmiInitCPUTopology, the scheduler is polled for messages inside PE 1
(CsdSchedulePoll() called inside CmiInitCPUTopolgy in cputopology.c) and this causes PE 1 to receive a message from PE 0 before PE 1 registers its message handlers (which happens in the pingpong code after the topology initialization). This is caused by PE 0 being ahead in execution and sending a message with node1Handler before PE 1 even registers the node1Handler. This race condition causes the bug leading to throwing an invalid handler error where the handler index of the received message is greater than the maximum registered handler id.

The fix is to call CmiCPUTopologyInit after the initialization of the handlers in order to avoid receiving messages with unregistered message handlers.

Fix: https://charm.cs.illinois.edu/gerrit/#/c/charm/+/4276/

#6 Updated by Nitin Bhat about 2 months ago

  • Status changed from New to Implemented

#7 Updated by Sam White about 2 months ago

  • Status changed from Implemented to Merged

Also available in: Atom PDF