Converse pingpong crashes for netlrts-linux-x86_64-smp build during nightly build tests
make: Entering directory '/scratch/jenkins/builds/Nightly-Build/label=xenial,platform=netlrts-linux-x86_64-smp@1591/charm/netlrts-linux-x86_64-smp/examples/converse/pingpong' ../../../bin/charmc -optimize -production -language converse++ -c pingpong.C ../../../bin/charmc -optimize -production -language converse++ -o pingpong pingpong.o ../../../bin/testrun ./pingpong +p2 ++ppn 2 Charmrun> scalable start enabled. Charmrun> started all node programs in 1.286 seconds. Charm++> Running in SMP mode: 1 processes, 2 worker threads (PEs) + 1 comm threads per process, 2 PEs total Charm++> The comm. thread both sends and receives messages Converse/Charm++ Commit ID: 42773fd Charm++> scheduler running in netpoll mode. Charm++> Running on 1 hosts (2 sockets x 4 cores x 1 PUs = 8-way SMP) Charm++> cpu topology info is gathered in 0.012 seconds. Pingpong with iterations = 1000, minMsgSize = 512, maxMsgSize = 16384, increase factor = 2 ------------- Processor 1 Exiting: Called CmiAbort ------------ Reason: Msg handler does not exist, possible race condition during init  Stack Traceback: [1:0] _Z14CmiAbortHelperPKcS0_S0_ii+0x56 [0x420796] [1:1] [0x4207fb] [1:2] [0x42bfe4] [1:3] CsdSchedulePoll+0x85 [0x42c565] [1:4] LrtsInitCpuTopo+0x425 [0x43f285] [1:5] _Z6mymainiPPc+0x2d6 [0x41d6c6] [1:6] [0x428525] [1:7] [0x428843] [1:8] +0x76ba [0x7ffff7bc16ba] [1:9] clone+0x6d [0x7ffff705641d] Fatal error on PE 1> Msg handler does not exist, possible race condition during init Makefile:18: recipe for target 'test' failed
#5 Updated by Nitin Bhat 8 months ago
The bug is caused by a race condition between the two PEs. Because of a call to CmiInitCPUTopology, the scheduler is polled for messages inside PE 1
(CsdSchedulePoll() called inside CmiInitCPUTopolgy in cputopology.c) and this causes PE 1 to receive a message from PE 0 before PE 1 registers its message handlers (which happens in the pingpong code after the topology initialization). This is caused by PE 0 being ahead in execution and sending a message with node1Handler before PE 1 even registers the node1Handler. This race condition causes the bug leading to throwing an invalid handler error where the handler index of the received message is greater than the maximum registered handler id.
The fix is to call CmiCPUTopologyInit after the initialization of the handlers in order to avoid receiving messages with unregistered message handlers.