Project

General

Profile

Bug #1614

SDAG Rdma example /examples/rdma/stencil3D failing on nightly build because of migration

Added by Nitin Bhat about 2 years ago. Updated about 2 years ago.

Status:
Merged
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
06/19/2017
Due date:
% Done:

100%

Tags:

Description

Running on 2 processors: ./stencil3d 64 32 balancer RefineLB +ppn 2
charmrun> /usr/bin/setarch x86_64 -R mpirun -np 2 ./stencil3d 64 32 +balancer RefineLB +ppn 2
Charm++> Running on MPI version: 2.1
Charm++> level of thread support used: MPI_THREAD_SINGLE (desired: MPI_THREAD_FUNNELED)
Charm++> Running in SMP mode: numNodes 2, 2 worker threads per process
Charm++> The comm. thread both sends and receives messages
Converse/Charm
+ Commit ID: 3383678
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (8-way SMP).
Charm++> cpu topology info is gathered in 0.020 seconds.
[0] RefineLB created

STENCIL COMPUTATION WITH BARRIERS
Running Stencil on 4 processors with (2, 2, 2) chares
Array Dimensions: 64 64 64
Block Dimensions: 32 32 32
[respect:01590] * Process received signal
[respect:01590] Signal: Segmentation fault (11)
[respect:01590] Signal code: Address not mapped (1)
[respect:01590] Failing at address: 0x2aaabc29c5e0
[respect:01590] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0) [0x2aaaaacdfcb0]
[respect:01590] [ 1] ./stencil3d(_ZN7Stencil13processGhostsEiiiPd+0x268) [0x4ddfa8]
[respect:01590] [ 2] ./stencil3d(_ZN7Stencil9_serial_0EPN15Closure_Stencil23receiveGhosts_5_closureE+0x26) [0x4dcb86]
[respect:01590] [ 3] ./stencil3d(_ZN7Stencil7_when_0Ei+0x1f3) [0x4dc4a3]
[respect:01590] [ 4] ./stencil3d(CkDeliverMessageFree+0x92) [0x5271d2]
[respect:01590] [ 5] ./stencil3d(_ZN8CkLocRec11invokeEntryEP12CkMigratablePvib+0x54) [0x54e1a4]
[respect:01590] [ 6] ./stencil3d(_ZN8CkLocMgr10deliverMsgEP14CkArrayMessage9CkArrayIDmPK12CkArrayIndex11CkDeliver_ti+0x3d1) [0x54e5e1]
[respect:01590] [ 7] ./stencil3d(_Z15_processHandlerPvP11CkCoreState+0xbe9) [0x52d979]
[respect:01590] [ 8] ./stencil3d(CsdScheduleForever+0x48) [0x62f5b8]
[respect:01590] [ 9] ./stencil3d(CsdScheduler+0x2d) [0x62f82d]
[respect:01590] [10] ./stencil3d() [0x62d7d2]
[respect:01590] [11] ./stencil3d() [0x62dd97]
[respect:01590] [12] /lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x2aaaaacd7e9a]
[respect:01590] [13] /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x2aaaabf1638d]
[respect:01590]
End of error message *

The failure is due to the closure structure being pupped incorrectly by the generated code. An rdmawrapper should be pupped as an array when inside a closure structure. (i.e marshall the array contents into the buffer as rdma pointers are invalid on a remote processor).

History

#1 Updated by Nitin Bhat about 2 years ago

  • % Done changed from 0 to 100
  • Target version set to 6.8.0
  • Assignee set to Nitin Bhat
  • Status changed from In Progress to Implemented
  • Tags changed from #rdma to #rdma, charmxi

Fix: https://charm.cs.illinois.edu/gerrit/#/c/2726/

This bug was caught in the nightly build for target mpi-linux-x86_64-smp on respect(machine). The crash happens on unpacking and using the rdma buffer after the closure structure has been unpacked on the new PE after migration. The fix was to adjust the rdma pointers before and after the migration in the pup method of the closure structure generated by charmxi.

#2 Updated by Sam White about 2 years ago

  • Status changed from Implemented to Merged

Also available in: Atom PDF