SDAG Rdma example /examples/rdma/stencil3D failing on nightly build because of migration
Running on 2 processors: ./stencil3d 64 32 + Commit ID: 3383678
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (8-way SMP).
Charm++> cpu topology info is gathered in 0.020 seconds.
 RefineLB created
STENCIL COMPUTATION WITH BARRIERS
Running Stencil on 4 processors with (2, 2, 2) chares
Array Dimensions: 64 64 64
Block Dimensions: 32 32 32
[respect:01590] * Process received signal
[respect:01590] Signal: Segmentation fault (11)
[respect:01590] Signal code: Address not mapped (1)
[respect:01590] Failing at address: 0x2aaabc29c5e0
[respect:01590] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0) [0x2aaaaacdfcb0]
[respect:01590] [ 1] ./stencil3d(_ZN7Stencil13processGhostsEiiiPd+0x268) [0x4ddfa8]
[respect:01590] [ 2] ./stencil3d(_ZN7Stencil9_serial_0EPN15Closure_Stencil23receiveGhosts_5_closureE+0x26) [0x4dcb86]
[respect:01590] [ 3] ./stencil3d(_ZN7Stencil7_when_0Ei+0x1f3) [0x4dc4a3]
[respect:01590] [ 4] ./stencil3d(CkDeliverMessageFree+0x92) [0x5271d2]
[respect:01590] [ 5] ./stencil3d(_ZN8CkLocRec11invokeEntryEP12CkMigratablePvib+0x54) [0x54e1a4]
[respect:01590] [ 6] ./stencil3d(_ZN8CkLocMgr10deliverMsgEP14CkArrayMessage9CkArrayIDmPK12CkArrayIndex11CkDeliver_ti+0x3d1) [0x54e5e1]
[respect:01590] [ 7] ./stencil3d(_Z15_processHandlerPvP11CkCoreState+0xbe9) [0x52d979]
[respect:01590] [ 8] ./stencil3d(CsdScheduleForever+0x48) [0x62f5b8]
[respect:01590] [ 9] ./stencil3d(CsdScheduler+0x2d) [0x62f82d]
[respect:01590]  ./stencil3d() [0x62d7d2]
[respect:01590]  ./stencil3d() [0x62dd97]
[respect:01590]  /lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x2aaaaacd7e9a]
[respect:01590]  /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x2aaaabf1638d]
[respect:01590] End of error message *
The failure is due to the closure structure being pupped incorrectly by the generated code. An rdmawrapper should be pupped as an array when inside a closure structure. (i.e marshall the array contents into the buffer as rdma pointers are invalid on a remote processor).
#1 Updated by Nitin Bhat about 2 years ago
- % Done changed from 0 to 100
- Target version set to 6.8.0
- Assignee set to Nitin Bhat
- Status changed from In Progress to Implemented
- Tags changed from #rdma to #rdma, charmxi
This bug was caught in the nightly build for target mpi-linux-x86_64-smp on respect(machine). The crash happens on unpacking and using the rdma buffer after the closure structure has been unpacked on the new PE after migration. The fix was to adjust the rdma pointers before and after the migration in the pup method of the closure structure generated by charmxi.