Bug #1273

Tracemode utilization crashes in production build of Charm++

Added by Karthik Senthil over 2 years ago. Updated over 1 year ago.

Target version:
Start date:
Due date:
% Done:



Charm++ programs compiled in production mode with tracemode utilization yield in a segmentation fault. The error occurs for larger examples like wave2d or jacobi2d(32000X16000).

Further analysis of the bug led to the following observations :
  1. The error occurs in the sumDetailCompressedReduction reducer in trace-utilization.C
  2. Specifically in the mergeCompressedBin method minEp is not correctly updated due to junk values being read from srcBufferArray
  3. The program does not crash in non-production mode possibly because the contents of srcBufferArray are zeroed out(during malloc)

Example crash log :

Completed 1160 iterations
Completed 1180 iterations
Completed 1200 iterations
------------- Processor 0 Exiting: Caught Signal ------------
Reason: segmentation violation
Suggestion: Try running with '++debug', or linking with '-memory paranoid' (memory paranoid requires '+netpoll' at runtime).
[0] Stack Traceback:
  [0:0]   [0x577183]
  [0:1] +0x36cb0  [0x7ffff7006cb0]
  [0:2] _ZN16compressedBuffer5writeIsEEvT_i+0x2c  [0x5ad406]
  [0:3] _ZN16compressedBuffer4pushIsEEiT_+0x35  [0x5ace09]
  [0:4] _Z18mergeCompressedBinP16compressedBufferiPiiRS_+0x262  [0x5aa4d7]
  [0:5] _Z28sumDetailCompressedReductioniPP14CkReductionMsg+0x220  [0x5a99aa]
  [0:6] _ZN14CkReductionMgr14reduceMessagesEv+0x2d1  [0x51d7b9]
  [0:7] _ZN14CkReductionMgr15finishReductionEv+0x162  [0x51cecc]
  [0:8] _ZN14CkReductionMgr7RecvMsgEP14CkReductionMsg+0xc3  [0x51d39b]
  [0:9] _ZN22CkIndex_CkReductionMgr28_call_RecvMsg_CkReductionMsgEPvS0_+0x2b  [0x526e13]
  [0:10] CkDeliverMessageFree+0x39  [0x4e3ca7]
  [0:11]   [0x4e3d9f]
  [0:12]   [0x4e3ebf]
  [0:13]   [0x4e53ad]
  [0:14]   [0x4e5438]
  [0:15] _Z15_processHandlerPvP11CkCoreState+0x126  [0x4e58ef]
  [0:16] CmiHandleMessage+0x4d  [0x57da31]
  [0:17] CsdScheduleForever+0xad  [0x57dcb2]
  [0:18] CsdScheduler+0x16  [0x57dbe3]
  [0:19]   [0x576c17]
  [0:20] ConverseInit+0x2e0  [0x576b21]
  [0:21] main+0x3f  [0x4d5048]
  [0:22] __libc_start_main+0xf5  [0x7ffff6ff1f45]
  [0:23]   [0x4c4b09]
Charmrun> error on request socket to node 0 'localhost'--


Bug #1274: Tracemode utilization produces incorrect results for EP utilization metricMergedRonak Buch


#1 Updated by Sam White over 2 years ago

  • Category set to Tracing
  • Target version set to 6.8.1

#2 Updated by Ronak Buch over 2 years ago

  • Status changed from New to In Progress

These were fixed as far as I remember, but I think Karthik had the relevant information (I sat down with him at his desk to fix this and he was supposed to make the logs afterward, I think). I'll check back to see if these were really done.

#3 Updated by Sam White almost 2 years ago

Karthik can you test this again?

#4 Updated by Karthik Senthil almost 2 years ago

I tested this again on my lab machine (netlrts-linux-x86_64-smp) for wave2d and jacobi2d. I didn't run into any crashes.

#5 Updated by Ronak Buch almost 2 years ago

  • Status changed from In Progress to Closed

Was likely fixed a while ago, but never updated. Since it's not reproducible, I'll close it for now.

Also available in: Atom PDF