Project

General

Profile

Bug #1629

Valgrind shows invalid writes at GNI_SmsgSendWTag calls for charm programs built on gni-craxyc

Added by Nitin Bhat 4 months ago. Updated 4 months ago.

Status:
New
Priority:
Normal
Assignee:
Category:
Machine Layers
Target version:
-
Start date:
07/10/2017
Due date:
% Done:

0%


Description

1. Build charm with target gni-crayxc and flags -g -O0
2. Build examples/charm++/hello/1darray with -g -O0
3. Run the example with valgrind: module load valgrind && srun -n 2 valgrind --leak-check=full --log-file=VG.out.%p --trace-children=yes ./hello

Valgrind output:

==23801== Invalid write of size 8
==23801==    at 0x201A85EF: GNII_SmsgSend (in /global/u1/n/nbhat4/software/charm/examples/charm++/hello/1darray/hello)
==23801==    by 0x201A9940: GNI_SmsgSendWTag (in /global/u1/n/nbhat4/software/charm/examples/charm++/hello/1darray/hello)
==23801==    by 0x200EB3E0: send_smsg_message (machine.c:1532)
==23801==    by 0x200EBB00: LrtsSendFunc (machine.c:1779)
==23801==    by 0x200E8DAA: CmiInterSendNetworkFunc (machine-common-core.c:598)
==23801==    by 0x200E8CD2: CmiSendNetworkFunc (machine-common-core.c:553)
==23801==    by 0x200E6EF7: SendSpanningChildren (machine-broadcast.c:118)
==23801==    by 0x200E6F60: SendSpanningChildrenProc (machine-broadcast.c:176)
==23801==    by 0x200E6FFD: CmiSyncBroadcastFn1 (machine-broadcast.c:219)
==23801==    by 0x200E7028: CmiSyncBroadcastFn (machine-broadcast.c:251)
==23801==    by 0x20010980: Converse::CmiSyncBroadcast(int, char*) (middle-conv.h:69)
==23801==    by 0x20013C5C: _createGroup(_ckGroupID, envelope*) (ck.C:803)
==23801==  Address 0x2aaaeaaaf068 is not stack'd, malloc'd or (recently) free'd

History

#1 Updated by Sam White 4 months ago

Are there other leaks that look like this that don't come from the Charm initialization phases? This particular leak, from topological tree creation, is known to show up across different comm layers: https://charm.cs.illinois.edu/redmine/issues/1202

#2 Updated by Phil Miller 4 months ago

I know we want to make a comparison to other errors that Valgrind reports, but this one isn't a leak. It's an outright out-of-bounds memory access.

Since it looks like this is happening inside the GNI library itself, there's two cases we need to distinguish: we're calling it with a bad parameter, or it's defective internally.

Does this reproduce in converse/commbench, or whatever calls the Converse broadcast function? How much further down the stack can we push a reproduction, to figure out whether it's our fault or not?

#3 Updated by Phil Miller 4 months ago

In converse/commbench, I get the initial report from ~10 instruction addresses in GNII_SmsgSend, and the following additional errors (ignoring use of uninitialized bytes for the moment):

==19352== Invalid read of size 8
==19352==    at 0x2007EAEA: GNII_DlaProgress (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
==19352==    by 0x2007AB27: GNI_CqGetEvent (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
==19352==    by 0x2001115D: PumpLocalTransactions (machine.c:2915)
==19352==    by 0x20012053: LrtsAdvanceCommunication (machine.c:3530)
==19352==    by 0x2000DD91: AdvanceCommunication (machine-common-core.c:1317)
==19352==    by 0x2000E01F: CmiGetNonLocal (machine-common-core.c:1487)
==19352==    by 0x200166B4: CsdNextMessage (convcore.c:1779)
==19352==    by 0x200167FE: CsdScheduleForever (convcore.c:1904)
==19352==    by 0x20016758: CsdScheduler (convcore.c:1843)
==19352==    by 0x2000DD54: ConverseRunPE (machine-common-core.c:1297)
==19352==    by 0x2000DC5B: ConverseInit (machine-common-core.c:1198)
==19352==    by 0x20001F0C: main (commbench.c:159)
==19352==  Address 0x2aaaeaab0000 is not stack'd, malloc'd or (recently) free'd

==19352== Invalid read of size 8
==19352==    at 0x200A600F: GNII_GenAllocSeqid (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
==19352==    by 0x200AB112: GNII_SmsgSend (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
==19352==    by 0x200ACDE0: GNI_SmsgSendWTag (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
==19352==    by 0x2000EFA4: send_smsg_message (machine.c:1532)
==19352==    by 0x2000F6C4: LrtsSendFunc (machine.c:1779)
==19352==    by 0x2000C96E: CmiInterSendNetworkFunc (machine-common-core.c:598)
==19352==    by 0x2000CA31: CmiInterFreeSendFn (machine-common-core.c:630)
==19352==    by 0x2000C999: CmiFreeSendFn (machine-common-core.c:604)
==19352==    by 0x20002B61: doTrials (proc.c:37)
==19352==    by 0x200165A3: CmiHandleMessage (convcore.c:1670)
==19352==    by 0x20016827: CsdScheduleForever (convcore.c:1907)
==19352==    by 0x20016758: CsdScheduler (convcore.c:1843)
==19352==  Address 0x2aaaeaab5000 is not stack'd, malloc'd or (recently) free'd

==19352== Invalid write of size 8
==19352==    at 0x2007AD75: GNI_CqGetEvent (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
==19352==    by 0x2000FF71: PumpNetworkSmsg (machine.c:2202)
==19352==    by 0x20011F65: LrtsAdvanceCommunication (machine.c:3496)
==19352==    by 0x2000DD91: AdvanceCommunication (machine-common-core.c:1317)
==19352==    by 0x2000E0E8: CmiNotifyStillIdle (machine-common-core.c:1544)
==19352==    by 0x2001A25B: call_cblist_keep (conv-conds.c:160)
==19352==    by 0x2001AF26: CcdRaiseCondition (conv-conds.c:524)
==19352==    by 0x200164F3: CsdStillIdle (convcore.c:1617)
==19352==    by 0x2001684D: CsdScheduleForever (convcore.c:1927)
==19352==    by 0x20016758: CsdScheduler (convcore.c:1843)
==19352==    by 0x2000DD54: ConverseRunPE (machine-common-core.c:1297)
==19352==    by 0x2000DC5B: ConverseInit (machine-common-core.c:1198)
==19352==  Address 0x2aaaeaab7000 is not stack'd, malloc'd or (recently) free'd

==19352== Invalid write of size 8
==19352==    at 0x2007EB84: GNII_DlaProgress (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
==19352==    by 0x2007AB27: GNI_CqGetEvent (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
==19352==    by 0x2001115D: PumpLocalTransactions (machine.c:2915)
==19352==    by 0x20012053: LrtsAdvanceCommunication (machine.c:3530)
==19352==    by 0x2000DD91: AdvanceCommunication (machine-common-core.c:1317)
==19352==    by 0x2000E01F: CmiGetNonLocal (machine-common-core.c:1487)
==19352==    by 0x200166B4: CsdNextMessage (convcore.c:1779)
==19352==    by 0x200167FE: CsdScheduleForever (convcore.c:1904)
==19352==    by 0x20016758: CsdScheduler (convcore.c:1843)
==19352==    by 0x2000DD54: ConverseRunPE (machine-common-core.c:1297)
==19352==    by 0x2000DC5B: ConverseInit (machine-common-core.c:1198)
==19352==    by 0x20001F0C: main (commbench.c:159)
==19352==  Address 0x2aaaeaab0000 is not stack'd, malloc'd or (recently) free'd

==19352== Invalid read of size 4
==19352==    at 0x20085649: GNII_PostFlbte (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
==19352==    by 0x200A1670: GNII_PostRdma (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
==19352==    by 0x200A20FA: GNI_PostRdma (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
==19352==    by 0x200118FD: SendRdmaMsg (machine.c:3182)
==19352==    by 0x20012101: LrtsAdvanceCommunication (machine.c:3569)
==19352==    by 0x2000DD91: AdvanceCommunication (machine-common-core.c:1317)
==19352==    by 0x2000E0E8: CmiNotifyStillIdle (machine-common-core.c:1544)
==19352==    by 0x2001A25B: call_cblist_keep (conv-conds.c:160)
==19352==    by 0x2001AF26: CcdRaiseCondition (conv-conds.c:524)
==19352==    by 0x200164F3: CsdStillIdle (convcore.c:1617)
==19352==    by 0x2001684D: CsdScheduleForever (convcore.c:1927)
==19352==    by 0x20016758: CsdScheduler (convcore.c:1843)
==19352==  Address 0x2aaaaaaab008 is not stack'd, malloc'd or (recently) free'd

==19352== Invalid write of size 8
==19352==    at 0x20085991: GNII_PostFlbte (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
==19352==    by 0x200A1670: GNII_PostRdma (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
==19352==    by 0x200A20FA: GNI_PostRdma (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
==19352==    by 0x200118FD: SendRdmaMsg (machine.c:3182)
==19352==    by 0x20012101: LrtsAdvanceCommunication (machine.c:3569)
==19352==    by 0x2000DD91: AdvanceCommunication (machine-common-core.c:1317)
==19352==    by 0x2000E0E8: CmiNotifyStillIdle (machine-common-core.c:1544)
==19352==    by 0x2001A25B: call_cblist_keep (conv-conds.c:160)
==19352==    by 0x2001AF26: CcdRaiseCondition (conv-conds.c:524)
==19352==    by 0x200164F3: CsdStillIdle (convcore.c:1617)
==19352==    by 0x2001684D: CsdScheduleForever (convcore.c:1927)
==19352==    by 0x20016758: CsdScheduler (convcore.c:1843)
==19352==  Address 0x2aaaeaaaf000 is not stack'd, malloc'd or (recently) free'd

==19352== Invalid write of size 8
==19352==    at 0x200AD1AD: return_back_credits (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
==19352==    by 0x200AEAB9: GNII_SmsgRelease (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
==19352==    by 0x200100A1: PumpNetworkSmsg (machine.c:2242)
==19352==    by 0x20011F65: LrtsAdvanceCommunication (machine.c:3496)
==19352==    by 0x2000DD91: AdvanceCommunication (machine-common-core.c:1317)
==19352==    by 0x2000E01F: CmiGetNonLocal (machine-common-core.c:1487)
==19352==    by 0x200166B4: CsdNextMessage (convcore.c:1779)
==19352==    by 0x200167FE: CsdScheduleForever (convcore.c:1904)
==19352==    by 0x20016758: CsdScheduler (convcore.c:1843)
==19352==    by 0x2000DD54: ConverseRunPE (machine-common-core.c:1297)
==19352==    by 0x2000DC5B: ConverseInit (machine-common-core.c:1198)
==19352==    by 0x20001F0C: main (commbench.c:159)
==19352==  Address 0x2aaaeaaaf000 is not stack'd, malloc'd or (recently) free'd

==19352== Invalid read of size 8
==19352==    at 0x200AD3C5: return_back_credits (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
==19352==    by 0x200AEAB9: GNII_SmsgRelease (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
==19352==    by 0x200100A1: PumpNetworkSmsg (machine.c:2242)
==19352==    by 0x20011F65: LrtsAdvanceCommunication (machine.c:3496)
==19352==    by 0x2000DD91: AdvanceCommunication (machine-common-core.c:1317)
==19352==    by 0x2000E01F: CmiGetNonLocal (machine-common-core.c:1487)
==19352==    by 0x200166B4: CsdNextMessage (convcore.c:1779)
==19352==    by 0x200167FE: CsdScheduleForever (convcore.c:1904)
==19352==    by 0x20016758: CsdScheduler (convcore.c:1843)
==19352==    by 0x2000DD54: ConverseRunPE (machine-common-core.c:1297)
==19352==    by 0x2000DC5B: ConverseInit (machine-common-core.c:1198)
==19352==    by 0x20001F0C: main (commbench.c:159)
==19352==  Address 0x2aaaeaab5008 is not stack'd, malloc'd or (recently) free'd

And a summary of the various junk seen in this run:

      3     at 0x200AD8B9: return_back_credits (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
      3     at 0x200AD8C4: return_back_credits (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
      3     at 0x200AD8E7: return_back_credits (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
      3     at 0x200ADAC0: return_back_credits (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
      3     at 0x200ADACE: return_back_credits (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
      3     at 0x200ADD78: return_back_credits (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
      3     at 0x200ADDA7: return_back_credits (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
      3     at 0x200ADE02: return_back_credits (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
      3     at 0x200ADF60: return_back_credits (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
      3     at 0x200ADF68: return_back_credits (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
      3     at 0x200ADF70: return_back_credits (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
      4     at 0x2007EB84: GNII_DlaProgress (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
      4     at 0x2007EC3D: GNII_DlaProgress (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
      4     at 0x20085649: GNII_PostFlbte (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
      4     at 0x20085664: GNII_PostFlbte (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
      4     at 0x20085991: GNII_PostFlbte (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
      4     at 0x2008599D: GNII_PostFlbte (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
      4     at 0x200859AA: GNII_PostFlbte (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
      4     at 0x200859B7: GNII_PostFlbte (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
      4     at 0x200859CB: GNII_PostFlbte (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
      4     at 0x200859DF: GNII_PostFlbte (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
      4     at 0x20085AB6: GNII_PostFlbte (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
      4     at 0x200A600F: GNII_GenAllocSeqid (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
      4     at 0x200A601C: GNII_GenAllocSeqid (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
      4     at 0x200AB45D: GNII_SmsgSend (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
      4     at 0x200AD1AD: return_back_credits (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
      4     at 0x200AD1B8: return_back_credits (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
      4     at 0x200AD1DE: return_back_credits (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
      4     at 0x200AD3B7: return_back_credits (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
      4     at 0x200AD3C5: return_back_credits (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
      4     at 0x200AD66F: return_back_credits (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
      4     at 0x200AD6A0: return_back_credits (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
      4     at 0x200AD6FF: return_back_credits (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
      4     at 0x200ADE54: return_back_credits (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
      4     at 0x200ADE5C: return_back_credits (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
      4     at 0x200ADE64: return_back_credits (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
      5     at 0x2007EAEA: GNII_DlaProgress (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
     12     at 0x2007AD75: GNI_CqGetEvent (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
     12     at 0x200AB027: GNII_SmsgSend (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
     12     at 0x200AB03B: GNII_SmsgSend (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
     12     at 0x200AB05E: GNII_SmsgSend (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
     12     at 0x200AB06A: GNII_SmsgSend (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
     12     at 0x200AB076: GNII_SmsgSend (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
     12     at 0x200AB11B: GNII_SmsgSend (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
     12     at 0x200AB13F: GNII_SmsgSend (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
     12     at 0x200AB3E4: GNII_SmsgSend (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
     12     at 0x200AB3F9: GNII_SmsgSend (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
     12     at 0x200AB449: GNII_SmsgSend (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
     12     at 0x200ABA8F: GNII_SmsgSend (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)
     13     at 0x200AB01C: GNII_SmsgSend (in /global/u2/p/pmiller/charm/gni-crayxc/tests/converse/commbench/pgm)

This looks a whole lot to me like we're mis-allocating some smsg space somewhere, or we're allocating it in some way that valgrind doesn't recognize at all.

#4 Updated by Phil Miller 4 months ago

And the addresses being touched in this run:

> grep -h 'is not stack' vg.* | sed 's/==[0-9]*==  //' | sort | uniq | less
Address 0x2aaaaaaab008 is not stack'd, malloc'd or (recently) free'd
Address 0x2aaaaaaab00c is not stack'd, malloc'd or (recently) free'd
Address 0x2aaaaaaaf000 is not stack'd, malloc'd or (recently) free'd
Address 0x2aaaaaaaf008 is not stack'd, malloc'd or (recently) free'd
Address 0x2aaaaaaaf010 is not stack'd, malloc'd or (recently) free'd
Address 0x2aaaaaaaf020 is not stack'd, malloc'd or (recently) free'd
Address 0x2aaaaaaaf030 is not stack'd, malloc'd or (recently) free'd
Address 0x2aaaaaaaf038 is not stack'd, malloc'd or (recently) free'd
Address 0x2aaaaaaaf078 is not stack'd, malloc'd or (recently) free'd
Address 0x2aaaeaaaf000 is not stack'd, malloc'd or (recently) free'd
Address 0x2aaaeaaaf008 is not stack'd, malloc'd or (recently) free'd
Address 0x2aaaeaaaf010 is not stack'd, malloc'd or (recently) free'd
Address 0x2aaaeaaaf018 is not stack'd, malloc'd or (recently) free'd
Address 0x2aaaeaaaf020 is not stack'd, malloc'd or (recently) free'd
Address 0x2aaaeaaaf028 is not stack'd, malloc'd or (recently) free'd
Address 0x2aaaeaaaf040 is not stack'd, malloc'd or (recently) free'd
Address 0x2aaaeaaaf050 is not stack'd, malloc'd or (recently) free'd
Address 0x2aaaeaaaf068 is not stack'd, malloc'd or (recently) free'd
Address 0x2aaaeaab0000 is not stack'd, malloc'd or (recently) free'd
Address 0x2aaaeaab1558 is not stack'd, malloc'd or (recently) free'd
Address 0x2aaaeaab4000 is not stack'd, malloc'd or (recently) free'd
Address 0x2aaaeaab5000 is not stack'd, malloc'd or (recently) free'd
Address 0x2aaaeaab5008 is not stack'd, malloc'd or (recently) free'd
Address 0x2aaaeaab6000 is not stack'd, malloc'd or (recently) free'd
Address 0x2aaaeaab7000 is not stack'd, malloc'd or (recently) free'd
Address 0x2aaaeaab8000 is not stack'd, malloc'd or (recently) free'd

Doing the math, they span a range just a hair over a gigabyte wide.

#5 Updated by Phil Miller 4 months ago

A third possibility that I neglected up in comment 2 - all of this space is properly allocated, just through a function that valgrind hadn't been taught to recognize.

Also available in: Atom PDF