Project

General

Profile

Bug #1042

verbs layer runs out of memory regions and segfaults

Added by Thomas Quinn about 3 years ago. Updated over 2 years ago.

Status:
Merged
Priority:
Normal
Assignee:
Category:
Machine Layers
Target version:
Start date:
04/20/2016
Due date:
% Done:

100%


Description

Running ChaNGa with the dwf1b.6144 benchmark compiled with verbs-linux-x86_64 smp --with-production on 8 nodes of Ivybridges (NAS Pleiades) terminates with the following segfault:

Big step 10 took 41.174330 seconds.
------------- Processor 115 Exiting: Caught Signal ------------
Reason: segmentation violation
Suggestion: Try running with '++debug', or linking with '-memory paranoid' (memory paranoid requires '+netpoll' at runtime).
[115] Stack Traceback:
  [115:0]   [0x79600c]
  [115:1] +0x32910  [0x7fffec57c910]
  [115:2] DeliverViaNetwork+0x1d4  [0x7996c4]
  [115:3] DeliverOutgoingMessage+0xc0  [0x7986d0]
  [115:4] CmiInterSendNetworkFunc+0xed  [0x7987fd]
  [115:5] _ZN8CkLocMgr14deliverUnknownEP14CkArrayMessage11CkDeliver_ti+0x5c  [0x
6e21ec]
  [115:6] _ZN8CkLocMgr7deliverEP9CkMessage11CkDeliver_ti+0x17a  [0x6e261a]
  [115:7] _ZN9TreePiece9ioShuffleEP14CkReductionMsg+0x6c0  [0x60a3d0]
  [115:8] CkDeliverMessageReadonly+0x43  [0x6c6d43]
  [115:9] _ZN14CkLocRec_local11invokeEntryEP12CkMigratablePvib+0xc5  [0x6ea835]
  [115:10] _ZN18CkArrayBroadcaster7deliverEP14CkArrayMessageP12ArrayElementb+0x9
b  [0x6f885b]
  [115:11] _ZN7CkArray13recvBroadcastEP9CkMessage+0xef  [0x6f905f]
  [115:12] _Z15_processHandlerPvP11CkCoreState+0x921  [0x6cdeb1]
  [115:13] CsdScheduleForever+0x85  [0x79e825]
  [115:14] CsdScheduler+0x2d  [0x79eb7d]
  [115:15]   [0x79b2aa]
  [115:16]   [0x79b316]
  [115:17] +0x7806  [0x7fffed6cf806]
  [115:18] clone+0x6d  [0x7fffec6289bd]
Fatal error on PE 115> segmentation violation

The root cause is ibv_reg_mr() failure at line 2318 in src/arch/verbs/machine-ibverbs.c. The CmiAssert() is a no-op when compiling with --with-production. This should probably be changed to a CmiEnforce().

The failure seems to be running out of memory segments to pin, so increasing the blocking ratio fixes this:

diff --git a/src/arch/verbs/machine-ibverbs.c b/src/arch/verbs/machine-ibverbs.c
index e1854b2..3bc92df 100644
--- a/src/arch/verbs/machine-ibverbs.c
+++ b/src/arch/verbs/machine-ibverbs.c
@@ -558,7 +558,7 @@ loop:
        /*      blockAllocRatio=16;
                blockThreshold=8;*/

-       blockAllocRatio=64;
+       blockAllocRatio=128;
        blockThreshold=9;

Note that 2 years ago we did an increase of 32 to 64.


Related issues

Related to Charm++ - Bug #1043: ChaNGa crashes on verbs for dwf.6144 on 128 nodes Rejected 04/20/2016

History

#2 Updated by Eric Bohm almost 3 years ago

  • Assignee set to Bilge Acun

#3 Updated by Sam White over 2 years ago

  • % Done changed from 0 to 100
  • translation missing: en.field_closed_date set to 2016-08-04 13:35:09.741102
  • Category set to Machine Layers
  • Status changed from New to Merged
  • Target version set to 6.8.0

#4 Updated by Phil Miller over 2 years ago

  • Target version changed from 6.8.0 to 6.8.0-beta1

Also available in: Atom PDF