Project

General

Profile

Bug #1421

Running leanmd with error checking enabled in Charm++ triggers assertion error in lbdb.h

Added by Juan Galvez over 2 years ago. Updated about 2 years ago.

Status:
Merged
Priority:
High
Category:
Load Balancing
Target version:
Start date:
02/15/2017
Due date:
% Done:

0%


Description

Building Charm++ with error checking:
./build charm++ netlrts-linux-x86_64 --with-production --enable-error-checking

And building and doing a make test on leanmd produces this error:

./charmrun +p4 ./leanmd 4 4 4 10 3 3 +balancer GreedyLB +LBDebug 1 ++local
Charmrun> scalable start enabled. 
Charmrun> started all node programs in 0.004 seconds.
Charm++> Running in non-SMP mode: numPes 4
Converse/Charm++ Commit ID: v6.7.0-660-g23a5d86a8
Charm++> scheduler running in netpoll mode.
CharmLB> Verbose level 1, load balancing period: 0.5 seconds
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (8-way SMP).
Charm++> cpu topology info is gathered in 0.001 seconds.
[0] GreedyLB created

LENNARD JONES MOLECULAR DYNAMICS START UP ...

Input Parameters...
Cell Array Dimension X:4 Y:4 Z:4 of size 15 15 30
Final Step Count:10
First LB Step:3
LB Period:3

Cells: 4 X 4 X 4 .... created
Computes: 2432 .... created
Starting simulation .... 

Step 1 Benchmark Time 259.245157 ms/step
Step 2 Benchmark Time 178.256035 ms/step
Step 3 Benchmark Time 154.880047 ms/step

CharmLB> GreedyLB: PE [0] step 0 starting at 0.659501 Memory: 3.862915 MB
[0] Assertion "type==2" failed in file lbdb.h line 232.
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: Assertion "type==2" failed in file lbdb.h line 232.
[0] Stack Traceback:
  [0:0] CmiAbortHelper+0x63  [0x62f093]
  [0:1] _ZN6BaseLB7LDStats11getRecvHashER10LDCommData+0x33  [0x5d4e23]
  [0:2] _ZN9CentralLB27removeCommDataOfDeletedObjsEPN6BaseLB7LDStatsE+0x13f  [0x5eb8ef]
  [0:3] _ZN9CentralLB11LoadBalanceEv+0xcc  [0x5ebaec]
  [0:4] CkDeliverMessageFree+0x39  [0x51e4c9]
  [0:5] _Z15_processHandlerPvP11CkCoreState+0x4b3  [0x523e43]
  [0:6] CsdScheduleForever+0x68  [0x634eb8]
  [0:7] CsdScheduler+0x2d  [0x6351ed]
  [0:8] ConverseInit+0x37a  [0x6335ca]
  [0:9] main+0x21  [0x4c1051]
  [0:10] __libc_start_main+0xf5  [0x2aaaab511f45]
  [0:11]   [0x4c173f]
Fatal error on PE 0> Assertion "type==2" failed in file lbdb.h line 232.
make: *** [test] Error 1

I suspect it has to do with multicast messages.

History

#1 Updated by Kavitha Chandrasekar over 2 years ago

The assertion seems to fail on type 3, LD_OBJLIST_MSG, for multicast messages as you suggest. The CentralLB::removeNonMigratable method seems to ignore LD_OBJLIST_MSG type commData. Maybe LD_OBJLIST_MSG could be ignored in CentralLB::removeCommDataOfDeletedObjs as well?

#2 Updated by Juan Galvez over 2 years ago

Solution might be to add support in CentralLB::removeCommDataOfDeletedObjs for multicast messages, going through all the destinations in a multicast message to see if some are deleted.

#3 Updated by Sam White over 2 years ago

  • Subject changed from Running leanmd with error checking enabled in Charm++triggers assertion error in lbdb.h to Running leanmd with error checking enabled in Charm++ triggers assertion error in lbdb.h
  • Target version set to 6.8.0

#4 Updated by Eric Bohm over 2 years ago

  • Assignee set to Kavitha Chandrasekar

#5 Updated by Kavitha Chandrasekar over 2 years ago

  • Status changed from New to In Progress

I am able to add a condition to check for destinations of a multicast message and remove them if they don't exist anymore. However, I am still trying to generate a case in leanmd where an object is removed from the multicast group, in order to test this case.

#6 Updated by Phil Miller about 2 years ago

  • Priority changed from Normal to High

#7 Updated by Sam White about 2 years ago

  • Status changed from In Progress to Implemented

#9 Updated by Sam White about 2 years ago

  • Status changed from Implemented to Merged

Also available in: Atom PDF