Running leanmd with error checking enabled in Charm++ triggers assertion error in lbdb.h
Building Charm++ with error checking:
./build charm++ netlrts-linux-x86_64 --with-production --enable-error-checking
And building and doing a make test on leanmd produces this error:
./charmrun +p4 ./leanmd 4 4 4 10 3 3 +balancer GreedyLB +LBDebug 1 ++local Charmrun> scalable start enabled. Charmrun> started all node programs in 0.004 seconds. Charm++> Running in non-SMP mode: numPes 4 Converse/Charm++ Commit ID: v6.7.0-660-g23a5d86a8 Charm++> scheduler running in netpoll mode. CharmLB> Verbose level 1, load balancing period: 0.5 seconds CharmLB> Load balancer assumes all CPUs are same. Charm++> Running on 1 unique compute nodes (8-way SMP). Charm++> cpu topology info is gathered in 0.001 seconds.  GreedyLB created LENNARD JONES MOLECULAR DYNAMICS START UP ... Input Parameters... Cell Array Dimension X:4 Y:4 Z:4 of size 15 15 30 Final Step Count:10 First LB Step:3 LB Period:3 Cells: 4 X 4 X 4 .... created Computes: 2432 .... created Starting simulation .... Step 1 Benchmark Time 259.245157 ms/step Step 2 Benchmark Time 178.256035 ms/step Step 3 Benchmark Time 154.880047 ms/step CharmLB> GreedyLB: PE  step 0 starting at 0.659501 Memory: 3.862915 MB  Assertion "type==2" failed in file lbdb.h line 232. ------------- Processor 0 Exiting: Called CmiAbort ------------ Reason: Assertion "type==2" failed in file lbdb.h line 232.  Stack Traceback: [0:0] CmiAbortHelper+0x63 [0x62f093] [0:1] _ZN6BaseLB7LDStats11getRecvHashER10LDCommData+0x33 [0x5d4e23] [0:2] _ZN9CentralLB27removeCommDataOfDeletedObjsEPN6BaseLB7LDStatsE+0x13f [0x5eb8ef] [0:3] _ZN9CentralLB11LoadBalanceEv+0xcc [0x5ebaec] [0:4] CkDeliverMessageFree+0x39 [0x51e4c9] [0:5] _Z15_processHandlerPvP11CkCoreState+0x4b3 [0x523e43] [0:6] CsdScheduleForever+0x68 [0x634eb8] [0:7] CsdScheduler+0x2d [0x6351ed] [0:8] ConverseInit+0x37a [0x6335ca] [0:9] main+0x21 [0x4c1051] [0:10] __libc_start_main+0xf5 [0x2aaaab511f45] [0:11] [0x4c173f] Fatal error on PE 0> Assertion "type==2" failed in file lbdb.h line 232. make: *** [test] Error 1
I suspect it has to do with multicast messages.
#1 Updated by Kavitha Chandrasekar over 2 years ago
The assertion seems to fail on type 3, LD_OBJLIST_MSG, for multicast messages as you suggest. The CentralLB::removeNonMigratable method seems to ignore LD_OBJLIST_MSG type commData. Maybe LD_OBJLIST_MSG could be ignored in CentralLB::removeCommDataOfDeletedObjs as well?
#5 Updated by Kavitha Chandrasekar over 2 years ago
- Status changed from New to In Progress
I am able to add a condition to check for destinations of a multicast message and remove them if they don't exist anymore. However, I am still trying to generate a case in leanmd where an object is removed from the multicast group, in order to test this case.