Bug #1786

Assertion "thisDim < thatDim" failed in file cklocation.C line 2880

Added by Thomas Quinn over 1 year ago. Updated over 1 year ago.

Target version:
Start date:
Due date:
% Done:



So far, this only happens when I "build ChaNGa mpi-crayxc smp -O2". (A gni-crayxc smp build works.)
A print statement shows that the "idx.index[0]" in CkLocMgr::checInBounds() is garbage; perhaps uninitialized.

It fails immediately when running ChaNGa; the output is:

Charm++> Running on MPI version: 3.1
Charm++> level of thread support used: MPI_THREAD_FUNNELED (desired: MPI_THREAD_FUNNELED)
Charm++> Running in SMP mode: numNodes 1,  35 worker threads per process
Charm++> The comm. thread both sends and receives messages
Charm++> CMA enabled for within node transfers using the zerocopy API
Converse/Charm++ Commit ID: v6.8.2-288-gcbb953362
CharmLB> Load balancer assumes all CPUs are same.
WARNING: bKDK parameter ignored; KDK is always used.
WARNING: bStandard parameter ignored; Output is always standard.
ChaNGa version 3.3, commit v3.3-76-g39ba07e
Running on 35 processors/ 1 nodes with 8192 TreePieces
yieldPeriod set to 5
Number of chunks for remote tree walk set to 1
Chunk Randomization...ON
cache 1
cacheLineDepth 4
Verbosity level 1
Domain decomposition...SFC Peano-Hilbert
Created 8192 pieces of tree
dim: 0 :this: -2147483648 that 8192
[34] Assertion "thisDim < thatDim" failed in file cklocation.C line 2882.


#1 Updated by Thomas Quinn over 1 year ago

seems to have introduced this bug.

#2 Updated by Sam White over 1 year ago

It's weird that this would show up in mpi-crayxc-smp but not gni-crayxc-smp. What machine is this on? It would be nice for us to reproduce the Charm++ build, ChaNGa build, and run command you are using.

#3 Updated by Thomas Quinn over 1 year ago

Machine: "Piz Daint".
Currently Loaded Modulefiles:
1) modules/
2) eproxy/2.0.16-
3) cray-mpich/7.6.0
4) slurm/17.02.9-1
5) xalt/daint-2016.11
6) gcc/5.3.0
7) craype-haswell
8) craype-network-aries
9) craype/2.5.12
10) cray-libsci/17.06.1
11) udreg/2.3.2-
12) ugni/6.0.14-
13) pmi/5.0.12
14) dmapp/7.1.1-
15) gni-headers/5.0.11-
16) xpmem/2.2.2-
17) job/2.2.2-
18) dvs/2.7_2.2.32-
19) alps/6.4.1-
20) rca/2.2.11-
21) atp/2.1.1
22) perftools-base/6.5.1
23) PrgEnv-gnu/6.0.4
24) craype-hugepages2M
25) papi/

Build command:
./build ChaNGa mpi-crayxc smp -j8 -O2

ChaNGa: "configure; make"

Run command (with one node/36 physical cores:
srun -n $SLURM_NTASKS --ntasks-per-node 1 -d 36 ./ChaNGa -killat 7 -g -b 32 -wall 5 -p 8192 -v 1 +balancer MultistepLB_notopo bench10M.param ++ppn 35 +commap 0 +pemap 1-35 +setcpuaffinity

#4 Updated by Sam White over 1 year ago

  • Target version set to 6.9.0
  • Assignee set to Seonmyeong Bak

We don't have access to Piz Daint, but Edison at NERSC is similar. Note the use of 'PrgEnv-gnu', 'gcc/5.3.0', and 'craype-hugepages2M' modules.

#5 Updated by Sam White over 1 year ago

It would be good to run this under Valgrind and see if anything comes up from the Cth* or *context routines

#6 Updated by Eric Bohm over 1 year ago

We're seeing a possibly related topo issue with larger scale runs in OpenAtom on BlueWaters, still trying to narrow down the issue.

[0] Assertion "idx.dimension == bounds.dimension" failed in file cklocation.C line 2874.

#7 Updated by Thomas Quinn over 1 year ago

This seems to be a memory corruption issue: I tried a smaller problem, and I got malloc failures:
free(): invalid pointer
at: queueing.C:634
CqsDeqEnqueueFifo(&(q->zeroprio), data);

Unfortunately valgrind takes forever: a program that dies in 9 seconds takes more than 4 hours under valgrind. Any other suggestions for memory tracking?

#8 Updated by Seonmyeong Bak over 1 year ago

Try to compile charm with -DCMK_NOT_USE_TLS_THREAD=1.

Tom, can you share the input data so that I can replay your bug on Edison?

#10 Updated by Thomas Quinn over 1 year ago

Running after compiling with -fsanitize=address gives:

Domain decomposition...SFC Peano-Hilbert
Created 128 pieces of tree =================================================================
8492ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6310000147e8 at pc 0x00002099a591 bp 0x6310000147d0 sp 0x6310000147c8
READ of size 8 at 0x6310000147e8 thread T1
#0 0x2099a590 in CthStartThread (/scratch/snx3000/trq/bench/disk1M/ChaNGa+0x2099a590)
#1 0x2099a93e in make_fcontext (/scratch/snx3000/trq/bench/disk1M/ChaNGa+0x2099a93e)

0x6310000147e8 is located 9192 bytes to the right of 72704-byte region [0x631000000800,0x631000012400)
allocated by thread T0 here:
#0 0x2aaaaadaab30 in _interceptor_malloc ../../../../cray-gcc-7.1.0-201705230545.65f29659747b4/libsanitizer/asan/
#1 0x2aaaae11ab35 in pool ../../../../cray-gcc-7.1.0-201705230545.65f29659747b4/libstdc++-v3/libsupc++/
#2 0x2aaaae11ab35 in __static_initialization_and_destruction_0 ../../../../cray-gcc-7.1.0-201705230545.65f29659747b4/libstdc++-v3/libsupc++/
#3 0x2aaaae11ab35 in _GLOBAL ../../../../cray-gcc-7.1.0-201705230545.65f29659747b4/libstdc++-v3/libsupc++/

#11 Updated by Sam White over 1 year ago

  • Priority changed from Normal to High
  • Status changed from New to In Progress

#12 Updated by Phil Miller over 1 year ago

  • Description updated (diff)

#13 Updated by Seonmyeong Bak over 1 year ago

This issue happened because mpi-crayxe/mpi-crayxc set JCONTEXT in conv-mach-smp.h while any other targets doesn't set flags for user-level thread configuration in conv-mach-smp separately.

Other targets where uFcontext enabled doesn't have this issue for the reason above.

#14 Updated by Seonmyeong Bak over 1 year ago

  • Status changed from In Progress to Implemented

#16 Updated by Sam White over 1 year ago

  • Status changed from Implemented to Merged

Also available in: Atom PDF