Project

General

Profile

Bug #1786

Assertion "thisDim < thatDim" failed in file cklocation.C line 2880

Added by Thomas Quinn over 1 year ago. Updated over 1 year ago.

Status:
Merged
Priority:
High
Category:
-
Target version:
Start date:
01/30/2018
Due date:
% Done:

0%


Description

So far, this only happens when I "build ChaNGa mpi-crayxc smp -O2". (A gni-crayxc smp build works.)
A print statement shows that the "idx.index[0]" in CkLocMgr::checInBounds() is garbage; perhaps uninitialized.

It fails immediately when running ChaNGa; the output is:

Charm++> Running on MPI version: 3.1
Charm++> level of thread support used: MPI_THREAD_FUNNELED (desired: MPI_THREAD_FUNNELED)
Charm++> Running in SMP mode: numNodes 1,  35 worker threads per process
Charm++> The comm. thread both sends and receives messages
Charm++> CMA enabled for within node transfers using the zerocopy API
Converse/Charm++ Commit ID: v6.8.2-288-gcbb953362
CharmLB> Load balancer assumes all CPUs are same.
WARNING: bKDK parameter ignored; KDK is always used.
WARNING: bStandard parameter ignored; Output is always standard.
ChaNGa version 3.3, commit v3.3-76-g39ba07e
Running on 35 processors/ 1 nodes with 8192 TreePieces
yieldPeriod set to 5
Prefetching...ON
Number of chunks for remote tree walk set to 1
Chunk Randomization...ON
cache 1
cacheLineDepth 4
Verbosity level 1
Domain decomposition...SFC Peano-Hilbert
Created 8192 pieces of tree
dim: 0 :this: -2147483648 that 8192
[34] Assertion "thisDim < thatDim" failed in file cklocation.C line 2882.

History

#1 Updated by Thomas Quinn over 1 year ago

Change https://charm.cs.illinois.edu/gerrit/#/c/3482/
seems to have introduced this bug.

#2 Updated by Sam White over 1 year ago

It's weird that this would show up in mpi-crayxc-smp but not gni-crayxc-smp. What machine is this on? It would be nice for us to reproduce the Charm++ build, ChaNGa build, and run command you are using.

#3 Updated by Thomas Quinn over 1 year ago

Machine: "Piz Daint".
Modules:
Currently Loaded Modulefiles:
1) modules/3.2.10.6
2) eproxy/2.0.16-6.0.4.1_3.1__g001b199.ari
3) cray-mpich/7.6.0
4) slurm/17.02.9-1
5) xalt/daint-2016.11
6) gcc/5.3.0
7) craype-haswell
8) craype-network-aries
9) craype/2.5.12
10) cray-libsci/17.06.1
11) udreg/2.3.2-6.0.4.0_12.2__g2f9c3ee.ari
12) ugni/6.0.14-6.0.4.0_14.1__ge7db4a2.ari
13) pmi/5.0.12
14) dmapp/7.1.1-6.0.4.0_46.2__gb8abda2.ari
15) gni-headers/5.0.11-6.0.4.0_7.2__g7136988.ari
16) xpmem/2.2.2-6.0.4.0_3.1__g43b0535.ari
17) job/2.2.2-6.0.4.0_8.2__g3c644b5.ari
18) dvs/2.7_2.2.32-6.0.4.1_7.1__ged1923a
19) alps/6.4.1-6.0.4.0_7.2__g86d0f3d.ari
20) rca/2.2.11-6.0.4.0_13.2__g84de67a.ari
21) atp/2.1.1
22) perftools-base/6.5.1
23) PrgEnv-gnu/6.0.4
24) craype-hugepages2M
25) papi/5.5.1.2

Build command:
./build ChaNGa mpi-crayxc smp -j8 -O2

ChaNGa: "configure; make"

Run command (with one node/36 physical cores:
srun -n $SLURM_NTASKS --ntasks-per-node 1 -d 36 ./ChaNGa -killat 7 -g -b 32 -wall 5 -p 8192 -v 1 +balancer MultistepLB_notopo bench10M.param ++ppn 35 +commap 0 +pemap 1-35 +setcpuaffinity

#4 Updated by Sam White over 1 year ago

  • Target version set to 6.9.0
  • Assignee set to Seonmyeong Bak

We don't have access to Piz Daint, but Edison at NERSC is similar. Note the use of 'PrgEnv-gnu', 'gcc/5.3.0', and 'craype-hugepages2M' modules.

#5 Updated by Sam White over 1 year ago

It would be good to run this under Valgrind and see if anything comes up from the Cth* or *context routines

#6 Updated by Eric Bohm over 1 year ago

We're seeing a possibly related topo issue with larger scale runs in OpenAtom on BlueWaters, still trying to narrow down the issue.

[0] Assertion "idx.dimension == bounds.dimension" failed in file cklocation.C line 2874.

#7 Updated by Thomas Quinn over 1 year ago

This seems to be a memory corruption issue: I tried a smaller problem, and I got malloc failures:
free(): invalid pointer
at: queueing.C:634
CqsDeqEnqueueFifo(&(q->zeroprio), data);

Unfortunately valgrind takes forever: a program that dies in 9 seconds takes more than 4 hours under valgrind. Any other suggestions for memory tracking?

#8 Updated by Seonmyeong Bak over 1 year ago

Try to compile charm with -DCMK_NOT_USE_TLS_THREAD=1.

Tom, can you share the input data so that I can replay your bug on Edison?

#10 Updated by Thomas Quinn over 1 year ago

Running after compiling with -fsanitize=address gives:

Domain decomposition...SFC Peano-Hilbert
Created 128 pieces of tree =================================================================
8492ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6310000147e8 at pc 0x00002099a591 bp 0x6310000147d0 sp 0x6310000147c8
READ of size 8 at 0x6310000147e8 thread T1
#0 0x2099a590 in CthStartThread (/scratch/snx3000/trq/bench/disk1M/ChaNGa+0x2099a590)
#1 0x2099a93e in make_fcontext (/scratch/snx3000/trq/bench/disk1M/ChaNGa+0x2099a93e)

0x6310000147e8 is located 9192 bytes to the right of 72704-byte region [0x631000000800,0x631000012400)
allocated by thread T0 here:
#0 0x2aaaaadaab30 in _interceptor_malloc ../../../../cray-gcc-7.1.0-201705230545.65f29659747b4/libsanitizer/asan/asan_malloc_linux.cc:62
#1 0x2aaaae11ab35 in pool ../../../../cray-gcc-7.1.0-201705230545.65f29659747b4/libstdc++-v3/libsupc++/eh_alloc.cc:123
#2 0x2aaaae11ab35 in __static_initialization_and_destruction_0 ../../../../cray-gcc-7.1.0-201705230545.65f29659747b4/libstdc++-v3/libsupc++/eh_alloc.cc:250
#3 0x2aaaae11ab35 in _GLOBAL
_sub_I_eh_alloc.cc ../../../../cray-gcc-7.1.0-201705230545.65f29659747b4/libstdc++-v3/libsupc++/eh_alloc.cc:326

#11 Updated by Sam White over 1 year ago

  • Priority changed from Normal to High
  • Status changed from New to In Progress

#12 Updated by Phil Miller over 1 year ago

  • Description updated (diff)

#13 Updated by Seonmyeong Bak over 1 year ago

This issue happened because mpi-crayxe/mpi-crayxc set JCONTEXT in conv-mach-smp.h while any other targets doesn't set flags for user-level thread configuration in conv-mach-smp separately.

Other targets where uFcontext enabled doesn't have this issue for the reason above.

#14 Updated by Seonmyeong Bak over 1 year ago

  • Status changed from In Progress to Implemented

#16 Updated by Sam White over 1 year ago

  • Status changed from Implemented to Merged

Also available in: Atom PDF