Project

General

Profile

Bug #1828

Infinite recursion inside malloc_info in CmiMemoryUsage

Added by Thomas Quinn 3 months ago. Updated 3 months ago.

Status:
Merged
Priority:
Normal
Assignee:
-
Category:
-
Target version:
Start date:
03/06/2018
Due date:
% Done:

100%


Description

I'm getting an infinite recursion with running ChaNGa on Piz Daint:

#27986 0x000000002027c950 in bool std::__detail::__regex_algo_impl<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, char, std::__cxx11::regex_traits<char>, (std::__detail::_RegexExecutorPolicy)0, false>(_gnu_cxx::_normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, _gnu_cxx::_normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__cxx11::match_results<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > >&, std::__cxx11::basic_regex<char, std::__cxx11::regex_traits<char> > const&, std::regex_constants::match_flag_type) ()
#27987 0x00000000202777a9 in findFirstCaptures ()
#27988 0x0000000020267b00 in MemusageMallocinfo ()
#27989 0x0000000020267d82 in CmiMemoryUsage ()

(yes, those are the stack height numbers!)

This is compiled with gcc 7.1.0 on Piz Daint with:
./build ChaNGa gni-crayxc hugepages smp -j8 --with-production

Note that I have
export MEMORYUSAGE_NO_MALLINFO=1
in my job script.


Subtasks

Bug #1819: bigsim failing lb_test inside CmiMemoryUsage()Merged

History

#1 Updated by Sam White 3 months ago

  • Target version set to 6.9.0

#2 Updated by Shaoqin Lu 3 months ago

Hi Thomas, I developed the malloc_info feature. I failed to reproduce on my local but I don't have access to that super computer right now. Do you mind try this in your charm build folder so I can narrow down the potential issue?

in
charm/<build>/examples/charm++/hello/1darray/hello.C
insert the following two lines in `Main`

printf("%s\n", CmiMemoryUsageReporter());
printf("%lu\n", CmiMemoryUsage());

then make and run hello and let me know the result. Ideally it should print "Mallocinfo" and the number of bytes used

#3 Updated by Thomas Quinn 3 months ago

Here is what I get:

Charm++> Running on Gemini (GNI) with 1 processes
Charm++> static SMSG
Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0 means no limit)
Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB
Charm++> only comm thread send/recv messages
Charm++> Cray TLB page size: 2048K
Charm++> Running in SMP mode: numNodes 1, 1 worker threads per process
Charm++> The comm. thread both sends and receives messages
Converse/Charm++ Commit ID: v6.8.2-453-g5b68b3ce4
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (72-way SMP).
Running Hello on 1 processors for 5 elements
Hello 0 created
Hello 1 created
Hello 2 created
Hello 3 created
Hello 4 created
Mallocinfo
1098442131584
Main::initDone reached
Hi17 from element 0
Hi18 from element 1
Hi19 from element 2
Hi20 from element 3
Hi21 from element 4
All done
[Partition 0][Node 0] End of program

#4 Updated by Shaoqin Lu 3 months ago

Thomas Quinn wrote:

Here is what I get:

Charm++> Running on Gemini (GNI) with 1 processes
Charm++> static SMSG
Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0 means no limit)
Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB
Charm++> only comm thread send/recv messages
Charm++> Cray TLB page size: 2048K
Charm++> Running in SMP mode: numNodes 1, 1 worker threads per process
Charm++> The comm. thread both sends and receives messages
Converse/Charm++ Commit ID: v6.8.2-453-g5b68b3ce4
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (72-way SMP).
Running Hello on 1 processors for 5 elements
Hello 0 created
Hello 1 created
Hello 2 created
Hello 3 created
Hello 4 created
Mallocinfo
1098442131584
Main::initDone reached
Hi17 from element 0
Hi18 from element 1
Hi19 from element 2
Hi20 from element 3
Hi21 from element 4
All done
[Partition 0][Node 0] End of program

This is the expected behavior on the Charm++ part. The problem might involve Changa. I will keep looking into this.

#5 Updated by Thomas Quinn 3 months ago

I got hello to die by
1) adding the printfs to the SayHi() method.
2) running with:
srun -C mc -n 1 --tasks-per-node 1 -c 8 ./hello ++ppn 8

This gives me:
Charm++> Running on Gemini (GNI) with 1 processes
Charm++> static SMSG
Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0 means no limit)
Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB
Charm++> only comm thread send/recv messages
Charm++> Cray TLB page size: 2048K
Charm++> Running in SMP mode: numNodes 1, 8 worker threads per process
Charm++> The comm. thread both sends and receives messages
Converse/Charm++ Commit ID: v6.8.2-453-g5b68b3ce4
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (72-way SMP).
Running Hello on 8 processors for 5 elements
Hello 0 created
Mallocinfo
1098449516672
Hello 4 created
Hello 3 created
Hello 2 created
Hello 1 created
Main::initDone reached
Hi17 from element 0
Mallocinfo
1098451695744
Hi18 from element 1
Mallocinfo
1098451720320
Hi19 from element 2
Mallocinfo
1098451748992
Hi20 from element 3
srun: error: nid00114: task 0: Segmentation fault (core dumped)

#6 Updated by Shaoqin Lu 3 months ago

Thomas Quinn wrote:

I got hello to die by
1) adding the printfs to the SayHi() method.
2) running with:
srun -C mc -n 1 --tasks-per-node 1 -c 8 ./hello ++ppn 8

This gives me:
Charm++> Running on Gemini (GNI) with 1 processes
Charm++> static SMSG
Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0 means no limit)
Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB
Charm++> only comm thread send/recv messages
Charm++> Cray TLB page size: 2048K
Charm++> Running in SMP mode: numNodes 1, 8 worker threads per process
Charm++> The comm. thread both sends and receives messages
Converse/Charm++ Commit ID: v6.8.2-453-g5b68b3ce4
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (72-way SMP).
Running Hello on 8 processors for 5 elements
Hello 0 created
Mallocinfo
1098449516672
Hello 4 created
Hello 3 created
Hello 2 created
Hello 1 created
Main::initDone reached
Hi17 from element 0
Mallocinfo
1098451695744
Hi18 from element 1
Mallocinfo
1098451720320
Hi19 from element 2
Mallocinfo
1098451748992
Hi20 from element 3
srun: error: nid00114: task 0: Segmentation fault (core dumped)

Thanks for this update. This is a different error than the one with Changa right? Right now I couldn't reproduce either of them. I am still working on different ways to reproduce. I will work with Eric on Monday to see what I can do to gain access to that particular operating system. If you don't mind can you update a stack trace for this segmentation fault?

#7 Updated by Shaoqin Lu 3 months ago

Also if this significantly blocks progress, you can remove the code in src/conv-core/memory.c MemusageMallocinfo(void)
replace it with a return 0 should resolve everything related to this. Really sorry about this

#8 Updated by Thomas Quinn 3 months ago

This is the same bug I get with ChaNGa: the seg fault is caused by an infinite recursion.

#9 Updated by Shaoqin Lu 3 months ago

Thomas Quinn wrote:

This is the same bug I get with ChaNGa: the seg fault is caused by an infinite recursion.

Okay thanks. I still can't reproduce this. Ran with Changa as well but no luck. The fact that it fails after a few successful calls makes me think this is a race condition triggered behavior but I doubled checked all the functions involved and all of them is MT-safe. Regular expression infinitely matching is also not likely since some of the calls actually succeed. I can only wait to gain access in order to make progress on this issue. Sorry about this.

#10 Updated by Shaoqin Lu 3 months ago

I just want to post an update to this issue. I am able to reproduce this on crayxc machine and it is a multi thread related issue as it only shows up in SMP mode and intermittently when many threads try to access regex_search, the code falls into infinite recursion. Not only crayxc build is affected but netlrts build is affected too. It seems to be specific to this machine though since I can't reproduce on my ubuntu. There is no obvious bug on the code itself and I need to potentially investigate some library source code. I will discuss this with my mentor and maybe push a temporary fix to disable this function on crayxc

Feel free to let me know if you have any other concerns.

#11 Updated by Sam White 3 months ago

This failure can also be reproduced on a Linux machine using a non-SMP bigsim build: https://charm.cs.illinois.edu/redmine/issues/1819

#12 Updated by Sam White 3 months ago

  • Status changed from New to In Progress

#13 Updated by Thomas Quinn 3 months ago

Somewhat of an aside, but the usage numbers reported by malloc_info() on the cray-xc are not informative. I doubt that "hello.C" is allocating a Terabyte of RAM.

#14 Updated by Shaoqin Lu 3 months ago

Thomas Quinn wrote:

Somewhat of an aside, but the usage numbers reported by malloc_info() on the cray-xc are not informative. I doubt that "hello.C" is allocating a Terabyte of RAM.

Yup I noticed this as well. Thanks for the note, I will address this too.

#15 Updated by Sam White 3 months ago

  • Subject changed from Infinite recursion in CmiMemory Usage to Infinite recursion inside malloc_info in CmiMemoryUsage

#16 Updated by Shaoqin Lu 3 months ago

Hi Thomas, I am filing a bug report to g++ on this issue. Do you mind posting the
`g++ -v` and `uname -a` information on Piz Daint? The machine I have access to have a 5.x version of g++ thus not supported for further bug report

#18 Updated by Thomas Quinn 3 months ago

gcc v:
Using built-in specs.
COLLECT_GCC=/opt/gcc/5.3.0/bin/../snos/bin/gcc
COLLECT_LTO_WRAPPER=/opt/gcc/5.3.0/snos/libexec/gcc/x86_64-suse-linux/5.3.0/lto-wrapper
Target: x86_64-suse-linux
Configured with: ../cray-gcc-5.3.0/configure --prefix=/opt/gcc/5.3.0/snos --disable-nls --libdir=/opt/gcc/5.3.0/snos/lib --enable-languages=c,c++,fortran --with-gxx-include-dir=/opt/gcc/5.3.0/snos/include/g++ --with-slibdir=/opt/gcc/5.3.0/snos/lib --with-system-zlib --enable-shared --enable
__cxa_atexit --build=x86_64-suse-linux --with-ppl --with-cloog
Thread model: posix
gcc version 5.3.0 20151204 (Cray Inc.) (GCC)

uname -a:
Linux daint105 4.4.74-92.38-default #1 SMP Tue Sep 12 19:43:46 UTC 2017 (545c055) x86_64 x86_64 x86_64 GNU/Linux

#19 Updated by Shaoqin Lu 3 months ago

Thomas Quinn wrote:

gcc v:
Using built-in specs.
COLLECT_GCC=/opt/gcc/5.3.0/bin/../snos/bin/gcc
COLLECT_LTO_WRAPPER=/opt/gcc/5.3.0/snos/libexec/gcc/x86_64-suse-linux/5.3.0/lto-wrapper
Target: x86_64-suse-linux
Configured with: ../cray-gcc-5.3.0/configure --prefix=/opt/gcc/5.3.0/snos --disable-nls --libdir=/opt/gcc/5.3.0/snos/lib --enable-languages=c,c++,fortran --with-gxx-include-dir=/opt/gcc/5.3.0/snos/include/g++ --with-slibdir=/opt/gcc/5.3.0/snos/lib --with-system-zlib --enable-shared --enable
__cxa_atexit --build=x86_64-suse-linux --with-ppl --with-cloog
Thread model: posix
gcc version 5.3.0 20151204 (Cray Inc.) (GCC)

uname -a:
Linux daint105 4.4.74-92.38-default #1 SMP Tue Sep 12 19:43:46 UTC 2017 (545c055) x86_64 x86_64 x86_64 GNU/Linux

This is still GCC 5.3 though. Do you mind try the 7.1 version?

#20 Updated by Thomas Quinn 3 months ago

I originally got the failure with this version:

trq@daint105:~/src/charm> gcc v
Using built-in specs.
COLLECT_GCC=/opt/gcc/7.1.0/bin/../snos/bin/gcc
COLLECT_LTO_WRAPPER=/opt/gcc/7.1.0/snos/libexec/gcc/x86_64-suse-linux/7.1.0/lto-wrapper
Target: x86_64-suse-linux
Configured with: ../cray-gcc-7.1.0-201705230545.65f29659747b4/configure --prefix=/opt/gcc/7.1.0/snos --disable-nls --libdir=/opt/gcc/7.1.0/snos/lib --enable-languages=c,c++,fortran --with-gxx-include-dir=/opt/gcc/7.1.0/snos/include/g++ --with-slibdir=/opt/gcc/7.1.0/snos/lib --with-system-zlib --enable-shared --enable
__cxa_atexit --build=x86_64-suse-linux --with-ppl --with-cloog --disable-multilib
Thread model: posix
gcc version 7.1.0 20170502 (Cray Inc.) (GCC)

#21 Updated by Shaoqin Lu 3 months ago

This issue is potentially resolved. See the g++ stdlib thread
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84865

Basically the stack overflow is expected, not an infinite loop. Compile charm++ with -O2 inlining the function call resolves the segfault for me. With that being said, I still need to figure out why it's reporting a terabyte of RAM.

Feel free to try out the -O2 and let me know the result. Thanks

#22 Updated by Thomas Quinn 3 months ago

I built with "-g -O2", and I still get the crash.

#23 Updated by Shaoqin Lu 3 months ago

Does -O2 work though?

I know relying on compiler optimization is not reliable. I will come up with some workaround (maybe set stack size on threads) or not use regex anymore. Allow me a couple days for those :)

#24 Updated by Thomas Quinn 3 months ago

I still get the crash with "-O2".

Running with
./hello +stack-size 10000000 ++ppn 8
gets me past the problem.

#25 Updated by Shaoqin Lu 3 months ago

Thomas Quinn wrote:

I still get the crash with "-O2".

Running with
./hello +stack-size 10000000 ++ppn 8
gets me past the problem.

Cool. This is definitely not a long term thing since it still crashes and reports useless numbers. I will update on this once a solution is implemented

#26 Updated by Sam White 3 months ago

We reverted the malloc_info patch for now

#27 Updated by Sam White 3 months ago

  • Status changed from In Progress to Merged

Also available in: Atom PDF