Project

General

Profile

Bug #1929

AMPIF print and write statements break when tlsglobals is enabled

Added by Evan Ramos 4 months ago. Updated 3 months ago.

Status:
Merged
Priority:
Normal
Assignee:
Category:
AMPI
Target version:
Start date:
06/08/2018
Due date:
% Done:

0%

Tags:

Description

When running an AMPI Fortran program with tlsglobals, text format statements of the form WRITE(*,[fmt]) and PRINT [fmt], crash inside libgfortran. Changing these to WRITE(*) and PRINT *, works around the crash but breaks the intended format of the string, and is not something we should require of users in order to AMPI-ize their code.


Related issues

Related to Charm++ - Bug #1268: AMPIF issues due to C++ main routine Merged 10/27/2016

History

#1 Updated by Evan Ramos 4 months ago

I built and am using a debug build of libgfortran to investigate further. For reference, this is how I built it:

sudo apt install flex bison libgmp-dev libgmp3-dev libgmp10 libmpc-dev libmpc3 libmpfr-dev libmpfr6
git clone git://gcc.gnu.org/git/gcc.git
mkdir build
cd build
../gcc/configure --disable-multilib
make -j

libgfortran will be one of the last pieces to build. For partial rebuilding, make can be run from build/x86_64-pc-linux-gnu/libgfortran/.

To run, copy build/x86_64-pc-linux-gnu/libgfortran/.libs/libgfortran.* to your binary's path, and manually re-run the link command, adding -L. -rpath-origin to the ampif90 invocation.

#2 Updated by Evan Ramos 4 months ago

Compiling libgfortran with -O0 instead of -O2 seems to delay crashing until later in MiniGhost's execution.

-O2:

Charm++: standalone mode (not using charmrun)
Charm++> Running in non-SMP mode: 1 processes (PEs)
Converse/Charm++ Commit ID: v6.8.2-743-gb94e029c7
Charm++> scheduler running in netpoll mode.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 hosts (1 sockets x 6 cores x 2 PUs = 12-way SMP)
Charm++> cpu topology info is gathered in 0.000 seconds.
Charm++> -tlsglobals enabled for privatization of thread-local variables.

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7b7ec2c in next_char (fmt=fmt@entry=0x555555ddf6d0, literal=literal@entry=0) at ../../../src/libgfortran/io/format.c:196
196          c = toupper (*fmt->format_string++);
(gdb) bt
#0  0x00007ffff7b7ec2c in next_char (fmt=fmt@entry=0x555555ddf6d0, literal=literal@entry=0) at ../../../src/libgfortran/io/format.c:196
#1  0x00007ffff7b7ed44 in format_lex (fmt=fmt@entry=0x555555ddf6d0) at ../../../src/libgfortran/io/format.c:309
#2  0x00007ffff7b805fe in format_lex (fmt=0x555555ddf6d0) at ../../../src/libgfortran/io/format.c:1346
#3  _gfortrani_parse_format (dtp=dtp@entry=0x4010ff400) at ../../../src/libgfortran/io/format.c:1348
#4  0x00007ffff7b8fa28 in data_transfer_init (dtp=dtp@entry=0x4010ff400, read_flag=read_flag@entry=0) at ../../../src/libgfortran/io/transfer.c:2793
#5  0x00007ffff7b904a4 in _gfortran_st_write (dtp=dtp@entry=0x4010ff400) at ../../../src/libgfortran/io/transfer.c:4133
#6  0x000055555572b7b7 in mg_utils_mod::mg_print_header (comm_method=10, stencil=21, ierr=0) at MG_UTILS.F:323
#7  0x00005555557408ac in mini_ghost (scaling_in=<optimized out>, nx_in=<optimized out>, ny_in=<optimized out>, nz_in=<optimized out>, nvars_in=<optimized out>, percent_sum_in=<optimized out>, nspikes_in=1, ntsteps_in=100, stencil_in=21, comm_method_in=10, bc_in=31, error_tol_in=8, report_diffusion_in=20, npx_in=1, npy_in=1, npz_in=1, report_perf_in=0, 
    cp_method_in=1, cp_interval_in=0, cp_file_in=..., restart_cp_num_in=-2, restart_file_in=..., debug_grid_in=0, _cp_file_in=1433575663, _restart_file_in=1439154320) at DRIVER.F:175
#8  0x000055555572a208 in AMPI_Main (argc=1, argv=0x555555c7c090) at main.c:375
#9  0x00005555557cb142 in AMPI_Main_c (argc=1, argv=0x555555c7c090) at compat_ampi.c:15
#10 0x0000555555754655 in AMPI_Fallback_Main (argc=1, argv=0x555555c7c090) at ampi.C:826
#11 0x0000555555797c44 in MPI_threadstart_t::start (this=0x401100098) at ampi.C:1031
#12 0x0000555555754c0f in AMPI_threadstart (data=0x555555dcef40) at ampi.C:1051
#13 0x0000555555741366 in startTCharmThread (msg=0x555555dcef20) at tcharm.C:175
#14 0x00005555558c5261 in CthStartThread (arg=...) at libthreads-default-tls.c:1770
#15 0x00005555558c56ff in make_fcontext () at make_x86_64_sysv_elf_gas.S:70
#16 0x0000000000000000 in ?? ()
(gdb) disas
Dump of assembler code for function next_char:
   0x00007ffff7b7ebf0 <+0>:    xor    $0x1,%esi
   0x00007ffff7b7ebf3 <+3>:    push   %r12
   0x00007ffff7b7ebf5 <+5>:    push   %rbp
   0x00007ffff7b7ebf6 <+6>:    mov    %esi,%r12d
   0x00007ffff7b7ebf9 <+9>:    push   %rbx
   0x00007ffff7b7ebfa <+10>:    mov    0x24(%rdi),%ebp
   0x00007ffff7b7ebfd <+13>:    mov    %rdi,%rbx
   0x00007ffff7b7ec00 <+16>:    and    $0x1,%r12d
   0x00007ffff7b7ec04 <+20>:    jmp    0x7ffff7b7ec47 <next_char+87>
   0x00007ffff7b7ec06 <+22>:    nopw   %cs:0x0(%rax,%rax,1)
   0x00007ffff7b7ec10 <+32>:    sub    $0x1,%ebp
   0x00007ffff7b7ec13 <+35>:    mov    %ebp,0x24(%rbx)
   0x00007ffff7b7ec16 <+38>:    callq  0x7ffff7a1b0f0 <__ctype_toupper_loc@plt>
   0x00007ffff7b7ec1b <+43>:    mov    (%rax),%rdx
   0x00007ffff7b7ec1e <+46>:    mov    (%rbx),%rax
   0x00007ffff7b7ec21 <+49>:    lea    0x1(%rax),%rcx
   0x00007ffff7b7ec25 <+53>:    mov    %rcx,(%rbx)
   0x00007ffff7b7ec28 <+56>:    movsbq (%rax),%rax
=> 0x00007ffff7b7ec2c <+60>:    mov    (%rdx,%rax,4),%eax
   0x00007ffff7b7ec2f <+63>:    cmp    $0x20,%eax
   0x00007ffff7b7ec32 <+66>:    mov    %al,0x18(%rbx)
   0x00007ffff7b7ec35 <+69>:    sete   %cl
   0x00007ffff7b7ec38 <+72>:    cmp    $0x9,%eax
   0x00007ffff7b7ec3b <+75>:    sete   %dl
   0x00007ffff7b7ec3e <+78>:    or     %dl,%cl
   0x00007ffff7b7ec40 <+80>:    je     0x7ffff7b7ec50 <next_char+96>
   0x00007ffff7b7ec42 <+82>:    test   %r12b,%r12b
   0x00007ffff7b7ec45 <+85>:    je     0x7ffff7b7ec50 <next_char+96>
   0x00007ffff7b7ec47 <+87>:    test   %ebp,%ebp
   0x00007ffff7b7ec49 <+89>:    jne    0x7ffff7b7ec10 <next_char+32>
   0x00007ffff7b7ec4b <+91>:    mov    $0xffffffff,%eax
   0x00007ffff7b7ec50 <+96>:    pop    %rbx
   0x00007ffff7b7ec51 <+97>:    pop    %rbp
   0x00007ffff7b7ec52 <+98>:    pop    %r12
   0x00007ffff7b7ec54 <+100>:    retq   
End of assembler dump.

-O0:

Charm++: standalone mode (not using charmrun)
Charm++> Running in non-SMP mode: 1 processes (PEs)
Converse/Charm++ Commit ID: v6.8.2-743-gb94e029c7
Charm++> scheduler running in netpoll mode.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 hosts (1 sockets x 6 cores x 2 PUs = 12-way SMP)
Charm++> cpu topology info is gathered in 0.000 seconds.
Charm++> -tlsglobals enabled for privatization of thread-local variables.

 ========================================================
           Mantevo miniapp MiniGhost experiment
 ========================================================

 Communication strategy: full message aggregation (COMM_METHOD_BSPMA)

 Computation: 5 pt difference stencil on a 2D grid (STENCIL_2D5PT)

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7b67dfb in format_lex (fmt=0x555555de0730) at ../../../src/libgfortran/io/format.c:370
370          if (!isdigit (c))
(gdb) bt
#0  0x00007ffff7b67dfb in format_lex (fmt=0x555555de0730) at ../../../src/libgfortran/io/format.c:370
#1  0x00007ffff7b68f7f in parse_format_list (dtp=0x4010ff400, seen_dd=0x4010ff1a7) at ../../../src/libgfortran/io/format.c:1096
#2  0x00007ffff7b695e2 in _gfortrani_parse_format (dtp=0x4010ff400) at ../../../src/libgfortran/io/format.c:1349
#3  0x00007ffff7b7cf52 in data_transfer_init (dtp=0x4010ff400, read_flag=0) at ../../../src/libgfortran/io/transfer.c:2793
#4  0x00007ffff7b7fd21 in _gfortran_st_write (dtp=0x4010ff400) at ../../../src/libgfortran/io/transfer.c:4133
#5  0x000055555572ba15 in mg_utils_mod::mg_print_header (comm_method=<optimized out>, stencil=<optimized out>, ierr=0) at MG_UTILS.F:378
#6  0x00005555557408ac in mini_ghost (scaling_in=<optimized out>, nx_in=<optimized out>, ny_in=<optimized out>, nz_in=<optimized out>, nvars_in=<optimized out>, percent_sum_in=<optimized out>, nspikes_in=1, ntsteps_in=100, stencil_in=21, comm_method_in=10, bc_in=31, error_tol_in=8, report_diffusion_in=20, npx_in=1, npy_in=1, npz_in=1, report_perf_in=0, 
    cp_method_in=1, cp_interval_in=0, cp_file_in=..., restart_cp_num_in=-2, restart_file_in=..., debug_grid_in=0, _cp_file_in=1433575663, _restart_file_in=1439154320) at DRIVER.F:175
#7  0x000055555572a208 in AMPI_Main (argc=1, argv=0x555555c7c090) at main.c:375
#8  0x00005555557cb142 in AMPI_Main_c (argc=1, argv=0x555555c7c090) at compat_ampi.c:15
#9  0x0000555555754655 in AMPI_Fallback_Main (argc=1, argv=0x555555c7c090) at ampi.C:826
#10 0x0000555555797c44 in MPI_threadstart_t::start (this=0x401100098) at ampi.C:1031
#11 0x0000555555754c0f in AMPI_threadstart (data=0x555555dcef40) at ampi.C:1051
#12 0x0000555555741366 in startTCharmThread (msg=0x555555dcef20) at tcharm.C:175
#13 0x00005555558c5261 in CthStartThread (arg=...) at libthreads-default-tls.c:1770
#14 0x00005555558c56ff in make_fcontext () at make_x86_64_sysv_elf_gas.S:70
#15 0x0000000000000000 in ?? ()
(gdb) disas
Dump of assembler code for function format_lex:
[...]
   0x00007ffff7b67de7 <+447>:    callq  0x7ffff79abc70 <__ctype_b_loc@plt>
   0x00007ffff7b67dec <+452>:    mov    (%rax),%rax
   0x00007ffff7b67def <+455>:    mov    -0xc(%rbp),%edx
   0x00007ffff7b67df2 <+458>:    movslq %edx,%rdx
   0x00007ffff7b67df5 <+461>:    add    %rdx,%rdx
   0x00007ffff7b67df8 <+464>:    add    %rdx,%rax
=> 0x00007ffff7b67dfb <+467>:    movzwl (%rax),%eax
   0x00007ffff7b67dfe <+470>:    movzwl %ax,%eax
   0x00007ffff7b67e01 <+473>:    and    $0x800,%eax
   0x00007ffff7b67e06 <+478>:    test   %eax,%eax
   0x00007ffff7b67e08 <+480>:    je     0x7ffff7b67e2d <format_lex+517>
[...]

Interestingly, the section of assembly code crashing with -O0 matches a commented explanation here: https://stackoverflow.com/a/50296176

Both crashes have the common characteristic that a segfault occurs during access of an implementation detail of functions in ctype.h, which happens to use thread-local storage.

#3 Updated by Evan Ramos 4 months ago

This seems relevant, but doesn't specifically help narrow down a solution: https://gcc.gnu.org/onlinedocs/gfortran/Thread-safety-of-the-runtime-library.html

#4 Updated by Evan Ramos 4 months ago

  • Status changed from New to In Progress

https://charm.cs.illinois.edu/gerrit/4329 is a start, but it only resolves the issue when the program is statically linked.

We need to account for shared objects in the size of the TLS segment we allocate. I'm currently researching how that can be done.

(gdb) p (char *)getTLS() - (char *)((unsigned short int **)__ctype_b_loc())
$5 = 1496
(gdb) p phdr->p_memsz
$7 = 1360

#5 Updated by Evan Ramos 4 months ago

  • Status changed from In Progress to Implemented

#6 Updated by Evan Ramos 4 months ago

  • Related to Bug #1268: AMPIF issues due to C++ main routine added

#7 Updated by Sam White 3 months ago

  • Status changed from Implemented to Merged
  • Target version set to 6.9.0

Also available in: Atom PDF