Project

General

Profile

Bug #965

ampicc -swapglobals is broken for ld v2.24+

Added by Nikhil Jain over 3 years ago. Updated 8 months ago.

Status:
Closed
Priority:
Low
Assignee:
Category:
AMPI
Target version:
-
Start date:
01/31/2016
Due date:
% Done:

0%


Description

With gcc 4.8/ld 2.24, global variables are not listed in the relocation records. Thus preventing swapglobals from swapping pointers to them.


Related issues

Related to Charm++ - Bug #422: ParFUM crashes with parallel partitioning Merged 02/22/2014

History

#1 Updated by Eric Bohm over 3 years ago

  • Assignee set to Sam White

#2 Updated by Sam White over 3 years ago

  • Target version set to 6.7.1
  • Category set to AMPI
  • Subject changed from Non functional -swapglobal in current gcc/ld to -swapglobals is broken for current gcc/ld

#3 Updated by Sam White over 3 years ago

  • Target version changed from 6.7.1 to 6.8.0

#4 Updated by Phil Miller almost 3 years ago

  • Assignee changed from Sam White to Phil Miller

#5 Updated by Sam White almost 3 years ago

Using swapglobals with gcc-4.7.1 (on Cab at LLNL) results in aborts at runtime with the message:

Reason: CtgGlobalList: symbol table and GOT address mismatch!

Edit: Cab has ld v2.20

#6 Updated by Sam White over 2 years ago

  • Target version changed from 6.8.0 to 6.8.1

#7 Updated by Phil Miller over 2 years ago

I just tried out tests/ampi/jacobi3d on Courage, Ubuntu 14.04 with GCC 4.8.4 and ld 2.24, and it doesn't output the error message. However, swapglobals also doesn't actually work correctly. All 8 of my virtual ranks see the same value a global variable that should have been swapped to privatize for each of them.

#8 Updated by Phil Miller over 2 years ago

OK, so first issue on this is that the relocations in question are actually being optimized out at link time, so they're not even present in the final executable:

Object file disassembly of accessing 'global_rank':
55e:       48 8b 05 00 00 00 00    mov    0x0(%rip),%rax        # 565 <AMPI_Main_cpp(int, char**)+0x60>
                   561: R_X86_64_GOTPCREL  global_rank-0x4
565:       89 10                   mov    %edx,(%rax)

Executable file disassembly of accessing 'global rank' (the same instruction position):
55b7cb:       48 8d 05 f6 9d 46 00    lea    0x469df6(%rip),%rax        # 9c55c8 <global_rank>
55b7d2:       89 10                   mov    %edx,(%rax)

So, I'll need to figure out how to disable that optimization, and get it to leave the relocations in place.

Once that's done, the relocation type also appears to potentially be a mismatch from what we look for when setting up the runtime's tables of adjustments to make to the GOT. GOTPCREL as seen above, vs GLOB_DAT in the code. If we're lucky, those are synonymous.

#9 Updated by Phil Miller over 2 years ago

Here's the patch that implemented that opitmization in GNU LD:
https://sourceware.org/ml/binutils/2012-08/msg00498.html
It would be present in binutils 2.23 (released 2012-11) and onward.

I'm trying things out on might, with ld 2.22, to see if it works absent that optimization.

#10 Updated by Phil Miller over 2 years ago

And indeed, when compiled on Might, swapglobals appears to operate correctly.

#11 Updated by Phil Miller over 2 years ago

Moving to PIE compilation/linking on the newer LD doesn't fix things. It still sees a 'local' reference to a global variable, and rewrites it to bypass the GOT

#12 Updated by Phil Miller over 2 years ago

The modified code to apply this conversion in modern ld is unconditional. Local references always get smashed. If we want swapglobals to ever work, we'll have to use a modified ld to do it.

#13 Updated by Phil Miller over 2 years ago

  • Status changed from New to In Progress

I have confirmed that with a modified ld to disable this optimization/conversion, and calling it by setting its path in the -B option to g++ at link time, I can get proper operation of -swapglobals. While this is a technically sufficient solution, it may not be practically deployable.

#14 Updated by Phil Miller over 2 years ago

Hacked ld in repo

git clone charmgit:users/phil/binutils-gdb -b phil/ampi-swapglobals-hack

#15 Updated by Phil Miller over 2 years ago

Pleasantly, the affected bits of code in ld are all part of libbfd, which popular distributions seem to link dynamically. So, we can be less dependent on compiler flags to get us the right path, and just provide suitably rebuilt modified versions of libbfd.so that we can point to with LD_LIBRARY_PATH or LD_PRELOAD. This rebuild will be somewhat more system-specific than a modified ld binary, so we'll have to see how the tradeoff goes between the various options.

The same trick applies to various TLS-based transformations as well.

#16 Updated by Sam White almost 2 years ago

  • Target version changed from 6.8.1 to 6.9.0

#17 Updated by Phil Miller over 1 year ago

Mass re-assign AMPI-related issues on my plate to Sam, for subsequent redistribution.

#18 Updated by Phil Miller over 1 year ago

  • Assignee changed from Phil Miller to Sam White

Mass re-assign AMPI-related issues on my plate to Sam, for subsequent redistribution.

For real this time.

#19 Updated by Sam White over 1 year ago

I'm not exactly sure what's left here, is it integration/distribution of the patched libbfd with AMPI?

#20 Updated by Phil Miller over 1 year ago

The integration work, and then validation that it's a practical path forward for deployment.

#21 Updated by Evan Ramos over 1 year ago

This is Phil's patch, which applies on top of binutils-gdb commit c41cf6fdf514fce6b69f8f875b6903b2a3910f89:

From af4a9518d3fb577b818a11822c03d5ce2e4aa6cb Mon Sep 17 00:00:00 2001
From: Phil Miller <mille121@illinois.edu>
Date: Mon, 3 Apr 2017 18:23:41 -0500
Subject: [PATCH] ld/bfd: Disable conversion of mov foo@GOT(%rip), %reg to lea
 foo@(%rip), %reg to re-enable AMPI -swapglobals

This conversion is a nice optimization to avoid indirection costs for references to variables that are local to the object making the reference. Unfortunately, it disables privatization by swapping pointers in the GOT, because global variable references no longer go through the GOT.
---
 bfd/elf64-x86-64.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/bfd/elf64-x86-64.c b/bfd/elf64-x86-64.c
index 6d92c79c93..33ab255870 100644
--- a/bfd/elf64-x86-64.c
+++ b/bfd/elf64-x86-64.c
@@ -1841,6 +1841,10 @@ elf_x86_64_convert_load_reloc (bfd *abfd, asection *sec,
       if (!relocx)
     return TRUE;
     }
+  else
+    {
+      return TRUE;
+    }

   /* We convert only to R_X86_64_PC32:
      1. Branch.
-- 
2.16.1

#22 Updated by Evan Ramos over 1 year ago

The patch applies cleanly on the binutils-gdb git repository, and I can build libbfd.so with `./configure --enable-shared && make all-bfd`.

#23 Updated by Sam White over 1 year ago

Great. But now since libbfd is GPL and we are not, we can't just distribute our modified version as part of Charm++/AMPI (note that MPC is CeCILL-C, which is GPL compliant). Some things that I think we can do are 1) Submit a feature request for a flag to be added to ld/libbfd that would optionally disable the optimization in future versions of ld, 2) Add a check for the version of ld to our configure script to tell users if their linker doesn't support our privatization techniques (and make our modified version available in a separate repository?), 3) See if ld.gold, which doesn't use libbfd, exhibits the same behavior, and 4) Discuss Charm++/AMPI licensing. Those are just my initial thoughts, we'll need to discuss this more.

#24 Updated by Phil Miller over 1 year ago

Since libbfd or a modified linker are not part of a binary linked with the runtime/application code themself, the licensing concern isn't a problem. You'll just need to comply with the GPL with respect to that library/program.

#25 Updated by Sam White over 1 year ago

  • Assignee changed from Sam White to Evan Ramos
  • Subject changed from -swapglobals is broken for current gcc/ld to ampicc -swapglobals is broken for ld v2.23+

Ah right, thanks!

#26 Updated by Evan Ramos over 1 year ago

  • Subject changed from ampicc -swapglobals is broken for ld v2.23+ to ampicc -swapglobals is broken for ld v2.24+

Upon closer inspection, the change in question did not make its way into a release until 2.24.

#27 Updated by Evan Ramos over 1 year ago

I have created versions of Phil's patch for all affected releases of ld, and also expanded it to cover i386.

I tested with gold (by adding `-fuse-ld=gold`) and swapglobals does not work with it.

#28 Updated by Sam White over 1 year ago

In Core we decided that we should do the following for this:
1. Add a configure check for the version of ld, if the version is 2.24+ and the user tries to use -swapglobals, print an error message that directs them to a repo that contains our patched versions of binutils as well as directions on how to download and install the right one. We could potentially have our configure script download, patch, and build the right version of binutils. We didn't think that was worth all the work, though a Charmworks intern could do it.
2. Email the binutils maintainers to see if they would be willing to conditionalize the optimization that our patches disable.

#29 Updated by Evan Ramos over 1 year ago

It looks like swapglobals outright crashes with more up-to-date systems:

Reading symbols from ./swapglobals...done.
(gdb) r
Starting program: /home/evan/Charmworks/charm/netlrts-linux-x86_64/tests/ampi/privatization/swapglobals
Charm++: standalone mode (not using charmrun)
Charm++> Running in non-SMP mode: numPes 1
Converse/Charm++ Commit ID: v6.8.2-377-gaa719d38d
Charm++> scheduler running in netpoll mode.
CharmLB> Load balancer assumes all CPUs are same.
WARNING: Running swapglobals without blacklist, globals from libraries might be unnecessarily swapped.

Program received signal SIGSEGV, Segmentation fault.
0x0000555555857362 in CtgGlobalList::CtgGlobalList (this=0x555555bed010) at global-elfgot.C:337
337             if ((void *)*gGot != (void *)symt[symindx].st_value)
(gdb) bt
#0  0x0000555555857362 in CtgGlobalList::CtgGlobalList (this=0x555555bed010) at global-elfgot.C:337
#1  0x0000555555857575 in CtgInit () at global-elfgot.C:436
#2  0x00005555556f5ae8 in TCharm::procInit () at tcharm.C:93
#3  0x00005555557789f7 in InitCallTable::enumerateInitCalls (this=0x555555bb4040 <_initCallTable>) at init.C:1030
#4  0x0000555555779157 in _initCharm (unused_argc=1, argv=0x7fffffffe488) at init.C:1282
#5  0x000055555585d8b1 in ConverseRunPE (everReturn=0) at machine-common-core.c:1441
#6  0x000055555585d7a4 in ConverseInit (argc=1, argv=0x7fffffffe488, fn=0x555555778b61 <_initCharm(int, char**)>, usched=0, initret=0) at machine-common-core.c:1337
#7  0x0000555555771e42 in main (argc=1, argv=0x7fffffffe488) at main.C:9
(gdb) p gGot
$1 = (ELFXX_TYPE_Addr *) 0x64cfa8
(gdb) p symindx
$2 = 183
(gdb) p *gGot
Cannot access memory at address 0x64cfa8
(gdb) p symt[symindx]
$3 = {st_name = 634484, st_info = 32 ' ', st_other = 0 '\000', st_shndx = 0, st_value = 0, st_size = 0}
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 17.10
Release:        17.10
Codename:       artful
$ ld --version
GNU ld (GNU Binutils for Ubuntu) 2.29.1

$ gcc --version
gcc (Ubuntu 7.2.0-8ubuntu3) 7.2.0

This happens regardless of whether libbfd has been patched or not.

#30 Updated by Evan Ramos over 1 year ago

  • Target version deleted (6.9.0)

Removing v6.9 target as per Core meeting. The crash on new systems has been resolved but swapglobals does not function even with the libbfd patch. Considering the limitations and performance overhead of swapglobals, we have decided to de-prioritize it.

#31 Updated by Evan Ramos over 1 year ago

Patches and documentation for them uploaded here: https://charm.cs.illinois.edu/gerrit/gitweb?p=libbfd-patches.git;a=summary

#32 Updated by Sam White 8 months ago

  • Priority changed from Normal to Low

We could maybe mark this "Closed" or something. It's an approach we aren't planning on investing more effort in moving forward.

#33 Updated by Evan Ramos 8 months ago

  • Status changed from In Progress to Closed

Also available in: Atom PDF