Bug 100593 - [ELF] -fno-pic: Use GOT to take address of an external default visibility function
Summary: [ELF] -fno-pic: Use GOT to take address of an external default visibility fun...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 11.0
: P3 normal
Target Milestone: 12.0
Assignee: Not yet assigned to anyone
URL:
Keywords: visibility
Depends on:
Blocks: visibility
  Show dependency treegraph
 
Reported: 2021-05-14 00:44 UTC by Fangrui Song
Modified: 2025-01-29 17:19 UTC (History)
6 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Fangrui Song 2021-05-14 00:44:37 UTC
Most ELF targets use an absolute relocation (e.g. R_X86_64_32) to take the address of a default visibility non-definition function declaration.
The absolute relocation can cause a canonical PLT entry (st_shndx=0, st_value!=0; The term is a parlance within a few LLD developers, but not broadly adopted).
If the defining DSO is linked with Bsymbolic-functions (or -Bsymbolic), the addresses taken within the DSO and outside of the DSO will be different.
Since C++ requires uniqueness of the address, this violates the language standard.

Outside of the GNU ELF world, many dynamic linking implementations have shifted to a direct binding and non-interposition by default world.
We have rants from people complaining about shared object performance.
(e.g. https://lore.kernel.org/lkml/CAHk-=whs8QZf3YnifdLv57+FhBi5_WeNTG1B-suOES=RcUSmQg@mail.gmail.com/ "Re: Very slow clang kernel config .." https://www.facebook.com/dan.colascione/posts/10107358290728348 "Python is 1.3x faster when compiled in a way that re-examines shitty technical decisions from the 1990s.")
I believe ld -Bsymbolic-functions can materialize most of the savings other implementations provide, without introducing complex things to ELF.
However, since -Bsymbolic-functions doesn't play well with -fno-pic's canonical PLT entries, we should fix -fno-pic.

Converting a direct access to a GOT access for a function symbol cannot be in a performance critical path,
so let's just do it.
Static linking is happy, too - the linker can either optimize out the GOT (x86-64 GOTPCRELX, PPC64 TOC) or prefill the GOT entry with
a constant.

Once -fno-pic has the sane behavior (GOT by default), more and more shared objects can be optionally built with -Bsymbolic-functions -
if they don't intend to support interposition, while still being compatible with -fno-pic executables.

How effective is -Bsymbolic-functions? As a data point, my x86_64 Linux kernel defconfig build with -Bsymbolic-functions linked Clang is 15% faster.
(83% JUMP_SLOT relocations are eliminated!)

% cat a.c
extern void fun();
void *get() { return (void*)fun; }

% gcc -fno-pic -S a.c -O2 -o -
get:
.LFB0:
        .cfi_startproc
        movl    $fun, %eax
        ret
% aarch64-linux-gnu-gcc -fno-pic -S a.c -O2 -o -
...
        adrp    x0, fun
        add     x0, x0, :lo12:fun

# good, ppc64 elfv2 always uses TOC
% powerpc64le-linux-gnu-gcc -fno-pic -S a.c -O2 -o -
...
        addis 3,2,.LC0@toc@ha
        ld 3,.LC0@toc@l(3)
Comment 1 Alexander Monakov 2021-05-17 04:57:49 UTC
It is not necessary to change -fno-pic code generation to gain most of the -Bsymbolic benefit: as you say, the most important point is to avoid jumping via PLT trampolines (or, with -fno-plt, GOT loads) for function calls, so the linker could do -Bsymbolic relaxation for sites where address doesn't matter (calls and jumps) while keeping a dynamic relocation for address loads? Under some new option of course, like -Bsymbolic-plt. Right?
Comment 2 Fangrui Song 2021-05-17 05:27:49 UTC
(In reply to Alexander Monakov from comment #1)
> It is not necessary to change -fno-pic code generation to gain most of the
> -Bsymbolic benefit

It is necessary, otherwise the function address taken from the -Bsymbolic/-Bsymbolic-functions/-Bsymbolic-global-functions shared object may be different from the address taken from the -fno-pic code.

The ELF hack is called canonical PLT entry, similar to copy relocations.

> as you say, the most important point is to avoid jumping
> via PLT trampolines (or, with -fno-plt, GOT loads) for function calls, so
> the linker could do -Bsymbolic relaxation for sites where address doesn't
> matter (calls and jumps) while keeping a dynamic relocation for address
> loads? Under some new option of course, like -Bsymbolic-plt. Right?

There are two points: (1) R_*_JUMP_SLOT symbol lookup cost (2) whether call sites get penalized by the PLT indirection.

-fno-pic code must use GOT (instead of an absolute relocation) for default visibility external function access to be compatible with a -Bsymbolic/-Bsymbolic-functions/-Bsymbolic-global-functions shared object.
Comment 3 Alexander Monakov 2021-05-17 06:13:13 UTC
I understand what you're saying, but it seems we're talking past each other.

I agree that if a library is linked with any -Bsymbolic* flag, the main executable is at risk of broken address uniqueness unless it uses GOT indirection.

I am saying that if the library was linked with a more restrictive variant of -Bsymbolic (that I called -Bsymbolic-plt), it would still get most the benefit of -Bsymbolic, while remaining compatible with unmodified executables.

Would you agree?
Comment 4 Fangrui Song 2021-05-17 07:55:37 UTC
(In reply to Alexander Monakov from comment #3)
> I understand what you're saying, but it seems we're talking past each other.
> 
> I agree that if a library is linked with any -Bsymbolic* flag, the main
> executable is at risk of broken address uniqueness unless it uses GOT
> indirection.
> 
> I am saying that if the library was linked with a more restrictive variant
> of -Bsymbolic (that I called -Bsymbolic-plt), it would still get most the
> benefit of -Bsymbolic, while remaining compatible with unmodified
> executables.
> 
> Would you agree?

You misunderstand this. Emitting GOT-generating relocation in -fno-pic mode is the only way to avoid canonical PLT entry, if the function turns out to be defined in a shared object. No -Bsymbolic variant can make this compatible.

Our goal is to eliminate symbol lookup for the function definition in the shared object. We must eliminate symbolic dynamic relocations, i.e. no JUMP_SLOT, no GLOB_DAT, no R_X86_64_64. The linker must set an address in the shared object and bind references to that address. In many programs (not long-running, not all code paths are exercised), the symbol lookup may cost more than the PLT indirection, given the sheer amount of symbol lookups.

Now a -fno-pic program uses an absolute/PC-relative relocation => the linker must set an address in the executable's address space as well. The traditional ELF hack (st_value!=0, st_shndx=0) achieves this and let the shared object symbol reference bind to the executable definition. Note that we have explicitly eliminated symbol lookup for the defining shared object so the pointer equality cannot be satisfied at all.
Comment 5 Alexander Monakov 2021-05-17 10:07:57 UTC
Hm, I still don't think I'm misunderstanding what you're saying. I'm familiar with the ELF standard (and FWIW I have read your blog posts on related matters). I am responding to this sentiment from the opening comment:

> I believe ld -Bsymbolic-functions can materialize most of the savings other
> implementations provide, without introducing complex things to ELF.
> However, since -Bsymbolic-functions doesn't play well with -fno-pic's
> canonical PLT entries, we should fix -fno-pic.

I am saying that fixing -fno-pic is not the only possible way forward. Rather, a restricted -Bsymbolic-functions that relaxes relocations that are not address-significant allows to still get some (but not all) of the benefits for unchanged -fno-pic executables.

> You misunderstand this. Emitting GOT-generating relocation in -fno-pic mode
> is the only way to avoid canonical PLT entry, if the function turns out to
> be defined in a shared object. No -Bsymbolic variant can make this
> compatible.

Well, if you frame the goal as "eliminate canonical PLT entries", then yes, but that in itself surely is not the end goal? The end goals are reducing startup time (which my idea helps only partially since it may bind direct calls but not e.g. vtable definitions) and runtime overheads (where again my proposal is weaker but not significantly so, assuming address loads are rarely on hot paths).


To clarify once more. I am not outright rejecting the idea in your opening comment. I am saying that there potentially is a lighter-weight alternative, which may be implementable purely in the linker, and still gets most of the benefit you're promoting (like in your Clang example). Which is nice, because it can be rolled out sooner, individual libraries/distros/users can opt-in and experiment as they like, etc.
Comment 6 Fangrui Song 2021-05-17 18:38:13 UTC
(In reply to Alexander Monakov from comment #5)
> Hm, I still don't think I'm misunderstanding what you're saying. I'm
> familiar with the ELF standard (and FWIW I have read your blog posts on
> related matters). I am responding to this sentiment from the opening comment:
> 
> > I believe ld -Bsymbolic-functions can materialize most of the savings other
> > implementations provide, without introducing complex things to ELF.
> > However, since -Bsymbolic-functions doesn't play well with -fno-pic's
> > canonical PLT entries, we should fix -fno-pic.
> 
> I am saying that fixing -fno-pic is not the only possible way forward.
> Rather, a restricted -Bsymbolic-functions that relaxes relocations that are
> not address-significant allows to still get some (but not all) of the
> benefits for unchanged -fno-pic executables.

You are right. A pure linker approach is possible. However, I think the
approach is inelegant, because the linker would have different preemptibility ideas on
different relocation types and (as you said) indirect calls like vtable definitions
are not optimized.

Let's say the proposed linker option for shared objects is -Bsymbolic-plt.
The discussion below focuses on default visibility definitions which would otherwise be preemptible.

Let categorize relocation types first.

PLT-generating: R_X86_64_PLT32
GOT-generating: R_X86_64_GOTPCREL, R_X86_64_GOTPCRELX, R_X86_64_REX_GOTPCRELX
absolute (symbolic): R_X86_64_64

There are three choices.

(a) If all relocation types are PLT-generating, bind branch targets directly and suppress the PLT entry.
If GOT-generating/absolute relocations are present, don't change behaviors.
This choice is less effective for some otherwise address-insignificant functions, e.g. non-vague-linkage virtual functions.

b) If all relocation types are R_X86_64_PLT32 or GOT-generating, bind branch targets directly and suppress the PLT entry.
If GOT-generating relocations are present, produce a GOT entry and an associated R_X86_64_GLOB_DAT.
If absolute relocations are present,  don't change behaviors.

c) Always bind branch targets directly and suppress the PLT entry.
If GOT-generating relocations are present, produce a GOT entry and an associated R_X86_64_GLOB_DAT.
If absolute relocations are present, produce outstanding dynamic relocations of the same type.


> > You misunderstand this. Emitting GOT-generating relocation in -fno-pic mode
> > is the only way to avoid canonical PLT entry, if the function turns out to
> > be defined in a shared object. No -Bsymbolic variant can make this
> > compatible.
> 
> Well, if you frame the goal as "eliminate canonical PLT entries", then yes,
> but that in itself surely is not the end goal? The end goals are reducing
> startup time (which my idea helps only partially since it may bind direct
> calls but not e.g. vtable definitions) and runtime overheads (where again my
> proposal is weaker but not significantly so, assuming address loads are
> rarely on hot paths).

Yes, the end goal is to reduce startup time and bind call targets directly if feasible.
Yes, -Bsymbolic-plt can help the goal partially.

> 
> To clarify once more. I am not outright rejecting the idea in your opening
> comment. I am saying that there potentially is a lighter-weight alternative,
> which may be implementable purely in the linker, and still gets most of the
> benefit you're promoting (like in your Clang example). Which is nice,
> because it can be rolled out sooner, individual libraries/distros/users can
> opt-in and experiment as they like, etc.

Such a -Bsymbolic-plt can achieve some goals.
But given that the function pointer equality problems are usually benign (-fno-pic is relatively uncommon in many areas; making use of such pointer equality is not a common practice),
I'd hope we just don't add that intermediate linker option.
Comment 7 Alexander Monakov 2021-05-18 09:16:56 UTC
Thanks. I agree that inferring address significance on the linker side is problematic.

Thinking about your original request, I was about to say that it would be very reasonable to do under -fno-plt flag, but then I found it was already implemented for x86-64 in gcc-7 and for 32-bit x86 in gcc-8. Compiling

int f();
void *g()
{
  return f;
}

with -fno-pic -fno-plt yields

g:
        movq    f@GOTPCREL(%rip), %rax
        ret

(yields GOTPCRELX relocation) and

g:
        movl    f@GOT, %eax
        ret

on 32-bit (yields GOT32X relocation), so on x86 it's already implemented?
Comment 8 Fangrui Song 2021-05-18 17:47:44 UTC
Seems that -fno-plt -fno-pic does have the required properties.
A side effect is that all external calls use the   (x86-64) call *f@GOTPCREL(%rip) (x86-32) call *f@GOT  form.

The instruction is one byte longer. (Calling a function is a common case. Taking the address in a non-vtable case is uncommon. So I'd rather punish the uncommon address taking).
When the linker notices that the branch target is defined in the executable, it can optimize out the GOT to use an addr32 prefix instead.
(gold and ld.lld haven't implemented the optimization for 32-bit)

__attribute__((noplt))
int f();
void h() {}

void *g()
{
  h();       // call h
  f();       // call *f@GOTPCREL(%rip)
  return f;  // movq f@GOTPCREL(%rip), %rax
}
Comment 9 Fangrui Song 2021-05-26 22:15:31 UTC
I have a patch to implement this Clang.

It'd be good to have a name even if GCC wants to postpone the implementation for now. How about -fdirect-access-external-function & -fno-direct-access-external-function ?  It is similar to the feature request -fdirect-access-external-data
Comment 10 Alexander Monakov 2021-05-27 07:30:30 UTC
Is there something wrong or undesirable with making this under -fno-plt (or the noplt attribute as in your example)?

(after all, it is a kind of PLT-avoidance transformation, just for addressing rather than direct calling/jumping)
Comment 11 Fangrui Song 2021-06-04 18:05:49 UTC
(In reply to Alexander Monakov from comment #10)
> Is there something wrong or undesirable with making this under -fno-plt (or
> the noplt attribute as in your example)?
> 
> (after all, it is a kind of PLT-avoidance transformation, just for
> addressing rather than direct calling/jumping)

-fno-plt is generally undesired due to longer branch instructions and performance lost when the branch target is defined in the exe/so when the linker is gold/ld.lld (they cannot optimize jmp *got to jmp target)

For non-x86, -fno-plt doesn't exist at all. If implemented, there requires many more instructions which are certainly undesirable.

So -fno-plt can never be a default.

Using GOT to take the address of an external function in -fno-pic is just a better default. I want the behavior to become the behavior, so it should not be under -fno-plt.
Comment 12 H.J. Lu 2021-06-06 15:01:40 UTC
We should handle it in the whole Linux software stack:

https://gitlab.com/x86-psABIs/x86-64-ABI/-/issues/8

not just in compiler.
Comment 13 Fangrui Song 2021-06-06 17:36:25 UTC
(In reply to H.J. Lu from comment #12)
> We should handle it in the whole Linux software stack:
> 
> https://gitlab.com/x86-psABIs/x86-64-ABI/-/issues/8
> 
> not just in compiler.

It is great that you have the desire to fix these fundamental issues :)

I think a GNU_PROPERTY marker is over-engineering. See https://gitlab.com/x86-psABIs/x86-64-ABI/-/issues/8 for details. Many things (including this and PR98112) can be changed today. When -fno-direct-access-external-data/-fno-direct-access-external-function as -fno-pic default becomes prevailing, make ld warning by default for R_*_COPY/canonical PLT entries. After a while (say one or two years), let glibc ld.so warn for R_*_COPY/canonical PLT entries.
Comment 14 GCC Commits 2022-02-09 12:39:22 UTC
The master branch has been updated by H.J. Lu <hjl@gcc.gnu.org>:

https://gcc.gnu.org/g:ab0b5fbfe90168d2e470aefb19e0cf31526290bc

commit r12-7126-gab0b5fbfe90168d2e470aefb19e0cf31526290bc
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Sat Jun 19 05:12:48 2021 -0700

    x86: Add -m[no-]direct-extern-access
    
    Add -m[no-]direct-extern-access and nodirect_extern_access attribute.
    -mdirect-extern-access is the default.  With nodirect_extern_access
    attribute, GOT is always used to access undefined data and function
    symbols with nodirect_extern_access attribute, including in PIE and
    non-PIE.  With -mno-direct-extern-access:
    
    1. Always use GOT to access undefined data and function symbols,
       including in PIE and non-PIE.  These will avoid copy relocations
       in executables.  This is compatible with existing executables and
       shared libraries.
    2. In executable and shared library, bind symbols with the STV_PROTECTED
       visibility locally:
       a. The address of data symbol is the address of data body.
       b. For systems without function descriptor, the function pointer is
          the address of function body.
       c. The resulting shared libraries may not be incompatible with
          executables which have copy relocations on protected symbols or
          use executable PLT entries as function addresses for protected
          functions in shared libraries.
    3. Update asm_preferred_eh_data_format to select PC relative EH encoding
    format with -mno-direct-extern-access to avoid copy relocation.
    4. Add ix86_reloc_rw_mask for TARGET_ASM_RELOC_RW_MASK to avoid copy
    relocation with -mno-direct-extern-access.
    
    gcc/
    
            PR target/35513
            PR target/100593
            * config/i386/gnu-property.cc: Include "i386-protos.h".
            (file_end_indicate_exec_stack_and_gnu_property): Generate
            a GNU_PROPERTY_1_NEEDED note for -mno-direct-extern-access or
            nodirect_extern_access attribute.
            * config/i386/i386-options.cc
            (handle_nodirect_extern_access_attribute): New function.
            (ix86_attribute_table): Add nodirect_extern_access attribute.
            * config/i386/i386-protos.h (ix86_force_load_from_GOT_p): Add a
            bool argument.
            (ix86_has_no_direct_extern_access): New.
            * config/i386/i386.cc (ix86_has_no_direct_extern_access): New.
            (ix86_force_load_from_GOT_p): Add a bool argument to indicate
            call operand.  Force non-call load from GOT for
            -mno-direct-extern-access or nodirect_extern_access attribute.
            (legitimate_pic_address_disp_p): Avoid copy relocation in PIE
            for -mno-direct-extern-access or nodirect_extern_access attribute.
            (ix86_print_operand): Pass true to ix86_force_load_from_GOT_p
            for call operand.
            (asm_preferred_eh_data_format): Use PC-relative format for
            -mno-direct-extern-access to avoid copy relocation.  Check
            ptr_mode instead of TARGET_64BIT when selecting DW_EH_PE_sdata4.
            (ix86_binds_local_p): Set ix86_has_no_direct_extern_access to
            true for -mno-direct-extern-access or nodirect_extern_access
            attribute.  Don't treat protected data as extern and avoid copy
            relocation on common symbol with -mno-direct-extern-access or
            nodirect_extern_access attribute.
            (ix86_reloc_rw_mask): New to avoid copy relocation for
            -mno-direct-extern-access.
            (TARGET_ASM_RELOC_RW_MASK): New.
            * config/i386/i386.opt: Add -mdirect-extern-access.
            * doc/extend.texi: Document nodirect_extern_access attribute.
            * doc/invoke.texi: Document -m[no-]direct-extern-access.
    
    gcc/testsuite/
    
            PR target/35513
            PR target/100593
            * g++.target/i386/pr35513-1.C: New file.
            * g++.target/i386/pr35513-2.C: Likewise.
            * gcc.target/i386/pr35513-1a.c: Likewise.
            * gcc.target/i386/pr35513-1b.c: Likewise.
            * gcc.target/i386/pr35513-2a.c: Likewise.
            * gcc.target/i386/pr35513-2b.c: Likewise.
            * gcc.target/i386/pr35513-3a.c: Likewise.
            * gcc.target/i386/pr35513-3b.c: Likewise.
            * gcc.target/i386/pr35513-4a.c: Likewise.
            * gcc.target/i386/pr35513-4b.c: Likewise.
            * gcc.target/i386/pr35513-5a.c: Likewise.
            * gcc.target/i386/pr35513-5b.c: Likewise.
            * gcc.target/i386/pr35513-6a.c: Likewise.
            * gcc.target/i386/pr35513-6b.c: Likewise.
            * gcc.target/i386/pr35513-7a.c: Likewise.
            * gcc.target/i386/pr35513-7b.c: Likewise.
            * gcc.target/i386/pr35513-8.c: Likewise.
            * gcc.target/i386/pr35513-9a.c: Likewise.
            * gcc.target/i386/pr35513-9b.c: Likewise.
            * gcc.target/i386/pr35513-10a.c: Likewise.
            * gcc.target/i386/pr35513-10b.c: Likewise.
            * gcc.target/i386/pr35513-11a.c: Likewise.
            * gcc.target/i386/pr35513-11b.c: Likewise.
            * gcc.target/i386/pr35513-12a.c: Likewise.
            * gcc.target/i386/pr35513-12b.c: Likewise.
Comment 15 Nick Desaulniers 2022-04-29 19:39:34 UTC
Any chance we could get -mdirect-extern-access implemented for aarch64?  Otherwise we're discussing the use of `#pragma GCC visibility push(hidden)` for use in the linux kernel since it's slightly more portable at the moment.

https://lore.kernel.org/linux-arm-kernel/20220427171241.2426592-3-ardb@kernel.org/
Comment 16 Andrew Pinski 2023-01-04 18:54:39 UTC
Fixed for GCC 12.