Bug 23756 - Missed optimization for PIC code with internal visibility
Summary: Missed optimization for PIC code with internal visibility
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.0.2
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization, visibility
Depends on:
Blocks: visibility
  Show dependency treegraph
 
Reported: 2005-09-06 20:40 UTC by Guillaume Melquiond
Modified: 2021-09-17 00:26 UTC (History)
4 users (show)

See Also:
Host:
Target: i?86-linux-gnu
Build:
Known to work:
Known to fail:
Last reconfirmed: 2012-01-11 00:00:00


Attachments
Test case for the internal function enhancement (110 bytes, text/x-c)
2010-01-28 14:54 UTC, christophe.guillon
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Guillaume Melquiond 2005-09-06 20:40:49 UTC
This bug-report is in fact a wishlist for an optimization described in GCC
manual, yet not implemented unfortunately (at least not for x86). About the
"internal" visibility of a symbol, the manual states: "By indicating that a
symbol cannot be called from outside the module, GCC may for instance omit the
load of a PIC register since it is known that the calling function loaded the
correct value."

This is a great idea, since loading the GOT register on x86 is a costly
operation. Even with "-march=pentium3" (it prevents the return address predictor
of the processor from going into lalaland because of the load), the PIC version
of the testcase still runs twice as slow. Although g() has already loaded the
GOT address in the %ebx callee-save register, f() will load it once again in %ecx.

The optimization described for the "internal" visibility would prevent such a
reload, since GCC would have complete control over the callers. I did not see
anything in the psABI that would disallow such an optimization. Hence this
wishlist. This was tested with GCC 4.0.2 and compiled by "gcc -O -fPIC" (or -fpic).

Testcase:

        extern int a;

        __attribute__((visibility("internal")))
        void f(void) { ++a; }

        void g(void) { a = 0; f(); }


Excerpt from the generated assembly code:

080483c9 <g>:
 80483c9:       55                      push   %ebp
 80483ca:       89 e5                   mov    %esp,%ebp
 80483cc:       53                      push   %ebx
 80483cd:       e8 00 00 00 00          call   80483d2 <g+0x9>  \
 80483d2:       5b                      pop    %ebx              | first load
 80483d3:       81 c3 2e 12 00 00       add    $0x122e,%ebx     /
 80483d9:       8b 83 f8 ff ff ff       mov    0xfffffff8(%ebx),%eax
 80483df:       c7 00 00 00 00 00       movl   $0x0,(%eax)
 80483e5:       e8 c6 ff ff ff          call   80483b0 <f>
 ...
080483b0 <f>:
 80483b0:       55                      push   %ebp
 80483b1:       89 e5                   mov    %esp,%ebp
 80483b3:       e8 00 00 00 00          call   80483b8 <f+0x8>  \
 80483b8:       59                      pop    %ecx              | second load
 80483b9:       81 c1 48 12 00 00       add    $0x1248,%ecx     /
 80483bf:       8b 81 f8 ff ff ff       mov    0xfffffff8(%ecx),%eax
 80483c5:       ff 00                   incl   (%eax)
 80483c7:       5d                      pop    %ebp
 80483c8:       c3                      ret


Note: it is impossible to specify both the "internal" visibility and the
"static" qualifier (GCC complains). And using only "static" does not help here
either.

$ gcc -v
Using built-in specs.
Target: i486-linux-gnu
Configured with: ../src/configure -v
--enable-languages=c,c++,java,f95,objc,ada,treelang --prefix=/usr
--enable-shared --with-system-zlib --libexecdir=/usr/lib --enable-nls
--without-included-gettext --enable-threads=posix --program-suffix=-4.0
--enable-__cxa_atexit --enable-libstdcxx-allocator=mt --enable-clocale=gnu
--enable-libstdcxx-debug --enable-java-gc=boehm --enable-java-awt=gtk
--enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.4.2-gcj-4.0-1.4.2.0/jre
--enable-mpfr --disable-werror --enable-checking=release i486-linux-gnu
Thread model: posix
gcc version 4.0.2 20050821 (prerelease) (Debian 4.0.1-6)
Comment 1 christophe.guillon 2010-01-28 14:54:42 UTC
Created attachment 19740 [details]
Test case for the internal function enhancement
Comment 2 christophe.guillon 2010-01-28 14:55:35 UTC
This enhancement is still pending on gcc 4.4.3 on x86.
A function declared internal sets the GOT pointer while it could be avoided in a callee-set-GOT model as on x86 ABI.
By defintion an internal function can only be accessed through functions of the same module and thus, if the ABI forces the GOT pointer to be in ebx - which I think is the case -, it does not have to be rematerialized.

See the original description also.
Compile the attached internal.c file with:
$ gcc -O3 -S -fpic internal.c

Check the .s file:
        .type   f, @function
f:
        call    __i686.get_pc_thunk.cx
        addl    $_GLOBAL_OFFSET_TABLE_, %ecx

The GOT pointer is materialized in ecx while it is guaranteed to be available in ebx as soon as f is actually internal.
Comment 3 Jakub Jelinek 2010-01-28 15:37:20 UTC
x86 ABI doesn't have any such guarantees.
Consider a function foo exported from the library which just calls this static or visibility ("internal") function bar (but doesn't call any function through PLT nor uses any global variables).
When this function foo is called from main, %ebx will contain garbage, when it is called from a function inside of some other shared library, %ebx will contain
__GLOBAL_OFFSET_TABLE__ of the other shared library.
As foo doesn't call anything through PLT, it doesn't need to compute the PIC register in %ebx (and, as it doesn't use any global variable, it doesn't need to compute it at all, not even in some other register).  It then calls this bar function, which would assume %ebx contains right value of PIC register for the shared library in question.

This optimization would be only possible if some whole file (or LTO) analysis has been performed and detected that some static (or hidden, but without address taken) function is only called from functions that are already known to compute the PIC register in %ebx, or alternatively have a mode in which the exported functions would always set it up in case they might directly or indirectly call such functions, then those could have it optimized away.  Unfortunately whether %ebx is needed or not (or any kind of PIC pointer) is something determined late during the RTL optimizations, a long time after the IPA passes that could determine this are run.
Comment 4 christophe.guillon 2010-01-28 16:20:27 UTC
Thanks for the detailled reply, I fully agree with your points:
- first, indeed, it's a matter of choice in the ABI (or compiler), the assumption would be that a function that call another function of the same module
(or a function of an undetermined module) must set the GP as soon as its own visibility is hidden or more visible. What happens actually on the code that I am optimizing is that generally it is better to have the parent function setting the GOT pointer and that most of the time it is set anyway. Hence I just observe that this choice, which is made on some architectures is a good trade-off. 
 Thus it is indeed a request for enhancement on the pair ABI/compiler.
- second, it will be a good motivating case in the context of interprocedural analysis.
 It can be considered in this case as a request for enhancement in the interprocedural analysis framework.
 
Comment 5 Richard Biener 2012-01-11 14:30:01 UTC
Confirmed.