This bug-report is in fact a wishlist for an optimization described in GCC manual, yet not implemented unfortunately (at least not for x86). About the "internal" visibility of a symbol, the manual states: "By indicating that a symbol cannot be called from outside the module, GCC may for instance omit the load of a PIC register since it is known that the calling function loaded the correct value." This is a great idea, since loading the GOT register on x86 is a costly operation. Even with "-march=pentium3" (it prevents the return address predictor of the processor from going into lalaland because of the load), the PIC version of the testcase still runs twice as slow. Although g() has already loaded the GOT address in the %ebx callee-save register, f() will load it once again in %ecx. The optimization described for the "internal" visibility would prevent such a reload, since GCC would have complete control over the callers. I did not see anything in the psABI that would disallow such an optimization. Hence this wishlist. This was tested with GCC 4.0.2 and compiled by "gcc -O -fPIC" (or -fpic). Testcase: extern int a; __attribute__((visibility("internal"))) void f(void) { ++a; } void g(void) { a = 0; f(); } Excerpt from the generated assembly code: 080483c9 <g>: 80483c9: 55 push %ebp 80483ca: 89 e5 mov %esp,%ebp 80483cc: 53 push %ebx 80483cd: e8 00 00 00 00 call 80483d2 <g+0x9> \ 80483d2: 5b pop %ebx | first load 80483d3: 81 c3 2e 12 00 00 add $0x122e,%ebx / 80483d9: 8b 83 f8 ff ff ff mov 0xfffffff8(%ebx),%eax 80483df: c7 00 00 00 00 00 movl $0x0,(%eax) 80483e5: e8 c6 ff ff ff call 80483b0 <f> ... 080483b0 <f>: 80483b0: 55 push %ebp 80483b1: 89 e5 mov %esp,%ebp 80483b3: e8 00 00 00 00 call 80483b8 <f+0x8> \ 80483b8: 59 pop %ecx | second load 80483b9: 81 c1 48 12 00 00 add $0x1248,%ecx / 80483bf: 8b 81 f8 ff ff ff mov 0xfffffff8(%ecx),%eax 80483c5: ff 00 incl (%eax) 80483c7: 5d pop %ebp 80483c8: c3 ret Note: it is impossible to specify both the "internal" visibility and the "static" qualifier (GCC complains). And using only "static" does not help here either. $ gcc -v Using built-in specs. Target: i486-linux-gnu Configured with: ../src/configure -v --enable-languages=c,c++,java,f95,objc,ada,treelang --prefix=/usr --enable-shared --with-system-zlib --libexecdir=/usr/lib --enable-nls --without-included-gettext --enable-threads=posix --program-suffix=-4.0 --enable-__cxa_atexit --enable-libstdcxx-allocator=mt --enable-clocale=gnu --enable-libstdcxx-debug --enable-java-gc=boehm --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.4.2-gcj-4.0-1.4.2.0/jre --enable-mpfr --disable-werror --enable-checking=release i486-linux-gnu Thread model: posix gcc version 4.0.2 20050821 (prerelease) (Debian 4.0.1-6)
Created attachment 19740 [details] Test case for the internal function enhancement
This enhancement is still pending on gcc 4.4.3 on x86. A function declared internal sets the GOT pointer while it could be avoided in a callee-set-GOT model as on x86 ABI. By defintion an internal function can only be accessed through functions of the same module and thus, if the ABI forces the GOT pointer to be in ebx - which I think is the case -, it does not have to be rematerialized. See the original description also. Compile the attached internal.c file with: $ gcc -O3 -S -fpic internal.c Check the .s file: .type f, @function f: call __i686.get_pc_thunk.cx addl $_GLOBAL_OFFSET_TABLE_, %ecx The GOT pointer is materialized in ecx while it is guaranteed to be available in ebx as soon as f is actually internal.
x86 ABI doesn't have any such guarantees. Consider a function foo exported from the library which just calls this static or visibility ("internal") function bar (but doesn't call any function through PLT nor uses any global variables). When this function foo is called from main, %ebx will contain garbage, when it is called from a function inside of some other shared library, %ebx will contain __GLOBAL_OFFSET_TABLE__ of the other shared library. As foo doesn't call anything through PLT, it doesn't need to compute the PIC register in %ebx (and, as it doesn't use any global variable, it doesn't need to compute it at all, not even in some other register). It then calls this bar function, which would assume %ebx contains right value of PIC register for the shared library in question. This optimization would be only possible if some whole file (or LTO) analysis has been performed and detected that some static (or hidden, but without address taken) function is only called from functions that are already known to compute the PIC register in %ebx, or alternatively have a mode in which the exported functions would always set it up in case they might directly or indirectly call such functions, then those could have it optimized away. Unfortunately whether %ebx is needed or not (or any kind of PIC pointer) is something determined late during the RTL optimizations, a long time after the IPA passes that could determine this are run.
Thanks for the detailled reply, I fully agree with your points: - first, indeed, it's a matter of choice in the ABI (or compiler), the assumption would be that a function that call another function of the same module (or a function of an undetermined module) must set the GP as soon as its own visibility is hidden or more visible. What happens actually on the code that I am optimizing is that generally it is better to have the parent function setting the GOT pointer and that most of the time it is set anyway. Hence I just observe that this choice, which is made on some architectures is a good trade-off. Thus it is indeed a request for enhancement on the pair ABI/compiler. - second, it will be a good motivating case in the context of interprocedural analysis. It can be considered in this case as a request for enhancement in the interprocedural analysis framework.
Confirmed.