When generating x86 position-independent code, GCC permanently reserves EBX as the GOT register. Even in functions that make no use of global data, EBX cannot be used as a general-purpose register. This both slows down code that's under register pressure and forces inline asm that needs an argument in EBX (e.g. syscalls) to use ugly temp register shuffling to make gcc happy.
My proposal, and I understand this may be difficult but I still think it's worth stating, is that the GOT register EBX should be considered spillable like any other register. In particular, the following consequences should result:
- If a function is not using the GOT (not accessing global or file-local static symbols or making non-hidden function calls), all GP registers can be used just like in non-PIC code. A pure function with no
- If a function is only using a "GOT register" for PC-relative data access, it should not go to the trouble of actually adjusting the PC obtained to point to the GOT. Instead it should generate addressing relative to the PC address that gets loaded into the register.
- In a function that's not making calls through the PLT (i.e. a leaf function or a function that only calls hidden/protected functions), the "GOT register" need not be EBX. Any register could be used, and in fact in some trivial functions, using a call-clobbered register would avoid having to save/restore EBX on the stack.
- In any function where EBX or any other register is being used to store the GOT address, it should be spillable (either pushed to stack, or simply discarded and reloaded with the standard load sequence when it's needed again later) just like a register caching any other data, so that under register pressure or inline asm constraints, the register becomes temporarily available for another use.
It seems like all of these very positive consequences would fall out of just treating GOT and GOT-relative addressing as address expressions based on the GOT address, which could be cached in registers just like any other expression, instead of hard-coding the GOT register as a special reserved register. The only remaining special-case/hard-coding would be treating the need for EBX to contain the GOT address when making calls through the PLT as an extra constraint of the function call ABI.
By the way, the code that inspired this report is crypt_blowfish.c and the corresponding asm by Solar Designer. We've been experimenting with performance characteristics while integrating it into musl libc, and I found that the C code is just as fast as the hand-optimized asm on the machine I was testing it on when using static libraries without -fPIC, but takes over 30% more runtime when built with -fPIC due to running out of registers.
I think the GOT is introduced too late to do any fancy ananlysis on whether
we need it or not. I also think that for outgoing function calls the ABI
relies on a properly setup GOT, even for those that bind locally and thus
do not go through the PLT.
> I think the GOT is introduced too late to do any fancy ananlysis
> on whether we need it or not.
This may be true, but if so, it's a highly suboptimal design that's hurting performance badly. 30% on the cryptographic code I looked at, and from working on FFmpeg in the past, I remember quite a few cases where PIC was hurting performance by significant measurable amounts like that too. If there's any way the changes I describe could be targeted even just in the long term, I think it would make a big difference for a lot of software.
> I also think that for outgoing function calls the ABI
> relies on a properly setup GOT, even for those that bind
> locally and thus do not go through the PLT.
The extern function call ABI on x86 does not allow the caller to depend on EBX containing the GOT address. This is because the callee has no way of knowing whether it was called by the same DSO it resides in. If not, the GOT address will be invalid for it.
For static functions whose addresses never leak out of the translation unit they're defined in, the calling convention is up to GCC. Ideally it would assume the GOT register is already loaded in such functions (as long as all the callees use the GOT), but in reality it rarely does. This is a separate code generation QoI implementation that should perhaps be addressed as its own bug.
Fixed in 5.0