This is the mail archive of the
mailing list for the GCC project.
PIC calls without PLT, generic implementation
- From: Alexander Monakov <amonakov at ispras dot ru>
- To: gcc-patches at gcc dot gnu dot org
- Cc: Alexander Monakov <amonakov at ispras dot ru>, Rich Felker <dalias at libc dot org>, Sriraman Tallam <tmsriram at google dot com>
- Date: Mon, 4 May 2015 19:37:53 +0300
- Subject: PIC calls without PLT, generic implementation
- Authentication-results: sourceware.org; auth=none
Recent post by Sriraman prompts me to post my -fno-plt approach sooner rather
than later; I was working on no-PLT PIC codegen in last few days too.
Although I'm posting a patch series, half of it is i386 backend tuning and can
go in independently. Except one patch where it's noted specifically, the
patches were bootstrapped and regtested together, not separately, on x86-64.
Likewise the improvement claimed below is obtained with GCC with all patches
applied, the difference being only in -fno-plt flag.
The approach taken here is different. Instead of adjusting call expansion in
the back end, I force callee address to be loaded into a pseudo at RTL
expansion time, similar to "function CSE" which is not enabled to most
targets. The address load (which loads from GOT) can be moved out of loops,
scheduled, or, on x86, re-fused with indirect jump by peepholes. On 32-bit
x86, it also allows the compiler to use registers other than %ebx for GOT
pointer (which can be a win since %ebx is callee-saved).
The benefit of PLT is the possibility of lazy relocation. It is not possible
with BIND_NOW, in particular when -z relro -z now flags were used at link time
as security hardening measure. Performance-critical executables do not
particularly need PLT and lazy relocation too, except if they are used very
frequently, with each individual run time extremely small -- but in that case
they can benefit massively from static linking or less massively from
prelinking, and with prelinking they can get the benefit of no-plt.
I've used LLVM/Clang to evaluate performance impact of PLT-less PIC codegen.
I configured with
cmake -DLLVM_ENABLE_PIC=ON -DBUILD_SHARED_LIBS=ON \
from 3.6 release branch; this configuration mimics non-static build that e.g.
OpenSUSE is using, and produces Clang dependent on 112 clang/llvm shared
libraries, with roughly 24000 externally visible functions.
Without input files time is mostly spent on dynamic linking, so without
prelink there's a predictable regression, from 55 to 140 ms. On C++ hello
world, I get:
PLT no-PLT PLT+BIND_NOW
[32bit] 430 ms 535 ms 590 ms
[64bit] 410 ms 495 ms 555 ms
So no-PLT is >20% slower than default, but already >10% faster when non-lazy
binding is forced.
On tramp3d compilation with -O2 -g I get:
[32bit] 49.0 s 43.3 s
[64bit] 41.6 s 36.8 s
So on long-running compiles -fno-plt is a very significant win. Note that I'm
using Clang as (perhaps extreme) example of PIC-call-intensive code, but the
argument about -fno-plt being useful for performance should apply generally.
When looking at code size changes, there's a 1% improvement on 32-bit
libstdc++ and a small regression on 64-bit. On LLVM/Clang, there's overall size
regression on both 32-bit and 64-bit; I've tried to analyze it and so far came
up with one possible cause, which is detailed in IRA REG_EQUIV patch.