This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

PIC calls without PLT, generic implementation


Recent post by Sriraman prompts me to post my -fno-plt approach sooner rather
than later; I was working on no-PLT PIC codegen in last few days too.
Although I'm posting a patch series, half of it is i386 backend tuning and can
go in independently.  Except one patch where it's noted specifically, the
patches were bootstrapped and regtested together, not separately, on x86-64.
Likewise the improvement claimed below is obtained with GCC with all patches
applied, the difference being only in -fno-plt flag.

The approach taken here is different.  Instead of adjusting call expansion in
the back end, I force callee address to be loaded into a pseudo at RTL
expansion time, similar to "function CSE" which is not enabled to most
targets.  The address load (which loads from GOT) can be moved out of loops,
scheduled, or, on x86, re-fused with indirect jump by peepholes.  On 32-bit
x86, it also allows the compiler to use registers other than %ebx for GOT
pointer (which can be a win since %ebx is callee-saved).

The benefit of PLT is the possibility of lazy relocation.  It is not possible
with BIND_NOW, in particular when -z relro -z now flags were used at link time
as security hardening measure.  Performance-critical executables do not
particularly need PLT and lazy relocation too, except if they are used very
frequently, with each individual run time extremely small -- but in that case
they can benefit massively from static linking or less massively from
prelinking, and with prelinking they can get the benefit of no-plt.

I've used LLVM/Clang to evaluate performance impact of PLT-less PIC codegen.
I configured with
  cmake -DLLVM_ENABLE_PIC=ON -DBUILD_SHARED_LIBS=ON \
  -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=OFF
from 3.6 release branch; this configuration mimics non-static build that e.g.
OpenSUSE is using, and produces Clang dependent on 112 clang/llvm shared
libraries, with roughly 24000 externally visible functions.

Without input files time is mostly spent on dynamic linking, so without
prelink there's a predictable regression, from 55 to 140 ms.  On C++ hello
world, I get:
            PLT   no-PLT  PLT+BIND_NOW
[32bit]  430 ms   535 ms  590 ms
[64bit]  410 ms   495 ms  555 ms

So no-PLT is >20% slower than default, but already >10% faster when non-lazy
binding is forced.

On tramp3d compilation with -O2 -g I get:
            PLT   no-PLT
[32bit]  49.0 s   43.3 s
[64bit]  41.6 s   36.8 s

So on long-running compiles -fno-plt is a very significant win.  Note that I'm
using Clang as (perhaps extreme) example of PIC-call-intensive code, but the
argument about -fno-plt being useful for performance should apply generally.

When looking at code size changes, there's a 1% improvement on 32-bit
libstdc++ and a small regression on 64-bit.  On LLVM/Clang, there's overall size
regression on both 32-bit and 64-bit; I've tried to analyze it and so far came
up with one possible cause, which is detailed in IRA REG_EQUIV patch.

Thanks.
Alexander


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]