This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [RFC][PATCH][X86_64] Eliminate PLT stubs for specified external functions via -fno-plt=
- From: "H.J. Lu" <hjl dot tools at gmail dot com>
- To: Michael Matz <matz at suse dot de>
- Cc: Sriraman Tallam <tmsriram at google dot com>, GCC Patches <gcc-patches at gcc dot gnu dot org>, David Li <davidxl at google dot com>
- Date: Sun, 10 May 2015 08:19:41 -0700
- Subject: Re: [RFC][PATCH][X86_64] Eliminate PLT stubs for specified external functions via -fno-plt=
- Authentication-results: sourceware.org; auth=none
On Sat, May 9, 2015 at 9:34 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Mon, May 4, 2015 at 7:45 AM, Michael Matz <matz@suse.de> wrote:
>> Hi,
>>
>> On Thu, 30 Apr 2015, Sriraman Tallam wrote:
>>
>>> We noticed that one of our benchmarks sped-up by ~1% when we eliminated
>>> PLT stubs for some of the hot external library functions like memcmp,
>>> pow. The win was from better icache and itlb performance. The main
>>> reason was that the PLT stubs had no spatial locality with the
>>> call-sites. I have started looking at ways to tell the compiler to
>>> eliminate PLT stubs (in-effect inline them) for specified external
>>> functions, for x86_64. I have a proposal and a patch and I would like to
>>> hear what you think.
>>>
>>> This comes with caveats. This cannot be generally done for all
>>> functions marked extern as it is impossible for the compiler to say if a
>>> function is "truly extern" (defined in a shared library). If a function
>>> is not truly extern(ends up defined in the final executable), then
>>> calling it indirectly is a performance penalty as it could have been a
>>> direct call.
>>
>> This can be fixed by Alans idea.
>>
>>> Further, the newly created GOT entries are fixed up at
>>> start-up and do not get lazily bound.
>>
>> And this can be fixed by some enhancements in the linker and dynamic
>> linker. The idea is to still generate a PLT stub and make its GOT entry
>> point to it initially (like a normal got.plt slot). Then the first
>> indirect call will use the address of PLT entry (starting lazy resolution)
>> and update the GOT slot with the real address, so further indirect calls
>> will directly go to the function.
>>
>> This requires a new asm marker (and hence new reloc) as normally if
>> there's a GOT slot it's filled by the real symbols address, unlike if
>> there's only a got.plt slot. E.g. a
>>
>> call *foo@GOTPLT(%rip)
>>
>> would generate a GOT slot (and fill its address into above call insn), but
>> generate a JUMP_SLOT reloc in the final executable, not a GLOB_DAT one.
>>
>
> I added the "relax" prefix support to x86 assembler on users/hjl/relax
> branch
>
> at
>
> https://sourceware.org/git/?p=binutils-gdb.git;a=summary
>
> [hjl@gnu-tools-1 relax-3]$ cat r.S
> .text
> relax jmp foo
> relax call foo
> relax jmp foo@plt
> relax call foo@plt
> [hjl@gnu-tools-1 relax-3]$ ./as -o r.o r.S
> [hjl@gnu-tools-1 relax-3]$ ./objdump -drw r.o
>
> r.o: file format elf64-x86-64
>
>
> Disassembly of section .text:
>
> 0000000000000000 <.text>:
> 0: 66 e9 00 00 00 00 data16 jmpq 0x6 2: R_X86_64_RELAX_PC32 foo-0x4
> 6: 66 e8 00 00 00 00 data16 callq 0xc 8: R_X86_64_RELAX_PC32 foo-0x4
> c: 66 e9 00 00 00 00 data16 jmpq 0x12 e: R_X86_64_RELAX_PLT32foo-0x4
> 12: 66 e8 00 00 00 00 data16 callq 0x18 14: R_X86_64_RELAX_PLT32foo-0x4
> [hjl@gnu-tools-1 relax-3]$
>
> Right now, the relax relocations are treated as PC32/PLT32 relocations.
> I am working on linker support.
>
I implemented the linker support for x86-64:
00000000 <main>:
0: 48 83 ec 08 sub $0x8,%rsp
4: e8 00 00 00 00 callq 9 <main+0x9> 5: R_X86_64_PC32 plt-0x4
9: e8 00 00 00 00 callq e <main+0xe> a: R_X86_64_PLT32 plt-0x4
e: e8 00 00 00 00 callq 13 <main+0x13> f: R_X86_64_PC32 bar-0x4
13: 66 e8 00 00 00 00 data16 callq 19 <main+0x19> 15:
R_X86_64_RELAX_PC32 bar-0x4
19: 66 e8 00 00 00 00 data16 callq 1f <main+0x1f> 1b:
R_X86_64_RELAX_PLT32 bar-0x4
1f: 66 e8 00 00 00 00 data16 callq 25 <main+0x25> 21:
R_X86_64_RELAX_PC32 foo-0x4
25: 66 e8 00 00 00 00 data16 callq 2b <main+0x2b> 27:
R_X86_64_RELAX_PLT32 foo-0x4
2b: 31 c0 xor %eax,%eax
2d: 48 83 c4 08 add $0x8,%rsp
31: c3 retq
00400460 <main>:
400460: 48 83 ec 08 sub $0x8,%rsp
400464: e8 d7 ff ff ff callq 400440 <plt@plt>
400469: e8 d2 ff ff ff callq 400440 <plt@plt>
40046e: e8 ad ff ff ff callq 400420 <bar@plt>
400473: ff 15 ff 03 20 00 callq *0x2003ff(%rip) # 600878
<_DYNAMIC+0xf8>
400479: ff 15 f9 03 20 00 callq *0x2003f9(%rip) # 600878
<_DYNAMIC+0xf8>
40047f: 66 e8 f3 00 00 00 data16 callq 400578 <foo>
400485: 66 e8 ed 00 00 00 data16 callq 400578 <foo>
40048b: 31 c0 xor %eax,%eax
40048d: 48 83 c4 08 add $0x8,%rsp
400491: c3 retq
Sriraman, can you give it a try?
--
H.J.