[PATCH 2/5] x86: Add -mindirect-branch-loop=

Fri Jan 12 16:02:00 GMT 2018

On 01/12/2018 08:09 AM, Kumar, Venkataramanan wrote:
> Hi all, 
> 
>> -----Original Message-----
>> From: Kumar, Venkataramanan
>> Sent: Friday, January 12, 2018 8:16 PM
>> To: 'H.J. Lu' <hjl.tools@gmail.com>; Martin Jambor <mjambor@suse.cz>
>> Cc: Nagarajan, Muthu kumar raj <Muthukumarraj.Nagarajan@amd.com>;
>> GCC Patches <gcc-patches@gcc.gnu.org>; Jeff Law <law@redhat.com>; Uros
>> Bizjak (ubizjak@gmail.com) <ubizjak@gmail.com>; 'Jan Hubicka'
>> <jh@suse.de>
>> Subject: RE: [PATCH 2/5] x86: Add -mindirect-branch-loop=
>>
>> Hi all,
>>
>>> -----Original Message-----
>>> From: H.J. Lu [mailto:hjl.tools@gmail.com]
>>> Sent: Friday, January 12, 2018 7:36 PM
>>> To: Martin Jambor <mjambor@suse.cz>
>>> Cc: Nagarajan, Muthu kumar raj <Muthukumarraj.Nagarajan@amd.com>;
>>> Kumar, Venkataramanan <Venkataramanan.Kumar@amd.com>; GCC
>> Patches
>>> <gcc-patches@gcc.gnu.org>; Jeff Law <law@redhat.com>
>>> Subject: Re: [PATCH 2/5] x86: Add -mindirect-branch-loop=
>>>
>>> On Fri, Jan 12, 2018 at 4:38 AM, Martin Jambor <mjambor@suse.cz> wrote:
>>>> Hi,
>>>>
>>>> On Thu, Jan 11 2018, Jeff Law wrote:
>>>>> On 01/07/2018 03:59 PM, H.J. Lu wrote:
>>>>>> Add -mindirect-branch-loop= option to control loop filler in call
>>>>>> and return thunks generated by -mindirect-branch=.  'lfence' uses
>>> "lfence"
>>>>>> as loop filler.  'pause' uses "pause" as loop filler.  'nop' uses "nop"
>>>>>> as loop filler.  The default is 'lfence'.
>>>>>>
>>>>>> gcc/
>>>>>>
>>>>>>      * config/i386/i386-opts.h (indirect_branch_loop): New.
>>>>>>      * config/i386/i386.c (output_indirect_thunk): Support
>>>>>>      -mindirect-branch-loop=.
>>>>>>      * config/i386/i386.opt (mindirect-branch-loop=): New option.
>>>>>>      (indirect_branch_loop): New.
>>>>>>      (lfence): Likewise.
>>>>>>      (pause): Likewise.
>>>>>>      (nop): Likewise.
>>>>>>      * doc/invoke.texi: Document -mindirect-branch-loop= option.
>>>>>>
>>>>>> gcc/testsuite/
>>>>>>
>>>>>>      * gcc.target/i386/indirect-thunk-loop-1.c: New test.
>>>>>>      * gcc.target/i386/indirect-thunk-loop-2.c: Likewise.
>>>>>>      * gcc.target/i386/indirect-thunk-loop-3.c: Likewise.
>>>>>>      * gcc.target/i386/indirect-thunk-loop-4.c: Likewise.
>>>>>>      * gcc.target/i386/indirect-thunk-loop-5.c: Likewise.
>>>>> I think we should drop the ability to change the filler until such
>>>>> time as we really need it.  Just pick one and go with it.  I think
>>>>> David suggested that they wanted "pause".  I'm obviously fine with that.
>>>>>
>>>>
>>>> unless I am mistaken (which is frankly quite possible, I am still
>>>> not quite up to speed about the nuances), AMD strongly prefers the
>>>> lfence variant.  OTOH, IIUC, in kernel this will be run-time patched
>>>> but so it does not matter in the most pressing case and we might
>>>> want to have a mechanism doing something similar for protecting
>> userspace later on.
>>>> But perhaps it is enough to keep the option?
>>>>
>>>> Muthu and/or Venkat, can you please comment?
>>>
>>> If we do want it, I will submit a separate patch AFTER the current
>>> patch set has been approved and checked into GCC 8.
>>>
>>
>> As per AMD architects, using â€œlfenceâ€ in â€œretpolineâ€ is better than â€œpauseâ€ for
>> our targets.
>> So please allow filler to use "lfence".
>>
> We also leant that "lfence" is a dispatch serializing instruction.  The Pause instruction is not serializing on AMD processors and has high latencies. 
So the concern I have if we end up wanting different sequences for
different micro-architectures is you either up having to build multiple
kernels, libraries and executables or you end up doing dynamic patching
(ifuncs which are often used for this kind of dynamic selection don't
really help in this particular scenario).

It's unlikely we'll see distributors building multiple kernels,
libraries and executables within a processor family.  The kernel does
have dynamic patching, but we don't have that for anything in user space
(but I think folks will be looking at that in the relatively near future).

So it's difficult to see a scenario where having the compiler make
different codegen choices for these sequences based on the
microarchitecture is all that useful.

ISTM we want to pick a sequence that is reasonably OK, then look to
dynamic patching to micro-optimize if the benefit is significant.
Ideally going forward these issues will be largely addressed at the
hardware level with perhaps some cooperation with kernel and repolines
become a mitigation technique for legacy hardware.

Thoughts Venkat?

Jeff