102952 – New code-gen options for retpolines and straight line speculation

Bug 102952 - New code-gen options for retpolines and straight line speculation

Summary: New code-gen options for retpolines and straight line speculation

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	12.0

Importance:	P3 enhancement
Target Milestone:	12.0
Assignee:	H.J. Lu

URL:
Keywords:

Depends on:
Blocks:

Reported:	2021-10-26 15:54 UTC by Andrew Cooper
Modified:	2022-02-16 14:01 UTC (History)
CC List:	8 users (show)

See Also:	https://bugs.llvm.org/show_bug.cgi?id=52323
Host:
Target:	x86_64--
Build:
Known to work:
Known to fail:
Last reconfirmed:	2021-10-27 00:00:00

Attachments
A patch to add -mharden-sls= (1.28 KB, patch) 2021-10-27 14:50 UTC, H.J. Lu	Details \| Diff
A patch to add -mindirect-branch-cs-prefix (787 bytes, patch) 2021-10-27 15:01 UTC, H.J. Lu	Details \| Diff
The v2 patch to add -mharden-sls= (1.27 KB, patch) 2021-10-27 17:55 UTC, H.J. Lu	Details \| Diff
kernel patch to test -mharden-sls=all (1.01 KB, patch) 2021-10-27 20:14 UTC, peterz	Details \| Diff
The v2 patch to add -mharden-sls= (1.76 KB, patch) 2021-10-27 21:42 UTC, H.J. Lu	Details \| Diff
The v4 patch to add -mharden-sls= (1.86 KB, patch) 2021-10-27 22:53 UTC, H.J. Lu	Details \| Diff
A patch to rename -harden-sls=indirect-branch to -harden-sls=indirect-jmp (1.35 KB, patch) 2022-01-06 13:21 UTC, H.J. Lu	Details \| Diff
Show Obsolete (3) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Andrew Cooper 2021-10-26 15:54:02 UTC

Hello

[FYI, this is being cross-requested of Clang too]

Linux and other kernel level software makes use of -mindirect-branch=thunk-extern to be able to alter the handling of indirect branches at boot. It turns out to be advantageous to inline the thunks when retpoline is not in use. https://lore.kernel.org/lkml/20211026120132.613201817@infradead.org/ is some infrastructure to make this work.

In some cases, we want to be able to inline an `lfence; jmp *%reg` thunk. This is fine for the low 8 registers, but not fine for %r{8..15} where the REX prefix pushes the replacement size to being 6 bytes.

It would be very useful to have a code-gen option to write out `call %cs:__x86_indirect_thunk_r{8..15}` where the redundant %cs prefix will increase the instruction length to 6, allowing the non-retpoline form to be inlined.

Relatedly, x86 straight line speculation has been discussed before, but without any action taken. It would be helpful to have a code gen option which would emit `int3` following any `ret` instruction, and any indirect jump, as neither of these two cases have following architectural execution.

The reason these two are related is that if both options are in use, we want an extra byte of replacement space to be able to inline `lfence; jmp *%reg; int3`.

Third (and possibly only for future optimisations), Clang has been observed to spot conditional tail calls as `Jcc __x86_indirect_thunk_*`. This is a 6 byte source size, but needs up to 9 bytes of space for inlining including an `int3` for straight line speculation reasons (See https://lore.kernel.org/lkml/20211026120310.359986601@infradead.org/ for full details). It might be enough to simply prohibit an optimisation like this when trying to pad retpolines for inlineability.

Comment 1 Andrew Cooper 2021-10-26 15:56:11 UTC

https://bugs.llvm.org/show_bug.cgi?id=52323 for Clang cross-request.

Comment 2 Andrew Cooper 2021-10-26 17:11:46 UTC

PeterZ has suggested that the straight line speculation case can be dis-entangled with the thunk inlining case.

If an `int3` is emitted following any `jmp __x86_indirect_thunk_*` instruction (i.e. treated as an indirect jump despite retpoline), then the inlining logic need not worry about straight line speculation at all.

However, this does depend on not generating `jcc __x86_indirect_thunk_*` as inlining that would require an additional `int3` for SLS safety.

Comment 3 H.J. Lu 2021-10-27 14:50:49 UTC

Created attachment 51678 [details]
A patch to add -mharden-sls=

x86: Add -mharden-sls=[none|all|return|indirect-branch]

Generate code to mitigate against straight line speculation.

Comment 4 H.J. Lu 2021-10-27 15:01:41 UTC

Created attachment 51679 [details]
A patch to add -mindirect-branch-cs-prefix

It adds CS prefix to call and jmp to thunk when converting indirect call
and jump.

Comment 5 H.J. Lu 2021-10-27 15:03:26 UTC

(In reply to Andrew Cooper from comment #0)
> Hello
> 
> [FYI, this is being cross-requested of Clang too]
> 
> Linux and other kernel level software makes use of
> -mindirect-branch=thunk-extern to be able to alter the handling of indirect
> branches at boot.  It turns out to be advantageous to inline the thunks when
> retpoline is not in use. 
> https://lore.kernel.org/lkml/20211026120132.613201817@infradead.org/ is some
> infrastructure to make this work.
> 
> In some cases, we want to be able to inline an `lfence; jmp *%reg` thunk. 
> This is fine for the low 8 registers, but not fine for %r{8..15} where the
> REX prefix pushes the replacement size to being 6 bytes.
> 
> It would be very useful to have a code-gen option to write out `call
> %cs:__x86_indirect_thunk_r{8..15}` where the redundant %cs prefix will
> increase the instruction length to 6, allowing the non-retpoline form to be
> inlined.
> 

-mindirect-branch-cs-prefix

> Relatedly, x86 straight line speculation has been discussed before, but
> without any action taken.  It would be helpful to have a code gen option
> which would emit `int3` following any `ret` instruction, and any indirect
> jump, as neither of these two cases have following architectural execution.
> 
> The reason these two are related is that if both options are in use, we want
> an extra byte of replacement space to be able to inline `lfence; jmp *%reg;
> int3`.
> 

-mharden-sls=[none|all|return|indirect-branch]

Let me know if they work.  I also need testcases.

> Third (and possibly only for future optimisations), Clang has been observed
> to spot conditional tail calls as `Jcc __x86_indirect_thunk_*`.  This is a 6
> byte source size, but needs up to 9 bytes of space for inlining including an
> `int3` for straight line speculation reasons (See
> https://lore.kernel.org/lkml/20211026120310.359986601@infradead.org/ for
> full details).  It might be enough to simply prohibit an optimisation like
> this when trying to pad retpolines for inlineability.

I don't think GCC does that at all.

Comment 6 H.J. Lu 2021-10-27 17:55:31 UTC

Created attachment 51681 [details]
The v2 patch to add -mharden-sls=

Comment 7 peterz 2021-10-27 19:59:21 UTC

(In reply to H.J. Lu from comment #3)
> Created attachment 51678 [details]
> A patch to add -mharden-sls=
> 
> x86: Add -mharden-sls=[none|all|return|indirect-branch]
> 
> Generate code to mitigate against straight line speculation.

I'm getting (a lot) spurious int3 instructions with this, for example:

0000000000000280 <do_SYSENTER_32>:
 280:   48 81 8f 90 00 00 00 00 02 00 00        orq    $0x200,0x90(%rdi)
 28b:   48 8b 47 20             mov    0x20(%rdi),%rax
 28f:   48 89 87 98 00 00 00    mov    %rax,0x98(%rdi)
 296:   e9 75 ff ff ff          jmp    210 <do_fast_syscall_32>
 29b:   cc                      int3

That's not an *indirect* jump there.

Comment 8 H.J. Lu 2021-10-27 20:00:17 UTC

(In reply to peterz from comment #7)
> (In reply to H.J. Lu from comment #3)
> > Created attachment 51678 [details]
> > A patch to add -mharden-sls=
> > 
> > x86: Add -mharden-sls=[none|all|return|indirect-branch]
> > 
> > Generate code to mitigate against straight line speculation.
> 
> I'm getting (a lot) spurious int3 instructions with this, for example:
> 
> 0000000000000280 <do_SYSENTER_32>:
>  280:   48 81 8f 90 00 00 00 00 02 00 00        orq    $0x200,0x90(%rdi)
>  28b:   48 8b 47 20             mov    0x20(%rdi),%rax
>  28f:   48 89 87 98 00 00 00    mov    %rax,0x98(%rdi)
>  296:   e9 75 ff ff ff          jmp    210 <do_fast_syscall_32>
>  29b:   cc                      int3
> 
> That's not an *indirect* jump there.

Please provide a testcase.

Comment 9 peterz 2021-10-27 20:14:04 UTC

Created attachment 51683 [details]
kernel patch to test -mharden-sls=all

$ make O=defconfig CC=gcc-12.0.0 arch/x86/entry/common.o
...
arch/x86/entry/common.o: warning: objtool: do_SYSENTER_32()+0x1b: unreachable instruction

Comment 10 peterz 2021-10-27 20:20:30 UTC

(In reply to H.J. Lu from comment #4)
> Created attachment 51679 [details]
> A patch to add -mindirect-branch-cs-prefix
> 
> It adds CS prefix to call and jmp to thunk when converting indirect call
> and jump.

This seems to work as expected, I get CS prefix on all high reg thunk calls and no CS prefix on the low reg calls. Example:

29241:       e8 00 00 00 00          call   29246 <intel_dsc_enable.cold+0x6e5>      29242: R_X86_64_PLT32   __x86_indirect_thunk_rax-0x4
292c8:       2e e8 00 00 00 00       cs call 292ce <intel_dsc_enable.cold+0x76d>     292ca: R_X86_64_PLT32   __x86_indirect_thunk_r10-0x4

Comment 11 H.J. Lu 2021-10-27 21:42:16 UTC

Created attachment 51684 [details]
The v2 patch to add -mharden-sls=

Comment 12 H.J. Lu 2021-10-27 21:42:37 UTC

(In reply to peterz from comment #9)
> Created attachment 51683 [details]
> kernel patch to test -mharden-sls=all
> 
> $ make O=defconfig CC=gcc-12.0.0 arch/x86/entry/common.o
> ...
> arch/x86/entry/common.o: warning: objtool: do_SYSENTER_32()+0x1b:
> unreachable instruction

Please try the v2 patch.

Comment 13 peterz 2021-10-27 22:07:08 UTC

(In reply to H.J. Lu from comment #12)
> (In reply to peterz from comment #9)
> > Created attachment 51683 [details]
> > kernel patch to test -mharden-sls=all
> > 
> > $ make O=defconfig CC=gcc-12.0.0 arch/x86/entry/common.o
> > ...
> > arch/x86/entry/common.o: warning: objtool: do_SYSENTER_32()+0x1b:
> > unreachable instruction
> 
> Please try the v2 patch.

Per comment #6 this should be v3, no? Anyway, the good news is that I now seem to have a kernel image with lots of extra int3 instructions, but all in the right place.

*However*, I seem to be missing a few:

  36f4:       41 5f                   pop    %r15
  36f6:       e9 00 00 00 00          jmp    36fb <__do_set_cpus_allowed+0x5b>
        36f7: R_X86_64_PLT32    __x86_indirect_thunk_rax-0x4
  36fb:       48 8b 87 90 02 00 00    mov    0x290(%rdi),%rax

There should be one after the jmp __x86_indirect_thunk_* thingy. I'll do an objtool patch to search for missing int3, but that'll have to wait until tomorrow, it's past midnight.

Comment 14 H.J. Lu 2021-10-27 22:12:52 UTC

(In reply to peterz from comment #13)
> (In reply to H.J. Lu from comment #12)
> > (In reply to peterz from comment #9)
> > > Created attachment 51683 [details]
> > > kernel patch to test -mharden-sls=all
> > > 
> > > $ make O=defconfig CC=gcc-12.0.0 arch/x86/entry/common.o
> > > ...
> > > arch/x86/entry/common.o: warning: objtool: do_SYSENTER_32()+0x1b:
> > > unreachable instruction
> > 
> > Please try the v2 patch.
> 
> Per comment #6 this should be v3, no? Anyway, the good news is that I now
> seem to have a kernel image with lots of extra int3 instructions, but all in
> the right place.
> 
> *However*, I seem to be missing a few:
> 
>   36f4:       41 5f                   pop    %r15
>   36f6:       e9 00 00 00 00          jmp    36fb
> <__do_set_cpus_allowed+0x5b>
>         36f7: R_X86_64_PLT32    __x86_indirect_thunk_rax-0x4

This is a direct branch.

>   36fb:       48 8b 87 90 02 00 00    mov    0x290(%rdi),%rax
> 
> There should be one after the jmp __x86_indirect_thunk_* thingy. I'll do an
> objtool patch to search for missing int3, but that'll have to wait until
> tomorrow, it's past midnight.

Comment 15 Andrew Cooper 2021-10-27 22:16:57 UTC

So this is the irritating corner case where the two options are linked.

*If* we are using -mindirect-branch-cs-prefix, then we intend to rewrite `jmp __x86_indirect_thunk_*` to `jmp *%reg` or `lfence; jmp *%reg` based on boot time configuration/settings.

In this case, we still need to fit the `int3` for SLS protection in somewhere.

The two options are:
1) Special case `jmp __x86_indirect_thunk_*` as if it were an indirect jump and write out an `int3` directly, or
2) Pad one extra %cs prefix on the jmp, so we've got space to insert one at boot time.

Comment 16 H.J. Lu 2021-10-27 22:39:54 UTC

(In reply to Andrew Cooper from comment #15)
> So this is the irritating corner case where the two options are linked.
> 
> *If* we are using -mindirect-branch-cs-prefix, then we intend to rewrite
> `jmp __x86_indirect_thunk_*` to `jmp *%reg` or `lfence; jmp *%reg` based on
> boot time configuration/settings.
> 
> In this case, we still need to fit the `int3` for SLS protection in
> somewhere.
> 
> The two options are:
> 1) Special case `jmp __x86_indirect_thunk_*` as if it were an indirect jump
> and write out an `int3` directly, or

I can do this.

> 2) Pad one extra %cs prefix on the jmp, so we've got space to insert one at
> boot time.

Comment 17 H.J. Lu 2021-10-27 22:42:59 UTC

[hjl@gnu-tgl-2 pr102952]$ cat z2.i
extern void (*fptr) (int, int);

void
foo (int x, int y)
{
  fptr (x, y);
}
[hjl@gnu-tgl-2 pr102952]$ make z2.s
/export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/xgcc -B/export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/ -O2 -mindirect-branch=thunk -mindirect-branch-cs-prefix -mharden-sls=all -S z2.i
[hjl@gnu-tgl-2 pr102952]$ cat z2.s
	.file	"z2.i"
	.text
	.p2align 4
	.globl	foo
	.type	foo, @function
foo:
.LFB0:
	.cfi_startproc
	movq	fptr(%rip), %rax
	jmp	__x86_indirect_thunk_rax

Is int3 needed here?

	.cfi_endproc
.LFE0:
	.size	foo, .-foo
	.section	.text.__x86_indirect_thunk_rax,"axG",@progbits,__x86_indirect_thunk_rax,comdat
	.globl	__x86_indirect_thunk_rax
	.hidden	__x86_indirect_thunk_rax
	.type	__x86_indirect_thunk_rax, @function
__x86_indirect_thunk_rax:
.LFB1:
	.cfi_startproc
	call	.LIND1
.LIND0:
	pause
	lfence
	jmp	.LIND0
.LIND1:
	.cfi_def_cfa_offset 16
	mov	%rax, (%rsp)
	ret
	int3   <<<<<<<<<<<<<<<<<<<< Is this needed?
	.cfi_endproc
.LFE1:
	.ident	"GCC: (GNU) 12.0.0 20211027 (experimental)"
	.section	.note.GNU-stack,"",@progbits
[hjl@gnu-tgl-2 pr102952]$

Comment 18 Andrew Cooper 2021-10-27 22:46:02 UTC

Yes to both.

Comment 19 H.J. Lu 2021-10-27 22:53:24 UTC

Created attachment 51685 [details]
The v4 patch to add -mharden-sls=

Comment 20 peterz 2021-10-28 07:30:25 UTC

(In reply to H.J. Lu from comment #19)
> Created attachment 51685 [details]
> The v4 patch to add -mharden-sls=

That looks to do the right thing! Let me go write more validation stuff to double check. Thanks!

Comment 21 peterz 2021-10-28 07:43:57 UTC

(In reply to H.J. Lu from comment #19)
> Created attachment 51685 [details]
> The v4 patch to add -mharden-sls=

I seem to have found one 'funny':

kernel/bpf/core.o: warning: objtool: ___bpf_prog_run()+0x23: unreachable instruction

  27e6:       48 8b 14 d5 00 00 00    mov    0x0(,%rdx,8),%rdx
  27ed:       00
       27ea: R_X86_64_32S      .rodata..c_jump_table
  27ee:       e9 00 00 00 00          jmp    27f3 <___bpf_prog_run+0x23>
       27ef: R_X86_64_PLT32    __x86_indirect_thunk_rdx-0x4
  27f3:       cc                      int3
  27f4:       cc                      int3
  27f5:       8b 53 04                mov    0x4(%rbx),%edx


Other than that, it seems I need to go clean up our asm ...

Comment 22 Andrew Cooper 2021-10-28 22:07:47 UTC

One curious thing I have discovered.  While auditing the -mharden-sls=all code generation in Xen, I found examples where I got "ret int3 ret int3" with no intervening instructions.

It turns out this is not a regression in this change.  It is a pre-existing missing optimisation, which is made more obvious when every ret is extended with an int3.

It occurs for functions with either no stack frame at all, or functions which have an early exit before setting up the stack frame.  Some examples which occur at -O1 do not occur at -O2.

One curious example which does still repro at -O2 is this.  We have a hash lookup function:

struct context *sidtab_search(struct sidtab *s, u32 sid)
{
    int hvalue;
    struct sidtab_node *cur;

    if ( !s )
        return NULL;

    hvalue = SIDTAB_HASH(sid);
    cur = s->htable[hvalue];
    while ( cur != NULL && sid > cur->sid )
        cur = cur->next;

    if ( cur == NULL || sid != cur->sid )
    {
        /* Remap invalid SIDs to the unlabeled SID. */
        sid = SECINITSID_UNLABELED;
        hvalue = SIDTAB_HASH(sid);
        cur = s->htable[hvalue];
        while ( cur != NULL && sid > cur->sid )
            cur = cur->next;
        if ( !cur || sid != cur->sid )
            return NULL;
    }

    return &cur->context;
}

which compiles (reformatted a little for width - unmodified: https://paste.debian.net/hidden/7bf675d6/) to:

<sidtab_search>:
         48 85 ff             test   %rdi,%rdi
/------- 74 63                je     <sidtab_search+0x68>
|        48 8b 17             mov    (%rdi),%rdx
|        89 f0                mov    %esi,%eax
|        83 e0 7f             and    $0x7f,%eax
|        48 8b 04 c2          mov    (%rdx,%rax,8),%rax
|        48 85 c0             test   %rax,%rax
|   /--- 75 13                jne    <sidtab_search+0x29>
|  /|--- eb 17                jmp    <sidtab_search+0x2f>
|  ||    0f 1f 84 00 00 00 00 nopl   0x0(%rax,%rax,1)
|  ||    00 
|  ||/-> 48 8b 40 48          mov    0x48(%rax),%rax
|  |||   48 85 c0             test   %rax,%rax
|  +||-- 74 06                je     <sidtab_search+0x2f>
|  |\|-> 39 30                cmp    %esi,(%rax)
|  | \-- 72 f3                jb     <sidtab_search+0x20>
| /|---- 74 24                je     <sidtab_search+0x53>
| |\---> 48 8b 42 28          mov    0x28(%rdx),%rax
| |      48 85 c0             test   %rax,%rax
| | /--- 75 11                jne    <sidtab_search+0x49>
|/|-|--- eb 32                jmp    <sidtab_search+0x6c> // (1)
||| |    66 0f 1f 44 00 00    nopw   0x0(%rax,%rax,1)
||| |/-> 48 8b 40 48          mov    0x48(%rax),%rax
||| ||   48 85 c0             test   %rax,%rax
|||/||-- 74 17                je     <sidtab_search+0x60> // (2)
||||\|-> 83 38 04             cmpl   $0x4,(%rax)
|||| \-- 76 f2                jbe    <sidtab_search+0x40>
||||     83 38 05             cmpl   $0x5,(%rax)
+|||---- 75 15                jne    <sidtab_search+0x68>
||\|---> 48 83 c0 08          add    $0x8,%rax
|| |     c3                   retq   
|| |     cc                   int3   
|| |     0f 1f 80 00 00 00 00 nopl   0x0(%rax)
|| \---> c3                   retq                        // Target of (2)
||       cc                   int3   
||       66 0f 1f 44 00 00    nopw   0x0(%rax,%rax,1)
\|-----> 31 c0                xor    %eax,%eax
 |       c3                   retq   
 |       cc                   int3   
 \-----> c3                   retq                        // Target of (1)
         cc                   int3   
         66 90                xchg   %ax,%ax

There are 4 exits in total.  Two have to set up %eax, so they can't usefully be merged.

However, the unconditional jmp at (1) is 2 bytes, and could fully contain its target ret;int3 without even impacting the surrounding padding.  Whether it inlines or merges, this drops 4 bytes.

The conditional jump at (2) could be folded in to any of the other exit paths, dropping 16 bytes from the total size size.

I have no idea how easy/hard this may be to track down, or whether it is worth pursuing urgently, but it probably does want looking at, seeing as SLS hardening doubles the hit.

Comment 23 H.J. Lu 2021-10-28 22:26:13 UTC

(In reply to Andrew Cooper from comment #22)
> One curious thing I have discovered.  While auditing the -mharden-sls=all
> code generation in Xen, I found examples where I got "ret int3 ret int3"
> with no intervening instructions.
> 
> It turns out this is not a regression in this change.  It is a pre-existing
> missing optimisation, which is made more obvious when every ret is extended
> with an int3.
> 
> It occurs for functions with either no stack frame at all, or functions
> which have an early exit before setting up the stack frame.  Some examples
> which occur at -O1 do not occur at -O2.
> 
> One curious example which does still repro at -O2 is this.  We have a hash
> lookup function:
> 
> struct context *sidtab_search(struct sidtab *s, u32 sid)
> {
>     int hvalue;
>     struct sidtab_node *cur;
> 
>     if ( !s )
>         return NULL;
> 
>     hvalue = SIDTAB_HASH(sid);
>     cur = s->htable[hvalue];
>     while ( cur != NULL && sid > cur->sid )
>         cur = cur->next;
> 
>     if ( cur == NULL || sid != cur->sid )
>     {
>         /* Remap invalid SIDs to the unlabeled SID. */
>         sid = SECINITSID_UNLABELED;
>         hvalue = SIDTAB_HASH(sid);
>         cur = s->htable[hvalue];
>         while ( cur != NULL && sid > cur->sid )
>             cur = cur->next;
>         if ( !cur || sid != cur->sid )
>             return NULL;
>     }
> 
>     return &cur->context;
> }
> 
> which compiles (reformatted a little for width - unmodified:
> https://paste.debian.net/hidden/7bf675d6/) to:
> 
> <sidtab_search>:
>          48 85 ff             test   %rdi,%rdi
> /------- 74 63                je     <sidtab_search+0x68>
> |        48 8b 17             mov    (%rdi),%rdx
> |        89 f0                mov    %esi,%eax
> |        83 e0 7f             and    $0x7f,%eax
> |        48 8b 04 c2          mov    (%rdx,%rax,8),%rax
> |        48 85 c0             test   %rax,%rax
> |   /--- 75 13                jne    <sidtab_search+0x29>
> |  /|--- eb 17                jmp    <sidtab_search+0x2f>
> |  ||    0f 1f 84 00 00 00 00 nopl   0x0(%rax,%rax,1)
> |  ||    00 
> |  ||/-> 48 8b 40 48          mov    0x48(%rax),%rax
> |  |||   48 85 c0             test   %rax,%rax
> |  +||-- 74 06                je     <sidtab_search+0x2f>
> |  |\|-> 39 30                cmp    %esi,(%rax)
> |  | \-- 72 f3                jb     <sidtab_search+0x20>
> | /|---- 74 24                je     <sidtab_search+0x53>
> | |\---> 48 8b 42 28          mov    0x28(%rdx),%rax
> | |      48 85 c0             test   %rax,%rax
> | | /--- 75 11                jne    <sidtab_search+0x49>
> |/|-|--- eb 32                jmp    <sidtab_search+0x6c> // (1)
> ||| |    66 0f 1f 44 00 00    nopw   0x0(%rax,%rax,1)
> ||| |/-> 48 8b 40 48          mov    0x48(%rax),%rax
> ||| ||   48 85 c0             test   %rax,%rax
> |||/||-- 74 17                je     <sidtab_search+0x60> // (2)
> ||||\|-> 83 38 04             cmpl   $0x4,(%rax)
> |||| \-- 76 f2                jbe    <sidtab_search+0x40>
> ||||     83 38 05             cmpl   $0x5,(%rax)
> +|||---- 75 15                jne    <sidtab_search+0x68>
> ||\|---> 48 83 c0 08          add    $0x8,%rax
> || |     c3                   retq   
> || |     cc                   int3   
> || |     0f 1f 80 00 00 00 00 nopl   0x0(%rax)
> || \---> c3                   retq                        // Target of (2)
> ||       cc                   int3   
> ||       66 0f 1f 44 00 00    nopw   0x0(%rax,%rax,1)
> \|-----> 31 c0                xor    %eax,%eax
>  |       c3                   retq   
>  |       cc                   int3   
>  \-----> c3                   retq                        // Target of (1)
>          cc                   int3   
>          66 90                xchg   %ax,%ax
> 
> There are 4 exits in total.  Two have to set up %eax, so they can't usefully
> be merged.
> 
> However, the unconditional jmp at (1) is 2 bytes, and could fully contain
> its target ret;int3 without even impacting the surrounding padding.  Whether
> it inlines or merges, this drops 4 bytes.
> 
> The conditional jump at (2) could be folded in to any of the other exit
> paths, dropping 16 bytes from the total size size.
> 
> I have no idea how easy/hard this may be to track down, or whether it is
> worth pursuing urgently, but it probably does want looking at, seeing as SLS
> hardening doubles the hit.

Please open a separate bug to track it.  Should shrink-wrap handle it?

Comment 24 H.J. Lu 2021-11-15 14:27:42 UTC

Should I submit the current patches?

Comment 25 peterz 2021-11-16 12:57:26 UTC

(In reply to H.J. Lu from comment #24)
> Should I submit the current patches?

Yes, I'd say so. Once merged I'll send a kernel patch to use -mindirect-branch-cs-prefix for all RETPOLINE builds. I posted an SLS patch but haven't really had much feedback on that yet, we'll see.

Comment 26 H.J. Lu 2021-11-16 18:52:04 UTC

(In reply to peterz from comment #25)
> (In reply to H.J. Lu from comment #24)
> > Should I submit the current patches?
> 
> Yes, I'd say so. Once merged I'll send a kernel patch to use
> -mindirect-branch-cs-prefix for all RETPOLINE builds. I posted an SLS patch
> but haven't really had much feedback on that yet, we'll see.

Patches are at

https://gcc.gnu.org/pipermail/gcc-patches/2021-November/584636.html
https://gcc.gnu.org/pipermail/gcc-patches/2021-November/584637.html

Comment 27 GCC Commits 2021-11-17 21:35:55 UTC

The master branch has been updated by H.J. Lu <hjl@gcc.gnu.org>:

https://gcc.gnu.org/g:53a643f8568067d7700a9f2facc8ba39974973d3

commit r12-5353-g53a643f8568067d7700a9f2facc8ba39974973d3
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Oct 27 07:48:54 2021 -0700

    x86: Add -mharden-sls=[none|all|return|indirect-branch]
    
    Add -mharden-sls= to mitigate against straight line speculation (SLS)
    for function return and indirect branch by adding an INT3 instruction
    after function return and indirect branch.
    
    gcc/
    
            PR target/102952
            * config/i386/i386-opts.h (harden_sls): New enum.
            * config/i386/i386.c (output_indirect_thunk): Mitigate against
            SLS for function return.
            (ix86_output_function_return): Likewise.
            (ix86_output_jmp_thunk_or_indirect): Mitigate against indirect
            branch.
            (ix86_output_indirect_jmp): Likewise.
            (ix86_output_call_insn): Likewise.
            * config/i386/i386.opt: Add -mharden-sls=.
            * doc/invoke.texi: Document -mharden-sls=.
    
    gcc/testsuite/
    
            PR target/102952
            * gcc.target/i386/harden-sls-1.c: New test.
            * gcc.target/i386/harden-sls-2.c: Likewise.
            * gcc.target/i386/harden-sls-3.c: Likewise.
            * gcc.target/i386/harden-sls-4.c: Likewise.
            * gcc.target/i386/harden-sls-5.c: Likewise.

Comment 28 GCC Commits 2021-11-18 16:26:04 UTC

The master branch has been updated by H.J. Lu <hjl@gcc.gnu.org>:

https://gcc.gnu.org/g:2196a681d7810ad8b227bf983f38ba716620545e

commit r12-5377-g2196a681d7810ad8b227bf983f38ba716620545e
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Oct 27 06:27:15 2021 -0700

    x86: Add -mindirect-branch-cs-prefix
    
    Add -mindirect-branch-cs-prefix to add CS prefix to call and jmp to
    indirect thunk with branch target in r8-r15 registers so that the call
    and jmp instruction length is 6 bytes to allow them to be replaced with
    "lfence; call *%r8-r15" or "lfence; jmp *%r8-r15" at run-time.
    
    gcc/
    
            PR target/102952
            * config/i386/i386.c (ix86_output_jmp_thunk_or_indirect): Emit
            CS prefix for -mindirect-branch-cs-prefix.
            (ix86_output_indirect_branch_via_reg): Likewise.
            * config/i386/i386.opt: Add -mindirect-branch-cs-prefix.
            * doc/invoke.texi: Document -mindirect-branch-cs-prefix.
    
    gcc/testsuite/
    
            PR target/102952
            * gcc.target/i386/indirect-thunk-cs-prefix-1.c: New test.
            * gcc.target/i386/indirect-thunk-cs-prefix-2.c: Likewise.

Comment 29 H.J. Lu 2021-11-18 18:30:33 UTC

Fixed for GCC 12.

Comment 30 Andrew Cooper 2022-01-06 09:51:46 UTC

(In reply to CVS Commits from comment #27)
> 
>     x86: Add -mharden-sls=[none|all|return|indirect-branch]
> 
It occurs to me that `indirect-branch` needs renaming to be `indirect-jmp` as the logic specifically does not make changes to code generation relating to `call` instructions.

Comment 31 H.J. Lu 2022-01-06 13:21:55 UTC

Created attachment 52134 [details]
A patch to rename -harden-sls=indirect-branch to -harden-sls=indirect-jmp

Comment 32 H.J. Lu 2022-01-06 13:23:23 UTC

(In reply to Andrew Cooper from comment #30)
> (In reply to CVS Commits from comment #27)
> > 
> >     x86: Add -mharden-sls=[none|all|return|indirect-branch]
> > 
> It occurs to me that `indirect-branch` needs renaming to be `indirect-jmp`
> as the logic specifically does not make changes to code generation relating
> to `call` instructions.

Does the patch in comment #31 look OK?

Comment 33 Andrew Cooper 2022-01-06 18:13:13 UTC

Looks good to me

Comment 34 GCC Commits 2022-01-06 19:53:53 UTC

The master branch has been updated by H.J. Lu <hjl@gcc.gnu.org>:

https://gcc.gnu.org/g:ed8060950c64f2e449aaf90e438aa26d0d9d0b31

commit r12-6320-ged8060950c64f2e449aaf90e438aa26d0d9d0b31
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Jan 5 16:33:16 2022 -0800

    x86: Rename -harden-sls=indirect-branch to -harden-sls=indirect-jmp
    
    Indirect branch also includes indirect call instructions.  Rename
    -harden-sls=indirect-branch to -harden-sls=indirect-jmp to match its
    intended behavior.
    
            PR target/102952
            * config/i386/i386-opts.h (harden_sls): Replace
            harden_sls_indirect_branch with harden_sls_indirect_jmp.
            * config/i386/i386.c (ix86_output_jmp_thunk_or_indirect):
            Likewise.
            (ix86_output_indirect_jmp): Likewise.
            (ix86_output_call_insn): Likewise.
            * config/i386/i386.opt: Replace indirect-branch with
            indirect-jmp.  Replace harden_sls_indirect_branch with
            harden_sls_indirect_jmp.
            * doc/invoke.texi (-harden-sls=): Replace indirect-branch with
            indirect-jmp.

Comment 35 H.J. Lu 2022-01-06 20:12:35 UTC

Fixed.

Comment 36 Richard Biener 2022-01-31 08:04:57 UTC

HJ, does it make sense to backport these to branches since AFAIU they also address possible security issues?

Comment 37 H.J. Lu 2022-01-31 13:34:50 UTC

(In reply to Richard Biener from comment #36)
> HJ, does it make sense to backport these to branches since AFAIU they also
> address possible security issues?

Yes, Linux kernel needs it.  I will work on it.

Comment 38 H.J. Lu 2022-01-31 15:43:28 UTC

(In reply to H.J. Lu from comment #37)
> (In reply to Richard Biener from comment #36)
> > HJ, does it make sense to backport these to branches since AFAIU they also
> > address possible security issues?
> 
> Yes, Linux kernel needs it.  I will work on it.

I am testing GCC 11 backport on users/hjl/pr102952/gcc-11 branch:

https://gitlab.com/x86-gcc/gcc/-/commits/users/hjl/pr102952/gcc-11

Comment 39 H.J. Lu 2022-01-31 18:56:38 UTC

The GCC 11 backport is posted at

https://gcc.gnu.org/pipermail/gcc-patches/2022-January/589527.html

Comment 40 Andrew Cooper 2022-02-07 23:06:46 UTC

I've given the GCC-11 branch a test and everything appears to be in order.

Comment 41 GCC Commits 2022-02-16 14:01:15 UTC

The releases/gcc-11 branch has been updated by H.J. Lu <hjl@gcc.gnu.org>:

https://gcc.gnu.org/g:39d944c4237e5d35e28a2668d3b9a2e0f6f7bd01

commit r11-9575-g39d944c4237e5d35e28a2668d3b9a2e0f6f7bd01
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Oct 27 07:48:54 2021 -0700

    x86: Add -mharden-sls=[none|all|return|indirect-branch]
    
    Add -mharden-sls= to mitigate against straight line speculation (SLS)
    for function return and indirect branch by adding an INT3 instruction
    after function return and indirect branch.
    
    gcc/
    
            PR target/102952
            * config/i386/i386-opts.h (harden_sls): New enum.
            * config/i386/i386.c (output_indirect_thunk): Mitigate against
            SLS for function return.
            (ix86_output_function_return): Likewise.
            (ix86_output_jmp_thunk_or_indirect): Mitigate against indirect
            branch.
            (ix86_output_indirect_jmp): Likewise.
            (ix86_output_call_insn): Likewise.
            * config/i386/i386.opt: Add -mharden-sls=.
            * doc/invoke.texi: Document -mharden-sls=.
    
    gcc/testsuite/
    
            PR target/102952
            * gcc.target/i386/harden-sls-1.c: New test.
            * gcc.target/i386/harden-sls-2.c: Likewise.
            * gcc.target/i386/harden-sls-3.c: Likewise.
            * gcc.target/i386/harden-sls-4.c: Likewise.
            * gcc.target/i386/harden-sls-5.c: Likewise.
    
    (cherry picked from commit 53a643f8568067d7700a9f2facc8ba39974973d3)

Comment 42 GCC Commits 2022-02-16 14:01:20 UTC

The releases/gcc-11 branch has been updated by H.J. Lu <hjl@gcc.gnu.org>:

https://gcc.gnu.org/g:5d928740a533cd9e78673fad7ea86d20b2142277

commit r11-9576-g5d928740a533cd9e78673fad7ea86d20b2142277
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Oct 27 06:27:15 2021 -0700

    x86: Add -mindirect-branch-cs-prefix
    
    Add -mindirect-branch-cs-prefix to add CS prefix to call and jmp to
    indirect thunk with branch target in r8-r15 registers so that the call
    and jmp instruction length is 6 bytes to allow them to be replaced with
    "lfence; call *%r8-r15" or "lfence; jmp *%r8-r15" at run-time.
    
    gcc/
    
            PR target/102952
            * config/i386/i386.c (ix86_output_jmp_thunk_or_indirect): Emit
            CS prefix for -mindirect-branch-cs-prefix.
            (ix86_output_indirect_branch_via_reg): Likewise.
            * config/i386/i386.opt: Add -mindirect-branch-cs-prefix.
            * doc/invoke.texi: Document -mindirect-branch-cs-prefix.
    
    gcc/testsuite/
    
            PR target/102952
            * gcc.target/i386/indirect-thunk-cs-prefix-1.c: New test.
            * gcc.target/i386/indirect-thunk-cs-prefix-2.c: Likewise.
    
    (cherry picked from commit 2196a681d7810ad8b227bf983f38ba716620545e)

Comment 43 GCC Commits 2022-02-16 14:01:25 UTC

The releases/gcc-11 branch has been updated by H.J. Lu <hjl@gcc.gnu.org>:

https://gcc.gnu.org/g:58a4e292e8507a2968bfd2b10631ba95d5440c97

commit r11-9577-g58a4e292e8507a2968bfd2b10631ba95d5440c97
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Jan 5 16:33:16 2022 -0800

    x86: Rename -harden-sls=indirect-branch to -harden-sls=indirect-jmp
    
    Indirect branch also includes indirect call instructions.  Rename
    -harden-sls=indirect-branch to -harden-sls=indirect-jmp to match its
    intended behavior.
    
            PR target/102952
            * config/i386/i386-opts.h (harden_sls): Replace
            harden_sls_indirect_branch with harden_sls_indirect_jmp.
            * config/i386/i386.c (ix86_output_jmp_thunk_or_indirect):
            Likewise.
            (ix86_output_indirect_jmp): Likewise.
            (ix86_output_call_insn): Likewise.
            * config/i386/i386.opt: Replace indirect-branch with
            indirect-jmp.  Replace harden_sls_indirect_branch with
            harden_sls_indirect_jmp.
            * doc/invoke.texi (-harden-sls=): Replace indirect-branch with
            indirect-jmp.
    
    (cherry picked from commit ed8060950c64f2e449aaf90e438aa26d0d9d0b31)