Bug 90582 - AArch64 stack-protector wastes an instruction on address-generation
Summary: AArch64 stack-protector wastes an instruction on address-generation
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 8.2.1
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2019-05-22 22:37 UTC by Peter Cordes
Modified: 2024-01-25 09:09 UTC (History)
1 user (show)

See Also:
Host:
Target: aarch64-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Cordes 2019-05-22 22:37:19 UTC
void protect_me() {
    volatile int buf[2];
    buf[1] = 3;
}

https://godbolt.org/z/xdlr5w AArch64 gcc8.2 -O3 -fstack-protector-strong

protect_me:
        stp     x29, x30, [sp, -32]!
        adrp    x0, __stack_chk_guard
        add     x0, x0, :lo12:__stack_chk_guard         ### this instruction
        mov     x29, sp                         # frame pointer even though -fomit-frame-pointer is part of -O3.  Goes away with explicit -fomit-frame-pointer

        ldr     x1, [x0]                        # copy the cookie
        str     x1, [sp, 24]
        mov     x1,0                            # and destroy the reg

        mov     w1, 3                           # right before it's already destroyed
        str     w1, [sp, 20]             # buf[1] = 3

        ldr     x1, [sp, 24]                    # canary
        ldr     x0, [x0]                        # key destroys the key pointer
        eor     x0, x1, x0
        cbnz    x0, .L5
        ldp     x29, x30, [sp], 32              # FP and LR save/restore (for some reason?)
        ret
.L5:
              # can the store of the link register go here, for backtracing?
        bl      __stack_chk_fail

A function that returns a global can embed the low 12 bits of the address into the load instruction.  AArch64 instructions are fixed-width, so there's no reason (AFAIK) not to do this.

f:
        adrp    x0, foo
        ldr     w0, [x0, #:lo12:foo]
        ret

I'm not an AArch64 performance expert; it's plausible that zero displacements are worth spending an extra instruction on for addresses that are used twice, but unlikely.

So we should be doing 

        adrp    x0, __stack_chk_guard
        ldr     x1, [x0, #:lo12:__stack_chk_guard]  # in prologue to copy cookie
        ... 
        ldr     x0, [x0, #:lo12:__stack_chk_guard]  # in epilogue to check cookie

This also avoids leaving an exact pointer right to __stack_chk_guard in a register, in case a vulnerable callee or code in the function body can be tricked into dereferencing it and leaking the cookie.  (In non-leaf functions, we generate the pointer in a call-preserved register like x19, so yes it will be floating around in a register for callees).

I'd hate to suggest destroying the pointer when copying to the stack, because that would require another adrp later.

Finding a gadget that has exactly the right offset (the low 12 bits of __stack_chk_guard's address) is a lot less likely than finding an  ldr from [x0].  Of course this will introduce a lot of LDR instructions with an #:lo12:__stack_chk_guard offset, but hopefully they won't be part of useful gadgets because they lead to writing the stack, or to EOR/CBNZ to __stack_chk_fail

----

I don't see a way to optimize canary^key == 0 any further, unlike x86-64 PR 90568.  I assume EOR / CBNZ is as at least as efficient as SUBS / BNE on all/most AArch64 microarchitectures, but someone should check.

----

-O3 includes -fomit-frame-pointer according to -fverbose-asm, but functions protected with -fstack-protector-strong still get a frame pointer in x29 (costing a MOV x29, sp instruction, and save/restore with STP/LDP along with x30.)

However, explicitly using -fomit-frame-pointer stops that from happening.  Is that a separate bug, or am I missing something?

----

Without stack-protector, the function is vastly simpler

protect_me:
        sub     sp, sp, #16
        mov     w0, 3
        str     w0, [sp, 12]
        add     sp, sp, 16
        ret

Does stack-protector really need to spill/reload x29/x30 (FP and LR)?  Bouncing the return address through memory seems inefficient, even though branch prediction does hide that latency.

Is that just so __stack_chk_fail can backtrace?  Can we move the store of the link register into the __stack_chk_fail branch, off the fast path?

Or if we do unconditionally store x30 (the link register), at least don't bother reloading it in a leaf function if register allocation didn't need to clobber it.  Unlike x86-64, the return address can't be attacked with buffer overflows if it stays safe in a register the whole function.

Obviously my test-case with a volatile array and no inputs at all is making -fstack-protector-strong look dumb by protecting a perfectly safe function.  IDK how common it is to have leaf functions with arrays or structs that just use them for some computation on function args or globals and then return, maybe after copying the array back to somewhere else.  A sort function might use a tmp array.

With -fstack-protector (not -strong), maybe instead of optimizing the store of the link register into the block that calls  __stack_chk_fail, we should use that as a sign that a leaf function which doesn't spill LR is not *itself* vulnerable to ROP attacks, and not stack-protect it.

-strong might still want to stack-protect to defend against overwriting the caller's stack frame, which might contain function pointers or classes with vtable pointers.  Or saved registers that are about to be restored.  Or to detect that something bad happened to it, even if it won't immediately result in a bad control transfer.

Or maybe not, and maybe -strong would also want to drop protection from leaf functions that don't (need to) spill LR.


----

The    mov     x1,0  in the prologue is redundant: mov  w1, 3 is about to destroy the value in x1 anyway, and it can't fault.  (Reporting this bug separately; it's separate and not AArch64-specific).
Comment 1 Andrew Pinski 2019-05-22 23:07:51 UTC
> I assume EOR / CBNZ is as at least as efficient as SUBS / BNE on
> all/most AArch64 microarchitectures, but someone should check.

It is similar as x86 with that respect on some cores (Marvell's cores mostly).
That is ThunderX, ThunderX 2 and OcteonTX and OcteonTX2 all have the ability to do macro-combining of the two instructions into one micro-op.
Comment 2 Andrew Pinski 2024-01-25 08:48:14 UTC
(In reply to Andrew Pinski from comment #1)
> > I assume EOR / CBNZ is as at least as efficient as SUBS / BNE on
> > all/most AArch64 microarchitectures, but someone should check.
> 
> It is similar as x86 with that respect on some cores (Marvell's cores
> mostly).
> That is ThunderX, ThunderX 2 and OcteonTX and OcteonTX2 all have the ability
> to do macro-combining of the two instructions into one micro-op.

Even on non-most Marvell cores now, subs/bne is better than eor/cbnz.


Anyways starting GCC 10.3/9.4  we get:
        ldr     x2, [x0]
        subs    x1, x1, x2
        mov     x2, 0
        bne     .L5

Which we can't fuse anyways.  I wonder if we should clobber x1 too.


Note for -fomit-frame-pointer issue, it is not really an issue as only -momit-leaf-frame-pointer is turned on by default and now the function is NOT a leaf function due to the call to __stack_chk_fail .

>        mov     x1,0                            # and destroy the reg
>        mov     w1, 3                           # right before it's already destroyed

This is by design, GCC does not go back and figure out if we could remove the zeroing as if it deletes it on accident, it might introduce a "security hole". So emitting it always allows that NOT to happen.


As far as the other issue dealing with the address formation, it is a small missed optmization and might not help in general or at all.