105928 – [AArch64] 64-bit constants with same high/low halves can use ADD lsl 32 (-Os at least)

Bug 105928 - [AArch64] 64-bit constants with same high/low halves can use ADD lsl 32 (-Os at least)

Summary: [AArch64] 64-bit constants with same high/low halves can use ADD lsl 32 (-Os ...

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	13.0

Importance:	P3 normal
Target Milestone:	14.0
Assignee:	Wilco

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2022-06-11 20:19 UTC by Peter Cordes
Modified:	2023-09-18 12:33 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:	aarch64
Build:
Known to work:
Known to fail:
Last reconfirmed:	2022-07-05 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Peter Cordes 2022-06-11 20:19:57 UTC

void foo(unsigned long *p) {
    *p = 0xdeadbeefdeadbeef;
}

cleverly compiles to https://godbolt.org/z/b3oqao5Kz

        mov     w1, 48879
        movk    w1, 0xdead, lsl 16
        stp     w1, w1, [x0]
        ret

But producing the value in a register uses more than 3 instructions:

unsigned long constant(){
    return 0xdeadbeefdeadbeef;
}

        mov     x0, 48879
        movk    x0, 0xdead, lsl 16
        movk    x0, 0xbeef, lsl 32
        movk    x0, 0xdead, lsl 48
        ret

At least with -Os, and maybe at -O2 or -O3 if it's efficient, we could be doing a shifted ADD or ORR to broadcast a zero-extended 32-bit value to 64-bit.

        mov     x0, 48879
        movk    x0, 0xdead, lsl 16
        add     x0, x0, x0, lsl 32

Some CPUs may fuse sequences of movk, and shifted operands for ALU ops may take extra time in some CPUs, so this might not actually be optimal for performance, but it is smaller for -Os and -Oz.

We should also be using that trick for stores to _Atomic or volatile long*, where we currently do MOV + 3x MOVK, then an STR, with ARMv8.4-a which guarantees atomicity.


---

ARMv8.4-a and later guarantees atomicity for ldp/stp within an aligned 16-byte chunk, so we should use MOV/MOVK / STP there even for volatile or __ATOMIC_RELAXED.  But presumably that's a different part of GCC's internals, so I'll report that separately.

Comment 1 Richard Sandiford 2022-07-05 11:48:34 UTC

Confirmed.

Comment 2 Wilco 2023-09-13 13:08:58 UTC

Shifted logical operations are single cycle on all recent cores.

Comment 3 Wilco 2023-09-14 15:26:03 UTC

Patch: https://gcc.gnu.org/pipermail/gcc-patches/2023-September/630358.html

Comment 4 GCC Commits 2023-09-18 12:28:13 UTC

The master branch has been updated by Wilco Dijkstra <wilco@gcc.gnu.org>:

https://gcc.gnu.org/g:fc7070025d1a6668ff6cb4391f84771a7662def7

commit r14-4096-gfc7070025d1a6668ff6cb4391f84771a7662def7
Author: Wilco Dijkstra <wilco.dijkstra@arm.com>
Date:   Wed Sep 13 13:21:50 2023 +0100

    AArch64: Improve immediate expansion [PR105928]
    
    Support immediate expansion of immediates which can be created from 2 MOVKs
    and a shifted ORR or BIC instruction.  Change aarch64_split_dimode_const_store
    to apply if we save one instruction.
    
    This reduces the number of 4-instruction immediates in SPECINT/FP by 5%.
    
    gcc/ChangeLog:
            PR target/105928
            * config/aarch64/aarch64.cc (aarch64_internal_mov_immediate)
            Add support for immediates using shifted ORR/BIC.
            (aarch64_split_dimode_const_store): Apply if we save one instruction.
            * config/aarch64/aarch64.md (<LOGICAL:optab>_<SHIFT:optab><mode>3):
            Make pattern global.
    
    gcc/testsuite:
            PR target/105928
            * gcc.target/aarch64/pr105928.c: Add new test.
            * gcc.target/aarch64/vect-cse-codegen.c: Fix test.

Comment 5 Wilco 2023-09-18 12:33:37 UTC

Fixed