Bug 111502 - Suboptimal unaligned 2/4-byte memcpy on strict-align targets
Summary: Suboptimal unaligned 2/4-byte memcpy on strict-align targets
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 13.2.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on: 50417
Blocks:
  Show dependency treegraph
 
Reported: 2023-09-20 17:51 UTC by Lasse Collin
Modified: 2023-09-21 06:22 UTC (History)
2 users (show)

See Also:
Host:
Target: riscv
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Lasse Collin 2023-09-20 17:51:16 UTC
I was playing with RISC-V GCC 12.2.0 from Arch Linux. I noticed inefficient-looking assembly output in code that uses memcpy to access 32-bit unaligned integers. I tried Godbolt with 16/32-bit integers and seems that the same weirdness happens with RV32 & RV64 with GCC 13.2.0 and trunk, and also on a few other targets. (Clang's output looks OK.)

For a little endian target:

#include <stdint.h>
#include <string.h>

uint32_t bytes16(const uint8_t *b)
{
    return (uint32_t)b[0]
        | ((uint32_t)b[1] << 8);
}

uint32_t copy16(const uint8_t *b)
{
    uint16_t v;
    memcpy(&v, b, sizeof(v));
    return v;
}

riscv64-linux-gnu-gcc -march=rv64gc -O2 -mtune=size

bytes16:
        lhu     a0,0(a0)
        ret

copy16:
        lhu     a0,0(a0)
        ret

That looks good because -mno-strict-align is the default.

After omitting -mtune=size, unaligned access isn't used (the output is the same as with -mstrict-align):

riscv64-linux-gnu-gcc -march=rv64gc -O2

bytes16:
        lbu     a5,1(a0)
        lbu     a0,0(a0)
        slli    a5,a5,8
        or      a0,a5,a0
        ret

copy16:
        lbu     a4,0(a0)
        lbu     a5,1(a0)
        addi    sp,sp,-16
        sb      a4,14(sp)
        sb      a5,15(sp)
        lhu     a0,14(sp)
        addi    sp,sp,16
        jr      ra

bytes16 looks good but copy16 is weird: the bytes are copied to an aligned location on stack and then loaded back.

On Godbolt it happens with GCC 13.2.0 on RV32, RV64, ARM64 (but only if using -mstrict-align), MIPS64EL, and SPARC & SPARC64 (comparison needs big endian bytes16). For ARM64 and MIPS64EL the oldest GCC on Godbolt is GCC 5.4 and the same thing happens with that too.

32-bit reads with -O2 behave similarly. With -Os a call to memcpy is emitted for copy32 but not for bytes32.

#include <stdint.h>
#include <string.h>

uint32_t bytes32(const uint8_t *b)
{
    return (uint32_t)b[0]
        | ((uint32_t)b[1] << 8)
        | ((uint32_t)b[2] << 16)
        | ((uint32_t)b[3] << 24);
}

uint32_t copy32(const uint8_t *b)
{
    uint32_t v;
    memcpy(&v, b, sizeof(v));
    return v;
}

riscv64-linux-gnu-gcc -march=rv64gc -O2

bytes32:
        lbu     a4,1(a0)
        lbu     a3,0(a0)
        lbu     a5,2(a0)
        lbu     a0,3(a0)
        slli    a4,a4,8
        or      a4,a4,a3
        slli    a5,a5,16
        or      a5,a5,a4
        slli    a0,a0,24
        or      a0,a0,a5
        sext.w  a0,a0
        ret

copy32:
        lbu     a2,0(a0)
        lbu     a3,1(a0)
        lbu     a4,2(a0)
        lbu     a5,3(a0)
        addi    sp,sp,-16
        sb      a2,12(sp)
        sb      a3,13(sp)
        sb      a4,14(sp)
        sb      a5,15(sp)
        lw      a0,12(sp)
        addi    sp,sp,16
        jr      ra

riscv64-linux-gnu-gcc -march=rv64gc -Os

bytes32:
        lbu     a4,1(a0)
        lbu     a5,0(a0)
        slli    a4,a4,8
        or      a4,a4,a5
        lbu     a5,2(a0)
        lbu     a0,3(a0)
        slli    a5,a5,16
        or      a5,a5,a4
        slli    a0,a0,24
        or      a0,a0,a5
        sext.w  a0,a0
        ret

copy32:
        addi    sp,sp,-32
        mv      a1,a0
        li      a2,4
        addi    a0,sp,12
        sd      ra,24(sp)
        call    memcpy@plt
        ld      ra,24(sp)
        lw      a0,12(sp)
        addi    sp,sp,32
        jr      ra

I probably cannot test any proposed fixes but I hope this report is still useful. Thanks!
Comment 1 Andrew Waterman 2023-09-20 17:56:57 UTC
This isn't actually a bug.  Quoting the RVA profile spec, "misaligned loads and stores might execute extremely slowly"--which is code for the possibility that they might be trapped and emulated, taking hundreds of clock cycles apiece.  So the default behavior of emitting byte accesses is best when generating generic code.  (Of course, when tuning for a particular microarchitecture, the shorter code sequence may be emitted.)
Comment 2 Lasse Collin 2023-09-20 18:37:34 UTC
Byte access by default is good when the compiler doesn't know if unaligned is fast on the target processor. There is no disagreement here.

What I suspect is a bug is the instruction sequence used for byte access in copy16 and copy32 cases. copy16 uses 2 * lbu + 2 * sb + 1 * lhu, that is, five memory operations to load an unaligned 16-bit integer. copy32 uses 4 * lbu + 4 * sb + 1 * lw, that is, nine memory operations to load a 32-bit integer.

bytes16 needs two memory operations and bytes32 needs four. Clang generates this kind of code from both bytesxx and copyxx.
Comment 3 Andrew Pinski 2023-09-20 18:42:28 UTC
This is a dup of this bug. Basically memcpy is not changed into an unaligned load ..
Comment 4 Andrew Pinski 2023-09-20 18:49:03 UTC
See pr 50417
Comment 5 Lasse Collin 2023-09-20 20:06:46 UTC
If I understood correctly, PR 50417 is about wishing that GCC would infer that a pointer given to memcpy has alignment higher than one. In my examples the alignment of the uint8_t *b argument is one and thus byte-by-byte access is needed (if the target processor doesn't have fast unaligned access, determined from -mtune and -mno-strict-align).

My report is about the instruction sequence used for the byte-by-byte access.

Omitting the stack pointer manipulation and return instruction, this is bytes16:

        lbu     a5,1(a0)
        lbu     a0,0(a0)
        slli    a5,a5,8
        or      a0,a5,a0

And copy16:

        lbu     a4,0(a0)
        lbu     a5,1(a0)
        sb      a4,14(sp)
        sb      a5,15(sp)
        lhu     a0,14(sp)

Is the latter as good code as the former? If yes, then this report might be invalid and I apologize for the noise.

PR 50417 includes a case where a memcpy(a, b, 4) generates an actual call to memcpy, so that is the same detail as the -Os case in my first message. Calling memcpy instead of expanding it inline saves six bytes in RV64C. On ARM64 with -Os -mstrict-align the call doesn't save space:

bytes32:
        ldrb    w1, [x0]
        ldrb    w2, [x0, 1]
        orr     x2, x1, x2, lsl 8
        ldrb    w1, [x0, 2]
        ldrb    w0, [x0, 3]
        orr     x1, x2, x1, lsl 16
        orr     w0, w1, w0, lsl 24
        ret

copy32:
        stp     x29, x30, [sp, -32]!
        mov     x1, x0
        mov     x2, 4
        mov     x29, sp
        add     x0, sp, 28
        bl      memcpy
        ldr     w0, [sp, 28]
        ldp     x29, x30, [sp], 32
        ret

And ARM64 with -O2 -mstrict-align, shuffing via stack is longer too:

bytes32:
        ldrb    w4, [x0]
        ldrb    w2, [x0, 1]
        ldrb    w1, [x0, 2]
        ldrb    w3, [x0, 3]
        orr     x2, x4, x2, lsl 8
        orr     x0, x2, x1, lsl 16
        orr     w0, w0, w3, lsl 24
        ret

copy32:
        sub     sp, sp, #16
        ldrb    w3, [x0]
        ldrb    w2, [x0, 1]
        ldrb    w1, [x0, 2]
        ldrb    w0, [x0, 3]
        strb    w3, [sp, 12]
        strb    w2, [sp, 13]
        strb    w1, [sp, 14]
        strb    w0, [sp, 15]
        ldr     w0, [sp, 12]
        add     sp, sp, 16
        ret

ARM64 with -mstrict-align might be a contrived example in practice though.
Comment 6 Andrew Waterman 2023-09-20 20:20:16 UTC
Ack, I misunderstood your earlier message.  You're of course right that the load/load/shift/or sequence is preferable to the load/load/store/store/load sequence, on just about any practical implementation.  That the memcpy version is optimized less optimally does seem to be disjoint from the issue Andrew mentioned.
Comment 7 Richard Biener 2023-09-21 06:22:22 UTC
You need to trace RTL expansion where it decides to use a temporary stack slot, there's likely some instruction pattern missing to compose the larger words.