I was playing with RISC-V GCC 12.2.0 from Arch Linux. I noticed inefficient-looking assembly output in code that uses memcpy to access 32-bit unaligned integers. I tried Godbolt with 16/32-bit integers and seems that the same weirdness happens with RV32 & RV64 with GCC 13.2.0 and trunk, and also on a few other targets. (Clang's output looks OK.) For a little endian target: #include <stdint.h> #include <string.h> uint32_t bytes16(const uint8_t *b) { return (uint32_t)b[0] | ((uint32_t)b[1] << 8); } uint32_t copy16(const uint8_t *b) { uint16_t v; memcpy(&v, b, sizeof(v)); return v; } riscv64-linux-gnu-gcc -march=rv64gc -O2 -mtune=size bytes16: lhu a0,0(a0) ret copy16: lhu a0,0(a0) ret That looks good because -mno-strict-align is the default. After omitting -mtune=size, unaligned access isn't used (the output is the same as with -mstrict-align): riscv64-linux-gnu-gcc -march=rv64gc -O2 bytes16: lbu a5,1(a0) lbu a0,0(a0) slli a5,a5,8 or a0,a5,a0 ret copy16: lbu a4,0(a0) lbu a5,1(a0) addi sp,sp,-16 sb a4,14(sp) sb a5,15(sp) lhu a0,14(sp) addi sp,sp,16 jr ra bytes16 looks good but copy16 is weird: the bytes are copied to an aligned location on stack and then loaded back. On Godbolt it happens with GCC 13.2.0 on RV32, RV64, ARM64 (but only if using -mstrict-align), MIPS64EL, and SPARC & SPARC64 (comparison needs big endian bytes16). For ARM64 and MIPS64EL the oldest GCC on Godbolt is GCC 5.4 and the same thing happens with that too. 32-bit reads with -O2 behave similarly. With -Os a call to memcpy is emitted for copy32 but not for bytes32. #include <stdint.h> #include <string.h> uint32_t bytes32(const uint8_t *b) { return (uint32_t)b[0] | ((uint32_t)b[1] << 8) | ((uint32_t)b[2] << 16) | ((uint32_t)b[3] << 24); } uint32_t copy32(const uint8_t *b) { uint32_t v; memcpy(&v, b, sizeof(v)); return v; } riscv64-linux-gnu-gcc -march=rv64gc -O2 bytes32: lbu a4,1(a0) lbu a3,0(a0) lbu a5,2(a0) lbu a0,3(a0) slli a4,a4,8 or a4,a4,a3 slli a5,a5,16 or a5,a5,a4 slli a0,a0,24 or a0,a0,a5 sext.w a0,a0 ret copy32: lbu a2,0(a0) lbu a3,1(a0) lbu a4,2(a0) lbu a5,3(a0) addi sp,sp,-16 sb a2,12(sp) sb a3,13(sp) sb a4,14(sp) sb a5,15(sp) lw a0,12(sp) addi sp,sp,16 jr ra riscv64-linux-gnu-gcc -march=rv64gc -Os bytes32: lbu a4,1(a0) lbu a5,0(a0) slli a4,a4,8 or a4,a4,a5 lbu a5,2(a0) lbu a0,3(a0) slli a5,a5,16 or a5,a5,a4 slli a0,a0,24 or a0,a0,a5 sext.w a0,a0 ret copy32: addi sp,sp,-32 mv a1,a0 li a2,4 addi a0,sp,12 sd ra,24(sp) call memcpy@plt ld ra,24(sp) lw a0,12(sp) addi sp,sp,32 jr ra I probably cannot test any proposed fixes but I hope this report is still useful. Thanks!
This isn't actually a bug. Quoting the RVA profile spec, "misaligned loads and stores might execute extremely slowly"--which is code for the possibility that they might be trapped and emulated, taking hundreds of clock cycles apiece. So the default behavior of emitting byte accesses is best when generating generic code. (Of course, when tuning for a particular microarchitecture, the shorter code sequence may be emitted.)
Byte access by default is good when the compiler doesn't know if unaligned is fast on the target processor. There is no disagreement here. What I suspect is a bug is the instruction sequence used for byte access in copy16 and copy32 cases. copy16 uses 2 * lbu + 2 * sb + 1 * lhu, that is, five memory operations to load an unaligned 16-bit integer. copy32 uses 4 * lbu + 4 * sb + 1 * lw, that is, nine memory operations to load a 32-bit integer. bytes16 needs two memory operations and bytes32 needs four. Clang generates this kind of code from both bytesxx and copyxx.
This is a dup of this bug. Basically memcpy is not changed into an unaligned load ..
See pr 50417
If I understood correctly, PR 50417 is about wishing that GCC would infer that a pointer given to memcpy has alignment higher than one. In my examples the alignment of the uint8_t *b argument is one and thus byte-by-byte access is needed (if the target processor doesn't have fast unaligned access, determined from -mtune and -mno-strict-align). My report is about the instruction sequence used for the byte-by-byte access. Omitting the stack pointer manipulation and return instruction, this is bytes16: lbu a5,1(a0) lbu a0,0(a0) slli a5,a5,8 or a0,a5,a0 And copy16: lbu a4,0(a0) lbu a5,1(a0) sb a4,14(sp) sb a5,15(sp) lhu a0,14(sp) Is the latter as good code as the former? If yes, then this report might be invalid and I apologize for the noise. PR 50417 includes a case where a memcpy(a, b, 4) generates an actual call to memcpy, so that is the same detail as the -Os case in my first message. Calling memcpy instead of expanding it inline saves six bytes in RV64C. On ARM64 with -Os -mstrict-align the call doesn't save space: bytes32: ldrb w1, [x0] ldrb w2, [x0, 1] orr x2, x1, x2, lsl 8 ldrb w1, [x0, 2] ldrb w0, [x0, 3] orr x1, x2, x1, lsl 16 orr w0, w1, w0, lsl 24 ret copy32: stp x29, x30, [sp, -32]! mov x1, x0 mov x2, 4 mov x29, sp add x0, sp, 28 bl memcpy ldr w0, [sp, 28] ldp x29, x30, [sp], 32 ret And ARM64 with -O2 -mstrict-align, shuffing via stack is longer too: bytes32: ldrb w4, [x0] ldrb w2, [x0, 1] ldrb w1, [x0, 2] ldrb w3, [x0, 3] orr x2, x4, x2, lsl 8 orr x0, x2, x1, lsl 16 orr w0, w0, w3, lsl 24 ret copy32: sub sp, sp, #16 ldrb w3, [x0] ldrb w2, [x0, 1] ldrb w1, [x0, 2] ldrb w0, [x0, 3] strb w3, [sp, 12] strb w2, [sp, 13] strb w1, [sp, 14] strb w0, [sp, 15] ldr w0, [sp, 12] add sp, sp, 16 ret ARM64 with -mstrict-align might be a contrived example in practice though.
Ack, I misunderstood your earlier message. You're of course right that the load/load/shift/or sequence is preferable to the load/load/store/store/load sequence, on just about any practical implementation. That the memcpy version is optimized less optimally does seem to be disjoint from the issue Andrew mentioned.
You need to trace RTL expansion where it decides to use a temporary stack slot, there's likely some instruction pattern missing to compose the larger words.